Toward the Production of Spatiotemporally Consistent Annual Land Cover Maps Using Sentinel-2 Time Series

Land cover (LC) maps generated by the classification of remote-sensing (RS) data allow for monitoring Earth processes and the dynamics of objects and phenomena. For accurate LC variability quantification in environmental monitoring, maps need to be spatiotemporally consistent, continually updated, and indicate permanent changes. However, producing frequent and spatiotemporally consistent LC maps is challenging because it involves balancing the need for temporal consistency with the risk of missing real changes. In this work, we propose a scalable and semiautomatic method for generating annual LC maps with labels that are consistently applied from one year to the next. It uses a Transformer deep-learning (DL) model as a classifier, which is trained on satellite time series (TS) of images using high performance computing (HPC). The trained model can generate stable maps by shifting the prediction window along the temporal direction. The effectiveness of the proposed approach is tested qualitatively and quantitatively on a multiannual Sentinel-2 dataset acquired over a three-year period in a study area located in the southern Italian Alps.


I. INTRODUCTION
C LASSIFICATION maps are vital for analyzing patterns natural disasters as well as tracking urban construction trends.Brown et al. [1] used Google Earth Engine to retrieve Sentinel-2 data and train a convolutional neural network (CNN) model to generate frequent LC maps.The European Space Agency (ESA) has also recently released a global LC product based on Sentinel-2 and Sentinel-1 data1 with a spatial resolution of 10 m.Although these approaches can produce accurate LC products, they do not avoid the risk of generating spatiotemporally inconsistent classification maps that show unrealistic year-to-year LC changes.
To guarantee spatiotemporal consistency across temporal updates, LC maps need to be generated with algorithms that are robust to noise (i.e., false alterations do not hinder real changes) and capable of generalizing on a temporal level.Furthermore, classification schemes must use consistent setups over time to ensure that LC classifications are semantically interoperable and can be compared.A method was proposed in [2] for mapping global LC types using multiple years time series (TS) of MODIS data.The method involves generating a yearly map using data from the preceding and subsequent years.Abercrombie and Friedl [3] proposed using a hidden Markov model (HMM) as a postprocessing step on a TS of LC maps to differentiate real LC change from spurious changes caused by classification errors.Recent works have focused on improving the temporal consistency of multiyear classification maps while maintaining sensitivity to LC changes [4], [5].
While these methods have improved the temporal consistency of classification maps, they may also risk obscuring inter-and intraannual LC changes, particularly when analyzing long TS of data.In addition, these methods are typically computationally demanding.This letter presents a scalable, semiautomatic approach using a Transformer deep-learning (DL) model to produce spatiotemporally consistent LC maps by continually ingesting satellite data.To this end, the proposed approach decouples the stable component of the LC map from the detection of changes, leveraging the ability of the self-attention Transformer model to classify long TS of RS data accurately.Finally, a user-guided analysis is requested to validate the LC mapping result obtained from the semantic viewpoint.The main contribution of this work is the proposal of a method that can use high-performance computing (HPC) systems to produce high-resolution LC maps that are: 1) regularly updated (e.g., annually); 2) able to identify permanent changes accurately; and 3) spatiotemporally consistent to Fig. 1.Proposed workflow that classifies the intrayear RS data stream to generate spatiotemporally consistent LC maps.The three main steps of the method are highlighted in green.In the final step, the user is requested to validate the automatically detected changes only from the semantic viewpoint.
facilitate the monitoring of ongoing environmental processes (e.g., desertification, urbanization, and deforestation).

II. CONSISTENT MULTIYEAR MAPS PRODUCTION
Fig. 1 shows the workflow of the proposed method (PM), which is based on three main steps: 1) intrayear RS data stream classification; 2) analysis of the LC label sequence; and 3) the production of multiyear LC maps.To generate spatiotemporal coherent thematic products, the PM produces an up-to-date LC map every time new RS data is acquired.This condition allows us to generate an intrayear sequence of LC labels that can be used to: 1) reduce spurious yearto-year changes; 2) strengthen the consistency of the annual maps; and 3) continually improve the accuracy of previously generated maps via a backpropagation strategy.To ensure high-quality mapping, the user is involved in the final step to revise the obtained classification results only from the semantic viewpoint, that is, approve/discard the presence of a change and assign the new LC label.

A. Problem Formulation and Notation
Annual LC maps are typically generated by classifying TS of RS images instead of individual RS data, as it yields more accurate results.Although this approach is effective, it has two main limitations.First, the classifier assigns each pixel the LC label that best fits most of the RS images in the TS.This leads to accurate results for stable LC components and land surface seasonality (e.g., crop phenology), while it may discard permanent changes visible in a small portion of the TS (e.g., occurred at the end of the year).Second, the spatiotemporal consistency of annual LC maps generated by separately classifying different TS of images is affected by classification errors that do not indicate real changes.
Let us focus on the first target year γ 1 , where a training set representing the k LC classes present in the scene . ., X n ) be the TS of n RS images of γ 1 , where X 1 ∈ R r ×q×b is the RS image acquired at time t 1 having r × q pixels and b spectral channels.By applying C γ 1 to the TS of images acquired from t 1 to t n , we generate the LC map M t n t 1 .Let TS t m t n+1 = (X n+1 , X n+2 , . . ., X m ) be the TS of RS images acquired for the subsequent year γ 2 , and M t m t n+1 be the corresponding annual map.Let x i be the i pixel of the considered TS, which is associated with the annual LC labels y i t n and y i t m visible in M t n t 1 and M t m t n+1 , respectively, where y i t n , y i t m ∈ .For simplicity, we focus here on two years.However, the PM can be easily scaled to multiyear data.

B. Intrayear RS Data Stream Classification
To strengthen the consistency of the annual LC maps, the intrayear RS data stream is classified using the same model C γ 1 by shifting the prediction window along the temporal direction.For instance, at the beginning of γ 2 , the updated version of M t n t 1 (i.e., M t n+1 t 2 ) is generated by classifying the TS of images TS ), which includes the first image of γ 2 and excludes the first image of γ 1 , that is, X n+1 and X 1 , respectively.In this letter, we use a Transformer deep learning (DL) model as a classifier, which has rapidly challenged recurrent networks in tasks using sequences of data [6].Transformers can be scaled efficiently because they are based on the concept of attention, which is inherently nonrecurrent, allowing sequence parallelization during training.Moreover, Transformers can model both short-and long-term dependencies since each attention head attends to information from different positions in the sequence.Multiple heads of attention are used to obtain different representations of this attention information.This leads to stable classification results obtained even under varying cloud cover conditions, thus enabling the Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.
classification of the continuous stream of RS data without needing any data preparation step.At the end of γ 2 , the method generates a TS of LC maps (M ).At the end of this step, for each pixel x i , we have an LC label sequence (y i t n , y i n+1 , . . ., y i t m ) (where n and m are the number of RS images acquired in the year γ 1 and γ 2 , respectively) that allows us to model the temporal trajectory of each pixel, independent of neighboring pixels (see Fig. 1).

C. Analysis of the LC Label Sequence
Once the label sequences are completed for at least two years, we can start the production of the spatiotemporal consistent annual LC maps.LC maps can be divided into three main components: 1) stable component φ un ; 2) seasonal component φ s (e.g., snow and crop phenology); and 3) permanent changes φ c (e.g., deforestation and urbanization).In this work, we focus our attention on annual maps that aim to represent stable components and interannual permanent changes, neglecting the seasonality of the land surface.For instance, for the "Snow" LC class, only the permanent snowfields are represented in an annual thematic product, while the snow patches visible only in the winter period are typically neglected.
To automatically discriminate the stable LC component from the land surface seasonality and the permanent changes, we perform an analysis of the LC label sequence generated in the previous step, that is, (M t n t 1 , M t n+1 t 2 , . . ., M t m t n+1 ).First, the labels are mapped from categorical into numerical data.Then, we apply a 1-D median filter ∈ R 1× p with length equal to p to the LC labels sequence pixel, that is, (y i t n , y i n+1 , . . ., y i t m ), where i ∈ [1, r × q], to reduce noisy classification results.The binary difference between the obtained annual maps M t n t 1 and M t m t n+1 is computed (i.e., the standard method used in the literature to detect changes between maps).This binary output provides all the changes that occurred on the ground, that is, [φ c , φ s ].To remove noisy changes and determine when they occurred, the binary intrayear LC label sequence is convoluted with a step function to retrieve the maximum.The date of permanent changes can be automatically computed by shifting its index backward along the temporal axis of half of the inference window.Finally, we disentangle the permanent changes φ c from the seasonal ones φ s .In the considered implementation of the method, we assume that a change can be identified as permanent if: 1) it is visible in the TS of LC maps for at least half of the inference window (stable change); 2) the LC sequence must not return to its initial LC more than once since this kind of patterns typically belongs to seasonal changes φ s ; and 3) changes that occurred at the beginning of the TS of LC are discarded to avoid unreliable changes.

D. Multiyear LC Maps Production
In the final step, the method generates the spatiotemporally consistent TS of LC maps by embedding the permanent changes identified in the previous step.Changes having an area lower than a certain threshold (ath) are discarded.By shifting the prediction window along the temporal direction, the method can capture permanent changes that occur at any time of the year.The changes are correctly embedded in the annual LC maps according to the date estimated in the previous step.The estimated change date is injected into the yearly LC output to: 1) increase the accuracy of previously generated LC maps and 2) avoid incorrectly identifying changes in the subsequent years.The new yearly map is produced by adding the detected permanent changes onto the initial map.The integration of the automatically detected permanent changes is finally confirmed by the user, who can visually check their presence in the RS data and assign the new LC label.We would like to remark that the user is involved only in checking the change from the semantic viewpoint (i.e., the presence of the change and the new LC label).The PM automatically identifies the identification of the changes and their geometrical extent in an unsupervised way.The proposed quality control procedure carried out by the user aims to increase the reliability of the final thematic products.

III. DATASET DESCRIPTION AND EXPERIMENTAL SETUP
The considered study area is located in the northeast part of Italy, where a major deforestation event occurred in 2018 due to storm Vaia, which caused significant forest damage.To generate the TS of annual LC maps, we considered the Sentinel-2 data (Tile TPS32) able to acquire dense TS of images at high spatial resolution (10 m).The considered TS comprises 45 images acquired from January 2018 to December 2020, that is, 15 yearly acquisitions equally distributed over the three years.The length of the median filter p = 5, while the area threshold ath is equal to 100 according to the minimum mapping unit of the changes that we aim to detect.Only the images having cloud coverage lower than 40% were automatically downloaded through the Sentinelsat API.The available training set representative of 2018 was reporting nine LC classes, namely, "Artificial land," "Grass," "Crops," "Bareland," "Broadleaves," "Conifers," "Shrub," "Water," and "Snow."The 2018 training dataset is made up of 10 000 points per class (a total of 80 000 samples for the training set), located in an area spatially disjoint from the predicted one used in the experiments.
We used a Transformer with five encoding layers, each with two attention heads.We set the number of expected features in the encoder layers d model = 128.No positional embedding was used.We carried out the training of the model for 100 epochs.The Adam optimizer [7] was used with a cyclic learning rate schedule with its initial value set to 0.003.The code is based upon a PyTorch implementation of the Transformer [8].The experiments were run on the JURECA-dc2 HPC at the Jülich Supercomputing Center (JSC), in particular, on the partition where each node is equipped with four Nvidia A100 graphics processing units (GPUs).The Transformer was trained from scratch using a training dataset representing the LC present in the scene, associated with CORINE LC labels, for the year 2018 (i.e., on the sequence of 15 acquisitions), and it was used to generate the map of the first target year, that is, M t n t 1 and the intrayear LC maps.We used PyTorch DistributedDataParallel to scale the training of the Transformer on multiple GPUs, speeding it up from ∼500 s on one GPU to ∼45 s on 16 GPUs.That model is then used for inference, shifting the prediction window of length equal to 15 acquisitions along the temporal direction from the end of 2018 to the end of 2020.Due to the availability of 30 Sentinel-2 images acquired in 2019 and 2020, 30 intraannual maps have been generated.We performed the inference for each pixel of the TPS32 tile (∼120 000 000 pixels) to generate an output label for each element of the input sequence.Reference data created by expert annotators were used to validate the results. 3

IV. EXPERIMENTS RESULTS
The multiyear annual maps generated by the PM are compared with those generated by separately classifying the annual TS of Sentinel-2 images with the: 1) Transformer and 2) random forest (RF).While the comparison with the Transformer allows us to emphasize the added value of the PM, RF is typically used as the baseline method due to its robustness to noise.Fig. 2 shows a qualitative comparison of the agreement between the LC maps produced by the baseline methods and the PM.As expected, the baseline methods' interyear agreement appears noisier than that produced by the PM, which can correctly detect major changes while keeping the stable LC component invariant between years.These qualitative results are confirmed by the quantitative ones reported in Table I.As expected for the "Bareland" and "Broadleaves" classes, all the methods show an agreement lower than 90% due to the impact of the permanent deforestation event that occurred in November 2018, which also had consequences in subsequent years.However, although no changes occurred for most of the LC classes, the baseline methods show lower interyear agreement than the one achieved by the PM.For instance, the "Artificial land" class presents an agreement lower than 90% for the baselines, while the PM correctly keeps this class in the stable LC component.Similarly, the "Shrub" class is frequently confused with "Grass," leading to LC variation in the annual maps that do not correspond to real changes.Table II reports quantitative results of the detected change for the three methods considering the reference change map.The PM shows significantly better scores for the detected changes than results obtained by the two baseline methods because of the sharp reduction of false-positive changes.Fig. 3 shows a small portion of the study area.The Sentinel-2 images acquired in 2018-2020 clearly show deforestation in late 2018 (i.e., November 2018).Differently from the PM, neither the RF nor the baseline Transformer can detect it in the 2018 annual maps.This is a common problem when generating annual LC maps.The changes that occurred at the end of the year are not identified by the classifier.Note that the LC changes that occurred at the end of 2018 are correctly identified by the PM when moving the prediction window into 2019 (i.e., when the change is present in at least eight images out of the 15 of the considered prediction window).However, the date of the change can be correctly backpropagated in the 2018 annual map when the user validates the presence of the change.Moreover, from the results obtained, one can see the spatiotemporal consistency of the LC maps generated by the PM, in which classification errors present at the pixel level are neglected.Furthermore, because of the visual check of the user, the correct LC label can be assigned to the changed areas, that is, "Bareland," thus ensuring high-quality mapping.In contrast, both the baseline methods misclassify the deforestation areas with "Artificial land" pixels because of the similarity of the urban and bare rocks spectral signatures.
investigate how to extract additional information from the Transformer model, such as visualizing the attention weights, to provide users with more detailed insights into the temporal dynamics of the LC products.Although we did not apply any manual correction to the produced maps in this work, active learning methods could be investigated to continually provide feedback to the model and produce more accurate LC maps.Dynamically adjusting the set of LC classes is vital for enhancing model robustness and reliability by addressing unpredictability in real operative scenarios, ensuring consistent performance and adaptability to variations on the ground.Moreover, we plan to update the initial training data to maintain reliable LC products and, in future developments, test the method on datasets from various areas to address different classification challenges.

Fig. 2 .
Fig. 2. Comparison of the multiyear (2018-2020) LC maps agreement (a) 2018 LC map M tn t 1 produced by the Transformer, (b) RF change map, (c) Transformer change map, and (d) PM change map.The true positive (green), true negative (white), false positive (red), and false negative (blue) are computed considering the reference change map.

Fig. 3 .
Fig. 3. Examples of multiyear maps and changes obtained for a small portion of the considered study area for 2018-2020 using the: 1) Transformer, 2) RF, and 3) PM.The Sentinel-2 images acquired in 2018-2020 are reported with the changes reference map.

TABLE I INTERYEAR
AGREEMENT OF LC MAPS OF 2018-2020 EVALUATED AT THE CLASS LEVEL BY COMPARING THE PM WITH THE BASELINES: 1) RF AND 2) TRANSFORMER TABLE II PRECISION, RECALL, F1 SCORE, AND INTERSECTION OVER UNION (IOU) FOR THE RF, TRANSFORMER, AND PM AGAINST THE REFERENCE MAP REPRESENTING THE LC CHANGES