DeepSatData: Building large scale datasets of satellite images for training machine learning models

This report presents design considerations for automatically generating satellite imagery datasets for training machine learning models with emphasis placed on dense classification tasks, e.g. semantic segmentation. The implementation presented makes use of freely available Sentinel-2 data which allows generation of large scale datasets required for training deep neural networks. We discuss issues faced from the point of view of deep neural network training and evaluation such as checking the quality of ground truth data and comment on the scalability of the approach. Accompanying code is provided in https://github.com/michaeltrs/DeepSatData.


Introduction
Currently there are more than 150 satellites in orbit equipped with dedicated instruments gathering data for a variety of Earth Observation (EO) tasks. An ever increasing amount of that data are made freely accessible to the public, for example approximately 20T b of new data are made available every day just through the European Space Agency's Sentinel 1-3 satellites. The Copernicus Open Access Hub (COAH) provides free and open access to data captured by the European Space Agency's Sentinel missions starting from the In-Orbit Commissioning Review (IOCR). These data are made available directly through COAH either by use of a graphical user interface [17], through a variety of platforms from the Copernicus Data and Information Access Services (DIAS) [1, 2, 3, 4, 5] or through mirror sites [6,7,8,9,10,11,12,13,14,15,16]. However, there are not, to the best of our knowledge, publicly available tools for downloading and processing Sentinel products at the scale required for successfully training machine learning models with satellite images. In this report we present DeepSatData a simple tool for downloading and processing Sentinel products from the point of view of training deep neural networks (DNN). With DeepSatData it is possible to automatically download available satellite imagery for a given area of interest (AOI) and time period of interest (POI) and to couple these with available ground truth data to create fully annotated datasets. In addition we present some general considerations for generating satellite imagery datasets suitable for training DNNs with particular emphasis on dense classification tasks, e.g semantic segmentation.

Densely annotated data
Typically, the anatomy of a dense classification dataset involves input arrays and dense annotations matching two or more dimensions of the inputs. In general obtaining annotations for dense classification tasks is a time consuming process. For example it is estimated that annotating a single image from the Cityscapes dataset [22] fine set takes about 90min of work. For datasets where annotations are not included for all objects found in the inputs it is common practice to assign all multiple unknown objects into a single class which is either treated as an unknown or as part of a background class. Depending on the formulation of the task it is possible to treat the background class as another regular class or mask its influence during training and only learn to recognise the remaining classes.

Dense classification tasks
Similar to the general classification problem where the goal is to assign one of N known classes to an input array, dense classification aspires to assign a class to every location, e.g. pixel, of an input array. Distinguishing between the different types of input arrays and the type of information encoded by the output classes can lead to defining several problems in computer vision. Inputs in general contain 2 or 3 spatial dimensions or a time dimension each with a fixed number of channels. For satellite imagery we are interested in either 2d images, i.e. a single image, or timeseries of images. In the second case each image is typically accompanied by a timestamp showing the capture time of the image. The interval between successive captures by a satellite is generally not constant. This is in contrast to video data, also consisting of timeseries of 2D images, in which there is a fixed time-step between successive frames. The model output most commonly encodes semantic or identity information or both leading to the tasks of semantic segmentation [35,19,20,21,42,31,26,23,25], instance segmentation [28,36,41,18,27,34,33] and joint semantic-instance segmentation [29,30,37].

Downloading satellite data
Downloading all required data for an AOI and POI can be a lengthy process. That is particularly the case for data captured more than 12 months in the past which will need to be accessed through the COAH's Long-Term Archive (LTA). This means that the data will first have to be requested by the LTA and will be made available to download within 24h. Additionally, there is a maximum allowed number of requests per user to the LTA at a rate of 1 product request every 30min. In fact when working with annotated data it is most likely that these correspond to a period in the past thus all imagery products will need to be downloaded through the LTA. The limit in the amount of data that can be requested by the LTA poses a hard constraint on the number of products that can realistically be downloaded forcing us to optimize our selection process. Given the importance of selecting the right products in space and time we propose to spend some time manually selecting the products to download and automate the remaining part of the dataset generation process. Below are some general criteria for optimizing the product selection process.

Low cloud cover ratio
Cloud cover percentage is calculated for the full extent of a Sentinel product. While it can be the case that a clear image of the AOI can be found in a cloudy image (especially for small AOI) it is likely to get more clear images from products with low cloud coverage. Thus, we prioritize downloading the less cloudy images over the more cloudy ones. This parameter is controlled by the user defined variable "cloudcoverpercentage" in the start of each product selection script.

Large overlap with the AOI
Each Sentinel-2 tile covers a region of 100km x 100km which is large enough such that a single tile can be used for a dataset. For example using striding windows of 240m x 240m (24x24 pixels for the largest resolution band) results in approximately 200k samples. If the AOI is small it is quite likely that it will be covered by a single Sentinel tile in which case there is 100% coverage of the AOI by that tile. If this is not the case then more than one products will need to be downloaded to cover the full extent of the AOI, in which case it is convenient to start with the ones that cover most of the AOI first.

Large product size
As described in the S2 product description website "Tiles can be fully or partially covered by image data. Partially covered tiles correspond to those at the edge of the swath.". In products partially covered with image data only part of the image contains information with remaining part covered by zero values. We prioritize downloading products with a small proportion of zero valued regions.

Uniformly spread along the time period of interest
Modern Earth observation satellites can have a very small revisit time. For example the two satellites which form the Sentinel-2 constellation can have a revisit time of as few as 5 days. Rather than downloading products for all available dates during a POI we may need to subsample from available dates. Unless otherwise required by experimental settings we choose to select products such that they are spread as uniformly as possible during the POI.

Data generation pipeline
Having downloaded a set of satellite imagery products what is of interest is to extract small image patches of constant size that can fit into hardware accelerator memory and group/sort these patches by location into timeseries objects that can be used to train temporal models. Fig.1 shows this process and also indicates the relative size of typically extracted patches compared to the size of downloaded satellite products. Depending on whether there are available ground truth annotations we may choose to only process locations for which there are ground truths. These steps are further elaborated in the following sections.

From vector to raster ground truth data
This step is only relevant for cases when there are available ground truth annotations. We assume these collections are in the form of geo-polygons whose vertices are GPS coordinates at a given coordinate reference system (CRS) as this is the way typically agricultural ground truth data are collected. To ensure consistency we define a canonical form of representing such collections which includes the following fields for each agricultural parcel: • geometry is a geo-polygon containing GPS coordinates for all the vertices of the agricultural parcel • crs denotes the geographic CRS used  • ground_truth indicates the class corresponding to area defined by geometry.This is typically of type int for semantic or identity classes and typ float in the case of regression tasks • year denotes the year the ground truth is valid for the given geometry Using these data we follow a rasterization step. Here we first define a grid which is initiated by a value corresponding to a background class. For each pixel in the grid we calculate the ratio of the pixel area that is covered by the geo-polygon. All pixels partly or fully covered by the geopolygon are assigned the ground_truth corresponding to that polygon. We note here that is is typical to define the grid size such that it equals the largest resolution satellite image available, however, this need not necessarily be the case. Using CNNs it is straightforward to control the output resolution of our model to match the rasterization resolution. Also, for crop-type semantic segmentation [40] showed that it is possible to successfully learn to distinguish crop types at a higher resolution than satellite pixels. An example of performed rasterization is shown in Fig.2.

Masking ground truth inconsistencies
The process of generating dense ground truth annotations for geodata is unique w.r.t other dense labelled data, e.g natural images [22,32,24], in that source images and ground truths are first collected separately and are then aligned by geolocation. While a human annotator working on a semantic segmentation dataset will draw semantic classes on top of captured images, ground truth collection for remote sensing involves a step of gathering GPS coordinates on the field and a separate step of matching these with source images. This introduces the possibility for systematic geolocation errors, the gathered GPS coordinates might not be in complete agreement with the geolocation corresponding to the satellite images. While it is possible to identify some cases where there are noticeable offsets between inputs and ground truths by inspection, [39] identify single pixel offset errors, in general it is impossible for a human to correct all such mistakes. For this reason we are using boolean masks to mark inconsistencies when it is possible to identify, such as pixels that fall inside multiple polygons during the rasterization step. We distinguish between pixels that are partly or fully claimed by two or more polygons. While the former case is a natural outcome of the rasterization step and is improved with using a higher resolution grid, as shown in Fig.2 low vs high resolution, the latter case clearly indicates a geocoding error in either polygon.

Splitting a Sentinel product to small windows and making timeseries objects
The main reason for choosing to split a satellite imagery product into smaller, equal size patches is the requirement to load multiple timeseries objects into hardware accelerator memory for efficient training. While the size of a satellite image is in the order of tens or hundreds of km, e.g a single Sentinel-2 tile covers a 100 × 100 km 2 area, sizes typically used for semantic segmentation are in the order of hundreds of m, e.g 240m [39], 480m [39,40], 640m [38], an example of that scale difference can be seen in Fig.1. We may choose to split the AOI only for locations where ground-truth annotations are available or, as is the case for unsupervised learning tasks, we may choose to split the entire AOI.
The end result of the data generation process is a set of time-series objects containing image patches corresponding to the same location at different timestamps. Even though it would be possible to load all satellite images for all timestamps in memory and split-save to disk in one step this can be forbidding in terms of memory consumption for a long POI. For this reason we choose to separate the steps of extracting patches and grouping/sorting these in time to create the final outputs. An example of data included in a single sample point extracted using the DeepSatData pipeline can be seen in Fig.3.

Conclusion
This report presented DeepSatData a pipeline for automatically generating data for training machine learning models on earth observation tasks and explained the main considerations behind its design. While particular emphasis was placed on generating datasets for dense classification tasks using time-series of satellite images it is trivial to extend the provided code to extract single images for dense or global classification. We intend to provide such capabilities in future updates.