Cloud-Free, Country-Scale Mosaic in Under 3 Hours: A Tilebox Workflow
Creating large-scale, cloud-free satellite imagery mosaics demands efficient data handling and powerful processing. Inspired by a data scientist's approach to generating a quarterly cloud-free Sentinel-2 10m resolution RGB mosaic over Ireland, we replicated this process using Tilebox to demonstrate how use cases like this can be streamlined. This post highlights how Tilebox streamlines complex geospatial workflows, leveraging multi-environment execution and parallel writes to a Zarr datacube for optimal performance.
Our workflow involves four key steps:
Locating relevant Sentinel-2 input granules over Ireland within a three-month period
Reading the Red, Green, and Blue (RGB) bands, plus the scene classification layer (cloud mask) from each located Sentinel-2 product
Reprojecting every product onto a common grid
Aggregating data across the time dimension to produce a single cloud-free measurement for every pixel
If you want to skip ahead and see our results for yourself, check out the interactive visualization down below.
Efficient Granule Discovery with Tilebox Datasets
The first step, locating the necessary Sentinel-2 granules, is straightforward with Tilebox's spatio-temporal query capabilities. Our tilebox.datasets client allows for rapid discovery of relevant data.
In just milliseconds, we located around 700 Sentinel-2A granules needed for our mosaic. Tilebox Open Data provides instant access to public datasets, and its robust spatial indexing handles complex geometries like antimeridian crossings seamlessly.
Managing Data Volumes
For our true-color mosaic, we need the Red, Green, and Blue bands at 10m resolution, plus the 20m resolution cloud mask for filtering. A single Sentinel-2 L2A granule, with these four bands, amounts to around 315MB. The 700 granules we located amount to a total of approximately 155GB of data. Efficiently handling this volume requires a smart approach to storage and processing.
To avoid costly data transfers and manage large intermediate products, we chose Zarr as our intermediate storage format. Zarr is a highly efficient, chunked array format ideal for parallel I/O and cloud-native workflows. It allows us to persist reprojected data in a way that supports easy spatial chunking and parallel access across the time dimension.
We initialized an empty Zarr cube for each band, with dimensions corresponding to the full spatial extent over our area over Ireland and a time dimension corresponding to one time layer for each of our 700 products. This resulted in a data cube shape of time=716, y=37151, x=45419. By setting the time dimension chunk size to 1, we enable parallel writing of individual timestamps. Furthermore, configuring a spatial chunk size (in our case we chose 2048x2048 pixels) allows Zarr to automatically skip writing empty chunks for each time layer, providing immediate efficiency gains, especially when reprojecting smaller granules onto a large target grid. Additionally this is also what allows us to process individual, smaller spatial chunks across the entire time dimension in the required subsequent temporal aggregation.
Orchestrating Multi-Environment Workflows
The process of reading, reprojecting, and writing each Sentinel-2 product to a Zarr cube is inherently parallel. Tilebox Workflows are designed to leverage this parallelism by allowing us to define individual tasks that can be automatically parallelized. All that is required to achieve that is to transform our processing logic into tasks.
The GranuleToZarr
task spawns multiple GranuleProductToZarr
subtasks, one for each band. These subtasks can then execute in parallel on available task runners. A higher-level task then iterates through the list of all 700 Sentinel-2 granules in a similar fashion, submitting a GranuleToZarr
task for each.
Co-locating Compute with Multi-Environment Capabilities
As you may have noticed, the above GranuleToZarr
task assumes Sentinel-2 product files are available as part of the local file system. Traditionally, we would need to adapt this logic to also support reading products via an S3 compatible object store interface. However, a significant advantage of Tilebox is its ability to run workflows across diverse compute environments. To avoid downloading 155GB of Sentinel-2 data from the Copernicus archive (hosted on CloudFerro), we instead executed our workflow in a multi-environment fashion.
Distributing work across various compute environments is a built-in feature of our workflow orchestrator, so the only requirement for setting this up was to start task runners in the right locations, no adaptations to the source code are necessary.
Data Access & Reprojection (CloudFerro): We deployed a Tilebox task runner on a virtual machine within CloudFerro's infrastructure. This runner's sole responsibility is to read Copernicus data, reproject it, and write the intermediate Zarr cube directly to Google Cloud Storage. This minimizes egress from CloudFerro. And it allows us to read the Sentinel products directly from a filesystem, since VMs on CloudFerro have the whole Copernicus archive mounted at
/eodata
.Temporal Aggregation (Local Cluster): For the final temporal aggregation (producing the cloud-free mosaic), we utilized a makeshift cluster using three of our developer MacBooks. Each MacBook's task runner fetches a specific 2048x2048 spatial chunk of the Zarr cube across the entire time range, performs the aggregation, and writes the output for that chunk back into the final mosaic layer of our Zarr store. This is made possible by Zarr's powerful rechunking capabilities and Tilebox's flexible task distribution. This strategy enabled us to avoid setting up multiple expensive VMs on Cloudferro for compute. Alternatively, we could have just as well used cheap spot instances, which would be the ideal solution for even larger processings, such as a mosaic of the entire globe, and to minimize AWS egress cost.
This multi-environment approach demonstrates Tilebox's flexibility: seamlessly integrating specialized compute resources (CloudFerro for data proximity, MacBooks for distributed final processing) into a single, cohesive workflow without rewriting code.
Figure 1: Architecture diagram depicting a multi-environment workflow, and Zarr chunking mechanics to enable parallelization. Depicted are task runners deployed on a CloudFerro VM as well as locally on developer notebooks.
Visualizing the Results
The final output is a mosaic, which we converted from Zarr to GeoTIFF and then uploaded to Ellipsis Drive for visualization.
This workflow showcases how Tilebox simplifies complex, large-scale geospatial processing by combining efficient data access, parallel processing with Zarr, and flexible multi-environment workflow orchestration. The full code for this example is available on GitHub, and you can try it out yourself using our free Community Access tier.