Cloud-Free, Country-Scale Mosaic in Under 3 Hours: A Tilebox Workflow

Creating large-scale, cloud-free satellite imagery mosaics demands efficient data handling and powerful processing. Inspired by a data scientist's approach to generating a quarterly cloud-free Sentinel-2 10m resolution RGB mosaic over Ireland, we replicated this process using Tilebox to demonstrate how use cases like this can be streamlined. This post highlights how Tilebox streamlines complex geospatial workflows, leveraging multi-environment execution and parallel writes to a Zarr datacube for optimal performance.

Our workflow involves four key steps:

Locating relevant Sentinel-2 input granules over Ireland within a three-month period
Reading the Red, Green, and Blue (RGB) bands, plus the scene classification layer (cloud mask) from each located Sentinel-2 product
Reprojecting every product onto a common grid
Aggregating data across the time dimension to produce a single cloud-free measurement for every pixel

If you want to skip ahead and see our results for yourself, check out the interactive visualization down below.

Efficient Granule Discovery with Tilebox Datasets

The first step, locating the necessary Sentinel-2 granules, is straightforward with Tilebox's spatio-temporal query capabilities. Our tilebox.datasets client allows for rapid discovery of relevant data.

from tilebox.datasets import Client
from shapely.geometry import box

# Initialize the Tilebox datasets client
client = Client()
sentinel_2 = client.dataset("open_data.copernicus.sentinel2_msi")
sentinel_2a = sentinel_2.collection("S2A_S2MSI2A")

# Define a rectangular area over Ireland
area = box(-10.68234795, 51.36473433, -5.34679566, 55.44704815)

# Query for Sentinel-2 granules within the specified temporal and spatial extent
granules = sentinel_2a.query(
    temporal_extent=("2025-03-01", "2025-06-01"),
    spatial_extent=area
)

print(f"Located {granules.sizes['time']} Sentinel-2 granules.")

In just milliseconds, we located around 700 Sentinel-2A granules needed for our mosaic. Tilebox Open Data provides instant access to public datasets, and its robust spatial indexing handles complex geometries like antimeridian crossings seamlessly.

Managing Data Volumes

For our true-color mosaic, we need the Red, Green, and Blue bands at 10m resolution, plus the 20m resolution cloud mask for filtering. A single Sentinel-2 L2A granule, with these four bands, amounts to around 315MB. The 700 granules we located amount to a total of approximately 155GB of data. Efficiently handling this volume requires a smart approach to storage and processing.

To avoid costly data transfers and manage large intermediate products, we chose Zarr as our intermediate storage format. Zarr is a highly efficient, chunked array format ideal for parallel I/O and cloud-native workflows. It allows us to persist reprojected data in a way that supports easy spatial chunking and parallel access across the time dimension.

We initialized an empty Zarr cube for each band, with dimensions corresponding to the full spatial extent over our area over Ireland and a time dimension corresponding to one time layer for each of our 700 products. This resulted in a data cube shape of time=716, y=37151, x=45419. By setting the time dimension chunk size to 1, we enable parallel writing of individual timestamps. Furthermore, configuring a spatial chunk size (in our case we chose 2048x2048 pixels) allows Zarr to automatically skip writing empty chunks for each time layer, providing immediate efficiency gains, especially when reprojecting smaller granules onto a large target grid. Additionally this is also what allows us to process individual, smaller spatial chunks across the entire time dimension in the required subsequent temporal aggregation.

Orchestrating Multi-Environment Workflows

The process of reading, reprojecting, and writing each Sentinel-2 product to a Zarr cube is inherently parallel. Tilebox Workflows are designed to leverage this parallelism by allowing us to define individual tasks that can be automatically parallelized. All that is required to achieve that is to transform our processing logic into tasks.

import rasterio
from tilebox.workflows import Task, ExecutionContext

class GranuleProductToZarr(Task):
    """
    Processes a single Sentinel-2 product (e.g., B02, B03, B04, SCL)
    reprojects it, and writes it to the Zarr datacube at a specific time index.
    """

    product_location: str
    """A concrete Sentinel 2 product to convert to Zarr"""

    time_index: int
    """The time index of the granule in the output Zarr datacube"""

    def execute(self, context: ExecutionContext) -> None:
        # Open the JPEG2000 product using rasterio
        with rasterio.open(self.product_location, driver="JP2OpenJPEG") as product:
             with rasterio.open(self.product_location, driver="JP2OpenJPEG") as product:
             arr = product.read(1)
             src_grid = GeoBox(shape=arr.shape, affine=product.transform, crs=product.crs)

         # .... continue with reprojecting the array, and writing it to zarr at the given time index

class GranuleToZarr(Task):
    """
    Orchestrates the processing of a single Sentinel-2 granule by
    submitting subtasks for each relevant band.
    """
    granule_location: str
    time_index: int

    def execute(self, context: ExecutionContext) -> None:
        products = list_products(  # glob the given directory for objects matching a pattern
            self.granule_location,
            filter=["R10m/*B02*", "R10m/*B03*", "R10m/*B04*", "R20m/*SCL*"]
        )
        for product in products:
            context.submit_subtask(product, self.time_index)

The GranuleToZarr task spawns multiple GranuleProductToZarr subtasks, one for each band. These subtasks can then execute in parallel on available task runners. A higher-level task then iterates through the list of all 700 Sentinel-2 granules in a similar fashion, submitting a GranuleToZarr task for each.

Co-locating Compute with Multi-Environment Capabilities

As you may have noticed, the above GranuleToZarr task assumes Sentinel-2 product files are available as part of the local file system. Traditionally, we would need to adapt this logic to also support reading products via an S3 compatible object store interface. However, a significant advantage of Tilebox is its ability to run workflows across diverse compute environments. To avoid downloading 155GB of Sentinel-2 data from the Copernicus archive (hosted on CloudFerro), we instead executed our workflow in a multi-environment fashion.

Distributing work across various compute environments is a built-in feature of our workflow orchestrator, so the only requirement for setting this up was to start task runners in the right locations, no adaptations to the source code are necessary.

Data Access & Reprojection (CloudFerro): We deployed a Tilebox task runner on a virtual machine within CloudFerro's infrastructure. This runner's sole responsibility is to read Copernicus data, reproject it, and write the intermediate Zarr cube directly to Google Cloud Storage. This minimizes egress from CloudFerro. And it allows us to read the Sentinel products directly from a filesystem, since VMs on CloudFerro have the whole Copernicus archive mounted at /eodata.
Temporal Aggregation (Local Cluster): For the final temporal aggregation (producing the cloud-free mosaic), we utilized a makeshift cluster using three of our developer MacBooks. Each MacBook's task runner fetches a specific 2048x2048 spatial chunk of the Zarr cube across the entire time range, performs the aggregation, and writes the output for that chunk back into the final mosaic layer of our Zarr store. This is made possible by Zarr's powerful rechunking capabilities and Tilebox's flexible task distribution. This strategy enabled us to avoid setting up multiple expensive VMs on Cloudferro for compute. Alternatively, we could have just as well used cheap spot instances, which would be the ideal solution for even larger processings, such as a mosaic of the entire globe, and to minimize AWS egress cost.

This multi-environment approach demonstrates Tilebox's flexibility: seamlessly integrating specialized compute resources (CloudFerro for data proximity, MacBooks for distributed final processing) into a single, cohesive workflow without rewriting code.

Architecture of the distributed Sentinel-2 mosaic workflow

Figure 1: Architecture diagram depicting a multi-environment workflow, and Zarr chunking mechanics to enable parallelization. Depicted are task runners deployed on a CloudFerro VM as well as locally on developer notebooks.

Visualizing the Results

The final output is a mosaic, which we converted from Zarr to GeoTIFF and then uploaded to Ellipsis Drive for visualization.

This workflow showcases how Tilebox simplifies complex, large-scale geospatial processing by combining efficient data access, parallel processing with Zarr, and flexible multi-environment workflow orchestration. The full code for this example is available on GitHub, and you can try it out yourself using our free Community Access tier.

Get started