Beyond General-Purpose: The Need for Space-Data Native Frameworks

In-house space data pipelines are finicky, to say the least. A patchwork that is often rebuilt for every major upgrade. Workflows that are impossible to iterate because they are baked into Infrastructure as Code. Infrastructure redundancies that consume unnecessary time and resources, like fixed cluster sizes, the overhead of booting up Docker containers for every task, or difficulties in rolling out updates while keeping production up. The inefficiencies are many, but they are all a result of incompatible design across environments, temporary fixes, and data type limitations.

There has been no standard or best practices for space data pipelines. To reach profitability and meet the expectations of real world applications it needs a space-data native framework.

Here’s how we designed ours:

Datasets

Let’s start with the critical foundation of data access and storage. Common database models such as Postgres and even specialized time-series databases like InfluxDB are amazing pieces of engineering, but on their own come with constraints that become real obstacles for space data pipelines, particularly regarding Earth Observation. One lacks support for textual data, the other supports geometries but has limited support for spatial indexing, a critical capability of any data catalog, and a very common reason for performance limitations of the data catalogs of the world.

Instead of wasting time waiting for results or consuming all your CPU on table scans, a framework designed intentionally for these pipelines needs to have:

Support for real-time metadata indexing with customizable data types
Support for arbitrary data types, including strings, polygons, coordinates
Fast, reliable spatio-temporal indexing
A very high performance API to support large queries of telemetry, metadata, or lightweight payload data
Backwards compatible typing and customizable datasets, or catalogs
Standards compliant (STAC) output interfaces where desired

Tilebox affords engineers the freedom to create and edit their own data types – instead of the database schema – without breaking existing software and datasets. Spatio-temporal queries are orders of magnitudes faster than comparable systems. This is essential for streamlined scalability. As operations expand, so will capabilities.

Workflows

While many companies are still figuring it out on the ground, others are preparing for edge computing. Any space-data native framework has to work in orbit. And, we think, everywhere else. Across clouds, mixed environments, and multiple clusters. Data fusion requires this level of flexibility, plus it empowers any team to work with the exact data they need no matter where it’s stored. This expedites processing and analysis, which creates increased revenue opportunities for time-sensitive data.

Efficiency is a key requirement for space data, as its source is far away and its volume is massive. Reducing downlink, transfer, and storage costs is one side of the equation, the other is job resilience. Executing workflows on Spot Instances is a necessary cost-savings for large-scale satellite data processing. But what happens when that Spot Instance goes down? What if a job breaks?

Resiliency affects everything: compute costs, latency, data security.

Manually developed monitoring tools are one of those inefficient redundancies software teams shouldn’t be wasting their time on building and managing. Even Tilebox leverages a service for this, Axiom.co; pre-integrated with Tilebox to deliver the fastest, most thorough distributed observability across your workflows, with a generic OpenTelemetry exporter available as well.

Identifying the break is the first step, re-entrant processing is the second step. Tilebox saves all work up to the breakpoint and automatically initiates a retry, keeping you up and running without manual intervention and rescheduling. And for large workflows, Tilebox supports auto-scaling clusters.

These are not conveniences, they are vital functionalities for a successful space data pipeline.

Future-Proof Your Pipeline

The right tooling makes all the difference. Space data engineers need flexibility, resilience, and purpose-built functionality to stay focused on delivering rather than pipeline maintenance and repair. We are building Tilebox for you and the future of space data services. So, create an account and explore our work – and give us feedback on your ideal tooling. It’s time to evolve space data management with high-performance software made for space data.

Join our discord community for updates on our latest releases.

Get started for free