Skip the Respawn Delay: Space Data Workflows with Built-in Resilience

Natural checkpoints, versioned tasking, and other smart workflow design decisions

During large processing workflows, things can and will go wrong. Memory errors, missing data, network issues, just to name a few. These processing errors cost time and money, resulting in inefficient compute resource utilization, reduced developer productivity, and increased egress costs from redundantly fetched external data dependencies.

Many workflow orchestrators don’t account for re-entrant execution at all. If something goes wrong, the whole processing has to be re-done. Or, a new workflow has to be developed for every failure that is able to resume from that particular partial state, and explicit checkpoints have to be added to the workflow to persist partial states. Not only does this take valuable time, it should be retired as a historical problem – there is no need to settle for this model.

Accounting for all possible errors beforehand is not possible, so it’s vital to be ready to act with the right resources to thoroughly investigate, rewrite, and deploy.

This is precisely why we designed our workflow orchestrator differently. Tilebox Tasks form natural checkpoints, and the idempotent nature of them enables efficient re-entrant execution. If something goes wrong, full observability enables developers to quickly dive in and figure out the problem. And it's all ready-to-go out of the box.

A long running workflow may fail after a lot of the processing work is already completed.

One approach to recover from such failures is to just re-run the whole processing, essentially duplicating all the work that has already been computed.

Another approach to handle such failures is to develop a new workflow capable of picking up partial results from a previously failed run. While this preserves compute resources, it does take valuable development time.

Tilebox re-entrant execution eliminates the necessity for manual recovery tasks and reduces downtime while simplifying the workflow.

After the issue is identified and fixed, versioned tasks allow gradual rollout without disrupting existing clusters / workflows. Then the workflow orchestrator is able to resume the failed job from the state it was in previously to the failure. Re-entrant processing can be manually triggered or configured using automated retries, giving developers another layer of operational resiliency.

Join us on discord for the latest releases, challenges, and Tilebox tips.

Get started for free