Sail, the Last Piece of the Composable Data Stack

The modern data ecosystem has been undergoing a “great decoupling” in recent years.¹ Core capabilities (storage, query planning, execution, and analytics) are no longer locked in single monolithic systems. Instead, open-source projects such as Apache Arrow, Apache DataFusion, DuckDB, Velox, Substrait, Ibis, Apache Iceberg, and Delta Lake have made it possible to mix and match best-of-breed components. This modularity has unleashed innovation, but it has also exposed a final gap—a distributed compute engine built for the new paradigm and post-AI landscape.

Sail fills that gap.

The Composable Data Stack So Far

In the canonical decomposition, systems separate into (a) language frontends, (b) an intermediate representation (IR), (c) a query optimizer, (d) a local execution engine, and (e) a distributed execution runtime. Most systems reuse similar data structures and algorithms inside these layers.²³

Over the past decade, the data ecosystem has become increasingly modular through a series of key breakthroughs. Apache Arrow established a universal in-memory columnar format so data can move freely across languages and hardware without serialization cost. Building on that foundation, DuckDB, Velox⁴, and DataFusion introduced high-performance, vectorized query engines that bring OLAP execution directly into applications and commoditize analytic processing.

Backend-agnostic programming interfaces such as Ibis abstract away SQL dialect differences, enabling data engineers to describe transformations once and run them anywhere. The emergence of Substrait provides a portable intermediate representation (IR) so analytical queries can be shared across engines without manual rewriting. At the storage layer, Delta Lake and Apache Iceberg manage data versioning, enforce ACID compliance, and handle schema evolution on top of low-cost object storage.

Together these breakthroughs enable a modular stack where storage, metadata, and query interfaces are interchangeable, yet one essential layer has remained stubbornly monolithic: distributed computation frameworks.

The Last Missing Piece

Modern data infrastructure has embraced modularity across nearly every layer (storage, data interchange, query planning, and APIs), yet distributed execution has remained tied to legacy frameworks.

The broader landscape shows the same pattern. Distributed computation has moved from early MapReduce engines to DAG-oriented designs such as Apache Spark and more recently to Ray, which push flexibility further by letting arbitrary Python functions run at every worker. Apache Flink excels at continuous streaming workloads. Yet despite all of this innovation, engineers from Meta, Databricks, Voltron Data, and Sundeck wrote in The Composable Data Management System Manifesto: “there is currently no standardization to the level of abstraction that strikes the best balance between ease of programming and tight control of execution.”² Instead, each framework reinvents critical pieces of scheduling, resource management, and execution semantics, which forces organizations to maintain siloed codebases and leaves users coping with inconsistent behavior.

This is precisely the gap Sail targets. What is needed to complete the composable stack is a distributed engine designed for this new reality: one that treats the language frontend as pluggable, meets it on a stable IR boundary, speaks Arrow natively at every stage, executes complex query DAGs through an Arrow DataFusion core with vectorized execution, achieves near bare-metal speeds, and runs on a modern runtime with stateless workers that integrate cleanly with today’s elastic distributed deployment infrastructure (e.g., Kubernetes and Unikraft)—all while enabling organizations to reuse 15+ years of Spark code without any changes by redirecting through Spark Connect to the Sail server.

Sail provides exactly this missing architecture, bringing the execution layer into alignment with the modular design now common across the rest of the open data ecosystem.

You may ask, doesn’t Apache Spark already fill this gap? Spark, while groundbreaking in its day, was created at the end of the pre-cloud era, before Rust, before Apache Arrow, and before today’s demands for AI and elastic cloud computing. Its reliance on row-based processing, the coupling of state with execution, and the JVM introduces unavoidable inefficiencies, including garbage-collection pauses and costly serialization between Python and execution cores. As a result, it is mismatched to today’s zero-copy, Arrow-native standards with vectorized execution and decoupled storage and state.

Sail is Rust-Native, Arrow-First, and Spark-Compatible

Sail was built to fit into this composable data stack with ease, reliability, and adaptability.

Rust-Native Core – Rust’s ownership model and compile-time memory management ensues safety while eliminating garbage-collection entirely, giving deterministic memory management and predictable latency.
Arrow as the Standard – All stages, from ingestion to shuffle, use Arrow columnar buffers. No row conversions or extra copies are required.
DataFusion Foundation – DataFusion provides the core abstractions for logical and physical query plans. Sail enhances these with custom execution strategies and full support for Spark features rarely found in SQL engines, including PySpark APIs for Python UDFs, UDAFs, UDWFs, and UDTFs. It further extends DataFusion with distributed query processing, native lakehouse format support, custom object store integration, extra file formats, and workload-specific optimizations.
Fun fact: every month DataFusion’s pre-releases are validated directly against Sail’s codebase to ensure stability and compatibility.
Lightning-Fast Python UDFs – Python code runs inside Sail with zero serialization overhead, thanks to direct Arrow array pointers.
Elastic, Stateless Workers – Workers start in seconds and consume only megabytes when idle. They communicate via Arrow Flight for in-cluster shuffle and scale horizontally in cloud-native environments such as Kubernetes.
Unified Shuffle Architecture – Sail implements a unified shuffle framework that supports multiple execution patterns without depending on Spark’s traditional shuffle service. This design removes a key performance bottleneck and makes Sail a natural fit for cloud-native environments with decoupled storage and compute.
Spark Connect Compatibility – Sail exposes the Spark Connect protocol so PySpark DataFrame and Spark SQL workloads run unchanged, letting teams migrate their existing Spark code and adopt Sail seamlessly without friction.

The end result is a unified and distributed multimodal computation framework purpose-built for the modular era, fast enough for ad hoc, AI, and agentic workloads while being simple enough for daily analytics, and open enough to interoperate with every Arrow-based tool.

Why This Matters

With Sail in place, teams can compose their data architecture like Lego bricks without the headaches that come with unnecessary complexity:

Ingest and Store with Delta Lake or Iceberg.
Model, Transform, Query and Visualize using Ibis or PySpark.
Execute at Scale with Sail.

No rewrites. No lock-in. The distributed engine is no longer a proprietary black box, but an interchangeable, open component ready for production usage on day one integration.

Looking Ahead

Open source has already delivered standardized data interchange, portable query languages, and robust table formats. What was missing was a truly modular distributed compute engine. Sail now provides it and completes the last missing piece of the composable data stack.

References

Wes McKinney. The Road to Composable Data Systems: Thoughts on the Last 15 Years and the Future. Wes McKinney Blog, 2023. https://wesmckinney.com/blog/looking-back-15-years/ ↩
Pedro Pedreira, Orri Erling, Konstantinos Karanasos, Scott Schneider, Wes McKinney, Satya R. Valluri, Mohamed Zait, and Jacques Nadeau. The Composable Data Management System Manifesto. Proceedings of the VLDB Endowment. 16, 10 (2023), 2679–2685. https://www.vldb.org/pvldb/vol16/p2679-pedreira.pdf ↩ ↩²
Biswapesh Chattopadhyay, Pedro Pedreira, Sameer Agarwal, Yutian James Sun, Suketu Vakharia, Peng Li, Weiran Liu, and Sundaram Narayanan. Shared Foundations: Modernizing Meta’s Data Lakehouse. Conference on Innovative Data Systems Research (CIDR), 2023. https://www.cidrdb.org/cidr2023/papers/p77-chattopadhyay.pdf ↩
Pedro Pedreira, Orri Erling, Masha Basmanova, Kevin Wilfong, Laith Sakka, Krishna Pai, Wei He, and Biswapesh Chattopadhyay. 2022. Velox: Meta’s Unified Execution Engine. Proceedings of the VLDB Endowment. 15, 12 (2022): 3372–3384. https://www.vldb.org/pvldb/vol15/p3372-pedreira.pdf ↩

Get in Touch to Learn More