How Sail Utilizes and Extends Apache DataFusion
The LakeSail Team
November 19, 2025
3 min read
Sail has reimagined distributed compute, combining a fresh Rust-native execution layer with deep utilization of Apache DataFusion’s planning and execution capabilities. Instead of using DataFusion as a blackbox query engine, Sail drives every query through DataFusion’s logical and physical planning and optimization APIs before applying its own distributed scheduling. This lets Sail leverage DataFusion’s mature, production-tested core while concentrating engineering effort where it matters most: high-performance Rust execution, Spark-compatible semantics, large-scale distributed reliability, native Python execution, stream processing, lakehouse integration, and more.
Choosing DataFusion
When designing Sail, we considered writing a query engine from scratch, but ultimately chose to build on DataFusion due to its proven, production-grade stability and wide adoption at companies such as Apple and Cloudflare. Its core abstractions (e.g., logical and physical plans) form the backbone for defining, optimizing, and executing queries at scale.
This strong, well-maintained core means Sail inherits stability, active community support, and rich SQL capabilities. Additionally, before every release, during pre-release testing, DataFusion is tested on Sail amongst various other well-known projects. We also contribute improvements upstream, ensuring that both projects advance together and that Sail remains compatible with the evolving DataFusion ecosystem.
Inside Sail’s Query Planning
When a user submits a query or workload, Sail first performs syntactic and semantic analysis. Sail parses SQL through Sail’s custom-built SQL parser, and then represents it as a Sail specification (spec) describing the intended operations. This spec is then translated into a DataFusion logical plan. During this process, Sail consults catalogs and function registries to resolve all references to functions or tables.
The logical optimizer applies rewrite rules that remove redundancies and rearrange operations for efficient execution. The optimized logical plan captures the full intent of the query in a DAG of operations. After logical optimization, Sail converts the logical plan into a physical plan with its own conversion rules in addition to DataFusion’s built-in ones. This physical plan undergoes a final round of optimization to ensure efficient execution.
Sail’s custom extensions are key to logical and physical planning. They enable advanced PySpark-compatible features (Python UDFs, data sources, etc.) and guarantee that data remains in the Apache Arrow columnar format, which minimizes copies, maximizes CPU cache efficiency, and ensures consistent performance.
How Sail Extends DataFusion
While Sail leverages DataFusion, our commitment to LakeSail’s mission as well as Spark compatibility demands capabilities outside DataFusion’s original scope:
Semantic Layer for Spark SQL
Spark SQL follows the Hive dialect, while DataFusion defaults to PostgreSQL semantics. Sail builds a custom semantic layer that translates Spark SQL and DataFrame operations into DataFusion logical plans. This is a true semantic rewrite, not just a mechanical mapping.
Extended Logical and Physical Plans
Sail adds custom nodes to handle Spark operations beyond typical SQL engines, including full PySpark API support (UDFs, UDAFs, UDWFs, and UDTFs) so Python transformations execute efficiently on Arrow-native columnar data.
Distributed Processing
Sail replaces DataFusion’s single-node executor with a distributed driver–worker architecture for scalable execution.
Integrations
Sail implements numerous integrations to support external catalogs, lakehouse table formats such as Delta Lake and Apache Iceberg, and a wide range of object stores.
For a closer look, feel free to explore our concepts in the Sail documentation and related posts on SQL parsing, Delta Lake, Iceberg, Python UDFs, and distributed processing.
If you’d like to follow our journey, join our Slack community and drop us a GitHub star!
Get in Touch to Learn More