Sail Turns One

We have just reached the one-year anniversary of Sail’s very first public release. When we launched version 0.1.0.dev0, our goal was simple but ambitious: to offer a new kind of distributed compute framework, one that’s faster, more reliable, and built to unify the disparate world of data and AI workloads. Spark fundamentally transformed the landscape for big data, but we believed the next leap required rethinking the architecture from the ground up by evolving beyond the JVM and leveraging Rust’s advantages.

Why We Started LakeSail

In today’s data-driven world, organizations of every size face a common challenge: efficiently and effectively leveraging vast amounts of complex data from diverse sources. Inadequate data tooling hinders business innovation and growth, a challenge that’s only intensified in the AI era. System fragmentation, inefficient resource usage, and the difficulty of integrating AI with data workloads remain unsolved bottlenecks. Most existing solutions focus on either data processing or AI workloads, but not both, forcing teams to manage multiple systems and limiting insights and agility.

The idea behind LakeSail began with a question of first principles: What are the fundamental reasons distributed data processing still feels heavy, slow, and hard to optimize, especially when it comes to unifying data and AI workloads? While this generation of big data tooling has driven significant progress, we noticed persistent inefficiencies in JVM-based execution models, memory handling, and support for multimodal workloads. We believed there must be a better path forward, but it hasn’t been technically viable until recently.

When Spark launched in 2009, the JVM was the most pragmatic choice: it offered portability, concurrency, and support for Scala, which brought functional programming to distributed compute. But the JVM also imposed unforeseen trade-offs, especially in handling memory-intensive workloads and executing Python code efficiently.

Why It’s Possible Now

Only recently, Rust has matured into a stable, production-grade systems language with a thriving ecosystem. It combines low-level control with memory safety, compile-time guarantees, and modern tooling. Crucially, it eliminates the runtime costs and unpredictability of the JVM, no garbage collection, no hidden allocations, and full control over execution.

We realized that with Rust it’s finally possible to build a distributed engine that’s fast, predictable, safe, and deeply interoperable with Python, without the architectural bottlenecks Spark users have come to accept as the norm. Check out our blog on Rust vs the JVM.

By having every single piece of the compute framework in Rust, we achieved results that surpassed even our own expectations. In the industry standard derived TPC-H benchmark, Sail outperformed Spark by ~4x for only 6% the hardware cost. The outcome offered strong validation of the research and intuition that guided our early decisions.

Our mission remains clear: to unify batch, streaming, and AI compute in a single composable, high-performance multimodal framework built for the future of data and AI.

A Year of Progress

In just twelve months, Sail went from being a prototype built on conviction to a mature, production-grade solution capable of powering serious workloads across multiple verticals. Here are some of our notable milestones our team has accomplished in the last year:

Distributed Architecture: Sail runs reliably on Kubernetes, with full cluster-level task scheduling and resource management. Our enterprise platform will also support Unikernels via Unikraft. Learn more about our distributed architecture.
SQL Parsing Built In-House: We designed our own SQL parser to ensure parity with the Spark SQL syntax, enabling seamless workload migration from Spark while giving us direct control over query parsing for optimization and future extensibility.
Performant PySpark UDF Support: The PySpark APIs for user-defined functions are powered by Arrow’s in-memory format and an embedded Python interpreter in the Rust worker.
Expanding Spark Parity: Compatibility across Spark SQL, Spark DataFrame APIs, and Spark functions continues to expand with each release.
MCP Server: Our Model Context Protocol (MCP) server allows users to interact with data using natural language queries.
Delta Lake Support: Read and write Delta Lake tables natively with predicate pushdown, schema evolution, and time travel.
Cloud Storage Integration: Sail now has native integration with cloud storage services including AWS S3, Azure, Google Cloud Storage (GCS), and Cloudflare R2.
Stream Processing Foundation: We began the groundwork for native streaming this month. Not only do we have the foundation figured out, but it already fits cleanly into Sail’s architecture—a strong proof of good design decisions over the past year.
Community Growth: We’ve seen a surge of contributions from engineers around the world. Open-source development has become central to our roadmap. Be sure to join our Slack community if you haven’t yet!
Production Success: Most importantly, Sail is being used to power massive pipelines. Hearing from teams running Sail in production has been the highlight of our year. If you’ve been using Sail in production, please let us know and we’ll send you some free merch!

Looking Ahead

Although this past year has been exciting, the team is only growing, and we’re just getting started. Here’s what we have in store:

Sail UI and Improved Observability: We aim to provide better tools for users to troubleshoot jobs and understand performance characteristics.
Continued Spark Parity Expansion: Maintaining compatibility with Spark remains a priority, ensuring that Sail can serve as a reliable drop-in replacement as Spark evolves.
Stream Processing: When we launch stream processing, users will be able to handle continuously arriving data with all the key streaming features, including change data feeds, watermarks, and checkpoints.

Join the Community

Sail wouldn’t be where it is today without the community around it. From GitHub issues and pull requests to community benchmarks and real-world use cases, the open-source ecosystem around Sail has become a core part of our identity.

If you’re excited about the future of distributed compute, we’d love to have you involved! You can:

Star the Sail GitHub repo to follow our development and show your support.
Submit issues, bug reports, or feature suggestions to help improve the roadmap.
Contribute directly to the codebase. Check out the open issues or drop into the discussions.
Join our Slack community to ask questions, discuss ideas, and connect with other contributors.

If you’re already using Sail or exploring adoption at scale, our Enterprise Support offering provides flexible support, custom integrations, and enables you to optimize workloads and scale with confidence from our team’s expertise.

Get in Touch to Learn More

Why We Started LakeSail

Why It’s Possible Now

A Year of Progress

Looking Ahead

Join the Community

Join the LakeSail Community