Why It’s Possible Now: Sail 0.2 and the Evolution of Distributed Compute Frameworks
The LakeSail Team
December 3, 2024
We are thrilled to officially announce the release of Sail 0.2, introducing distributed processing on Kubernetes. With its unified shuffle architecture and seamless compatibility with existing PySpark workflows, Sail 0.2 represents a major leap forward in redefining distributed computing and advancing our vision of a unified, efficient, and modern data framework. For a deeper technical dive, check out our preview release blog post.
The evolution of data processing frameworks has closely paralleled the rapid advancements in programming languages and computational requirements. Since the release of Spark in 2009, the industry has seen wave after wave of innovation, and yet a truly unified framework for distributed compute—encompassing batch, stream, and AI workloads—has remained elusive. Today, LakeSail’s Sail represents a natural progression in distributed computing, harnessing Rust’s speed, safety, and stability to finally unify these workloads into one seamless framework.
Tools Available at the Time
When Apache Spark debuted in 2009 and Apache Flink followed in 2011, both frameworks were built in JVM programming languages, which are infamously known for their high resource usage and limited language interoperability. At that time, Scala was likely the optimal language choice, bringing functional programming into the JVM ecosystem with strong concurrency support—ideal for distributed batch processing. Flink, on the other hand, was particularly adept at handling real-time, low-latency stream processing, offering useful capabilities for managing continuous data flows. Since Rust wasn’t released until 2015, Spark and Flink developers initially had limited language options available to them.
When Spark and Flink were designed, the JVM provided a stable, versatile platform. However, the JVM falls short for today’s Python-centric AI needs, where executing Python code on the JVM is highly inefficient. Python UDFs (user-defined functions) in Spark suffer from poor performance due to the need for isolated processes, hindering memory sharing and slowing execution. This fragmentation discourages using JVM-based systems for modern AI and machine learning workflows, where high-speed data exchange and resource sharing are paramount.
In 2017, Ray by Anyscale entered the distributed processing ecosystem with a new vision: to cater specifically to compute-heavy AI workloads. Developed in C++, Ray offered native support for Python, bypassing some of the limitations that Spark and Flink encountered in AI applications. Ray’s architecture allowed for flexibility and scalability in tasks like distributed training, making it a favorite among machine learning engineers.
However, when Ray was developed, Rust had not yet reached maturity, making C++ an understandable choice. If it were developed today however, it would likely be built in Rust. Although it’s technically possible to unify general-purpose ETL/data processing (I/O-bound) and AI workloads (compute-intensive) in C++, the complexity involved makes it an impractical choice. C++ lacks memory safety, introducing potential issues such as memory leaks, null pointer access, and memory corruption. These risks not only complicate development but also raise significant security concerns.
In fact, this issue is so pressing that the FBI and CISA (Cybersecurity and Infrastructure Security Agency) recently released a joint report warning against memory-unsafe languages like C/C++,
The development of new product lines for use in service of critical infrastructure or NCFs in a memory-unsafe language (e.g., C or C++) where there are readily available alternative memory-safe languages that could be used is dangerous and significantly elevates risk to national security, national economic security, and national public health and safety.
— Product Security and Best Practices. FBI and CISA, Oct 16, 2024.
Additionally, earlier this year, the White House released a statement advocating for the adoption of memory-safe programming languages to mitigate vulnerabilities associated with unsafe memory management,
Despite rigorous code reviews as well as other preventive and detective controls, up to 70 percent of security vulnerabilities in memory unsafe languages patched and assigned a CVE designation are due to memory safety issues. When large code bases are migrated to a memory safe language, evidence shows that memory safety vulnerabilities are nearly eliminated.
— White House Office of the National Cyber Director (ONCD), Press Release, Feb 26, 2024 and the full report.
With Rust now mature and widely adopted, these security and stability concerns associated with C++ can be largely mitigated. Rust’s memory-safe design makes it an ideal foundation for a unified framework, paving the way for a secure, high-performance system that avoids the pitfalls of legacy languages.
The Evolution of Distributed Processing
Spark’s 2009 release revolutionized distributed data processing, establishing a high-performance framework for batch processing. When Flink followed, it expanded distributed compute’s scope by unifying batch and stream processing, capitalizing on a similar architecture. Later, as AI computation became critical, Ray provided a solution for compute-heavy tasks like training and inference across distributed systems.
Despite these advancements, no framework has unified all three paradigms—batch, stream, and AI processing—within a single, cohesive system. The reason lies in historical software and programming language limitations rather than the frameworks themselves. Spark and Flink’s JVM-based architecture was ill-suited to the demands of modern AI, where Python plays a central role. Running Python code on the JVM is challenging and highly inefficient, primarily due to the need for separate processes, limited memory sharing, and the JVM’s high resource usage, all of which introduce significant overhead.
Rust now enables this unification because it provides memory safety alongside C++-level performance. Python code can run on Rust without the limitations seen with the JVM, facilitating zero-copy execution with Apache Arrow for efficient data sharing. As a result, Sail offers unprecedented performance and simplicity in a single framework.
Sail: The Next Generation of Distributed Compute
Sail is the first solution to unify distributed batch, stream, and AI workloads together in a way that’s both efficient and practical, using Rust to its full potential. In 2024, Rust has matured into a globally adopted language that provides the speed of C++ along with memory safety that C++ lacks, enabling Sail to deliver unprecedented performance without the memory management challenges associated with C++.
By re-engineering Spark’s backend in Rust, Sail leverages the Spark Connect protocol for compatibility, allowing seamless integration for existing PySpark users. This approach facilitates zero-copy execution with Apache Arrow, enhancing Python use cases without the performance constraints of the JVM. As a result, Python processes gain direct memory access, significantly speeding up execution times and unlocking new AI and data processing possibilities.
The beauty of it all? We make it a no-brainer to try out Sail. Transitioning over is remarkably smooth: if you’re already using a PySpark client, you don’t need to change a single line of code! (beyond updating the host URL of course). Sail’s Rust-powered engine is ~4x faster than Spark, with a 94% reduction in hardware costs (derived from the industry-standard TPC-H benchmark), allowing you to consolidate data processing and AI frameworks without compromising performance or reliability.
In today’s landscape, Sail represents a transformative shift in distributed compute, unifying batch, stream, and AI processing into a single, cohesive framework; eliminating the need to set up multiple disparate data processing frameworks and stitching together a Frankenstein-like architecture. Built in Rust, Sail not only solves the limitations of legacy JVM systems, but also opens the door to a new realm of data processing possibilities—an essential and inevitable evolution in distributed compute solutions.
Sail’s Support for Spark
As of now, we have mined 3,809 PySpark tests from the Spark codebase, among which 2,590 (~68%) are successful on Sail.
We have focused on the most commonly used features of Spark first. The features we have implemented already support a wide range of data analytics tasks, including all 22 queries in the derived TPC-H benchmark as well as 79 out of the 99 queries in the derived TPC-DS benchmark.
For the remaining uncovered tests, there is a long tail of failed tests due to less-used SQL functions, edge cases, and formatting discrepancies, which we will continue tackling in future releases.
Getting Started
Check out the deployment docs to get started with Sail 0.2 on Kubernetes. Don’t hesitate to reach out if you have any questions or would like to share your experiences with us!
Get in Touch
LakeSail offers flexible enterprise support options, including managing Sail on Kubernetes. To discover how Sail can enhance your data workflows and significantly reduce costs, get in touch with us today!