Sail 0.2 and the Future of Distributed Processing
The LakeSail Team
November 19, 2024
LakeSail is thrilled to unveil a preview release of Sail 0.2, our latest milestone in the journey to redefine distributed data processing. With a high-performance, Rust-based implementation, Sail 0.2 takes another bold step in creating a unified solution for Big Data and AI workloads. Designed to remove the limitations of JVM-based frameworks and elevate performance with Rust’s inherent efficiency, Sail 0.2 builds on our commitment to support modern data infrastructure needs—spanning batch, streaming, and AI.
In an era where powerful single machines can process substantial data volumes, the question arises—why distributed processing? Many applications require data handling that goes far beyond what isolated hardware can achieve. Distributed processing enables scalability across multiple nodes, fault tolerance, and optimized resource allocation, all crucial for handling diverse and dynamic workloads. This approach supports efficient and resilient workflows, especially for businesses managing real-time and geographically dispersed data needs.
High-Level Architecture
The architecture of Sail 0.2 is built on a separation between control and data planes, allowing for fine-grained resource management and optimized data movement. For the control plane, we use gRPC alongside the actor model, creating a robust system that manages distributed workflows in our framework. The data plane leverages both gRPC and the Arrow IPC protocol, establishing an efficient pipeline for shuffle data within the cluster. Support for cloud storage APIs for remote shuffle data is planned for future releases.
Sail’s distributed processing framework organizes workloads into jobs, stages, and tasks. A job is a self-contained unit of work managed by the driver, which breaks it down into stages at data shuffle boundaries. Each stage is further divided into tasks that each handle a partition of data, with the driver assigning these tasks to workers that handle multiple tasks in parallel, maximizing efficiency.
Rust Over JVM: Why It Matters
Rust has proven instrumental in realizing Sail’s highly efficient and performant design, particularly across the control and data planes.
In the control plane, Rust, combined with the Tokio runtime, ensures correctness by providing memory safety and eliminating race conditions. This reduces complexity and improves productivity for Sail developers. Unlike Java and Scala, which lack async
/await
as built-in language features, Rust’s support for asynchronous programming simplifies concurrent programming, making it easier to implement and maintain.
For the data plane, Rust with Tokio and Apache Arrow enable efficiency through serialization-free data transfer, resulting in high throughput and low memory usage. Arrow’s columnar data format further enhances Sail’s performance, allowing Sail to handle data-intensive workloads with minimal resource consumption and maximum speed.
Unified Shuffle: Best Fit for Cloud-Native
In Sail 0.2 we have built the basis for a unified shuffle architecture that will support both blocking and pipelined shuffle for unified batch and stream processing in future releases.
We see potential to eliminate the traditional “shuffle service” found in Spark’s architecture, positioning Sail as a better fit for cloud-native use cases with decoupled storage and compute.
In the preview release, Sail supports pipelined shuffle (a concept popularized by Flink for real-time data handling in streaming workloads) with in-memory shuffle data, avoiding local and remote data persistence. Future releases will introduce additional shuffle mechanisms, further enhancing Sail’s versatility and scalability.
Leveraging the Actor Model for Robust Concurrency
The actor model forms the backbone of Sail’s control plane, offering a concurrency model that ensures state is safely managed without locks. The event loop pattern, common in frameworks like Spark, Flink, and other distributed data processing systems, influences Sail’s design. The driver and worker are modeled as actors within a single-threaded, non-blocking event loop. While Flink relies on the JVM based Apache Pekko framework for its actor library, we were able to implement our own minimal actor framework in Rust. By focusing only on essential requirements—without actor hierarchy or remote actors—Sail’s actor model has proven lightweight, efficient, and highly effective in supporting distributed processing needs.
This approach resonates with the concurrency philosophy of the Go programming language: “Don’t communicate by sharing memory; share memory by communicating.” Applying this principle eliminates shared-state pitfalls, keeping the control plane implementation clean and adaptable for future expansion.
LakeSail’s Mission
LakeSail is redefining what’s possible in distributed data processing by introducing a high-performance, Rust-based solution that streamlines complexity while enhancing efficiency. Sail is a complete, Rust-native solution, eliminating JVM constraints entirely and enabling seamless migration of existing Spark code without complex rewrites. We believe this evolution is both natural and inevitable. Just as hardware advances have driven exponential gains in speed and efficiency, Sail pushes the boundaries in software, striving to match these innovations with a streamlined, high-performance solution. By unifying the management of batch, streaming, and AI workloads within a single framework, Sail brings a new level of simplicity and power that fits seamlessly with modern data processing requirements.
This streamlined framework delivers significant performance improvements—running ~4x faster than Spark and reducing hardware costs by 94% (derived from the industry-standard TPC-H benchmark)—establishing LakeSail as the next generation of data processing frameworks.
Sail’s Support for Spark
Admittedly, the path we have taken is not easy. Spark has developed rich APIs over the past 15 years, and building compatibility for it requires a huge engineering effort. We have been taking systematic approaches to this problem. As of now, we have mined 3,809 PySpark tests from the Spark codebase, among which 2,504 (65.7%) are successful on Sail.
We have focused on the most commonly used features of Spark first. The features we have implemented already support a wide range of data analytics tasks, including all 22 queries in the derived TPC-H benchmark as well as 79 out of the 99 queries in the derived TPC-DS benchmark.
When looking at the test coverage numbers alone, Sail’s capability may seem limited, but we have found that there is a long tail of failed tests due to formatting discrepancies, edge cases, and less-used SQL functions, which we will continue tackling in future releases.
Getting Started
Check out the deployment docs to get started with Sail 0.2 on Kubernetes. Don’t hesitate to reach out if you have any questions or would like to share your experiences with us!
What’s Next
We are stabilizing the user interface and polishing the documentation as we gather feedback for this preview release (0.2.0.dev0
). We will continue feature developments for distributed processing, around themes such as unified shuffle, YARN support, fault tolerance, and observability.
Have a specific feature in mind? Let us know—we’re committed to building Sail to meet your needs. LakeSail offers flexible enterprise support options, including managing Sail on Kubernetes.
Last Updated: December 3, 2024