The First PySail Release

The LakeSail Team
August 30, 2024

We are thrilled to announce the 0.1 release of Sail. Sail is a computation framework with a mission to unify stream processing, batch processing, and compute-intensive (AI) workloads. Sail is now available as a Python package on PyPI, with documentation accessible here.

In this first milestone, Sail features a drop-in replacement for Spark SQL and the Spark DataFrame API in single-process settings. To see Sail in action, you can install it in your Python environment.

bash
pip install 'pysail[spark]'

With a few lines of code, you can start data exploration using a PySpark client, but the computation is now powered by a Sail server.

python
from pysail.spark import SparkConnectServer
from pyspark.sql import SparkSession

server = SparkConnectServer()
server.start()
_, port = server.listening_address

spark = SparkSession.builder.remote(f"sc://localhost:{port}").getOrCreate()

spark.sql("SELECT 1 + 1").show()

# Please configure AWS credentials in your environment.
# Please replace the S3 path with your dataset path.
spark.read.parquet("s3://bucket/key").show()

Behind the scenes, the Spark session communicates with the Sail server using the Spark Connect protocol. Sail is written in the Rust programming language and built on top of Apache Arrow and Apache DataFusion. These design choices let Sail stand out for its robustness, performance, and resource-efficiency, compared with the JVM-based compute engine in Spark.

We believe the best way to innovate in Big Data is by working backwards from the existing user experience, and improve it piece by piece. We picked Apache Spark as Sail’s first integration point because of its wide adoption. As you have seen, existing Spark workloads can benefit from Sail with ease. A smooth transition ensures smooth sailing for you as a user, with minimal production migration costs leading to enhanced performance and significant cost savings.

Admittedly, the path we have taken is not easy. Spark has developed rich APIs over the past 15 years, and building compatibility for it requires a huge engineering effort. We have been taking systematic approaches for this problem. As of now, we have mined 1,580 PySpark tests from the Spark codebase, among which 838 (53.0%) are successful on Sail. We have also mined 2,230 Spark SQL statements or expressions, among which 1,396 (62.6%) can be parsed by Sail. The test suites run on every code push to the Sail repository. It is motivating to see the number of passing tests grow over time. We have also developed tools to aggregate error messages for failed tests, providing valuable insights when we prioritize features in project planning.

When looking at the test coverage numbers alone, Sail’s capability may seem limited. But we have found that there is a long tail of failed tests due to formatting discrepancies, edge cases, and less-used SQL functions, which we will continue tackling in future releases. The features we have implemented already support a wide range of data analytics tasks, including all the 22 queries in the derived TPC-H benchmark. Our earlier experiments show that Sail runs nearly 4x faster compared with the JVM version of Spark, along with 94% hardware cost reduction. You can refer to our previous post for more details.

There is a lot more to imagine after the first Sail release. We will continue expanding Spark compatibility for Sail. We will also investigate big ideas such as data streaming, distributed data processing, and AI model inference. We have a firm believe that all these will fit together and fulfill the mission of Sail. Sail is an open-source project, and we would like to invite you to shape its future with us. You can check out its GitHub repository here. Feel free to create issues or pull requests, and give us a star if you like the project.

The Big Data landscape is changing in the AI era, and we are honored to be with you on the journey ahead.

Get started with Sail today!

Last Updated: September 2, 2024

LakeSail, Inc. © 2024