The First PySail Release
The LakeSail Team
August 30, 2024
2 min read
We are thrilled to announce the 0.1 release of Sail. Sail is a computation framework with a mission to unify stream processing, batch processing, and compute-intensive (AI) workloads. Sail is now available as a Python package on PyPI, with documentation accessible here.
In this first milestone, Sail features a drop-in replacement for Spark SQL and the Spark DataFrame API in single-process settings. To see Sail in action, you can install it in your Python environment.
pip install 'pysail[spark]'
With a few lines of code, you can start data exploration using a PySpark client, but the computation is now powered by a Sail server.
from pysail.spark import SparkConnectServer
from pyspark.sql import SparkSession
server = SparkConnectServer()
server.start()
_, port = server.listening_address
spark = SparkSession.builder.remote(f"sc://localhost:{port}").getOrCreate()
spark.sql("SELECT 1 + 1").show()
# Please configure AWS credentials in your environment.
# Please replace the S3 path with your dataset path.
spark.read.parquet("s3://bucket/key").show()
Behind the scenes, the Spark session communicates with the Sail server using the Spark Connect protocol. Sail is written in the Rust programming language and built on top of Apache Arrow and Apache DataFusion. These design choices let Sail stand out for its robustness, performance, and resource-efficiency, compared with the JVM-based compute engine in Spark.
We believe the best way to innovate in Big Data is by working backwards from the existing user experience, and improve it piece by piece. We picked Apache Spark as Sail’s first integration point because of its wide adoption. As you have seen, existing Spark workloads can benefit from Sail with ease. A smooth transition ensures smooth sailing for you as a user, with minimal production migration costs leading to enhanced performance and significant cost savings.
There is a lot more to imagine after the first Sail release. We will continue expanding Spark compatibility for Sail. We will also investigate big ideas such as data streaming, distributed data processing, and AI model inference. We have a firm belief that all these will fit together and fulfill the mission of Sail. Sail is an open-source project, and we would like to invite you to shape its future with us. You can check out its GitHub repository here. Feel free to create issues or pull requests, and give us a star if you like the project.
The Big Data landscape is changing in the AI era, and we are honored to be with you on the journey ahead.
Get started with Sail today!
Get in Touch to Learn More