Sail 0.6 is an Arrow-focused release. Two of the additions keep pace with Spark 4’s own Arrow work, and one lets clients query Sail without a Spark client at all: Arrow UDFs from Spark 4 in the Python API, the Variant data type in the SQL type system, and an Arrow Flight SQL server on the wire. Each one carries Arrow end to end, so user code, SQL types, and client protocols see the same in-memory format the engine has always run on.
Arrow UDFs
In 0.6, Sail supports Spark 4’s Arrow UDF decorators. A Python function decorated with @arrow_udf runs against Arrow data directly, with zero-copy access to the same in-memory batches Sail is already processing and no serialization crossing the boundary between user code and query execution. The same decorator works across both engines, but the path it follows inside Sail is different from the path it follows inside Spark.
The most classic form of PySpark UDFs expected row-wise processing, where each row arrived at the function as a tuple of Python objects. That gave them flexibility but came with significant per-row overhead. Spark 2.3 (February 2018) introduced Pandas UDFs, which process batches of data using the Pandas library and partially resolved this problem.
The journey for Arrow UDFs, however, took a bit longer:
- Spark 3.3 (June 2022) introduced the
mapInArrow()method which allows you to apply a transformation on an iterator of data batches. - Spark 4.0 (May 2025) introduced the
applyInArrow()method which can be used to call Arrow UDFs on grouped or co-grouped data. You can chain this method aftergroupby()orcogroup()in the DataFrame API. - Spark 4.1 (December 2025) introduced the
arrow_udf()andarrow_udtf()decorators for defining Arrow UDFs and UDTFs. The UDF can be used to transform data batches or aggregate grouped data. The UDTF allows you to produce Arrow data as a table.
The user interface for Arrow UDFs is now comparable to its Pandas counterpart in terms of completeness. However, because Spark still separates Python workers from the JVM and still represents data row-wise internally, Arrow UDFs still suffer from a significant performance penalty.
In Sail, since we embed the Python process inside Rust using PyO3, and since all of Sail’s in-memory data representation is of the Arrow format, the interaction between Python Arrow UDFs and the query execution process implies no serialization and zero-copy data access. This means you can define custom data processing logic in Python as if it were written in a Rust-native manner, as we can see in the example below. Sail’s architecture is therefore a great complement to the nice user interface of Arrow UDFs in Spark 4.
import pyarrow as pa
import pyarrow.compute as pc
from pyspark.sql.functions import arrow_udf
@arrow_udf("long")
def square(v: pa.Array) -> pa.Array:
return pc.multiply(v, v)
spark.range(3).select(square("id")).show()
# +----------+
# |square(id)|
# +----------+
# | 0|
# | 1|
# | 4|
# +----------+ Variant Data Type
Semi-structured data has always been an awkward fit for fixed-schema table formats. Variant is the open binary standard the lakehouse ecosystem is converging on for that case: compact, schema-free, and much faster to query than JSON strings because path lookups run against binary data instead of raw text. Sail 0.6 adds it to the SQL type system.
The integration builds directly on the Arrow Rust community’s great work on Variant support. Sail exposes the type through SQL functions that convert JSON into variants and query into them using path expressions. The minimal example below parses a JSON literal into a variant value, reads a path inside it, and casts the result to the requested type:
SELECT variant_get(parse_json('{"a": [42]}'), '$.a[0]', 'int') AS v The parse_json function converts a JSON string to the variant data type, which is later queried via the variant_get function and the path syntax. Both functions operate on Sail’s native variant encoding, which means path lookups run against binary data rather than re-parsing a string on every call.
We’re continuing to work on adding the last few remaining Spark variant functions in upcoming releases. Stay tuned!
Arrow Flight SQL Server
SQL is and has been the core user interface for Sail since its very inception. This design philosophy can be seen from the effort we put into developing our own SQL parser and the recent addition of system catalog for observability in Sail 0.5.
Arrow Flight SQL is an open gRPC protocol for executing SQL statements and exchanging Arrow data between clients and servers. The specification is maintained by the Apache Arrow project and is implemented by multiple database engines, so clients and servers can be built independently of each other. A Flight SQL client library written against the specification works against any conformant server, which is what makes the protocol a portable entry point into the query engine.
Sail joins those implementations in 0.6 as a Flight SQL server. This is the first alternative protocol we support other than Spark Connect, so any client that speaks Flight SQL can now query Sail directly without a Spark Connect runtime in the middle.
Here’s how you can start the Flight SQL server via the Sail CLI:
sail flight server --ip 127.0.0.1 --port 32010 For programmatic access, you can use the pysail.flight.FlightSqlServer class, which starts the same server from Python. See the documentation for more details.
Once the server is running, you can send SQL queries from any Flight SQL client. Here is a code snippet for connecting to the server using the adbc-driver-flightsql Python package.
from adbc_driver_flightsql import dbapi
conn = dbapi.connect("grpc://127.0.0.1:32010")
cur = conn.cursor()
cur.execute("SELECT 1 + 1")
cur.fetchall()
conn.close() Building Toward a Composable Data Stack
The three additions in 0.6 deepen Sail’s role in the composable data stack. Every surface that carries Arrow end to end removes a place where data has to be reshaped on the way through, and a place where Sail would otherwise need a custom adapter to fit into the larger stack. Fewer conversions, fewer glue layers, and more of the system speaking the same in-memory format.
Getting Started
Sail 0.6 is available on PyPI. Install or upgrade with pip install "pysail==0.6.0", or see the installation guide for standalone binary and Docker options.
We’d like to have our special thanks to @davidlghellin and @tamirkifle who contributed to the features highlighted in this post. Sail is a community-driven project. If you’d like to shape the future of big data compute, come and build with us!
If there’s a Spark Arrow function or an integration you need sooner, open an issue on GitHub or join our Slack Community to start a chat. We’d love to hear from you.
Managed Sail in Your Cloud
If you’d rather not manage your own infrastructure, you can request early access to the LakeSail Platform and use it for free until its official launch.