Spark’s Python Problem and How Sail Solves It
By Shehab Amin, Heran Lin, and Everett Roeth
December 16, 2025
5 min read
User Defined Functions (UDFs) are a critical extension point in distributed data systems. They provide a mechanism to incorporate custom application-specific transformations and domain logic directly into execution pipelines. The performance characteristics of UDFs, however, depend entirely on how the underlying engine executes them; specifically, whether the UDF runs in-process or across an inter-process boundary.
In Apache Spark and other JVM-based systems, this distinction produces two very different execution paths. Scala UDFs run inside the JVM, side-by-side with Spark’s physical plan, benefiting from JIT compilation, operator fusion, and unified memory management.
Spark’s Python UDFs, in contrast, execute out-of-process incurring serialization, deserialization, and inter-process communication on every batch. These architectural differences inherent to the JVM dictate the practical performance gap that teams experience in production.
Sail takes a different approach. By implementing the engine in Rust and using Apache Arrow as a shared memory layer for zero-copy execution, Sail can execute Python directly in the same Rust process. Instead of treating Python as an external process, Sail executes Python UDFs natively in-process inside the engine, eliminating the overhead seen in Spark entirely.
Scala UDFs Within Spark’s Execution Engine
Spark’s design centers on the JVM. Its query planner, optimizer, and physical execution engine all run as JVM processes, so the most natural extension mechanism has always been a UDF written in Scala (or Java) that executes directly alongside those components.
Because Scala UDFs run inside the same JVM process as the physical plan, they avoid cross-language boundaries and therefore deliver the most predictable execution profile in Spark. Their performance is limited only by the UDF logic and the JVM itself, not by inter-process communication. For this reason, some teams maintain performance by placing critical UDF logic in Scala and exposing it to Python through thin wrapper functions. This keeps Python as the orchestration layer while ensuring the actual UDF executes inside the JVM.
The tradeoff is clear. Scala UDFs perform well because they live inside Spark’s execution engine, but that same tight coupling to the JVM makes them misaligned with modern Python-first, AI-heavy workloads.
Python UDFs and Spark’s Boundary Problem
Although Spark supports Python UDFs, their execution is fundamentally constrained by the JVM–Python boundary. When a Python UDF runs, data must move between the JVM executor and a separate Python process, crossing this boundary repeatedly:
- Serialize JVM data.
- Transfer batches to the Python worker across processes.
- Deserialize data into Python objects.
- Execute the UDF under the Python interpreter.
- Serialize results.
- Send results back to the JVM process.
- Deserialize results into JVM memory.
Each step adds overhead. CPU cycles are consumed not by computation but by serialization, deserialization, and inter-process communication. Even with Apache Arrow or pandas UDFs, which batch data column-wise, the underlying boundary remains.
This boundary exists because the JVM and Python cannot practically share the same process space. Spark has no choice but to run Python in separate worker processes and communicate via inter-process communication. Native accelerators like Databricks’ Photon can sometimes speed up physical plan execution by replacing JVM operators with optimized C++ code, but the JVM still owns the control plane. Because that coordination layer remains JVM-based, Python UDFs still cross the same inter-process boundary. This is why Photon does not support Python UDFs.
Sail’s Native Python Execution
Sail takes advantage of a more modern execution foundation. Because the engine is implemented natively in Rust, it can integrate PyO3, a library that allows Python and Rust to share the same process space and memory directly.
In a Rust-native engine, that boundary does not exist. When the physical plan references a Python UDF, Sail binds the function through PyO3 and invokes it inline from Rust, avoiding the extra serialization and inter-process communication inherent to Spark’s architecture. Because Python runs natively within the engine process, Sail can apply the same vectorized execution strategies and operator-fusion techniques to workloads that include Python UDFs. This allows Python code to execute at its native speed, avoiding the separate-process overhead inherent to Spark’s architecture.
A Python-Path Built for AI Workloads
The demands of AI-driven pipelines look very different from the ETL-focused workloads Spark was originally optimized for. Today, distributed systems frequently combine SQL transformations with model training and inference, embedding generation, vector operations, feature engineering, and integration with specialized Python libraries. These steps rely on frequent, fine-grained interaction with Python and its ecosystem—NumPy, PyArrow compute, model runtimes, tokenizers, and vector databases.
In these workflows, Python is not an auxiliary scripting layer; it is often the main execution environment. AI pipelines depend on tight feedback loops, high-frequency calls into C-extension kernels, and predictable access to columnar memory. They assume Python is on the fast path, not separated behind a runtime boundary.
Sail’s execution model supports this natively. By embedding Python within the same runtime as native operators and operating on shared Arrow buffers, Python logic executes under the same vectorized, multithreaded engine as SQL and Rust-based operations. There is no external Python process, no Py4J coordination, and no repeated marshaling of Arrow batches across runtimes. Python becomes part of the continuous execution plan.
This enables common AI tasks (e.g., embedding computations, feature extraction, lightweight inference, pre-processing steps, or vector-store interactions) to run at their native speed and scale across a cluster. The feedback patterns used in modern frameworks, including iterative tool-driven reasoning found in agent-style applications, benefit naturally from in-process Python execution because they no longer incur cross-runtime penalties.
If you’d like to follow our journey, join our Slack community and drop us a GitHub star!
Get in Touch to Learn More