Spark Rebuilt in Rust

Next-gen lakehouse. Zero rewrites. Every workload.

Try it out Talk to us →

platform.lakesail.com/notebooks

Notebooks / wau-growth-by-region

Running

Kernel Python 3.12 Runtime marimo Last run 4 min ago

1 regions = mo.ui.slider(2, 8, value=6, label="Regions")

Regions

Window (weeks)

2 wau = pl.read_iceberg("warehouse.events").group_by("region", "week") 58 ms

Weekly active users by region · last 15 weeks

US-East EU-West US-West APAC EU-Central LATAM

15-week WAU growth

total268k

growth+139%

01 / Spark Connect

Plug in. Nothing else changes.

LakeSail implements the Spark Connect protocol. Point your existing Spark code at a new endpoint and that's it. Same DataFrame API, same libraries, same pipelines. The engine upgrades; your code doesn't.

Update one config line

Swap the .remote() endpoint string. That's the migration. No code changes, no schema conversions, no library updates.

sc://lakesail.yourdomain.com

Run your existing jobs

Every PySpark DataFrame operation, Spark SQL query, and Python UDF you already have continues to work modification. LakeSail's Sail engine executes them natively, no JVM translation layer.

Watch the cost drop

Rust-native execution with no GC pauses, no JVM startup overhead, and scale-to-zero workers. On the derived TPC-H benchmark, that translates to roughly 94% lower compute cost vs JVM-based Spark. Your workload will vary.

Add new capabilities at your pace

Once you're running, unlock the agent layer, lakehouse branching, and native Python workloads, none of which require any changes to your existing pipelines.

bash

$pip install pysail

✓ pysail installed

$sail spark server --port 50051

✓ Spark Connect server listening on sc://127.0.0.1:50051

$python pipeline.py

✓ Pipeline complete

→ 8x speedup vs Spark · 94% lower compute cost (derived TPC-H)

Inside Spark Connect compatibility →

02 / Open Lakehouse

No lock-in. Your formats, your cloud.

LakeSail is built on open standards end to end. Native Apache Iceberg and Delta Lake support means your tables stay exactly where they are, no conversion, no copying. Your data stays in open formats in your own AWS account. Switching engines should never require migrating terabytes of data.

ICEBERG

Apache Iceberg

Native read and write support. Time travel, schema evolution, partition evolution, all supported. LakeSail does not require you to convert or copy Iceberg tables.

DELTA

Delta Lake

Native read and write support for Delta tables. Keep Delta indefinitely or migrate to another format at your own pace, your choice, not ours.

SPARK

Spark Connect protocol

Full Apache Spark Connect compatibility. Any code that runs on Spark 3.5 or Spark 4.x against the Connect protocol runs unchanged on LakeSail. This is not a partial implementation.

ARROW & DF

Apache Arrow & DataFusion

The Sail engine is built on Apache Arrow and DataFusion, both Apache Software Foundation projects with large, independent ecosystems.

Your data stays with you. LakeSail doesn't own your data, doesn't move it, and doesn't require proprietary formats to achieve its performance numbers.

03 / The Engine

What changes when the runtime is Rust

Sail is the open-source Rust engine at LakeSail's core. Built on Apache Arrow and DataFusion. No JVM, no GC, no serialization overhead. One engine handles every workload type.

Rust-native runtime

No JVM startup. No garbage collection pauses. No JVM memory tuning. Sail boots instantly and scales to zero between jobs, you only pay for compute you use.

No JVMScale-to-zeroInstant startup

Vectorized query execution

Built on Apache Arrow's columnar format and DataFusion's vectorized query engine. Processes data with SIMD acceleration where available.

Columnar executionVectorized queriesSIMD acceleration

Unified batch + stream + AI

One engine, one API, one cost model. Run batch ETL, Python workloads, and interactive SQL queries without switching tools, re-learning APIs, or managing separate clusters.

Batch ETLStream processingAI/ML workloads

Native Python at engine speed

Python UDFs and workloads execute natively in-process, no inter-process serialization, no JVM-to-Python IPC overhead. AI/ML pipelines that previously paid a heavy tax run at native speed.

Python UDFsIn-process executionNo JVM bridge

Stateless, secure workers

Workers are fully stateless, no shuffle data on disk between runs, no leftover JVM processes consuming memory. Each job gets a clean, isolated execution environment. Easier security audits, simpler ops.

Stateless workersClean job isolationNo leftover processes

Transparent cost model

Charged for actual compute hours, fully transparent, predictable, with no opaque credits. Autoscales to zero between jobs. No minimum spend. No contract lock-in. You see exactly what you're paying for.

Compute-hour billingNo minimum spendScale-to-zero

How data engineering runs on LakeSail →

04 / Python & AI Workloads

Native Python. No JVM tax.

Other engines run Python work in separate worker processes and move data across the JVM boundary. LakeSail runs Python natively at engine speed, no inter-process overhead, no tuning required.

Runtime UDFs without serialization

Define Python UDFs inline in your PySpark code. Sail executes them in the Rust engine via PyO3, no pickling, no JVM bridge, no data copying between processes.

AI/ML pipelines as first-class workloads

LLM inference, embedding generation, model scoring, these are native workload types, not workarounds. Feed your lakehouse data directly into ML pipelines without building data bridges.

Multimodal lakehouse

Process PDFs, images, and video as first-class lakehouse data types. Structured and unstructured data in one query, no ETL step to a separate vector store or object store pipeline.

Scale from laptop to cloud

Develop locally on the same Sail runtime, then point the same workload at production.

Serialization overhead

Faster vs JVM Sparkper derived TPC-H benchmark

94%

Lower compute cost vs Sparkper derived TPC-H benchmark

Migration cost

PYTHON model_scoring.py

1from pyspark.sql.functions import udf

2from pyspark.sql.types import FloatType

3import my_model

5# UDF runs at engine speed via PyO3

6@udf(returnType=FloatType())

7def score(text):

8return my_model.predict(text)

10df = spark.read.parquet("s3://events/")

11.withColumn("score", score(df.content))

12.write.saveAsTable("scores")

How Python workloads run natively →

05 / Agent Layer

Built for AI agents from day one.

Competitors retrofit agent support onto a JVM platform that was never designed for it. LakeSail ships an MCP server, lakehouse branching, and full audit trails as core engine features, not add-ons.

Native MCP server

LakeSail exposes a Model Context Protocol server out of the box. Connect any MCP-compatible AI agent, Claude, GPT, custom agents, to your lakehouse directly. Query, transform, and write data without building a custom tool layer.

MCPTool useAny LLM

Lakehouse branching

Agents can branch your lakehouse like a git repo. Create an isolated sandbox for a transformation or analysis, run it, review the diff, and commit or discard, all without touching production data.

BranchingSandboxReversible

Elastic agent compute

Compute provisions per agent workload, scales with execution, and releases when work is done. Sub-second cold starts on the Rust-native engine mean short-lived agent loops never pay JVM warm-up, and there are no idle clusters between calls.

Scale-to-zeroSub-second start

Dynamic Python tooling

Agents can define Python tools and data sources at runtime, custom logic that runs at engine speed against lakehouse data. No pre-registration, no redeployment. The agent writes it, Sail executes it.

Dynamic toolsData sourcesPyO3

Agent execution flow

Step 01

Agent receives task

LLM agent gets context via MCP server tools

Step 02

Branch created

Isolated sandbox branched from production lakehouse

Step 03

Actions executed

Queries, transforms, writes, all audited in real time

Step 04

Human reviews diff

Exact changes surfaced for approval before any commit

Step 05

Commit or discard

One-click merge to production, or clean rollback

Inside the agent layer →

06 / Deployment Options

From open source to fully managed.

Three ways to run Sail. OSS if you want to self-host, LakeSail Platform if you want managed, and Enterprise for organizations with procurement requirements. All three use the same engine.

Open Source

Sail OSS

The Rust engine, open source on GitHub under Apache 2.0. Self-host on your own infrastructure. No managed services, no support SLA, but full access to the engine.

Apache 2.0 license
Full Spark Connect compatibility
Community support via GitHub & Slack
No contracts, no commitments

Free. Star on GitHub →

Managed Platform

LakeSail Platform

Fully managed Sail, deployed into your AWS account (BYOC). We handle the infrastructure; you keep data sovereignty and security. No cluster management, no ops overhead.

Deploys into your AWS account (BYOC)
Managed upgrades, patches, monitoring
Autoscale to zero between jobs
Per-second billing, no minimums

View pricing →

Enterprise

For organizations with procurement, security review, and custom SLA requirements. Includes dedicated support, private deployment options, and enterprise SSO.

Everything in Platform, plus:
Dedicated support with SLA
Enterprise SSO (SAML, OIDC)
Custom licensing models

Talk to us →

Your Spark workloads.
A better engine.

Get a 30-minute demo and a benchmark of LakeSail against your existing Spark workloads.