Sail MCP Server: Spark Analytics for LLM Agents

The 0.2.3 release of Sail includes a server for the Model Context Protocol (MCP)—an open protocol that standardizes how applications provide context to LLMs. The MCP server in Sail exposes tools that allow LLM agents to register datasets and execute Spark SQL queries in Sail.

Sail is a drop-in replacement for Spark SQL and DataFrame APIs in both single-host and distributed settings. Written in Rust, Sail makes Spark ~4x faster while cutting hardware costs by 94%. Our goal is to achieve full compatibility with Spark. You might have already migrated some of your Spark workloads to Sail—bringing lightning-fast execution to your data processing tasks with zero code changes. Thanks to the MCP server, Sail’s power now benefits AI agents as well.

Bridging Big Data and AI Tooling

With Sail, LLMs or other AI-driven tools can interact directly with the compute engine, issuing queries and retrieving insights as part of a natural dialogue. This represents a paradigm shift in data interaction—merging the power of AI with the precision of modern data processing.

By enabling LLMs to engage in interactive, context-aware conversations with data systems, Sail dismantles the traditional barriers of complex query languages and manual integrations. Instead of describing data processing via SQL or DataFrame APIs, you can talk to Sail in a more narrative style—for example, “Show me the total sales for the last quarter”, or “Compare transaction volumes between Region A and Region B”. Your LLM agents translate these natural-language commands into SQL queries and send the queries via MCP to Sail for execution.

The result? Even complex analytical tasks feel like a simple back-and-forth conversation, removing barriers and enabling more intuitive interaction with data. This approach doesn’t just improve developer productivity; it also empowers business teams to quickly derive insights, accelerating decision-making and driving tangible value across the entire organization.

Filling the Agentic Gap for Spark

While Spark is a titan in the data ecosystem and was an incredible technological achievement when it was invented over fifteen years ago, its complex architecture and reliance on large, distributed clusters can make interactive, AI-driven conversations feel sluggish—and expensive. However, by integrating MCP with Sail (similar to existing MCP servers for DuckDB, PostgreSQL, etc.), your AI-generated queries can run at unparalleled speed for a fraction of the cost.

Beyond the usual compute workloads, the MCP server makes it simple and straightforward for any AI agent to query massive datasets and receive live responses, all in a natural, chat-like workflow. That means developers can plug in their favorite LLM agents, ask a question in plain English (or language of choice), and watch Sail sift through massive amounts of data with unmatched speed. No obscure syntax, no clunky setup—just seamless, interactive analytics, fronted by an AI agent that feels almost human. We see this as an opportunity to help advance MCP in the Big Data space, offering an on-ramp for users who want to utilize the full power of AI on massive, real-world datasets in a timely and cost-efficient manner.

Sample Chat

To give you a sense of what the MCP server looks like in Sail, in the appendix we present a sample chat with Claude. In this chat, we asked Claude to analyze a Hugging Face dataset for us. Claude correctly understood the intent, generated and executed several SQL queries via MCP, and summarized the results accurately. It even adjusted the SQL queries when the Sail server returned errors due to unsupported SQL syntax or functions!

In fact, the sample chat is the very first chat we had with Claude after implementing the MCP server. The success on the very first trial is a rather strong signal that the MCP integration for Sail just works out of the box.

Where We Are Headed

What excites us most is how these innovations bring us closer to our mission: to unify batch processing, stream processing, and compute-intensive AI workloads, empowering organizations to tackle modern data challenges with unprecedented speed, efficiency, and cost-effectiveness. By being able to integrate diverse workloads into a single framework, we enable the flexibility and scalability needed to drive innovation and meet the ever-evolving demands of the AI revolution.

Sail’s progress continues to accelerate. Besides supporting MCP, the 0.2.3 release contains features such as Hugging Face dataset support and data source URL glob pattern support. These additions enable a smoother experience when bringing your data to Sail. Out of 3,839 tests mined from the Spark codebase, we now pass 3,075 (80.1%) in Sail. While we acknowledge some gaps in full Spark parity, these tests cover the bulk of commonly used Spark functionality, making migrating to Sail smoother than ever.

To get started with the MCP server in Sail, you can visit the documentation site.

If you’re searching for a faster and more cost-effective alternative to Spark, get in touch with our team. We can help you experience firsthand how Sail, now with MCP integration, can transform your AI needs.

Appendix: Sample Chat with Claude

Here we show an unmodified sample of the Claude chat history. We keep the Markdown style in the conversation as is. The only adjustment we made is the formatting of each tool use section, so that they can be displayed nicely in this page.

I’d like to analyze a Parquet dataset located at hf://datasets/openai/openai_humaneval@~parquet/openai_humaneval/test/. Could you explore what this dataset is about?

I’ll help you analyze the Parquet dataset from the path you provided. Let’s start by loading the dataset and exploring its structure.

View result from create_parquet_view

Input

name

openai_humaneval

path

hf://datasets/openai/openai_humaneval@~parquet/openai_humaneval/test/

Output
```
{}
```

Now that I’ve created a view for the dataset, let’s examine its schema to understand the structure:

View result from describe_view

Input
name
```
openai_humaneval
```

Output

{"name": "openai_humaneval", "description": null, "columns": [{"name": "task_id", "description": null, "dataType": "string", "nullable": true}, {"name": "prompt", "description": null, "dataType": "string", "nullable": true}, {"name": "canonical_solution", "description": null, "dataType": "string", "nullable": true}, {"name": "test", "description": null, "dataType": "string", "nullable": true}, {"name": "entry_point", "description": null, "dataType": "string", "nullable": true}]}

Now let’s look at a few rows from the dataset to better understand its content:

View result from execute_query

Input
limit
```
5
```
query
```
SELECT * FROM openai_humaneval LIMIT 5
```

Output

[{"task_id":"HumanEval\/0","prompt":"from typing import List\n\n\ndef has_close_elements(numbers: List[float], threshold: float) -> bool:\n \"\"\" Check if in given list of numbers, are any two numbers closer to each other than\n given threshold.\n >>> has_close_elements([1.0, 2.0, 3.0], 0.5)\n False\n >>> has_close_elements([1.0, 2.8, 3.0, 4.0, 5.0, 2.0], 0.3)\n True\n \"\"\"\n","canonical_solution":" for idx, elem in enumerate(numbers):\n for idx2, elem2 in enumerate(numbers):\n if idx != idx2:\n distance = abs(elem - elem2)\n if distance < threshold:\n return True\n\n return False\n","test":"\n\nMETADATA = {\n 'author': 'jt',\n 'dataset': 'test'\n}\n\n\ndef check(candidate):\n assert candidate([1.0, 2.0, 3.9, 4.0, 5.0, 2.2], 0.3) == True\n assert candidate([1.0, 2.0, 3.9, 4.0, 5.0, 2.2], 0.05) == False\n assert candidate([1.0, 2.0, 5.9, 4.0, 5.0], 0.95) == True\n assert candidate([1.0, 2.0, 5.9, 4.0, 5.0], 0.8) == False\n assert candidate([1.0, 2.0, 3.0, 4.0, 5.0, 2.0], 0.1) == True\n assert candidate([1.1, 2.2, 3.1, 4.1, 5.1], 1.0) == True\n assert candidate([1.1, 2.2, 3.1, 4.1, 5.1], 0.5) == False\n\n","entry_point":"has_close_elements"},{"task_id":"HumanEval\/1","prompt":"from typing import List\n\n\ndef separate_paren_groups(paren_string: str) -> List[str]:\n \"\"\" Input to this function is a string containing multiple groups of nested parentheses. Your goal is to\n separate those group into separate strings and return the list of those.\n Separate groups are balanced (each open brace is properly closed) and not nested within each other\n Ignore any spaces in the input string.\n >>> separate_paren_groups('( ) (( )) (( )( ))')\n ['()', '(())', '(()())']\n \"\"\"\n","canonical_solution":" result = []\n current_string = []\n current_depth = 0\n\n for c in paren_string:\n if c == '(':\n current_depth += 1\n current_string.append(c)\n elif c == ')':\n current_depth -= 1\n current_string.append(c)\n\n if current_depth == 0:\n result.append(''.join(current_string))\n current_string.clear()\n\n return result\n","test":"\n\nMETADATA = {\n 'author': 'jt',\n 'dataset': 'test'\n}\n\n\ndef check(candidate):\n assert candidate('(()()) ((())) () ((())()())') == [\n '(()())', '((()))', '()', '((())()())'\n ]\n assert candidate('() (()) ((())) (((())))') == [\n '()', '(())', '((()))', '(((())))'\n ]\n assert candidate('(()(())((())))') == [\n '(()(())((())))'\n ]\n assert candidate('( ) (( )) (( )( ))') == ['()', '(())', '(()())']\n","entry_point":"separate_paren_groups"},{"task_id":"HumanEval\/2","prompt":"\n\ndef truncate_number(number: float) -> float:\n \"\"\" Given a positive floating point number, it can be decomposed into\n and integer part (largest integer smaller than given number) and decimals\n (leftover part always smaller than 1).\n\n Return the decimal part of the number.\n >>> truncate_number(3.5)\n 0.5\n \"\"\"\n","canonical_solution":" return number % 1.0\n","test":"\n\nMETADATA = {\n 'author': 'jt',\n 'dataset': 'test'\n}\n\n\ndef check(candidate):\n assert candidate(3.5) == 0.5\n assert abs(candidate(1.33) - 0.33) < 1e-6\n assert abs(candidate(123.456) - 0.456) < 1e-6\n","entry_point":"truncate_number"},{"task_id":"HumanEval\/3","prompt":"from typing import List\n\n\ndef below_zero(operations: List[int]) -> bool:\n \"\"\" You're given a list of deposit and withdrawal operations on a bank account that starts with\n zero balance. Your task is to detect if at any point the balance of account fallls below zero, and\n at that point function should return True. Otherwise it should return False.\n >>> below_zero([1, 2, 3])\n False\n >>> below_zero([1, 2, -4, 5])\n True\n \"\"\"\n","canonical_solution":" balance = 0\n\n for op in operations:\n balance += op\n if balance < 0:\n return True\n\n return False\n","test":"\n\nMETADATA = {\n 'author': 'jt',\n 'dataset': 'test'\n}\n\n\ndef check(candidate):\n assert candidate([]) == False\n assert candidate([1, 2, -3, 1, 2, -3]) == False\n assert candidate([1, 2, -4, 5, 6]) == True\n assert candidate([1, -1, 2, -2, 5, -5, 4, -4]) == False\n assert candidate([1, -1, 2, -2, 5, -5, 4, -5]) == True\n assert candidate([1, -2, 2, -2, 5, -5, 4, -4]) == True\n","entry_point":"below_zero"},{"task_id":"HumanEval\/4","prompt":"from typing import List\n\n\ndef mean_absolute_deviation(numbers: List[float]) -> float:\n \"\"\" For a given list of input numbers, calculate Mean Absolute Deviation\n around the mean of this dataset.\n Mean Absolute Deviation is the average absolute difference between each\n element and a centerpoint (mean in this case):\n MAD = average | x - x_mean |\n >>> mean_absolute_deviation([1.0, 2.0, 3.0, 4.0])\n 1.0\n \"\"\"\n","canonical_solution":" mean = sum(numbers) \/ len(numbers)\n return sum(abs(x - mean) for x in numbers) \/ len(numbers)\n","test":"\n\nMETADATA = {\n 'author': 'jt',\n 'dataset': 'test'\n}\n\n\ndef check(candidate):\n assert abs(candidate([1.0, 2.0, 3.0]) - 2.0\/3.0) < 1e-6\n assert abs(candidate([1.0, 2.0, 3.0, 4.0]) - 1.0) < 1e-6\n assert abs(candidate([1.0, 2.0, 3.0, 4.0, 5.0]) - 6.0\/5.0) < 1e-6\n\n","entry_point":"mean_absolute_deviation"}]

Let’s also check how many samples are in this dataset:

View result from execute_query

Input

limit

query

SELECT COUNT(*) AS total_samples FROM openai_humaneval

Output
```
[{"total_samples":164}]
```

Based on my analysis, this dataset is the OpenAI HumanEval benchmark, which is a collection of programming problems designed to evaluate code generation capabilities. Here’s what I’ve found:

Dataset Content: The dataset contains 164 programming problems, each with:
- task_id: A unique identifier for each problem (e.g., “HumanEval/0”)
- prompt: The problem description and function signature in Python
- canonical_solution: The reference solution code
- test: Test cases to verify the correctness of solutions
- entry_point: The name of the function to be implemented
Purpose: This dataset is specifically designed to evaluate the ability of AI systems to generate functional code that correctly solves programming problems. It’s widely used as a benchmark for code generation capabilities.
Structure: Each problem includes:
- A function signature with type hints
- A docstring explaining the problem, often with examples
- Test cases to verify correctness
Problem Types: From the samples I’ve seen, the problems cover various programming concepts like:
- List manipulation
- String processing
- Mathematical operations
- Logic and control flow

This dataset is valuable for evaluating and comparing code generation models, as it provides a standardized set of problems with well-defined inputs, outputs, and test cases. Would you like me to analyze any specific aspects of this dataset in more detail?

Could you share some statistics for the prompt length?

I’ll analyze the prompt lengths in the dataset and provide you with some statistics. Let’s run a query to calculate this:

View result from execute_query

Input

limit

query

SELECT
  MIN(LENGTH(prompt)) AS min_length,
  MAX(LENGTH(prompt)) AS max_length,
  AVG(LENGTH(prompt)) AS avg_length,
  PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY LENGTH(prompt)) AS median_length,
  PERCENTILE_CONT(0.25) WITHIN GROUP (ORDER BY LENGTH(prompt)) AS q1_length,
  PERCENTILE_CONT(0.75) WITHIN GROUP (ORDER BY LENGTH(prompt)) AS q3_length,
  STDDEV(LENGTH(prompt)) AS std_dev_length
FROM openai_humaneval

Output

Error executing tool execute_query: unknown function: PERCENTILE_CONT

Let me adjust the query to use functions that are available:

View result from execute_query

Input

limit

query

SELECT
  MIN(LENGTH(prompt)) AS min_length,
  MAX(LENGTH(prompt)) AS max_length,
  AVG(LENGTH(prompt)) AS avg_length,
  STDDEV(LENGTH(prompt)) AS std_dev_length
FROM openai_humaneval

Output

[{"min_length":115,"max_length":1360,"avg_length":450.5975609756,"std_dev_length":230.3037663159}]

Now, let’s get a better understanding of the distribution by looking at the length frequency:

View result from execute_query

Input

limit

query

SELECT
  FLOOR(LENGTH(prompt) / 100) * 100 AS length_bucket,
  COUNT(*) AS count
FROM openai_humaneval
GROUP BY length_bucket
ORDER BY length_bucket

Output

Error executing tool execute_query: cannot resolve attribute: ObjectName([Identifier("length_bucket")])

Let me adjust the query:

View result from execute_query

Input

limit

query

SELECT
  CAST(FLOOR(LENGTH(prompt) / 100) * 100 AS INTEGER) AS length_bucket,
  COUNT(*) AS count
FROM openai_humaneval
GROUP BY CAST(FLOOR(LENGTH(prompt) / 100) * 100 AS INTEGER)
ORDER BY length_bucket

Output

[{"length_bucket":100,"count":11},{"length_bucket":200,"count":34},{"length_bucket":300,"count":38},{"length_bucket":400,"count":34},{"length_bucket":500,"count":13},{"length_bucket":600,"count":12},{"length_bucket":700,"count":8},{"length_bucket":800,"count":5},{"length_bucket":900,"count":3},{"length_bucket":1000,"count":3}]

Let’s also get some more specific percentiles by ordering and selecting specific rows:

View result from execute_query

Input

limit

query

SELECT LENGTH(prompt) AS prompt_length
FROM openai_humaneval
ORDER BY prompt_length

Output

[{"prompt_length":115},{"prompt_length":130},{"prompt_length":133},{"prompt_length":138},{"prompt_length":149},{"prompt_length":164},{"prompt_length":167},{"prompt_length":168},{"prompt_length":172},{"prompt_length":179},{"prompt_length":185},{"prompt_length":206},{"prompt_length":210},{"prompt_length":212},{"prompt_length":213},{"prompt_length":213},{"prompt_length":217},{"prompt_length":219},{"prompt_length":224},{"prompt_length":224},{"prompt_length":225},{"prompt_length":228},{"prompt_length":230},{"prompt_length":231},{"prompt_length":234},{"prompt_length":234},{"prompt_length":236},{"prompt_length":242},{"prompt_length":243},{"prompt_length":243},{"prompt_length":244},{"prompt_length":259},{"prompt_length":260},{"prompt_length":262},{"prompt_length":264},{"prompt_length":267},{"prompt_length":269},{"prompt_length":269},{"prompt_length":285},{"prompt_length":286},{"prompt_length":287},{"prompt_length":288},{"prompt_length":293},{"prompt_length":295},{"prompt_length":296},{"prompt_length":300},{"prompt_length":309},{"prompt_length":312},{"prompt_length":313},{"prompt_length":316},{"prompt_length":325},{"prompt_length":330},{"prompt_length":331},{"prompt_length":332},{"prompt_length":333},{"prompt_length":338},{"prompt_length":345},{"prompt_length":346},{"prompt_length":348},{"prompt_length":348},{"prompt_length":348},{"prompt_length":350},{"prompt_length":359},{"prompt_length":360},{"prompt_length":361},{"prompt_length":361},{"prompt_length":363},{"prompt_length":367},{"prompt_length":372},{"prompt_length":372},{"prompt_length":376},{"prompt_length":382},{"prompt_length":382},{"prompt_length":383},{"prompt_length":383},{"prompt_length":385},{"prompt_length":386},{"prompt_length":387},{"prompt_length":387},{"prompt_length":392},{"prompt_length":394},{"prompt_length":396},{"prompt_length":396},{"prompt_length":400},{"prompt_length":414},{"prompt_length":415},{"prompt_length":419},{"prompt_length":425},{"prompt_length":426},{"prompt_length":426},{"prompt_length":427},{"prompt_length":430},{"prompt_length":431},{"prompt_length":435},{"prompt_length":435},{"prompt_length":436},{"prompt_length":440},{"prompt_length":441},{"prompt_length":448},{"prompt_length":449},{"prompt_length":451},{"prompt_length":453},{"prompt_length":454},{"prompt_length":462},{"prompt_length":462},{"prompt_length":465},{"prompt_length":465},{"prompt_length":467},{"prompt_length":472},{"prompt_length":472},{"prompt_length":475},{"prompt_length":476},{"prompt_length":480},{"prompt_length":484},{"prompt_length":490},{"prompt_length":491},{"prompt_length":499},{"prompt_length":506},{"prompt_length":520},{"prompt_length":533},{"prompt_length":542},{"prompt_length":560},{"prompt_length":567},{"prompt_length":571},{"prompt_length":576},{"prompt_length":578},{"prompt_length":580},{"prompt_length":586},{"prompt_length":590},{"prompt_length":598},{"prompt_length":608},{"prompt_length":609},{"prompt_length":613},{"prompt_length":613},{"prompt_length":613},{"prompt_length":636},{"prompt_length":641},{"prompt_length":647},{"prompt_length":660},{"prompt_length":672},{"prompt_length":673},{"prompt_length":676},{"prompt_length":702},{"prompt_length":723},{"prompt_length":754},{"prompt_length":759},{"prompt_length":766},{"prompt_length":775},{"prompt_length":776},{"prompt_length":794},{"prompt_length":805},{"prompt_length":809},{"prompt_length":812},{"prompt_length":843},{"prompt_length":851},{"prompt_length":926},{"prompt_length":927},{"prompt_length":987},{"prompt_length":1035},{"prompt_length":1050},{"prompt_length":1053},{"prompt_length":1167},{"prompt_length":1265},{"prompt_length":1360}]

Based on the analysis of prompt lengths in the OpenAI HumanEval dataset, here are the key statistics:

Basic Statistics

Minimum length: 115 characters
Maximum length: 1,360 characters
Average length: 450.6 characters
Standard deviation: 230.3 characters

Distribution of Prompt Lengths

100-199 characters: 11 prompts (6.7%)
200-299 characters: 34 prompts (20.7%)
300-399 characters: 38 prompts (23.2%)
400-499 characters: 34 prompts (20.7%)
500-599 characters: 13 prompts (7.9%)
600-699 characters: 12 prompts (7.3%)
700-799 characters: 8 prompts (4.9%)
800-899 characters: 5 prompts (3.0%)
900-999 characters: 3 prompts (1.8%)
1000+ characters: 6 prompts (3.7%)

Percentiles (estimated from ordered data)

25th percentile (Q1): ~260 characters
50th percentile (median): ~385 characters
75th percentile (Q3): ~600 characters

Insights

The majority of prompts (64.6%) are between 200-499 characters in length.
The distribution is right-skewed, with a few very long prompts extending beyond 1,000 characters.
There’s considerable variation in prompt lengths (as indicated by the standard deviation), suggesting varying complexity of the programming problems.
The median prompt length (~385 characters) is less than the mean (450.6), further confirming the right-skewed distribution.

These statistics suggest that the HumanEval dataset contains a mix of concise and more detailed programming problems, with most problems being of moderate length and complexity.

Get in Touch to Learn More

Bridging Big Data and AI Tooling

Filling the Agentic Gap for Spark

Sample Chat

Where We Are Headed

Appendix: Sample Chat with Claude

Basic Statistics

Distribution of Prompt Lengths

Percentiles (estimated from ordered data)

Insights

Join the LakeSail Community