Writing a Rust SQL Parser in One Week

Ok, there is a bit of exaggeration in the title. A production-ready SQL parser definitely takes more than one week to polish. But this post is going to talk about the story of why Sail decided to switch to an in-house solution for SQL parsing, and how we built the foundation of our SQL parser in a single week.

Sail’s mission is bold: to seamlessly unify batch processing, stream processing, and AI workloads under one powerful framework. As a drop-in replacement for Apache Spark in both single-host and distributed environments, Sail aims to deliver complete coverage of all Spark SQL and PySpark DataFrame capabilities.

Written in Rust, Sail leverages the language’s strengths that naturally align with the demands of compute frameworks. Rust eliminates whole categories of memory and concurrency bugs while delivering system-level performance thanks to its language design. This foundation positions Sail effectively for the ever-evolving landscape of Big Data and AI compute.

TL; DR

The 0.2.2 version of Sail comes with its own SQL parser built with Rust procedural macros and the parser combinator library chumsky. We define the SQL grammar in Rust, and the parser code is derived automatically.
With this SQL parser design, we have implemented the majority of Spark SQL syntax. For the 1,373 Spark SQL parsing tests we have, the test success rate has increased from 79.6% to 94.5%, compared with the third-party solution we used before.
Thanks to comprehensive SQL syntax support and various features we have implemented in the past two months, for the first time, Sail supports over 80% of the ~3,800 tests we have mined from the Spark codebase.

The History of SQL Parsing in Sail

To parse SQL from text to an abstract syntax tree (AST) in Rust, Sail has been using the sqlparser-rs library, the go-to choice for data systems written in Rust. sqlparser-rs aims to understand all SQL dialects and is maintained by the community behind Apache DataFusion—the library that powers Sail’s query execution.

Our experience with sqlparser-rs was mostly satisfactory. sqlparser-rs did the heavy lifting of SQL syntactic analysis, which allowed us to spend all our time in SQL semantic analysis and execution planning. This was crucial for Sail at its infant stage, since we needed to deliver a complete solution fast enough to show the feasibility of our mission.

Over time, however, we have encountered a few limitations of the sqlparser-rs library. Apache Spark, due to its historical connection to Hive, supports many Hive SQL syntax elements not seen in other data systems. To accommodate this, we maintained our fork of sqlparser-rs, with the hope to contribute our patches upstream over time.

Another inconvenience is that sqlparser-rs covers SQL for both transactional and analytical databases. Many SQL constructs for transactional databases are irrelevant for Sail. We have to skip those nodes as we traverse the SQL AST generated by sqlparser-rs. This adds noise and maintenance burden to our codebase.

The drawbacks mentioned above are acceptable given the benefits of sqlparser-rs. We thought we would stay with sqlparser-rs longer since developing our own SQL parser is a huge undertaking. However, we changed our minds as we evaluated more requirements around SQL syntactic analysis. Sail does not have SQL documentation yet, and writing it by hand would become unsustainable to keep consistent with our code. Also, we may want to generate SQL statements from the grammar for fuzzy testing. These tasks require a SQL grammar that matches and only matches the SQL syntax supported by Sail (which is full of Spark SQL with potentially our own extensions). The sqlparser-rs AST data structure does not fulfill such a requirement since there is no obvious way to carve out a subset of it for our needs.

Parser Combinators: A Promising First Experience

We explored numerous SQL parsing alternatives. The first few solutions we evaluated are parser generators. This is the approach taken by Apache Spark and Apache Flink (the latter using the Apache Calcite library internally), which leverage the well-known ANTLR Java library. There are a few Rust libraries, such as pest, lalrpop, and peg, but we feel that maintaining a separate grammar definition using the library’s DSL (domain-specific language) would create barriers for Sail’s contributors. Such DSLs lack IDE support, and writing the grammar may require compiler knowledge (e.g., left-recursion elimination).

sqlparser-rs writes the SQL parser by hand. This results in a performant recursive descent parser that is easy to understand and can elegantly handle operation precedence in expression parsing using techniques such as Pratt parsing. However, if we took the same approach, it would take a long time to catch up with the years of development efforts in sqlparser-rs.

We then explored another category of solutions, which turned out to be an exciting discovery. There are Rust libraries built around the idea of parser combinators. Such libraries offer building blocks for writing parsers manually and provide utilities to combine smaller parsers into more complex ones that can handle more intricate constructs. The top choice is nom, but it seems to focus more on parsing binary formats than on parsing programming languages. There is a successor project, winnow, that aims to address language parsing concerns and is worth watching for the future.

We ultimately chose chumsky, a parser combinator library that offers a well-designed API with a focus on rich error reporting. Our experience with chumsky started off smoothly. We were able to implement the SQL lexer (tokenizer) in about 300 lines of code. In contrast, the tokenizer in sqlparser-rs has about 1,300 lines of code. One optimization we made was using &'a str (a string reference with a lifetime 'a that matches the lifetime of the input SQL string) in our token data structure. This ensures no memory allocation for strings as we tokenize the SQL string into a list of tokens.

Rust Procedural Macros: Reliable Code Generation without AI

The Manual Approach

We eagerly set out to define the AST data structure for Sail’s SQL parser. We wanted to create a high-fidelity AST that fully matches the SQL grammar, allowing us to generate SQL documentation from the AST data structure definition. Here is a toy example that defines nested expressions, which are either numbers or parenthesized expressions. For those familiar with compiler theory, this resembles a context-free grammar (CFG) definition in Backus-Naur Form (BNF), but written using Rust types.

/// A nested list, such as `1`, `[1 2]`, or `[1 [2 3]]`.
pub enum List {
    Value(Number),
    Nested(LeftBracket, Vec<List>, RightBracket),
}

We define distinct types for the number literal, each operator (LeftBracket and RightBracket here), and each keyword (although the example here does not use keywords). Notably, we do not collect all the keywords into a single enum. Instead, each keyword has its own struct type, allowing the Rust type system to ensure the validity of the AST at compile time.

We also define a trait, TreeParser, for all the types involved in the AST, ensuring that each type implements a parser method that returns a chumsky::Parser implementation. In Rust, a trait is a way to define a common interface that some type must support.

pub trait TreeParser<'a, A = ()> {
    // `args` is an opaque argument required to create the parser.
    // The actual type of `args` varies depending on the type of `Self`.
    // By default, the type of `args` is `()`, which means nothing is needed
    // for creating the parser.
    fn parser(args: A) -> impl Parser<'a, &'a [Token<'a>], Self>;
}

Suppose that the TreeParser trait has been implemented for Number, LeftBracket, and RightBracket; we can then define the trait for List.

impl<P> TreeParser<'a, P> for List
where
    P: Parser<'a, &'a [Token<'a>], List>,
{
    fn parser(itself: P) -> impl Parser<'a, &'a [Token<'a>], List> {
        choice((
            Number::parser(()).map(List::Value),
            LeftBracket::parser(())
                .then(itself.repeated().collect())
                .then(RightBracket::parser(()))
                .map(|((left, items), right)| List::Nested(left, items, right)),
        ))
    }
}

And we can parse the List AST using the following function.

fn parse_list(tokens: &'a [Token<'a>]) -> ParserResult<Vec<List>> {
    let mut list = Recursive::declare();

    list.define(List::parser(list.clone()));

    list.parse(tokens)
}

The Taxonomy of SQL Grammar

Now, the real heavy lifting is to implement the TreeParser trait (the parser method) for all the types in the AST. We soon realized that there are four parts of SQL grammar, with parser implementation ranging from the easiest to the hardest.

There are parts of the SQL grammar that correspond to terminals in the CFG, which match one or a few consecutive tokens. Examples include identifiers, keywords, literals, and operators.
There is a large body of SQL grammar that can be defined as a sequence of patterns (e.g., CREATE TABLE <name>), or choices among patterns (e.g., CREATE TABLE or CREATE VIEW). Most SQL DDL statements belong to this category. In compiler theory, such patterns define a regular grammar, a subset of context-free grammar that contains no recursion. An analogy is that every regular expression defines a regular grammar over strings. For such a grammar, you actually do not need a recursive descent parser.
There are parts of the SQL grammar that are defined in terms of themselves. For example, a data type can contain nested data types for ARRAY or STRUCT. As another example, a query depends on expressions in the SELECT clause, while expressions such as <value> IN (<subquery>) contain a nested query.
The only parts of the SQL grammar whose ASTs do not match the parsing logic are SQL expressions and set operations for queries (e.g., UNION and INTERSECT). The grammar has left recursion, and we need to take into account operator precedence during parsing.

For the first category, the parser can be generated via Rust macro_rules!. For the last category, despite its difficulty, chumsky comes with built-in support—the parser can be implemented using the pratt() method by specifying a precedence and associativity table for all prefix, infix, and postfix operators. Our focus for the SQL parser is on the second and third categories. Can we do it intelligently?

The Magic of Procedural Macros

In fact, the second and third parts of the grammar share some similarities. The grammar can be written using Rust enum or struct, regardless of whether the grammar is recursive or not. As you can see, the parser code shown above matches exactly the AST data structure. This important observation led to one critical step in our SQL parser journey: we implemented parser code generation using Rust procedural macros. In Rust, procedural macros allow you to write a Rust function that takes the Rust source code as input and generates more Rust code during code compilation. We are going to omit the details of our SQL parser procedural macro implementation (only about 400 lines of code), but you can see how it looks like when it’s in use.

// Invokes the `TreeParser` procedural macro to generate the SQL parser code.
#[derive(TreeParser)]
// The dependency can be a single type or a tuple of types.
// The `TreeParser::parser` method now accepts *declared* parsers as the input,
// for each type in the dependency. Then `TreeParser::parser` method returns
// the *defined* parser for the type.
// For regular grammars, the `#[parser]` attribute is not needed.
#[parser(dependency = "List")]
pub enum List {
    Value(Number),
    Nested(
        LeftBracket,
        // Overrides the parser function since this field involves recursion.
        #[parser(function = |itself| itself.repeated().collect())]
        Vec<List>,
        RightBracket,
    ),
}

With such an annotated AST data structure, the TreeParser trait implementation is generated automatically!

One would imagine an alternative in which we define the SQL grammar in Rust AST data structures and let AI generate the parser code after feeding it a few examples. This would definitely produce code easily, but it’s hard to verify the correctness, and it can be challenging to evolve in the long term. Rust procedural macros are a more reliable way for now. Interestingly, AI did help to some extent, since AI coding tools did assist our procedural macro development.

Summary

This is the whole story of Sail’s SQL parser! We have developed the SQL parser using parser combinators and Rust procedural macros, and the prototype took us about a week. Such an approach can be extended to other DSL parsers for Big Data or AI. Moreover, the procedural macro can even be generalized to power a parser generator library, while the grammar is defined fully in Rust. This is probably out of Sail’s scope, so we decided that mixing #[derive(TreeParser)] with manual impl TreeParser achieves a good balance for our needs.

The parser combinator approach does have some limitations. First, the parser may have a Very<Very<Very<Complex<Type>>>> in Rust, which can make it difficult to decipher compiler error messages when you make a mistake in your code. This can also make compilation slow sometimes. The longest type name we saw was over 1 million characters long, before we optimized our implementation to speed up compilation.

Further, the SQL parser can be slower than hand-written ones, but the performance is yet to be confirmed in benchmarks. Fortunately, SQL parsing speed is not our top concern, and the tradeoff we make allows us to do more with the SQL grammar. The AST data structure, along with the procedural macro idea, can be used to generate data structures suitable for rendering SQL documentation, generating fuzzy test cases, and more!

With this SQL parser implementation, we make one critical step forward towards Sail’s mission to unify batch processing, stream processing, and AI workloads. Among the 1,373 Spark SQL parsing tests (a subset of the 3,808 tests discussed below), 1,298 (94.5%) are now successful, a great increase from the 1,093 (79.6%) tests supported in the previous version. The remaining 75 (5.5%) failed tests are either cases for undocumented SQL statements (e.g., CREATE INDEX), or negative cases with expected semantic errors—cases for which Sail would defer error reporting to later stages of query processing.

The new SQL parser has been released in Sail 0.2.2, along with various improvements we made in the past two months. Among the 3,808 tests we have mined from the Spark codebase, 3,057 (80.3%) are now successful on Sail. All 22 queries in the derived TPC-H benchmark and all 99 queries in the derived TPC-DS benchmark are also successful on Sail.

Despite the gap in full Spark parity, many companies can already migrate their workloads with ease, as this coverage includes the vast majority of commonly used Spark features. And our commitment to full Spark parity ensures that even the most complex and unique demands are met over time. Try out Sail today and join the Sail community on GitHub or Slack!

Get in Touch

LakeSail offers flexible enterprise support options, including managing Sail on Kubernetes. To discover how Sail can enhance your data workflows and significantly reduce costs, get in touch with us!

Get in Touch to Learn More