Notes on the SQLite DuckDB paper
1st September 2022
SQLite: Past, Present, and Future is a newly published paper authored by Kevin P. Gaffney, Martin Prammer and Jignesh M. Patel from the University of Wisconsin-Madison and D. Richard Hipp, Larry Brasfield and Dan Kennedy from the core SQLite engineering team.
The paper compares SQLite and DuckDB, and describes some optimization work to make SQLite perform better for analytical queries.
DuckDB is a relatively new project which is frequently nicknamed (including by this paper) “SQLite for analytics”. It shares many of the characteristics of SQLite: an embedded database implemented as a header file and implementation file (using C++ as opposed to SQLite’s C) with databases that are theselves single binary files.
The primary difference is that DuckDB is optimized for analytical queries: queries that apply aggregate calculations across large numbers of rows, rather than being optimized for fast scanning and lookup of individual rows of data.
I’ve been tracking DuckDB for a while—it’s a very exciting project. If I ever introduce alternative database backends to Datasette it’s top of my list of things to try.
The paper spends quite a bit of time on benchmark comparisons. To very loosely summarize those:
- SQLite out-performs DuckDB on a write transactions benchmark by 10x-500x on a powerful cloud server and 2x-60x on a Raspberry Pi, for small to large databases.
- For analytical benchmarks using the SSB (Star Schema Benchmark) DuckDB out-performs SQLite by 30-50x at the highest margin and 3-8x at the lowest.
Neither of these are particularly surprising: DuckDB isn’t optimized for write transactions and SQLite isn’t optimized for analytical queries.
Next, the paper explores optimizations that can be applied to SQLite to improve its analytical query performance.
One of these has already shipped! In February 2022 the SQLite 3.38.0 release notes included this:
Use a Bloom filter to speed up large analytic queries.
I had wondered at the time what the deal with this was—the paper explains it in some detail:
A key change is made to the join processing, which is to probe the Bloom filters before carrying out the rest of the join. Applying the Bloom filters early in the join pipeline dramatically reduces the number of tuples that flow through the join pipeline, and thus improves performance.
This had a pretty big impact on the benchmark:
The performance impact of our optimizations is shown in Figure 6. On the Raspberry Pi, SQLite is now 4.2X faster on SSB. Our optimizations are particularly effective for query flight 2, resulting in 10X speedup. On the cloud server, we observed an overall speedup of 2.7X and individual query speedups up to 7X
I decided to quote in full the first part of section 5, “future development”, because it does a great job of describing some of the key features of the culture of SQLite itself.
5 FUTURE DEVELOPMENT
The developers intend to provide support for SQLite through the year 2050, and design decisions are made accordingly. SQLite’s code and database file format are fully cross-platform, ensuring that SQLite can run on any current or future platform with an 8-bit byte, two’s complement 32-bit and 64-bit integers, and a C compiler. Every machine-code branch in the SQLite library is tested with multiple platforms and compilers, which makes the code robust for future migrations. SQLite is also extensively documented and commented, which helps new developers quickly understand SQLite’s architecture. Finally, the developers work hard to evaluate new programming trends based on merit rather than popularity.
While the performance gap has narrowed as a result of this work, DuckDB is still considerably faster than SQLite on SSB. This is somewhat expected; SQLite is a general-purpose database engine, whereas DuckDB is designed from the ground up for efficient OLAP. Although SQLite’s OLAP performance could be further improved in future work, there are several constraints that potential modifications to SQLite must satisfy.
First, modifications should cause no significant performance regression across the broad range of workloads served by SQLite. Second, the benefit of an optimization must be weighed against its impact on the size of the source code and the compiled library. Finally, modifications should not break SQLite’s backwards compatibility with previous versions and cross-compatibility with different machine architectures
Although SQLite’s performance is a key priority, it must be balanced with these (sometimes competing) goals. We considered several means of improving value extraction in SQLite, but no single solution satisfied all the constraints above. For example, changing the data format from row-oriented to column-oriented would streamline value extraction, but it would also likely increase overhead for OLTP workloads. Moreover, drastic changes to the data format are at odds with SQLite’s goal of stability for the database file format
That section continues with this intriguing detail:
An alternative approach to improving SQLite’s OLAP performance is a separate, yet tightly connected query engine that evaluates analytical queries on its own copy of the data, while SQLite continues to serve transactional requests, ensuring that the analytical engine stays up to date with the freshest data. If the extra space overhead is acceptable, the specialized analytical engine can provide substantial OLAP performance gains.
This design has been successfully implemented in SQLite3/HE [46], a query acceleration path for analytics in SQLite. SQLite3/HE achieves speedups of over 100X on SSB with no degradation in OLTP performance. However, the current implementation of SQLite3/HE does not persist columnar data to storage and is designed to be used in a single process. Future work may explore similar approaches without these limitations.
This is the first I had heard of SQLite3/HE. It’s described in this paper: Introducing a Query Acceleration Path for Analytics in SQLite3—by Martin Prammer, Suryadev Sahadevan Rajesh, Junda Chen, Jignesh M. Patel from the University of Wisconsin-Madison. Martin and Jignesh are both credited as authors on the SQLite/DuckDB paper.
It’s also the subject of this 12 minute video from the CIDR DB conference.
It’s not clear to me if the code for this has been made available yet. I’d be fascinated to learn more about this project.
More recent articles
- Gemini 2.0 Flash: An outstanding multi-modal LLM with a sci-fi streaming mode - 11th December 2024
- ChatGPT Canvas can make API requests now, but it's complicated - 10th December 2024
- I can now run a GPT-4 class model on my laptop - 9th December 2024