Simon Willison's Weblog: performance

Using Rust in non-Rust servers to improve performance

2024-10-23T15:45:42+00:00

Using Rust in non-Rust servers to improve performance

Deep dive into different strategies for optimizing part of a web server application - in this case written in Node.js, but the same strategies should work for Python as well - by integrating with Rust in different ways.

The example app renders QR codes, initially using the pure JavaScript qrcode package. That ran at 1,464 req/sec, but switching it to calling a tiny Rust CLI wrapper around the qrcode crate using Node.js spawn() increased that to 2,572 req/sec.

This is yet another reminder to me that I need to get over my cgi-bin era bias that says that shelling out to another process during a web request is a bad idea. It turns out modern computers can quite happily spawn and terminate 2,500+ processes a second!

The article optimizes further first through a Rust library compiled to WebAssembly (2,978 req/sec) and then through a Rust function exposed to Node.js as a native library (5,490 req/sec), then finishes with a full Rust rewrite of the server that replaces Node.js entirely, running at 7,212 req/sec.

Full source code to accompany the article is available in the using-rust-in-non-rust-servers repository.

Via lobste.rs

Tags: performance, rust, javascript, nodejs, scaling

Cerebras Inference: AI at Instant Speed

2024-08-28T04:14:00+00:00

Cerebras Inference: AI at Instant Speed

New hosted API for Llama running at absurdly high speeds: "1,800 tokens per second for Llama3.1 8B and 450 tokens per second for Llama3.1 70B".

How are they running so fast? Custom hardware. Their WSE-3 is 57x physically larger than an NVIDIA H100, and has 4 trillion transistors, 900,000 cores and 44GB of memory all on one enormous chip.

Their live chat demo just returned me a response at 1,833 tokens/second. Their API currently has a waitlist.

Via Hacker News

Tags: performance, generative-ai, llama, ai, llms, cerebras

Optimizing Datasette (and other weeknotes)

2024-08-22T15:46:43+00:00

I've been working with Alex Garcia on an experiment involving using Datasette to explore FEC contributions. We currently have a 11GB SQLite database - trivial for SQLite to handle, but at the upper end of what I've comfortably explored with Datasette in the past.

This was just the excuse I needed to dig into some optimizations! The next Datasette alpha release will feature some significant speed improvements for working with large tables - they're available on the main branch already.

Datasette tracing

Datasette has had a ?_trace=1 feature for a while. It's only available if you run Datasette with the trace_debug setting enabled - which you can do like this:

datasette -s trace_debug 1 mydatabase.db

Then any request with ?_trace=1 added to the URL will return a JSON blob at the end of the page showing every SQL query that was executed, how long it took and a truncated stack trace showing the code that triggered it.

Scroll to the bottom of https://latest.datasette.io/fixtures?_trace=1 for an example.

The JSON isn't very pretty. datasette-pretty-traces is a plugin I built to fix that - it turns that JSON into a much nicer visual representation.

As I dug into tracing I found a nasty bug in the trace mechanism. It was meant to quietly give up on pages longer than 256KB, in order to avoid having to spool potentially megabytes of data into memory rather than streaming it to the client. That code had a bug: the user would get a blank page instead! I fixed that first.

The next problem was that SQL queries that terminated with an error - including the crucial "query interrupted" error raised when a query took longer than the Datasette configured time limit - were not being included in the trace. That's fixed too, and I upgraded datasette-pretty-traces to render those errors with a pink background:

This gave me all the information I needed to track down those other performance problems.

Rule of thumb: don't scan more than 10,000 rows

SQLite is fast, but you can still run into performance problems if you ask it to scan too many rows.

Going forward, I'm introducing a new target for Datasette development: never scan more than 10,000 rows without a user explicitly requesting that scan.

The most common time this happens is with a select count(*) query. Datasette likes to display the number of rows in a table, and when you run a SQL query it likes to show you how many total rows match even when only displaying a subset of them in the paginated interface.

These counts are shown in two key places: on the list of tables in a database, and on the table view itself.

Counts are protected by Datasette's query time limit mechanism. On the table listing page this was configured such that if a count takes longer than 5ms it would be skipped and "Many rows" would be displayed. It turns out this mechanism isn't as reliable as I had hoped, maybe due to the overhead of cancelling the query. Given enough large tables those cancelled count queries could still add up to user-visible latency problems on that page.

Here's the pattern I turned to that fixed the performance problem:

select count(*) from (
    select * from libfec_SA16 limit 10001
)

This nested query first limits the table to 10,001 rows, then counts them. If the count is less than 10,001 we know that the count is entirely accurate. If it's exactly 10,001 we can show ">10,000 rows" in the UI.

Capping the number of scanned rows to 10,000 for any of these counts makes a huge difference in the performance of these pages!

But what about those table pages? Showing ">10,000 rows" is a bit of a cop-out, especially if the question the user wants to answer is "how many rows are in this table / match this filter?"

I addressed that in issue #2408: Datasette still truncates the count at 10,000 on initial page load, but users now get a "count all" link they can click to execute the full count.

The link goes to a SQL query page that runs the query, but I've also added a bit of progressive enhancement JavaScript to run that query and update the page in-place when the link is clicked. Here's what that looks like:

10,000 rows with a count all link. Clicking that replaces it with the text counting... which then replaces the entire count text with 23,036,621 rows." style="max-width: 100%;" />

In the future I may add various caching mechanisms so that counts that have been calculated can be displayed elsewhere in the UI without having to re-run the expensive queries. I may also incorporate SQL triggers for updating exact denormalized counts in a _counts table, as implemented in sqlite-utils.

The other feature that was really hurting performance was facet suggestions.

Datasette Facets are a really powerful way to quickly explore data. They can be applied to any column by the user, but to make the feature more visible Datasette suggests facets that might be a good fit for the current table by looking for things like columns that only contain 3 unique values.

The suggestion code was designed with performance in mind - it uses tight time limits (governed by the facet_suggest_time_limit_ms setting, defaulting to 50ms) and attempts to use other SQL tricks to quickly decide if a facet should be considered or not.

I found a couple of tricks to dramatically speed these up against larger tables as well.

First, I've started enforcing that new 10,000 limit for facet suggestions too - so each suggestion query only considers a maximum of 10,000 rows, even on tables with millions of items. These suggestions are just suggestions, so seeing a recommendation that would not have been suggested if the full table had been scanned is a reasonable trade-off.

Secondly, I spotted a gnarly bug in the way the date facet suggestion works. The previous query looked like this:

select date(column_to_test) from ( 
    select * from mytable
)
where column_to_test glob "????-??-*"
limit 100;

That limit 100 was meant to restrict it to considering 100 rows... but that didn't actually work! If a table with 20 million columns in had NO rows that matched the glob pattern, the query would still scan all 20 million rows.

The new query looks like this, and fixes the problem:

select date(column_to_test) from ( 
    select * from mytable limit 100
)
where column_to_test glob "????-??-*"

Moving the limit to the inner query causes the SQL to only run against the first 100 rows, as intended.

Thanks to these optimizations running Datasette against a database with huge tables now feels snappy and responsive. Expect them in an alpha release soon.

On the blog

I'm trying something new for the rest of my weeknotes. Since I'm investing a lot more effort in my link blog, I'm including a digest of everything I've linked to since the last edition. I updated my weeknotes Observable notebook to help generate these, after prompting Claude to help prototype a bunch of different approaches.

The following section was generated by this code - it includes everything I've posted, grouped by the most "interesting" tag assigned to each post. I'll likely iterate on this a bunch more in the future.

openai

OpenAI: Introducing Structured Outputs in the API - 2024-08-06
GPT-4o System Card - 2024-08-08
Using sqlite-vec with embeddings in sqlite-utils and Datasette - 2024-08-11

javascript

Observable Plot: Waffle mark - 2024-08-06
Reckoning - 2024-08-18

python

cibuildwheel 2.20.0 now builds Python 3.13 wheels by default - 2024-08-06
django-http-debug, a new Django app mostly written by Claude - 2024-08-08
PEP 750 – Tag Strings For Writing Domain-Specific Languages - 2024-08-11
mlx-whisper - 2024-08-13
Upgrading my cookiecutter templates to use python -m pytest - 2024-08-17
Writing your pyproject.toml - 2024-08-20
uv: Unified Python packaging - 2024-08-20
#!/usr/bin/env -S uv run - 2024-08-21
Armin Ronacher: There is an elephant in the room which is that As... - 2024-08-21
light-the-torch - 2024-08-22

security

Google AI Studio data exfiltration demo - 2024-08-07
SQL Injection Isn't Dead: Smuggling Queries at the Protocol Level - 2024-08-12
Links and materials for Living off Microsoft Copilot - 2024-08-14
Adam Newbold: [Passkeys are] something truly unique, because ba... - 2024-08-15
com2kid: Having worked at Microsoft for almost a decade, I... - 2024-08-16
Data Exfiltration from Slack AI via indirect prompt injection - 2024-08-20
SQL injection-like attack on LLMs with special tokens - 2024-08-20
The dangers of AI agents unfurling hyperlinks and what to do about it - 2024-08-21

llm

q What do I title this article? - 2024-08-07

prompt-engineering

Braggoscope Prompts - 2024-08-07
Using gpt-4o-mini as a reranker - 2024-08-11
LLMs are bad at returning code in JSON - 2024-08-16

andrej-karpathy

Andrej Karpathy: The RM [Reward Model] we train for LLMs is just a... - 2024-08-08

projects

Share Claude conversations by converting their JSON to Markdown - 2024-08-08
Datasette 1.0a15 - 2024-08-16
datasette-checkbox - 2024-08-16
Fix @covidsewage bot to handle a change to the underlying website - 2024-08-18

anthropic

Gemini 1.5 Flash price drop - 2024-08-08
Prompt caching with Claude - 2024-08-14
Alex Albert: Examples are the #1 thing I recommend people use ... - 2024-08-15
Introducing Zed AI - 2024-08-20

sqlite

High-precision date/time in SQLite - 2024-08-09
New Django {% querystring %} template tag - 2024-08-13

ethics

Where Facebook's AI Slop Comes From - 2024-08-10

jon-udell

Jon Udell: Some argue that by aggregating knowledge drawn fr... - 2024-08-10

browsers

Ladybird set to adopt Swift - 2024-08-11

explorables

Transformer Explainer - 2024-08-11

ai-assisted-programming

Tom MacWright: But [LLM assisted programming] does make me wonde... - 2024-08-12

hacker-news

dang: We had to exclude [dead] and eventually even just... - 2024-08-12

design

Help wanted: AI designers - 2024-08-13

prompt-injection

A simple prompt injection template - 2024-08-14

fly

Fly: We're Cutting L40S Prices In Half - 2024-08-16

open-source

Whither CockroachDB? - 2024-08-16

game-design

“The Door Problem” - 2024-08-18

whisper

llamafile v0.8.13 (and whisperfile) - 2024-08-19

Migrating Mess With DNS to use PowerDNS - 2024-08-19

Releases

datasette-pretty-traces 0.5 - 2024-08-21
Prettier formatting for ?_trace=1 traces
sqlite-utils-ask 0.1a0 - 2024-08-19
Ask questions of your data with LLM assistance
datasette-checkbox 0.1a2 - 2024-08-16
Add interactive checkboxes to columns in Datasette
datasette 1.0a15 - 2024-08-16
An open source multi-tool for exploring and publishing data
asgi-csrf 0.10 - 2024-08-15
ASGI middleware for protecting against CSRF attacks
datasette-pins 0.1a3 - 2024-08-07
Pin databases, tables, and other items to the Datasette homepage
django-http-debug 0.2 - 2024-08-07
Django app for creating endpoints that log incoming request and return mock data

TILs

Using sqlite-vec with embeddings in sqlite-utils and Datasette - 2024-08-11
Using pytest-django with a reusable Django application - 2024-08-07

Tags: performance, sql, sqlite, datasette, weeknotes

Quoting Nikita Melkozerov

2024-07-13T23:44:04+00:00

My architecture is a monolith written in Go (this is intentional, I sacrificed scalability to improve my shipping speed), and this is where SQLite shines. With a DB located on the local NVMe disk, a 5$ VPS can deliver a whopping 60K reads and 20K writes per second.

— Nikita Melkozerov

Tags: performance, go, sqlite

Quoting D. Richard Hipp

2024-04-30T13:59:50+00:00

Performance analysis indicates that SQLite spends very little time doing bytecode decoding and dispatch. Most CPU cycles are consumed in walking B-Trees, doing value comparisons, and decoding records - all of which happens in compiled C code. Bytecode dispatch is using less than 3% of the total CPU time, according to my measurements.

So at least in the case of SQLite, compiling all the way down to machine code might provide a performance boost 3% or less. That's not very much, considering the size, complexity, and portability costs involved.

— D. Richard Hipp

Tags: d-richard-hipp, performance, sqlite

Optimizing SQLite for servers

2024-03-31T20:16:23+00:00

Optimizing SQLite for servers

Sylvain Kerkour’s comprehensive set of lessons learned running SQLite for server-based applications.

There’s a lot of useful stuff in here, including detailed coverage of the different recommended PRAGMA settings.

There was also a tip I haven’t seen before about “BEGIN IMMEDIATE” transactions:

“By default, SQLite starts transactions in DEFERRED mode: they are considered read only. They are upgraded to a write transaction that requires a database lock in-flight, when query containing a write/update/delete statement is issued.

The problem is that by upgrading a transaction after it has started, SQLite will immediately return a SQLITE_BUSY error without respecting the busy_timeout previously mentioned, if the database is already locked by another connection.

This is why you should start your transactions with BEGIN IMMEDIATE instead of only BEGIN. If the database is locked when the transaction starts, SQLite will respect busy_timeout.”

Via lobste.rs

Tags: sqlite, sql, performance, databases

DiskCache

2024-03-19T15:43:18+00:00

DiskCache

Grant Jenks built DiskCache as an alternative caching backend for Django (also usable without Django), using a SQLite database on disk. The performance numbers are impressive—it even beats memcached in microbenchmarks, due to avoiding the need to access the network.

The source code (particularly in core.py) is a great case-study in SQLite performance optimization, after five years of iteration on making it all run as fast as possible.

Via Hacker News comment

Tags: sqlite, django, python, performance

Quoting Charlie Marsh

2024-02-04T19:41:16+00:00

Sometimes, performance just doesn't matter. If I make some codepath in Ruff 10x faster, but no one ever hits it, I'm sure it could get some likes on Twitter, but the impact on users would be meaningless.

And yet, it's good to care about performance everywhere, even when it doesn't matter. Caring about performance is cultural and contagious. Small wins add up. Small losses add up even more.

— Charlie Marsh

Tags: performance, ruff, charlie-marsh

Batch size one billion: SQLite insert speedups, from the useful to the absurd

2023-09-26T17:31:54+00:00

Batch size one billion: SQLite insert speedups, from the useful to the absurd

Useful, detailed review of ways to maximize the performance of inserting a billion integers into a SQLite database table.

Via Hacker News

Tags: performance, sqlite

How CPython Implements and Uses Bloom Filters for String Processing

2023-09-16T22:32:37+00:00

How CPython Implements and Uses Bloom Filters for String Processing

Fascinating dive into Python string internals by Abhinav Upadhyay. It turns out CPython uses very simple bloom filters in several parts of the core string methods, to solve problems like splitting on newlines where there are actually eight codepoints that could represent a newline, and a tiny bloom filter can help filter a character in a single operation before performing all eight comparisons only if that first check failed.

Tags: performance, bloom-filters, python

Quoting Andrej Karpathy

2023-02-04T00:08:18+00:00

The most dramatic optimization to nanoGPT so far (~25% speedup) is to simply increase vocab size from 50257 to 50304 (nearest multiple of 64). This calculates added useless dimensions but goes down a different kernel path with much higher occupancy. Careful with your Powers of 2.

— Andrej Karpathy

Tags: andrej-karpathy, performance, gpt-3, generative-ai, ai, llms

Data-driven performance optimization with Rust and Miri

2022-12-09T17:19:14+00:00

Data-driven performance optimization with Rust and Miri

Useful guide to some Rust performance optimization tools. Miri can be used to dump out a detailed JSON profile of a program which can then be opened and explored using the Chrome browser’s performance tool.

Via Hacker News

Tags: performance, chrome, rust

Efficient Pagination Using Deferred Joins

2022-08-16T17:35:27+00:00

Efficient Pagination Using Deferred Joins

Surprisingly simple trick for speeding up deep OFFSET x LIMIT y pagination queries, which get progressively slower as you paginate deeper into the data. Instead of applying them directly, apply them to a “select id from ...” query to fetch just the IDs, then either use a join or run a separate “select * from table where id in (...)” query to fetch the full records for that page.

Via Introducing FastPage: Faster offset pagination for Rails apps

Tags: sql, performance

Announcing Pyston-lite: our Python JIT as an extension module

2022-06-08T17:58:11+00:00

Announcing Pyston-lite: our Python JIT as an extension module

The Pyston JIT can now be installed in any Python 3.8 virtual environment by running “pip install pyston_lite_autoload”—which includes a hook to automatically inject the JIT. I just tried a very rough benchmark against Datasette (ab -n 1000 -c 10) and got 391.20 requests/second without the JIT compared to 404.10 request/second with it.

Via Hacker News

Tags: performance, jit, python

Compiling Black with mypyc

2022-05-31T23:24:16+00:00

Compiling Black with mypyc

Richard Si is a Black contributor who recently obtained a 2x performance boost by compiling Black using the mypyc tool from the mypy project, which uses Python type annotations to generate a compiled C version of the Python logic. He wrote up this fantastic three-part series describing in detail how he achieved this, including plenty of tips on Python profiling and clever optimization tricks.

Via Łukasz Langa

Tags: performance, mypy, python, black

Mypyc

2022-01-30T01:31:12+00:00

Mypyc

Spotted this in the Black release notes: “Black is now compiled with mypyc for an overall 2x speed-up”. Mypyc is a tool that compiles Python modules (written in a subset of Python) to C extensions—similar to Cython but using just Python syntax, taking advantage of type annotations to perform type checking and type inference. It’s part of the mypy type checking project, which has been using it since 2019 to gain a 4x performance improvement over regular Python.

Via Black release notes

Tags: c, performance, mypy, python

Tricking Postgres into using an insane – but 200x faster – query plan

2022-01-18T20:53:01+00:00

Tricking Postgres into using an insane – but 200x faster – query plan

Jacob Martin talks through a PostgreSQL query optimization they implemented at Spacelift, showing in detail how to interpret the results of EXPLAIN (FORMAT JSON, ANALYZE) using the explain.dalibo.com visualization tool.

Tags: performance, optimization, postgresql

Weeknotes: datasette-tiddlywiki, filters_from_request

2021-12-24T07:08:03+00:00

I made some good progress on the big refactor this week, including extracting some core logic out into a new Datasette plugin hook. I also got distracted by TiddlyWiki and released a new Datasette plugin that lets you run TiddlyWiki inside Datasette.

datasette-tiddlywiki

TiddlyWiki is a fascinating and unique project. Jeremy Ruston has been working on it for 17 years now and I've still not seen another piece of software that works even remotely like it.

It's a full-featured wiki that's implemented entirely as a single 2.3MB page of HTML and JavaScript, with a plugin system that allows it to be extended in all sorts of interesting ways.

The most unique feature of TiddlyWiki is how it persists data. You can create a brand new wiki by opening tiddlywiki.com/empty.html in your browser, making some edits... and then clicking the circle-tick "Save changes" button to download a copy of the page with your changes baked into it! Then you can open that up on your own computer and keep on using it.

There's actually a lot more to TiddlyWiki persistence than that: The GettingStarted guide lists dozens of options that vary depending on operating system and browser - it's worth browsing through them just to marvel at how much innovation has happened around the project just in the persistence space.

One of the options is to run a little server that implements the WebServer API and persists data sent via PUT requests. SQLite is an obvious candidate for a backend, and Datasette makes it pretty easy to provide APIs on top of SQLite... so I decided to experiment with building a Datasette plugin that offers a full persistant TiddlyWiki experience.

datasette-tiddlywiki is the result.

You can try it out by running datasette install datasette-tiddlywiki and then datasette tiddlywiki.db --create to start the server (with a tiddlywiki.db SQLite database that will be created if it does not already exist.)

Then navigate to http://localhost:8001/-/tiddlywiki to start interacting with your new TiddlyWiki. Any changes you make there will be persisted to the tiddlywiki database.

I had a running research issue that I updated as I was figuring out how to build it - all sorts of fun TiddlyWiki links and TILs are embedded in that thread. The issue started out in my private "notes" GitHub repository but I transferred it to the datasette-tiddlywiki repository after I had created and published the first version of the plugin.

filters_from_request() plugin hook

My big breakthrough in the ongoing Datasette Table View refactor project was a realization that I could simplify the table logic by extracting some of it out into a new plugin hook.

The new hook is called filters_from_request. It acknowledges that the primary goal of the table page is to convert query string parameters - like ?_search=tony or ?id__gte=6 or ?_where=id+in+(1,+2+,3) into SQL where clauses.

(Here's a full list of supported table arguments.)

So that's what filters_from_request() does - given a request object it can return SQL clauses that should be added to the WHERE.

Datasette now uses those internally to implement ?_where= and ?_search= and ?_through=, see datasette/filters.py.

I always try to accompany a new plugin hook with a plugin that actually uses it - in this case I've been updating datasette-leaflet-freedraw to use that hook to add a "draw a shape on a map to filter this table" interface to any table that it detects has a SpatiaLite geometry column. There's a demo of that here:

https://calands.datasettes.com/calands/CPAD_2020a_SuperUnits?_freedraw=%7B%22type%22%3A%22MultiPolygon%22%2C%22coordinates%22%3A%5B%5B%5B%5B-121.92627%2C37.73597%5D%2C%5B-121.83838%2C37.68382%5D%2C%5B-121.64063%2C37.45742%5D%2C%5B-121.57471%2C37.19533%5D%2C%5B-121.81641%2C36.80928%5D%2C%5B-122.146%2C36.63316%5D%2C%5B-122.56348%2C36.65079%5D%2C%5B-122.89307%2C36.79169%5D%2C%5B-123.06885%2C36.96745%5D%2C%5B-123.09082%2C37.33522%5D%2C%5B-123.0249%2C37.562%5D%2C%5B-122.91504%2C37.77071%5D%2C%5B-122.71729%2C37.92687%5D%2C%5B-122.58545%2C37.96152%5D%2C%5B-122.10205%2C37.96152%5D%2C%5B-121.92627%2C37.73597%5D%5D%5D%5D%7D

Note the new custom ?_freedraw={...} parameter which accepts a GeoJSON polygon and uses it to filter the table - that's implemented using the new hook.

This isn't in a full Datasette release yet, but it's available in the Datasette 0.60a1 alpha (added in 0.60a0) if you want to try it out.

Optimizing populate_table_schemas()

I introduced the datasette-pretty-traces plugin last week - it makes it much easier to see the queries that are running on any given Datasette page.

This week I realized it wasn't tracking write queries, so I added support for that - and discovered that on first page load after starting up Datasette spends a lot of time populating its own internal database containing schema information (see Weeknotes: Datasette internals from last year.)

I opened a tracking ticket and made a bunch of changes to optimize this. The new code in datasette/utils/internal_db.py uses two new documented internal methods:

db.execute_write_script() and db.execute_write_many()

These are the new methods that were created as part of the optimization work. They are documented here:

They are Datasette's async wrappers around the Python sqlite3 module's executemany() and executescript() methods.

I also made a breaking change to Datasette's existing execute_write() and execute_write_fn() methods: their block= argument now defaults to True, where it previously defaulted to False.

Prior to this change, db.execute_write(sql) would put the passed SQL in a queue to be executed once the write connection became available... and then return control to the calling code, whether or not that SQL had actually run- a fire-and-forget mechanism for executing SQL.

The block=True option would change it to blocking until the query had finished executing.

Looking at my own code, I realized I had never once used the fire-and-forget mechanism: I always used block=True to ensure the SQL had finished writing before I moved on.

So clearly block=True was a better default. I made that change in issue 1579.

This is technically a breaking change... but I used the new GitHub code search to see if anyone was using it in a way that would break and could only find one example of it in code not written by me, in datasette-webhook-write - and since they use block=True there anyway this update won't break their code.

If I'd released Datasette 1.0 I would still consider this a breaking change and bump the major version number, but thankfully I'm still in the 0.x range where I can be a bit less formal about these kinds of thing!

Releases this week

datasette-tiddlywiki: 0.1 - 2021-12-23
Run TiddlyWiki in Datasette and save Tiddlers to a SQLite database
asyncinject: 0.2 - (4 releases total) - 2021-12-21
Run async workflows using pytest-fixtures-style dependency injection
datasette: 0.60a1 - (104 releases total) - 2021-12-19
An open source multi-tool for exploring and publishing data
datasette-pretty-traces: 0.3.1 - (5 releases total) - 2021-12-19
Prettier formatting for ?_trace=1 traces

TIL this week

Tags: gis, performance, plugins, projects, tiddlywiki, datasette, weeknotes

Apply conversion functions to data in SQLite columns with the sqlite-utils CLI tool

2021-08-06T06:05:15+00:00

Earlier this week I released sqlite-utils 3.14 with a powerful new command-line tool: sqlite-utils convert, which applies a conversion function to data stored in a SQLite column.

Anyone who works with data will tell you that 90% of the work is cleaning it up. Running command-line conversions against data in a SQLite file turns out to be a really productive way to do that.

Transforming a column

Here's a simple example. Say someone gave you data with numbers that are formatted with commas - like 3,044,502 - in a count column in a states table.

You can strip those commas out like so:

sqlite-utils convert states.db states count \
    'value.replace(",", "")'

The convert command takes four arguments: the database file, the name of the table, the name of the column and a string containing a fragment of Python code that defines the conversion to be applied.

The conversion function can be anything you can express with Python. If you want to import extra modules you can do so using --import module - here's an example that wraps text using the textwrap module from the Python standard library:

sqlite-utils convert content.db articles content \
    '"\n".join(textwrap.wrap(value, 100))' \
    --import=textwrap

You can consider this analogous to using Array.map() in JavaScript, or running a transformation using a list comprehension in Python.

Custom functions in SQLite

Under the hood, the tool takes advantage of a powerful SQLite feature: the ability to register custom functions written in Python (or other languages) and call them from SQL.

The text wrapping example above works by executing the following SQL:

update articles set content = convert_value(content)

convert_value(value) is a custom SQL function, compiled as Python code and then made available to the database connection.

The equivalent code using just the Python standard library would look like this:

import sqlite3
import textwrap

def convert_value(value):
    return "\n".join(textwrap.wrap(value, 100))

conn = sqlite3.connect("content.db")
conn.create_function("convert_value", 1, convert_value)
conn.execute("update articles set content = convert_value(content)")

sqlite-utils convert works by compiling the code argument to a Python function, registering it with the connection and executing the above SQL query.

Splitting columns into multiple other columns

Sometimes when I'm working with a table I find myself wanting to split a column into multiple other columns.

A classic example is locations - if a location column contains latitude,longitude values I'll often want to split that into separate latitude and longitude columns, so I can visualize the data with datasette-cluster-map.

The --multi option lets you do that using sqlite-utils convert:

sqlite-utils convert data.db places location '
latitude, longitude = value.split(",")
return {
    "latitude": float(latitude),
    "longitude": float(longitude),
}' --multi

--multi tells the command to expect the Python code to return dictionaries. It will then create new columns in the database corresponding to the keys in those dictionaries and populate them using the results of the transformation.

If the places table started with just a location column, after running the above command the new table schema will look like this:

CREATE TABLE [places] (
    [location] TEXT,
    [latitude] FLOAT,
    [longitude] FLOAT
);

Common recipes

This new feature in sqlite-utils actually started life as a separate tool entirely, called sqlite-transform.

Part of the rationale for adding it to sqlite-utils was to avoid confusion between what that tool did and the sqlite-utils transform tool, which does something completely different (applies table transformations that aren't possible using SQLite's default ALTER TABLE statement). Somewhere along the line I messed up with the naming of the two tools!

sqlite-transform bundles a number of useful default transformation recipes, in addition to allowing arbitrary Python code. I ended up making these available in sqlite-utils convert by exposing them as functions that can be called from the command-line code argument like so:

sqlite-utils convert my.db articles created_at \
    'r.parsedate(value)'

Implementing them as Python functions in this way meant I didn't need to invent a new command-line mechanism for passing in additional options to the individual recipes - instead, parameters are passed like this:

sqlite-utils convert my.db articles created_at \
    'r.parsedate(value, dayfirst=True)'

Also available in the sqlite_utils Python library

Almost every feature that is exposed by the sqlite-utils command-line tool has a matching API in the sqlite_utils Python library. convert is no exception.

The Python API lets you perform operations like the following:

db = sqlite_utils.Database("dogs.db")

db["dogs"].convert("name", lambda value: value.upper())

Any Python callable can be passed to convert, and it will be applied to every value in the specified column - again, like using map() to apply a transformation to every item in an array.

You can also use the Python API to perform more complex operations like the following two examples:

# Convert title to upper case only for rows with id > 20
table.convert(
    "title",
    lambda v: v.upper(),
    where="id > :id",
    where_args={"id": 20}
)

# Create two new columns, "upper" and "lower",
# and populate them from the converted title
table.convert(
    "title",
    lambda v: {
        "upper": v.upper(),
        "lower": v.lower()
    }, multi=True
)

See the full documentation for table.convert() for more options.

A more sophisticated example: analyzing log files

I used the new sqlite-utils convert command earlier today, to debug a performance issue with my blog.

Most of my blog traffic is served via Cloudflare with a 15 minute cache timeout - but occasionally I'll hit an uncached page, and they had started to feel not quite as snappy as I would expect.

So I dipped into the Heroku dashboard, and saw this pretty sad looking graph:

Somehow my 50th percentile was nearly 10 seconds, and my maximum page response time was 23 seconds! Something was clearly very wrong.

I use NGINX as part of my Heroku setup to buffer responses (see Running gunicorn behind nginx on Heroku for buffering and logging), and I have custom NGINX configuration to write to the Heroku logs - mainly to work around a limitation in Heroku's default logging where it fails to record full user-agents or referrer headers.

I extended that configuration to record the NGINX request_time, upstream_response_time, upstream_connect_time and upstream_header_time variables, which I hoped would help me figure out what was going on.

After applying that change I started seeing Heroku log lines that looked like this:

2021-08-05T17:58:28.880469+00:00 app[web.1]: measure#nginx.service=4.212 request="GET /search/?type=blogmark&page=2&tag=highavailability HTTP/1.1" status_code=404 request_id=25eb296e-e970-4072-b75a-606e11e1db5b remote_addr="10.1.92.174" forwarded_for="114.119.136.88, 172.70.142.28" forwarded_proto="http" via="1.1 vegur" body_bytes_sent=179 referer="-" user_agent="Mozilla/5.0 (Linux; Android 7.0;) AppleWebKit/537.36 (KHTML, like Gecko) Mobile Safari/537.36 (compatible; PetalBot;+https://webmaster.petalsearch.com/site/petalbot)" request_time="4.212" upstream_response_time="4.212" upstream_connect_time="0.000" upstream_header_time="4.212";

Next step: analyze those log lines.

I ran this command for a few minutes to gather some logs:

heroku logs -a simonwillisonblog --tail | grep 'measure#nginx.service' > /tmp/log.txt

Having collected 488 log lines, the next step was to load them into SQLite.

The sqlite-utils insert command likes to work with JSON, but I just had raw log lines. I used jq to convert each line into a {"line": "raw log line"} JSON object, then piped that as newline-delimited JSON into sqlite-utils insert:

cat /tmp/log.txt | \
    jq --raw-input '{line: .}' --compact-output | \
    sqlite-utils insert /tmp/logs.db log - --nl

jq --raw-input accepts input that is just raw lines of text, not yet valid JSON. '{line: .}' is a tiny jq program that builds {"line": "raw input"} objects. --compact-output causes jq to output newline-delimited JSON.

Then sqlite-utils insert /tmp/logs.db log - --nl reads that newline-delimited JSON into a new SQLite log table in a logs.db database file (full documentation here).

Update 6th January 2022: sqlite-utils 3.20 introduced a new sqlite-utils insert ... --lines option for importing raw lines, so you can now achieve this without using jq at all. See Inserting unstructured data with --lines and --text for details.

Now I had a SQLite table with a single column, line. Next step: parse that nasty log format.

To my surprise I couldn't find an existing Python library for parsing key=value key2="quoted value" log lines. Instead I had to figure out a regular expression:

([^\s=]+)=(?:"(.*?)"|(\S+))

Here's that expression visualized using Debuggex:

I used that regular expression as part of a custom function passed in to the sqlite-utils convert tool:

sqlite-utils convert /tmp/logs.db log line --import re --multi "$(cat <<EOD
    r = re.compile(r'([^\s=]+)=(?:"(.*?)"|(\S+))')
    pairs = {}
    for key, value1, value2 in r.findall(value):
        pairs[key] = value1 or value2
    return pairs
EOD
)"

(This uses a cat <<EOD trick to avoid having to figure out how to escape the single and double quotes in the Python code for usage in a zsh shell command.)

Using --multi here created new columns for each of the key/value pairs seen in that log file.

One last step: convert the types. The new columns are all of type text but I want to do sorting and arithmetic on them so I need to convert them to integers and floats. I used sqlite-utils transform for that:

sqlite-utils transform /tmp/logs.db log \
    --type 'measure#nginx.service' float \
    --type 'status_code' integer \
    --type 'body_bytes_sent' integer \
    --type 'request_time' float \
    --type 'upstream_response_time' float \
    --type 'upstream_connect_time' float \
    --type 'upstream_header_time' float

Here's the resulting log table (in Datasette Lite).

Once the logs were in Datasette, the problem quickly became apparent when I sorted by request_time: an army of search engine crawlers were hitting deep linked filters in my faceted search engine, like /search/?tag=geolocation&tag=offlineresources&tag=canvas&tag=javascript&tag=performance&tag=dragndrop&tag=crossdomain&tag=mozilla&tag=video&tag=tracemonkey&year=2009&type=blogmark. These are expensive pages to generate! They're also very unlikely to be in my Cloudflare cache.

Could the answer be as simple as a robots.txt rule blocking access to /search/?

I shipped that change and waited a few hours to see what the impact would be:

It took a while for the crawlers to notice that my robots.txt had changed, but by 8 hours later my site performance was dramatically improved - I'm now seeing 99th percentile of around 450ms, compared to 25 seconds before I shipped the robots.txt change!

With this latest addition, sqlite-utils has evolved into a powerful tool for importing, cleaning and re-shaping data - especially when coupled with Datasette in order to explore, analyze and publish the results.

TIL this week

Releases this week

sqlite-transform: 1.2.1 - (10 releases total) - 2021-08-02
Tool for running transformations on columns in a SQLite database
sqlite-utils: 3.14 - (82 releases total) - 2021-08-02
Python CLI utility and library for manipulating SQLite databases
datasette-json-html: 1.0.1 - (6 releases total) - 2021-07-31
Datasette plugin for rendering HTML based on JSON values
datasette-publish-fly: 1.0.2 - (5 releases total) - 2021-07-30
Datasette plugin for publishing data using Fly

Tags: performance, projects, sqlite, datasette, data-science, weeknotes, sqlite-utils

Quoting Brendan Gregg

2021-06-08T19:33:16+00:00

When I was a performance consultant I'd show up to random companies who wanted me to fix their computer performance issues. If they trusted me with a login to their production servers, I could help them a lot quicker. To get that trust I knew which tools looked but didn't touch: Which were observability tools and which were experimental tools. "I'll start with observability tools only" is something I'd say at the start of every engagement.

— Brendan Gregg

Tags: observability, performance, brendan-gregg

Cleaning Up Your Postgres Database

2021-02-03T07:32:33+00:00

Cleaning Up Your Postgres Database

Craig Kerstiens provides some invaluable tips on running an initial check of the health of a PostgreSQL database, by using queries against the pg_statio_user_indexes table to find the memory cache hit ratio and the pg_stat_user_tables table to see what percentage of queries to your tables are using an index.

Via @craigkerstiens

Tags: performance, postgresql, databases, craig-kerstiens

Making GitHub’s new homepage fast and performant

2021-01-29T19:05:34+00:00

Making GitHub’s new homepage fast and performant

A couple of really clever tricks in this article by Tobias Ahlin. The first is using IntersectionObserver in conjunction with the video preload=“none” attribute to lazily load a video when it scrolls into view. The second is an ingenious trick to create an efficiently encoded transparent JPEG image: embed the image in a SVG file twice, once as the image and once as a transparency mask.

Tags: images, performance, svg, javascript, github

How Shopify Uses WebAssembly Outside of the Browser

2020-12-19T16:46:09+00:00

How Shopify Uses WebAssembly Outside of the Browser

I’m fascinated by applications of WebAssembly outside the browser. As a Python programmer I’m excited to see native code libraries getting compiled to WASM in a way that lets me call them from Python code via a bridge, but the other interesting application is executing untrusted code in a sandbox.

Shopify are doing exactly that—they are building a kind-of plugin mechanism where partner code compiled to WASM runs inside their architecture using Fastly’s Lucet. The performance numbers are in the same ballpark as native code.

Also interesting: they’re recommending AssemblyScript, a TypeScript-style language designed to compile directly to WASM without needing any additional interpreter support, as required by dynamic languages such as JavaScript, Python or Ruby.

Via Hacker News

Tags: performance, webassembly, security

Quoting Nelson Elhage

2020-02-24T14:32:35+00:00

I’ve really come to appreciate that performance isn’t just some property of a tool independent from its functionality or its feature set. Performance — in particular, being notably fast — is a feature in and of its own right, which fundamentally alters how a tool is used and perceived.

— Nelson Elhage

Tags: performance

The Now CDN

2018-07-12T03:34:06+00:00

The Now CDN

Huge announcement from Zeit Now today: all .now.sh deployments are now served through the Cloudflare CDN, which means they benefit from 150 worldwide CDN locations that obey HTTP caching headers. This is particularly relevant for Datasette, since it serves far-future cache headers by default and uses Cloudflare-compatible HTTP/2 push hints to accelerate 302 redirects. This means that both the “datasette publish now” CLI command and the Datasette Publish web app will now result in Cloudflare-accelerated deployments.

Via @zeithq

Tags: cdn, performance, datasette, zeit-now, cloudflare

Quoting D. Richard Hipp

2018-05-10T05:15:49+00:00

The latest SQLite 3.8.7 alpha version is 50% faster than the 3.7.17 release from 16 months ago. That is to say, it does 50% more work using the same number of CPU cycles. [...] The 50% faster number above is not about better query plans. This is 50% faster at the low-level grunt work of moving bits on and off disk and search b-trees. We have achieved this by incorporating hundreds of micro-optimizations. Each micro-optimization might improve the performance by as little as 0.05%. If we get one that improves performance by 0.25%, that is considered a huge win. Each of these optimizations is unmeasurable on a real-world system (we have to use cachegrind to get repeatable run-times) but if you do enough of them, they add up.

— D. Richard Hipp

Tags: sqlite, performance, d-richard-hipp

Computer latency: 1977-2017

2017-12-26T00:28:41+00:00

Computer latency: 1977-2017

Dan Luu used a 240 fps camera to investigate the latency between hitting a key and having the character show up on the display across four decades of computing devices... and found 1983’s Apple IIe outperformed everything else. He goes to great lengths to explain why in his accompanying write-up.

Via Craig Mod

Tags: performance, dan-luu

Many Small Queries Are Efficient In SQLite

2017-11-26T16:24:01+00:00

Many Small Queries Are Efficient In SQLite

Since SQLite runs in-process rather than being accessed over a network it avoids the per-query overhead of network round trips. This means that while MySQL or PostgreSQL applications need to avoid N+1 query patterns that create 100s of queries per request, SQLite apps can be designed differently: provided you hit indexes or small tables, 200 queries just means 200 extra cheap function calls.

Tags: performance, sqlite

Your Web, Half a Second Sooner

2011-03-17T17:39:00+00:00

Your Web, Half a Second Sooner

Google AdSense now serves a tiny bit of JavaScript that loads everything else in a dynamically populated iframe, thus avoiding blocking the rest of the page load. It’s about time online advertising providers started taking page performance seriously.

Tags: adsense, advertising, performance, recovered

Google and Microsoft Cheat on Slow-Start. Should You?

2010-12-03T19:03:00+00:00

Google and Microsoft Cheat on Slow-Start. Should You?

Fascinating optimisation tricks by some of the big websites, which violate the RFC governing the TCP slow-start algorithm in order to perform better in the common case.

Tags: google, microsoft, networking, performance, recovered

Simon Willison's Weblog: performance

Using Rust in non-Rust servers to improve performance

Cerebras Inference: AI at Instant Speed

Optimizing Datasette (and other weeknotes)

Datasette tracing

Rule of thumb: don't scan more than 10,000 rows

Optimized facet suggestions

On the blog

Releases

TILs

Quoting Nikita Melkozerov

Quoting D. Richard Hipp

Optimizing SQLite for servers

DiskCache

Quoting Charlie Marsh

Batch size one billion: SQLite insert speedups, from the useful to the absurd

How CPython Implements and Uses Bloom Filters for String Processing

Quoting Andrej Karpathy

Data-driven performance optimization with Rust and Miri

Efficient Pagination Using Deferred Joins

Announcing Pyston-lite: our Python JIT as an extension module

Compiling Black with mypyc

Mypyc

Tricking Postgres into using an insane – but 200x faster – query plan

Weeknotes: datasette-tiddlywiki, filters_from_request

datasette-tiddlywiki

filters_from_request() plugin hook

Optimizing populate_table_schemas()

db.execute_write_script() and db.execute_write_many()

Releases this week

TIL this week

Apply conversion functions to data in SQLite columns with the sqlite-utils CLI tool

Transforming a column

Custom functions in SQLite

Splitting columns into multiple other columns

Common recipes

Also available in the sqlite_utils Python library

A more sophisticated example: analyzing log files

TIL this week

Releases this week

Quoting Brendan Gregg

Cleaning Up Your Postgres Database

Making GitHub’s new homepage fast and performant

How Shopify Uses WebAssembly Outside of the Browser

Quoting Nelson Elhage

The Now CDN

Quoting D. Richard Hipp

Computer latency: 1977-2017

Many Small Queries Are Efficient In SQLite

Your Web, Half a Second Sooner

Google and Microsoft Cheat on Slow-Start. Should You?