Weeknotes: Docker architectures, sqlite-utils 3.7, nearly there with Datasette 0.57
This week I learned a whole bunch about using Docker to emulate different architectures, released sqlite-utils 3.7 and made a ton of progress towards the almost-ready-to-ship Datasette 0.57.
Development environments on the Mac M1
We added another full-time developer to the VIAL project this week, which came with a new requirement: we needed our development environment to work on an M1 Mac.
“Can’t be that hard”, I thought.
After several rounds of attempted fixes, I decided I needed an M1 for myself. I ended up hiring an M1 Mac Mini from MacStadium, and dove into the horror show that is getting a relatively straight-forward Django app with a small but crucial number of C dependencies (most notably psycopg2 and cryptography) running in a Docker container on an M1 machine.
Having almost beaten all of the dependencies into shape, I hit a
free(): invalid pointer error. With no idea how to even start debugging that, I wrote up what I had learned so far and declared defeat on Twitter.
the possibly-good news is that you can emulate aarch64-linux on other platforms Docker runs on
... said Mo McRoberts. And it turns out you can!
The short version: I’m using Docker Compose, and it turned out all I had to do was add
platform: linux/amd64 to my
docker-compose.yml file and Docker built and ran my container under that architecture, using QEMU as an emulation layer. And just like that, everything on the M1 started working!
Performance isn’t brilliant, but it works. In the longer run it would be great to get the whole environment running on the native architecture for the M1, but it’s great to have found an escape hatch.
More QEMU: testing Datasette on different architectures
Blair Drummond contributed a patch for Datasette that enables building the official Datasette Docker container across multiple architectures, using the Docker buildx mechanism.
It successfully builds and publishes images, which is cool... but what’s a good way to test them?
--platform option to Docker holds the key. Here’s a one-liner that runs the Datasette image built for
linux/s390x on my laptop, via an emulation layer:
% docker run --platform linux/s390x \ -v `pwd`:/mnt -p 8001:8001 \ datasetteproject/datasette:0.57a1 datasette -p 8001 -h 0.0.0.0
And here’s a recipe that confirms the platform by outputting the result of Python’s
% docker run -it --platform linux/s390x -v `pwd`:/mnt \ -p 8001:8001 datasetteproject/datasette:0.57a1 \ python -c 'import platform; print(platform.uname())' uname_result( system='Linux', node='d14916ca91df', release='4.19.121-linuxkit', version='#1 SMP Thu Jan 21 15:36:34 UTC 2021', machine='s390x' )
Ideally I’d like to run the full Datasette test suite inside the container. That requires some extra dependencies, plus mounting the Datasette root folder (complete with tests) inside the container. Here’s a hacky way to do that:
docker run -it -v `pwd`:/mnt --platform linux/amd64 \ datasetteproject/datasette:0.57a1 bash -c ' pip install "pytest>=5.2.2,<6.3.0" \ "pytest-xdist>=2.2.1,<2.3" \ "pytest-asyncio>=0.10,<0.16" \ "beautifulsoup4>=4.8.1,<4.10.0" \ "black==21.5b1" \ "pytest-timeout>=1.4.2,<1.5" \ "trustme>=0.7,<0.8" \ && cd /mnt && pytest'
Before I release 0.57 I want to use this trick to make sure I’m only building official Docker containers for platforms on which the test suite passes.
This was an absolutely classic case of yak shaving. I noticed that exporting CSV data from covid-19.datasettes.com/covid/ny_times_us_counties—now with 1,378,190 rows—was running really slowly.
That page is sorted by date, descending—and Datasette large exports work using keyset pagination, which means they execute 1,300 SQL queries (1000 rows per query) ordered by date, descending with a filter and a limit.
There was an index on
date but it was in ascending order. SQLite also supports descending order indexes, and a micro-benchmark suggested that this could speed things up—each query could take 10ms instead of ~200ms.
So I needed to teach sqlite-utils to create descending order indexes. And then since I had a bunch of functionality accumulated since version 3.6 back in February, I cleaned those up, fixed an additional bug and shipped a 3.7 release.
Then when I applied the fix to my
covid-19 project it made almost no difference to the performance at all! It turned out I had been shaving entirely the wrong yak.
The real problem was that each page of results was unneccessarily calculating facets, suggested facets and a full table count—potentially a thousand times when returning a million streaming rows. Which leads me to...
Progress towards Datasette 0.57
This was the yak I needed to shave all along. Some highlights from the past 8 days:
- I landed ?col= and ?_nocol=—you can now select exactly which columns you would like to see on the table page (and in the CSV and JSON exports).
- A new
?_facet_size=maxoption (#1337), which is now linked to from the … shown at the bottom of a truncated list of facet results. This is particularly useful for things like lists of states—50 gets truncated to 30, but the new link lets you see all 50 in one place.
- You can now opt-out of both facets and table counts with the new ?_nofacet=1 and ?_nocount=1 options—both of which are used to greatly speed up large CSV exports.
- And a fix for an issue first opened in November 2019—if your SQL query throws an error, Datasette now shows you the error in context with the SQL query so you can edit it without having to hit “back” to recover it! This may seem like a small thing but it was a surprisingly fiddly fix—I’m so glad to finally have it done though.
I hope to release 0.57 final later this week. In the meantime some of the above is available in the 0.57a1 alpha.
TIL this week
- Turning an array of arrays into objects with jq
- Running Docker on an M1 Mac
- Finding CSV files that start with a BOM using ripgrep
Releases this week
sqlite-utils: 3.7—(74 releases total)—2021-05-29
Python CLI utility and library for manipulating SQLite databases
datasette: 0.57a1—(87 releases total)—2021-05-27
An open source multi-tool for exploring and publishing data
django-sql-dashboard: 0.15—(29 releases total)—2021-05-25
Django app for building dashboards using raw SQL queries
More recent articles
- Understanding GPT tokenizers - 8th June 2023
- Weeknotes: Parquet in Datasette Lite, various talks, more LLM hacking - 4th June 2023
- It's infuriatingly hard to understand how closed models train on their input - 4th June 2023
- ChatGPT should include inline tips - 30th May 2023
- Lawyer cites fake cases invented by ChatGPT, judge is not amused - 27th May 2023
- llm, ttok and strip-tags - CLI tools for working with ChatGPT and other LLMs - 18th May 2023
- Delimiters won't save you from prompt injection - 11th May 2023
- Weeknotes: sqlite-utils 3.31, download-esm, Python in a sandbox - 10th May 2023
- Leaked Google document: "We Have No Moat, And Neither Does OpenAI" - 4th May 2023
- Midjourney 5.1 - 4th May 2023