Weeknotes: Docker architectures, sqlite-utils 3.7, nearly there with Datasette 0.57
3rd June 2021
This week I learned a whole bunch about using Docker to emulate different architectures, released sqlite-utils 3.7 and made a ton of progress towards the almost-ready-to-ship Datasette 0.57.
Development environments on the Mac M1
We added another full-time developer to the VIAL project this week, which came with a new requirement: we needed our development environment to work on an M1 Mac.
“Can’t be that hard”, I thought.
After several rounds of attempted fixes, I decided I needed an M1 for myself. I ended up hiring an M1 Mac Mini from MacStadium, and dove into the horror show that is getting a relatively straight-forward Django app with a small but crucial number of C dependencies (most notably psycopg2 and cryptography) running in a Docker container on an M1 machine.
Having almost beaten all of the dependencies into shape, I hit a
free(): invalid pointer error. With no idea how to even start debugging that, I wrote up what I had learned so far and declared defeat on Twitter.
the possibly-good news is that you can emulate aarch64-linux on other platforms Docker runs on
... said Mo McRoberts. And it turns out you can!
The short version: I’m using Docker Compose, and it turned out all I had to do was add
platform: linux/amd64 to my
docker-compose.yml file and Docker built and ran my container under that architecture, using QEMU as an emulation layer. And just like that, everything on the M1 started working!
Performance isn’t brilliant, but it works. In the longer run it would be great to get the whole environment running on the native architecture for the M1, but it’s great to have found an escape hatch.
More QEMU: testing Datasette on different architectures
It successfully builds and publishes images, which is cool... but what’s a good way to test them?
--platform option to Docker holds the key. Here’s a one-liner that runs the Datasette image built for
linux/s390x on my laptop, via an emulation layer:
% docker run --platform linux/s390x \ -v `pwd`:/mnt -p 8001:8001 \ datasetteproject/datasette:0.57a1 datasette -p 8001 -h 0.0.0.0
And here’s a recipe that confirms the platform by outputting the result of Python’s
% docker run -it --platform linux/s390x -v `pwd`:/mnt \ -p 8001:8001 datasetteproject/datasette:0.57a1 \ python -c 'import platform; print(platform.uname())' uname_result( system='Linux', node='d14916ca91df', release='4.19.121-linuxkit', version='#1 SMP Thu Jan 21 15:36:34 UTC 2021', machine='s390x' )
Ideally I’d like to run the full Datasette test suite inside the container. That requires some extra dependencies, plus mounting the Datasette root folder (complete with tests) inside the container. Here’s a hacky way to do that:
docker run -it -v `pwd`:/mnt --platform linux/amd64 \ datasetteproject/datasette:0.57a1 bash -c ' pip install "pytest>=5.2.2,<6.3.0" \ "pytest-xdist>=2.2.1,<2.3" \ "pytest-asyncio>=0.10,<0.16" \ "beautifulsoup4>=4.8.1,<4.10.0" \ "black==21.5b1" \ "pytest-timeout>=1.4.2,<1.5" \ "trustme>=0.7,<0.8" \ && cd /mnt && pytest'
Before I release 0.57 I want to use this trick to make sure I’m only building official Docker containers for platforms on which the test suite passes.
This was an absolutely classic case of yak shaving. I noticed that exporting CSV data from covid-19.datasettes.com/covid/ny_times_us_counties—now with 1,378,190 rows—was running really slowly.
That page is sorted by date, descending—and Datasette large exports work using keyset pagination, which means they execute 1,300 SQL queries (1000 rows per query) ordered by date, descending with a filter and a limit.
There was an index on
date but it was in ascending order. SQLite also supports descending order indexes, and a micro-benchmark suggested that this could speed things up—each query could take 10ms instead of ~200ms.
So I needed to teach sqlite-utils to create descending order indexes. And then since I had a bunch of functionality accumulated since version 3.6 back in February, I cleaned those up, fixed an additional bug and shipped a 3.7 release.
Then when I applied the fix to my
covid-19 project it made almost no difference to the performance at all! It turned out I had been shaving entirely the wrong yak.
The real problem was that each page of results was unneccessarily calculating facets, suggested facets and a full table count—potentially a thousand times when returning a million streaming rows. Which leads me to...
Progress towards Datasette 0.57
This was the yak I needed to shave all along. Some highlights from the past 8 days:
- I landed ?col= and ?_nocol=—you can now select exactly which columns you would like to see on the table page (and in the CSV and JSON exports).
- A new
?_facet_size=maxoption (#1337), which is now linked to from the … shown at the bottom of a truncated list of facet results. This is particularly useful for things like lists of states—50 gets truncated to 30, but the new link lets you see all 50 in one place.
- You can now opt-out of both facets and table counts with the new ?_nofacet=1 and ?_nocount=1 options—both of which are used to greatly speed up large CSV exports.
- And a fix for an issue first opened in November 2019—if your SQL query throws an error, Datasette now shows you the error in context with the SQL query so you can edit it without having to hit “back” to recover it! This may seem like a small thing but it was a surprisingly fiddly fix—I’m so glad to finally have it done though.
I hope to release 0.57 final later this week. In the meantime some of the above is available in the 0.57a1 alpha.
TIL this week
- Turning an array of arrays into objects with jq
- Running Docker on an M1 Mac
- Finding CSV files that start with a BOM using ripgrep
Releases this week
sqlite-utils: 3.7—(74 releases total)—2021-05-29
Python CLI utility and library for manipulating SQLite databases
datasette: 0.57a1—(87 releases total)—2021-05-27
An open source multi-tool for exploring and publishing data
django-sql-dashboard: 0.15—(29 releases total)—2021-05-25
Django app for building dashboards using raw SQL queries
More recent articles
- Weeknotes: datasette-enrichments, datasette-comments, sqlite-chronicle - 8th December 2023
- Datasette Enrichments: a new plugin framework for augmenting your data - 1st December 2023
- llamafile is the new best way to run a LLM on your own computer - 29th November 2023
- Prompt injection explained, November 2023 edition - 27th November 2023
- I'm on the Newsroom Robots podcast, with thoughts on the OpenAI board - 25th November 2023
- Weeknotes: DevDay, GitHub Universe, OpenAI chaos - 22nd November 2023
- Deciphering clues in a news article to understand how it was reported - 22nd November 2023
- Exploring GPTs: ChatGPT in a trench coat? - 15th November 2023
- Financial sustainability for open source projects at GitHub Universe - 10th November 2023
- ospeak: a CLI tool for speaking text in the terminal via OpenAI - 7th November 2023