Weeknotes: Hacking on 23 different projects
16th April 2020
I wrote a lot of code this week: 184 commits over 23 repositories! I’ve also started falling for Zeit Now v2, having found workarounds for some of my biggest problems with it.
Better Datasette on Zeit Now v2
Last week I bemoaned the loss of Zeit Now v1 and documented my initial explorations of Zeit Now v2 with respect to Datasette.
My favourite thing about Now v1 was that it ran from Dockerfiles, which gave me complete control over the versions of everything in my deployment environment.
Now v2 runs on AWS Lambda, which means you are mostly stuck with what Zeit’s flavour of Lambda gives you. This currently means Python 3.6 (not too terrible—Datasette fully supports it) and a positively ancient SQLite— 3.7.17 from May 2013.
Lambda runs on Amazon Linux. Charles Leifer maintains a package called pysqlite3 which bundles the latest version of SQLite3 as a standalone Python package, and includes a pysqlite3-binary
package precompiled for Linux. Could it work on Amazon Linux...?
It turns out it does! A one-line change (not including tests) to my datasette-publish-now and it now deploys Datasette on Now v2 with SQLite 3.31.1—the latest release from January this year, with window functions and all kinds of other goodness.
This means that Now v2 is back to being a really solid option for hosting Datasette instances. You get scale-to-zero, crazily low prices and really fast cold-boot times. It can only take databases up to around 50MB—if you need more space than that you’re better off with Cloud Run—but it’s a great option for smaller data.
I released a few versions of datasette-publish-now as a result of this research. I plan to release the first non-alpha version at the same time as Datasette 0.40.
Various projects ported to Now v2 or Cloud Run
I had over 100 projects running on Now v1 that needed updating or deleting in time for that platform’s shutdown in August. I’ve been porting some of them very quickly using datasette-publish-now
, but a few have been more work. Some highlights from this week:
- ftfy.now.sh, my web app that takes a string of broken unicode and figures out the sequence of transformations you can use to make sense of it (built on the incredible FTFY Python library by Robyn Speer) has been upgraded to Now v2—repo here.
- gzthermal.now.sh offers a web interface to the
gzthermal
gzip visualization tool, released by caveman on the encode.ru (now encode.su) forum. My repo is here. - My crowdsourced directory of range maps of cryptozoological creatures is now running on Cloud Run (I haven’t figured out a way to run SpatiaLite on Now v2 yet).
- The datasette-sqlite-fts4.datasette.io demo instance I used for explanations in Exploring search relevance algorithms with SQLite.
- The demo instance used for datasette-jellyfish is on Now v2.
- The demo for datasette-jq had to move to Cloud Run, because I couldn’t install pyjq on Now v2.
big-local-datasette
I’ve been collaborating with the Big Local team at Stanford on a number of projects related to the Covid-19 situation. It’s not quite open to the public yet but I’ve been building a Datasette instance which shares data from the “open projects” maintained by that team.
The implementation fits a common pattern for me: a scheduled GitHub Action which fetches project data from a GraphQL API, seeks out CSV files which have changed (using HTTP HEAD requests to check their ETags), loads the CSV into SQLite tables and publishes the resulting database using datasette publish cloudrun
.
There’s one interesing new twist: I’m fetching the existing database files on every run using my new datasette-clone tool (written for this project), applying changes to them and then only publishing if the resulting MD5 sums have changed since last time.
It seems to work well, and I’m excited about this technique as a way of incrementally updating existing databases using stateless code running in a GitHub Action.
Datasette Cloud
I continue to work on the invite-only alpha of my SaaS Datasette platform, Datasette Cloud. This week I ported the CI and deployment scripts from GitLab to GitHub Actions, mainly to try and reduce the variety of CI systems I’m working with (I now have projects live on three: Travis, Circle CI and GitHub Actions).
I’ve also been figuring out ways of supporting API tokens for making requests to authentication-protected Datasette instances. I shipped small releases of datasette-auth-github and datasette-auth-existing-cookies to support this.
In tinkering with Datasette Cloud I also shipped an upgrade to datasette-mask-columns, which now shows visible REDACTED text on redacted columns in table view.
Miscellaneous
- My covid-19.datasettes.com project now also imports data from the LA Times.
- I added
.rows_where(..., order_by="column")
in release 2.6 of sqlite-utils. - I shipped a new release of paginate-json, a tool I built primarily for paginating through the GitHub API and piping the results to
sqlite-utils
. - I fixed a minor bug with Datasette’s --plugin-secret mechanism and added a CSS customization hook for the canned query page.
- I built a HEIC to JPEG converting proxy as part of my ongoing mission to eventually liberate my photos from Apple Photos and make them available to Dogsheep. In doing so I contributed usage documentation to the pyheif Python library.
More recent articles
- Storing times for human events - 27th November 2024
- Ask questions of SQLite databases and CSV/JSON files in your terminal - 25th November 2024
- Weeknotes: asynchronous LLMs, synchronous embeddings, and I kind of started a podcast - 22nd November 2024