Weeknotes: Tildes not dashes, and the big refactor
After last week’s shot-scraper distractions with Playwright, this week I finally managed to make some concrete progress on the path towards Datasette 1.0.
shot-scraper for scraping, and GitHub template repository hacks
I did invest some time in shot-scraper this week, which I wrote about on this blog in detail earlier:
Scraping web pages from the command line with shot-scraper describes the new
shot-scraperinto more than just a screenshotting tool: it now doubles up as the command line scraping tool I’ve been wanting for years.
Instantly create a GitHub repository to take screenshots of a web page describes shot-scraper-template—a GitHub repository template I created that lets people easily create a new repository that takes screenshots of a web page using
shot-scraperrunning in GitHub Actions.
That GitHub repository trick really took off. Searching for
shot-scraper-template -user:simonw path:.github/workflows on GitHub now returns 50 repositories that people other-than-me have created using the template!
It also caused me to revisit my template repositories for generating Python Click apps, Python libraries and Datasette plugins: all three of those now use a new pattern which avoids the user having to manually rename a folder in order to enable the GitHub Actions workflows—details here.
The big refactor, finally under way
The remaining work to do for Datasette 1.0 concerns stability. I want to make sure the JSON API and the plugin hooks are in a state where I can keep them stable until Datasette 2.0—which with any luck won’t ever need to happen.
I intend 1.0 as a promise that Datasette is a rock-solid foundation on which people can build build their own APIs, sites and plugins.
This also means I need to refactor some of the cruft. Work on the 1.0 API has been held up because it depends on some of the most complex code in the system, some of which has evolved in some pretty ugly ways over the past few years.
The two most covoluted aspects of Datasette’s codebase dealt with the following:
- Handling the difference between
/database/t1.jsonwhere the table is called
/database/t1.jsonwhere the table is called
t1.json(a valid SQLite table name)—and handling database tables with
/as part of their name. I wrote more about that when I described dash encoding.
- Hashed URL mode—an optional performance optimization where Datasette rewrites the URLs to a database to incorporate part of the SHA-256 hash of the database contents, so it can use far-future cache expire headers.
This week, I solved both of these!
Tilde encoding, not dash encoding
In Why I invented “dash encoding”, a new encoding scheme for URL paths I confidently described a new approach to encoding values in a URL, based on the unfortunate fact that it turns out URL encoding in the path of a URL can’t be used reliably due to long-standing unfixable bugs in a large number of widely used reverse proxies.
My original dash encoding scheme worked like this:
Then I hit a huge problem with it: this encoding scheme does nothing special with the
% character. But... it turns out the
% character can’t be relied on in a URL since there’s a chance it may be mangled by one of the afore-mentioned misbehaving proxies.
-% doesn’t help because that still has an unsafe percent character in it.
Then glyph suggested this:
Have you considered replacing % with some other character and then using percent-encoding?
So I invented dash encoding v2, which worked exactly the same as URL percentage encoding but used the
- character instead of the
I was pretty confident this would work... until I started rolling out the new Datasette
main branch to some of my deployed Datasette instances. That’s when I realized that
- is a VERY common character in existing installations, and escaping it was actually pretty ugly.
The Datasette website uses a database called
dogsheep-index—this got renamed to
dogsheep-2Findex, which broke the search page.
More importantly, replacing
- in a name with
-2F is just really ugly. Surely I can do better than that?
I consulted the URI RFC and was delighted to find this list of unreserved characters:
unreserved = ALPHA / DIGIT / "-" / "." / "_" / "~"
And so Tilde encoding was born! Same exact idea: percent encoding, but use a different character. In this case
This works great. It doesn’t require encoding hyphens, so it results in prettier URLs. So it’s now implemented on Datasette
main, ready for the next release.
Removing Hashed URL mode
/database-c9e67c4 and send far-future cache expiry headers with every response, caching them in both browsers and CDNs.
I liked the idea so much I made Datasette do it by default!
In Datasette 0.28 I changed my mind. Datasette grew the option to serve databases that could change while the server was running, at which point that optimization stopped making sense. I made it an option, controlled by a new
A couple of years later, I realized I hadn’t chosen to use that option myself in any of my own projects. I’d also really started to grate at the additional complexity the feature brought to the codebase.
I set myself a task to reconsider it before 1.0. This week, I found what I think is the right solution: I extracted the functionality out into a separate plugin, datasette-hashed-urls.
The key to building the plugin was realizing that, if the database is immutable, I can handle the URL rewriting simply by renaming the database to include its hash when the server first starts up.
The rest of the plugin implementation then handles redirects, for if the database has changed its contents and the old hash URLs need to be redirected to the new one.
Having built the plugin, I removed the implementation from core in issue 1661. This resulted in some sizable, satisfying code deletion, further convincing me that this was the right decision for the project.
Creeping closer to 1.0
If you want to follow the progress towards the first stable release,the Datasette 1.0 milestone is the place to look.
I’m determined to make significant progress this month, with the goal of shipping an alpha before March turns into April.
Releases this week
datasette-hashed-urls: 0.2—(2 releases total)—2022-03-16
Optimize Datasette performance behind a caching proxy
datasette-publish-vercel: 0.12.1—(19 releases total)—2022-03-15
Datasette plugin for publishing data using Vercel
shot-scraper: 0.9—(10 releases total)—2022-03-14
Tools for taking automated screenshots of websites
TIL this week
More recent articles
- Weeknotes: Parquet in Datasette Lite, various talks, more LLM hacking - 4th June 2023
- It's infuriatingly hard to understand how closed models train on their input - 4th June 2023
- ChatGPT should include inline tips - 30th May 2023
- Lawyer cites fake cases invented by ChatGPT, judge is not amused - 27th May 2023
- llm, ttok and strip-tags - CLI tools for working with ChatGPT and other LLMs - 18th May 2023
- Delimiters won't save you from prompt injection - 11th May 2023
- Weeknotes: sqlite-utils 3.31, download-esm, Python in a sandbox - 10th May 2023
- Leaked Google document: "We Have No Moat, And Neither Does OpenAI" - 4th May 2023
- Midjourney 5.1 - 4th May 2023
- Prompt injection explained, with video, slides, and a transcript - 2nd May 2023