Weeknotes: Tildes not dashes, and the big refactor
19th March 2022
After last week’s shot-scraper distractions with Playwright, this week I finally managed to make some concrete progress on the path towards Datasette 1.0.
shot-scraper for scraping, and GitHub template repository hacks
I did invest some time in shot-scraper this week, which I wrote about on this blog in detail earlier:
-
Scraping web pages from the command line with shot-scraper describes the new
shot-scraper javascript URL script
command, which lets you fire up a web page in a headless browser, execute some custom JavaScript against it to extract information and return that information as JSON to the command-line. This is a really cool trick! It turnsshot-scraper
into more than just a screenshotting tool: it now doubles up as the command line scraping tool I’ve been wanting for years. -
Instantly create a GitHub repository to take screenshots of a web page describes shot-scraper-template—a GitHub repository template I created that lets people easily create a new repository that takes screenshots of a web page using
shot-scraper
running in GitHub Actions.
That GitHub repository trick really took off. Searching for shot-scraper-template -user:simonw path:.github/workflows
on GitHub now returns 50 repositories that people other-than-me have created using the template!
It also caused me to revisit my template repositories for generating Python Click apps, Python libraries and Datasette plugins: all three of those now use a new pattern which avoids the user having to manually rename a folder in order to enable the GitHub Actions workflows—details here.
The big refactor, finally under way
The remaining work to do for Datasette 1.0 concerns stability. I want to make sure the JSON API and the plugin hooks are in a state where I can keep them stable until Datasette 2.0—which with any luck won’t ever need to happen.
I intend 1.0 as a promise that Datasette is a rock-solid foundation on which people can build build their own APIs, sites and plugins.
This also means I need to refactor some of the cruft. Work on the 1.0 API has been held up because it depends on some of the most complex code in the system, some of which has evolved in some pretty ugly ways over the past few years.
The two most covoluted aspects of Datasette’s codebase dealt with the following:
- Handling the difference between
/database/t1.json
where the table is calledt1
and/database/t1.json
where the table is calledt1.json
(a valid SQLite table name)—and handling database tables with/
as part of their name. I wrote more about that when I described dash encoding. - Hashed URL mode—an optional performance optimization where Datasette rewrites the URLs to a database to incorporate part of the SHA-256 hash of the database contents, so it can use far-future cache expire headers.
This week, I solved both of these!
Tilde encoding, not dash encoding
In Why I invented “dash encoding”, a new encoding scheme for URL paths I confidently described a new approach to encoding values in a URL, based on the unfortunate fact that it turns out URL encoding in the path of a URL can’t be used reliably due to long-standing unfixable bugs in a large number of widely used reverse proxies.
My original dash encoding scheme worked like this:
-
/foo/bar
encoded to-/foo-/bar
-
table.csv
encoded totable-.csv
-
foo-bar
encoded tofoo--bar
Then I hit a huge problem with it: this encoding scheme does nothing special with the %
character. But... it turns out the %
character can’t be relied on in a URL since there’s a chance it may be mangled by one of the afore-mentioned misbehaving proxies.
Using -%
doesn’t help because that still has an unsafe percent character in it.
Then glyph suggested this:
Have you considered replacing % with some other character and then using percent-encoding?
So I invented dash encoding v2, which worked exactly the same as URL percentage encoding but used the -
character instead of the %
.
-
/foo/bar
encodes to-2Ffoo-2Fbar
-
-/db-/table.csv
encodes to-2D-2Fdb-2D-2Ftable-2Ecsv
I was pretty confident this would work... until I started rolling out the new Datasette main
branch to some of my deployed Datasette instances. That’s when I realized that -
is a VERY common character in existing installations, and escaping it was actually pretty ugly.
The Datasette website uses a database called dogsheep-index
—this got renamed to dogsheep-2Findex
, which broke the search page.
More importantly, replacing -
in a name with -2F
is just really ugly. Surely I can do better than that?
I consulted the URI RFC and was delighted to find this list of unreserved characters:
unreserved = ALPHA / DIGIT / "-" / "." / "_" / "~"
And so Tilde encoding was born! Same exact idea: percent encoding, but use a different character. In this case ~
.
-
dogsheep-beta
encodes todogsheep-beta
-
/foo/bar
encodes to~2Ffoo~2Fbar
-
-/db-/table.csv
encodes to-~2Fdb-~2Ftable~2Ecsv
This works great. It doesn’t require encoding hyphens, so it results in prettier URLs. So it’s now implemented on Datasette main
, ready for the next release.
Removing Hashed URL mode
I wrote about this three years ago as one of the interesting ideas in Datasette. The key idea was inspired by CSS and JavaScript asset rewriting: if your database never changes, you can rewrite the JSON URLs to /database-c9e67c4
and send far-future cache expiry headers with every response, caching them in both browsers and CDNs.
I liked the idea so much I made Datasette do it by default!
In Datasette 0.28 I changed my mind. Datasette grew the option to serve databases that could change while the server was running, at which point that optimization stopped making sense. I made it an option, controlled by a new hash_urls
setting.
A couple of years later, I realized I hadn’t chosen to use that option myself in any of my own projects. I’d also really started to grate at the additional complexity the feature brought to the codebase.
I set myself a task to reconsider it before 1.0. This week, I found what I think is the right solution: I extracted the functionality out into a separate plugin, datasette-hashed-urls.
The key to building the plugin was realizing that, if the database is immutable, I can handle the URL rewriting simply by renaming the database to include its hash when the server first starts up.
The rest of the plugin implementation then handles redirects, for if the database has changed its contents and the old hash URLs need to be redirected to the new one.
Having built the plugin, I removed the implementation from core in issue 1661. This resulted in some sizable, satisfying code deletion, further convincing me that this was the right decision for the project.
Creeping closer to 1.0
If you want to follow the progress towards the first stable release,the Datasette 1.0 milestone is the place to look.
I’m determined to make significant progress this month, with the goal of shipping an alpha before March turns into April.
Releases this week
-
datasette-hashed-urls: 0.2—(2 releases total)—2022-03-16
Optimize Datasette performance behind a caching proxy -
datasette-publish-vercel: 0.12.1—(19 releases total)—2022-03-15
Datasette plugin for publishing data using Vercel -
shot-scraper: 0.9—(10 releases total)—2022-03-14
Tools for taking automated screenshots of websites
TIL this week
More recent articles
- Notes from Bing Chat—Our First Encounter With Manipulative AI - 19th November 2024
- Project: Civic Band - scraping and searching PDF meeting minutes from hundreds of municipalities - 16th November 2024
- Qwen2.5-Coder-32B is an LLM that can code well that runs on my Mac - 12th November 2024