Simon Willison’s Weblog

Subscribe

Weeknotes: Tildes not dashes, and the big refactor

19th March 2022

After last week’s shot-scraper distractions with Playwright, this week I finally managed to make some concrete progress on the path towards Datasette 1.0.

shot-scraper for scraping, and GitHub template repository hacks

I did invest some time in shot-scraper this week, which I wrote about on this blog in detail earlier:

  • Scraping web pages from the command line with shot-scraper describes the new shot-scraper javascript URL script command, which lets you fire up a web page in a headless browser, execute some custom JavaScript against it to extract information and return that information as JSON to the command-line. This is a really cool trick! It turns shot-scraper into more than just a screenshotting tool: it now doubles up as the command line scraping tool I’ve been wanting for years.
  • Instantly create a GitHub repository to take screenshots of a web page describes shot-scraper-template—a GitHub repository template I created that lets people easily create a new repository that takes screenshots of a web page using shot-scraper running in GitHub Actions.

That GitHub repository trick really took off. Searching for shot-scraper-template -user:simonw path:.github/workflows on GitHub now returns 50 repositories that people other-than-me have created using the template!

It also caused me to revisit my template repositories for generating Python Click apps, Python libraries and Datasette plugins: all three of those now use a new pattern which avoids the user having to manually rename a folder in order to enable the GitHub Actions workflows—details here.

The big refactor, finally under way

The remaining work to do for Datasette 1.0 concerns stability. I want to make sure the JSON API and the plugin hooks are in a state where I can keep them stable until Datasette 2.0—which with any luck won’t ever need to happen.

I intend 1.0 as a promise that Datasette is a rock-solid foundation on which people can build build their own APIs, sites and plugins.

This also means I need to refactor some of the cruft. Work on the 1.0 API has been held up because it depends on some of the most complex code in the system, some of which has evolved in some pretty ugly ways over the past few years.

The two most covoluted aspects of Datasette’s codebase dealt with the following:

  • Handling the difference between /database/t1.json where the table is called t1 and /database/t1.json where the table is called t1.json (a valid SQLite table name)—and handling database tables with / as part of their name. I wrote more about that when I described dash encoding.
  • Hashed URL mode—an optional performance optimization where Datasette rewrites the URLs to a database to incorporate part of the SHA-256 hash of the database contents, so it can use far-future cache expire headers.

This week, I solved both of these!

Tilde encoding, not dash encoding

In Why I invented “dash encoding”, a new encoding scheme for URL paths I confidently described a new approach to encoding values in a URL, based on the unfortunate fact that it turns out URL encoding in the path of a URL can’t be used reliably due to long-standing unfixable bugs in a large number of widely used reverse proxies.

My original dash encoding scheme worked like this:

  • /foo/bar encoded to -/foo-/bar
  • table.csv encoded to table-.csv
  • foo-bar encoded to foo--bar

Then I hit a huge problem with it: this encoding scheme does nothing special with the % character. But... it turns out the % character can’t be relied on in a URL since there’s a chance it may be mangled by one of the afore-mentioned misbehaving proxies.

Using -% doesn’t help because that still has an unsafe percent character in it.

Then glyph suggested this:

Have you considered replacing % with some other character and then using percent-encoding?

So I invented dash encoding v2, which worked exactly the same as URL percentage encoding but used the - character instead of the %.

  • /foo/bar encodes to -2Ffoo-2Fbar
  • -/db-/table.csv encodes to -2D-2Fdb-2D-2Ftable-2Ecsv

I was pretty confident this would work... until I started rolling out the new Datasette main branch to some of my deployed Datasette instances. That’s when I realized that - is a VERY common character in existing installations, and escaping it was actually pretty ugly.

The Datasette website uses a database called dogsheep-index—this got renamed to dogsheep-2Findex, which broke the search page.

More importantly, replacing - in a name with -2F is just really ugly. Surely I can do better than that?

I consulted the URI RFC and was delighted to find this list of unreserved characters:

unreserved  = ALPHA / DIGIT / "-" / "." / "_" / "~"

And so Tilde encoding was born! Same exact idea: percent encoding, but use a different character. In this case ~.

  • dogsheep-beta encodes to dogsheep-beta
  • /foo/bar encodes to ~2Ffoo~2Fbar
  • -/db-/table.csv encodes to -~2Fdb-~2Ftable~2Ecsv

This works great. It doesn’t require encoding hyphens, so it results in prettier URLs. So it’s now implemented on Datasette main, ready for the next release.

Removing Hashed URL mode

I wrote about this three years ago as one of the interesting ideas in Datasette. The key idea was inspired by CSS and JavaScript asset rewriting: if your database never changes, you can rewrite the JSON URLs to /database-c9e67c4 and send far-future cache expiry headers with every response, caching them in both browsers and CDNs.

I liked the idea so much I made Datasette do it by default!

In Datasette 0.28 I changed my mind. Datasette grew the option to serve databases that could change while the server was running, at which point that optimization stopped making sense. I made it an option, controlled by a new hash_urls setting.

A couple of years later, I realized I hadn’t chosen to use that option myself in any of my own projects. I’d also really started to grate at the additional complexity the feature brought to the codebase.

I set myself a task to reconsider it before 1.0. This week, I found what I think is the right solution: I extracted the functionality out into a separate plugin, datasette-hashed-urls.

The key to building the plugin was realizing that, if the database is immutable, I can handle the URL rewriting simply by renaming the database to include its hash when the server first starts up.

The rest of the plugin implementation then handles redirects, for if the database has changed its contents and the old hash URLs need to be redirected to the new one.

Having built the plugin, I removed the implementation from core in issue 1661. This resulted in some sizable, satisfying code deletion, further convincing me that this was the right decision for the project.

Creeping closer to 1.0

If you want to follow the progress towards the first stable release,the Datasette 1.0 milestone is the place to look.

I’m determined to make significant progress this month, with the goal of shipping an alpha before March turns into April.

Releases this week

TIL this week