Simon Willison’s Weblog

Subscribe

Weeknotes: Page caching and custom templates for Datasette Cloud

7th January 2024

My main development focus this week has been adding public page caching to Datasette Cloud, and exploring what custom template support might look like for that service.

Datasette Cloud primarily provides private “spaces” for teams to collaborate on data. A team can invite additional members, upload CSV files, use the API to ingest data, run enrichments, share private comments and browse and query the data together.

The overall goal is to help teams find stories in their data.

Originally I planned Datasette Cloud as an exclusively private collaboration space, but with hindsight this was a mistake. Datasette has been a tool for publishing data right from the start, and Datasette Cloud users quickly started asking for ways to share their data with the world.

I started with a plugin for this, datasette-public, allowing tables to be selectively made visible to unauthenticated users.

This raised a couple of challenges though. First, I worry about sudden spikes of traffic. Each Datasette Cloud user gets their own dedicated Fly container to ensure performance issues are isolated and don’t affect other users, but I still don’t like the idea of a big public traffic spike taking down a user’s site.

Secondly, some users expressed interest in customizing the display of their public Datasette instance. The open source Datasette application has extensive support for this, but allowing users to run arbitrary HTML and JavaScript on a hosted service is a major risk for XSS holes.

This week I’ve been exploring a way to address both of these issues.

Full page caching for unauthorized users

I’ve used this trick multiple times through my career—at Lanyrd, at Eventbrite and even for my own personal blog. If a user is signed out, serve them pages through a simple full-page cache—something like Varnish. Set a short TTL on that cache—maybe as short as 15s—such that cached content doesn’t have time to go stale.

Good caches include support for dog-pile prevention, also known as request coalescing. If 10 requests come in for the same page at exactly the same moment, the cache bundles them together and makes just a single request to the backend, then serves the result to all 10 waiting clients.

How to implement this for Datasette Cloud? My current plan is to use a separate domain—.datasette.site—for the publicly visible pages of each site. So simon.datasette.cloud (my personal Datasette Cloud space) would have simon.datasette.site as its public domain.

I got this working as a proof-of-concept this week. I actually got it working twice: I figured out how to run a dedicated Varnish instance on Fly, and then I realized that Cloudflare also now offer wildcard DNS support so I tried that out too.

I have both mechanisms up and running at the moment, on two separate domains. I’ll likely go with the Cloudflare option to reduce the number of moving parts I’m responsible for myself, but having both means I can compare them to see which one is likely to work best.

Custom templates based on host

The other reason I decided to explore *.datasette.site was the security issue I mentioned earlier.

XSS attacks, where malicious JavaScript executes on a trusted domain, are a major security risk.

I plan to explore additional layers of protection against these such as CSP headers, but my general rule is to NEVER allow even a chance of untrusted JavaScript executing on a domain where authenticated users are able to perform privileged actions.

My current plan is to have *.datasette.site work as an entirely cookie-free domain. Any functionality that requires authentication will be handled by the privileged *.datasette.cloud domain instead.

This means I can allow users to provide their own custom templates for their public Datasette instance, without worrying that any mistakes in those templates could lead to a security breach elsewhere within the service.

There was just one catch: this meant I needed Datasette to be able to use different templates depending on host that the content was being served on.

After wasting a bunch of time trying to get this to work through monkey-patching, I realized the solution was to add a new plugin hook. jinja2_environment_from_request(datasette, request, env) is now implemented on main and should be out in a new alpha release pretty soon. The documentation for that hook includes an example that hints at how I’m using it for Datasette Cloud.

Fun further applications of this pattern

I’m wary of adding features to Datasette that only serve Datasette Cloud. In this case, I realized that the new plugin hook opens up some interesting possibilities for other users of Datasette.

I run a bunch of projects on top of Datasette myself—til.simonwillison.net and www.niche-museums.com are two examples of my sites that are actually templated Datasette instances.

Currently, those sites are hosted separately—which means I’m paying to run Datasette multiple times.

With the ability to serve different templates based on host, I’ve realized I could instead serve a single Datasette instance for multiple sites, each with their own custom templates.

Taking advantage of CNAMEs—or even wildcard DNS—means I could run a whole family of weird personal projects on a single instance without any incremental cost for each new project!

Releases

TILs