datasette-atom: Define an Atom feed using a custom SQL query

3rd December 2019

I’ve been having a ton of fun iterating on www.niche-museums.com. I put together some notes on how the site works last week, and I’ve been taking advantage of the Thanksgiving break to continue exploring ways in which Datasette can be used to quickly build database-backed static websites.

I post a new museum to the site every day, so it was inevitable that someone would ask for a feed. And here it is: an Atom feed for Niche Museums.

This means Niche Museums is effectively a blog now, which is fitting: it’s a universal truth that any sufficiently advanced backend technology will evolve to the point where it can power a blog with an Atom feed.

datasette-atom

I built the feed by wrapping up work on the first version of a new Datasette plugin: datasette-atom. It takes advantage of the register_output_renderer plugin hook, which was contributed by Russ Garrett back in May.

The idea with the plugin is to make it possible to construct an Atom feed from an arbitrary SQL query.

This is a really powerful ability. It means that a user with sufficent knowledge of SQL can subscribe to an arbitrary feed of data from any Datasette instance that is running the plugin.

Defining an Atom feed with a SQL query

The plugin works by requiring you to provide a SQL query that produces the following columns in its output:

atom_id, atom_title and atom_updated.

These correspond to the required entry elements defined by the Atom specification.

The plugin can then render the results of the query as an Atom feed.

You can also produce an atom_link column, which will become a link.

And finally, you can produce either an atom_content column which will be treated as text and used as the feed entry body, or an atom_content_html column which will be treated as HTML.

(The HTML from atom_content_html is sanitized through Mozilla’s Bleach library to ensure the plugin doesn’t act as an XSS vector.)

This means we can define a custom Atom feed by crafting a SQL query! Here’s the query I’m using on the Niche Museums website:

select
  'tag:niche-museums.com,' || substr(created, 0, 11) || ':' || id as atom_id,
  name as atom_title,
  created as atom_updated,
  'https://www.niche-museums.com/browse/museums/' || id as atom_link,
  coalesce(
    '<img src="' || photo_url || '?w=800&amp;h=400&amp;fit=crop&amp;auto=compress">',
    ''
  ) || '<p>' || description || '</p>' as atom_content_html
from
  museums
order by
  created desc
limit
  15

I’m using a couple of extra tricks here.

The atom_id is defined as a tag:uri following this advice from Mark Pilgrim—since created is an ISO 8601 timestamp substr(created, 0, 11) returns the YYYY-MM-DD component.

Not every museum has a photo, which means photo_url is sometimes null. In SQL, concatenating a null value to something else (using the || concatenation operator) produces another null. So this entire expression evaluates to null if photo_url is null:

'<img src="' || photo_url ||
'?w=800&amp;h=400&amp;fit=crop&amp;auto=compress">',

The coalesce(x...) function returns the first argument passed to it. So coalesce('<img ...>' || photo_url || '...', '') returns the empty string if the photo is not available.

Deriving created/updated fields from Git history

For the atom feed to work, I need an atom_updated value. This should be a timestamp representing “the last time the entry was modified in a significant way”—so it’s actually more like a created timestamp for my museums website.

My museum data is defined in a YAML file—museums.yaml—which doesn’t include created and updated timestamps. So where can I get them from?

Since the YAML file is stored in the site’s GitHub repository, I’m deriving those timestamps from the git history. I repurposed code I wrote for my PG&E outages project for this—the full script is annotate _timestamps.py in the museums repo.

It works by looping through the entire history of the museums.yaml file comparing the list of museums in each version to the previous iteration. If a museum is new (it has an ID not seen before) we use the commit date as its created date. If the JSON serialization of the museum differs from the previous version we reset its updated date.

Then at the end the script uses sqlite-utils to update each record with the derived timestamps:

# updated/created are dicts of {"id": "timestamp"}
db = sqlite_utils.Database("browse.db")
for id, ts in created.items():
    db["museums"].update(id, {
        "created": ts,
        "updated": updated[id]
    }, alter=True)

The alter=True parameter to the .update() method causes sqlite-utils to automatically add any missing columns that are referenced in the update.

Defining a feed as a canned query

Now that we’ve defined the feed as a SQL query, we can assign it a more pleasing URL using Datasette’s canned queries feature.

I encoded the query as a JSON string using JSON Escape Text, then added it to the metadata.json configuration file for Niche Museums. I named the query feed, resulting in a URL of www.niche-museums.com/browse/feed.

There’s just one catch: Atom feeds should have a name. As a quick and nasty hack I allow the name to be set using ?_feed_name=Niche+Museums. I have an open issue to come up with a less nasty way of defining this.

Also this week

I added a simple search engine to Niche Museums! Here’s an example search for “model”: www.niche-museums.com/browse/search?q=model.

I used sqlite-utils to configure SQLite FTS (here’s the line that calls it in the CI build script), defined another canned query in metadata.json that executes the query and built a custom template to render the results page.

I added press coverage to Niche Museums. Many of the listings now link to articles in the local media about them.

Museum pages now link to other nearby museums. Here’s the commit that implemented that feature—it works by embedding a new SQL query in the template page, using datasette-template-sql.

I identified a bug in yaml-to-sqlite and shipped a fix in version 0.3.

I added the following seven museums to the site:

The Comic Rock Star’s Toilet Seat Museum in San Francisco
Fabergé Museum, Baden-Baden in Baden-Baden, Germany
jAdis in Santa Monica
Lundy in the Bristol Channel
Neon Works in Oakland
Maison des Johnnies et de l’Oignon de Roscoff in Roscoff
Bourton-on-the-Water Model Village in the Cotswolds

Goal for the next week: make some progress on projects that aren’t related to niche museums!

Posted 3rd December 2019 at 12:20 am · Follow me on Mastodon, Bluesky, Twitter or subscribe to my newsletter

Simon Willison’s Weblog