Simon Willison’s Weblog

Building a sitemap.xml with a one-off Datasette plugin

One of the fun things about launching a new website is re-learning what it takes to promote a website from scratch on the modern web. I’ve been thoroughly enjoying using Niche Museums as an excuse to explore 2020-era SEO.

I used to use Google Webmaster Tools for this, but apparently that got rebranded as Google Search Console back in May 2015. It’s really useful. It shows which search terms got impressions, which ones got clicks and lets you review which of your pages are indexed and which have returned errors.

Niche Museums has been live since October 24th, but it was a SPA for the first month. I switched it to server-side rendering (with separate pages for each museum) on November 25th. The Google Search Console shows it first appeared in search results on 2nd December.

So far, I’ve had 35 clicks! Not exactly earth-shattering, but every site has to start somewhere.

Screenshot of the Google Search Console.

In a bid to increase the number of indexed pages, I decided to build a sitemap.xml. This probably isn’t necessary—Google advise that you might not need one if your site is “small”, defined as 500 pages or less (Niche Museums lists 88 museums, though it’s still increasing by one every day). It’s nice to be able to view that sitemap and confirm that those pages have all been indexed inside the Search Console though.

Since Niche Museums is entirely powered by a customized Datasette instance, I needed to figure out how best to build that sitemap.

One-off plugins

Datasette’s most powerful customization options are provided by the plugins mechanism. Back in June I ported Datasette to ASGI, and the subsequent release of Datasette 0.29 introduced a new asgi_wrapper plugin hook. This hook makes it possible to intercept requests and implement an entirely custom response—ideal for serving up a /sitemap.xml page.

I considered building and releasing a generic datasette-sitemap plugin that could be used anywhere, but that felt like over-kill for this particular problem. Instead, I decided to take advantage of the --plugins-dir= Datasette option to build a one-off custom plugin for the site.

The Datasette instance that runs Niche Museums starts up like this:

$ datasette browse.db about.db \
    --template-dir=templates/ \
    --plugins-dir=plugins/ \
    --static css:static/ \
    -m metadata.json

This serves the two SQLite database files, loads custom templatse from the templates/ directory, sets up www.niche-museums.com/css/museums.css to serve data from the static/ directory and loads metadata settings from metadata.json. All of these files are on GitHub.

It also tells Datasette to look for any Python files in the plugins/ directory and load those up as plugins.

I currently have four Python files in that directory—you can see them here. The sitemap.xml is implemented using the new sitemap.py plugin file.

Here’s the first part of that file, which wraps the Datasette ASGI app with middleware that checks for the URL /robots.txt or /sitemap.xml and returns custom content for either of them:

from datasette import hookimpl
from datasette.utils.asgi import asgi_send


@hookimpl
def asgi_wrapper(datasette):
    def wrap_with_robots_and_sitemap(app):
        async def robots_and_sitemap(scope, recieve, send):
            if scope["path"] == "/robots.txt":
                await asgi_send(
                    send, "Sitemap: https://www.niche-museums.com/sitemap.xml", 200
                )
            elif scope["path"] == "/sitemap.xml":
                await send_sitemap(send, datasette)
            else:
                await app(scope, recieve, send)

        return robots_and_sitemap

    return wrap_with_robots_and_sitemap

The boilerplate here is a little convoluted, but this does the job. I’m considering adding alternative plugin hooks for custom pages that could simplify this in the future.

The asgi_wrapper(datasette) plugin function is expected to return a function which will be used to wrap the Datasette ASGI application. In this case that wrapper function is called wrap_with_robots_and_sitemap(app). Here’s the Datasette core code that builds the ASGI app and applies the wrappers:

asgi = AsgiLifespan(
    AsgiTracer(DatasetteRouter(self, routes)), on_startup=setup_db
)
for wrapper in pm.hook.asgi_wrapper(datasette=self):
    asgi = wrapper(asgi)

So this plugin will be executed as:

asgi = wrap_with_robots_and_sitemap(asgi)

The wrap_with_robots_and_sitemap(app) function then returns another, asynchronous function. This function follows the ASGI protocol specification, and has the following signature and body:

async def robots_and_sitemap(scope, recieve, send):
    if scope["path"] == "/robots.txt":
        await asgi_send(
            send, "Sitemap: https://www.niche-museums.com/sitemap.xml", 200
        )
    elif scope["path"] == "/sitemap.xml":
        await send_sitemap(send, datasette)
    else:
        await app(scope, recieve, send)

If the incoming URL path is /robots.txt, the function directly returns a reference to the sitemap, as seen at www.niche-museums.com/robots.txt.

If the path is /sitemap.xml, it calls the send_sitemap(...) function.

For any other path, it proxies the call to the original ASGI app function that was passed to the wrapper function: await app(scope, recieve, send).

The most interesting part of the implementation is that send_sitemap() function. This is the function which constructs the sitemap.xml returned by www.niche-museums.com/sitemap.xml.

Here’s what that function looks like:

async def send_sitemap(send, datasette):
    content = [
        '<?xml version="1.0" encoding="UTF-8"?>',
        '<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">',
    ]
    for db in datasette.databases.values():
        hidden = await db.hidden_table_names()
        tables = await db.table_names()
        for table in tables:
            if table in hidden:
                continue
            for row in await db.execute("select id from [{}]".format(table)):
                content.append(
                    "<url><loc>https://www.niche-museums.com/browse/{}/{}</loc></url>".format(
                        table, row["id"]
                    )
                )
    content.append("</urlset>")
    await asgi_send(send, "\n".join(content), 200, content_type="application/xml")

The key trick here is to use the datasette instance object which was passed to the asgi_wrapper() plugin hook.

The code uses that instance to introspect the attached SQLite databases. It loops through them listing all of their tables, and filtering out any hidden tables (which in this case are tables used by the SQLite FTS indexing mechanism). Then for each of those tables it runs select id from [tablename] and uses the results to build the URLs that are listed in the sitemap.

Finally, the resulting XML is concatenated together and sent back to the client with an application/xml content type.

For the moment, Niche Museums only has one table that needs including in the sitemap—the museums table.

I have a longer-term goal to provide detailed documentation for the datasette object here: since it’s exposed to plugins it’s become part of the API interface for Datasette itself. I want to stabilize this before I release Datasette 1.0.

This week’s new museums

I had a lot of fun writing up the Griffith Observatory: it turns out founding donor Griffith J. Griffith was a truly terrible individual.

This is Building a sitemap.xml with a one-off Datasette plugin by Simon Willison, posted on 6th January 2020.

Tagged , , , ,

Next: Weeknotes: Improv at Stanford, planning Datasette Cloud

Previous: sqlite-utils 2.0: real upserts