Simon Willison’s Weblog

Subscribe

Weeknotes: airtable-export, generating screenshots in GitHub Actions, Dogsheep!

3rd September 2020

This week I figured out how to populate Datasette from Airtable, wrote code to generate social media preview card page screenshots using Puppeteer, and made a big breakthrough with my Dogsheep project.

airtable-export

I wrote about Rocky Beaches in my weeknotes two weeks ago. It’s a new website built by Natalie Downe that showcases great places to go rockpooling (tidepooling in American English), mixing in tide data from NOAA and species sighting data from iNaturalist.

Rocky Beaches is powered by Datasette, using a GitHub Actions workflow that builds the site’s underlying SQLite database using API calls and YAML data stored in the GitHub repository.

Natalie wanted to use Airtable to maintain the structured data for the site, rather than hand-editing a YAML file. So I built airtable-export, a command-line script for sucking down all of the data from an Airtable instance and writing it to disk as YAML or JSON.

You run it like this:

airtable-export out/ mybaseid table1 table2 --key=key

This will create a folder called out/ with a .yml file for each of the tables.

Sadly the Airtable API doesn’t yet provide a mechanism to list all of the tables in a database (a long-running feature request) so you have to list the tables yourself.

We’re now running that command as part of the Rocky Beaches build script, and committing the latest version of the YAML file back to the GitHub repo (thus gaining a full change history for that data).

Social media cards for my TILs

I really like social media cards—og:image HTML meta attributes for Facebook and twitter:image for Twitter. I wanted them for articles on my TIL website since I often share those via Twitter.

One catch: my TILs aren’t very image heavy. So I decided to generate screenshots of the pages and use those as the 2x1 social media card images.

The best way I know of programatically generating screenshots is to use Puppeteer, a Node.js library for automating a headless instance of the Chrome browser that is maintained by the Chrome DevTools team.

My first attempt was to run Puppeteer in an AWS Lambda function on Vercel. I remembered seeing an example of how to do this in the Vercel documentation a few years ago. The example isn’t there any more, but I found the original pull request that introduced it.

Since the example was MIT licensed I created my own fork at simonw/puppeteer-screenshot and updated it to work with the latest Chrome.

It’s pretty resource intensive, so I also added a secret ?key= mechanism so only my own automation code could call my instance running on Vercel.

I needed to store the generated screenshots somewhere. They’re pretty small—on the order of 60KB each—so I decided to store them in my SQLite database itself and use my datasette-media plugin (see Fun with binary data and SQLite) to serve them up.

This worked! Until it didn’t... I ran into a showstopper bug when I realized that the screenshot process relies on the page being live on the site... but when a new article is added it’s not live when the build process works, so the generated screenshot is of the 404 page.

So I reworked it to generate the screenshots inside the GitHub Action as part of the build script, using puppeteer-cli.

My generate_screenshots.py script handles this, by first shelling out to datasette --get to render the HTML for the page, then running puppeteer to generate the screenshot. Relevant code:

def png_for_path(path):
    # Path is e.g. /til/til/python_debug-click-with-pdb.md
    page_html = str(TMP_PATH / "generate-screenshots-page.html")
    # Use datasette to generate HTML
    proc = subprocess.run(["datasette", ".", "--get", path], capture_output=True)
    open(page_html, "wb").write(proc.stdout)
    # Now use puppeteer screenshot to generate a PNG
    proc2 = subprocess.run(
        [
            "puppeteer",
            "screenshot",
            page_html,
            "--viewport",
            "800x400",
            "--full-page=false",
        ],
        capture_output=True,
    )
    png_bytes = proc2.stdout
    return png_bytes

This worked great! Except for one thing... the site is hosted on Vercel, and Vercel has a 5MB response size limit.

Every time my GitHub build script runs it downloads the previous SQLite database file, so it can avoid regenerating screenshots and HTML for pages that haven’t changed.

The addition of the binary screenshots drove the size of the SQLite database over 5MB, so the part of my script that retrieved the previous database no longer worked.

I needed a reliable way to store that 5MB (and probably eventually 10-50MB) database file in between runs of my action.

The best place to put this would be an S3 bucket, but I find the process of setting up IAM permissions for access to a new bucket so infuriating that I couldn’t bring myself to do it.

So... I created a new dedicated GitHub repository, simonw/til-db, and updated my action to store the binary file in that repo—using a force push so the repo doesn’t need to maintain unnecessary version history of the binary asset.

This is an abomination of a hack, and it made me cackle a lot. I tweeted about it and got the suggestion to try Git LFS instead, which would definitely be a more appropriate way to solve this problem.

Rendering Markdown

I write my blog entries in Markdown and transform them into HTML before I post them on my blog. Some day I’ll teach my blog to render Markdown itself, but so far I’ve got by through copying and pasting into Markdown tools.

My favourite Markdown flavour is GitHub’s, which adds a bunch of useful capabilities—most notably the ability to apply syntax highlighting. GitHub expose an API that applies their Markdown formatter and returns the resulting HTML.

I built myself a quick and scrappy tool in JavaScript that sends Markdown through their API and then applies a few DOM manipulations to clean up what comes back. It was a nice opportunity to write some modern vanilla JavaScript using fetch():

async function render(markdown) {
    return (await fetch('https://api.github.com/markdown', {
        method: 'POST',
        headers: {
            'Content-Type': 'application/json'
        },
        body: JSON.stringify({'mode': 'markdown', 'text': markdown})
    })).text();
}

const button = document.getElementsByTagName('button')[0];
const output = document.getElementById('output');
const preview = document.getElementById('preview');

button.addEventListener('click', async function() {
    const rendered = await render(input.value);
    output.value = rendered;
    preview.innerHTML = rendered;
});

Dogsheep Beta

My most exciting project this week was getting out the first working version of Dogsheep Beta—the search engine that ties together results from my Dogsheep family of tools for personal analytics.

I’m giving a talk about this tonight at PyCon Australia: Build your own data warehouse for personal analytics with SQLite and Datasette. I’ll be writing up detailed notes in the next few days, so watch this space.

TIL this week

Releases this week