Simon Willison’s Weblog

On cookiecutter, machinelearning, documentation, maxwoolf, apple, ...

 

Recent entries

Weeknotes: Installing Datasette with Homebrew, more GraphQL, WAL in SQLite nine hours ago

This week I’ve been working on making Datasette easier to install, plus wide-ranging improvements to the Datasette GraphQL plugin.

Datasette and Homebrew

Datasette is now part of the GitHub Discussions beta—which means the GitHub repository for the project now has a Discussions area. I’ve been wanting to set up somewhere to talk about the project free of pressure to file issues or bug reports for a while, so I’m really excited to have this as a new community space.

One of the first threads there was about Making Datasette easier to install. This inspired me to finally take a look at issue #335 from July 2018—“Package datasette for installation using homebrew”.

I used the VisiData Homebrew Tap as a starting point, along with Homebrew’s Python for Formula Authors documentation. To cut a long story short, brew install datasette now works!

I wrote up some detailed notes on Packaging a Python CLI tool for Homebrew. I’ve also had my sqlite-utils CLI tool accepted into Homebrew, so you can now install that using brew install sqlite-utils as well.

datasette install, datasette uninstall

The updated Datasette installation instructions now feature a range of different options: Homebrew, pip, pipx and Docker.

Datasette Plugins need to be installed into the same Python environment as Datasette itself. If you installad Datasette using pipx or Homebrew figuring out which environment that is isn’t particularly straight-forward.

So I added two new commands to Datasette (released in Datasette 0.47): datasette install name-of-plugin and datasette uninstall name-of-plugin. These are very thin wrappers around the underlying pip, but with the crucial improvement that they guarantee they’ll run it in the correct environment. I derived another TIL from these on How to call pip programatically from Python.

datasette --get “/-/versions.json”

Part of writing a Homebrew package is defining a test block that confirms that the packaged tool is working correctly.

I didn’t want that test to have to start a Datasette web server just so it could execute an HTTP request and shut the server down again, so I added a new feature: datasette --get.

This is a mechanism that lets you execute a fake HTTP GET request against Datasette without starting the server, and outputs the result to the terminal.

This means that anything you can do with the Datasette JSON API is now available on the command-line as well!

I like piping the output to jq to get pretty-printed JSON:

% datasette github.db --get \
    '/github/recent_releases.json?_shape=array&_size=1' | jq
[
  {
    "rowid": 140912432,
    "repo": "https://github.com/simonw/sqlite-utils",
    "release": "https://github.com/simonw/sqlite-utils/releases/tag/2.15",
    "date": "2020-08-10"
  }
]

datasette-graphql improvements

I introduced datasette-graphql last week. I shipped five new releases since then, incorporating feedback from GraphQL advocates on Twitter.

The most significant improvement: I’ve redesigned the filtering mechanism to be much more in line with GraphQL conventions. The old syntax looked like this:

{
  repos(filters: ["license=apache-2.0", "stargazers_count__gt=10"]) {
    edges {
      node {
        full_name
      }
    }
  }
}

This mirrored how Datasette’s table page works (e.g. repos?license=apache-2.0&stargazers_count__gt=10), but it’s a pretty ugly hack.

The new syntax is much, much nicer:

{
  repos(filter: {license: {eq: "apache-2.0"}, stargazers_count: {gt: 10}}) {
    edges {
      node {
        full_name
      }
    }
  }
}

Execute this query.

The best part of this syntax is that the columns and operations are part of the GraphQL schema, which means tools like GraphiQL can provide auto-completion for them interactively as you type a query.

Another new feature: tablename_row can be used to return an individual row (actually the first matching item for its arguments). This is a convenient way to access rows by their primary key, since the primary key columns automatically become GraphQL arguments:

{
  users_row(id: 9599) {
    id
    name
    contributors_list(first: 5) {
      totalCount
      nodes {
        repo_id {
          full_name
        }
        contributions
      }
    }
  } 
}

Try that query here.

There are plenty more improvements to the plugin detailed in the datasette-graphql changelog.

Write-ahead logging in SQLite

SQLite’s Write-Ahead Logging feature improves concurrency by preventing writes from blocking reads. I was seeing the occasional “database is locked” error with my personal Dogsheep so I decided to finally figure out how turn this on for a database.

The breakthrough realization for me (thanks to a question I asked on the SQLite forum) was that WAL mode is a characteristic of the database file itself. Once you’ve turned it on for the file, all future connections to that file will take advantage of it.

I wrote about this in a TIL: Enabling WAL mode for SQLite database files. I also embedded what I learned in sqlite-utils 2.15, which now includes sqlite-utils enable-wal file.db and sqlite-utils disable-wal file.db commands (and accompanying Python API methods).

Datasette 0.46, with a security fix

Earlier this week I also released Datasette 0.46, with the key feature being a security fix relating to canned queries and CSRF protection.

I used GitHub’s security advisory mechanism for this one: CSRF tokens leaked in URL by canned query form. I’ve also included detailed information on the exploit (and the fix) in issue #918.

Also new in 0.46: the /-/allow-debug tool, which can be used to experiment with Datasette’s allow blocks permissions mechanism.

Releases this week

TIL this week

GraphQL in Datasette with the new datasette-graphql plugin seven days ago

This week I’ve mostly been building datasette-graphql, a plugin that adds GraphQL query support to Datasette.

I’ve been mulling this over for a couple of years now. I wasn’t at all sure if it would be a good idea, but it’s hard to overstate how liberating Datasette’s plugin system has proven to be: plugins provide a mechanism for exploring big new ideas without any risk of taking the core project in a direction that I later regret.

Now that I’ve built it, I think I like it.

A GraphQL refresher

GraphQL is a query language for APIs, first promoted by Facebook in 2015.

(Surprisingly it has nothing to do with the Facebook Graph API, which predates it by several years and is more similar to traditional REST. A third of respondents to my recent poll were understandably confused by this.)

GraphQL is best illustrated by an example. The following query (a real example that works with datasette-graphql) does a whle bunch of work:

  • Retrieves the first 10 repos that match a search for “datasette”, sorted by most stargazers first
  • Shows the total count of search results, along with how to retrieve the next page
  • For each repo, retrieves an explicit list oy columns
  • owner is a foreign key to the users table—this query retrieves the name and html_url for the user that owns each repo
  • A repo has issues (via an incoming foreign key relationship). The query retrieves the first three issues, a total count of all issues and for each of those three gets the title and created_at.

That’s a lot of stuff! Here’s the query:

{
  repos(first:10, search: "datasette", sort_desc: stargazers_count) {
    totalCount
    pageInfo {
      endCursor
      hasNextPage
    }
    nodes {
      full_name
      description
      stargazers_count
      created_at
      owner {
        name
        html_url
      }
      issues_list(first: 3) {
        totalCount
        nodes {
          title
          created_at
        }
      }
    }
  }
}

You can run this query against the live demo. I’m seeing it return results in 511ms. Considering how much it’s getting done that’s pretty good!

datasette-graphql

The datasette-graphql plugin adds a /graphql page to any Datasette instance. It exposes a GraphQL field for every table and view. Those fields can be used to select, filter, search and paginate through rows in the corresponding table.

The plugin detects foreign key relationships—both incoming and outgoing—and turns those into further nested fields on the rows.

It does this by using table introspection (powered by sqlite-utils) to dynamically define a schema using the Graphene Python GraphQL library.

Most of the work happens in the schema_for_datasette() function in datasette_graphql/utils.py. The code is a little fiddly because Graphene usually expects you to define your GraphQL schema using classes (similar to Django’s ORM), but in this case the schema needs to be generated dynamically based on introspecting the tables and columns.

It has a solid set of unit tests, including some test examples written in Markdown which double as further documentation (see test_graphql_examples()).

GraphiQL for interactively exploring APIs

GraphiQL is the best thing about GraphQL. It’s a JavaScript interface for trying out GraphQL queries which pulls in a copy of the API schema and uses it to implement really comprehensive autocomplete.

datasette-graphql includes GraphiQL (inspired by Starlette’s implementation). Here’s an animated gif showing quite how useful it is for exploring an API:

Animated demo

A couple of tips: On macOS option+space brings up the full completion list for the current context, and command+enter executes the current query (equivalent to clicking the play button).

Performance notes

The most convenient thing about GraphQL from a client-side development point of view is also the most nerve-wracking from the server-side: a single GraphQL query can end up executing a LOT of SQL.

The example above executes at least 32 separate SQL queries:

  • 1 select against repos (plus 1 count query)
  • 10 against issues (plus 10 counts)
  • 10 against users (for the owner field)

There are some optimization tricks I’m not using yet (in particular the DataLoader pattern) but it’s still cause for concern.

Interestingly, SQLite may be the best possible database backend for GraphQL due to the characteristics explained in the essay Many Small Queries Are Efficient In SQLite.

Since SQLite is an in-process database, it doesn’t have to deal with network overhead for each SQL query that it executes. A SQL query is essentially a C function call. So the flurry of queries that’s characteristic for GraphQL really plays to SQLite’s unique strengths.

Datasette has always featured arbitrary SQL execution as a core feature, which it protects using query time limits. I have an open issue to further extend the concept of Datasette’s time limits to the overall execution of a GraphQL query.

More demos

Enabling a GraphQL instance for a Datasette is as simple as pip install datasette-graphql, so I’ve deployed the new plugin in a few other places:

Future improvements

I have a bunch of open issues for the plugin describing what I want to do with it next. The most notable planned improvement is adding support for Datasette’s canned queries.

Andy Ingram shared the following interesting note on Twitter:

The GraphQL creators are (I think) unanimous in their skepticism of tools that bring GraphQL directly to your database or ORM, because they just provide carte blanche access to your entire data model, without actually giving API design proper consideration.

My plugin does exactly that. Datasette is a tool for publishing raw data, so exposing everything is very much in line with the philosophy of the project. But it’s still smart to put some design thought into your APIs.

Canned queries are pre-baked SQL queries, optionally with parameters that can be populated by the user.

These could map directly to GraphQL fields. Users could even use plugin configuration to turn off the automatic table fields and just expose their canned queries.

In this way, canned queries can allow users to explicitly design the fields they expose via GraphQL. I expect this to become an extremely productive way of prototyping new GraphQL APIs, even if the final API is built on a backend other than Datasette.

Also this week

A couple of years ago I wrote a piece about Exploring the UK Register of Members Interests with SQL and Datasette. I finally got around to automating this using GitHub Actions, so register-of-members-interests.datasettes.com now updates with the latest data every 24 hours.

I renamed datasette-publish-now to datasette-publish-vercel, reflecting Vercel’s name change from Zeit Now. Here’s how I did that.

datasette-insert, which provides a JSON API for inserting data, defaulted to working unauthenticated. MongoDB and Elasticsearch have taught us that insecure-by-default inevitably leads to insecure deployments. I fixed that: the plugin now requires authentication, and if you don’t want to set that up and know what you are doing you can install the deliberately named datasette-insert-unsafe plugin to allow unauthenticated access.

Releases this week

TIL this week

Fun with binary data and SQLite 14 days ago

This week I’ve been mainly experimenting with binary data storage in SQLite. sqlite-utils can now insert data from binary files, and datasette-media can serve content over HTTP that originated as binary BLOBs in a database file.

Paul Ford piqued my interest in this when he tweeted about loading thousands of PDF documents into a SQLite database:

The SQLite documentation claims that serving smaller binary files from BLOB columns can be 35% faster than the filesystem. I’ve done a little bit of work with binary files in SQLite—the datasette-render-binary and datasette-render-images both help display BLOB data—but I’d never really dug into it in much detail.

sqlite-utils insert-files

The first step was to make it easier to build database files that include binary data.

sqlite-utils is my combination Python library and CLI tool for building SQLite databases. I’ve been steadily evolving it for a couple of years now, and it’s the engine behind my Dogsheep collection of tools for personal analytics.

The new insert-files command can be used to insert content from binary files into a SQLite database, along with file metadata.

The most basic usage looks like this:

sqlite-utils insert-files gifs.db images *.gif

By default, this creates a database table like so:

CREATE TABLE [images] (
    [path] TEXT PRIMARY KEY,
    [content] BLOB,
    [size] INTEGER
);

You can customize this table to include other file metadata using the -c (short for --column) option:

sqlite-utils insert-files gifs.db images *.gif \
    -c path -c md5 -c last_modified:mtime -c size --pk=path

This creates a table with the following schema:

CREATE TABLE [images] (
    [path] TEXT PRIMARY KEY,
    [md5] TEXT,
    [last_modified] FLOAT,
    [size] INTEGER
);

If you pass a directory instead of a file name the command will recursively add every file in that directory.

I also improved sqlite-utils with respect to outputting binary data. The new --raw option dumps the binary contents of a column directly to standard out, so you can read an image back out of one of the above tables like so:

sqlite-utils photos.db \
    "select content from images where path=:path" \
    -p path 'myphoto.jpg' \
    --raw > myphoto.jpg

This example also demonstrates the new support for :parameters passed using the new -p option, see #124.

sqlite-utils usually communicate in JSON, but JSON doesn’t have the ability to represent binary values. Datasette outputs binary values like so:

"data": {
  "$base64": true,
  "encoded": "iVBORw0KGgoAAAANSUhEUgAAABAAAAAQCAY..."
}

I added support for the same format to sqlite-utils—so you can now query binary columns and get out that nested object, or pipe JSON with that nested structure in to sqlite-utils insert and have it stored as a binary BLOB in the database.

datasette-media

datasette-media is a plugin for serving binary content directly from Datasette on a special URL. I originally built it while working on Dogsheep Photos—given a SQLite file full of Apple Photos metadata I wanted to be able to serve thumbnails of the actual images via my Datasette web server.

Those photos were still stored on disk—the plugin lets you configure a SQL query like this which will cause hits to /-/media/photos/$UUID to serve that file from disk:

{
    "plugins": {
        "datasette-media": {
            "photo": {
                "sql": "select filepath from apple_photos where uuid=:key"
            }
        }
    }
}

Issue #14 added support for BLOB columns as well. You can now configure the plugin like this to serve binary content that was stored in the database:

{
    "plugins": {
        "datasette-media": {
            "thumb": {
                "sql": "select content from thumbnails where uuid=:key"
            }
        }
    }
}

This would serve content from a BLOB column in a thumbnails table from the URL /-/media/thumb/$UUID.

I really like this pattern of configuring plugins using SQL queries, where the returned column names have special meaning that is interpreted by the plugin. datasette-atom and datasette-ics use a similar trick.

I expanded datasette-media with a few other related features:

  • Return a content_url column and it will proxy content from that URL
  • Set "enable_transform": true for a media bucket to enable ?w= and ?h= and ?format= parameters for transforming the image before it is served to the user
  • Return a content_filename column to set a download file name (in a content-disposition HTTP header) prompting the user’s browser to download the file

See the README or release notes for more details.

Also this week

I renamed datasette-insert-api to just datasette-insert, reflecting my plans to add non-API features to that plugin in the future.

In doing so I had to figure out how to rename a PyPI package such that dependent projects would continue to work. I ended up building a pypi-rename cookiecutter template encoding what I learned.

I enabled PostgreSQL full-text search for my blog’s Django Admin interface, and wrote a TIL on how I did it.

I added compound primary key support to db-to-sqlite, so now it can convert PostgreSQL or MySQL databases to SQLite if they use compound primary keys.

TIL this week

Releases this week

Weeknotes: datasette-copyable, datasette-insert-api 21 days ago

Two new Datasette plugins this week: datasette-copyable, helping users copy-and-paste data from Datasette into other places, and datasette-insert-api, providing a JSON API for inserting and updating data and creating tables.

datasette-copyable

Datasette has long provided CSV and JSON export, but when you’re working with smaller amounts of data it’s often most convenient to move that data around using copy and paste.

datasette-copyable uses Datasette’s register_output_renderer() plugin hook to add a whole plethora of new export options, powered by the excellent tabulate Python library (which I’m also using in sqlite-utils).

The default output format is TSV, because if you copy tab-separated values to your clipboard and then hit paste on a cell in Google Sheets, Excel or Numbers the data will be imported directly into your spreadsheet, preserving the columns and rows!

The other supported formats are everything that tabulate supports—LaTeX, GitHub Markdown, MediaWiki, JIRA and more.

You can try the plugin out on some New York Times Covid-19 data (for Harris county, Texas) over here.

Copy as tsv

Clicking the “copy to clipboard” button selects the content in the textarea and copies it to your clipboard. I wrote about how this works in a TIL.

datasette-insert-api

Ever since Datasette plugins gained the ability to write to the database back in February I’ve been looking forward to building something that really put it through its paces.

datasette-insert-api is a relatively thin wrapper around my sqlite-utils Python library which unlocks a powerful set of functionality:

  • POST a JSON array to Datasette to insert those objects as rows in a database table
  • If the table does not yet exist, Datasette can create it with the corresponding columns
  • If the table exists and has a primary key, records with matching primary keys can replace existing rows (a simple form of update)
  • If unknown columns are present in the data, the plugin can optionally alter the table to add the necessary columns

The README has the details, but the simplest example of it in action (using curl) looks like this:

curl --request POST \
  --data '[
    {
        "id": 1,
        "name": "Cleopaws",
        "age": 5
    },
    {
        "id": 2,
        "name": "Pancakes",
        "age": 4
    }
    ]' \
  'http://localhost:8001/-/insert/data/dogs?pk=id'

The plugin works without authentication—useful for piping JSON into a database on your own laptop—or you can couple it with a plugin such as datasette-auth-tokens to provide authenticated access over the public internet.

Speaking at Boston Python

I gave a talk about Datasette and Dogsheep for Boston Python on Wednesday. Ned turned around the video really quickly—I start talking at 8 minutes and 52 seconds in.

I used a collaborative Google Doc for the Q&A which worked really well, as the questions accumulated while I was presenting and I then got to address them at the end. I spent some time after the talk fleshing out my answers with extra links—here’s the resulting document.

Upgrading this blog to Django 3.1

The first Django 3.1 release candidate came out earlier this week, and I’ve upgraded my blog to this (a one-line change I made through the GitHub web interface thanks to Heroku continuous deployment).

I also upgraded my Heroku PostgreSQL database to v11 so I could take advantage of the new search_type="websearch" option for PostgreSQL full-text search added in Django 3.1—James Turk contributed that improvement and wrote about it on his blog.

This means my blog search engine now supports advanced operators, like “phrase searches” and -subtractions—try that out here: “signed cookies” -django.

content-length and datasette-clone progress bars

When Datasette runs against an immutable SQLite database it lets you download that database file directly. Datasette development version now includes a content-length header specifying the size of that download.

... which means my datasette-clone tool can now show you a progress bar!

$ datasette-clone https://fivethirtyeight.datasettes.com -v
Fetching fivethirtyeight.db, current hash 7ba632a14f29375c289d96400947e2d5fcdad3a4b5a90883d6286bd4e83ede78 != None
268.79 MB  [##########--------------------------]    28%  00:02:14

github-to-sqlite tags

I wanted to build myself a SQL query to show me my GitHub repositories that have commits which have not yet been included in a release.

It turned out I needed to build a new import command for this: github-to-sqlite tags github.db simonw/datasette—which fetches all of the tags for the specified repository.

Here’s the relevant issue and here’s a monstrous SQL query that shows me repositories with commits that are ready to be shipped in a release:

with most_recent_releases as (
  with ranked as (
    select
      repo,
      tag_name,
      published_at,
      row_number() OVER (
        partition BY repo
        ORDER BY
          published_at DESC
      ) rank
    FROM
      releases
  )
  select
    *
  from
    ranked
  where
    rank = 1
)
select
  repos.full_name as repo,
  most_recent_releases.tag_name as release,
  commits.committer_date as release_commit_date,
  (
    select
      count(*)
    from
      commits c2
    where
      c2.repo = repos.id
      and c2.committer_date > commits.committer_date
  ) as commits_since_release,
  'https://github.com/' || repos.full_name || '/compare/' || most_recent_releases.tag_name || '...' || repos.default_branch as view_commits
from
  most_recent_releases
  join repos on most_recent_releases.repo = repos.id
  join tags on tags.repo = repos.id
  and tags.name = most_recent_releases.tag_name
  join commits on tags.sha = commits.sha
order by
  commits_since_release desc

Try it here.

TIL this week

Releases this week

Weeknotes: datasette-auth-passwords, a Datasette logo and a whole lot more 28 days ago

All sorts of project updates this week.

datasette-auth-passwords

Datasette 0.44 added authentication support as a core concept, but left the actual implementation details up to the plugins.

I released datasette-auth-passwords on Monday. It’s an implementation of the most obvious form of authentication (as opposed to GitHub SSO or bearer tokens or existing domain cookies): usernames and passwords, typed into a form.

Implementing passwords responsibly is actually pretty tricky, due to the need to effectively hash them. After some research I ended up mostly copying how Django does it (never a bad approach): I’m using 260,000 salted pbkdf2_hmac iterations, taking advantage of the Python standard library. I wrote this up in a TIL.

The plugin currently only supports hard-coded password hashes that are fed to Datasette via an environment variable—enough to set up a password-protected Datasette instance with a couple of users, but not really good for anything more complex than that. I have an open issue for implementing database-backed password accounts, although again the big challenge is figuring out how to responsible store those password hashes.

I’ve set up a live demo of the password plugin at datasette-auth-passwords-demo.datasette.io—you can sign into it to reveal a private database that’s only available to authenticated users.

Datasette website and logo

I’m finally making good progress on a website for Datasette. As part of that I’ve been learning to use Figma, which I used to create a Datasette logo.

Datasette

Figma is really neat: it’s an entirely web-based vector image editor, aimed at supporting the kind of design work that goes into websites and apps. It has full collaborative editing for teams but it’s free for single users. Most importantly it has extremely competent SVG exports.

I’ve added the logo to the latest version of the Datasette docs, and I have an open pull request to sphinx_rtd_theme to add support for setting a custom link target on the logo so I can link back to the rest of the official site, when it goes live.

TIL search snippet highlighting

My TIL site has a search engine, but it didn’t do snippet highlighting. I reused the pattern I described in Fast Autocomplete Search for Your Website—implemented server-side rather than client-side this time—to add that functionality. The implementation is here—here’s a demo of it in action.

SRCCON schedule

I’m attending (virtually) the SRCCON 2020 journalism conference this week, and Datasette is part of the Projects, Products, & Research track.

As a demo, I set up a Datasette powered copy of the conference schedule at srccon-2020.datasette.io—it’s running the datasette-ics plugin which means it can provide a URL that can be subscribed to in Google or Apple Calendar.

The site runs out of the simonw/srccon-2020-datasette repository, which uses a GitHub Action to download the schedule JSON, modify it a little (mainly to turn the start and end dates into ISO datestamps), save it to a SQLite database with sqlite-utils and publish it to Vercel.

Covid 19 population data

My Covid-19 tracker publishes updated numbers of cases and deaths from the New York Times, the LA Times and Johns Hopkins university on an hourly basis.

One thing that was missing was county population data. US counties are identified in the data by their FIPS codes, which offers a mechanism for joining against population estimates pulled from the US Census.

Thanks to Aaron King I’ve now incorporated that data into the site, as a new us_census_county_populations_2019 table.

I used that data to define a SQL view—latest_ny_times_counties_with_populations—which shows the latest New York Times county data with new derived cases_per_million and deaths_per_million columns.

Tweaks to this blog

For many years this blog’s main content has sat on the left of the page—which looks increasingly strange as screens get wider and wider. As of this commit the main layout is centered, which I think looks much nicer.

I also ran a data migration to fix some old internal links.

Miscellaneous

I gave a (virtual) talk at Django London on Monday about Datasette. I’ve taken to sharing a Google Doc for this kind of talk, which I prepare before the talk with notes and then update afterwards to reflect additional material from the Q&A. Here’s the document from Monday’s talk.

San Francisco Public Works maintain a page of tree removal notifications showing trees that are scheduled for removal. I like those trees. They don’t provide an archive of notifications from that page, so I’ve set up a git scraping GitHub repository that scrapes the page daily and maintains a history of its contents in the commit log.

I updated datasette-publish-fly for compatibility with Datasette 0.44 and Python 3.6.

I made a few tweaks to my GitHub profile README, which is now Apache 2 licensed so people know they can adapt it for their own purposes.

I released github-to-sqlite 2.3 with a new option for fetching information for just specific repositories.

The Develomentor podcast published an interview with me about my career, and how it’s been mostly defined by side-projects.

TIL this week

Building a self-updating profile README for GitHub one month ago

GitHub quietly released a new feature at some point in the past few days: profile READMEs. Create a repository with the same name as your GitHub account (in my case that’s github.com/simonw/simonw), add a README.md to it and GitHub will render the contents at the top of your personal profile page—for me that’s github.com/simonw

I couldn’t resist re-using the trick from this blog post and implementing a GitHub Action to automatically keep my profile README up-to-date.

Visit github.com/simonw and you’ll see a three-column README showing my latest GitHub project releases, my latest blog entries and my latest TILs.

My GitHub profile

I’m doing this with a GitHub Action in build.yml. It’s configured to run on every push to the repo, on a schedule at 32 minutes past the hour and on the new workflow_dispatch event which means I get a manual button I can click to trigger it on demand.

The Action runs a Python script called build_readme.py which does the following:

  • Hits the GitHub GraphQL API to retrieve the latest release for every one of my 300+ repositories
  • Hits my blog’s full entries Atom feed to retrieve the most recent posts (using the feedparser Python library)
  • Hits my TILs website’s Datasette API running this SQL query to return the latest TIL links

It then turns the results from those various sources into a markdown list of links and replaces commented blocks in the README that look like this:

<!-- recent_releases starts -->
...
<!-- recent_releases ends -->

The whole script is less than 150 lines of Python.

GitHub GraphQL

I have a bunch of experience working with GitHub’s regular REST APIs, but for this project I decided to go with their newer GraphQL API.

I wanted to show the most recent “releases” for all of my projects. I have over 300 GitHub repositories now, and only a portion of them use the releases feature.

Using REST, I would have to make over 300 API calls to figure out which ones have releases.

With GraphQL, I can do this instead:

query {
  viewer {
    repositories(first: 100, privacy: PUBLIC) {
      pageInfo {
        hasNextPage
        endCursor
      }
      nodes {
        name
        releases(last:1) {
          totalCount
          nodes {
            name
            publishedAt
            url
          }
        }
      }
    }
  }
}

This query returns the most recent release (last:1) for each of the first 100 of my public repositories.

You can paste it into the GitHub GraphQL explorer to run it against your own profile.

There’s just one catch: pagination. I have more than 100 repos but their GraphQL can only return 100 nodes at a time.

To paginate, you need to request the endCursor and then pass that as the after: parameter for the next request. I wrote up how to do this in this TIL.

Next steps

I’m pretty happy with this as a first attempt at automating my profile. There’s something extremely satsifying about having a GitHub profile that self-updates itself using GitHub Actions—it feels appropriate.

There’s so much more stuff I could add to this: my tweets, my sidebar blog links, maybe even download statistics from PyPI. I’ll see what takes my fancy in the future.

I’m not sure if there’s a size limit on the README that is displayed on the profile page, so deciding how much information is appropriate is appears to be mainly a case of personal taste.

Building these automated profile pages is pretty easy, so I’m looking forward to seeing what kind of things other nerds come up with!

Elsewhere

Yesterday

  • We’re generally only impressed by things we can’t do—things that are beyond our own skill set. So, by definition, we aren’t going to be that impressed by the things we create. The end user, however, is perfectly able to find your work impressive.

    @gamemakerstk #

9th August 2020

  • Datasette 0.46 (via) I just released Datasette 0.46 with a security fix for an issue involving CSRF tokens on canned query pages, plus a new debugging tool, improved file downloads and a bunch of other smaller improvements. #

8th August 2020

  • COVID-19 attacks our physical bodies, but also the cultural foundations of our lives, the toolbox of community and connectivity that is for the human what claws and teeth represent to the tiger.

    Wade Davis #

7th August 2020

  • Pysa: An open source static analysis tool to detect and prevent security issues in Python code (via) Interesting new static analysis tool for auditing Python for security vulnerabilities—things like SQL injection and os.execute() calls. Built by Facebook and tested extensively on Instagram, a multi-million line Django application. #
  • Design Docs at Google. Useful description of the format used for software design docs at Google—informal documents of between 3 and 20 pages that outline the proposed design of a new project, discuss trade-offs that were considered and solicit feedback before the code starts to be written. #

5th August 2020

  • Zero Downtime Release: Disruption-free Load Balancing of a Multi-Billion User Website (via) I remain fascinated by techniques for zero downtime deployment—once you have it working it makes shipping changes to your software so much less stressful, which means you can iterate faster and generally be much more confident in shipping code. Facebook have invested vast amounts of effort into getting this right, and their new paper for the ACM SIGCOMM conference goes into detail about how it all works. #

2nd August 2020

  • When you talk with cheese aficionados, it doesn’t usually take long for the conversation to veer this way: away from curds, whey, and mold, and toward matters of life and death. With the zeal of nineteenth-century naturalists, they discuss great lineages and endangered species, painstakingly cataloguing those cheeses that are thriving and those that are lost to history.

    Ruby Tandoh #

  • How a Cheese Goes Extinct (via) Ruby Tandoh writes for the New Yorker about the culture, history and anthropology of cheesemaking through the lens of the British cheese industry. I learned that two of my favourite British cheeses—Tymsboro and Innes Log, have sadly ceased production. Beautifully written. #

1st August 2020

  • sqlite-utils 2.14 (via) I finally figured out porter stemming with SQLite full-text search today—it turns out it’s as easy as adding tokenize=’porter’ to the CREATE VIRTUAL TABLE statement. So I just shipped sqlite-utils 2.14 with a tokenize= option (plus the ability to insert binary file data from stdin). #
  • The impact of crab mentality on performance was quantified by a New Zealand study in 2015 which demonstrated up to an 18% average exam result improvement for students when their grades were reported in a way that prevented others from knowing their position in published rankings.

    Crab mentality on Wikipedia #

  • James Bennett on why Django should not support JWT in core (via) The topic of adding JWT support to Django core comes up occasionally—here’s James Bennett’s detailed argument for not doing that. The short version is that the JWT specification isn’t just difficult to implement securely: it’s fundamentally flawed, which results in things like five implementations in three different languages all manifesting the same vulnerability. Third party modules exist that add JWT support to Django, but baking it into core would act as a form of endorsement and Django’s philosophy has always been to encourage people towards best practices. #

31st July 2020

30th July 2020

  • Sandboxing and Workload Isolation (via) Fly.io run other people’s code in containers, so workload isolation is a Big Deal for them. This blog post goes deep into the history of isolation and the various different approaches you can take, and fills me with confidence that the team at Fly.io know their stuff. I got to the bottom and found it had been written by Thomas Ptacek, which didn’t surprise me in the slightest. #
  • How GPT3 Works - Visualizations and Animations. Nice essay full of custom animations illustrating how GPT-3 actually works. #

29th July 2020

  • Some SQL Tricks of an Application DBA (via) This post taught me so many PostgreSQL tricks that I hadn’t seen before. Did you know you can start a transaction, drop an index, run explain and then rollback the transaction (cancelling the index drop) to see what explain would look like without that index? Among other things I also learned what the “correlation” database statistic does: it’s a measure of how close-to-sorted the values in a specific column are, which helps PostgreSQL decide if it should do an index scan or a bitmap scan when making use of an index. #