Help scraping: track changes to CLI tools by recording their --help using Git

2nd February 2022

I’ve been experimenting with a new variant of Git scraping this week which I’m calling Help scraping. The key idea is to track changes made to CLI tools over time by recording the output of their --help commands in a Git repository.

My new help-scraper GitHub repository is my first implementation of this pattern.

It uses this GitHub Actions workflow to record the --help output for the Amazon Web Services aws CLI tool, and also for the flyctl tool maintained by the Fly.io hosting platform.

The workflow runs once a day. It loops through every available AWS command (using this script) and records the output of that command’s CLI help option to a .txt file in the repository—then commits the result at the end.

The result is a version history of changes made to those help files. It’s essentially a much more detailed version of a changelog—capturing all sorts of details that might not be reflected in the official release notes for the tool.

Here’s an example. This morning, AWS released version 1.22.47 of their CLI helper tool. They release new versions on an almost daily basis.

Here are the official release notes—12 bullet points, spanning 12 different AWS services.

My help scraper caught the details of the release in this commit—89 changed files with 3,543 additions and 1,324 deletions. It tells the story of what’s changed in a whole lot more detail.

The AWS CLI tool is enormous. Running find aws -name '*.txt' | wc -l in that repository counts help pages for 11,401 individual commands—or 11,390 if you checkout the previous version, showing that there were 11 commands added just in this morning’s new release.

There are plenty of other ways of tracking changes made to AWS. I’ve previously kept an eye on the botocore GitHub history, which exposes changes to the underlying JSON—and there are projects like awschanges.info which try to turn those sources of data into something more readable.

But I think there’s something pretty neat about being able to track changes in detail for any CLI tool that offers help output, independent of the official release notes for that tool. Not everyone writes release notes with the detail I like from them!

I implemented this for flyctl first, because I wanted to see what changes were being made that might impact my datasette-publish-fly plugin which shells out to that tool. Then I realized it could be applied to AWS as well.

Help scraping my own projects

I got the initial idea for this technique from a change I made to my Datasette and sqlite-utils projects a few weeks ago.

Both tools offer CLI commands with --help output—but I kept on forgetting to update the help, partly because there was no easy way to see its output online without running the tools themselves.

So, I added documentation pages that list the output of --help for each of the CLI commands, generated using the Cog file generation tool:

sqlite-utils CLI reference (39 commands!)
datasette CLI reference

Having added these pages, I realized that the Git commit history of those generated documentation pages could double up as a history of changes I made to the --help output—here’s that history for sqlite-utils.

It was a short jump from that to the idea of combining it with Git scraping to generate history for other tools.

Bonus trick: GraphQL schema scraping

I’ve started making selective use of the Fly.io GraphQL API as part of my plugin for publishing Datasette instances to that platform.

Their GraphQL API is openly available, but it’s not extensively documented—presumably because they reserve the right to make breaking changes to it at any time. I collected some notes on it in this TIL: Using the undocumented Fly GraphQL API.

This gave me an idea: could I track changes made to their GraphQL schema using the same scraping trick?

It turns out I can! There’s an NPM package called get-graphql-schema which can extract the GraphQL schema from any GraphQL server and write it out to disk:

npx get-graphql-schema https://api.fly.io/graphql > /tmp/fly.graphql

I’ve added that to my help-scraper repository too—so now I have a commit history of changes of changes they are making there too. Here’s an example from this morning.

Other weeknotes

I’ve decided to start setting goals on a monthly basis. My goal for February is to finally ship Datasette 1.0! I’m trying to make at least one commit every day that takes me closer to that milestone.

This week I did a bunch of work adding a Link: https://...; rel="alternate"; type="application/datasette+json" HTTP header to a bunch of different pages in the Datasette interface, to support discovery of the JSON version of a page based on a URL to the human-readable version.

(I had originally planned to also support Accept: application/json request headers for this, but I’ve been put off that idea by the discovery that Cloudflare deliberately ignores the Vary: Accept header.)

Unrelated to Datasette: I also started a new Twitter thread, gathering behind the scenes material from the movie the Mitchells vs the Machines. There’s been a flurry of great material shared recently by the creative team, presumably as part of the run-up to awards season—and I’ve been enjoying trying to tie it all together in a thread.

The last time I did this was for Into the Spider-Verse (from the same studio) and that thread ended up running for more than a year!

TIL this week

Posted 2nd February 2022 at 11:46 pm · Follow me on Mastodon, Bluesky, Twitter or subscribe to my newsletter

Simon Willison’s Weblog