Help scraping: track changes to CLI tools by recording their --help using Git
2nd February 2022
I’ve been experimenting with a new variant of Git scraping this week which I’m calling Help scraping. The key idea is to track changes made to CLI tools over time by recording the output of their
--help commands in a Git repository.
My new help-scraper GitHub repository is my first implementation of this pattern.
The workflow runs once a day. It loops through every available AWS command (using this script) and records the output of that command’s CLI help option to a
.txt file in the repository—then commits the result at the end.
The result is a version history of changes made to those help files. It’s essentially a much more detailed version of a changelog—capturing all sorts of details that might not be reflected in the official release notes for the tool.
Here’s an example. This morning, AWS released version 1.22.47 of their CLI helper tool. They release new versions on an almost daily basis.
Here are the official release notes—12 bullet points, spanning 12 different AWS services.
My help scraper caught the details of the release in this commit—89 changed files with 3,543 additions and 1,324 deletions. It tells the story of what’s changed in a whole lot more detail.
The AWS CLI tool is enormous. Running
find aws -name '*.txt' | wc -l in that repository counts help pages for 11,401 individual commands—or 11,390 if you checkout the previous version, showing that there were 11 commands added just in this morning’s new release.
There are plenty of other ways of tracking changes made to AWS. I’ve previously kept an eye on the botocore GitHub history, which exposes changes to the underlying JSON—and there are projects like awschanges.info which try to turn those sources of data into something more readable.
But I think there’s something pretty neat about being able to track changes in detail for any CLI tool that offers help output, independent of the official release notes for that tool. Not everyone writes release notes with the detail I like from them!
I implemented this for
flyctl first, because I wanted to see what changes were being made that might impact my datasette-publish-fly plugin which shells out to that tool. Then I realized it could be applied to AWS as well.
Help scraping my own projects
Both tools offer CLI commands with
--help output—but I kept on forgetting to update the help, partly because there was no easy way to see its output online without running the tools themselves.
So, I added documentation pages that list the output of
--help for each of the CLI commands, generated using the Cog file generation tool:
Having added these pages, I realized that the Git commit history of those generated documentation pages could double up as a history of changes I made to the
--help output—here’s that history for sqlite-utils.
It was a short jump from that to the idea of combining it with Git scraping to generate history for other tools.
Bonus trick: GraphQL schema scraping
Their GraphQL API is openly available, but it’s not extensively documented—presumably because they reserve the right to make breaking changes to it at any time. I collected some notes on it in this TIL: Using the undocumented Fly GraphQL API.
This gave me an idea: could I track changes made to their GraphQL schema using the same scraping trick?
It turns out I can! There’s an NPM package called get-graphql-schema which can extract the GraphQL schema from any GraphQL server and write it out to disk:
npx get-graphql-schema https://api.fly.io/graphql > /tmp/fly.graphql
I’ve decided to start setting goals on a monthly basis. My goal for February is to finally ship Datasette 1.0! I’m trying to make at least one commit every day that takes me closer to that milestone.
This week I did a bunch of work adding a
Link: https://...; rel="alternate"; type="application/datasette+json" HTTP header to a bunch of different pages in the Datasette interface, to support discovery of the JSON version of a page based on a URL to the human-readable version.
(I had originally planned to also support
Accept: application/json request headers for this, but I’ve been put off that idea by the discovery that Cloudflare deliberately ignores the
Vary: Accept header.)
Unrelated to Datasette: I also started a new Twitter thread, gathering behind the scenes material from the movie the Mitchells vs the Machines. There’s been a flurry of great material shared recently by the creative team, presumably as part of the run-up to awards season—and I’ve been enjoying trying to tie it all together in a thread.
The last time I did this was for Into the Spider-Verse (from the same studio) and that thread ended up running for more than a year!
TIL this week
More recent articles
- Prompt injection explained, November 2023 edition - 27th November 2023
- I'm on the Newsroom Robots podcast, with thoughts on the OpenAI board - 25th November 2023
- Weeknotes: DevDay, GitHub Universe, OpenAI chaos - 22nd November 2023
- Deciphering clues in a news article to understand how it was reported - 22nd November 2023
- Exploring GPTs: ChatGPT in a trench coat? - 15th November 2023
- Financial sustainability for open source projects at GitHub Universe - 10th November 2023
- ospeak: a CLI tool for speaking text in the terminal via OpenAI - 7th November 2023
- DALL-E 3, GPT4All, PMTiles, sqlite-migrate, datasette-edit-schema - 30th October 2023
- Now add a walrus: Prompt engineering in DALL-E 3 - 26th October 2023
- Execute Jina embeddings with a CLI using llm-embed-jina - 26th October 2023