Simon Willison’s Weblog

Subscribe
Atom feed for github-actions

54 posts tagged “github-actions”

GitHub's Actions tool for repository automation.

2025

Building and deploying a custom site using GitHub Actions and GitHub Pages. I figured out a minimal example of how to use GitHub Actions to run custom scripts to build a website and then publish that static site to GitHub Pages. I turned the example into a template repository, which should make getting started for a new project extremely quick.

I've needed this for various projects over the years, but today I finally put these notes together while setting up a system for scraping the iNaturalist API for recent sightings of the California Brown Pelican and converting those into an Atom feed that I can subscribe to in NetNewsWire:

Screenshot of a Brown Pelican sighting Atom feed in NetNewsWire showing a list of entries on the left sidebar and detailed view of "Brown Pelican at Art Museum, Isla Vista, CA 93117, USA" on the right with date "MAR 13, 2025 AT 10:40 AM", coordinates "34.4115542997, -119.8500448", and a photo of three brown pelicans in water near a dock with copyright text "(c) Ery, all rights reserved"

I got Claude to write me the script that converts the scraped JSON to atom.

Update: I just found out iNaturalist have their own atom feeds! Here's their own feed of recent Pelican observations.

# 18th March 2025, 8:17 pm / github-actions, ai-assisted-programming, github, git-scraping, atom, netnewswire, inaturalist

OpenTimes (via) Spectacular new open geospatial project by Dan Snow:

OpenTimes is a database of pre-computed, point-to-point travel times between United States Census geographies. It lets you download bulk travel time data for free and with no limits.

Here's what I get for travel times by car from El Granada, California:

Isochrone map showing driving times from the El Granada census tract to other places in the San Francisco Bay Area

The technical details are fascinating:

  • The entire OpenTimes backend is just static Parquet files on Cloudflare's R2. There's no RDBMS or running service, just files and a CDN. The whole thing costs about $10/month to host and costs nothing to serve. In my opinion, this is a great way to serve infrequently updated, large public datasets at low cost (as long as you partition the files correctly).

Sure enough, R2 pricing charges "based on the total volume of data stored" - $0.015 / GB-month for standard storage, then $0.36 / million requests for "Class B" operations which include reads. They charge nothing for outbound bandwidth.

  • All travel times were calculated by pre-building the inputs (OSM, OSRM networks) and then distributing the compute over hundreds of GitHub Actions jobs. This worked shockingly well for this specific workload (and was also completely free).

Here's a GitHub Actions run of the calculate-times.yaml workflow which uses a matrix to run 255 jobs!

GitHub Actions run: calculate-times.yaml run by workflow_dispatch taking 1h49m to execute 255 jobs with names like run-job (2020-01)

Relevant YAML:

  matrix:
    year: ${{ fromJSON(needs.setup-jobs.outputs.years) }}
    state: ${{ fromJSON(needs.setup-jobs.outputs.states) }}

Where those JSON files were created by the previous step, which reads in the year and state values from this params.yaml file.

  • The query layer uses a single DuckDB database file with views that point to static Parquet files via HTTP. This lets you query a table with hundreds of billions of records after downloading just the ~5MB pointer file.

This is a really creative use of DuckDB's feature that lets you run queries against large data from a laptop using HTTP range queries to avoid downloading the whole thing.

The README shows how to use that from R and Python - I got this working in the duckdb client (brew install duckdb):

INSTALL httpfs;
LOAD httpfs;
ATTACH 'https://data.opentimes.org/databases/0.0.1.duckdb' AS opentimes;

SELECT origin_id, destination_id, duration_sec
  FROM opentimes.public.times
  WHERE version = '0.0.1'
      AND mode = 'car'
      AND year = '2024'
      AND geography = 'tract'
      AND state = '17'
      AND origin_id LIKE '17031%' limit 10;

In answer to a question about adding public transit times Dan said:

In the next year or so maybe. The biggest obstacles to adding public transit are:

  • Collecting all the necessary scheduling data (e.g. GTFS feeds) for every transit system in the county. Not insurmountable since there are services that do this currently.
  • Finding a routing engine that can compute nation-scale travel time matrices quickly. Currently, the two fastest open-source engines I've tried (OSRM and Valhalla) don't support public transit for matrix calculations and the engines that do support public transit (R5, OpenTripPlanner, etc.) are too slow.

GTFS is a popular CSV-based format for sharing transit schedules - here's an official list of available feed directories.

# 17th March 2025, 10:49 pm / open-data, github-actions, openstreetmap, duckdb, gis, cloudflare, parquet

Here’s how I use LLMs to help me write code

Visit Here's how I use LLMs to help me write code

Online discussions about using Large Language Models to help write code inevitably produce comments from developers who’s experiences have been disappointing. They often ask what they’re doing wrong—how come some people are reporting such great results when their own experiments have proved lacking?

[... 5,179 words]

simonw/git-scraper-template. I built this new GitHub template repository in preparation for a workshop I'm giving at NICAR (the data journalism conference) next week on Cutting-edge web scraping techniques.

One of the topics I'll be covering is Git scraping - creating a GitHub repository that uses scheduled GitHub Actions workflows to grab copies of websites and data feeds and store their changes over time using Git.

This template repository is designed to be the fastest possible way to get started with a new Git scraper: simple create a new repository from the template and paste the URL you want to scrape into the description field and the repository will be initialized with a custom script that scrapes and stores that URL.

It's modeled after my earlier shot-scraper-template tool which I described in detail in Instantly create a GitHub repository to take screenshots of a web page.

The new git-scraper-template repo took some help from Claude to figure out. It uses a custom script to download the provided URL and derive a filename to use based on the URL and the content type, detected using file --mime-type -b "$file_path" against the downloaded file.

It also detects if the downloaded content is JSON and, if it is, pretty-prints it using jq - I find this is a quick way to generate much more useful diffs when the content changes.

# 26th February 2025, 5:34 am / github-actions, nicar, projects, git-scraping, data-journalism, git, github, scraping

Using a Tailscale exit node with GitHub Actions. New TIL. I started running a git scraper against doge.gov to track changes made to that website over time. The DOGE site runs behind Cloudflare which was blocking requests from the GitHub Actions IP range, but I figured out how to run a Tailscale exit node on my Apple TV and use that to proxy my shot-scraper requests.

The scraper is running in simonw/scrape-doge-gov. It uses the new shot-scraper har command I added in shot-scraper 1.6 (and improved in shot-scraper 1.7).

# 23rd February 2025, 2:49 am / scraping, github-actions, tailscale, til, git-scraping, github, shot-scraper

Run LLMs on macOS using llm-mlx and Apple’s MLX framework

Visit Run LLMs on macOS using llm-mlx and Apple's MLX framework

llm-mlx is a brand new plugin for my LLM Python Library and CLI utility which builds on top of Apple’s excellent MLX array framework library and mlx-lm package. If you’re a terminal user or Python developer with a Mac this may be the new easiest way to start exploring local Large Language Models.

[... 1,524 words]

Using pip to install a Large Language Model that’s under 100MB

Visit Using pip to install a Large Language Model that's under 100MB

I just released llm-smollm2, a new plugin for LLM that bundles a quantized copy of the SmolLM2-135M-Instruct LLM inside of the Python package.

[... 1,553 words]

2024

PyPI now supports digital attestations (via) Dustin Ingram:

PyPI package maintainers can now publish signed digital attestations when publishing, in order to further increase trust in the supply-chain security of their projects. Additionally, a new API is available for consumers and installers to verify published attestations.

This has been in the works for a while, and is another component of PyPI's approach to supply chain security for Python packaging - see PEP 740 – Index support for digital attestations for all of the underlying details.

A key problem this solves is cryptographically linking packages published on PyPI to the exact source code that was used to build those packages. In the absence of this feature there are no guarantees that the .tar.gz or .whl file you download from PyPI hasn't been tampered with (to add malware, for example) in a way that's not visible in the published source code.

These new attestations provide a mechanism for proving that a known, trustworthy build system was used to generate and publish the package, starting with its source code on GitHub.

The good news is that if you're using the PyPI Trusted Publishers mechanism in GitHub Actions to publish packages, you're already using this new system. I wrote about that system in January: Publish Python packages to PyPI with a python-lib cookiecutter template and GitHub Actions - and hundreds of my own PyPI packages are already using that system, thanks to my various cookiecutter templates.

Trail of Bits helped build this feature, and provide extra background about it on their own blog in Attestations: A new generation of signatures on PyPI:

As of October 29, attestations are the default for anyone using Trusted Publishing via the PyPA publishing action for GitHub. That means roughly 20,000 packages can now attest to their provenance by default, with no changes needed.

They also built Are we PEP 740 yet? (key implementation here) to track the rollout of attestations across the 360 most downloaded packages from PyPI. It works by hitting URLs such as https://pypi.org/simple/pydantic/ with a Accept: application/vnd.pypi.simple.v1+json header - here's the JSON that returns.

I published an alpha package using Trusted Publishers last night and the files for that release are showing the new provenance information already:

Provenance. The following attestation bundles were made for llm-0.18a0-py3-none-any.whl: Publisher: publish.yml on simonw/llm Attestations: Statement type: https://in-toto.io/Statement/v1 Predicate type: https://docs.pypi.org/attestations/publish/v1 Subject name: llm-0.18a0-py3-none-any.whl Subject digest: dde9899583172e6434971d8cddeb106bb535ae4ee3589cb4e2d525a4526976da Sigstore transparency entry: 148798240 Sigstore integration time: about 18 hours ago

Which links to this Sigstore log entry with more details, including the Git hash that was used to build the package:

X509v3 extensions:   Key Usage (critical):   - Digital Signature   Extended Key Usage:   - Code Signing   Subject Key Identifier:   - 4E:D8:B4:DB:C1:28:D5:20:1A:A0:14:41:2F:21:07:B4:4E:EF:0B:F1   Authority Key Identifier:     keyid: DF:D3:E9:CF:56:24:11:96:F9:A8:D8:E9:28:55:A2:C6:2E:18:64:3F   Subject Alternative Name (critical):     url:     - https://github.com/simonw/llm/.github/workflows/publish.yml@refs/tags/0.18a0   OIDC Issuer: https://token.actions.githubusercontent.com   GitHub Workflow Trigger: release   GitHub Workflow SHA: 041730d8b2bc12f62cfe41c44b62a03ef4790117   GitHub Workflow Name: Publish Python Package   GitHub Workflow Repository: simonw/llm   GitHub Workflow Ref: refs/tags/0.18a0   OIDC Issuer (v2): https://token.actions.githubusercontent.com   Build Signer URI: https://github.com/simonw/llm/.github/workflows/publish.yml@refs/tags/0.18a0   Build Signer Digest: 041730d8b2bc12f62cfe41c44b62a03ef4790117

Sigstore is a transparency log maintained by Open Source Security Foundation (OpenSSF), a sub-project of the Linux Foundation.

# 14th November 2024, 7:56 pm / packaging, pypi, python, supply-chain, github, dustin-ingram, github-actions, psf

Generating Descriptive Weather Reports with LLMs. Drew Breunig produces the first example I've seen in the wild of the new LLM attachments Python API. Drew's Downtown San Francisco Weather Vibes project combines output from a JSON weather API with the latest image from a webcam pointed at downtown San Francisco to produce a weather report "with a style somewhere between Jack Kerouac and J. Peterman".

Here's the Python code that constructs and executes the prompt. The code runs in GitHub Actions.

# 29th October 2024, 11:12 pm / vision-llms, drew-breunig, llm, generative-ai, ai, llms, github-actions, prompt-engineering

UV with GitHub Actions to run an RSS to README project. Jeff Triplett demonstrates a very neat pattern for using uv to run Python scripts with their dependencies inside of GitHub Actions. First, add uv to the workflow using the setup-uv action:

- uses: astral-sh/setup-uv@v3
  with:
    enable-cache: true
    cache-dependency-glob: "*.py"

This enables the caching feature, which stores uv's own cache of downloads from PyPI between runs. The cache-dependency-glob key ensures that this cache will be invalidated if any .py file in the repository is updated.

Now you can run Python scripts using steps that look like this:

- run: uv run fetch-rss.py

If that Python script begins with some dependency definitions (PEP 723) they will be automatically installed by uv run on the first run and reused from the cache in the future. From the start of fetch-rss.py:

# /// script
# requires-python = ">=3.11"
# dependencies = [
#     "feedparser",
#     "typer",
# ]
# ///

uv will download the required Python version and cache that as well.

# 5th October 2024, 11:39 pm / uv, jeff-triplett, github-actions, python

New improved commit messages for scrape-hacker-news-by-domain. My simonw/scrape-hacker-news-by-domain repo has a very specific purpose. Once an hour it scrapes the Hacker News /from?site=simonwillison.net page (and the equivalent for datasette.io) using my shot-scraper tool and stashes the parsed links, scores and comment counts in JSON files in that repo.

It does this mainly so I can subscribe to GitHub's Atom feed of the commit log - visit simonw/scrape-hacker-news-by-domain/commits/main and add .atom to the URL to get that.

NetNewsWire will inform me within about an hour if any of my content has made it to Hacker News, and the repo will track the score and comment count for me over time. I wrote more about how this works in Scraping web pages from the command line with shot-scraper back in March 2022.

Prior to the latest improvement, the commit messages themselves were pretty uninformative. The message had the date, and to actually see which Hacker News post it was referring to, I had to click through to the commit and look at the diff.

I built my csv-diff tool a while back to help address this problem: it can produce a slightly more human-readable version of a diff between two CSV or JSON files, ideally suited for including in a commit message attached to a git scraping repo like this one.

I got that working, but there was still room for improvement. I recently learned that any Hacker News thread has an undocumented URL at /latest?id=x which displays the most recently added comments at the top.

I wanted that in my commit messages, so I could quickly click a link to see the most recent comments on a thread.

So... I added one more feature to csv-diff: a new --extra option lets you specify a Python format string to be used to add extra fields to the displayed difference.

My GitHub Actions workflow now runs this command:

csv-diff simonwillison-net.json simonwillison-net-new.json \
  --key id --format json \
  --extra latest 'https://news.ycombinator.com/latest?id={id}' \
  >> /tmp/commit.txt

This generates the diff between the two versions, using the id property in the JSON to tie records together. It adds a latest field linking to that URL.

The commits now look like this:

Fri Sep 6 05:22:32 UTC 2024. 1 row changed. id: 41459472 points: "25" => "27" numComments: "7" => "8" extras: latest: https://news.ycombinator.com/latest?id=41459472

# 6th September 2024, 5:40 am / shot-scraper, github-actions, projects, hacker-news, git-scraping, json

GitHub Actions: Faster Python runs with cached virtual environments (via) Adam Johnson shares his improved pattern for caching Python environments in GitHub Actions.

I've been using the pattern where you add cache: pip to the actions/setup-python block, but it has two disadvantages: if the tests fail the cache won't be saved at the end, and it still spends time installing the packages despite not needing to download them fresh since the wheels are in the cache.

Adam's pattern works differently: he caches the entire .venv/ folder between runs, avoiding the overhead of installing all of those packages. He also wraps the block that installs the packages between explicit actions/cache/restore and actions/cache/save steps to avoid the case where failed tests skip the cache persistence.

# 19th July 2024, 2:14 pm / adam-johnson, github-actions, python

qrank (via) Interesting and very niche project by Colin Dellow.

Wikidata has pages for huge numbers of concepts, people, places and things.

One of the many pieces of data they publish is QRank—“ranking Wikidata entities by aggregating page views on Wikipedia, Wikispecies, Wikibooks, Wikiquote, and other Wikimedia projects”. Every item gets a score and these scores can be used to answer questions like “which island nations get the most interest across Wikipedia”—potentially useful for things like deciding which labels to display on a highly compressed map of the world.

QRank is published as a gzipped CSV file.

Colin’s hikeratlas/qrank GitHub repository runs weekly, fetches the latest qrank.csv.gz file and loads it into a SQLite database using SQLite’s “.import” mechanism. Then it publishes the resulting SQLite database as an asset attached to the “latest” GitHub release on that repo—currently a 307MB file.

The database itself has just a single table mapping the Wikidata ID (a primary key integer) to the latest QRank—another integer. You’d need your own set of data with Wikidata IDs to join against this to do anything useful.

I’d never thought of using GitHub Releases for this kind of thing. I think it’s a really interesting pattern.

# 21st April 2024, 10:28 pm / wikipedia, github-actions, sqlite, colin-dellow

GitHub Actions: Introducing the new M1 macOS runner available to open source! Set “runs-on: macos-14” to run a GitHub Actions workflow on a 7GB of RAM ARM M1 runner. I have been looking forward to this for ages: it should make it much easier to build releases of both Electron apps and Python binary wheels for Apple Silicon.

# 31st January 2024, 2:04 am / github-actions, macosx

Publish Python packages to PyPI with a python-lib cookiecutter template and GitHub Actions

Visit Publish Python packages to PyPI with a python-lib cookiecutter template and GitHub Actions

I use cookiecutter to start almost all of my Python projects. It helps me quickly generate a skeleton of a project with my preferred directory structure and configured tools.

[... 686 words]

2022

Tracking Mastodon user numbers over time with a bucket of tricks

Visit Tracking Mastodon user numbers over time with a bucket of tricks

Mastodon is definitely having a moment. User growth is skyrocketing as more and more people migrate over from Twitter.

[... 1,534 words]

Leveraging ’shot-scraper’ and creating image diffs. Üllar Seerme has a neat recipe for using shot-scraper and ImageMagick to create differential animations showing how a scraped web page has visually changed.

# 24th October 2022, 9:34 pm / imagemagick, shot-scraper, github-actions

How to create a Python package in 2022 (via) Fantastic tutorial on modern Python packaging by Rodrigo Girão Serrão. I’ve been meaning to figure out Poetry for a while now and this gave me exactly the information I needed to start figuring it out. Great coverage of GitHub Actions, Tox and pre-commit as well.

# 15th October 2022, 10:10 pm / packaging, python, github-actions

Automating screenshots for the Datasette documentation using shot-scraper

Visit Automating screenshots for the Datasette documentation using shot-scraper

I released shot-scraper back in March as a tool for keeping screenshots in documentation up-to-date.

[... 1,810 words]

Software engineering practices

Gergely Orosz started a Twitter conversation asking about recommended “software engineering practices” for development teams.

[... 1,557 words]

A tool to run caption extraction against online videos using Whisper and GitHub Issues/Actions

Visit A tool to run caption extraction against online videos using Whisper and GitHub Issues/Actions

I released a new project this weekend, built during the Bellingcat Hackathon (I came second!) It’s called Action Transcription and it’s a tool for caturing captions and transcripts from online videos.

[... 1,362 words]

upptime (via) “Open-source uptime monitor and status page, powered entirely by GitHub Actions, Issues, and Pages.” This is a very creative (ab)use of GitHub Actions: it runs a scheduled action to check the availability of sites that you specify, records the results in a YAML file (with the commit history tracking them over time) and can automatically open a GitHub issue for you if it detects a new incident.

# 26th May 2022, 3:53 am / github-actions

simonw/datasette-screenshots (via) I started a new GitHub repository to automate taking screenshots of Datasette for marketing purposes, using my shot-scraper browser automation tool.

# 17th May 2022, 5:56 pm / projects, shot-scraper, github-actions, datasette

Supercharging GitHub Actions with Job Summaries (via) GitHub Actions workflows can now generate a rendered Markdown summary of, well, anything that you can think to generate as part of the workflow execution. I particularly like the way this is designed: they provide a filename in a $GITHUB_STEP_SUMMARY environment variable which you can then append data to from each of your steps.

# 16th May 2022, 11:02 pm / github-actions

Automatically opening issues when tracked file content changes

Visit Automatically opening issues when tracked file content changes

I figured out a GitHub Actions pattern to keep track of a file published somewhere on the internet and automatically open a new repository issue any time the contents of that file changes.

[... 1,211 words]

Building a Covid sewage Twitter bot (and other weeknotes)

Visit Building a Covid sewage Twitter bot (and other weeknotes)

I built a new Twitter bot today: @covidsewage. It tweets a daily screenshot of the latest Covid sewage monitoring data published by Santa Clara county.

[... 1,079 words]

Instantly create a GitHub repository to take screenshots of a web page

Visit Instantly create a GitHub repository to take screenshots of a web page

I just released shot-scraper-template, a GitHub repository template that helps you start taking automated screenshots of a web page by filling out a form.

[... 1,177 words]

Scraping web pages from the command line with shot-scraper

Visit Scraping web pages from the command line with shot-scraper

I’ve added a powerful new capability to my shot-scraper command line browser automation tool: you can now use it to load a web page in a headless browser, execute JavaScript to extract information and return that information back to the terminal as JSON.

[... 1,277 words]

@newshomepages (via) Ben Welsh used my shot-scraper tool and GitHub Actions to launch a Twitter bot which tweets screenshots of newspaper homepages on a scheduled basis. Ben says: “The tech is so easy, I was able to pull it off in a couple hours at zero cost. A decade ago I ran a similar project using the cloud resources of the day. [...] It costs thousands of dollars and the screenshots were of much lower quality. Incredible progress!”

# 12th March 2022, 7:21 pm / twitter, shot-scraper, github-actions, playwright, ben-welsh

shot-scraper: automated screenshots for documentation, built on Playwright

Visit shot-scraper: automated screenshots for documentation, built on Playwright

shot-scraper is a new tool that I’ve built to help automate the process of keeping screenshots up-to-date in my documentation. It also doubles as a scraping tool—hence the name—which I picked as a complement to my git scraping and help scraping techniques.

[... 1,802 words]