Simon Willison’s Weblog

Subscribe

Git scraping, the five minute lightning talk

5th March 2021

I prepared a lightning talk about Git scraping for the NICAR 2021 data journalism conference. In the talk I explain the idea of running scheduled scrapers in GitHub Actions, show some examples and then live code a new scraper for the CDC’s vaccination data using the GitHub web interface. Here’s the video.

Notes from the talk

Here’s the PG&E outage map that I scraped. The trick here is to open the browser developer tools network tab, then order resources by size and see if you can find the JSON resource that contains the most interesting data.

I scraped that outage data into simonw/pge-outages—here’s the commit history (over 40,000 commits now!)

The scraper code itself is here. I wrote about the project in detail in Tracking PG&E outages by scraping to a git repo—my database of outages database is at pge-outages.simonwillison.net and the animation I made of outages over time is attached to this tweet.

The much simpler scraper for the www.fire.ca.gov/incidents website is at simonw/ca-fires-history.

In the video I used that as the template to create a new scraper for CDC vaccination data—their website is https://covid.cdc.gov/covid-data-tracker/#vaccinations and the API I found using the browser developer tools is https://covid.cdc.gov/covid-data-tracker/COVIDData/getAjaxData?id=vaccination_data.

The new CDC scraper and the data it has scraped lives in simonw/cdc-vaccination-history.

You can find more examples of Git scraping in the git-scraping GitHub topic.