Git scraping, the five minute lightning talk
I prepared a lightning talk about Git scraping for the NICAR 2021 data journalism conference. In the talk I explain the idea of running scheduled scrapers in GitHub Actions, show some examples and then live code a new scraper for the CDC’s vaccination data using the GitHub web interface. Here’s the video.
Notes from the talk
Here’s the PG&E outage map that I scraped. The trick here is to open the browser developer tools network tab, then order resources by size and see if you can find the JSON resource that contains the most interesting data.
The scraper code itself is here. I wrote about the project in detail in Tracking PG&E outages by scraping to a git repo—my database of outages database is at pge-outages.simonwillison.net and the animation I made of outages over time is attached to this tweet.
Here’s a video animation of PG&E’s outages from October 5th up until just a few minutes ago pic.twitter.com/50K3BrROZR- Simon Willison (@simonw) October 28, 2019
In the video I used that as the template to create a new scraper for CDC vaccination data—their website is https://covid.cdc.gov/covid-data-tracker/#vaccinations and the API I found using the browser developer tools is https://covid.cdc.gov/covid-data-tracker/COVIDData/getAjaxData?id=vaccination_data.
The new CDC scraper and the data it has scraped lives in simonw/cdc-vaccination-history.
You can find more examples of Git scraping in the git-scraping GitHub topic.