Git scraping, the five minute lightning talk
5th March 2021
I prepared a lightning talk about Git scraping for the NICAR 2021 data journalism conference. In the talk I explain the idea of running scheduled scrapers in GitHub Actions, show some examples and then live code a new scraper for the CDC’s vaccination data using the GitHub web interface. Here’s the video.
Notes from the talk
Here’s the PG&E outage map that I scraped. The trick here is to open the browser developer tools network tab, then order resources by size and see if you can find the JSON resource that contains the most interesting data.
I scraped that outage data into simonw/pge-outages—here’s the commit history (over 40,000 commits now!)
The scraper code itself is here. I wrote about the project in detail in Tracking PG&E outages by scraping to a git repo—my database of outages database is at pge-outages.simonwillison.net and the animation I made of outages over time is attached to this tweet.
Here’s a video animation of PG&E’s outages from October 5th up until just a few minutes ago pic.twitter.com/50K3BrROZR
- Simon Willison (@simonw) October 28, 2019
The much simpler scraper for the www.fire.ca.gov/incidents website is at simonw/ca-fires-history.
In the video I used that as the template to create a new scraper for CDC vaccination data—their website is https://covid.cdc.gov/covid-data-tracker/#vaccinations and the API I found using the browser developer tools is https://covid.cdc.gov/covid-data-tracker/COVIDData/getAjaxData?id=vaccination_data.
The new CDC scraper and the data it has scraped lives in simonw/cdc-vaccination-history.
You can find more examples of Git scraping in the git-scraping GitHub topic.
More recent articles
- llm cmd undo last git commit - a new plugin for LLM - 26th March 2024
- Building and testing C extensions for SQLite with ChatGPT Code Interpreter - 23rd March 2024
- Claude and ChatGPT for ad-hoc sidequests - 22nd March 2024
- Weeknotes: the aftermath of NICAR - 16th March 2024
- The GPT-4 barrier has finally been broken - 8th March 2024
- Prompt injection and jailbreaking are not the same thing - 5th March 2024
- Interesting ideas in Observable Framework - 3rd March 2024
- Weeknotes: Getting ready for NICAR - 27th February 2024
- The killer app of Gemini Pro 1.5 is video - 21st February 2024
- Weeknotes: a Datasette release, an LLM release and a bunch of new plugins - 9th February 2024