Git scraping, the five minute lightning talk
5th March 2021
I prepared a lightning talk about Git scraping for the NICAR 2021 data journalism conference. In the talk I explain the idea of running scheduled scrapers in GitHub Actions, show some examples and then live code a new scraper for the CDC’s vaccination data using the GitHub web interface. Here’s the video.
Notes from the talk
Here’s the PG&E outage map that I scraped. The trick here is to open the browser developer tools network tab, then order resources by size and see if you can find the JSON resource that contains the most interesting data.
I scraped that outage data into simonw/pge-outages—here’s the commit history (over 40,000 commits now!)
The scraper code itself is here. I wrote about the project in detail in Tracking PG&E outages by scraping to a git repo—my database of outages database is at pge-outages.simonwillison.net and the animation I made of outages over time is attached to this tweet.
Here’s a video animation of PG&E’s outages from October 5th up until just a few minutes ago pic.twitter.com/50K3BrROZR
- Simon Willison (@simonw) October 28, 2019
The much simpler scraper for the www.fire.ca.gov/incidents website is at simonw/ca-fires-history.
In the video I used that as the template to create a new scraper for CDC vaccination data—their website is https://covid.cdc.gov/covid-data-tracker/#vaccinations and the API I found using the browser developer tools is https://covid.cdc.gov/covid-data-tracker/COVIDData/getAjaxData?id=vaccination_data.
The new CDC scraper and the data it has scraped lives in simonw/cdc-vaccination-history.
You can find more examples of Git scraping in the git-scraping GitHub topic.
More recent articles
- llamafile is the new best way to run a LLM on your own computer - 29th November 2023
- Prompt injection explained, November 2023 edition - 27th November 2023
- I'm on the Newsroom Robots podcast, with thoughts on the OpenAI board - 25th November 2023
- Weeknotes: DevDay, GitHub Universe, OpenAI chaos - 22nd November 2023
- Deciphering clues in a news article to understand how it was reported - 22nd November 2023
- Exploring GPTs: ChatGPT in a trench coat? - 15th November 2023
- Financial sustainability for open source projects at GitHub Universe - 10th November 2023
- ospeak: a CLI tool for speaking text in the terminal via OpenAI - 7th November 2023
- DALL-E 3, GPT4All, PMTiles, sqlite-migrate, datasette-edit-schema - 30th October 2023
- Now add a walrus: Prompt engineering in DALL-E 3 - 26th October 2023