Simon Willison on git-scraping

28 items tagged “git-scraping”

Git scraping is a technique where data is scraped from an external source into a Git repository in order to record changes to that data over time.

2024

interactive-feed (via) Sam Morris maintains this project which gathers interactive, graphic and data visualization stories from various newsrooms around the world and publishes them on Twitter, Mastodon and Bluesky.

It runs automatically using GitHub Actions, and gathers data using a number of different techniques - XML feeds, custom API integrations (for the NYT, Guardian and Washington Post) and in some cases by scraping index pages on news websites using CSS selectors and cheerio.

The data it collects is archived as JSON in the data/ directory of the repository.

# 5th July 2024, 11:39 pm / data-journalism, git-scraping

Figure out who’s leaving the company: dump, diff, repeat (via) Rachel Kroll describes a neat hack for companies with an internal LDAP server or similar machine-readable employee directory: run a cron somewhere internal that grabs the latest version and diffs it against the previous to figure out who has joined or left the company.

I suggest using Git for this—a form of Git scraping—as then you get a detailed commit log of changes over time effectively for free.

I really enjoyed Rachel’s closing thought: “Incidentally, if someone gets mad about you running this sort of thing, you probably don’t want to work there anyway. On the other hand, if you’re able to build such tools without IT or similar getting ”threatened“ by it, then you might be somewhere that actually enjoys creating interesting and useful stuff. Treasure such places. They don’t tend to last.”

# 9th February 2024, 5:44 am / git, git-scraping

2022

Tracking Mastodon user numbers over time with a bucket of tricks

Mastodon is definitely having a moment. User growth is skyrocketing as more and more people migrate over from Twitter.

[... 1,534 words]

7 am / 20th November 2022 / github, projects, datasette, observable, github-actions, git-scraping, git-history, s3-credentials, datasette-lite, mastodon

Measuring traffic during the Half Moon Bay Pumpkin Festival

This weekend was the 50th annual Half Moon Bay Pumpkin Festival.

[... 2,693 words]

3:41 pm / 19th October 2022 / data-journalism, projects, sqlite, datasette, git-scraping, git-history, datasette-lite, half-moon-bay

Half Moon Bay Pumpkin Festival traffic on Saturday 15th October 2022 (via) It’s the Half Moon Bay Pumpkin Festival this weekend... and its impact on the traffic between our little town of El Granada and Half Moon Bay—8 minutes drive away—is notorious. So I built a git scraper that archives estimated driving times from the Google Maps Navigation API, and used git-history to turn that scraped data into a SQLite database and visualize it on a chart.

# 16th October 2022, 3:56 am / projects, git-scraping, git-history, half-moon-bay

Automatically opening issues when tracked file content changes

I figured out a GitHub Actions pattern to keep track of a file published somewhere on the internet and automatically open a new repository issue any time the contents of that file changes.

[... 1,211 words]

5:18 pm / 28th April 2022 / github, gzip, projects, python, datasette, asgi, github-actions, git-scraping

Scraping web pages from the command line with shot-scraper

I’ve added a powerful new capability to my shot-scraper command line browser automation tool: you can now use it to load a web page in a headless browser, execute JavaScript to extract information and return that information back to the terminal as JSON.

[... 1,276 words]

1:29 am / 14th March 2022 / github, hacker-news, scraping, github-actions, git-scraping, shot-scraper

shot-scraper: automated screenshots for documentation, built on Playwright

shot-scraper is a new tool that I’ve built to help automate the process of keeping screenshots up-to-date in my documentation. It also doubles as a scraping tool—hence the name—which I picked as a complement to my git scraping and help scraping techniques.

[... 1,802 words]

12:13 am / 10th March 2022 / documentation, projects, scraping, github-actions, git-scraping, puppeteer, playwright, shot-scraper

Help scraping: track changes to CLI tools by recording their --help using Git

I’ve been experimenting with a new variant of Git scraping this week which I’m calling Help scraping. The key idea is to track changes made to CLI tools over time by recording the output of their --help commands in a Git repository.

[... 978 words]

11:46 pm / 2nd February 2022 / git, github, projects, scraping, graphql, weeknotes, github-actions, git-scraping, fly

2021

Weeknotes: Shaving some beautiful yaks

I’ve been mostly shaving yaks this week—two in particular: the Datasette table refactor and the next release of git-history. I also built and released my first Web Component!

[... 1,307 words]

3:43 am / 1st December 2021 / projects, npm, datasette, webcomponents, weeknotes, git-scraping

Weeknotes: Apache proxies in Docker containers, refactoring Datasette

Updates to six major projects this week, plus finally some concrete progress towards Datasette 1.0.

[... 1,630 words]

5:43 am / 22nd November 2021 / apache, proxies, refactoring, supervisord, docker, datasette, weeknotes, git-scraping, sqlite-utils

Weeknotes: git-history, created for a Git scraping workshop

My main project this week was a 90 minute workshop I delivered about Git scraping at Coda.Br 2021, a Brazilian data journalism conference, on Friday. This inspired the creation of a brand new tool, git-history, plus smaller improvements to a range of other projects.

[... 1,239 words]

4:10 am / 15th November 2021 / aws, projects, s3, talks, teaching, datasette, weeknotes, git-scraping, sqlite-utils, git-history, s3-credentials

Weeknotes: CDC vaccination history fixes, developing in GitHub Codespaces

I spent the last week mostly surrounded by boxes: we’re completing our move to the new place and life is mostly unpacking now. I did find some time to fix some issues with my CDC vaccination history Datasette instance though.

[... 514 words]

1:53 am / 28th September 2021 / github, projects, weeknotes, covid19, git-scraping, github-codespaces

Flat Data. New project from the GitHub OCTO (the Office of the CTO, love that backronym) somewhat inspired by my work on Git scraping: I’m really excited to see GitHub embracing git for CSV/JSON data in this way. Flat incorporates a reusable Action for scraping and storing data (using Deno), a VS Code extension for setting up those workflows and a very nicely designed Flat Viewer web app for browsing CSV and JSON data hosted on GitHub.

# 19th May 2021, 1:05 am / github, git-scraping

Weeknotes: SpatiaLite 5, Datasette on Azure, more CDC vaccination history

This week I got SpatiaLite 5 working in the Datasette Docker image, improved the CDC vaccination history git scraper, figured out Datasette on Azure and we closed on a new home!

[... 986 words]

5:19 am / 28th March 2021 / aws, azure, datasette, weeknotes, git-scraping

Weeknotes: Datasette and Git scraping at NICAR, VaccinateCA

This week I virtually attended the NICAR data journalism conference and made a ton of progress on the Django backend for VaccinateCA (see last week).

[... 773 words]

7:29 am / 7th March 2021 / data-journalism, youtube, datasette, weeknotes, git-scraping, vaccinate-ca, nicar

Git scraping, the five minute lightning talk

I prepared a lightning talk about Git scraping for the NICAR 2021 data journalism conference. In the talk I explain the idea of running scheduled scrapers in GitHub Actions, show some examples and then live code a new scraper for the CDC’s vaccination data using the GitHub web interface. Here’s the video.

[... 289 words]

12:44 am / 5th March 2021 / data-journalism, scraping, talks, github-actions, git-scraping, annotated-talks, nicar

2020

Weeknotes: sqlite-utils 3.0 alpha, Git scraping in the zeitgeist

Natalie and I decided to escape San Francisco for election week, and have been holed up in Fort Bragg on the Northern California coast. I’ve mostly been on vacation, but I did find time to make some significant changes to sqlite-utils. Plus notes on an exciting Git scraping project.

[... 603 words]

2:17 am / 7th November 2020 / alex-gaynor, elections, weeknotes, git-scraping, sqlite-utils

nyt-2020-election-scraper. Brilliant application of git scraping by Alex Gaynor and a growing team of contributors. Takes a JSON snapshot of the NYT’s latest election poll figures every five minutes, then runs a Python script to iterate through the history and build an HTML page showing the trends, including what percentage of the remaining votes each candidate needs to win each state. This is the perfect case study in why it can be useful to take a “snapshot if the world right now” data source and turn it into a git revision history over time.

# 6th November 2020, 2:24 pm / alex-gaynor, data-journalism, elections, git, new-york-times, git-scraping

Datasette Weekly: Datasette 0.50, git scraping, extracting columns (via) The first edition of the new Datasette Weekly newsletter—covering Datasette 0.50, Git scraping, extracting columns with sqlite-utils and featuring datasette-graphql as the first “plugin of the week”

# 10th October 2020, 9 pm / email, projects, sqlite, graphql, datasette, git-scraping, sqlite-utils

Git scraping: track changes over time by scraping to a Git repository

Git scraping is the name I’ve given a scraping technique that I’ve been experimenting with for a few years now. It’s really effective, and more people should use it.

[... 963 words]

6:27 pm / 9th October 2020 / git, github, projects, scraping, github-actions, git-scraping

Weeknotes: datasette-auth-passwords, a Datasette logo and a whole lot more

All sorts of project updates this week.

[... 913 words]

3:41 am / 17th July 2020 / design, passwords, projects, datasette, weeknotes, covid19, git-scraping

Weeknotes: Archiving coronavirus.data.gov.uk, custom pages and directory configuration in Datasette, photos-to-sqlite

I mainly made progress on three projects this week: Datasette, photos-to-sqlite and a cleaner way of archiving data to a git repository.

[... 1,132 words]

7:41 pm / 29th April 2020 / git, http, matthew-somerville, photos, projects, datasette, weeknotes, covid19, git-scraping

Tracking FARA by deploying a data API using GitHub Actions and Cloud Run

I’m using the combination of GitHub Actions and Google Cloud Run to retrieve data from the U.S. Department of Justice FARA website and deploy it as a queryable API using Datasette.

[... 1,599 words]

7:51 am / 21st January 2020 / continuous-deployment, continuous-integration, data-journalism, github, projects, datasette, cloudrun, github-actions, git-scraping

2019

Tracking PG&E outages by scraping to a git repo

PG&E have cut off power to several million people in northern California, supposedly as a precaution against wildfires.

[... 868 words]

11:32 pm / 10th October 2019 / data-journalism, projects, scraping, sqlite, datasette, git-scraping, digitalocean, sqlite-utils

Generating a commit log for San Francisco’s official list of trees

San Francisco has a neat open data portal (as do an increasingly large number of cities these days). For a few years my favourite file on there has been Street Tree List, a list of all 190,000 trees in the city maintained by the Department of Public Works.

[... 1,051 words]

2:49 pm / 13th March 2019 / csv, data-journalism, git, projects, san-francisco, git-scraping

2017

Changelogs to help understand the fires in the North Bay

The situation in the counties north of San Francisco is horrifying right now. I’ve repurposed some of the tools I built to for the Irma Response project last month to collect and track some data that might be of use to anyone trying to understand what’s happening up there. I’m sharing these now in the hope that they might prove useful.

[... 383 words]

6:48 am / 10th October 2017 / screenscraping, crisishacking, git-scraping

Scraping hurricane Irma

The Irma Response project is a team of volunteers working together to make information available during and after the storm. There is a huge amount of information out there, on many different websites. The Irma API is an attempt to gather key information in one place, verify it and publish it in a reuseable way. It currently powers the irmashelters.org website.

[... 438 words]

6:21 am / 10th September 2017 / screenscraping, crisishacking, git-scraping