Simon Willison’s Weblog

Subscribe

Generating a commit log for San Francisco’s official list of trees

13th March 2019

San Francisco has a neat open data portal (as do an increasingly large number of cities these days). For a few years my favourite file on there has been Street Tree List, a list of all 190,000 trees in the city maintained by the Department of Public Works.

I’ve been using that file for Datasette demos for a while now, but last week I noticed something intriguing: the file had been recently updated. On closer inspection it turns out it’s updated on a regular basis! I had assumed it was a static snapshot of trees at a certain point in time, but I was wrong: Street_Tree_List.csv is a living document.

Back in September 2017 I built a scraping project relating to hurricane Irma. The idea was to take data sources like FEMA’s list of open shelters and track them over time, by scraping them into a git repository and committing after every fetch.

I’ve been meaning to spend more time with this idea, and building a commit log for San Francisco’s trees looked like an ideal opportunity to do so.

sf-tree-history

Here’s the result: sf-tree-history, a git repository dedicated to recording the history of changes made to the official list of San Francisco’s trees. The repo contains three things: the latest copy of Street_Tree_List.csv, a README, and a Circle CI configuration that grabs a new copy of the file every night and, if it has changed, commits it to git and pushes the result to GitHub.

The most interesting part of the repo is the commit history itself. I’ve only been running the script for just over a week, but I already have some useful illustrative commits:

  • 7ab432cdcb8d7914cfea4a5b59803f38cade532b from March 6th records three new trees added to the file: two Monterey Pines and a Blackwood Acacia.
  • d6b258959af9546909b2eee836f0156ed88cd45d from March 12th shows four changes made to existing records. Of particular interest: TreeID 235981 (a Cherry Plum) had its address updated from 412 Webster St to 410 Webster St and its latitude and longitude tweaked a little bit as well.
  • ca66d9a5fdd632549301d249c487004a5b68abf2 lists 2151 rows changed, 1280 rows added! I found an old copy of Street_Tree_List.csv on my laptop from April 2018, so for fun I loaded it into the repository and used git commit amend to back-date the commit to almost a year ago. I generated a commit message between that file and the version from 9 days ago which came in at around 10,000 lines of text. Git handled that just fine, but GitHub’s web view sadly truncates it.

csv-diff

One of the things I learned from my hurricane Irma project was the importance of human-readable commit messages that summarize the detected changes. I initially wrote some code to generate those by hand, but then realized that this could be extracted into a reusable tool.

The result is csv-diff, a tiny Python CLI tool which can generate a human (or machine) readable version of the differences between two CSV files.

Using it looks like this:

$ csv-diff one.csv two.csv --key=id
1 row added, 1 row removed, 1 row changed

1 row added

  {"id": "3", "name": "Bailey", "age": "1"}

1 row removed

  {"id": "2", "name": "Pancakes", "age": "2"}

1 row changed

  Row 1
    age: "4" => "5"

The csv-diff README has further details on the tool.

Circle CI

My favourite thing about the sf-tree-history project is that it costs me nothing to run—either in hosting costs or (hopefully) in terms of ongoing maintenance.

The git repository is hosted for free on GitHub. Because it’s a public project, Circle CI will run tasks against it for free.

My .circleci/config.yml does the rest. It uses Circle’s cron syntax to schedule a task that runs every night. The task then runs this script (embedded in the YAML configuration):

cp Street_Tree_List.csv Street_Tree_List-old.csv
curl -o Street_Tree_List.csv "https://data.sfgov.org/api/views/tkzw-k3nq/rows.csv?accessType=DOWNLOAD"
git add Street_Tree_List.csv
git config --global user.email "treebot@example.com"
git config --global user.name "Treebot"
sudo pip install csv-diff
csv-diff Street_Tree_List-old.csv Street_Tree_List.csv --key=TreeID > message.txt
git commit -F message.txt && \
  git push -q https://${GITHUB_PERSONAL_TOKEN}@github.com/simonw/sf-tree-history.git master \
  || true

This script does all of the work.

  1. First it backs up the existing Street_Tree_list.csv as Street_Tree_List-old.csv, in order to be able to run a comparison later.
  2. It downloads the latest copy of Street_Tree_List.csv from the San Francisco data portal
  3. It adds the file to the git index and sets itself an identity for use in the commit
  4. It installs my csv-diff utility from PyPI
  5. It uses csv-diff to create a diff of the two files, and writes that diff to a new file called message.txt
  6. Finally, it attempts to create a new commit using message.txt as the commit message, then pushes the result to GitHub

The last line is the most complex. Circle CI will mark a build as failed if any of the commands in the run block return a non-0 exit code. git commit returns a non-0 exit code if you attempt to run it but none of the files have changed.

git commit ... && git push ... || true ensures that if git commit succeeds the git push command will be run, BUT if it fails the || true will still return a 0 exit code for the overall line—so Circle CI will not mark the build as failed.

There’s one last trick here: I’m using git push -q https://${GITHUB_PERSONAL_TOKEN}@github.com/simonw/sf-tree-history.git master to push my changes to GitHub. This takes advantage of Circle CI environment variables, which are the recommended way to configure secrets such that they cannot be viewed by anyone browsing your Circle CI builds. I created a personal GitHub auth token for this project, which I’m using to allow Circle CI to push commits to GitHub on my behalf.

Next steps

I’m really excited about this pattern of using GitHub in combination with Circle CI to track changes to any file that is being posted on the internet. I’m opening up the code (and my csv-diff utility) in the hope that other people will use them to set up their own tracking projects. Who knows, maybe there’s a file out there that’s even more exciting than San Francisco’s official list of trees!

This is Generating a commit log for San Francisco’s official list of trees by Simon Willison, posted on 13th March 2019.

Part of series Git scraping

  1. Scraping hurricane Irma - Sept. 10, 2017, 6:21 a.m.
  2. Changelogs to help understand the fires in the North Bay - Oct. 10, 2017, 6:48 a.m.
  3. Generating a commit log for San Francisco's official list of trees - March 13, 2019, 2:49 p.m.
  4. Tracking PG&E outages by scraping to a git repo - Oct. 10, 2019, 11:32 p.m.
  5. Git scraping: track changes over time by scraping to a Git repository - Oct. 9, 2020, 6:27 p.m.
  6. Git scraping, the five minute lightning talk - March 5, 2021, 12:44 a.m.
  7. git-history: a tool for analyzing scraped data collected using Git and SQLite - Dec. 7, 2021, 10:32 p.m.
  8. … more

Next: Running Datasette on Glitch

Previous: I commissioned an oil painting of Barbra Streisand’s cloned dogs