Simon Willison’s Weblog

Subscribe

A CSV export, JSON import workflow for bulk updating our data

28th April 2021

Originally posted to my internal blog at VaccinateCA

I just added missing counties to around 1200 of our locations using a combination of tricks, and I thought they’d make a good blog post.

County is an optional field on our location model, and we sometimes import new locations without attaching a county to them.

How big is the problem?

I ran this SQL query in our dashboard to figure out how bad the problem was:

select
  state.name,
  count(*) as number_of_locations_with_no_county
from
  location join state on state.id = location.state_id
where
  location.county_id is null
group by
  state.name

I got back this:

name        number_of_locations_with_no_county
Arkansas    1
Arizona     90
California  1
Oregon      1113

I decided to start with Oregon, since I knew that was my fault (I imported a bunch of data for Oregon over a month ago and clearly didn’t bother with counties).

Exporting the bad data as CSV

I used another dashboard query to export the location public ID, latitude and longitude for every location in Oregon that was missing a county:

select
  public_id, latitude, longitude
from
  location
where
  state_id = (
    select id from state where name = 'Oregon'
  )
  and county_id is null

I exported those results as CSV, then switched into a Jupyter notebook running on my laptop.

The full script

Here’s the script I ran there:

import csv
import httpx
import json

rows = list(csv.DictReader(open("select-public-id--latitude--lof05de6.csv")))

def county_for_point(latitude, longitude):
    return httpx.get("https://us-counties.datasette.io/counties/county_for_latitude_longitude.json?_shape=array", params={
        "latitude": latitude,
        "longitude": longitude,
    }).json()[0]['county_name']

counties = {}

# Progress bar:
from tqdm.notebook import tqdm

for row in tqdm(rows):
    counties[row["public_id"]] = county_for_point(
        row["latitude"], row["longitude"]
    )

update = {
  "update": {
    key: {
      "state": "OR",
      "county": value
    } for key, value in counties.items()
  },
  "revision_comment": "Added missing counties"
}

print(json.dumps(update))

Finding the county for a latitude/longitude point

The county_for_point() function uses my us-counties.datasette.io API, which provides an API for looking up a county based on a latitude and longitude based on US Census shapefiles—source code for that is at simonw/us-counties-datasette.

You can try the API out here: https://us-counties.datasette.io/counties/county_for_latitude_longitude?longitude=-122.676968&latitude=45.266289

Add .json and ?_shape=array to get the result as a JSON array:

https://us-counties.datasette.io/counties/county_for_latitude_longitude.json?longitude=-122.676968&latitude=45.266289&_shape=array

[
  {
    "state_fips": "41",
    "state": "OR",
    "county_fips": "41005",
    "county_name": "Clackamas",
    "COUNTYNS": "01155127",
    "AFFGEOID": "0500000US41005",
    "GEOID": "41005",
    "LSAD": "06",
    "ALAND": 4845034547,
    "AWATER": 31873103
  }
]

Progress bars in Jupyter

I used tqdm to display a progress bar inside my Jupyter notebook. It makes it easy to wrap an iterable Python object in a way that shows a progress bar as it processes the list:

from tqdm.notebook import tqdm

for row in tqdm(rows):
    counties[row["public_id"]] = county_for_point(
        row["latitude"], row["longitude"]
    )

tqdm

Building JSON to send to our API

The /api/updateLocations API is documented here. It’s a bulk update API which accepts a JSON body looking something like this:

{
  "update": {
    "rec9Zc6A08cEWyNpR": {
      "name": "Berkeley Clinic II"
    },
    "lgzgq": {
      "phone_number": "(555) 555-5551"
    }
  },
  "revision_comment": "New details"
}

It lets you apply a partial update to a whole list of locations at once.

In our case we only want to update the county (though we need to send the state too in order to correctly validate that the county name belongs to that state). My Python code to generate the update document looks like this:

update = {
  "update": {
    key: {
      "state": "OR",
      "county": value
    } for key, value in counties.items()
  },
  "revision_comment": "Added missing counties"
}

print(json.dumps(update))

A pretty-printed subset of the output looks like this:

{
    "update": {
        "lrzkd": {
            "state": "AZ",
            "county": "Maricopa"
        },
        "lrzmg": {
            "state": "AZ",
            "county": "Pima"
        },
        "lrzmr": {
            "state": "AZ",
            "county": "Cochise"
        }
    },
    "revision_comment": "Added missing counties"
}

Pasting that into our API explorer

Having generated this API update document, I used our interactive API explorer tool to execute the update:

https://vial.calltheshots.us/api/updateLocations/debug

Almost all of our API endpoints have an interface like this, which is mainly intended for trying things out but also works for making ad-hoc requests like this one.

Try_api_updateLocations

Tying it together with a saved dashboard

I built a saved dashboard (using the VIAL admin) called locations-with-no-county to tie all of this together.

Initially I thought the dashboard could be used by our web bankers to find locations missing a county—so I had it return ten random locations without a county, so multiple web bankers could use it at the same time with a low risk of overlapping efforts.

Then I realized I could automate it myself, so that aspect of the dashboard turned out not to be useful—though we may well use it for smaller fixing jobs in the future.

https://vial.calltheshots.us/dashboard/locations-with-no-county/

Ten_random_locations_with_no_county

Doing this by generating JSON entirely inside PostgreSQL

I used another variant of this trick last week to add concordance identifiers to our locations, using the /api/updateConcordanceLocations API.

I wanted to add a walgreens:12134 concordance identifier to every one of our locations that was a Walgreens. You can follow my progress in this issue—I used this SQL query to identify all of our Walgreens locations by executing a PostgreSQL regular expression against the name column, then generate the API update JSON using PostgreSQL’s JSON functions:

with extracted_idrefs as (
  select
    public_id, name,
    (regexp_match(name, 'Walgreens.*#0*([0-9]{1,8})', 'i')) [1] as idref
  from
    location
  where
    regexp_match(name, 'Walgreens.*#0*([0-9]{1,8})', 'i') is not null
)
select
  json_build_object(
    'update',
    json_object_agg(
      public_id,
      json_build_object('add', json_build_array('walgreens:' || idref), 'comment', name)
    )
  )
from
  extracted_idrefs
where
  public_id not in (
    select
      public_id
    from
      location
    where
      id in (
        select
          location_id
        from
          concordance_location
        where
          concordanceidentifier_id in (
            select
              id
            from
              concordance_identifier
            where
              authority = 'walgreens'
          )
      )
  )

That last where clause avoids generating JSON for locations that already have a walgreens:X concordance identifier.

The output of the query looks something like this:

{
  "update": {
    "lrxzc": {
      "add": [
        "walgreens:10076"
      ],
      "comment": "Walgreens Co. #10076"
    },
    "lrxzd": {
      "add": [
        "walgreens:10082"
      ],
      "comment": "Walgreens Co. #10082"
    },
    "lrxzf": {
      "add": [
        "walgreens:10160"
      ],
      "comment": "Walgreens Co. #10160"
    }
  }
}

As before, this can be executed directly using this API debugging tool: https://vial.calltheshots.us/api/updateLocationConcordances/debug

This is A CSV export, JSON import workflow for bulk updating our data by Simon Willison, posted on 28th April 2021.

Part of series VaccinateCA internal blog

  1. VIAL is now live, plus django-sql-dashboard - March 15, 2021, 5 p.m.
  2. The Airtable formulas at the heart of everything - March 23, 2021, 5 p.m.
  3. VIAL: Preparing for some collaborative testing - April 1, 2021, 5 p.m.
  4. A CSV export, JSON import workflow for bulk updating our data - April 28, 2021, 5 p.m.

Next: One year of TILs

Previous: Weeknotes: Vaccinate The States, and how I learned that returning dozens of MB of JSON works just fine these days