Simon Willison’s Weblog

On postgresql, projects, gis, management, sql, ...

 

Recent entries

Weeknotes: Python 3.7 on Glitch, datasette-render-markdown three days ago

Streaks is really working well for me. I’m at 12 days of commits to Datasette, 16 posting a daily Niche Museum, 19 of actually reviewing my email inbox and 14 of guitar practice. I rewarded myself for that last one by purchasing an actual classical (as opposed to acoustic) guitar.

Datasette

One downside: since my aim is to land a commit to Datasette master every day, I’m incentivised to land small changes. I have a bunch of much larger Datasette projects in the works—I think my goal for the next week should be to land one of those. Contenders include:

I’m going to redefine my daily goal to include pushing in-progress work to Datasette branches in an attempt to escape that false incentive.

New datasette-csvs using Python 3.7 on Glitch

The main reason I’ve been strict about keeping Datasette compatible with Python 3.5 is that it was the only version supported by Glitch, and Glitch has become my favourite tool for getting people up and running with Datasette quickly.

There’s been a long running Glitch support thread requesting an upgrade, and last week it finally bore fruit. Projects on Glitch now get python3 pointing to Python 3.7.5 instead!

This actually broke my datasette-csvs project at first, because for some reason under Python 3.7 the Pandas dependency used by csvs-to-sqlite started taking up too much space from the 200MB Glitch instance quota. I ended up working around this by switching over to using my sqlite-utils CLI tool instead, which has much lighter dependencies.

I’ve shared the new code for my Glitch project in the datasette-csvs repo on GitHub.

The one thing missing from sqlite-utils insert my.db mytable myfile.csv --csv right now is the ability to run it against multiple files at once—something csvs-to-sqlite handles really well. I ended up finally learning how to use while in bash and wrote the following install.sh shell script:

$ pip3 install -U -r requirements.txt --user && \
  mkdir -p .data && \
  rm .data/data.db || true && \
  for f in *.csv
    do
        sqlite-utils insert .data/data.db ${f%.*} $f --csv
    done

${f%.*} is the bash incantation for stripping off the file extension—so the above evaluates to this for each of the CSV files it finds in the root directory:

$ sqlite-utils insert .data/data.db trees trees.csv --csv

github-to-sqlite releases

I released github-to-sqlite 0.6 with a new sub-command:

$ github-to-sqlite releases github.db simonw/datasette

It grabs all of the releases for a repository using the GitHub releases API.

I’m using this for my personal Dogsheep instance, but I’m also planning to use this for the forthcoming Datasette website—I want to pull together all of the releases of all of the Datasette Ecosystem of projects in one place.

I decided to exercise my new bash while skills and write a script to run by cron once an hour which fetches all of my repos (from both my simonw account and my dogsheep GitHub organization) and then fetches their releases.

Since I don’t want to fetch releases for all 257 of my personal GitHub repos—just the repos which relate to Datasette—I started applying a new datasette-io topic (for datasette.io, my planned website domain) to the repos that I want to pull releases from.

Then I came up with this shell script monstrosity:

#!/bin/bash
# Fetch repos for simonw and dogsheep
github-to-sqlite repos github.db simonw dogsheep -a auth.json

# Fetch releases for the repos tagged 'datasette-io'
sqlite-utils github.db "
select full_name from repos where rowid in (
    select repos.rowid from repos, json_each(repos.topics) j
    where j.value = 'datasette-io'
)" --csv --no-headers | while read repo;
    do github-to-sqlite releases \
            github.db $(echo $repo | tr -d '\r') \
            -a auth.json;
        sleep 2;
    done;

Here’s an example of the database this produces, running on Cloud Run: https://github-to-sqlite-releases-j7hipcg4aq-uc.a.run.app

I’m using the ability of sqlite-utils to run a SQL query and return the results as CSV, but without the header row. Then I pipe the results through a while loop and use them to call the github-to-sqlite releases command against each repo.

I ran into a weird bug which turned out to be caused by the CSV output using \r\n which was fed into github-to-sqlite releases as simonw/datasette\r—I fixed that using $(echo $repo | tr -d '\r').

datasette-render-markdown

Now that I have a releases database table with all of the releases of my various packages I want to be able to browse them in one place. I fired up Datasette and realized that the most interesting information is in the body column, which contains markdown.

So I built a plugin for the render_cell plugin hook which safely renders markdown data as HTML. Here’s the full implementation of the plugin:

import bleach
import markdown
from datasette import hookimpl
import jinja2

ALLOWED_TAGS = [
    "a", "abbr", "acronym", "b", "blockquote", "code", "em",
    "i", "li", "ol", "strong", "ul", "pre", "p", "h1","h2",
    "h3", "h4", "h5", "h6",
]

@hookimpl()
def render_cell(value, column):
    if not isinstance(value, str):
        return None
    # Only convert to markdown if table ends in _markdown
    if not column.endswith("_markdown"):
        return None
    # Render it!
    html = bleach.linkify(
        bleach.clean(
            markdown.markdown(value, output_format="html5"),
            tags=ALLOWED_TAGS,
        )
    )
    return jinja2.Markup(html)

This first release of the plugin just looks for column names that end in _markdown and renders those. So the following SQL query does what I need:

select
  json_object("label", repos.full_name, "href", repos.html_url) as repo,
  json_object(
    "href",
    releases.html_url,
    "label",
    releases.name
  ) as release,
  substr(releases.published_at, 0, 11) as date,
  releases.body as body_markdown,
  releases.published_at
from
  releases
  join repos on repos.id = releases.repo
order by
  releases.published_at desc

In aliases releases.body to body_markdown to trigger the markdown rendering, and uses json_object(...) to cause datasette-json-html to render some links.

You can see the results here.

Releases SQL results

More museums

I added another 7 museums to www.niche-museums.com.

  • Dingles Fairground Heritage Centre
  • Ilfracombe Museum
  • Barometer World
  • La Galcante
  • Musée des Arts et Métiers
  • International Women’s Air & Space Museum
  • West Kern Oil Museum

Weeknotes: More releases, more museums 10 days ago

Lots of small releases this week.

Datasette

I released two bug fix releases for Datasette—0.30.1 and 0.30.2. Changelog here. My Dogsheep personal analytics project means I’m using Datasette for my own data analysis every day, which inspires me to fix small but annoying bugs much more aggressively.

I’ve also set myself a Streak goal to land a commit to Datasette every day.

I landed a tiny new feature to master yesterday: a ?column__notin=x,y,z filter, working as an inverse of the existing ?column__in=x,y,z filter. See issue #614 for details.

More Niche Museums

I’ve been keeping up my streak of adding at least one new museum to www.niche-museums.com every day. This week I added the Pirates Museum in Antananarivo, Madagascar, the David Rumsey Map Center at Stanford, Galerie de Paléontologie et d’Anatomie comparée in Paris, DEVIL-ish Little Things in Vancouver, Washington, Mardi Gras World in New Orleans, Environmental Volunteers EcoCenter in Palo Alto, the Evergreen Aviation & Space Museum (home of the Spruce Goose!) in McMinnville Oregon and Autoservicio Condorito in Mendoza.

Here’s that list of new museums with my photos of them (images rendered using datasette-json-html).

sqlite-transform

I released a new tiny CLI tool for manipulating SQLite databases. sqlite-transform lets you run transformation functions against the values in a specific column of a database. It currently has three sub-commands:

  • parsedate parses strings that looks like dates and turns them into YYYY-MM-DD, so they can be sorted.
  • parsedatetime turns them into ISO YYYY-MM-DDTHH:mm:ss timestamps.
  • lambda is the most fun: it lets you provide a snippet of Python code which will be executed against each value to perform a custom transformation. More details in issue #2.

Here’s how to use it to wrap the values in a specific column, including importing the textwrap module from the Python standard library:

$ sqlite-transform lambda my.db mytable mycolumn \
    --code='"\n".join(textwrap.wrap(value, 10))' \
    --import=textwrap

Other releases

Dogsheep

I’ve been having a lot of fun creating new features for my personal Dogsheep analytics site. Many of these take the form of simple HTML added to the private homepage. Most recently I added the ability to search through the people who follow me on Twitter (an evolution of this technique from last year). That feature is entirely implemented as the following HTML form:

  <form action="/twitter/users" method="GET">
    <p>
      <input type="hidden" name="_where" value="id in (select follower_id from following where followed_id = 12497)">
      <input name="_search" type="search" placeholder="Search my Twitter followers"> <input type="submit" value="Search">
    </p>
  </form>

More tree data

I exported all 3.85 million 311 calls from the San Francisco data portal into a database, then extracted out the 80,000 calls that mention trees and loaded them into a separate Datasette instance. You can play with that here—it was the inspiration for creating the sqlite-transform tool because I needed a way to clean up the datetime columns.

Weeknotes: Niche Museums, Kepler, Trees and Streaks 17 days ago

Niche Museums

Every now and then someone will ask “so when are you going to build Museums Near Me then?”, based on my obsession with niche museums and websites like www.owlsnearme.com.

For my Strategic Communications course at Stanford last week I had to perform a midterm presentation—a six minute talk to convince my audience of something, accompanied by slides and a handout.

I chose “you should seek out and explore tiny museums” as my topic, and used it as an excuse to finally start the website!

www.niche-museums.com is the result. It’s a small but growing collection of niche museums (17 so far, mostly in the San Francisco Bay Area) complete with the all important blue “Use my location” button to see museums near you.

Naturally I built it on Datasette. I’ll be writing more about the implementation (and releasing the underlying code) soon. I also built a new plugin for it, datasette-haversine.

Mapping museums against Starbucks

I needed a way to emphasize quite how many tiny museums there are in the USA. I decided to do this with a visualization.

It turns out there are 15,891 branches of Starbucks in the USA… and at least 30,132 museums!

15,891 Starbucks

At least 30.132 museums!

I made these maps using a couple of sources.

All The Places is a crowdsourced scraper project which aims to build scrapers for every company that has a “store locator” area of their website. Starbucks has a store locator and All The Places have a scraper for it, so you can download GeoJSON of every Starbucks. I wrote a quick script to import that GeoJSON into Datasette using sqlite-utils.

The Institute of Museum and Library Services is an independent agency of the federal government that supports museums and libraries across the country. They publish a dataset of Museums in the USA as a set of CSV files. I used csvs-to-sqlite to load those into Datasette, than ran a union query to combine the three files together.

So I have Datasette instances (with a CSV export feature) for both Starbucks and USA museums, with altitudes and longitudes for each.

Now how to turn that into a map?

I turned to my new favourite GIS tool, Kepler. Kepler is an open source GIS visualization tool released by Uber, based on WebGL. It’s astonishingly powerful and can be used directly in your browser by clicking the “Get Started” button on their website (which I assumed would take you to installation instructions, but no, it loads up the entire tool in your browser).

You can import millions of points of data into Kepler and it will visualize them for you directly. I used a Datasette query to export the CSVs, then loaded in my Starbucks CSV, exported an image, loaded in the Museums CSV as a separate colour and exported a second image. The whole project ended up taking about 15 minutes. Kepler is a great addition to the toolbelt!

Animating the PG&E outages

My PG&E outages scraper continues to record a snapshot of the PG&E outage map JSON every ten minutes. I’m posting updates to a thread on Twitter, but discovering Kepler inspired me to look at more sophisticated visualization options.

This tutorial by Giuseppe Macrì tipped me off the the fact that you can use Kepler to animate points against timestamps!

Here’s the result: a video animation showing how PG&E’s outages have evolved since the 5th of October:

Hayes Valley Trees

The city announced plans to cut down 27 ficus trees in our neighborhood in San Francisco. I’ve been working with Natalie to help a small group of citizens organize an appeal, and this weekend I helped run a survey of the affected trees (recording their exact locations in a CSV file) and then built www.hayes-valley-trees.com (source on GitHub) to link to from fliers attached to each affected tree.

It started out as a Datasette (running on Glitch) but since it’s only 27 data points I ended up freezing the data in a static JSON file to avoid having to tolerate any cold start times. The site is deployed as static assets on Zeit Now using their handy GitHub continuous deployment tool.

Streaks

It turns out I’m very motivated by streaks: I’m at 342 days for Duolingo Spanish and 603 days for an Apple Watch move streak. Could I apply this to other things in my life?

I asked on Twitter and was recommended the Streaks iOS app. It’s beautiful! I’m now tracking streaks for guitar practice, Duolingo, checking email, checking Slack, reading some books and adding a new museum to www.niche-museums.com (if I add one a day I can get from 17 museums today to 382 in a year!)

It seems to be working pretty well so far. I particularly like their iPhone widget.

Streaks widget

Weeknotes: The Squirrel Census, Genome SQL query 24 days ago

This week was mostly about incremental improvements. And squirrels.

The Squirrel Census

Last October a team of 323 volunteer Squirrel Sighters got together to attempt the first ever comprehensive census of the squirrels in New York’s Central Park.

The result was the Central Park Squirrel Census 2019 Report, an absolutely delightful package you can order from their site that includes two beautiful five foot long maps, a written supplemental report and (not kidding) a 45-RPM vinyl audio report.

I bought a copy and everyone I have shown it to has fallen in love with it.

Last week they released the underlying data through the NYC open data portal! Naturally I loaded it into a Datasette instance (running on Glitch, which is also made in New York). You can facet by fur color.

Analyzing my genome with SQL

Natalie and I attended the first half of the last San Francisco Science Hack Day. I made a 44 second pitch asking for help using SQL to analyze my genome, and Dr. Laura Cantino answered my call and helped me figure out how to run some fun queries.

Here’s my favourite example so far:

select rsid, genotype, case genotype
  when 'AA' then 'brown eye color, 80% of the time'
  when 'AG' then 'brown eye color'
  when 'GG' then 'blue eye color, 99% of the time'
end as interpretation from genome where rsid = 'rs12913832'

It works! If I run it against my genome it gets GG (blue eyes), and if I run it against Natalie’s it gets AG (brown).

Running SQL against my genome and Natalie's genome

twitter-to-sqlite crons for Dogsheep

I upgraded my personal Dogsheep instance to start fetching fresh data via cron. This meant improving twitter-to-sqlite to be better at incremental data fetching. Here’s the resulting relevant section of my cron tab:

# Fetch new tweets I have tweeted every 10 minutes
1,11,21,31,41,51 * * * * /home/ubuntu/datasette-venv/bin/twitter-to-sqlite user-timeline /home/ubuntu/twitter.db -a /home/ubuntu/auth.json --since

# Fetch most recent 50 tweets I have favourited every 10 minutes
6,16,26,36,46,56 * * * * /home/ubuntu/datasette-venv/bin/twitter-to-sqlite favorites /home/ubuntu/twitter.db -a /home/ubuntu/auth.json --stop_after=50

# Fetch new tweets from my home timeline every 5 minutes
2,7,12,17,22,27,32,37,42,47,52,57 * * * * /home/ubuntu/datasette-venv/bin/twitter-to-sqlite home-timeline /home/ubuntu/timeline.db -a /home/ubuntu/auth.json --since

That last line is quite fun: I’m now hitting the Twitter home-timeline API every five minutes and storing the resulting tweets in a separate timeline.db database. This equates to creating a searchable permanent database of anything tweeted by any of the 3,600+ accounts I follow.

The main thing I’ve learned from this exercise is that I’m really grateful for Twitter’s algorithmic timeline! Accounts I follow tweeted between 13,000 and 22,000 times a day over the past week (/timeline/tweets?_where=id+in+(select+tweet+from+timeline_tweets)&_facet_date=created_at). Trusting the algorithm to show me the highlights frees me to be liberal in following new accounts.

Playing with lit-html

I have a couple of projects brewing where I’ll be building out a JavaScript UI. I used React for both datasette-vega and owlsnearme and it works OK, but it’s pretty weighty (~40KB gzipped) and requires a complex build step.

For my current pro;jects, I’m willing to sacrifice compatibility with older browsers in exchange for less code and no build step.

I considered going completely library free, but having some kind of templating mechanism for constructing complex DOM elements is useful.

I love JavaScript tagged template literals—they’re like Python’s f-strings but even better, because you can provide your own function that executes against every interpolated value—ideal for implementing Django-style autoescaping.

I’ve used them for my own tiny autoescaping implementation in the past, but I’m getting excited about exploring lit-html, which extends the same technique to cover a lot more functionality while still weighing in at less than 4KB gzipped.

Thanks to Laurence Rowe on Twitter I now have a tidy one-liner for running lit-html in the Firefox developer console, loaded from a module:

let {html, render} = await import('https://unpkg.com/lit-html?module');

// Try it out like so:
render(html`<h1>Hello world</h1>`, document.body);

And here’s a full proof-of-concept script showing data loaded from Datasette via fetch() which is then rendered using lit-html:

<div></div>
<script type="module">
import {html, render} from 'https://unpkg.com/lit-html?module';
const myTemplate = (data) => html`<p>Hello ${JSON.stringify(data)}</p>`;
async function go() {
    let div = document.getElementsByTagName("div")[0];
    var data = await (await fetch("https://latest.datasette.io/fixtures/complex_foreign_keys.json?_labels=on&_shape=array")).json();
    render(myTemplate(data), div);
}
go();
</script>

Everything else

I shipped a new release of Datasette (version 0.30) incorporating some bug fixes and the external contributions I talked about last week.

I started work on a new project which has been brewing for a while. Not ready to talk about it yet but I did buy a domain name (through Google Domains this time, but I got some great advice from Twitter on domain registrars with non-terrible UIs in 2019).

Weeknotes: PG&E outages, and Open Source works! one month ago

My big focus this week was the PG&E outages project. I’m really pleased with how this turned out: the San Francisco Chronicle used data from it for their excellent PG&E outage interactive (mixing in data on wind conditions) and it earned a bunch of interest on Twitter and some discussion on Hacker News.

I gave a talk about the project on Thursday for the Online News Association Silicon Valley meetup and made connections with a bunch of interesting journalists from around the Bay Area.

Open Source works!

A highlight of last week was that a bunch of my projects gained new functionality through open source contributions!

Dogsheep

I started a very basic website for my Dogsheep personal analytics project.

I also started running various Dogsheep tools via cron on my personal Dogsheep server, to keep that set of databases automatically updated with my latest activity on various services.

Most excitingly, Tobias Kunze built the first indepedent Dogsheep-style tool: goodreads-to-sqlite!

As the name suggests, it imports your data from Goodreads into a SQLite database. It inspired me to create a Goodreads account which I will be using to track my book reading activity from now on.

Tobias wrote a fantastic blog post introducing the tool which includes some neat example queries and graphs.

In other Dogsheep news, I added an issue-comments command to github-to-sqlite for fetching all issue comments in a repo. My goal is to evolve that tool to the point where it can import all relevant data from all of my repositories and give me a single Datasette-powered dashboard for keeping track of everything in one place.

Tracking PG&E outages by scraping to a git repo one month ago

PG&E have cut off power to several million people in northern California, supposedly as a precaution against wildfires.

As it happens, I’ve been scraping and recording PG&E’s outage data every 10 minutes for the past 4+ months. This data got really interesting over the past two days!

The original data lives in a GitHub repo (more importantly in the commit history of that repo).

Reading JSON in a Git repo isn’t particularly productive, so this afternoon I figured out how to transform that data into a SQLite database and publish it with Datasette.

The result is pge-outages.simonwillison.net

The data model: outages and snapshots

The three key tables to understand are outages, snapshots and outage_snapshots.

PG&E assign an outage ID to every outage—where an outage is usually something that affects a few dozen customers. I store these in the outages table.

Every 10 minutes I grab a snapshot of their full JSON file, which reports every single outage that is currently ongoing. I store a record of when I grabbed that snapshot in the snapshots table.

The most interesting table is outage_snapshots. Every time I see an outage in the JSON feed, I record a new copy of its data as an outage_snapshot row. This allows me to reconstruct the full history of any outage, in 10 minute increments.

Here are all of the outages that were represented in snapshot 1269—captured at 4:10pm Pacific Time today.

I can run select sum(estCustAffected) from outage_snapshots where snapshot = 1269 (try it here) to count up the total PG&E estimate of the number of affected customers—it’s 545,706!

I’ve installed datasette-vega which means I can render graphs. Here’s my first attempt at a graph showing the number of estimated customers affected over time.

(I don’t know why there’s a dip towards the end of the graph).

I also defined a SQL view which shows all of the outages from the most recently captured snapshot (usually within the past 10 minutes if the PG&E website hasn’t gone down) and renders them using datasette-cluster-map.

Things to be aware of

There are a huge amount of unanswered questions about this data. I’ve just been looking at PG&E’s JSON and making guesses about what things like estCustAffected means. Without official documentation we can only guess as to how accurate this data is, or how it should be interpreted.

Some things to question:

  • What’s the quality of this data? Does it reflect accurately on what’s actually going on out there?
  • What’s the exact meaning of the different columns—estCustAffected, currentEtor, autoEtor, hazardFlag etc?
  • Various columns (lastUpdateTime, currentEtor, autoEtor) appear to be integer unix timestamps. What timezone were they recorded in? Do they include DST etc?

How it works

I originally wrote the scraper back in October 2017 during the North Bay fires, and moved it to run on Circle CI based on my work building a commit history of San Francisco’s trees.

It’s pretty simple: every 10 minutes a Circle CI job runs which scrapes the JSON feed that powers the PG&E website’s outage map.

The JSON is then committed to my pge-outages GitHub repository, over-writing the existing pge-outages.json file. There’s some code that attempts to generate a human-readable commit message, but the historic data itself is saved in the commit history of that single file.

Building the Datasette

The hardest part of this project was figuring out how to turn a GitHub commit history of changes to a JSON file into a SQLite database for use with Datasette.

After a bunch of prototyping in a Jupyter notebook, I ended up with the schema described above.

The code that generates the database can be found in build_database.py. I used GitPython to read data from the git repository and my sqlite-utils library to create and update the database.

Deployment

Since this is a large database that changes every ten minutes, I couldn’t use the usual datasette publish trick of packaging it up and re-deploying it to a serverless host (Cloud Run or Heroku or Zeit Now) every time it updates.

Instead, I’m running it on a VPS instance. I ended up trying out Digital Ocean for this, after an enjoyable Twitter conversation about good options for stateful (as opposed to stateless) hosting.

Next steps

I’m putting this out there and sharing it with the California News Nerd community in the hope that people can find interesting stories in there and help firm up my methodology—or take what I’ve done and spin up much more interesting forks of it.

If you build something interesting with this please let me know, via email (swillison is my Gmail) or on Twitter.

Elsewhere

Yesterday

  • I have sometimes wondered how I would fare with a problem where the solution really isn’t in sight. I decided that I should give it a try before I get too old. I’m going to work on artificial general intelligence (AGI). I think it is possible, enormously valuable, and that I have a non-negligible chance of making a difference there, so by a Pascal’s Mugging sort of logic, I should be working on it.

    John Carmack #

12th November 2019

  • Datasette 0.31. Released today: this version adds compatibility with Python 3.8 and breaks compatibility with Python 3.5. Since Glitch support Python 3.7.3 now I decided I could finally give up on 3.5. This means Datasette can use f-strings now, but more importantly it opens up the opportunity to start taking advantage of Starlette, which makes all kinds of interesting new ASGI-based plugins much easier to build. #
  • My Python Development Environment, 2020 Edition (via) Jacob Kaplan-Moss shares what works for him as a Python environment coming into 2020: pyenv, poetry, and pipx. I’m not a frequent user of any of those tools—it definitely looks like I should be. #

7th November 2019

  • pinboard-to-sqlite (via) Jacob Kaplan-Moss just released the second Dogsheep tool that wasn’t written by me (after goodreads-to-sqlite by Tobias Kunze)—this one imports your Pinterest bookmarks. The repo includes a really clean minimal example of how to use GitHub actions to run tests and release packages to PyPI. #

6th November 2019

  • The first ever commit to Sentry (via) This is fascinating: the first 70 lines of code that started the Sentry error tracking project. It’s a straight-forward Django process_exception() middleware method that collects the traceback and the exception class and saves them to a database. The trick of using the md5 hash of the traceback message to de-dupe errors has been there from the start, and remains one of my favourite things about the design of Sentry. #
  • Automate the Boring Stuff with Python: Working with PDF and Word Documents. I stumbled across this while trying to extract some data from a PDF file (the kind of file with actual text in it as opposed to dodgy scanned images) and it worked perfectly: PyPDF2.PdfFileReader(open(“file.pdf”, “rb”)).getPage(0).extractText() #

4th November 2019

  • selenium-demoscraper (via) Really useful minimal example of a Binder project. Click the button to launch a Jupyter notebook in Binder that can take screenshots of URLs using Selenium-controlled headless Firefox. The binder/ folder uses an apt.txt file to install Firefox, requirements.txt to get some Python dependencies and a postBuild Python script to download the Gecko Selenium driver. #
  • Cloud Run Button: Click-to-deploy your git repos to Google Cloud (via) Google Cloud Run now has its own version of the Heroku deploy button: you can add a button to a GitHub repository which, when clicked, will provide an interface for deploying your repo to the user’s own Google Cloud account using Cloud Run. #
  • sqlite-transform. I released a new CLI tool today: sqlite-transform, which lets you run “transformations” against a SQLite database. I built it out of frustration of constantly running into CSV files that use horrible American date formatting—the “sqlite-transform parsedatetime my.db mytable col1” command runs dateutil’s parser against those columns and replaces them with a nice, sortable ISO formatted timestamp. I’ve also added a “sqlite-transform lambda” command that lets you specify Python code directly on the command-line that should be used to transform every value in a specified column. #

2nd November 2019

  • Why you should use `python -m pip` (via) Brett Cannon explains why he prefers “python -m pip install...” to “pip install...”—it ensures you always know exactly which Python interpreter environment you are installing packages for. He also makes the case for always installing into a virtual environment, created using “python -m venv”. #

27th October 2019

  • Calling C functions from BigQuery with web assembly (via) Google BigQuery lets you define custom SQL functions in JavaScript, and it turns out they expose the WebAssembly.instantiate family of APIs. Which means you can write your UDD in C or Rust, compile it to WebAssembly and run it as part of your query! #

26th October 2019

  • Azure Readiness Checklist (via) I love a good comprehensive checklist. This one is focused on large projects running on Azure but it’s still fun to browse through if you are hosting elsewhere, mainly as a reminder of quite how much still goes into deploying large web services into production. #

25th October 2019

  • kepler.gl. Uber built this open source geospatial analysis tool for large-scale data sets, and they offer it as a free hosted online tool—just click Get Started on the site. I uploaded two CSV files with 30,000+ latitude/longitude points in them just now and used Kepler to render them as images. #

21st October 2019

  • Thematic map - GIS Wiki. This is a really useful wiki full of GIS information, and the coverage of different types of thematic maps is particularly thorough. #
  • Setting up Datasette, step by step (via) Tobias describes how he runs Datasette on his own server/VPS, using nginx and systemd. I’m doing something similar for some projects and systemd really does feel like the solution to the “ensure a Python process keeps running” problem I’ve been fighting for over a decade. I really like how Tobias creates a dedicated Linux user for each of his deployed Python projects. #