Simon Willison’s Weblog

Items in 2019

Filters: Year: 2019 ×


I have sometimes wondered how I would fare with a problem where the solution really isn’t in sight. I decided that I should give it a try before I get too old. I’m going to work on artificial general intelligence (AGI). I think it is possible, enormously valuable, and that I have a non-negligible chance of making a difference there, so by a Pascal’s Mugging sort of logic, I should be working on it.

John Carmack # 14th November 2019, 1:18 am

Datasette 0.31. Released today: this version adds compatibility with Python 3.8 and breaks compatibility with Python 3.5. Since Glitch support Python 3.7.3 now I decided I could finally give up on 3.5. This means Datasette can use f-strings now, but more importantly it opens up the opportunity to start taking advantage of Starlette, which makes all kinds of interesting new ASGI-based plugins much easier to build. # 12th November 2019, 6:11 am

My Python Development Environment, 2020 Edition (via) Jacob Kaplan-Moss shares what works for him as a Python environment coming into 2020: pyenv, poetry, and pipx. I’m not a frequent user of any of those tools—it definitely looks like I should be. # 12th November 2019, 1:30 am

Weeknotes: Python 3.7 on Glitch, datasette-render-markdown

Streaks is really working well for me. I’m at 12 days of commits to Datasette, 16 posting a daily Niche Museum, 19 of actually reviewing my email inbox and 14 of guitar practice. I rewarded myself for that last one by purchasing an actual classical (as opposed to acoustic) guitar.

[... 1141 words]

pinboard-to-sqlite (via) Jacob Kaplan-Moss just released the second Dogsheep tool that wasn’t written by me (after goodreads-to-sqlite by Tobias Kunze)—this one imports your Pinterest bookmarks. The repo includes a really clean minimal example of how to use GitHub actions to run tests and release packages to PyPI. # 7th November 2019, 8:46 pm

The first ever commit to Sentry (via) This is fascinating: the first 70 lines of code that started the Sentry error tracking project. It’s a straight-forward Django process_exception() middleware method that collects the traceback and the exception class and saves them to a database. The trick of using the md5 hash of the traceback message to de-dupe errors has been there from the start, and remains one of my favourite things about the design of Sentry. # 6th November 2019, 11:08 pm

Automate the Boring Stuff with Python: Working with PDF and Word Documents. I stumbled across this while trying to extract some data from a PDF file (the kind of file with actual text in it as opposed to dodgy scanned images) and it worked perfectly: PyPDF2.PdfFileReader(open(“file.pdf”, “rb”)).getPage(0).extractText() # 6th November 2019, 4:17 pm

Weeknotes: More releases, more museums

Lots of small releases this week.

[... 538 words]

selenium-demoscraper (via) Really useful minimal example of a Binder project. Click the button to launch a Jupyter notebook in Binder that can take screenshots of URLs using Selenium-controlled headless Firefox. The binder/ folder uses an apt.txt file to install Firefox, requirements.txt to get some Python dependencies and a postBuild Python script to download the Gecko Selenium driver. # 4th November 2019, 3:05 pm

Cloud Run Button: Click-to-deploy your git repos to Google Cloud (via) Google Cloud Run now has its own version of the Heroku deploy button: you can add a button to a GitHub repository which, when clicked, will provide an interface for deploying your repo to the user’s own Google Cloud account using Cloud Run. # 4th November 2019, 4:57 am

sqlite-transform. I released a new CLI tool today: sqlite-transform, which lets you run “transformations” against a SQLite database. I built it out of frustration of constantly running into CSV files that use horrible American date formatting—the “sqlite-transform parsedatetime my.db mytable col1” command runs dateutil’s parser against those columns and replaces them with a nice, sortable ISO formatted timestamp. I’ve also added a “sqlite-transform lambda” command that lets you specify Python code directly on the command-line that should be used to transform every value in a specified column. # 4th November 2019, 2:41 am

Why you should use `python -m pip` (via) Brett Cannon explains why he prefers “python -m pip install...” to “pip install...”—it ensures you always know exactly which Python interpreter environment you are installing packages for. He also makes the case for always installing into a virtual environment, created using “python -m venv”. # 2nd November 2019, 4:41 pm

Weeknotes: Niche Museums, Kepler, Trees and Streaks

Every now and then someone will ask “so when are you going to build Museums Near Me then?”, based on my obsession with niche museums and websites like www.owlsnearme.com.

[... 872 words]

Calling C functions from BigQuery with web assembly (via) Google BigQuery lets you define custom SQL functions in JavaScript, and it turns out they expose the WebAssembly.instantiate family of APIs. Which means you can write your UDD in C or Rust, compile it to WebAssembly and run it as part of your query! # 27th October 2019, 5:55 am

Azure Readiness Checklist (via) I love a good comprehensive checklist. This one is focused on large projects running on Azure but it’s still fun to browse through if you are hosting elsewhere, mainly as a reminder of quite how much still goes into deploying large web services into production. # 26th October 2019, 8:32 pm

kepler.gl. Uber built this open source geospatial analysis tool for large-scale data sets, and they offer it as a free hosted online tool—just click Get Started on the site. I uploaded two CSV files with 30,000+ latitude/longitude points in them just now and used Kepler to render them as images. # 25th October 2019, 4:16 am

Weeknotes: The Squirrel Census, Genome SQL query

This week was mostly about incremental improvements. And squirrels.

[... 911 words]

Thematic map—GIS Wiki. This is a really useful wiki full of GIS information, and the coverage of different types of thematic maps is particularly thorough. # 21st October 2019, 2:25 am

Setting up Datasette, step by step (via) Tobias describes how he runs Datasette on his own server/VPS, using nginx and systemd. I’m doing something similar for some projects and systemd really does feel like the solution to the “ensure a Python process keeps running” problem I’ve been fighting for over a decade. I really like how Tobias creates a dedicated Linux user for each of his deployed Python projects. # 21st October 2019, 2:20 am

2018 Central Park Squirrel Census in Datasette (via) The Squirrel Census project released their data! 3,000 squirrel observations in Central Park, each with fur color and latitude and longitude and behavioral observations. I love this data so much. I’ve loaded it into a Datasette running on Glitch. # 16th October 2019, 6:01 pm

μPlot (via) “An exceptionally fast, tiny time series chart. [...] from a cold start it can create an interactive chart containing 150,000 data points in 40ms. [...] at < 10 KB, it’s likely the smallest and fastest time series plotter that doesn’t make use of WebGL shaders or WASM” # 14th October 2019, 11:03 pm

Weeknotes: PG&E outages, and Open Source works!

My big focus this week was the PG&E outages project. I’m really pleased with how this turned out: the San Francisco Chronicle used data from it for their excellent PG&E outage interactive (mixing in data on wind conditions) and it earned a bunch of interest on Twitter and some discussion on Hacker News.

[... 452 words]

goodreads-to-sqlite (via) This is so cool! Tobias Kunze built a Python CLI tool to import your Goodreads data into a SQLite database, inspired by github-to-sqlite and my various other Dogsheep tools. It’s the first Dogsheep style tool I’ve seen that wasn’t built by me—and Tobias’ write-up includes some neat examples of queries you can run against your Goodreads data. I’ve now started using Goodreads and I’m importing my books into my own private Dogsheep Datasette instance. # 14th October 2019, 4:07 am

Tracking PG&E outages by scraping to a git repo

PG&E have cut off power to several million people in northern California, supposedly as a precaution against wildfires.

[... 833 words]

SQL Murder Mystery in Datasette (via) “A crime has taken place and the detective needs your help. The detective gave you the  crime scene report, but you somehow lost it. You vaguely remember that the crime  was a murder that occurred sometime on ​Jan.15, 2018 and that it took place in SQL  City. Start by retrieving the corresponding crime scene report from the police  department’s database.”—Really fun game to help exercise your skills with SQL by the NU Knight Lab. I loaded their SQLite database into Datasette so you can play in your browser. # 7th October 2019, 11:37 pm

Weeknotes: Dogsheep

Having figured out my Stanford schedule, this week I started getting back into the habit of writing some code.

[... 1367 words]

twitter-to-sqlite 0.6, with track and follow. I shipped a new release of my twitter-to-sqlite command-line tool this evening. It now includes experimental features for subscribing to the Twitter streaming API: you can track keywords or follow users and matching Tweets will be written to a SQLite database in real-time as they come in through the API. Since Datasette supports mutable databases now you can run Datasette against the database and run queries against the tweets as they are inserted into the tables. # 6th October 2019, 4:54 am

Streamlit: Turn Python Scripts into Beautiful ML Tools (via) A really interesting new tool / application development framework. Streamlit is designed to help machine learning engineers build usable web frontends for their work. It does this by providing a simple, productive Python environment which lets you declaratively build up a sort-of Notebook style interface for your code. It includes the ability to insert a DataFrame, geospatial map rendering, chart or image into the application with a single Python function call. It’s hard to describe how it works, but the tutorial and demo worked really well for me: “pip install streamlit” and then “streamlit hello” to get a full-featured demo in a browser, then you can run through the tutorial to start building a real interactive application in a few dozen lines of code. # 6th October 2019, 3:52 am

Get your own Pocket OAuth token (via) I hate it when APIs make you jump through extensive hoops just to get an access token for pulling data directly from your own personal account. I’ve been playing with the Pocket API today and it has a pretty complex OAuth flow, so I built a tiny Flask app on Glitch which helps go through the steps to get an API token for your own personal Pocket account. # 5th October 2019, 9:56 pm

Client-Side Certificate Authentication with nginx. I’m intrigued by client-side browser certificates, which allow you to lock down a website such that only browsers with a specific certificate installed can access them. They work on both laptops and mobile phones. I followed the steps in this tutorial and managed to get an nginx instance running which only allows connections from my personal laptop and iPhone. # 5th October 2019, 5:26 pm