Simon Willison’s Weblog


Filters: Type: blogmark ×

When data is messy. I love this story: a neural network trained on images was asked what the most significant pixels in pictures of tench (a kind of fish) were: it returned pictures of fingers on a green background, because most of the tench photos it had seen were fisherfolk showing off their catch. # 7th July 2020, 7:03 pm

GitHub Actions: Manual triggers with workflow_dispatch (via) New GitHub Actions feature which fills a big gap in the offering: you can now create “workflow dispatch” events which provide a button for manually triggering an action—and you can specify extra UI form fields that can customize how that action runs. This turns Actions into an interactive automation engine for any code that can be wrapped in a Docker container. # 7th July 2020, 4:33 am

sba-loans-covid-19-datasette (via) The treasury department released a bunch of data on the Covid-19 SBA Paycheck Protection Program Loan recipients today—I’ve loaded the most interesting data (the $150,000+ loans) into a Datasette instance. # 7th July 2020, 2:42 am

How to find what you want in the Django documentation (via) Useful guide by Matthew Segal to navigating the Django documentation, and tips for reading documentation in general. The Django docs have a great reputation so it’s easy to forget how intimidating they can be for newcomers: Matthew emphasizes that docs are rarely meant to be read in full: the trick is learning how to quickly search them for the things you need to understand right now. # 3rd July 2020, 3:04 pm

Better Python Decorators with wrapt (via) Adam Johnson explains the intricacies of decorating a Python function without breaking the ability to correctly introspect it, and dicsusses how Scout use the wrapt library by Graham Dumpleton to implement their instrumentation library. # 2nd July 2020, 9:48 pm

entr: rerun your build when files change. “WHY DID NOBODY TELL ME ABOUT THIS BEFORE?!?!” is one of my favourite genres of blog post. # 1st July 2020, 3:58 pm

Unlocking value with durable teams (via) Anna Shipman describes the FT’s experience switching from project-based teams to “durable” teams—teams which own a specific area of the product. Lots of really smart organizational design thinking in this. I’ve seen how much of a difference it makes to have every inch of a complex system “owned” by a specific team. I also like how Anna uses the term “technical estate” to describe the entirety of the FT’s systems. # 29th June 2020, 9:33 pm

Reducing search indexing latency to one second. Really detailed dive into the nuts and bolts of Twitter’s latest iteration of search indexing technology, including a great explanation of skip lists. # 26th June 2020, 5:06 pm

How CDNs Generate Certificates. Thomas Ptacek (now at Fly) describes in intricate detail the challenges faced by large-scale hosting providers that want to securely issue LetsEncrypt certificates for customer domains. Lots of detail here on the different ACME challenges supported by LetsEncrypt and why the new tls-alpn-01 challenge is the right option for operating at scale. # 26th June 2020, 12:03 am

datasette-block-robots. Another little Datasette plugin: this one adds a /robots.txt page with “Disallow: /” to block all indexing of a Datasette instance from respectable search engine crawlers. I built this in less than ten minutes from idea to deploy to PyPI thanks to the datasette-plugin cookiecutter template. # 23rd June 2020, 3:28 am

click-app. While working on sqlite-generate today I built a cookiecutter template for building the skeleton for Click command-line utilities. It’s based on datasette-plugin so it automatically sets up GitHub Actions for running tests and deploying packages to PyPI. # 23rd June 2020, 2:21 am

sqlite-generate (via) I wrote this tool today to generate arbitrarily large SQLite databases, for testing purposes. You tell it how many tables, columns and rows you want and it will use the Faker Python library to generate random data and populate the tables with it. # 23rd June 2020, 2:19 am

Datasette: A Developer, a Shower and a Data-Inspired Moment (via) Matt Asay interviewed me over Zoom last month. This captures a lot of my thinking around open source really well: “Datasette is aggressively open source for a bunch of reasons. Most of them are very selfish reasons.” # 18th June 2020, 11:32 pm

Refactoring optional chaining into a large codebase: lessons learned (via) JavaScript now supports foo?.bar?.baz?.() optional chaining syntax across all major browsers. Lea Verou provides the definitive guide to using it to refactor code. # 18th June 2020, 3:23 pm

Happy Birthday Sea Lions! (via) Today, June 15th, is Sea Lion birthday—half of all California Sea Lions are born today thanks to clever co-ordinated delayed implantation by Sea Lion females. Natalie has started making nature videos and I’ve been tagging along as her camera-person—this three minute video, shot at Pier 39 in San Francisco, celebrates Sea Lion birthday and explains how it works. # 15th June 2020, 7:08 pm

Tip for changing cookie subdomains: change the cookie name too. This is a really useful tip I hadn’t encountered before. If you make a change to the way cookies are configured—changing the cookie domain or path for example—it’s a good idea to change the name of the cookie as well. If you don’t change the cookie name you’ll see weird behaviour for users who have previously had the cookie set using the older configuration. This definitely explains bugs I’ve seen in the past. Filing this tip away for future cookie-related development work. # 9th June 2020, 6:41 pm

Apple password-manager-resources (via) Apple maintain on open source repository full of heuristics for implementing smart password managers. It lists password rules for different sites (e.g. min/max length, special characters required), change password URLs for different services and sites that share credential backends—like and They accept pull requests! # 9th June 2020, 4:21 am

A List of Hacker News’s Undocumented Features and Behaviors (via) If you’re interested in community software design this is a neat insight into the many undocumented features of Hacker News, collated by Max Woolf. # 6th June 2020, 5:36 pm

Working Backwards: A New Version Of Amazon’s “Press Release” Approach To Plan Customer-Centric Projects (via) I’ve long wanted to give the Amazon “future press release” trick a go—start a project by writing the imaginary press release that would announce that project to the world, in order to focus on understanding what the project is for and how it will deliver value. Jeff Gothelf has put a lot of thought into this and constructed a thorough looking template for writing one of these that covers a number of different important project aspects. # 2nd June 2020, 3:54 pm

Get Started—Materialize. Materialize is a really interesting new database—“a streaming SQL materialized view engine”. It builds materialized views on top of streaming data sources (such as Kafka)—you define the view using a SQL query, then it figures out how to keep that view up-to-date automatically as new data streams in. It speaks the PostgreSQL protocol so you can talk to it using the psql tool or any PostgreSQL client library. The “get started” guide is particularly impressive: it uses a curl stream of the Wikipedia recent changes API, parsed using a regular expression. And it’s written in Rust, so installing it is as easy as downloading and executing a single binary (though I used Homebrew). # 1st June 2020, 10:11 pm

Practical Python Programming (via) David Beazley has been developing and presenting this three day Python course (aimed at people with some prior programming experience) for over thirteen years, and he’s just released the course materials under a Creative Commons license for the first time. # 29th May 2020, 1:15 pm

Deno is a Browser for Code (via) One of the most interesting ideas in Deno is that code imports are loaded directly from URLs—which can themselves depend on other URL-based packages. On first encounter it feels wrong—obviously insecure. Deno contributor Kitson Kelly provides a deeper exploration of the idea, and explains how the combination of caching and lock files makes it no less secure than code installed from npm or PyPI. # 29th May 2020, 2:36 am

Advice on specifying more granular permissions with Google Cloud IAM (via) My single biggest frustration working with both Google Cloud and AWS is permissions: more specifically, figuring out what the smallest set of permissions are that I need to assign in order to achieve different goals. Katie McLaughlin’s new series aims to address exactly that problem. I learned a ton from this that I’ve previously missed, and there’s plenty of actionable advice on tooling that can be used to help figure this stuff out. # 28th May 2020, 10:44 pm

Why we use homework to recruit engineers. Ad Hoc run a remote-first team, and use detailed homework assignments as part of their interview process in place of in-person technical interview. The homework assignments are really interesting to browse through—“Containerize” for example involves building a Docker container to run a Python app with nginx a and a modern cipher suite. I’m nervous about the extra burden this places on candidates, but Ad Hoc address that: “We recognize that we’re asking folks to invest time into our process, but we feel like our homework compares favorably to extensive on-site interviews or other evaluation techniques, especially for candidates who have responsibilities outside of their work life.” # 27th May 2020, 6:04 pm

AWS services explained in one line each (via) Impressive effort to summarize all 163(!) AWS services—this helped clarify a whole bunch that I haven’t figured yet. Only a few defeated the author, with a single question mark for the description. I enjoyed Amazon Braket: “Some quantum thing. It’s in preview so I have no idea what it is.” # 26th May 2020, 4:41 pm

Serving photos locally with datasette-media. datasette-media is a new Datasette plugin which can serve static files from disk in response to a configured SQL query that maps incoming URL parameters to a path to a file. I built it so I could run dogsheep-photos locally on my laptop and serve up thumbnails of images that match particular queries. I’ve added documentation to the dogsheep-photos README explaining how to use datasette-media, datasette-json-html and datasette-template-sql to create custom interfaces onto Apple Photos data on your machine. # 26th May 2020, 3:53 pm

Waiting in asyncio. Handy cheatsheet explaining the differences between asyncio.gather(), asyncio.wait_for(), asyncio.as_completed() and asyncio.wait() by Hynek Schlawack. # 26th May 2020, 3:28 pm

Using SQL to Look Through All of Your iMessage Text Messages (via) Dan Kelch shows how to access the iMessage SQLite database at ~/Library/Messages/chat.db—it’s protected under macOS Catalina so you have to enable Full Disk Access in the privacy settings first. I usually use the macOS terminal app but I installed iTerm for this because I’d rather enable full disk access to a separate terminal program than let anything I’m running in my regular terminal take advantage of it. It worked! Now I can run “datasette ~/Library/Messages/chat.db” to browse my messages. # 22nd May 2020, 4:45 pm

Doordash and Pizza Arbitrage (via) In which a Pizza restaurant owner notices that Doordash, uninvited, have started offering their $24 pizzas for $16 and starts ordering their own pizzas and keeping the difference. # 18th May 2020, 2:32 pm

Deno 1.0. Deno is a new take on server-side JavaScript from a team lead by Ryan Dahl, who originally created Node.js. It’s built using Rust and crammed with fascinating ideas—like the ability to import code directly from a URL. # 13th May 2020, 11:38 pm