Simon Willison’s Weblog

Items in 2021

Filters: Year: 2021 ×


Weeknotes: Velma, more Django SQL Dashboard

Matching locations for Vaccinate The States, fun with GeoJSON and more improvements to Django SQL Dashboard.

[... 555 words]

No feigning surprise (via) Don’t feign surprise if someone doesn’t know something that you think they should know. Even better: even if you are surprised, don’t let them know! “When people feign surprise, it’s usually to make them feel better about themselves and others feel worse.” # 17th May 2021, 4:30 pm

geocode-sqlite. Neat command-line Python utility by Chris Amico: point it at a SQLite database file and it will add latitude and longitude columns and populate them by geocoding one or more of the other fields, using your choice from four currently supported geocoders. # 17th May 2021, 1:15 am

Powering the Python Package Index in 2021. PyPI now serves “nearly 900 terabytes over more than 2 billion requests per day”. Bandwidth is donated by Fastly, a value estimated at 1.8 million dollars per month! Lots more detail about how PyPI has evolved over the past years in this post by Dustin Ingram. # 14th May 2021, 4:50 am

Folks think s3 is static assets hosting but really it’s a consistent and highly available key value store with first class blob support

Brian LeRoux # 13th May 2021, 3:01 pm

Async functions require an event loop to run. Flask, as a WSGI application, uses one worker to handle one request/response cycle. When a request comes in to an async view, Flask will start an event loop in a thread, run the view function there, then return the result. Each request still ties up one worker, even for async views. The upside is that you can run async code within a view, for example to make multiple concurrent database queries, HTTP requests to an external API, etc. However, the number of requests your application can handle at one time will remain the same.

Using async and await in Flask 2.0 # 12th May 2021, 5:59 pm

New Major Versions Released! Flask 2.0, Werkzeug 2.0, Jinja 3.0, Click 8.0, ItsDangerous 2.0, and MarkupSafe 2.0. Huge set of releases from the Pallets team. Python 3.6+ required and comprehensive type annotations. Flask now supports async views, Jinja async templates (used extensively by Datasette) “no longer requires patching”, Click has a bunch of new code around shell tab completion, ItsDangerous supports key rotation and so much more. # 12th May 2021, 5:37 pm

Django SQL Dashboard

I’ve released the first non-alpha version of Django SQL Dashboard, which provides an interface for running arbitrary read-only SQL queries directly against a PostgreSQL database, protected by the Django authentication scheme. It can also be used to create saved dashboards that can be published or shared internally.

[... 2171 words]

A museum bot (via) Shawn Graham built a Twitter bot, using R, which tweets out random items from the collection at the Canadian Science and Technology Museum—using a Datasette instance that he’s running based on a CSV export of their collections data. # 5th May 2021, 7:09 pm

cinder: Instagram’s performance oriented fork of CPython (via) Instagram forked CPython to add some performance-oriented features they wanted, including a method-at-a-time JIT compiler and a mechanism for eagerly evaluating coroutines (avoiding the overhead of creating a coroutine if an awaited function returns a value without itself needing to await). They’re open sourcing the code to help start conversations about implementing some of these features in CPython itself. I particularly enjoyed the warning that accompanies the repo: this is not intended to be a supported release, and if you decide to run it in production you are on your own! # 4th May 2021, 10:13 pm

Plot & Vega-Lite. Useful documentation comparing the brand new Observable Plot to Vega-Lite, complete with examples of how to achieve the same thing in both libraries. # 4th May 2021, 4:32 pm

Observable Plot (via) This is huge: a brand new high-level JavaScript visualization library from Mike Bostock, the author of D3—partially inspired by Vega-Lite which I’ve used enthusiastically in the past. First impressions are that this is a big step forward for quickly building high-quality visualizations. It’s released under the ISC license which is “functionally equivalent to the BSD 2-Clause and MIT licenses”. # 4th May 2021, 4:28 pm

Practical SQL for Data Analysis (via) This is a really great SQL tutorial: it starts with the basics, but quickly moves on to a whole array of advanced PostgreSQL techniques—CTEs, window functions, efficient sampling, rollups, pivot tables and even linear regressions executed directly in the database using regr_slope(), regr_intercept() and regr_r2(). I picked up a whole bunch of tips for things I didn’t know you could do with PostgreSQL here. # 4th May 2021, 3:11 am

Adding GeoDjango to an existing Django project

Work on VIAL for Vaccinate The States continues.

[... 1503 words]

Hosting SQLite databases on Github Pages (via) I’ve seen the trick of running SQLite compiled to WASM in the browser before, but this comes with an incredibly clever bonus trick: it uses SQLite’s page structure to fetch subsets of the database file via HTTP range requests, which means you can run indexed SQL queries against a 600MB database file while only fetching a few MBs of data over the wire. Absolutely brilliant. Tucked away at the end of the post is another neat trick: making the browser DOM available to SQLite as a virtual table, so you can query and update the DOM of the current page using SQL! # 2nd May 2021, 6:55 pm

One year of TILs

Just over a year ago I started tracking TILs, inspired by Josh Branchaud’s collection. I’ve since published 148 TILs across 43 different topics. It’s a great format!

[... 224 words]

Query Engines: Push vs. Pull (via) Justin Jaffray (who has worked on Materialize) explains the difference between push and pull query execution engines using some really clear examples built around JavaScript generators. # 2nd May 2021, 2:49 am

Over the past several months, everyone in the industry who provides any kind of free CPU resources has been dealing with a massive outbreak of abuse for cryptocurrency mining. The industry has been setting up informal working groups to pool knowledge of mitigations, communicate when our platforms are being leveraged against one another, and cumulatively wasting thousands of hours of engineering time implementing measures to deal with this abuse, and responding as attackers find new ways to circumvent them.

Drew DeVault, SourceHut # 26th April 2021, 11:52 pm

Weeknotes: Vaccinate The States, and how I learned that returning dozens of MB of JSON works just fine these days

On Friday VaccinateCA grew in scope, a lot: we launched a new website called Vaccinate The States. Patrick McKenzie wrote more about the project here—the short version is that we’re building the most comprehensive possible dataset of vaccine availability in the USA, using a combination of data collation, online research and continuing to make a huge number of phone calls.

[... 1109 words]

Weeknotes: The Aftermath

Some tweets that effectively illustrate my week:

[... 208 words]

country-coder (via) Given a latitude and longitude, how can you tell what country that point sits within? One way is to do a point-in-polygon lookup against a set of country polygons, but this can be tricky: some countries such as New Zealand have extremely complex outlines, even though for this use-case you don’t need the exact shape of the coastline. country-coder solves this with a custom designed 595KB GeoJSON file with detailed land borders but loosely defined ocean borders. It also comes with a wrapper JavaScript library that provides an API for resolving points, plus useful properties on each country with details like telepohen calling codes and emoji flags. # 18th April 2021, 7:37 pm

Why you shouldn’t use ENV variables for secret data (via) I do this all the time, but this article provides a good set of reasons that secrets in environment variables are a bad pattern—even when you know there’s no multi-user access to the host you are deploying to. The biggest problem is that they often get captured by error handling scripts, which may not have the right code in place to redact them. This article suggests using Docker secrets instead, but I’d love to see a comprehensive write-up of other recommended patterns for this that go beyond applications running in Docker. # 14th April 2021, 6:22 pm

Porting VaccinateCA to Django

As I mentioned back in February, I’ve been working with the VaccinateCA project to try to bring the pandemic to an end a little earlier by helping gather as accurate a model as possible of where the Covid vaccine is available in California and how people can get it.

[... 2156 words]

In general, relying only on natural keys is a nightmare. Double nightmare if it’s PII. Natural keys only work if you are flawlessly omniscient about the domain. And you aren’t.

Jacques Chester # 11th April 2021, 10:48 pm

Behind GitHub’s new authentication token formats (via) This is a really smart design. GitHub’s new tokens use a type prefix of “ghp_” or “gho_” or a few others depending on the type of token, to help support mechanisms that scan for accidental token publication. A further twist is that the last six characters of the tokens are a checksum, which means token scanners can reliably distinguish a real token from a coincidental string without needing to check back with the GitHub database. “One other neat thing about _ is it will reliably select the whole token when you double click on it”—what a useful detail! # 5th April 2021, 9:28 pm

Render single selected county on a map (via) Another experiment at the intersection of Datasette and Observable notebooks. This one imports a full Datasette table (3,200 US counties) using streaming CSV and loads that into Observable’s new Search and Table filter widgets. Once you select a single county a second Datasette SQL query (this time retuning JSON) fetches a GeoJSON representation of that county which is then rendered as SVG using D3. # 5th April 2021, 4:48 am

Spatialite Speed Test. Part of an excellent series of posts about SpatiaLite from 2012—here John C. Zastrow reports on running polygon intersection queries against a 1.9GB database file in 40 seconds without an index and 0.186 seconds using the SpatialIndex virtual table mechanism. # 4th April 2021, 4:28 pm

Animated choropleth of vaccinations by US county

Last week I mentioned that I’ve recently started scraping and storing the CDC’s per-county vaccination numbers in my cdc-vaccination-history GitHub repository. This week I used an Observable notebook and d3’s TopoJSON support to render those numbers on an animated choropleth map.

[... 1138 words]

If you measure things by foot traffic we [the SFO Museum] are one of the busiest museums in the world. If that is the case we are also one of the busiest museums in the world that no one knows about. Nothing in modern life really prepares you for the idea that a museum should be part of an airport. San Francisco, as I’ve mentioned, is funny that way.

Aaron Straup Cope # 1st April 2021, 10:40 pm

This teaches us that—when it’s a big enough deal—Amazon will lie to us. And coming from the company that runs the production infrastructure for our companies, stores our data, and has been granted an outsized position of trust based upon having earned it over 15 years, this is a nightmare.

Corey Quinn # 31st March 2021, 4:47 pm