Simon Willison’s Weblog


Filters: Type: blogmark ×

Powering the Python Package Index in 2021. PyPI now serves “nearly 900 terabytes over more than 2 billion requests per day”. Bandwidth is donated by Fastly, a value estimated at 1.8 million dollars per month! Lots more detail about how PyPI has evolved over the past years in this post by Dustin Ingram. # 14th May 2021, 4:50 am

New Major Versions Released! Flask 2.0, Werkzeug 2.0, Jinja 3.0, Click 8.0, ItsDangerous 2.0, and MarkupSafe 2.0. Huge set of releases from the Pallets team. Python 3.6+ required and comprehensive type annotations. Flask now supports async views, Jinja async templates (used extensively by Datasette) “no longer requires patching”, Click has a bunch of new code around shell tab completion, ItsDangerous supports key rotation and so much more. # 12th May 2021, 5:37 pm

A museum bot (via) Shawn Graham built a Twitter bot, using R, which tweets out random items from the collection at the Canadian Science and Technology Museum—using a Datasette instance that he’s running based on a CSV export of their collections data. # 5th May 2021, 7:09 pm

cinder: Instagram’s performance oriented fork of CPython (via) Instagram forked CPython to add some performance-oriented features they wanted, including a method-at-a-time JIT compiler and a mechanism for eagerly evaluating coroutines (avoiding the overhead of creating a coroutine if an awaited function returns a value without itself needing to await). They’re open sourcing the code to help start conversations about implementing some of these features in CPython itself. I particularly enjoyed the warning that accompanies the repo: this is not intended to be a supported release, and if you decide to run it in production you are on your own! # 4th May 2021, 10:13 pm

Plot & Vega-Lite. Useful documentation comparing the brand new Observable Plot to Vega-Lite, complete with examples of how to achieve the same thing in both libraries. # 4th May 2021, 4:32 pm

Observable Plot (via) This is huge: a brand new high-level JavaScript visualization library from Mike Bostock, the author of D3—partially inspired by Vega-Lite which I’ve used enthusiastically in the past. First impressions are that this is a big step forward for quickly building high-quality visualizations. It’s released under the ISC license which is “functionally equivalent to the BSD 2-Clause and MIT licenses”. # 4th May 2021, 4:28 pm

Practical SQL for Data Analysis (via) This is a really great SQL tutorial: it starts with the basics, but quickly moves on to a whole array of advanced PostgreSQL techniques—CTEs, window functions, efficient sampling, rollups, pivot tables and even linear regressions executed directly in the database using regr_slope(), regr_intercept() and regr_r2(). I picked up a whole bunch of tips for things I didn’t know you could do with PostgreSQL here. # 4th May 2021, 3:11 am

Hosting SQLite databases on Github Pages (via) I’ve seen the trick of running SQLite compiled to WASM in the browser before, but this comes with an incredibly clever bonus trick: it uses SQLite’s page structure to fetch subsets of the database file via HTTP range requests, which means you can run indexed SQL queries against a 600MB database file while only fetching a few MBs of data over the wire. Absolutely brilliant. Tucked away at the end of the post is another neat trick: making the browser DOM available to SQLite as a virtual table, so you can query and update the DOM of the current page using SQL! # 2nd May 2021, 6:55 pm

Query Engines: Push vs. Pull (via) Justin Jaffray (who has worked on Materialize) explains the difference between push and pull query execution engines using some really clear examples built around JavaScript generators. # 2nd May 2021, 2:49 am

country-coder (via) Given a latitude and longitude, how can you tell what country that point sits within? One way is to do a point-in-polygon lookup against a set of country polygons, but this can be tricky: some countries such as New Zealand have extremely complex outlines, even though for this use-case you don’t need the exact shape of the coastline. country-coder solves this with a custom designed 595KB GeoJSON file with detailed land borders but loosely defined ocean borders. It also comes with a wrapper JavaScript library that provides an API for resolving points, plus useful properties on each country with details like telepohen calling codes and emoji flags. # 18th April 2021, 7:37 pm

Why you shouldn’t use ENV variables for secret data (via) I do this all the time, but this article provides a good set of reasons that secrets in environment variables are a bad pattern—even when you know there’s no multi-user access to the host you are deploying to. The biggest problem is that they often get captured by error handling scripts, which may not have the right code in place to redact them. This article suggests using Docker secrets instead, but I’d love to see a comprehensive write-up of other recommended patterns for this that go beyond applications running in Docker. # 14th April 2021, 6:22 pm

Behind GitHub’s new authentication token formats (via) This is a really smart design. GitHub’s new tokens use a type prefix of “ghp_” or “gho_” or a few others depending on the type of token, to help support mechanisms that scan for accidental token publication. A further twist is that the last six characters of the tokens are a checksum, which means token scanners can reliably distinguish a real token from a coincidental string without needing to check back with the GitHub database. “One other neat thing about _ is it will reliably select the whole token when you double click on it”—what a useful detail! # 5th April 2021, 9:28 pm

Render single selected county on a map (via) Another experiment at the intersection of Datasette and Observable notebooks. This one imports a full Datasette table (3,200 US counties) using streaming CSV and loads that into Observable’s new Search and Table filter widgets. Once you select a single county a second Datasette SQL query (this time retuning JSON) fetches a GeoJSON representation of that county which is then rendered as SVG using D3. # 5th April 2021, 4:48 am

Spatialite Speed Test. Part of an excellent series of posts about SpatiaLite from 2012—here John C. Zastrow reports on running polygon intersection queries against a 1.9GB database file in 40 seconds without an index and 0.186 seconds using the SpatialIndex virtual table mechanism. # 4th April 2021, 4:28 pm (via) I really like this: “curl” gives you your IP address as plain text, “curl” tells you your city according to MaxMind GeoLite2, “curl” gives you all sorts of useful extra data. Suggested rate limit is one per minute, but the code is open source Go that you can run yourself. # 30th March 2021, 7:53 pm

Hello, HPy (via) HPy provides a new way to write C extensions for Python in a way that is compatible with multiple Python implementations at once, including PyPy. # 29th March 2021, 2:40 pm

sqlite-plus (via) Anton Zhiyanov bundled together a bunch of useful SQLite C extensions for things like statistical functions, unicode string normalization and handling CSV files as virtual tables. The GitHub Actions workflow here is a particularly useful example of compiling SQLite extensions for three different platforms. # 25th March 2021, 9:13 pm

Why all my servers have an 8GB empty file (via) This is such a good idea: keep a 8GB empty spacer.img file on any servers you operate, purely so that if the server runs out of space you can delete the file and get some breathing room for getting everything else working again. I’ve had servers run out of space in the past and it’s an absolute pain to sort out—this trick would have really helped. # 25th March 2021, 8:35 pm

Homebrew Python Is Not For You. If you’ve been running into frustrations with your Homebrew Python environments breaking over the past few months (the dreaded “Reason: image not found” error) Justin Mayer has a good explanation. Python in a Homebrew is designed to work as a dependency for their other packages, and recent policy changes that they made to support smoother upgrades have had catastrophic problems effects on those of us who try to use it for development environments. # 25th March 2021, 3:14 pm

Understanding JSON Schema (via) Useful, comprehensive short book guide to JSON Schema, which finally helped me feel like I fully understand the specification. # 24th March 2021, 2:57 am

A Complete Guide To Accessible Front-End Components. I’m so excited about this article: it brings together an absolute wealth of resources on accessible front-end components, including many existing component implementations that are accessible out of the box. Date pickers, autocomplete widgets, modals, menus—all sorts of things that I’ve been dragging my heels on implementing because I didn’t fully understand their accessibility implications. # 23rd March 2021, 1:06 am

The Accountability Project Datasettes. The Accountability Project “curates, standardizes and indexes public data to give journalists, researchers and others a simple way to search across otherwise siloed records”—they have a wide range of useful data, and they’ve started experimenting with Datasette to provide SQL access to a subset of the information that they have collected. # 22nd March 2021, 12:07 am

How we found and fixed a rare race condition in our session handling. GitHub had a terrifying bug this month where a user reported suddenly being signed in as another user. This is a particularly great example of a security incident report, explaining how GitHub identified the underlying bug, what caused it and the steps they are taking to ensure bugs like that never happen in the future. The root cause was a convoluted sequence of events which could cause a Ruby Hash to be accidentally shared between two requests, caused as a result of a new background thread that was introduced as a performance optimization. # 18th March 2021, 11:06 pm

logpaste (via) Useful example of how to use the Litestream SQLite replication tool in a Dockerized application: S3 credentials are passed to the container on startup, it then attempts to restore the SQLite database from S3 and starts a Litestream process in the same container to periodically synchronize changes back up to the S3 bucket. # 17th March 2021, 3:48 pm

sqlite-uuid (via) Another Python package that wraps a SQLite module written in C: this one provides access to UUID functions as SQLite functions. # 15th March 2021, 2:55 am

sqlite-spellfix (via) I really like this pattern: “pip install sqlite-spellfix” gets you a Python module which includes a compiled (on your system when pip install ran) copy of the SQLite spellfix1 module, plus a utility variable containing its path so you can easily load it into a SQLite connection. # 15th March 2021, 2:52 am

The SOC2 Starting Seven (via) "So, you plan to sell your startup’s product to big companies one day. Congratu-dolences! [...] Here’s how we’ll try to help: with Seven Things you can do now that will simplify SOC2 for you down the road while making your life, or at least your security posture, materially better in the immediacy. # 5th March 2021, 7:50 pm

google-cloud-4-words. This is really useful: every Google Cloud service (all 250 of them) with a four word description explaining what it does. I’d love to see the same thing for AWS. UPDATE: Turns out I had—I can’t link to other posts from blogmark descriptions yet, so search “aws explained” on this site to find it. # 4th March 2021, 12:40 am

How I cut GTA Online loading times by 70% (via) Incredible debugging war story: t0st was fed up of waiting six minutes (!) for GTA Online to load on their PC, so they used a host of devious debugging tricks to try and figure out what was going on. It turned out the game was loading a 10MB JSON file detailing all of the available in-game purchases, but inefficient JSON parsing meant it was pegging an entire CPU for 4 minutes mainly running the strlen() C function. Despite not having access to source code or debugging symbols t0st figured out the problem and managed to inject a custom DLL that hooked some internal functions and dropped load times down from 6m down to to 1m50s! # 1st March 2021, 7:12 pm

unasync (via) Today I started wondering out loud if one could write code that takes an asyncio Python library and transforms it into the synchronous equivalent by using some regular expressions to strip out the “await ...” keywords and suchlike. Turns out that can indeed work, and Ratan Kulshreshtha built it! unasync uses the standard library tokenize module to run some transformations against an async library and spit out the sync version automatically. I’m now considering using this for sqlite-utils. # 27th February 2021, 10:20 pm