Simon Willison’s Weblog

Subscribe

Items in 2021

Filters: Year: 2021 × Sorted by date


Many of you here today are toolbuilders who help people work with data. Rather than presuming that those using your tools are clear-eyed about their data, how can you build features and methods that ensure people know the limits of their data and work with them responsibly? Your tools are not neutral. Neither is the data that your tools help analyze. How can you build tools that invite responsible data use and make visible when data is being manipulated? How can you help build tools for responsible governance?

danah boyd # 24th December 2021, 11:41 pm

The Asymmetry of Open Source (via) Caddy creator Matt Holt provides “a comprehensive guide to funding open source software projects”. This is really useful—it describes a whole range of funding models that have been demonstrated to work, including sponsorship, consulting, private support channels and more. # 24th December 2021, 9:11 pm

Weeknotes: datasette-tiddlywiki, filters_from_request

I made some good progress on the big refactor this week, including extracting some core logic out into a new Datasette plugin hook. I also got distracted by TiddlyWiki and released a new Datasette plugin that lets you run TiddlyWiki inside Datasette.

[... 1197 words]

Annotated explanation of David Beazley’s dataklasses (via) David Beazley released a self-described “deliciously evil spin on dataclasses” that uses some deep Python trickery to implement a dataclass style decorator which creates classes that import 15-20 times faster than the original. I put together a heavily annotated version of his code while trying to figure out how all of the different Python tricks in it work. # 20th December 2021, 5:05 am

Transactionally Staged Job Drains in Postgres. Any time I see people argue that relational databases shouldn’t be used to implement job queues I think of this post by Brandur from 2017. If you write to a queue before committing a transaction you run the risk of a queue consumer trying to read from the database before the new row becomes visible. If you write to the queue after the transaction there’s a risk an error might result in your message never being written. So: write to a relational staging table as part of the transaction, then have a separate process read from that table and write to the queue. # 18th December 2021, 1:34 am

TypeScript for Pythonistas (via) Really useful explanation of how TypeScript differs from Python with mypy. I hadn’t realized TypeScript leans so far into structural typing, to the point that two types with different names but the same “shape” are identified as being the same type as each other. # 17th December 2021, 7:43 pm

Weeknotes: Trapped in an eternal refactor

I’m still working on refactoring Datasette’s table view. In doing so I spun out a new plugin, datasette-pretty-traces, which improves Datasette’s tooling for seeing the SQL that was executed to build a specific page.

[... 544 words]

A deep dive into an NSO zero-click iMessage exploit: Remote Code Execution (via) Fascinating and terrifying description of an extremely sophisticated attack against iMessage. iMessage was passing incoming image bytes through to a bunch of different libraries to figure out which image format should be decoded, including a PDF renderer that supported the old JBIG2 compression format. JBIG2 includes a mechanism for programatically swapping the values of individual black and white pixels... which turns out to be Turing complete, and means that a sufficiently cunning “image” can include a full computer architecture defined in terms of logical bit operations. Combine this with an integer overflow and you can perform arbitrary memory operations that break out of the iOS sandbox. # 16th December 2021, 8:33 pm

servefolder.dev (via) Absurdly clever application of service workers and the file system API: you can select a folder from your computer and the contents of that folder will be served (just to you) from a path on this website—all without uploading any content. The code is on GitHub and offers a useful, succinct introduction to how to use those APIs. # 12th December 2021, 6:32 pm

wheel.yml for Pyjion using cibuildwheel (via) cibuildwheel, maintained by the Python Packaging Authority, builds and tests Python wheels across multiple platforms. I hadn’t realized quite how minimal a configuration using their GitHub Actions action was until I looked at how Pyjion was using it. # 10th December 2021, 3:05 am

Introducing stack graphs (via) GitHub launched “precise code navigation” for Python today—the first language to get support for this feature. Click on any Python symbol in GitHub’s code browsing views and a box will show you exactly where that symbol was defined—all based on static analysis by a custom parser written in Rust as opposed to executing any Python code directly. The underlying computer science uses a technique called stack graphs, based on scope graphs research from Eelco Visser’s research group at TU Delft. # 9th December 2021, 11:07 pm

Notes on Notes.app. Apple’s Notes app keeps its data in a SQLite database at ~/Library/Group\ Containers/group.com.apple.notes/NoteStore.sqlite—but it’s pretty difficult to extract data from. It turns out the note text is stored as a gzipped protocol buffers object in the ZICNOTEDATA.ZDATA column. Steve Dunham did the hard work of figuring out how it all works—the complexity stems from Apple’s use of CRDT’s to support seamless multiple edits from different devices. # 9th December 2021, 10:39 pm

Weeknotes: git-history, bug magnets and s3-credentials --public

I’ve stopped considering my projects “shipped” until I’ve written a proper blog entry about them, so yesterday I finally shipped git-history, coinciding with the release of version 0.6—a full 27 days after the first 0.1.

[... 1013 words]

git-history: a tool for analyzing scraped data collected using Git and SQLite

I described Git scraping last year: a technique for writing scrapers where you periodically snapshot a source of data to a Git repository in order to record changes to that source over time.

[... 2002 words]

One popular way of making money through cryptocurrency is to start a new currency, while retaining a large chunk of it for yourself. As a result, there are now thousands of competing cryptocurrencies in operation, with relatively little technical difference between them. In order to succeed, currency founders must convince people that their currency is new and different, and crucially, that the buyer understands this while other less savvy investors do not. Wild claims, fanciful economic ideas and rampant technobabble are the order of the day. This is a field that thrives on mystique, and particularly preys on participants’ fear of missing out on the next big thing.

Martin O'Leary # 7th December 2021, 8:41 am

s3-credentials 0.8. The latest release of my s3-credentials CLI tool for creating S3 buckets with credentials to access them (with read-write, read-only or write-only policies) adds a new --public option for creating buckets that allow public access, such that anyone who knows a filename can download a file. The s3-credentials put-object command also now sets the appropriate Content-Type heading on the uploaded object. # 7th December 2021, 7:04 am

Google Public DNS Flush Cache (via) Google Public DNS (8.8.8.8) have a flush cache page too. # 6th December 2021, 11:17 pm

1.1.1.1/purge-cache (via) Cloudflare’s 1.1.1.1 DNS service has a tool that anyone can use to flush a specific DNS entry from their cache—could be useful for assisting rollouts of new DNS configurations. # 6th December 2021, 11:15 pm

Centrifuge: a reliable system for delivering billions of events per day (via) From 2018, a write-up from Segment explaining how they solved the problem of delivering webhooks from thousands of different producers to hundreds of potentially unreliable endpoints. They started with Kafka and ended up on a custom system written in Go against RDS MySQL that was specifically tuned to their write-heavy requirements. # 6th December 2021, 1:41 am

jc (via) This is such a great idea: jc is a CLI tool which knows how to convert the output of dozens of different classic Unix utilities to JSON, so you can more easily process it programmatically, pipe it through jq and suchlike. “pipx install jc” to install, then “dig example.com | jc --dig” to try it out. # 5th December 2021, 11:05 pm

100 years of whatever this will be. This piece by apenwarr defies summarization but I enjoyed reading it a lot. # 2nd December 2021, 8:37 pm

Weeknotes: Shaving some beautiful yaks

I’ve been mostly shaving yaks this week—two in particular: the Datasette table refactor and the next release of git-history. I also built and released my first Web Component!

[... 1307 words]

PHP 8.1 release notes (via) PHP is gaining “Fibers” for lightweight cooperative concurrency—very similar to Python asyncio. Interestingly you don’t need to use separate syntax like “await fn()” to call them—calls to non-blocking functions are visually indistinguishable from calls to blocking functions. Considering how much additional library complexity has emerged in Python world from having these two different colours of functions it’s noteworthy that PHP has chosen to go in a different direction here. # 25th November 2021, 7:53 pm

htmlspecialchars was a very early function. Back when PHP had less than 100 functions and the function hashing mechanism was strlen(). In order to get a nice hash distribution of function names across the various function name lengths names were picked specifically to make them fit into a specific length bucket. This was circa late 1994 when PHP was a tool just for my own personal use and I wasn’t too worried about not being able to remember the few function names.

Rasmus Lerdorf # 22nd November 2021, 7:23 pm

Introduction to heredocs in Dockerfiles (via) This is a fantastic upgrade to Dockerfile syntax, enabled by BuildKit and a new frontend for executing the Dockerfile that can be specified with a #syntax= directive. I often like to create a standalone Dockerfile that works without needing other files from a directory, so being able to use <<EOF syntax to populate configure files from inline blocks of code is really handy. # 22nd November 2021, 5:01 pm

Weeknotes: Apache proxies in Docker containers, refactoring Datasette

Updates to six major projects this week, plus finally some concrete progress towards Datasette 1.0.

[... 1630 words]

Hurl (via) Hurl is “a command line tool that runs HTTP requests defined in a simple plain text format”—written in Rust on top of curl, it lets you run HTTP requests and then execute assertions against the response, defined using JSONPath or XPath for HTML. It can even assert that responses were returned within a specified duration. # 22nd November 2021, 3:32 am

Many Web3 boost­ers see them­selves as disruptors, but “tokenize all the things” is noth­ing if not an obe­di­ent con­tin­u­a­tion of “market-ize all the things”, the cam­paign started in the 1970s, hugely suc­cessful, ongoing. I think the World Wide Web was the real rupture — “Where … is the money?”—which Web 2.0 smoothed over and Web3 now attempts to seal totally.

Robin Sloan # 18th November 2021, 9:55 pm

Cookiecutter Data Science (via) Some really solid thinking in this documentation for the DrivenData cookiecutter template. They emphasize designing data science projects for repeatability, such that just the src/ and data/ folders can be used to recreate all of the other analysis from scratch. I like the suggestion to give each project a dedicated S3 bucket for keeping immutable copies of the original raw data that might be too large for GitHub. # 18th November 2021, 3:21 pm

Weeknotes: git-history, created for a Git scraping workshop

My main project this week was a 90 minute workshop I delivered about Git scraping at Coda.Br 2021, a Brazilian data journalism conference, on Friday. This inspired the creation of a brand new tool, git-history, plus smaller improvements to a range of other projects.

[... 1239 words]