Simon Willison’s Weblog

Blogmarks in Jul

Filters: Type: blogmark × Month: Jul ×


Why Your Christian Friends and Family Members Are So Easily Fooled by Conspiracy Theories (via) I think the title undersells this: this is a really great piece of writing on conspiracy theories, why people fall for them and why it’s so hard to dig people back out again—regardless of any particular religion, despite being written for a Christian audience. # 31st July 2020, 6:28 pm

Sandboxing and Workload Isolation (via) Fly.io run other people’s code in containers, so workload isolation is a Big Deal for them. This blog post goes deep into the history of isolation and the various different approaches you can take, and fills me with confidence that the team at Fly.io know their stuff. I got to the bottom and found it had been written by Thomas Ptacek, which didn’t surprise me in the slightest. # 30th July 2020, 10:19 pm

How GPT3 Works—Visualizations and Animations. Nice essay full of custom animations illustrating how GPT-3 actually works. # 30th July 2020, 12:58 am

Some SQL Tricks of an Application DBA (via) This post taught me so many PostgreSQL tricks that I hadn’t seen before. Did you know you can start a transaction, drop an index, run explain and then rollback the transaction (cancelling the index drop) to see what explain would look like without that index? Among other things I also learned what the “correlation” database statistic does: it’s a measure of how close-to-sorted the values in a specific column are, which helps PostgreSQL decide if it should do an index scan or a bitmap scan when making use of an index. # 29th July 2020, 7:04 pm

datasette-media 0.4. datasette-media is my Datasette plugin for serving media (e.g. images) directly from Datasette. The first version used file paths saved in a column and served the data from disk—this new version adds the ability to serve content from BLOB columns, such as those created by the new “sqlite-utils insert-files” command. It also adds configurable support for resizing images based on querystring parameters like ?w=100. # 28th July 2020, 2:22 am

sqlite-utils 2.12 (via) I’ve been experimenting with ways of improving BLOB support in Datasette and sqlite-utils. This new version of sqlite-utils includes a “sqlite-utils insert-files” command, which can recursively crawl directories for files and add their contents to SQLite with configurable columns containing their metadata. I was inspired by Paul Ford who has been creating multi-GB SQLite databases of images and PDFs. It turns out that when disk space is cheap this is a pretty effective way of working with interesting corpuses of documents and images. # 27th July 2020, 7:36 am

pypi-rename. I wanted to rename a PyPI package (renaming datasette-insert-api to datasette-insert as it’s about to grow some non-API features). PyPI recommend uploading a final release under the old name which points to (and depends on) the new name. I’ve built a cookiecutter template to codify that pattern. # 25th July 2020, 11:07 pm

PostgreSQL full-text search in the Django Admin. Today I figured out how to use PostgreSQL full-text search in the Django admin for my blog, using the get_search_results method on a subclass of ModelAdmin. # 25th July 2020, 11:05 pm

Doing Stupid Stuff with GitHub Actions (via) I love the idea here of running a scheduled action once a year that deliberately fails, causing GitHub to send you a “Happy New Year” failure email! # 25th July 2020, 9:19 pm

The unofficial Google Cloud Run FAQ. This is really useful: a no-fluff, content rich explanation of Google Cloud Run hosted as a GitHub repo that actively accepts pull requests from the community. It’s maintained by Ahmet Alp Balkan, a Cloud Run engineer who states “Googlers: If you find this repo useful, you should recognize the work internally, as I actively fight for alternative forms of content like this”. One of the hardest parts of working with AWS and GCP is digging through the marketing materials to figure out what the product actually does, so the more alternative forms of documentation like this the better. # 22nd July 2020, 5:20 pm

22 Principles for Great Product Managers (via) By Alex Reeve, a PM at LinkedIn. These are really strong—I particularly liked the “leading your team” section which emphasizes ensuring your team understand the goal and the path to reach it, and that you know what winning will look like and how to tell. # 20th July 2020, 8:17 pm

Tempering Expectations for GPT-3 and OpenAI’s API. Insightful commentary on GPT-3 (which is producing some ridiculously cool demos at the moment thanks to the invite-only OpenAI API) from Max Woolf. # 18th July 2020, 7:29 pm

Develomentor podcast: Simon Willison – Data Journalism, The Importance of Side Projects (via) Grant Ingersoll interviewed me for the Develomentor podcast. We talked about my career so far, and how much of it was driven by side-projects that I’ve worked on individually or with Natalie. # 17th July 2020, 1:33 am

datasette-auth-passwords. My latest plugin: datasette-auth-passwords provides a mechanism for signing into Datasette using a username and password (which is verified in order to set a ds_actor authentication cookie). So far it only supports passwords that are hard-coded into Datasette’s configuration via environment variables, but I plan to add database-backed user accounts in the future. # 13th July 2020, 11:39 pm

zhiiiyang/zhiiiyang profile README (via) This is a brilliant hack: a GitHub profile README that uses an action to retrieve the author’s latest tweet (using R), render it as a PNG screenshot in headless Chrome via rstudio/webshot2 and embed that image in their profile. # 11th July 2020, 5:47 pm

When data is messy. I love this story: a neural network trained on images was asked what the most significant pixels in pictures of tench (a kind of fish) were: it returned pictures of fingers on a green background, because most of the tench photos it had seen were fisherfolk showing off their catch. # 7th July 2020, 7:03 pm

GitHub Actions: Manual triggers with workflow_dispatch (via) New GitHub Actions feature which fills a big gap in the offering: you can now create “workflow dispatch” events which provide a button for manually triggering an action—and you can specify extra UI form fields that can customize how that action runs. This turns Actions into an interactive automation engine for any code that can be wrapped in a Docker container. # 7th July 2020, 4:33 am

sba-loans-covid-19-datasette (via) The treasury department released a bunch of data on the Covid-19 SBA Paycheck Protection Program Loan recipients today—I’ve loaded the most interesting data (the $150,000+ loans) into a Datasette instance. # 7th July 2020, 2:42 am

How to find what you want in the Django documentation (via) Useful guide by Matthew Segal to navigating the Django documentation, and tips for reading documentation in general. The Django docs have a great reputation so it’s easy to forget how intimidating they can be for newcomers: Matthew emphasizes that docs are rarely meant to be read in full: the trick is learning how to quickly search them for the things you need to understand right now. # 3rd July 2020, 3:04 pm

Better Python Decorators with wrapt (via) Adam Johnson explains the intricacies of decorating a Python function without breaking the ability to correctly introspect it, and dicsusses how Scout use the wrapt library by Graham Dumpleton to implement their instrumentation library. # 2nd July 2020, 9:48 pm

entr: rerun your build when files change. “WHY DID NOBODY TELL ME ABOUT THIS BEFORE?!?!” is one of my favourite genres of blog post. # 1st July 2020, 3:58 pm

Repository driven development (via) I’m already a big fan of keeping documentation and code in the same repo so you can update them both from within the same code review, but this takes it even further: in repository driven development every aspect of the code and configuration needed to define, document, test and ship a service live in the service repository—all the way down to the configurations for reporting dashboards. This sounds like heaven. # 24th July 2019, 8:41 am

Using memory-profiler to debug excessive memory usage in healthkit-to-sqlite. This morning I figured out how to use the memory-profiler module (and mprof command line tool) to debug memory usage of Python processes. I added the details, including screenshots, to this GitHub issue. It helped me knock down RAM usage for my healthkit-to-sqlite from 2.5GB to just 80MB by making smarter usage of the ElementTree pull parser. # 24th July 2019, 8:25 am

Targeted diagnostic logging in production (via) Will Sargent defines diagnostic logging as “debug logging statements with an audience”, and proposes controlling this style if logging via a feature flat system to allow detailed logging to be turned on in production against a selected subset if users in order to help debug difficult problems. Lots of great background material in the topic of observability here too. # 24th July 2019, 5:44 am

When a rewrite isn’t: rebuilding Slack on the desktop. Slack appear to have pulled off the almost impossible: finishing a complete, incremental rewrite of their core product. They moved from jQuery to React over the course of two years, constantly shipping new features as they went along. The biggest gain was in rewriting their code to support multiple workspaces, which means desktop client users no longer have to run a separate copy of Electron for every workspace they are signed into. # 22nd July 2019, 6:30 pm

healthkit-to-sqlite. Ever since I got an Apple Watch I’ve been itching to get my hands on the step tracking and health data that it’s been collecting for me. I know it’s there in a SQLite database on my wrist, but I couldn’t figure out how to get it! A few days ago I stumbled across the “Export Health Data” button in the iOS Health app, and it turns out it creates a zip file containing XML with a full dump of the data collected by Apple Health. healthkit-to-sqlite is the tool I’ve built that can read that export and use it to create a SQLite database ready to be queried and explored with Datasette. It’s a pretty basic implementation but it’s already giving me access to over 3 million rows of data. Lots of potential here for interesting work with personal analytics. # 22nd July 2019, 3:34 am

Unlocking the Department of State’s foreign military training data for good this time (via) I’m so excited about this: Security Force Monitor used Datasette to publish a 200,000 row database of training engagements between the US military and foreign military units, based on their own massive efforts to clean up the official data (from thousands of PDF files). This is pretty much my dream use-case for Datasette, and their future goals are inspiring: “Our hope is that when the next report arrives in a short few months, we will be able to turn it into machine readable data and pass it around the sector in minutes, rather than months.” # 20th July 2019, 4:42 am

Details of the Cloudflare outage on July 2, 2019 (via) Best retrospective I’ve read in a long time. The outage was caused by a backtracking regex rule that was added to the Web Application Firewall project, which rolls out globally and skips most of Cloudflare’s regular graduar rollout process (delightfully animal themed, named DOG for the dogfooding PoP that their employees use, PIG for the Guinea Pig PoPs reserved for free customers, then Canary for the final step) so that they can deploy counter-measures to newly discovered vulnerabilities as quickly as possible—but the real value in the retro is that it provides an extremely deep insight into how Cloudflare organize, test and manage their changes. Really interesting stuff. # 12th July 2019, 5:36 pm

datasette-cors (via) My other Datasette ASGI plugin: this one wraps my asgi-cors project and lets you configure CORS access from a list of domains (or a set of domain wildcards) so you can make JavaScript calls to a Datasette instance from a specific set of other hosts. # 8th July 2019, 4:30 am

datasette-auth-github (via) My first big ASGI plugin for Datasette: datasette-auth-github adds the ability to require users to authenticate against the GitHub OAuth API. You can whitelist specific users, or you can restrict access to members of specific GitHub organizations or teams. While it’s structured as a Datasette plugin it also includes ASGI middleware which can be applied to any ASGI application. # 8th July 2019, 4:28 am