Blogmarks
Filters: Sorted by date
Datasette is four years old today. I marked the occasion with a short Twitter thread about the project so far.
Deno Deploy Beta 3 (via) I missed Deno Deploy when it first came out back in June: it’s a really interesting new hosting environment for scripts written in Deno, Node.js creator Ryan Dahl’s re-imagining of Node.js. Deno Deploy runs your code using v8 isolates running in 28 regions worldwide, with a clever BroadcastChannel mechanism (inspired by the browser API of the same name) that allows instances of the server-side code running in different regions to send each other messages. See the “via” link for my annotated version of a demo by Ondřej Žára that got me excited about what it can do.
AWS IAM definitions in Datasette (via) As part of my ongoing quest to conquer IAM permissions, I built myself a Datasette instance that lets me run queries against all 10,441 permissions across 280 AWS services. It’s deployed by a build script running in GitHub Actions which downloads a 8.9MB JSON file from the Salesforce policy_sentry repository—policy_sentry itself creates that JSON file by running an HTML scraper against the official AWS documentation!
A half-hour to learn Rust. I haven’t tried to write any Rust yet but I occasionally find myself wanting to read it, and I find some of the syntax really difficult to get my head around. This article helped a lot: it provides a quick but thorough introduction to most of Rust’s syntax, with clearly explained snippet examples for each one.
An oral history of Bank Python (via) Fascinating description of a very custom Python environment inside a large investment bank—where all of the code lives inside the Python environment itself, everything can be imported into the same process and a directed acyclic graph engine implements Excel-style reactive dependencies. Plenty of extra flavour from people who’ve worked with this (and related) Python systems in the Hacker News comments.
DuckDB-Wasm: Efficient Analytical SQL in the Browser (via) First SQLite, now DuckDB: options for running database engines in the browser using WebAssembly keep on growing. DuckDB means browsers now have a fast, intuitive mechanism for querying Parquet files too. This also supports the same HTTP Range header trick as the SQLite demo from a while back, meaning it can query large databases loaded over HTTP without downloading the whole file.
aws-lambda-adapter. AWS Lambda added support for Docker containers last year, but with a very weird shape: you can run anything on Lambda that fits in a Docker container, but unlike Google Cloud Run your application doesn’t get to speak HTTP: it needs to run code that listens for proprietary AWS lambda events instead. The obvious way to fix this is to run some kind of custom proxy inside the container which turns AWS runtime events into HTTP calls to a regular web application. Serverlessish and re:Web are two open source projects that implemented this, and now AWS have their own implementation of that pattern, written in Rust.
Tonic (via) Really interesting library for building Web Components: it’s tiny (just 350 lines of code), works directly in browsers without any compile or build step and makes very creative use of modern JavaScript features such as async generators.
New HTTP standards for caching on the modern web (via) Cache-Status is a new HTTP header (RFC from August 2021) designed to provide better debugging information about which caches were involved in serving a request—“Cache-Status: Nginx; hit, Cloudflare; fwd=stale; fwd-status=304; collapsed; ttl=300” for example indicates that Nginx served a cache hit, then Cloudflare had a stale cached version so it revalidated from Nginx, got a 304 not modified, collapsed multiple requests (dogpile prevention) and plans to serve the new cached value for the next five minutes. Also described is $Target-Cache-Control: which allows different CDNs to respond to different headers and is already supported by Cloudflare and Akamai (Cloudflare-CDN-Cache-Control: and Akamai-Cache-Control:).
Why you shouldn’t invoke setup.py directly (via) Paul Ganssle explains why you shouldn’t use “python setup.py command” any more. I’ve mostly switched to pip and pytest and twine but I was still using “python setup.py sdist”—apparently the new replacement recipe for that is “python -m build”.
Where does all the effort go? Looking at Python core developer activity (via) Łukasz Langa used Datasette to explore 28,780 pull requests made to the CPython GitHub repository, using some custom Python scripts (and sqlite-utils) to load in the data.
Tests aren’t enough: Case study after adding type hints to urllib3. Very thorough write-up by Seth Michael Larson describing what it took for the urllib3 Python library to fully embrace mypy and optional typing and what they learned along the way.
Web Browser Engineering (via) In progress free online book by Pavel Panchekha and Chris Harrelson that demonstrates how a web browser works by writing one from scratch using Python, tkinter and the DukPy wrapper around the Duktape JavaScript interpreter.
How to win at CORS (via) Jake Archibald’s definitive guide to CORS, including a handy CORS playground interactive tool. Also includes a useful history explaining why we need CORS in the first place.
Abusing Terraform to Upload Static Websites to S3 (via) I found this really interesting. Terraform is infrastructure as code software which mostly handles creating and updating infrastructure resources, so it’s a poor fit for uploading files to S3 and setting the correct Content-Type headers for them. But... in figuring out how to do that, this article taught me a ton about how Terraform works. I wonder if that’s a useful general pattern? Get a tool to do something that it’s poorly designed to handle and see how much you learn about that tool along the way.
Writing for distributed teams (via) Vicki Boykis describes how she only sent 11 emails during her first 12 months working at Automattic, because the company culture there revolves around asynchronous communication through durable writing using the P2 custom WordPress theme. “This is a completely different paradigm than I’ve ever worked in, which has been a world usually riddled with information lost to Slack, Confluence, and dozens of email re:re:res.”
The GIL and its effects on Python multithreading (via) Victor Skvortsov presents the most in-depth explanation of the Python Global Interpreter Lock I’ve seen anywhere. I learned a ton from reading this.
django-upgrade (via) Adam Johnson’s new CLI tool for upgrading Django projects by automatically applying changes to counter deprecations made in different versions of the framework. Uses the Python standard library tokenize module which gives it really quick performance in parsing and rewriting Python code. Exciting to see this kind of codemod approach becoming more common in Python world—JavaScript developers use this kind of thing a lot.
New tool: an nginx playground. Julia Evans built a sandbox tool for interactively trying out an nginx configuration and executing test requests through it. I love this kind of tool, and Julia’s explanation of how they built it using a tiny fly.io instance and a network namespace to reduce the amount of damage any malicious usage could cause is really interesting.
File not found: A generation that grew up with Google is forcing professors to rethink their lesson plans (via) This is fascinating: as-of 2017 university instructors have been increasingly encountering students who have absolutely no idea how files and folders on a computer work. The new generation has a completely different mental model of how applications work, where everything is found using search and data mostly lives inside the application that you use to manipulate it.
Gradually, Garland came to the same realization that many of her fellow educators have reached in the past four years: the concept of file folders and directories, essential to previous generations’ understanding of computers, is gibberish to many modern students.
Introducing Partytown 🎉: Run Third-Party Scripts From a Web Worker (via) This is just spectacularly clever. Partytown is a 6KB JavaScript library that helps you move gnarly poorly performing third-party scripts out of your main page and into a web worker, so they won’t destroy your page performance. The really clever bit is in how it provides sandboxed access to the page DOM: it uses a devious trick where a proxy object provides getters and setters which then make blocking API calls to a separate service worker, using the mostly-forgotten xhr.open(..., false) parameter that turns off the async default for an XMLHttpRequest call.
egghead screencasting technical guide (via) Detailed guide to producing high quality screencasts—software to use, audio tips, editing workflow—from the egghead.io online instructor platform.
Datasette Desktop 0.1.0 (via) This is the first installable version of the new Datasette Desktop macOS application I’ve been building. Please try it out and leave feedback on Twitter or on the GitHub Discussions thread linked from the release notes.
Making world-class docs takes effort (via) Curl maintainer Daniel Stenberg writes about his principles for good documentation. I agree with all of these: he emphasizes keeping docs in the repo, avoiding the temptation to exclusively generate them from code, featuring examples and ensuring every API you provide has documentation. Daniel describes an approach similar to the documentation unit tests I’ve been using for my own projects: he has scripts which scan the curl documentation to ensure not only that everything is documented but that each documentation area contains the same sections in the same order.
Per-project PostgreSQL (via) Jamey Sharp describes an ingenious way of setting up PostgreSQL instances for each of your local development project, without depending on an always-running shared localhost database server. The trick is a shell script which creates a PGDATA folder in the current folder and then instantiates a PostgreSQL server in --single single user mode which listens on a Unix domain socket in that folder, instead of listening on the network. Jamey then uses direnv to automatically configure that PostgreSQL, initializing the DB if necessary, for each of his project folders.
API Tokens: A Tedious Survey. Thomas Ptacek reviews different approaches to implementing secure API tokens, from simple random strings stored in a database through various categories of signed token to exotic formats like Macaroons and Biscuits, both new to me.
Macaroons carry a signed list of restrictions with them, but combine it with a mechanism where a client can add their own additional restrictions, sign the combination and pass the token on to someone else.
Biscuits are similar, but “embed Datalog programs to evaluate whether a token allows an operation”.
SQLModel. A new project by FastAPI creator Sebastián Ramírez: SQLModel builds on top of both SQLAlchemy and Sebastián’s Pydantic validation library to provide a new ORM that’s designed around Python 3’s optional typing. The real brilliance here is that a SQLModel subclass is simultaneously a valid SQLAlchemy ORM model AND a valid Pydantic validation model, saving on duplicate code by allowing the same class to be used both for form/API validation and for interacting with the database.
How Discord Stores Billions of Messages (via) Fascinating article from 2017 describing how Discord migrated their primary message store to Cassandra (from MongoDB, but I could easily see them making the same decision if they had started with PostgreSQL or MySQL).
The trick with scalable NoSQL databases like Cassandra is that you need to have a very deep understanding of the kinds of queries you will need to answer - and Discord had exactly that.
In the article they talk about their desire to eventually migrate to Scylla (a compatible Cassandra alternative written in C++) - in the Hacker News comments they confirm that in 2021 they are using Scylla for a few things but they still have their core messages in Cassandra.
MDN: Subdomain takeovers (via) MDN have a page about subdomain takeover attacks that focuses more on CNAME records: if you have a CNAME pointing to a common delegated hosting provider but haven’t yet provisioned your virtual host there, someone else might beat you to it and use it for an XSS attack.
“Preventing subdomain takeovers is a matter of order of operations in lifecycle management for virtual hosts and DNS.”
I now understand why Google Cloud make your “prove” your ownership of a domain before they’ll let you configure it to host e.g. a Cloud Run instance.
I stumbled across a nasty XSS hole involving DNS A records. Found out today that an old subdomain that I had assigned an IP address to via a DNS A record was serving unexpected content—turned out I’d shut down the associated VPS and the IP had been recycled to someone else, so their content was now appearing under my domain. It strikes me that if you got really unlucky this could turn into an XSS hole—and that new server could even use Let’s Encrypt to obtain an HTTPS certificate for your subdomain.
I’ve added “audit your A records” to my personal security checklist.