Simon Willison’s Weblog

Items in Mar

Filters: Month: Mar ×

datasette-jellyfish. I learned about a handy Python library called Jellyfish which implements approximate and phonetic matching of strings—soundex, metaphone, porter stemming, levenshtein distance and more. I’ve built a simple Datasette plugin which wraps the library and makes each of those algorithms available as a SQL function. # 9th March 2019, 6:29 pm

Publish the data behind your stories with SQLite and Datasette. I presented a workshop on Datasette at the IRE and NICAR CAR 2019 data journalism conference yesterday. Here’s the worksheet I prepared for the tutorial. # 9th March 2019, 6:27 pm

I commissioned an oil painting of Barbra Streisand’s cloned dogs

Two dogs in a stroller looking at a gravestone, as an oil painting
Two identical puffs of white fur, gazing at the tombstone of the dog they are

[... 517 words]

MySQL: How to get the top N rows for each group. MySQL doesn’t support the row_number() window function that’s available in PostgreSQL (and recent SQLite), which means it can’t easily answer questions like “for each of these authors, give me the most recent three blog entries they have written” in a single query. Only it turns out it can, if you abuse MySQL session variables in a devious way. This isn’t a new feature: MySQL has had this for over a decade, and in my rough testing it works quickly even on tables with millions of rows. # 4th March 2019, 11:38 pm

List of Physical Visualizations (via) “A chronological list of physical visualizations and related artifacts, maintained by Pierre Dragicevic and Yvonne Jansen”—327 and counting! # 4th March 2019, 2:45 am

import-pypi. A devious Python 3 hack which abuses importlib.machinery to add a hook such that any time you type “import modulename” it checks to see if the module is installed and runs “pip install modulename” first if it isn’t. Intended as a joke, but if you habitually fire up temporary virtual environments for exploratory programming like I do this could actually be a neat little time-saver. # 29th March 2018, 10:16 pm

The original Reddit source code, written in Lisp in 2005 (via) “If anyone’s interested, I found a hard drive in my garage with the original Reddit Lisp code from 2005. Been looking for it for years. Enjoy.”—spez # 29th March 2018, 10:13 pm

Watching companies gradually realize “blockchain is just super expensive consensus and only makes sense for untrusted counterparties” is a wild, expensive trip

Kyle Kingsbury # 29th March 2018, 9:25 pm

Use The Index, Luke! Paging Through Results (via) The best explanation of keyset pagination I’ve seen. Keyset pagination is where instead of using OFFSET/LIMIT to return the next page of results you instead track the last seen value in the column you sort by and then return the next X results that follow it. This allows you to paginate to arbitrarily deep offsets within a table, whereas OFFSET/LIMIT requires first iterating across all preceding rows and tends to stop working well after the first few thousand results. # 29th March 2018, 5:30 pm

Vega-Lite. A “high-level grammar of interactive graphics”. Part of the Vega project, which provides a mechanism for creating declarative visualizations by defining them using JSON. Vega-Lite is particularly interesting to me because it makes extremely tasteful decisions about how data should be visualized—give it some records, tell it which properties to plot on an axis and it will default to a display that makes sense for that data. The more I play with this the more impressed I am at the quality of its default settings. # 28th March 2018, 5:22 pm

Baltimore Sun Public Salary Records (via) The Baltimore Sun have published an interactive search engine for public salaries of Maryland state employees, and it’s powered by Datasette! Since data journalism is one of my key use-cases for Datasette I’m incredibly excited to see this in the wild. They’ve also published the underlying source code (see the via link) which is a really nice example of how to use Datasette’s custom templates and canned query functionality. # 28th March 2018, 5:12 pm

Charles Proxy now available on iOS (via) I didn’t think this was possible, but the Charles debugging proxy is now available for iOS. It works by setting itself up as a VPN such that all app traffic runs through it. You can also optionally turn on SSL decryption for specific hosts by installing a special certificate (which involves jumping through several hoops). It won’t work for apps that implement SSL certificate pinning but from playing with it for a few minutes it looks like most apps haven’t done that, even apps from Google. Well worth $8.99. # 28th March 2018, 3:57 pm

Cloud-first: Rapid webapp deployment using containers (via) The Research Software Engineering group at ICL have written a tutorial on deploying web apps as Docker containers using Azure and they use Datasette as the example application. # 28th March 2018, 3:50 pm

Touring a Fast, Safe, and Complete(ish) Web Service in Rust. Brandur’s notes from building a high performance web service in Rust, using PostgreSQL via the Diesel ORM and the Rust actix-web framework which peovides Erlang-style actors and promise-based async concurrency. # 28th March 2018, 3:47 pm

Describing events in code (via) Phil Gyford built an online directory of every play, movie, gig and exhibition he has been to in the past 38 years using a combination of digital archaeology and saved ticket stubs. He built it using Django and published this piece extensively describing the process he went through to design the data model. # 28th March 2018, 3:41 pm

Building a combined stream of recent additions using the Django ORM

I’m a big believer in the importance of a “recent additions” feed. Any time you’re building an application that involves users adding and editing records it’s useful to have a page somewhere that shows the most recent objects that have been created across multiple different types of data.

[... 1647 words]

Using flamegraphs. I really like flamegraphs as a profiling tool—we have support for them baked into our Tikibar debugging toolbar at Eventbrite—but interpreting them isn’t particularly intuitive on first glance. Julia Evans has put together a great explanation of how to read them as part of the documentation for her rbspy Ruby profiler. # 21st March 2018, 8:56 pm

User-defined Order in SQL (via) This is a fun intellectual exercise: how can one efficiently implement a user-defined order in a SQL table? The obvious initial approach is to have an integer position column, but this means every subsequent row must be updated when an item changes position. Joe “begriffs” Nelson explores some clever alternatives, including floating point or decimal positions (allowing new items to be inserted at a midpoint between existing positions) and a new custom rational number type he buiIt as a PostgreSQL extension. # 21st March 2018, 2:07 pm

Adhering to a plan Moon spelled out more than three decades ago in a series of sermons, members of his movement managed to integrate virtually every facet of the highly competitive seafood industry. The Moon followers’ seafood operation is driven by a commercial powerhouse, known as True World Group. It builds fleets of boats, runs dozens of distribution centers and, each day, supplies most of the nation’s estimated 9,000 sushi restaurants.

Sushi and Rev. Moon # 21st March 2018, 12:52 am

It seems as if you are never ‘hardcore’ enough for YouTube’s recommendation algorithm. It promotes, recommends and disseminates videos in a manner that appears to constantly up the stakes. Given its billion or so users, YouTube may be one of the most powerful radicalising instruments of the 21st century.

Zeynep Tufecki # 20th March 2018, 7:20 pm

Protecting Against HSTS Abuse (via) Any web feature that can be used to persist information will eventually be used to build super-cookies. In this case it’s HSTS—a web feature that allows sites to tell browsers “in the future always load this domain over HTTPS even if the request specified HTTP”. The WebKit team caught this being exploited in the wild, by encoding a user identifier in binary across 32 separate sub domains. They have a couple of mitigations in place now—I expect other browser vendors will follow suit. # 19th March 2018, 10:21 pm

How to use HDF5 files in Python (via) HDF5: the missing manual. A detailed explanation of the HDF5 file format and how to work with it using the h5py module. HDF5 allows you to efficiently store multiple datasets (plus metatdata about them) in a single file and then load data from them without pulling the entire file into memory—kind of like SQLite but without the SQL support and more optimized for working with arrays. # 19th March 2018, 2:55 pm

Trio Tutorial. Trio is a really nice async library for Python—a simpler alternative to asyncio, with some very clean API design. Best of all, the tutorial is fantastic—it provides a very clear explanation of async/await without diving into the intricacies of coroutines. # 17th March 2018, 3:55 pm

Everyone can now run JavaScript on Cloudflare with Workers. This is such a brilliant piece of software design: Cloudflare took the service workers spec and used it as the basis for their edge-executed JacaScript feature. This means you can run server-side JavaScript in hundreds of edge locations worldwide, applying custom dynamic logic (including additional async cached fetch() calls) with only around 1ms if additional overhead. The pricing model is a steal: $0.50 per million requests with a $5/month minimum. # 13th March 2018, 4:36 pm

Being fast and light: Using binary data to optimise libraries on the client and the server. (via) Ada Rose Cannon provides a detailed introduction to ArrayBuffers in JavaScript and describes how she used them for a custom binary protocol to sync the state of 170 Virtual Reality users in the same venue without bringing down the network. # 13th March 2018, 2:34 pm

Consider Bitcoin a grand middle finger. It’s a prank, almost a parody of the global financial system, that turned into a bubble. “You plutocrats of Davos may think you control the global money supply,” the pranksters seem to say. “But humans will make an economy out of anything. Even this!”

Paul Ford # 10th March 2018, 11:34 am

BAD TRAFFIC: Sandvine’s PacketLogic Devices Used to Deploy Government Spyware in Turkey and Redirect Egyptian Users to Affiliate Ads? “Targeted users in Turkey and Syria who downloaded Windows applications from official vendor websites including Avast Antivirus, CCleaner, Opera, and 7-Zip were silently redirected to malicious versions by way of injected HTTP redirects. This redirection was possible because official websites for these programs, even though they might have supported HTTPS, directed users to non-HTTPS downloads by default.” # 10th March 2018, 10:40 am

Real-time photogrammetry with #ARKit. Astonishing photogrammetry demo by Tim Field using ARKit in iOS 11.3. # 10th March 2018, 10:32 am

I’m still a novice to the healthcare space, but if I walked away with a single insight, it’s that the problems of the US healthcare system are very tractable. The high cost and mixed results are unique to our system. There are incumbents fighting fiercely to maintain the status quo, but no more so than in other industries that technology has overturned. The regulatory environment is complex, but again not uniquely so. There are industries where one has to dig to find the problems that technology is well suited to solve, but US healthcare, an industry that communicates via fax, is not one of them.

Kellan Elliott-McCrea # 10th March 2018, 1:11 am

Upgrades to Facebook’s link security (via) Facebook have started scanning links shared on the site for HSTS headers, which are used to indicate that an HTTP page is also available over HTTPS and are intended to be cached by browsers such that future HTTP access is automatically retrieved over HTTPS instead. Facebook will now obey those headers itself and link directly to the HTTPS version. What a great idea: all sites with sophisticated link sharing (where links are fetched to retrieve extracts and images for example) should do this as well. # 5th March 2018, 3:32 pm