Simon Willison’s Weblog

Subscribe
Atom feed

Blogmarks

Filters: Sorted by date

Bare columns in an aggregate queries. This is a really nice SQL tweak implemented in SQLite: If you run a query like “SELECT a, b, max(c) FROM tab1 GROUP BY a” SQLite will find the row with the highest value for c and use the columns of that row as the returned values for the other columns mentioned in the query.

# 10th August 2021, 1:29 am / sql, sqlite

Everything new in Datasette since January, plus Django SQL Dashboard. I sent out the first Datasette newsletter since late January this year, covering everything that’s new in Datasette and sqlite-utils this year and introducing Django SQL Dashboard.

# 10th August 2021, 1:28 am / datasette, sqlite-utils, django-sql-dashboard

The World of CSS Transforms. Comprehensive, clearly explained tutorial on CSS transforms by Josh W. Comeau, with some very neat interactive demos. I hadn’t understood how useful it is that the translate() transform treats percentages as applying to the dimensions of the element being transformed, not its parent. This means you can use expressions like transform: translateX(calc(100% + 4px)); to shift an element by its entire width plus a few more pixels.

# 9th August 2021, 2:30 pm / css, josh-comeau

Stanford School Enrollment Project (via) This is Project Pelican: I’ve been working with the Big Local News team at Stanford helping bundle up and release the data they’ve been collecting on school enrollment statistics around the USA. This Datasette instance has data from 33 states for every year since 2015—3.3m rows total. Be sure to check out the accompanying documentation!

# 8th August 2021, 12:23 am / data-journalism, journalism, datasette

Running GitHub on Rails 6.0. Back in 2019 Eileen M. Uchitelle explained how GitHub upgraded everything in production to Rails 6.0 within 1.5 weeks of the stable release. There’s a trick in here I really like: they have an automated weekly job which fetches the latest Rails main branch and runs the full GitHub test suite against it, giving them super-early warnings about anything that might break and letting them provide feedback to upstream about unintended regressions.

# 6th August 2021, 4:30 pm / continuous-integration, github, rails

Breaking Changes to the Web Platform (via) “Over the years there have been necessary changes to the web platform that caused legacy websites to break.”—this list is thankfully very short, only 11 items so far. Let’s hope it stays that way!

# 6th August 2021, 6:32 am / web

OkCupid had a CSRF vulnerability (via) Good write-up of a (now fixed) CSRF vulnerability on OkCupid. Their site worked by POSTing JSON objects to an API. JSON POSTs are usually protected against CSRF because they can only be sent using fetch() or XMLHttpRequest, which are protected by the same-origin policy. Yan Zhu notes that you can use the enctype="text/plain" attribute on a form (introduced in HTML5) and a crafty hidden input element with name='{"foo":"' value='bar"}' to construct JSON in an off-site form, which enabled CSRF attacks.

# 2nd August 2021, 10:12 pm / csrf, security

Clickhouse on Cloud Run (via) Alex Reid figured out how to run Clickhouse against read-only baked data on Cloud Run last year, and wrote up some comprehensive notes.

# 29th July 2021, 6:07 am / cloudrun, baked-data, clickhouse

How the Python import system works (via) Remarkably detailed and thorough dissection of how exactly import, modules and packages work in Python—eventually digging right down into the C code. Part of Victor Skvortsov’s excellent “Python behind the scenes” series.

# 24th July 2021, 8:12 pm / python

The Tyranny of Spreadsheets (via) In discussing the notorious Excel incident last year when the UK lost track of 16,000 Covid cases due to a .xls row limit, Tim Harford presents a history of the spreadsheet, dating all the way back to Francesco di Marco Datini and double-entry bookkeeping in 1396. A delightful piece of writing.

# 23rd July 2021, 3:57 am / history, spreadsheets, covid19

Launch HN Instructions (via) The instructions for YC companies that are posting their launch announcement on Hacker News are really interesting to read. “As founders, you’re used to talking to users, customers, and investors. HN readers are not any of those—what they are is peers, and using any of those styles with peers feels clueless and entitled. [...] To interest HN, write in a factual, personal, and modest way about what problem you solve, why it matters, how you solve it, and how you got there.”

# 19th July 2021, 1:05 am / hacker-news, marketing, y-combinator

toyDB: references. toyDB is a “distributed SQL database in Rust, written as a learning project”, with its own implementations of SQL, raft, ACID transactions, B+trees and more. toyDB author Erik Grinaker has assembled a detailed set of references that he used to learn how to build a database—I’d love to see more projects do this, it’s really useful.

# 19th July 2021, 12:18 am / databases, rust

Inserting One Billion Rows in SQLite Under a Minute (via) Avinash Sajjanshetty experiments with accelerating writes to a test table in SQLite, using various SQLite pragmas to accelerate inserts followed by a rewrite of Python code to Rust. Also of note: running the exact same code in PyPy saw a 3.5x speed-up!

# 19th July 2021, 12:13 am / pypy, sqlite, rust

Organize and Index Your Screenshots (OCR) on macOS (via) Alexandru Nedelcu has a very neat recipe for creating an archive of searchable screenshots on macOS: set the default save location for screenshots to a Dropbox folder, then create a launch agent that runs a script against new files in that folder to run tesseract OCR to convert them into a searchable PDF.

# 18th July 2021, 4:11 pm / macos, ocr

Datasette downloads per day (with Observable Plot) (via) I built an Observable notebook that imports PyPI package download data from datasette.io (itself scraped from pypistats.org using a scheduled GitHub Action) and plots it using Observable Plot. Datasette downloads from PyPI apparently jumped from ~800/day in May to ~4,000/day in July—would love to know why!

# 17th July 2021, 5:01 pm / pypi, datasette, observable, observable-plot

The Digital Antiquarian: Sam and Max Hit the Road. Delightful history and retrospective review of 1993’s Sam and Max Hit the Road. I didn’t know Sam and Max happened because the independent comic’s creator worked for LucasArts and the duo had embedded themselves in LucasArts culture through their use in the internal educational materials prepared for SCUMM University.

# 17th July 2021, 3:12 am / game-design, games, history

Last Mile Redis (via) Fly.io article about running a local redis cache in each of their geographic regions—“Cache data overlaps a lot less than you assume it will. For the most part, people in Singapore will rely on a different subset of data than people in Amsterdam or São Paulo or New Jersey.” But then they note that Redis has the ability to act as both a replica of a primary AND a writable server at the same time (“replica-read-only no”), which actually makes sense for a cache—it lets you cache local data but send out cluster-wide cache purges if necessary.

# 17th July 2021, 2:44 am / caching, redis, fly

The Untold Story of SQLite With Richard Hipp. This is a really interesting interview with SQLite creator D. Richard Hipp—it covers all sorts of aspects of the SQLite story I hadn’t heard before, from its inspiration by a software challenge on a battleship to the first income from clients such as AOL and Symbian to the formation of the SQLite Consortium (based on advice from Mozilla’s Mitchell Baker) and more.

# 16th July 2021, 8:12 pm / podcasts, sqlite, d-richard-hipp

Dropbox: Sharing our Engineering Career Framework with the world (via) Dropbox have published their engineering career framework, with detailed descriptions of the different levels of the engineering (as opposed to management) career track and what is expected for each one. I’m fascinated by how different companies handle the challenge of keeping career progression working for engineers without pushing them into people management, and this as a particularly detailed and well thought-out implementation of that.

# 13th July 2021, 11:31 pm / kellan-elliott-mccrea, careers, dropbox

RabbitMQ Streams Overview. New in RabbitMQ 3.9: streams are a persisted, replicated append-only log with non-destructive consuming semantics. Sounds like it fits the same hole as Kafka and Redis Streams, an extremely useful pattern.

# 13th July 2021, 11:29 pm / message-queues, rabbitmq, redis, kafka

Behind the scenes, AWS Lambda (via) Bruno Schaatsbergen pulled together details about how AWS Lambda works under the hood from a detailed review of the AWS documentation, the Firecracker paper and various talks at AWS re:Invent.

# 10th July 2021, 7:40 pm / aws, lambda, software-architecture, firecracker

The data team: a short story (via) Erik Bernhardsson’s fictional account (“I guess I should really call this a parable”) of a new data team leader successfully growing their team and building a data-first culture in a medium-sized technology company. His depiction of the initial state of the company (data in many different places, frustrated ML researchers who can’t get their research into production, confusion over what the data team is actually for) definitely rings true to me.

# 8th July 2021, 11:12 pm / data, data-science, leadership

Probably Are Gonna Need It: Application Security Edition (via) Jacob Kaplan-Moss shares his PAGNIs for application security: “basic security mitigations that are easy to do at the beginning, but get progressively harder the longer you put them off”. Plenty to think about in here—I particularly like Jacob’s recommendation to build a production-to-staging database mirroring solution that works from an allow-list of columns, to avoid the risk of accidentally exposing new private data as the product continues to evolve.

# 8th July 2021, 6:31 pm / jacob-kaplan-moss, security, pagni

Temporal: getting started with JavaScript’s new date time API. Axel Rauschmayer explores the new proposed API for handling dates, times and timezones in JavaScript., which is under development by Ecma TC39 at the moment (and made available as a Polyfill which you are recommended not to run in production since the API is still being figured out). This is a notoriously difficult problem so it’s always interesting to see the latest thinking on how to best address it.

# 7th July 2021, 10:29 pm / datetime, javascript, timezones

The art of asking nicely (via) CLIP+VQGAN Is a GAN that generates images based on some text input—you can run it on Google Collab notebooks, there are instructions linked at the bottom of this post. Janelle Shane of AI Weirdness explores tricks for getting the best results out of it for “a herd of sheep grazing on a lush green hillside”—various modifiers like “amazing awesome and epic” produce better images, but the one with the biggest impact, quite upsettingly, is “ultra high definition free desktop wallpaper”.

# 2nd July 2021, 3:02 pm / machine-learning, ai

Smooth sailing with Kubernetes (via) Scott McCloud (of Understanding Comics) authored this comic introduction to Kubernetes, and it’s a really good explanation of the core concepts. I’d love to have something like this for Datasette—I still feel like I’m a long way from being able to explain the project with anything like this amount of clarity.

# 1st July 2021, 11:30 pm / comics, kubernetes

YAGNI exceptions (via) Luke Plant provides his collection of things that you probably ARE going to need in a project, where adding them later is painful enough that it’s worth the up-front investment. I really like these as a concept, and I’m coining the term PAGNI—for Probably Are Gonna Need It—to describe them.

# 1st July 2021, 6:30 pm / luke-plant, software-engineering, yagni, pagni

Django SQL Dashboard 1.0 (via) As part of my ongoing attempt to be braver about 1.0 releases (crucial if you want to do semantic versioning properly) I’ve released version 1.0 of Django SQL Dashboard, my Datasette-inspired app for Django that adds an interface for running read-only, bookmarkable SQL queries against a PostgreSQL database. The new version adds a column cog menu providing shortcuts for changing the sort order, counting distinct values and performing a group-by/count against column values.

# 1st July 2021, 5:44 pm / django, projects, sql, django-sql-dashboard

Group thousands of similar spreadsheet text cells in seconds (via) Luke Whyte explains how to efficiently group similar text columns in a table (Walmart and Wal-mart for example) using a clever combination of TF/IDF, sparse matrices and cosine similarity. Includes the clearest explanation of cosine similarity for text I’ve seen—and Luke wrote a Python library, textpack, that implements the described pattern.

# 27th June 2021, 4:24 pm / python, data-science

A Datasette tutorial in Portuguese. Nicolás Linares put together this Datasette tutorial in Portuguese, including an explanation of the project, how to get it up and running on a laptop, how to use it to explore and facet data, how to use plugins (including datasette-vega and datasette-cluster-map) and how to publish data using Vercel. I ran this through Google Translate and I can confirm that it’s a really well constructed tutorial—fantastic to see material like this starting to emerge in languages other than English.

# 25th June 2021, 10:57 pm / datasette

Years

Tags