Simon Willison’s Weblog

Subscribe

June 2021

54 posts: 5 entries, 17 links, 3 quotes, 29 beats

June 13, 2021

Release markdown-to-sqlite 1.0 — CLI tool for loading markdown files into a SQLite database
Release yaml-to-sqlite 1.0 — Utility for converting YAML files to SQLite
Release dogsheep-beta 0.10.2 — Build a search index across content from multiple SQLite database tables and run faceted searches against it using Datasette

June 17, 2021

Multi-region PostgreSQL on Fly (via) Really interesting piece of architectural design from Fly here. Fly can run your application (as a Docker container run using Firecracker) in multiple regions around the world, and they’ve now quietly added PostgreSQL multi-region support. The way it works is that all-but-one region can have a read-only replica, and requests sent to application servers can perform read-only queries against their local region’s replica. If a request needs to execute a SQL update your application code can return a “fly-replay: region=scl” HTTP header and the Fly CDN will transparently replay the request against the region containing the leader database. This also means you can implement tricks like setting a 10s expiring cookie every time the user performs a write, such that their requests in the next 10s will go straight to the leader and avoid them experiencing any replication lag that hasn’t caught up with their latest update.

# 6:39 pm / postgresql, replication, scaling, fly

June 19, 2021

Release sqlite-utils 3.10 — Python CLI utility and library for manipulating SQLite databases
TIL Mouse support in vim — Today I learned that if you hit `Esc` in vim and then type `:set mouse=a` and hit enter... vim grows mouse support! In your terminal!

Joining CSV and JSON data with an in-memory SQLite database

Visit Joining CSV and JSON data with an in-memory SQLite database

The new sqlite-utils memory command can import CSV and JSON data directly into an in-memory SQLite database, combine and query it using SQL and output the results as CSV, JSON or various other formats of plain text tables.

[... 1,507 words]

June 20, 2021

Release sqlite-utils 3.11 — Python CLI utility and library for manipulating SQLite databases

June 21, 2021

TIL Scraping Reddit via their JSON API — Reddit have long had an unofficial (I think) API where you can add `.json` to the end of any URL to get back the data for that page as JSON.

June 22, 2021

GitLab Culture: The phases of remote adaptation. GitLab claim to be “the world’s largest all-remote company”—1300 employees across 65 countries, with not a single physical office. Lots of interesting thinking in this article about different phases a company can go through to become truly remote-first. “Maximally efficient remote environments will do as little work as possible synchronously, instead focusing the valuable moments where two or more people are online at the same time on informal communication and bonding.” They also expire their Slack messages after 90 days to force critical project information into documents and issue threads.

# 12:37 am / management, remote, gitlab

What I’ve learned about data recently (via) Laurie Voss talks about the structure of data teams, based on his experience at npm and more recently Netlify. He suggests that Airflow and dbt are the data world’s equivalent of frameworks like Rails: opinionated tools that solve core problems and which mean that you can now hire people who understand how your data pipelines work on their first day on the job.

# 5:09 pm / data, big-data, data-science, laurie-voss

A framework for building Open Graph images. GitHub’s new social preview images are generated by a Node.js script that fetches data from their GraphQL API, generates an HTML version of the card and then grabs a PNG snapshot of it using Puppeteer. It takes an average of 280ms to serve an image and generates around 2 million unique images a day. Interestingly, they found that bumping the available RAM from 512MB up to 513MB had a big effect on performance, because Chromium detects devices on 512MB or less and switches some processes from parallel to sequential.

# 9:25 pm / github, nodejs, puppeteer

June 23, 2021

Release asgi-csrf 0.9 — ASGI middleware for protecting against CSRF attacks

June 24, 2021

Release datasette 0.58a1 — An open source multi-tool for exploring and publishing data

Django for Startup Founders: A better software architecture for SaaS startups and consumer apps (via) The opening section of this article has very little to do with Django: it’s an insightful description of the technical challenges faced by a startup that is still seeking product-market fit. Alex then extends that into his own architectural recommendations for startups building with Django to help waste as little time as possible on problems that aren’t core to the product they are building.

# 8:43 pm / django, startups

June 25, 2021

Notes on streaming large API responses

I started a Twitter conversation last week about API endpoints that stream large amounts of data as an alternative to APIs that return 100 results at a time and require clients to paginate through all of the pages in order to retrieve all of the data:

[... 1,692 words]

Hierarchical Structures in PostgreSQL (via) Two techniques I hadn’t seen before: the first is to define a materialized view using a CTE that offers efficient tree queries against a PostgreSQL array of path components (plus a trigger to update the materialized view), the second is with the PostgreSQL ltree extension which ships as part of PostgreSQL and hence should be widely available.

# 5:19 pm / postgresql, sql

Release sqlite-utils 3.12 — Python CLI utility and library for manipulating SQLite databases

PostgreSQL: nbtree/README (via) The PostgreSQL source tree includes beatifully written README files for different parts of PostgreSQL. Here’s the README for their btree implementation—it continues to be actively maintained (last change was is March) and “git blame” shows that parts of the file date back 25 years, to 1996!

# 6:09 pm / computer-science, databases, postgresql

Querying Parquet using DuckDB (via) DuckDB is a relatively new SQLite-style database (released as an embeddable library) with a focus on analytical queries. This tutorial really made the benefits click for me: it ships with support for the Parquet columnar data format, and you can use it to execute SQL queries directly against Parquet files—e.g. “SELECT COUNT(*) FROM ’taxi_2019_04.parquet’”. Performance against large files is fantastic, and the whole thing can be installed just using “pip install duckdb”. I wonder if faceting-style group/count queries (pretty expensive with regular RDBMSs) could be sped up with this?

# 10:40 pm / python, parquet, duckdb

A Datasette tutorial in Portuguese. Nicolás Linares put together this Datasette tutorial in Portuguese, including an explanation of the project, how to get it up and running on a laptop, how to use it to explore and facet data, how to use plugins (including datasette-vega and datasette-cluster-map) and how to publish data using Vercel. I ran this through Google Translate and I can confirm that it’s a really well constructed tutorial—fantastic to see material like this starting to emerge in languages other than English.

# 10:57 pm / datasette

June 27, 2021

Group thousands of similar spreadsheet text cells in seconds (via) Luke Whyte explains how to efficiently group similar text columns in a table (Walmart and Wal-mart for example) using a clever combination of TF/IDF, sparse matrices and cosine similarity. Includes the clearest explanation of cosine similarity for text I’ve seen—and Luke wrote a Python library, textpack, that implements the described pattern.

# 4:24 pm / python, data-science

June 28, 2021

Weeknotes: sqlite-utils updates, Datasette and asgi-csrf, open-sourcing VIAL

Some work on sqlite-utils, asgi-csrf, a Datasette alpha and we open-sourced VIAL.

[... 662 words]

In 2015, the men controlling 80% of Bitcoin mining stood on stage together at a conference. Three or four entities have run Bitcoin mining since then. The only thing preventing miner misbehaviour is wanting to avoid spooking the suckers — it’s completely trust-based. Bitcoin now uses a country’s worth of electricity for no actual reason. You could do the transactions on a 2007 iPhone.

David Gerard

# 5:32 pm / bitcoin

2021 » June

MTWTFSS
 123456
78910111213
14151617181920
21222324252627
282930