Simon Willison’s Weblog

Subscribe

Items in Jun, 2021

Filters: Year: 2021 × Month: Jun × Sorted by date


In 2015, the men controlling 80% of Bitcoin mining stood on stage together at a conference. Three or four entities have run Bitcoin mining since then. The only thing preventing miner misbehaviour is wanting to avoid spooking the suckers — it’s completely trust-based. Bitcoin now uses a country’s worth of electricity for no actual reason. You could do the transactions on a 2007 iPhone.

David Gerard # 28th June 2021, 5:32 pm

Weeknotes: sqlite-utils updates, Datasette and asgi-csrf, open-sourcing VIAL

Some work on sqlite-utils, asgi-csrf, a Datasette alpha and we open-sourced VIAL.

[... 662 words]

Group thousands of similar spreadsheet text cells in seconds (via) Luke Whyte explains how to efficiently group similar text columns in a table (Walmart and Wal-mart for example) using a clever combination of TF/IDF, sparse matrices and cosine similarity. Includes the clearest explanation of cosine similarity for text I’ve seen—and Luke wrote a Python library, textpack, that implements the described pattern. # 27th June 2021, 4:24 pm

A Datasette tutorial in Portuguese. Nicolás Linares put together this Datasette tutorial in Portuguese, including an explanation of the project, how to get it up and running on a laptop, how to use it to explore and facet data, how to use plugins (including datasette-vega and datasette-cluster-map) and how to publish data using Vercel. I ran this through Google Translate and I can confirm that it’s a really well constructed tutorial—fantastic to see material like this starting to emerge in languages other than English. # 25th June 2021, 10:57 pm

Querying Parquet using DuckDB (via) DuckDB is a relatively new SQLite-style database (released as an embeddable library) with a focus on analytical queries. This tutorial really made the benefits click for me: it ships with support for the Parquet columnar data format, and you can use it to execute SQL queries directly against Parquet files—e.g. “SELECT COUNT(*) FROM ’taxi_2019_04.parquet’”. Performance against large files is fantastic, and the whole thing can be installed just using “pip install duckdb”. I wonder if faceting-style group/count queries (pretty expensive with regular RDBMSs) could be sped up with this? # 25th June 2021, 10:40 pm

PostgreSQL: nbtree/README (via) The PostgreSQL source tree includes beatifully written README files for different parts of PostgreSQL. Here’s the README for their btree implementation—it continues to be actively maintained (last change was is March) and “git blame” shows that parts of the file date back 25 years, to 1996! # 25th June 2021, 6:09 pm

Hierarchical Structures in PostgreSQL (via) Two techniques I hadn’t seen before: the first is to define a materialized view using a CTE that offers efficient tree queries against a PostgreSQL array of path components (plus a trigger to update the materialized view), the second is with the PostgreSQL ltree extension which ships as part of PostgreSQL and hence should be widely available. # 25th June 2021, 5:19 pm

Notes on streaming large API responses

I started a Twitter conversation last week about API endpoints that stream large amounts of data as an alternative to APIs that return 100 results at a time and require clients to paginate through all of the pages in order to retrieve all of the data:

[... 1692 words]

Django for Startup Founders: A better software architecture for SaaS startups and consumer apps (via) The opening section of this article has very little to do with Django: it’s an insightful description of the technical challenges faced by a startup that is still seeking product-market fit. Alex then extends that into his own architectural recommendations for startups building with Django to help waste as little time as possible on problems that aren’t core to the product they are building. # 24th June 2021, 8:43 pm

A framework for building Open Graph images. GitHub’s new social preview images are generated by a Node.js script that fetches data from their GraphQL API, generates an HTML version of the card and then grabs a PNG snapshot of it using Puppeteer. It takes an average of 280ms to serve an image and generates around 2 million unique images a day. Interestingly, they found that bumping the available RAM from 512MB up to 513MB had a big effect on performance, because Chromium detects devices on 512MB or less and switches some processes from parallel to sequential. # 22nd June 2021, 9:25 pm

What I’ve learned about data recently (via) Laurie Voss talks about the structure of data teams, based on his experience at npm and more recently Netlify. He suggests that Airflow and dbt are the data world’s equivalent of frameworks like Rails: opinionated tools that solve core problems and which mean that you can now hire people who understand how your data pipelines work on their first day on the job. # 22nd June 2021, 5:09 pm

GitLab Culture: The phases of remote adaptation. GitLab claim to be “the world’s largest all-remote company”—1300 employees across 65 countries, with not a single physical office. Lots of interesting thinking in this article about different phases a company can go through to become truly remote-first. “Maximally efficient remote environments will do as little work as possible synchronously, instead focusing the valuable moments where two or more people are online at the same time on informal communication and bonding.” They also expire their Slack messages after 90 days to force critical project information into documents and issue threads. # 22nd June 2021, 12:37 am

Joining CSV and JSON data with an in-memory SQLite database

The new sqlite-utils memory command can import CSV and JSON data directly into an in-memory SQLite database, combine and query it using SQL and output the results as CSV, JSON or various other formats of plain text tables.

[... 1507 words]

Multi-region PostgreSQL on Fly (via) Really interesting piece of architectural design from Fly here. Fly can run your application (as a Docker container run using Firecracker) in multiple regions around the world, and they’ve now quietly added PostgreSQL multi-region support. The way it works is that all-but-one region can have a read-only replica, and requests sent to application servers can perform read-only queries against their local region’s replica. If a request needs to execute a SQL update your application code can return a “fly-replay: region=scl” HTTP header and the Fly CDN will transparently replay the request against the region containing the leader database. This also means you can implement tricks like setting a 10s expiring cookie every time the user performs a write, such that their requests in the next 10s will go straight to the leader and avoid them experiencing any replication lag that hasn’t caught up with their latest update. # 17th June 2021, 6:39 pm

Best Practices Around Production Ready Web Apps with Docker Compose (via) I asked on Twitter for some tips on Docker Compose and was pointed to this article by Nick Janetakis, which has a whole host of useful tips and patterns I hadn’t encountered before. # 12th June 2021, 2:36 am

Weeknotes: New releases across nine different projects

A new release and security patch for Datasette, plus releases of sqlite-utils, datasette-auth-passwords, django-sql-dashboard, datasette-upload-csvs, xml-analyser, datasette-placekey, datasette-mask-columns and db-to-sqlite.

[... 861 words]

I saw millions compromise their Facebook accounts to fuel fake engagement. Sophie Zhang, ex-Facebook, describes how millions of Facebook users have signed up for “autolikers”—programs that promise likes and engagement for their posts, in exchange for access to their accounts which are then combined into the larger bot farm and used to provide likes to other posts. “Self-compromise was a widespread problem, and possibly the largest single source of existing inauthentic activity on Facebook during my time there. While actual fake accounts can be banned, Facebook is unwilling to disable the accounts of real users who share their accounts with a bot farm.” # 9th June 2021, 3:40 pm

When I was a performance consultant I’d show up to random companies who wanted me to fix their computer performance issues. If they trusted me with a login to their production servers, I could help them a lot quicker. To get that trust I knew which tools looked but didn’t touch: Which were observability tools and which were experimental tools. “I’ll start with observability tools only” is something I’d say at the start of every engagement.

Brendan Gregg # 8th June 2021, 7:33 pm

An incomplete list of skills senior engineers need, beyond coding. By Camille Fournier, author of my favourite book on engineering management “The Manager’s Path”. Number one is “How to run a meeting, and no, being the person who talks the most in the meeting is not the same thing as running it”. # 6th June 2021, 10:17 pm

Apple’s tightly controlled App Store is teeming with scams. I’m quoted in an article in the Washington Post today (linked at the top of the homepage!) explaining how I got scammed on the App Store and spent $19 on a TV remote app with a similar name to the official Samsung app. I mistakenly assumed that the App Store review process wouldn’t allow an app called “Smart Things” to show up in search when I was looking for SmartThings, the official name—and assumed that Samsung were nickel-and-diming their customers rather than expecting the App Store review process to have failed so obviously. # 6th June 2021, 10:13 pm

The humble hash aggregate (via) Today I learned that “hash aggregate” is the name for the algorithm where you split a list of tuples on a common key, run an aggregation against each resulting group and combine the results back together again—I’d previously thought if this in terms of map/reduce but hash aggregate is a much older term used widely by SQL engines—I’ve seen it come up in PostgreSQL explain query output (for GROUP BY) before but didn’t know what it meant. # 6th June 2021, 4:03 pm

Reflected cross-site scripting issue in Datasette (via) Here’s the GitHub security advisory I published for the XSS hole in Datasette. The fix is available in versions 0.57 and 0.56.1, both released today. # 5th June 2021, 11:14 pm

Datasette 0.57. Released today, Datasette 0.57 has new options for controlling which columns are visible on a table page, a way to show more than the default 30 facet results, a whole bunch of smaller improvements and a fix for a severe cross-site scripting security vulnerability. # 5th June 2021, 11:12 pm

Weeknotes: Docker architectures, sqlite-utils 3.7, nearly there with Datasette 0.57

This week I learned a whole bunch about using Docker to emulate different architectures, released sqlite-utils 3.7 and made a ton of progress towards the almost-ready-to-ship Datasette 0.57.

[... 1081 words]

I’m pretty convinced that the biggest single contributor to improved software in my lifetime wasn’t object-orientation or higher-level languages or functional programming or strong typing or MVC or anything else: It was the rise of testing culture.

Tim Bray # 1st June 2021, 2:35 pm