Simon Willison’s Weblog

Subscribe

Blogmarks

Filters: Sorted by date

Probably Are Gonna Need It: Application Security Edition (via) Jacob Kaplan-Moss shares his PAGNIs for application security: “basic security mitigations that are easy to do at the beginning, but get progressively harder the longer you put them off”. Plenty to think about in here—I particularly like Jacob’s recommendation to build a production-to-staging database mirroring solution that works from an allow-list of columns, to avoid the risk of accidentally exposing new private data as the product continues to evolve.

# 8th July 2021, 6:31 pm / jacob-kaplan-moss, security, pagni

Temporal: getting started with JavaScript’s new date time API. Axel Rauschmayer explores the new proposed API for handling dates, times and timezones in JavaScript., which is under development by Ecma TC39 at the moment (and made available as a Polyfill which you are recommended not to run in production since the API is still being figured out). This is a notoriously difficult problem so it’s always interesting to see the latest thinking on how to best address it.

# 7th July 2021, 10:29 pm / datetime, javascript, timezones

The art of asking nicely (via) CLIP+VQGAN Is a GAN that generates images based on some text input—you can run it on Google Collab notebooks, there are instructions linked at the bottom of this post. Janelle Shane of AI Weirdness explores tricks for getting the best results out of it for “a herd of sheep grazing on a lush green hillside”—various modifiers like “amazing awesome and epic” produce better images, but the one with the biggest impact, quite upsettingly, is “ultra high definition free desktop wallpaper”.

# 2nd July 2021, 3:02 pm / machine-learning, ai

Smooth sailing with Kubernetes (via) Scott McCloud (of Understanding Comics) authored this comic introduction to Kubernetes, and it’s a really good explanation of the core concepts. I’d love to have something like this for Datasette—I still feel like I’m a long way from being able to explain the project with anything like this amount of clarity.

# 1st July 2021, 11:30 pm / comics, kubernetes

YAGNI exceptions (via) Luke Plant provides his collection of things that you probably ARE going to need in a project, where adding them later is painful enough that it’s worth the up-front investment. I really like these as a concept, and I’m coining the term PAGNI—for Probably Are Gonna Need It—to describe them.

# 1st July 2021, 6:30 pm / luke-plant, software-engineering, yagni, pagni

Django SQL Dashboard 1.0 (via) As part of my ongoing attempt to be braver about 1.0 releases (crucial if you want to do semantic versioning properly) I’ve released version 1.0 of Django SQL Dashboard, my Datasette-inspired app for Django that adds an interface for running read-only, bookmarkable SQL queries against a PostgreSQL database. The new version adds a column cog menu providing shortcuts for changing the sort order, counting distinct values and performing a group-by/count against column values.

# 1st July 2021, 5:44 pm / django, projects, sql, django-sql-dashboard

Group thousands of similar spreadsheet text cells in seconds (via) Luke Whyte explains how to efficiently group similar text columns in a table (Walmart and Wal-mart for example) using a clever combination of TF/IDF, sparse matrices and cosine similarity. Includes the clearest explanation of cosine similarity for text I’ve seen—and Luke wrote a Python library, textpack, that implements the described pattern.

# 27th June 2021, 4:24 pm / python, data-science

A Datasette tutorial in Portuguese. Nicolás Linares put together this Datasette tutorial in Portuguese, including an explanation of the project, how to get it up and running on a laptop, how to use it to explore and facet data, how to use plugins (including datasette-vega and datasette-cluster-map) and how to publish data using Vercel. I ran this through Google Translate and I can confirm that it’s a really well constructed tutorial—fantastic to see material like this starting to emerge in languages other than English.

# 25th June 2021, 10:57 pm / datasette

Querying Parquet using DuckDB (via) DuckDB is a relatively new SQLite-style database (released as an embeddable library) with a focus on analytical queries. This tutorial really made the benefits click for me: it ships with support for the Parquet columnar data format, and you can use it to execute SQL queries directly against Parquet files—e.g. “SELECT COUNT(*) FROM ’taxi_2019_04.parquet’”. Performance against large files is fantastic, and the whole thing can be installed just using “pip install duckdb”. I wonder if faceting-style group/count queries (pretty expensive with regular RDBMSs) could be sped up with this?

# 25th June 2021, 10:40 pm / python, parquet, duckdb

PostgreSQL: nbtree/README (via) The PostgreSQL source tree includes beatifully written README files for different parts of PostgreSQL. Here’s the README for their btree implementation—it continues to be actively maintained (last change was is March) and “git blame” shows that parts of the file date back 25 years, to 1996!

# 25th June 2021, 6:09 pm / computer-science, databases, postgresql

Hierarchical Structures in PostgreSQL (via) Two techniques I hadn’t seen before: the first is to define a materialized view using a CTE that offers efficient tree queries against a PostgreSQL array of path components (plus a trigger to update the materialized view), the second is with the PostgreSQL ltree extension which ships as part of PostgreSQL and hence should be widely available.

# 25th June 2021, 5:19 pm / postgresql, sql

Django for Startup Founders: A better software architecture for SaaS startups and consumer apps (via) The opening section of this article has very little to do with Django: it’s an insightful description of the technical challenges faced by a startup that is still seeking product-market fit. Alex then extends that into his own architectural recommendations for startups building with Django to help waste as little time as possible on problems that aren’t core to the product they are building.

# 24th June 2021, 8:43 pm / django, startups

A framework for building Open Graph images. GitHub’s new social preview images are generated by a Node.js script that fetches data from their GraphQL API, generates an HTML version of the card and then grabs a PNG snapshot of it using Puppeteer. It takes an average of 280ms to serve an image and generates around 2 million unique images a day. Interestingly, they found that bumping the available RAM from 512MB up to 513MB had a big effect on performance, because Chromium detects devices on 512MB or less and switches some processes from parallel to sequential.

# 22nd June 2021, 9:25 pm / github, nodejs, puppeteer

What I’ve learned about data recently (via) Laurie Voss talks about the structure of data teams, based on his experience at npm and more recently Netlify. He suggests that Airflow and dbt are the data world’s equivalent of frameworks like Rails: opinionated tools that solve core problems and which mean that you can now hire people who understand how your data pipelines work on their first day on the job.

# 22nd June 2021, 5:09 pm / data, big-data, data-science, laurie-voss

GitLab Culture: The phases of remote adaptation. GitLab claim to be “the world’s largest all-remote company”—1300 employees across 65 countries, with not a single physical office. Lots of interesting thinking in this article about different phases a company can go through to become truly remote-first. “Maximally efficient remote environments will do as little work as possible synchronously, instead focusing the valuable moments where two or more people are online at the same time on informal communication and bonding.” They also expire their Slack messages after 90 days to force critical project information into documents and issue threads.

# 22nd June 2021, 12:37 am / management, remote, gitlab

Multi-region PostgreSQL on Fly (via) Really interesting piece of architectural design from Fly here. Fly can run your application (as a Docker container run using Firecracker) in multiple regions around the world, and they’ve now quietly added PostgreSQL multi-region support. The way it works is that all-but-one region can have a read-only replica, and requests sent to application servers can perform read-only queries against their local region’s replica. If a request needs to execute a SQL update your application code can return a “fly-replay: region=scl” HTTP header and the Fly CDN will transparently replay the request against the region containing the leader database. This also means you can implement tricks like setting a 10s expiring cookie every time the user performs a write, such that their requests in the next 10s will go straight to the leader and avoid them experiencing any replication lag that hasn’t caught up with their latest update.

# 17th June 2021, 6:39 pm / postgresql, replication, scaling, fly

Best Practices Around Production Ready Web Apps with Docker Compose (via) I asked on Twitter for some tips on Docker Compose and was pointed to this article by Nick Janetakis, which has a whole host of useful tips and patterns I hadn’t encountered before.

# 12th June 2021, 2:36 am / docker

I saw millions compromise their Facebook accounts to fuel fake engagement. Sophie Zhang, ex-Facebook, describes how millions of Facebook users have signed up for “autolikers”—programs that promise likes and engagement for their posts, in exchange for access to their accounts which are then combined into the larger bot farm and used to provide likes to other posts. “Self-compromise was a widespread problem, and possibly the largest single source of existing inauthentic activity on Facebook during my time there. While actual fake accounts can be banned, Facebook is unwilling to disable the accounts of real users who share their accounts with a bot farm.”

# 9th June 2021, 3:40 pm / facebook, social-media

An incomplete list of skills senior engineers need, beyond coding. By Camille Fournier, author of my favourite book on engineering management “The Manager’s Path”. Number one is “How to run a meeting, and no, being the person who talks the most in the meeting is not the same thing as running it”.

# 6th June 2021, 10:17 pm / careers, management, camillefournier

Apple’s tightly controlled App Store is teeming with scams. I’m quoted in an article in the Washington Post today (linked at the top of the homepage!) explaining how I got scammed on the App Store and spent $19 on a TV remote app with a similar name to the official Samsung app. I mistakenly assumed that the App Store review process wouldn’t allow an app called “Smart Things” to show up in search when I was looking for SmartThings, the official name—and assumed that Samsung were nickel-and-diming their customers rather than expecting the App Store review process to have failed so obviously.

# 6th June 2021, 10:13 pm / appstore, scams, washington-post, press-quotes

The humble hash aggregate (via) Today I learned that “hash aggregate” is the name for the algorithm where you split a list of tuples on a common key, run an aggregation against each resulting group and combine the results back together again—I’d previously thought if this in terms of map/reduce but hash aggregate is a much older term used widely by SQL engines—I’ve seen it come up in PostgreSQL explain query output (for GROUP BY) before but didn’t know what it meant.

# 6th June 2021, 4:03 pm / algorithms, mapreduce, sql

Reflected cross-site scripting issue in Datasette (via) Here’s the GitHub security advisory I published for the XSS hole in Datasette. The fix is available in versions 0.57 and 0.56.1, both released today.

# 5th June 2021, 11:14 pm / security, xss, datasette

Datasette 0.57. Released today, Datasette 0.57 has new options for controlling which columns are visible on a table page, a way to show more than the default 30 facet results, a whole bunch of smaller improvements and a fix for a severe cross-site scripting security vulnerability.

# 5th June 2021, 11:12 pm / projects, datasette

explain.dalibo.com (via) By far the best tool I’ve seen for turning the output of PostgreSQL EXPLAIN ANALYZE into something I can actually understand—produces a tree visualization which includes clear explanations of what each step (such as a “Index Only Scan Node”) actually means.

# 28th May 2021, 5:41 pm / postgresql

M1RACLES: M1ssing Register Access Controls Leak EL0 State. You need to read (or at least scan) all the way to the bottom: this security disclosure is a masterpiece. It not only describes a real flaw in the M1 silicon but also deconstructs the whole culture of over-hyped name-branded vulnerability reports. The TLDR is that you don’t really need to worry about this one, and if you’re writing this kind if thing up for a news article you should read all the way to the end first!

# 26th May 2021, 3:25 pm / journalism, security

HackSoft Django styleguide: services and selectors. HackSoft’s Django styleguide uses the terms “services” and “selectors”. Services are functions that live in services.py and perform business logic operations such as creating new entities that might span multiple Django models. Selectors live in selectors.py and perform more complex database read operations, such as returning objects in a way that respects visibility permissions.

# 24th May 2021, 7:17 pm / django

How to look at the stack with gdb. Useful short tutorial on gdb from first principles.

# 24th May 2021, 6:23 pm / c, debugger, julia-evans

Flat Data. New project from the GitHub OCTO (the Office of the CTO, love that backronym) somewhat inspired by my work on Git scraping: I’m really excited to see GitHub embracing git for CSV/JSON data in this way. Flat incorporates a reusable Action for scraping and storing data (using Deno), a VS Code extension for setting up those workflows and a very nicely designed Flat Viewer web app for browsing CSV and JSON data hosted on GitHub.

# 19th May 2021, 1:05 am / github, deno, git-scraping

No feigning surprise (via) Don’t feign surprise if someone doesn’t know something that you think they should know. Even better: even if you are surprised, don’t let them know! “When people feign surprise, it’s usually to make them feel better about themselves and others feel worse.”

# 17th May 2021, 4:30 pm / communication, teaching, julia-evans

geocode-sqlite. Neat command-line Python utility by Chris Amico: point it at a SQLite database file and it will add latitude and longitude columns and populate them by geocoding one or more of the other fields, using your choice from four currently supported geocoders.

# 17th May 2021, 1:15 am / geocoding, sqlite, chris-amico

Years

Tags