Simon Willison’s Weblog

83 items tagged “postgresql”

2022

Glue code to quickly copy data from one Postgres table to another (via) The Python script that Retool used to migrate 4TB of data between two PostgreSQL databases. I find the structure of this script really interesting—it uses Python to spin up a queue full of ID ranges to be transferred and then starts some threads, but then each thread shells out to a command that runs “psql COPY (SELECT ...) TO STDOUT” and pipes the result to “psql COPY xxx FROM STDIN”. Clearly this works really well (“saturate the database’s hardware capacity” according to a comment on HN), and neatly sidesteps any issues with Python’s GIL. # 19th April 2022, 4:57 pm

Postgres Auditing in 150 lines of SQL (via) I’ve run up against the problem of tracking changes made to rows within a database table so many times, and I still don’t have a preferred solution. This approach to it looks very neat: it uses PostgreSQL triggers to populate a single audit table (as opposed to one audit table per tracked table) and records the previous and current column values for the row using jsonb. # 9th March 2022, 7:19 pm

migra (via) This looks like a very handy tool to have around: run “migra postgresql:///a postgresql:///b” and it will detect and output the SQL alter statements needed to modify the first PostgreSQL database schema to match the second. It’s written in Python, running on top of SQLAlchemy. # 26th February 2022, 11:23 pm

Single dependency stacks (via) Brandur Leach notes that the core services at Crunchy (admittedly a PostgreSQL hosting and consultancy company) have only one stateful dependency – Postgres. No Redis, ElasticSearch or anything else. This means that problems like rate limiting and search, which are often farmed out to external services, are all handled using either PostgreSQL or in-memory mechanisms on their servers. # 9th February 2022, 6:43 pm

Tricking Postgres into using an insane – but 200x faster – query plan. Jacob Martin talks through a PostgreSQL query optimization they implemented at Spacelift, showing in detail how to interpret the results of EXPLAIN (FORMAT JSON, ANALYZE) using the explain.dalibo.com visualization tool. # 18th January 2022, 8:53 pm

2021

Transactionally Staged Job Drains in Postgres. Any time I see people argue that relational databases shouldn’t be used to implement job queues I think of this post by Brandur from 2017. If you write to a queue before committing a transaction you run the risk of a queue consumer trying to read from the database before the new row becomes visible. If you write to the queue after the transaction there’s a risk an error might result in your message never being written. So: write to a relational staging table as part of the transaction, then have a separate process read from that table and write to the queue. # 18th December 2021, 1:34 am

Per-project PostgreSQL (via) Jamey Sharp describes an ingenious way of setting up PostgreSQL instances for each of your local development project, without depending on an always-running shared localhost database server. The trick is a shell script which creates a PGDATA folder in the current folder and then instantiates a PostgreSQL server in --single single user mode which listens on a Unix domain socket in that folder, instead of listening on the network. Jamey then uses direnv to automatically configure that PostgreSQL, initializing the DB if necessary, for each of his project folders. # 3rd September 2021, 3:06 am

Django SQL Dashboard 1.0

Earlier this week I released Django SQL Dashboard 1.0. I also just released 1.0.1, with a bug fix for PostgreSQL 10 contributed by Ryan Cheley.

[... 629 words]

PostgreSQL: nbtree/README (via) The PostgreSQL source tree includes beatifully written README files for different parts of PostgreSQL. Here’s the README for their btree implementation—it continues to be actively maintained (last change was is March) and “git blame” shows that parts of the file date back 25 years, to 1996! # 25th June 2021, 6:09 pm

Hierarchical Structures in PostgreSQL (via) Two techniques I hadn’t seen before: the first is to define a materialized view using a CTE that offers efficient tree queries against a PostgreSQL array of path components (plus a trigger to update the materialized view), the second is with the PostgreSQL ltree extension which ships as part of PostgreSQL and hence should be widely available. # 25th June 2021, 5:19 pm

Multi-region PostgreSQL on Fly (via) Really interesting piece of architectural design from Fly here. Fly can run your application (as a Docker container run using Firecracker) in multiple regions around the world, and they’ve now quietly added PostgreSQL multi-region support. The way it works is that all-but-one region can have a read-only replica, and requests sent to application servers can perform read-only queries against their local region’s replica. If a request needs to execute a SQL update your application code can return a “fly-replay: region=scl” HTTP header and the Fly CDN will transparently replay the request against the region containing the leader database. This also means you can implement tricks like setting a 10s expiring cookie every time the user performs a write, such that their requests in the next 10s will go straight to the leader and avoid them experiencing any replication lag that hasn’t caught up with their latest update. # 17th June 2021, 6:39 pm

explain.dalibo.com (via) By far the best tool I’ve seen for turning the output of PostgreSQL EXPLAIN ANALYZE into something I can actually understand—produces a tree visualization which includes clear explanations of what each step (such as a “Index Only Scan Node”) actually means. # 28th May 2021, 5:41 pm

Django SQL Dashboard

I’ve released the first non-alpha version of Django SQL Dashboard, which provides an interface for running arbitrary read-only SQL queries directly against a PostgreSQL database, protected by the Django authentication scheme. It can also be used to create saved dashboards that can be published or shared internally.

[... 2171 words]

Practical SQL for Data Analysis (via) This is a really great SQL tutorial: it starts with the basics, but quickly moves on to a whole array of advanced PostgreSQL techniques—CTEs, window functions, efficient sampling, rollups, pivot tables and even linear regressions executed directly in the database using regr_slope(), regr_intercept() and regr_r2(). I picked up a whole bunch of tips for things I didn’t know you could do with PostgreSQL here. # 4th May 2021, 3:11 am

Django admin customization, JSON in our PostgreSQL

My progress slowed a bit today as I started digging into some things I’m less familiar with—but I’ve found some tricks that I think will help us out a lot.

[... 1089 words]

Fuzzy Name Matching in Postgres. Paul Ramsey describes how to implement fuzzy name matching in PostgreSQL using the fuzzystrmatch extension and its levenshtein() and soundex() functions, plus functional indexes to query against indexed soundex first and then apply slower Levenshtein. The same tricks should also work against SQLite using the datasette-jellyfish plugin. # 22nd February 2021, 9:16 pm

Cleaning Up Your Postgres Database (via) Craig Kerstiens provides some invaluable tips on running an initial check of the health of a PostgreSQL database, by using queries against the pg_statio_user_indexes table to find the memory cache hit ratio and the pg_stat_user_tables table to see what percentage of queries to your tables are using an index. # 3rd February 2021, 7:32 am

2020

DuckDB (via) This is a really interesting, relatively new database. It’s kind of a weird hybrid between SQLite and PostgreSQL: it uses the PostgreSQL parser but models itself after SQLite in that databases are a single file and the code is designed for use as an embedded library, distributed in a single amalgamation C++ file (SQLite uses a C amalgamation). It features a “columnar-vectorized query execution engine” inspired by MonetDB (also by the DuckDB authors) and is hence designed to run analytical queries really quickly. You can install it using “pip install duckdb”—the resulting module feels similar to Python’s sqlite3, and follows roughly the same DBAPI pattern. # 19th September 2020, 11:43 pm

Some SQL Tricks of an Application DBA (via) This post taught me so many PostgreSQL tricks that I hadn’t seen before. Did you know you can start a transaction, drop an index, run explain and then rollback the transaction (cancelling the index drop) to see what explain would look like without that index? Among other things I also learned what the “correlation” database statistic does: it’s a measure of how close-to-sorted the values in a specific column are, which helps PostgreSQL decide if it should do an index scan or a bitmap scan when making use of an index. # 29th July 2020, 7:04 pm

PostgreSQL full-text search in the Django Admin. Today I figured out how to use PostgreSQL full-text search in the Django admin for my blog, using the get_search_results method on a subclass of ModelAdmin. # 25th July 2020, 11:05 pm

Get Started—Materialize. Materialize is a really interesting new database—“a streaming SQL materialized view engine”. It builds materialized views on top of streaming data sources (such as Kafka)—you define the view using a SQL query, then it figures out how to keep that view up-to-date automatically as new data streams in. It speaks the PostgreSQL protocol so you can talk to it using the psql tool or any PostgreSQL client library. The “get started” guide is particularly impressive: it uses a curl stream of the Wikipedia recent changes API, parsed using a regular expression. And it’s written in Rust, so installing it is as easy as downloading and executing a single binary (though I used Homebrew). # 1st June 2020, 10:11 pm

PostGraphile: Production Considerations. PostGraphile is a tool for building a GraphQL API on top of an existing PostgreSQL schema. Their “production considerations” documentation is particularly interesting because it directly addresses some of my biggest worries about GraphQL: the potential for someone to craft an expensive query that ties up server resources. PostGraphile suggests a number of techniques for avoiding this, including a statement timeout, a query allowlist, pagination caps and (in their “pro” version) a cost limit that uses a calculated cost score for the query. # 27th March 2020, 1:22 am

Generated Columns in SQLite (via) SQLite 3.31.0 released today, and generated columns are the single most notable new feature. PostgreSQL 12 added these in October 2019, and MySQL has had them since 5.7 in October 2015. MySQL and SQLite both offer either “stored” or “virtual” generated columns, with virtual columns being calculated at runtime. PostgreSQL currently only supports stored columns. # 24th January 2020, 4:20 am

2019

db-to-sqlite 1.0 release. I’ve released version 1.0 of my db-to-sqlite tool, which lets you create a SQLite database copy of any database supported by SQLAlchemy (I’ve tested it against MySQL and PostgreSQL). The tool has a bunch of new features: you can use --redact to redact specific columns, specify --table multiple times to copy a subset of tables, and the --all option now efficiently adds all foreign keys at the end of the import. The project now has unit tests which run against MySQL and PostgreSQL in Travis CI. Also included in the README: a shell one-liner for creating a local SQLite copy of a remote Heroku Postgres database based on extracting the connection string from a Heroku config environment variable. # 1st July 2019, 1:35 am

How to Create an Index in Django Without Downtime (via) Excellent advanced tutorial on Django migrations, which uses a desire to create indexes in PostgreSQL without locking the table (with CREATE INDEX CONCURRENTLY) to explain the SeparateDatabaseAndState and atomic features of Django’s migration framework. # 11th April 2019, 3:06 pm

zson (via) “ZSON is a PostgreSQL extension for transparent JSONB compression. Compression is based on a shared dictionary of strings most frequently used in specific JSONB documents [...] In some cases ZSON can save half of your disk space and give you about 10% more TPS.” # 2nd April 2019, 9:26 pm

django-zombodb (via) The hardest part of working with an external search engine like Elasticsearch is always keeping that index synchronized with your relational database. ZomboDB is a PostgreSQL extension which lets you create a new type of index backed by an external Elasticsearch cluster. Updated rows will be pushed to the index automatically, and custom SQL syntax can then be used to execute searches. django-zombodb is a brand new library by Flávio Juvenal which integrates ZomboDB directly into the Django ORM, letting you add Elasticsearch-backed functionality with just a few lines of extra configuration. It even includes custom Django migrations for enabling the extension in PostgreSQL! # 13th February 2019, 10:14 pm

2018

Optimizing Django Admin Paginator. The Django admin paginator uses a count(*) to calculate the total number of rows, so it knows how many pages to display. This makes it unpleasantly slow over large datasets. Haki Benita has an ingenious solution: drop in a custom paginator which uses the PostgreSQL “SET LOCAL statement_timeout TO 200” statement first, then if a timeout error is raised returns 9999999999 as the count instead. This means small tables get accurate page counts and giant tables load display in the admin within a reasonable time period. # 6th November 2018, 6:17 pm

Experiences with running PostgreSQL on Kubernetes (via) Fascinating interview that makes a solid argument for the idea that running stateful data stores like PostgreSQL or Cassandra is made harder, not easier when you add an orchestration tool like Kubernetes into the mix. # 13th August 2018, 2:30 pm

mycli. Really neat auto-complete enabled MySQL terminal client, built using the excellent python-prompt-toolkit. Has a sister-project for PostgreSQL called pgcli. # 11th June 2018, 7:08 pm