Simon Willison’s Weblog

Friday, 25th June 2021

A Datasette tutorial in Portuguese. Nicolás Linares put together this Datasette tutorial in Portuguese, including an explanation of the project, how to get it up and running on a laptop, how to use it to explore and facet data, how to use plugins (including datasette-vega and datasette-cluster-map) and how to publish data using Vercel. I ran this through Google Translate and I can confirm that it’s a really well constructed tutorial—fantastic to see material like this starting to emerge in languages other than English. # 10:57 pm

Querying Parquet using DuckDB (via) DuckDB is a relatively new SQLite-style database (released as an embeddable library) with a focus on analytical queries. This tutorial really made the benefits click for me: it ships with support for the Parquet columnar data format, and you can use it to execute SQL queries directly against Parquet files—e.g. “SELECT COUNT(*) FROM ’taxi_2019_04.parquet’”. Performance against large files is fantastic, and the whole thing can be installed just using “pip install duckdb”. I wonder if faceting-style group/count queries (pretty expensive with regular RDBMSs) could be sped up with this? # 10:40 pm

PostgreSQL: nbtree/README (via) The PostgreSQL source tree includes beatifully written README files for different parts of PostgreSQL. Here’s the README for their btree implementation—it continues to be actively maintained (last change was is March) and “git blame” shows that parts of the file date back 25 years, to 1996! # 6:09 pm

Hierarchical Structures in PostgreSQL (via) Two techniques I hadn’t seen before: the first is to define a materialized view using a CTE that offers efficient tree queries against a PostgreSQL array of path components (plus a trigger to update the materialized view), the second is with the PostgreSQL ltree extension which ships as part of PostgreSQL and hence should be widely available. # 5:19 pm

Notes on streaming large API responses

I started a Twitter conversation last week about API endpoints that stream large amounts of data as an alternative to APIs that return 100 results at a time and require clients to paginate through all of the pages in order to retrieve all of the data:

[... 1691 words]

2021 » June