Simon Willison’s Weblog

Subscribe
Atom feed

Blogmarks

Filters: Sorted by date

Querying Parquet using DuckDB (via) DuckDB is a relatively new SQLite-style database (released as an embeddable library) with a focus on analytical queries. This tutorial really made the benefits click for me: it ships with support for the Parquet columnar data format, and you can use it to execute SQL queries directly against Parquet files—e.g. “SELECT COUNT(*) FROM ’taxi_2019_04.parquet’”. Performance against large files is fantastic, and the whole thing can be installed just using “pip install duckdb”. I wonder if faceting-style group/count queries (pretty expensive with regular RDBMSs) could be sped up with this?

# 25th June 2021, 10:40 pm / python, parquet, duckdb

PostgreSQL: nbtree/README (via) The PostgreSQL source tree includes beatifully written README files for different parts of PostgreSQL. Here’s the README for their btree implementation—it continues to be actively maintained (last change was is March) and “git blame” shows that parts of the file date back 25 years, to 1996!

# 25th June 2021, 6:09 pm / computer-science, databases, postgresql

Hierarchical Structures in PostgreSQL (via) Two techniques I hadn’t seen before: the first is to define a materialized view using a CTE that offers efficient tree queries against a PostgreSQL array of path components (plus a trigger to update the materialized view), the second is with the PostgreSQL ltree extension which ships as part of PostgreSQL and hence should be widely available.

# 25th June 2021, 5:19 pm / postgresql, sql

Django for Startup Founders: A better software architecture for SaaS startups and consumer apps (via) The opening section of this article has very little to do with Django: it’s an insightful description of the technical challenges faced by a startup that is still seeking product-market fit. Alex then extends that into his own architectural recommendations for startups building with Django to help waste as little time as possible on problems that aren’t core to the product they are building.

# 24th June 2021, 8:43 pm / django, startups

A framework for building Open Graph images. GitHub’s new social preview images are generated by a Node.js script that fetches data from their GraphQL API, generates an HTML version of the card and then grabs a PNG snapshot of it using Puppeteer. It takes an average of 280ms to serve an image and generates around 2 million unique images a day. Interestingly, they found that bumping the available RAM from 512MB up to 513MB had a big effect on performance, because Chromium detects devices on 512MB or less and switches some processes from parallel to sequential.

# 22nd June 2021, 9:25 pm / github, nodejs, puppeteer

What I’ve learned about data recently (via) Laurie Voss talks about the structure of data teams, based on his experience at npm and more recently Netlify. He suggests that Airflow and dbt are the data world’s equivalent of frameworks like Rails: opinionated tools that solve core problems and which mean that you can now hire people who understand how your data pipelines work on their first day on the job.

# 22nd June 2021, 5:09 pm / data, big-data, data-science, laurie-voss

GitLab Culture: The phases of remote adaptation. GitLab claim to be “the world’s largest all-remote company”—1300 employees across 65 countries, with not a single physical office. Lots of interesting thinking in this article about different phases a company can go through to become truly remote-first. “Maximally efficient remote environments will do as little work as possible synchronously, instead focusing the valuable moments where two or more people are online at the same time on informal communication and bonding.” They also expire their Slack messages after 90 days to force critical project information into documents and issue threads.

# 22nd June 2021, 12:37 am / management, remote, gitlab

Multi-region PostgreSQL on Fly (via) Really interesting piece of architectural design from Fly here. Fly can run your application (as a Docker container run using Firecracker) in multiple regions around the world, and they’ve now quietly added PostgreSQL multi-region support. The way it works is that all-but-one region can have a read-only replica, and requests sent to application servers can perform read-only queries against their local region’s replica. If a request needs to execute a SQL update your application code can return a “fly-replay: region=scl” HTTP header and the Fly CDN will transparently replay the request against the region containing the leader database. This also means you can implement tricks like setting a 10s expiring cookie every time the user performs a write, such that their requests in the next 10s will go straight to the leader and avoid them experiencing any replication lag that hasn’t caught up with their latest update.

# 17th June 2021, 6:39 pm / postgresql, replication, scaling, fly

Best Practices Around Production Ready Web Apps with Docker Compose (via) I asked on Twitter for some tips on Docker Compose and was pointed to this article by Nick Janetakis, which has a whole host of useful tips and patterns I hadn’t encountered before.

# 12th June 2021, 2:36 am / docker

I saw millions compromise their Facebook accounts to fuel fake engagement. Sophie Zhang, ex-Facebook, describes how millions of Facebook users have signed up for “autolikers”—programs that promise likes and engagement for their posts, in exchange for access to their accounts which are then combined into the larger bot farm and used to provide likes to other posts. “Self-compromise was a widespread problem, and possibly the largest single source of existing inauthentic activity on Facebook during my time there. While actual fake accounts can be banned, Facebook is unwilling to disable the accounts of real users who share their accounts with a bot farm.”

# 9th June 2021, 3:40 pm / facebook, social-media

An incomplete list of skills senior engineers need, beyond coding. By Camille Fournier, author of my favourite book on engineering management “The Manager’s Path”. Number one is “How to run a meeting, and no, being the person who talks the most in the meeting is not the same thing as running it”.

# 6th June 2021, 10:17 pm / careers, management, camillefournier

Apple’s tightly controlled App Store is teeming with scams. I’m quoted in an article in the Washington Post today (linked at the top of the homepage!) explaining how I got scammed on the App Store and spent $19 on a TV remote app with a similar name to the official Samsung app. I mistakenly assumed that the App Store review process wouldn’t allow an app called “Smart Things” to show up in search when I was looking for SmartThings, the official name—and assumed that Samsung were nickel-and-diming their customers rather than expecting the App Store review process to have failed so obviously.

# 6th June 2021, 10:13 pm / appstore, scams, washington-post, press-quotes

The humble hash aggregate (via) Today I learned that “hash aggregate” is the name for the algorithm where you split a list of tuples on a common key, run an aggregation against each resulting group and combine the results back together again—I’d previously thought if this in terms of map/reduce but hash aggregate is a much older term used widely by SQL engines—I’ve seen it come up in PostgreSQL explain query output (for GROUP BY) before but didn’t know what it meant.

# 6th June 2021, 4:03 pm / algorithms, mapreduce, sql

Reflected cross-site scripting issue in Datasette (via) Here’s the GitHub security advisory I published for the XSS hole in Datasette. The fix is available in versions 0.57 and 0.56.1, both released today.

# 5th June 2021, 11:14 pm / security, xss, datasette

Datasette 0.57. Released today, Datasette 0.57 has new options for controlling which columns are visible on a table page, a way to show more than the default 30 facet results, a whole bunch of smaller improvements and a fix for a severe cross-site scripting security vulnerability.

# 5th June 2021, 11:12 pm / projects, datasette

explain.dalibo.com (via) By far the best tool I’ve seen for turning the output of PostgreSQL EXPLAIN ANALYZE into something I can actually understand—produces a tree visualization which includes clear explanations of what each step (such as a “Index Only Scan Node”) actually means.

# 28th May 2021, 5:41 pm / postgresql

M1RACLES: M1ssing Register Access Controls Leak EL0 State. You need to read (or at least scan) all the way to the bottom: this security disclosure is a masterpiece. It not only describes a real flaw in the M1 silicon but also deconstructs the whole culture of over-hyped name-branded vulnerability reports. The TLDR is that you don’t really need to worry about this one, and if you’re writing this kind if thing up for a news article you should read all the way to the end first!

# 26th May 2021, 3:25 pm / journalism, security

HackSoft Django styleguide: services and selectors. HackSoft’s Django styleguide uses the terms “services” and “selectors”. Services are functions that live in services.py and perform business logic operations such as creating new entities that might span multiple Django models. Selectors live in selectors.py and perform more complex database read operations, such as returning objects in a way that respects visibility permissions.

# 24th May 2021, 7:17 pm / django

How to look at the stack with gdb. Useful short tutorial on gdb from first principles.

# 24th May 2021, 6:23 pm / c, debugger, julia-evans

Flat Data. New project from the GitHub OCTO (the Office of the CTO, love that backronym) somewhat inspired by my work on Git scraping: I’m really excited to see GitHub embracing git for CSV/JSON data in this way. Flat incorporates a reusable Action for scraping and storing data (using Deno), a VS Code extension for setting up those workflows and a very nicely designed Flat Viewer web app for browsing CSV and JSON data hosted on GitHub.

# 19th May 2021, 1:05 am / github, deno, git-scraping

No feigning surprise (via) Don’t feign surprise if someone doesn’t know something that you think they should know. Even better: even if you are surprised, don’t let them know! “When people feign surprise, it’s usually to make them feel better about themselves and others feel worse.”

# 17th May 2021, 4:30 pm / communication, teaching, julia-evans

geocode-sqlite. Neat command-line Python utility by Chris Amico: point it at a SQLite database file and it will add latitude and longitude columns and populate them by geocoding one or more of the other fields, using your choice from four currently supported geocoders.

# 17th May 2021, 1:15 am / geocoding, sqlite, chris-amico

Powering the Python Package Index in 2021. PyPI now serves “nearly 900 terabytes over more than 2 billion requests per day”. Bandwidth is donated by Fastly, a value estimated at 1.8 million dollars per month! Lots more detail about how PyPI has evolved over the past years in this post by Dustin Ingram.

# 14th May 2021, 4:50 am / pypi, python, fastly, dustin-ingram

New Major Versions Released! Flask 2.0, Werkzeug 2.0, Jinja 3.0, Click 8.0, ItsDangerous 2.0, and MarkupSafe 2.0. Huge set of releases from the Pallets team. Python 3.6+ required and comprehensive type annotations. Flask now supports async views, Jinja async templates (used extensively by Datasette) “no longer requires patching”, Click has a bunch of new code around shell tab completion, ItsDangerous supports key rotation and so much more.

# 12th May 2021, 5:37 pm / async, flask, jinja, python

A museum bot (via) Shawn Graham built a Twitter bot, using R, which tweets out random items from the collection at the Canadian Science and Technology Museum—using a Datasette instance that he’s running based on a CSV export of their collections data.

# 5th May 2021, 7:09 pm / museums, twitter, datasette

cinder: Instagram’s performance oriented fork of CPython (via) Instagram forked CPython to add some performance-oriented features they wanted, including a method-at-a-time JIT compiler and a mechanism for eagerly evaluating coroutines (avoiding the overhead of creating a coroutine if an awaited function returns a value without itself needing to await). They’re open sourcing the code to help start conversations about implementing some of these features in CPython itself. I particularly enjoyed the warning that accompanies the repo: this is not intended to be a supported release, and if you decide to run it in production you are on your own!

# 4th May 2021, 10:13 pm / open-source, python

Plot & Vega-Lite. Useful documentation comparing the brand new Observable Plot to Vega-Lite, complete with examples of how to achieve the same thing in both libraries.

# 4th May 2021, 4:32 pm / visualization, d3, observable, observable-plot

Observable Plot (via) This is huge: a brand new high-level JavaScript visualization library from Mike Bostock, the author of D3—partially inspired by Vega-Lite which I’ve used enthusiastically in the past. First impressions are that this is a big step forward for quickly building high-quality visualizations. It’s released under the ISC license which is “functionally equivalent to the BSD 2-Clause and MIT licenses”.

# 4th May 2021, 4:28 pm / open-source, visualization, d3, observable, mike-bostock, observable-plot

Practical SQL for Data Analysis (via) This is a really great SQL tutorial: it starts with the basics, but quickly moves on to a whole array of advanced PostgreSQL techniques - CTEs, window functions, efficient sampling, rollups, pivot tables and even linear regressions executed directly in the database using regr_slope(), regr_intercept() and regr_r2(). I picked up a whole bunch of tips for things I didn't know you could do with PostgreSQL here.

# 4th May 2021, 3:11 am / postgresql, sql, haki-benita

Hosting SQLite databases on Github Pages (via) I've seen the trick of running SQLite compiled to WASM in the browser before, but here it comes with an incredibly clever bonus trick: it uses SQLite's page structure to fetch subsets of the database file via HTTP range requests, which means you can run indexed SQL queries against a 600MB database file while only fetching a few MBs of data over the wire. Absolutely brilliant. Tucked away at the end of the post is another neat trick: making the browser DOM available to SQLite as a virtual table, so you can query and update the DOM of the current page using SQL!

# 2nd May 2021, 6:55 pm / sqlite, webassembly, http-range-requests

Years

Tags