Simon Willison’s Weblog

Subscribe
Atom feed

Blogmarks

Filters: Sorted by date

sqlite-transform. I released a new CLI tool today: sqlite-transform, which lets you run “transformations” against a SQLite database. I built it out of frustration of constantly running into CSV files that use horrible American date formatting—the “sqlite-transform parsedatetime my.db mytable col1” command runs dateutil’s parser against those columns and replaces them with a nice, sortable ISO formatted timestamp. I’ve also added a “sqlite-transform lambda” command that lets you specify Python code directly on the command-line that should be used to transform every value in a specified column.

# 4th November 2019, 2:41 am / cli, projects, sqlite

Why you should use `python -m pip` (via) Brett Cannon explains why he prefers “python -m pip install...” to “pip install...”—it ensures you always know exactly which Python interpreter environment you are installing packages for. He also makes the case for always installing into a virtual environment, created using “python -m venv”.

# 2nd November 2019, 4:41 pm / packaging, python, brett-cannon

Calling C functions from BigQuery with web assembly (via) Google BigQuery lets you define custom SQL functions in JavaScript, and it turns out they expose the WebAssembly.instantiate family of APIs. Which means you can write your UDD in C or Rust, compile it to WebAssembly and run it as part of your query!

# 27th October 2019, 5:55 am / c, sql, rust, webassembly

Azure Readiness Checklist (via) I love a good comprehensive checklist. This one is focused on large projects running on Azure but it’s still fun to browse through if you are hosting elsewhere, mainly as a reminder of quite how much still goes into deploying large web services into production.

# 26th October 2019, 8:32 pm / azure, checklists, deployment

kepler.gl. Uber built this open source geospatial analysis tool for large-scale data sets, and they offer it as a free hosted online tool—just click Get Started on the site. I uploaded two CSV files with 30,000+ latitude/longitude points in them just now and used Kepler to render them as images.

# 25th October 2019, 4:16 am / geospatial, gis, visualization

Thematic map—GIS Wiki. This is a really useful wiki full of GIS information, and the coverage of different types of thematic maps is particularly thorough.

# 21st October 2019, 2:25 am / gis, visualization, wiki

Setting up Datasette, step by step (via) Tobias describes how he runs Datasette on his own server/VPS, using nginx and systemd. I’m doing something similar for some projects and systemd really does feel like the solution to the “ensure a Python process keeps running” problem I’ve been fighting for over a decade. I really like how Tobias creates a dedicated Linux user for each of his deployed Python projects.

# 21st October 2019, 2:20 am / sysadmin, datasette

2018 Central Park Squirrel Census in Datasette (via) The Squirrel Census project released their data! 3,000 squirrel observations in Central Park, each with fur color and latitude and longitude and behavioral observations. I love this data so much. I’ve loaded it into a Datasette running on Glitch.

# 16th October 2019, 6:01 pm / squirrels, datasette

μPlot (via) “An exceptionally fast, tiny time series chart. [...] from a cold start it can create an interactive chart containing 150,000 data points in 40ms. [...] at < 10 KB, it’s likely the smallest and fastest time series plotter that doesn’t make use of WebGL shaders or WASM”

# 14th October 2019, 11:03 pm / canvas, charting, graphing, javascript

goodreads-to-sqlite (via) This is so cool! Tobias Kunze built a Python CLI tool to import your Goodreads data into a SQLite database, inspired by github-to-sqlite and my various other Dogsheep tools. It’s the first Dogsheep style tool I’ve seen that wasn’t built by me—and Tobias’ write-up includes some neat examples of queries you can run against your Goodreads data. I’ve now started using Goodreads and I’m importing my books into my own private Dogsheep Datasette instance.

# 14th October 2019, 4:07 am / books, cli, sqlite, datasette, dogsheep

SQL Murder Mystery in Datasette (via) “A crime has taken place and the detective needs your help. The detective gave you the  crime scene report, but you somehow lost it. You vaguely remember that the crime  was a murder that occurred sometime on ​Jan.15, 2018 and that it took place in SQL  City. Start by retrieving the corresponding crime scene report from the police  department’s database.”—Really fun game to help exercise your skills with SQL by the NU Knight Lab. I loaded their SQLite database into Datasette so you can play in your browser.

# 7th October 2019, 11:37 pm / projects, sql, sqlite, datasette

twitter-to-sqlite 0.6, with track and follow. I shipped a new release of my twitter-to-sqlite command-line tool this evening. It now includes experimental features for subscribing to the Twitter streaming API: you can track keywords or follow users and matching Tweets will be written to a SQLite database in real-time as they come in through the API. Since Datasette supports mutable databases now you can run Datasette against the database and run queries against the tweets as they are inserted into the tables.

# 6th October 2019, 4:54 am / projects, realtime, twitter, dogsheep

Streamlit: Turn Python Scripts into Beautiful ML Tools (via) A really interesting new tool / application development framework. Streamlit is designed to help machine learning engineers build usable web frontends for their work. It does this by providing a simple, productive Python environment which lets you declaratively build up a sort-of Notebook style interface for your code. It includes the ability to insert a DataFrame, geospatial map rendering, chart or image into the application with a single Python function call. It’s hard to describe how it works, but the tutorial and demo worked really well for me: “pip install streamlit” and then “streamlit hello” to get a full-featured demo in a browser, then you can run through the tutorial to start building a real interactive application in a few dozen lines of code.

# 6th October 2019, 3:52 am / python

Get your own Pocket OAuth token (via) I hate it when APIs make you jump through extensive hoops just to get an access token for pulling data directly from your own personal account. I’ve been playing with the Pocket API today and it has a pretty complex OAuth flow, so I built a tiny Flask app on Glitch which helps go through the steps to get an API token for your own personal Pocket account.

# 5th October 2019, 9:56 pm / glitch, mozilla, oauth

Client-Side Certificate Authentication with nginx. I’m intrigued by client-side browser certificates, which allow you to lock down a website such that only browsers with a specific certificate installed can access them. They work on both laptops and mobile phones. I followed the steps in this tutorial and managed to get an nginx instance running which only allows connections from my personal laptop and iPhone.

# 5th October 2019, 5:26 pm / certificates, nginx, security, dogsheep

NGINX: Authentication Based on Subrequest Result (via) TIL about this neat feature of NGINX: you can use the auth_request directive to cause NGINX to make an HTTP subrequest to a separate authentication server for each incoming HTTP request. The authentication server can see the cookies on the incoming request and tell NGINX if it should fulfill the parent request (via a 2xx status code) or if it should be denied (by returning a 401 or 403). This means you can run NGINX as an authenticating proxy in front of any HTTP application and roll your own custom authentication code as a simple webhook-recieving endpoint.

# 4th October 2019, 3:36 pm / authentication, nginx, webhooks

SQL queries don’t start with SELECT. This is really useful. Understanding that SELECT (and associated window functions) happen after the WHERE, GROUP BY and HAVING helps explain why you can’t filter a query based on the results of a window function for example.

# 3rd October 2019, 8:56 pm / sql, julia-evans

Looking back at the Snowden revelations (via) Six years on from the Snowden revelations, crypto researcher Matthew Green reviews their impact and reminds us what we learned. Really interesting.

# 25th September 2019, 5:48 am / cryptography, security

The Distribution of Users’ Computer Skills: Worse Than You Think (via) Research from 2016: “Across 33 rich countries, only 5% of the population has high computer-related abilities, and only a third of people can complete medium-complexity tasks”

# 23rd September 2019, 2:49 pm / jakob-nielsen, usability

genome-to-sqlite. I just found out 23andMe let you export your genome as a zipped TSV file, so I wrote a little Python command-line tool to import it into a SQLite database.

# 19th September 2019, 3:58 pm / genetics, projects, sqlite, datasette

Evolving “nofollow” – new ways to identify the nature of links (via) Slightly confusing announcement from Google: they’re introducing rel=ugc and rel=sponsored in addition to rel=nofollow, and will be treating all three values as “hints” for their indexing system. They’re very unclear as to what the concrete effects of these hints will be, presumably because they will become part of the secret sauce of their ranking algorithm.

# 10th September 2019, 9:16 pm / google, nofollow, seo

sqlite-utils 1.11. Amjith Ramanujam contributed an excellent new feature to sqlite-utils, which I’ve now released as part of version 1.11. Previously you could enable SQLite full-text-search on a table using the .enable_fts() method (or the “sqlite-utils enable-fts” CLI command) but it wouldn’t reflect future changes to the table—you had to use populate_fts() any time you inserted new records. Thanks to Amjith you can now pass create_triggers=True (or --create-triggers) to cause sqlite-utils to automatically add triggers that keeps the FTS index up-to-date any time a row is inserted, updated or deleted from the table.

# 3rd September 2019, 1:05 am / cli, full-text-search, projects, sqlite, sqlite-utils

Subsume JSON a.k.a. JSON ⊂ ECMAScript (via) TIL that JSON isn’t a subset of ECMAScript after all! “In ES2018, ECMAScript string literals couldn’t contain unescaped U+2028 LINE SEPARATOR and U+2029 PARAGRAPH SEPARATOR characters, because they are considered to be line terminators even in that context.”

# 15th August 2019, 10:30 am / json, v8

OPP (Other People’s Problems) (via) Camille Fournier provides a comprehensive guide to picking your battles: in a large organization how can you navigate the enormous array of problems you can see that you’d like to fix, especially when so many of those problems aren’t directly in your area of control?

# 7th August 2019, 1:58 pm / management, camillefournier

Optimizing for the mobile web: Moving from Angular to Preact. Grubhub reduced their mobile web load times from 9-11s to 3-4s by replacing Angular with Preact (and replacing other libraries such as lodash with native JavaScript code). The conversion took 6 months and involved running Angular and Preact simultaneously during the transition—not a huge additional overhead as Preact itself is only 4KB. They used TypeScript throughout and credit it with providing a great deal of confidence and productivity to the overall refactoring.

# 5th August 2019, 12:26 pm / javascript, refactoring, web-performance

Working with many-to-many relationships in sqlite-utils (via) I just released sqlite-utils 1.9 with syntactic sugar support for creating many-to-many relationships for records stored in SQLite databases.

# 4th August 2019, 3:57 am / projects, sqlite, sqlite-utils

Logs vs. metrics: a false dichotomy (via) Nick Stenning discusses the differences between logs and metrics: most notably that metrics can be derived from logs but logs cannot be reconstituted starting with time-series metrics.

# 3rd August 2019, 4:46 pm / logging, observability

PyPI now supports uploading via API token (via) All of my open source Python libraries are set up to automatically deploy new tagged releases as PyPI packages using Circle CI or Travis, but I’ve always get a bit uncomfortable about sharing my PyPI password with those CI platforms to get this to work. PyPI just added scopes authentication tokens, which means I can issue a token that’s only allowed to upload a specific project and see an audit log of when that token was last used.

# 1st August 2019, 4:03 pm / pypi, python

Repository driven development (via) I’m already a big fan of keeping documentation and code in the same repo so you can update them both from within the same code review, but this takes it even further: in repository driven development every aspect of the code and configuration needed to define, document, test and ship a service live in the service repository—all the way down to the configurations for reporting dashboards. This sounds like heaven.

# 24th July 2019, 8:41 am / git

Using memory-profiler to debug excessive memory usage in healthkit-to-sqlite. This morning I figured out how to use the memory-profiler module (and mprof command line tool) to debug memory usage of Python processes. I added the details, including screenshots, to this GitHub issue. It helped me knock down RAM usage for my healthkit-to-sqlite from 2.5GB to just 80MB by making smarter usage of the ElementTree pull parser.

# 24th July 2019, 8:25 am / elementtree, memory, profiling, python, xml

Years

Tags