Simon Willison’s Weblog

Subscribe

Items in Jun, 2018

Filters: Year: 2018 × Month: Jun × Sorted by date


datasette-vega (via) I wrote a visualization plugin for Datasette that uses the excellent Vega “visualization grammar” library to provide bar, line and scatter charts configurable against any Datasette table or SQL query. # 29th June 2018, 3 pm

Migrating Messenger storage to optimize performance (via) Fascinating case-study of a truly gargantuan migration. Messenger has over a billion users, and Facebook successfully migrated its backend storage from HBase to their MyRocks database (a fork of MySQL with a storage engine built on their SSD-optimized RocksDB key/value library) without any user-visible downtime. They ended up using two migration paths: one for the 99.9% of regular accounts, and a separate path for extremely high volume accounts (businesses with very active chat bots or support systems). # 27th June 2018, 3:05 pm

mkcert (via) Handy new tool from Filippo Valsorda (a cryptographer at Google) for easily generating TLS certificates for your local development environment. You can use this to get a certificate pair for a localhost web server created with a couple of simple commands. # 26th June 2018, 6:55 pm

ActorDB. Distributed SQL database written in Erlang built on top of SQLite (on top of LMDB), adding replication using the raft consensus algorithm (so sharded with no single-points of failure) and a MySQL protocol interface. Interesting combination of technologies. # 24th June 2018, 9:48 pm

source-map-explorer. Very neat tool for creating a tree map visualization of the size of the components of a bundled JavaScript file created by webpack (or if you’re using create-react-app by “npm run build”). I ran this using “npx source-map-explorer build/static/js/main.d63f3f34.js” (since I don’t like using “npm install -g”). # 24th June 2018, 9:37 pm

Query Parquet files in SQLite. Colin Dellow built a SQLite virtual table extension that lets you query Parquet files directly using SQL. Parquet is interesting because it’s a columnar format that dramatically reduces the space needed to store tables with lots of duplicate column data—most CSV files, for example. Colin reports being able to shrink a 1291 MB CSV file from the Canadian census to an equivalent Parquet file weighing just 42MB (3% of the original)—then running a complex query against the data in just 60ms. I’d love to see someone get this extension working with Datasette. # 24th June 2018, 7:44 pm

The Four Golden Signals. “The four golden signals of monitoring are latency, traffic, errors, and saturation. If you can only measure four metrics of your user-facing system, focus on these four.”—from the excellent (and free) Google Site Reliability Engineering book. # 22nd June 2018, 9:23 pm

lemongraph. An open-source “log-based transactional graph engine”. Written by the NSA. In Python. It runs on top of LMDB, which is the fast memory-mapped transactional key-value store that was developed by the OpenLDAP project as a replacement for BerkeleyDB. # 22nd June 2018, 9:15 pm

MySQL High Availability at GitHub. Cutting edge high availability case-study: GitHub are now using Consul, raft, their own custom load balancer and their own custom orchestrator replication management toolkit to achieve cross-datacenter failover for their MySQL master/replica clusters. # 20th June 2018, 11:05 pm

Notebook: How to build a Teachable Machine with TensorFlow.js (via) This is a really cool Observable notebook. It explains how to build image classification that runs in the browser on top of Tensorflow.js, and includes interactive demos that hook into your webcam and let you hold up items and use them to train a classifier. Since it’s built on Observable every single underlying line of source code is available to browse as part of the essay. # 20th June 2018, 9:10 pm

Sunsetting React Native at Airbnb. “Due to a variety of technical and organizational issues, we will be sunsetting React Native and putting all of our efforts into making native amazing.” Fascinating write-up from Airbnb (part of a series) based on two years of working with React Native. It’s worth reading this in full: 63% of their engineers they surveyed would have chosen React Native again given the chance and 74% would consider it for a new project—but the larger technical and organizational challenges (in particular the fact that React Native remains a polarizing choice in the mobile world, making it harder to hire great native engineers) mean that Airbnb are migrating back to pure-native for their iOS and Android apps. # 19th June 2018, 9:03 pm

github/gh-ost: Thoughts on Foreign Keys? The biggest challenge I’ve seen with foreign key constraints at scale (at least with MySQL) is how they conflict with online schema migrations using tools like pt-online-schema-change or GitHub’s gh-ost. This is a good explanation of the issue by Shlomi Noach, one of the gh-ost maintainers. # 19th June 2018, 4:12 pm

Datasette 0.23: CSV, SpatiaLite and more (via) The big new feature in 0.23 is CSV export: any Datasette table or query can now be exported as CSV, including the option to get all matching rows in one giant CSV file taking advantage of Python 3 async and Datasette’s efficient keyset pagination. Also in this release: improved support for SpatiaLite and various JSON API improvements including the ability to expand foreign key labels in JSON and CSV responses. # 18th June 2018, 3:34 pm

Django Bakery (via) “A set of helpers for baking your Django site out as flat files”. Released by the LA Times Data Desk, who use it for a large number of projects from election results to data journalism interactives. Statically publishing these projects to S3 lets them handle huge traffic spikes at a very low cost. # 16th June 2018, 1:49 am

Raccoons don’t think ahead very much, so raccoons don’t have very good impulse control. I don’t think the raccoon realized when it started climbing what it was in for.

Suzanne MacDonald, raccoon behavior expert # 15th June 2018, 6:06 pm

One of the ways the internet has changed around us over the years is the blog-o-sphere of MetaFilter’s early years has all but disappeared, and so has the kind of link-sharing culture that went with it.

Josh Millard # 14th June 2018, 2:01 pm

Metafilter financial update and future directions. Recent drops in revenue from Google AdSense and Amazon Affiliates have left MetaFilter (19th birthday coming up next month) with a $8,000/month shortfall. They have an optional monthly subscription which currently brings in $7,500/month (monthly expenses are $38,000) so I’ve opted in and thankfully it looks like a lot of other people are subscribing or upping their subscription. I joined the site nearly 14 years ago and it’s been an important part of my online world ever since. # 14th June 2018, 1:55 pm

Changelog 2018-06-12 / Observable. The ability to download an Observable notebook as a stand-alone ES module and run it anywhere using their open source runtime is fascinating, but it’s also worth reading the changelog for some of the new clever tricks they are pulling using await—“await visibility();” in a notebook cell will cause execution to pause until the cell scrolls into view for example. # 13th June 2018, 3:50 pm

Password Tips From a Pen Tester: Common Patterns Exposed (via) Pipal is a tool for analyzing common patterns in passwords. It turns out if you make people change their password every three months and force at least one uppercase letter plus a number they pick “Winter2018”. # 12th June 2018, 3:35 pm

mycli. Really neat auto-complete enabled MySQL terminal client, built using the excellent python-prompt-toolkit. Has a sister-project for PostgreSQL called pgcli. # 11th June 2018, 7:08 pm

Open Source gives engineers the power to collaborate across legal entities (companies) without involving bizdev. The benefits of this workaround are extraordinary and underappreciated.

Yehuda Katz # 6th June 2018, 9:52 pm

At Harvard we’ve built out an infrastructure to allow us to deploy JupyterHub to courses with authentication managed by Canvas. It has allowed us to easily deploy complex set-ups to students so they can do really cool stuff without having to spend hours walking them through setup. Instructors are writing their lectures as IPython notebooks, and distributing them to students, who then work through them in their JupyterHub environment. Our most ambitious so far has been setting up each student in the course with a p2.xlarge machine with cuda and TensorFlow so they could do deep learning work for their final projects. We supported 15 courses last year, and got deployment time for an implementation down to only 2-3 hours.

Chris Rogers # 5th June 2018, 7:37 pm

Continuous Integration with Travis CI—ZEIT Documentation. One of the neat things about Zeit Now is that since deployments are unlimited and are automatically assigned a unique URL you can set up a continuous integration system like Travis to deploy a brand new copy of every commit or every pull request. This documentation also shows how to have commits to master automatically aliased to a known URL. I have quite a few Datasette projects that are deployed automatically to Now by Travis and the pattern seems to be working great so far. # 1st June 2018, 5:21 pm

Side-channel attacking browsers through CSS3 features. Really clever attack. Sites like Facebook offer iframe widgets which show the user’s name, but due to the cross-origin resource policy cannot be introspected by the site on which they are embedded. By using CSS3 blend modes it’s possible to construct a timing attack where a stack of divs layered over the top of the iframe can be used to derive the embedded content, by taking advantage of blend modes that take different amounts of time depending on the colour of the underlying pixel. Patched in Firefox 60 and Chrome 63. # 1st June 2018, 2:54 pm

asgi-scope (via) I made a tiny (16 lines of code) web application to help understand the ASGI specification for building asynchronous Python applications. It works a little like phpinfo(): it dumps out the ASGI scope created by the incoming request. # 1st June 2018, 2:42 pm

Half of the time when companies say they need “AI” what they really need is a SELECT clause with GROUP BY.

Mat Velloso # 1st June 2018, 2:35 pm