Simon Willison’s Weblog

Subscribe
Atom feed

Blogmarks

Filters: Sorted by date

The Now CDN (via) Huge announcement from Zeit Now today: all .now.sh deployments are now served through the Cloudflare CDN, which means they benefit from 150 worldwide CDN locations that obey HTTP caching headers. This is particularly relevant for Datasette, since it serves far-future cache headers by default and uses Cloudflare-compatible HTTP/2 push hints to accelerate 302 redirects. This means that both the “datasette publish now” CLI command and the Datasette Publish web app will now result in Cloudflare-accelerated deployments.

# 12th July 2018, 3:34 am / cdn, performance, zeit-now, datasette, cloudflare

Usage of ARIA attributes via HTTP Archive. A neat example of a Google BigQuery query you can run against the HTTP Archive public dataset (a crawl of the “top” websites run periodically by the Internet Archive, which captures the full details of every resource fetched) to see which ARIA attributes are used the most often. Linking to this because I used it successfully today as the basis for my own custom query—I love that it’s possible to analyze a huge representative sample of the modern web in this way.

# 12th July 2018, 3:16 am / aria, http, internet-archive, big-data

scrapely. Neat twist on a screen scraping library: this one lets you “train it” by feeding it examples of URLs paired with a dictionary of the data you would like to have extracted from that URL, then uses an instance based learning earning algorithm to run against new URLs. Slightly confusing name since it’s maintained by the scrapy team but is a totally independent project from the scrapy web crawling framework.

# 10th July 2018, 8:25 pm / python, scraping

react-from-zero (via) Interesting approach to teaching and understanding React: unlike most other tutorials this skips Webpack and ES6 entirely an focuses on things you can get running just using a browser and loading code via script tags. It does eventually load Babel to enable client-side JSX transforms, but before that it shows how React can be used by loading react.js and react-dom.js and then calling React.createElement() manually (or by using the 0xeac7 magic symbol and constructing JavaScript objects manually with $$typeof: magicValue).

# 3rd July 2018, 5:27 pm / javascript, react

Digg’s v4 launch: an optimism born of necessity. Riveting behind-the-scenes story of the disastrous Digg V4 launch by former Digg engineer Will Larson.

# 2nd July 2018, 5:25 pm / digg, will-larson

datasette-vega (via) I wrote a visualization plugin for Datasette that uses the excellent Vega “visualization grammar” library to provide bar, line and scatter charts configurable against any Datasette table or SQL query.

# 29th June 2018, 3 pm / plugins, projects, visualization, datasette

Migrating Messenger storage to optimize performance (via) Fascinating case-study of a truly gargantuan migration. Messenger has over a billion users, and Facebook successfully migrated its backend storage from HBase to their MyRocks database (a fork of MySQL with a storage engine built on their SSD-optimized RocksDB key/value library) without any user-visible downtime. They ended up using two migration paths: one for the 99.9% of regular accounts, and a separate path for extremely high volume accounts (businesses with very active chat bots or support systems).

# 27th June 2018, 3:05 pm / facebook, migration, mysql, scaling, zero-downtime

mkcert (via) Handy new tool from Filippo Valsorda (a cryptographer at Google) for easily generating TLS certificates for your local development environment. You can use this to get a certificate pair for a localhost web server created with a couple of simple commands.

# 26th June 2018, 6:55 pm / certificates, go, https, filippo-valsorda

ActorDB. Distributed SQL database written in Erlang built on top of SQLite (on top of LMDB), adding replication using the raft consensus algorithm (so sharded with no single-points of failure) and a MySQL protocol interface. Interesting combination of technologies.

# 24th June 2018, 9:48 pm / erlang, scaling, sqlite, big-data

source-map-explorer. Very neat tool for creating a tree map visualization of the size of the components of a bundled JavaScript file created by webpack (or if you’re using create-react-app by “npm run build”). I ran this using “npx source-map-explorer build/static/js/main.d63f3f34.js” (since I don’t like using “npm install -g”).

# 24th June 2018, 9:37 pm / javascript, npm

Query Parquet files in SQLite. Colin Dellow built a SQLite virtual table extension that lets you query Parquet files directly using SQL. Parquet is interesting because it’s a columnar format that dramatically reduces the space needed to store tables with lots of duplicate column data—most CSV files, for example. Colin reports being able to shrink a 1291 MB CSV file from the Canadian census to an equivalent Parquet file weighing just 42MB (3% of the original)—then running a complex query against the data in just 60ms. I’d love to see someone get this extension working with Datasette.

# 24th June 2018, 7:44 pm / sqlite, big-data, datasette, parquet, colin-dellow

The Four Golden Signals. “The four golden signals of monitoring are latency, traffic, errors, and saturation. If you can only measure four metrics of your user-facing system, focus on these four.”—from the excellent (and free) Google Site Reliability Engineering book.

# 22nd June 2018, 9:23 pm / monitoring

lemongraph. An open-source “log-based transactional graph engine”. Written by the NSA. In Python. It runs on top of LMDB, which is the fast memory-mapped transactional key-value store that was developed by the OpenLDAP project as a replacement for BerkeleyDB.

# 22nd June 2018, 9:15 pm / graph, nsa, open-source

MySQL High Availability at GitHub. Cutting edge high availability case-study: GitHub are now using Consul, raft, their own custom load balancer and their own custom orchestrator replication management toolkit to achieve cross-datacenter failover for their MySQL master/replica clusters.

# 20th June 2018, 11:05 pm / github, highavailability, mysql, scaling, shlominoach

Notebook: How to build a Teachable Machine with TensorFlow.js (via) This is a really cool Observable notebook. It explains how to build image classification that runs in the browser on top of Tensorflow.js, and includes interactive demos that hook into your webcam and let you hold up items and use them to train a classifier. Since it’s built on Observable every single underlying line of source code is available to browse as part of the essay.

# 20th June 2018, 9:10 pm / javascript, machine-learning, explorables, tensorflow, observable

Sunsetting React Native at Airbnb. “Due to a variety of technical and organizational issues, we will be sunsetting React Native and putting all of our efforts into making native amazing.” Fascinating write-up from Airbnb (part of a series) based on two years of working with React Native. It’s worth reading this in full: 63% of their engineers they surveyed would have chosen React Native again given the chance and 74% would consider it for a new project—but the larger technical and organizational challenges (in particular the fact that React Native remains a polarizing choice in the mobile world, making it harder to hire great native engineers) mean that Airbnb are migrating back to pure-native for their iOS and Android apps.

# 19th June 2018, 9:03 pm / mobile, react

github/gh-ost: Thoughts on Foreign Keys? The biggest challenge I’ve seen with foreign key constraints at scale (at least with MySQL) is how they conflict with online schema migrations using tools like pt-online-schema-change or GitHub’s gh-ost. This is a good explanation of the issue by Shlomi Noach, one of the gh-ost maintainers.

# 19th June 2018, 4:12 pm / databases, mysql, scaling, sql, shlominoach

Datasette 0.23: CSV, SpatiaLite and more (via) The big new feature in 0.23 is CSV export: any Datasette table or query can now be exported as CSV, including the option to get all matching rows in one giant CSV file taking advantage of Python 3 async and Datasette’s efficient keyset pagination. Also in this release: improved support for SpatiaLite and various JSON API improvements including the ability to expand foreign key labels in JSON and CSV responses.

# 18th June 2018, 3:34 pm / csv, projects, datasette

Django Bakery (via) “A set of helpers for baking your Django site out as flat files”. Released by the LA Times Data Desk, who use it for a large number of projects from election results to data journalism interactives. Statically publishing these projects to S3 lets them handle huge traffic spikes at a very low cost.

# 16th June 2018, 1:49 am / data-journalism, django, s3, static-generator, ben-welsh

Metafilter financial update and future directions. Recent drops in revenue from Google AdSense and Amazon Affiliates have left MetaFilter (19th birthday coming up next month) with a $8,000/month shortfall. They have an optional monthly subscription which currently brings in $7,500/month (monthly expenses are $38,000) so I’ve opted in and thankfully it looks like a lot of other people are subscribing or upping their subscription. I joined the site nearly 14 years ago and it’s been an important part of my online world ever since.

# 14th June 2018, 1:55 pm / metafilter

Changelog 2018-06-12 / Observable. The ability to download an Observable notebook as a stand-alone ES module and run it anywhere using their open source runtime is fascinating, but it’s also worth reading the changelog for some of the new clever tricks they are pulling using await—“await visibility();” in a notebook cell will cause execution to pause until the cell scrolls into view for example.

# 13th June 2018, 3:50 pm / async, javascript, observable

Password Tips From a Pen Tester: Common Patterns Exposed (via) Pipal is a tool for analyzing common patterns in passwords. It turns out if you make people change their password every three months and force at least one uppercase letter plus a number they pick “Winter2018”.

# 12th June 2018, 3:35 pm / passwords, security

mycli. Really neat auto-complete enabled MySQL terminal client, built using the excellent python-prompt-toolkit. Has a sister-project for PostgreSQL called pgcli.

# 11th June 2018, 7:08 pm / mysql, postgresql, python

Continuous Integration with Travis CI—ZEIT Documentation. One of the neat things about Zeit Now is that since deployments are unlimited and are automatically assigned a unique URL you can set up a continuous integration system like Travis to deploy a brand new copy of every commit or every pull request. This documentation also shows how to have commits to master automatically aliased to a known URL. I have quite a few Datasette projects that are deployed automatically to Now by Travis and the pattern seems to be working great so far.

# 1st June 2018, 5:21 pm / continuous-deployment, continuous-integration, zeit-now, travis

Side-channel attacking browsers through CSS3 features. Really clever attack. Sites like Facebook offer iframe widgets which show the user’s name, but due to the cross-origin resource policy cannot be introspected by the site on which they are embedded. By using CSS3 blend modes it’s possible to construct a timing attack where a stack of divs layered over the top of the iframe can be used to derive the embedded content, by taking advantage of blend modes that take different amounts of time depending on the colour of the underlying pixel. Patched in Firefox 60 and Chrome 63.

# 1st June 2018, 2:54 pm / css3, security, sidechannel, timing-attack

asgi-scope (via) I made a tiny (16 lines of code) web application to help understand the ASGI specification for building asynchronous Python applications. It works a little like phpinfo(): it dumps out the ASGI scope created by the incoming request.

# 1st June 2018, 2:42 pm

SpatiaLite — Datasette documentation. Datasette’s documentation now includes extensive coverage of the SpatiaLite extension for SQLite: how to install it, how to import latitude/longitude points, shapefiles and GeoJSON data into SpatiaLite tables, and how to run SQL queries against it that take advantage of spatial indexes. I’m learning SpatiaLite at the moment and filling out the documentation with each new trick I learn as I go—as Mark Pilgrim once taught me, the best way to learn a new technology is to write about it.

# 30th May 2018, 4:34 am / documentation, mark-pilgrim, spatialite, sqlite, datasette

Library of Congress Sustainability of Digital Formats: SQLite. “The Library of Congress Recommended Formats Statement (RFS) includes SQLite as a preferred format for datasets.”

# 28th May 2018, 5:19 pm / sqlite

Beginner’s Guide to Jupyter Notebooks for Data Science (with Tips, Tricks!) (via) If you haven’t yet got on the Jupyter notebooks bandwagon this should help. It’s the single biggest productivity improvement I’ve made to my workflow in a very long time.

# 24th May 2018, 1:58 pm / jupyter, data-science

Showdown: MySQL 8 vs PostgreSQL 10 (via) MySQL 8 makes comparisons between PostgreSQL and MySQL far more interesting, as it closes some of the key feature gaps. Meanwhile the PostgreSQL replication story (long one of MySQL’s key advantages) has improved dramatically in recent versions. This article offers a useful overview of the current differences, including diving into some of the less obvious implementation details that differ between the two.

# 23rd May 2018, 5:02 pm / databases, mysql, postgresql

Years

Tags