Blogmarks
Filters: Sorted by date
The Now CDN (via) Huge announcement from Zeit Now today: all .now.sh deployments are now served through the Cloudflare CDN, which means they benefit from 150 worldwide CDN locations that obey HTTP caching headers. This is particularly relevant for Datasette, since it serves far-future cache headers by default and uses Cloudflare-compatible HTTP/2 push hints to accelerate 302 redirects. This means that both the “datasette publish now” CLI command and the Datasette Publish web app will now result in Cloudflare-accelerated deployments.
Usage of ARIA attributes via HTTP Archive. A neat example of a Google BigQuery query you can run against the HTTP Archive public dataset (a crawl of the “top” websites run periodically by the Internet Archive, which captures the full details of every resource fetched) to see which ARIA attributes are used the most often. Linking to this because I used it successfully today as the basis for my own custom query—I love that it’s possible to analyze a huge representative sample of the modern web in this way.
scrapely. Neat twist on a screen scraping library: this one lets you “train it” by feeding it examples of URLs paired with a dictionary of the data you would like to have extracted from that URL, then uses an instance based learning earning algorithm to run against new URLs. Slightly confusing name since it’s maintained by the scrapy team but is a totally independent project from the scrapy web crawling framework.
react-from-zero (via) Interesting approach to teaching and understanding React: unlike most other tutorials this skips Webpack and ES6 entirely an focuses on things you can get running just using a browser and loading code via script tags. It does eventually load Babel to enable client-side JSX transforms, but before that it shows how React can be used by loading react.js and react-dom.js and then calling React.createElement() manually (or by using the 0xeac7 magic symbol and constructing JavaScript objects manually with $$typeof: magicValue).
Digg’s v4 launch: an optimism born of necessity. Riveting behind-the-scenes story of the disastrous Digg V4 launch by former Digg engineer Will Larson.
datasette-vega (via) I wrote a visualization plugin for Datasette that uses the excellent Vega “visualization grammar” library to provide bar, line and scatter charts configurable against any Datasette table or SQL query.
Migrating Messenger storage to optimize performance (via) Fascinating case-study of a truly gargantuan migration. Messenger has over a billion users, and Facebook successfully migrated its backend storage from HBase to their MyRocks database (a fork of MySQL with a storage engine built on their SSD-optimized RocksDB key/value library) without any user-visible downtime. They ended up using two migration paths: one for the 99.9% of regular accounts, and a separate path for extremely high volume accounts (businesses with very active chat bots or support systems).
mkcert (via) Handy new tool from Filippo Valsorda (a cryptographer at Google) for easily generating TLS certificates for your local development environment. You can use this to get a certificate pair for a localhost web server created with a couple of simple commands.
ActorDB. Distributed SQL database written in Erlang built on top of SQLite (on top of LMDB), adding replication using the raft consensus algorithm (so sharded with no single-points of failure) and a MySQL protocol interface. Interesting combination of technologies.
source-map-explorer. Very neat tool for creating a tree map visualization of the size of the components of a bundled JavaScript file created by webpack (or if you’re using create-react-app by “npm run build”). I ran this using “npx source-map-explorer build/static/js/main.d63f3f34.js” (since I don’t like using “npm install -g”).
Query Parquet files in SQLite. Colin Dellow built a SQLite virtual table extension that lets you query Parquet files directly using SQL. Parquet is interesting because it’s a columnar format that dramatically reduces the space needed to store tables with lots of duplicate column data—most CSV files, for example. Colin reports being able to shrink a 1291 MB CSV file from the Canadian census to an equivalent Parquet file weighing just 42MB (3% of the original)—then running a complex query against the data in just 60ms. I’d love to see someone get this extension working with Datasette.
The Four Golden Signals. “The four golden signals of monitoring are latency, traffic, errors, and saturation. If you can only measure four metrics of your user-facing system, focus on these four.”—from the excellent (and free) Google Site Reliability Engineering book.
lemongraph. An open-source “log-based transactional graph engine”. Written by the NSA. In Python. It runs on top of LMDB, which is the fast memory-mapped transactional key-value store that was developed by the OpenLDAP project as a replacement for BerkeleyDB.
MySQL High Availability at GitHub. Cutting edge high availability case-study: GitHub are now using Consul, raft, their own custom load balancer and their own custom orchestrator replication management toolkit to achieve cross-datacenter failover for their MySQL master/replica clusters.
Notebook: How to build a Teachable Machine with TensorFlow.js (via) This is a really cool Observable notebook. It explains how to build image classification that runs in the browser on top of Tensorflow.js, and includes interactive demos that hook into your webcam and let you hold up items and use them to train a classifier. Since it’s built on Observable every single underlying line of source code is available to browse as part of the essay.
Sunsetting React Native at Airbnb. “Due to a variety of technical and organizational issues, we will be sunsetting React Native and putting all of our efforts into making native amazing.” Fascinating write-up from Airbnb (part of a series) based on two years of working with React Native. It’s worth reading this in full: 63% of their engineers they surveyed would have chosen React Native again given the chance and 74% would consider it for a new project—but the larger technical and organizational challenges (in particular the fact that React Native remains a polarizing choice in the mobile world, making it harder to hire great native engineers) mean that Airbnb are migrating back to pure-native for their iOS and Android apps.
github/gh-ost: Thoughts on Foreign Keys? The biggest challenge I’ve seen with foreign key constraints at scale (at least with MySQL) is how they conflict with online schema migrations using tools like pt-online-schema-change or GitHub’s gh-ost. This is a good explanation of the issue by Shlomi Noach, one of the gh-ost maintainers.
Datasette 0.23: CSV, SpatiaLite and more (via) The big new feature in 0.23 is CSV export: any Datasette table or query can now be exported as CSV, including the option to get all matching rows in one giant CSV file taking advantage of Python 3 async and Datasette’s efficient keyset pagination. Also in this release: improved support for SpatiaLite and various JSON API improvements including the ability to expand foreign key labels in JSON and CSV responses.
Django Bakery (via) “A set of helpers for baking your Django site out as flat files”. Released by the LA Times Data Desk, who use it for a large number of projects from election results to data journalism interactives. Statically publishing these projects to S3 lets them handle huge traffic spikes at a very low cost.
Metafilter financial update and future directions. Recent drops in revenue from Google AdSense and Amazon Affiliates have left MetaFilter (19th birthday coming up next month) with a $8,000/month shortfall. They have an optional monthly subscription which currently brings in $7,500/month (monthly expenses are $38,000) so I’ve opted in and thankfully it looks like a lot of other people are subscribing or upping their subscription. I joined the site nearly 14 years ago and it’s been an important part of my online world ever since.
Changelog 2018-06-12 / Observable. The ability to download an Observable notebook as a stand-alone ES module and run it anywhere using their open source runtime is fascinating, but it’s also worth reading the changelog for some of the new clever tricks they are pulling using await—“await visibility();” in a notebook cell will cause execution to pause until the cell scrolls into view for example.
Password Tips From a Pen Tester: Common Patterns Exposed (via) Pipal is a tool for analyzing common patterns in passwords. It turns out if you make people change their password every three months and force at least one uppercase letter plus a number they pick “Winter2018”.
mycli. Really neat auto-complete enabled MySQL terminal client, built using the excellent python-prompt-toolkit. Has a sister-project for PostgreSQL called pgcli.
Continuous Integration with Travis CI—ZEIT Documentation. One of the neat things about Zeit Now is that since deployments are unlimited and are automatically assigned a unique URL you can set up a continuous integration system like Travis to deploy a brand new copy of every commit or every pull request. This documentation also shows how to have commits to master automatically aliased to a known URL. I have quite a few Datasette projects that are deployed automatically to Now by Travis and the pattern seems to be working great so far.
Side-channel attacking browsers through CSS3 features. Really clever attack. Sites like Facebook offer iframe widgets which show the user’s name, but due to the cross-origin resource policy cannot be introspected by the site on which they are embedded. By using CSS3 blend modes it’s possible to construct a timing attack where a stack of divs layered over the top of the iframe can be used to derive the embedded content, by taking advantage of blend modes that take different amounts of time depending on the colour of the underlying pixel. Patched in Firefox 60 and Chrome 63.
asgi-scope (via) I made a tiny (16 lines of code) web application to help understand the ASGI specification for building asynchronous Python applications. It works a little like phpinfo(): it dumps out the ASGI scope created by the incoming request.
SpatiaLite — Datasette documentation. Datasette’s documentation now includes extensive coverage of the SpatiaLite extension for SQLite: how to install it, how to import latitude/longitude points, shapefiles and GeoJSON data into SpatiaLite tables, and how to run SQL queries against it that take advantage of spatial indexes. I’m learning SpatiaLite at the moment and filling out the documentation with each new trick I learn as I go—as Mark Pilgrim once taught me, the best way to learn a new technology is to write about it.
Library of Congress Sustainability of Digital Formats: SQLite. “The Library of Congress Recommended Formats Statement (RFS) includes SQLite as a preferred format for datasets.”
Beginner’s Guide to Jupyter Notebooks for Data Science (with Tips, Tricks!) (via) If you haven’t yet got on the Jupyter notebooks bandwagon this should help. It’s the single biggest productivity improvement I’ve made to my workflow in a very long time.
Showdown: MySQL 8 vs PostgreSQL 10 (via) MySQL 8 makes comparisons between PostgreSQL and MySQL far more interesting, as it closes some of the key feature gaps. Meanwhile the PostgreSQL replication story (long one of MySQL’s key advantages) has improved dramatically in recent versions. This article offers a useful overview of the current differences, including diving into some of the less obvious implementation details that differ between the two.