Simon Willison’s Weblog

Subscribe
Atom feed

Blogmarks

Filters: Sorted by date

source-map-explorer. Very neat tool for creating a tree map visualization of the size of the components of a bundled JavaScript file created by webpack (or if you’re using create-react-app by “npm run build”). I ran this using “npx source-map-explorer build/static/js/main.d63f3f34.js” (since I don’t like using “npm install -g”).

# 24th June 2018, 9:37 pm / javascript, npm

Query Parquet files in SQLite. Colin Dellow built a SQLite virtual table extension that lets you query Parquet files directly using SQL. Parquet is interesting because it’s a columnar format that dramatically reduces the space needed to store tables with lots of duplicate column data—most CSV files, for example. Colin reports being able to shrink a 1291 MB CSV file from the Canadian census to an equivalent Parquet file weighing just 42MB (3% of the original)—then running a complex query against the data in just 60ms. I’d love to see someone get this extension working with Datasette.

# 24th June 2018, 7:44 pm / sqlite, big-data, datasette, parquet, colin-dellow

The Four Golden Signals. “The four golden signals of monitoring are latency, traffic, errors, and saturation. If you can only measure four metrics of your user-facing system, focus on these four.”—from the excellent (and free) Google Site Reliability Engineering book.

# 22nd June 2018, 9:23 pm / monitoring

lemongraph. An open-source “log-based transactional graph engine”. Written by the NSA. In Python. It runs on top of LMDB, which is the fast memory-mapped transactional key-value store that was developed by the OpenLDAP project as a replacement for BerkeleyDB.

# 22nd June 2018, 9:15 pm / graph, nsa, open-source

MySQL High Availability at GitHub. Cutting edge high availability case-study: GitHub are now using Consul, raft, their own custom load balancer and their own custom orchestrator replication management toolkit to achieve cross-datacenter failover for their MySQL master/replica clusters.

# 20th June 2018, 11:05 pm / github, highavailability, mysql, scaling, shlominoach

Notebook: How to build a Teachable Machine with TensorFlow.js (via) This is a really cool Observable notebook. It explains how to build image classification that runs in the browser on top of Tensorflow.js, and includes interactive demos that hook into your webcam and let you hold up items and use them to train a classifier. Since it’s built on Observable every single underlying line of source code is available to browse as part of the essay.

# 20th June 2018, 9:10 pm / javascript, machine-learning, explorables, tensorflow, observable

Sunsetting React Native at Airbnb. “Due to a variety of technical and organizational issues, we will be sunsetting React Native and putting all of our efforts into making native amazing.” Fascinating write-up from Airbnb (part of a series) based on two years of working with React Native. It’s worth reading this in full: 63% of their engineers they surveyed would have chosen React Native again given the chance and 74% would consider it for a new project—but the larger technical and organizational challenges (in particular the fact that React Native remains a polarizing choice in the mobile world, making it harder to hire great native engineers) mean that Airbnb are migrating back to pure-native for their iOS and Android apps.

# 19th June 2018, 9:03 pm / mobile, react

github/gh-ost: Thoughts on Foreign Keys? The biggest challenge I’ve seen with foreign key constraints at scale (at least with MySQL) is how they conflict with online schema migrations using tools like pt-online-schema-change or GitHub’s gh-ost. This is a good explanation of the issue by Shlomi Noach, one of the gh-ost maintainers.

# 19th June 2018, 4:12 pm / databases, mysql, scaling, sql, shlominoach

Datasette 0.23: CSV, SpatiaLite and more (via) The big new feature in 0.23 is CSV export: any Datasette table or query can now be exported as CSV, including the option to get all matching rows in one giant CSV file taking advantage of Python 3 async and Datasette’s efficient keyset pagination. Also in this release: improved support for SpatiaLite and various JSON API improvements including the ability to expand foreign key labels in JSON and CSV responses.

# 18th June 2018, 3:34 pm / csv, projects, datasette

Django Bakery (via) “A set of helpers for baking your Django site out as flat files”. Released by the LA Times Data Desk, who use it for a large number of projects from election results to data journalism interactives. Statically publishing these projects to S3 lets them handle huge traffic spikes at a very low cost.

# 16th June 2018, 1:49 am / data-journalism, django, s3, static-generator, ben-welsh

Metafilter financial update and future directions. Recent drops in revenue from Google AdSense and Amazon Affiliates have left MetaFilter (19th birthday coming up next month) with a $8,000/month shortfall. They have an optional monthly subscription which currently brings in $7,500/month (monthly expenses are $38,000) so I’ve opted in and thankfully it looks like a lot of other people are subscribing or upping their subscription. I joined the site nearly 14 years ago and it’s been an important part of my online world ever since.

# 14th June 2018, 1:55 pm / metafilter

Changelog 2018-06-12 / Observable. The ability to download an Observable notebook as a stand-alone ES module and run it anywhere using their open source runtime is fascinating, but it’s also worth reading the changelog for some of the new clever tricks they are pulling using await—“await visibility();” in a notebook cell will cause execution to pause until the cell scrolls into view for example.

# 13th June 2018, 3:50 pm / async, javascript, observable

Password Tips From a Pen Tester: Common Patterns Exposed (via) Pipal is a tool for analyzing common patterns in passwords. It turns out if you make people change their password every three months and force at least one uppercase letter plus a number they pick “Winter2018”.

# 12th June 2018, 3:35 pm / passwords, security

mycli. Really neat auto-complete enabled MySQL terminal client, built using the excellent python-prompt-toolkit. Has a sister-project for PostgreSQL called pgcli.

# 11th June 2018, 7:08 pm / mysql, postgresql, python

Continuous Integration with Travis CI—ZEIT Documentation. One of the neat things about Zeit Now is that since deployments are unlimited and are automatically assigned a unique URL you can set up a continuous integration system like Travis to deploy a brand new copy of every commit or every pull request. This documentation also shows how to have commits to master automatically aliased to a known URL. I have quite a few Datasette projects that are deployed automatically to Now by Travis and the pattern seems to be working great so far.

# 1st June 2018, 5:21 pm / continuous-deployment, continuous-integration, zeit-now, travis

Side-channel attacking browsers through CSS3 features. Really clever attack. Sites like Facebook offer iframe widgets which show the user’s name, but due to the cross-origin resource policy cannot be introspected by the site on which they are embedded. By using CSS3 blend modes it’s possible to construct a timing attack where a stack of divs layered over the top of the iframe can be used to derive the embedded content, by taking advantage of blend modes that take different amounts of time depending on the colour of the underlying pixel. Patched in Firefox 60 and Chrome 63.

# 1st June 2018, 2:54 pm / css3, security, sidechannel, timing-attack

asgi-scope (via) I made a tiny (16 lines of code) web application to help understand the ASGI specification for building asynchronous Python applications. It works a little like phpinfo(): it dumps out the ASGI scope created by the incoming request.

# 1st June 2018, 2:42 pm

SpatiaLite — Datasette documentation. Datasette’s documentation now includes extensive coverage of the SpatiaLite extension for SQLite: how to install it, how to import latitude/longitude points, shapefiles and GeoJSON data into SpatiaLite tables, and how to run SQL queries against it that take advantage of spatial indexes. I’m learning SpatiaLite at the moment and filling out the documentation with each new trick I learn as I go—as Mark Pilgrim once taught me, the best way to learn a new technology is to write about it.

# 30th May 2018, 4:34 am / documentation, mark-pilgrim, spatialite, sqlite, datasette

Library of Congress Sustainability of Digital Formats: SQLite. “The Library of Congress Recommended Formats Statement (RFS) includes SQLite as a preferred format for datasets.”

# 28th May 2018, 5:19 pm / sqlite

Beginner’s Guide to Jupyter Notebooks for Data Science (with Tips, Tricks!) (via) If you haven’t yet got on the Jupyter notebooks bandwagon this should help. It’s the single biggest productivity improvement I’ve made to my workflow in a very long time.

# 24th May 2018, 1:58 pm / jupyter, data-science

Showdown: MySQL 8 vs PostgreSQL 10 (via) MySQL 8 makes comparisons between PostgreSQL and MySQL far more interesting, as it closes some of the key feature gaps. Meanwhile the PostgreSQL replication story (long one of MySQL’s key advantages) has improved dramatically in recent versions. This article offers a useful overview of the current differences, including diving into some of the less obvious implementation details that differ between the two.

# 23rd May 2018, 5:02 pm / databases, mysql, postgresql

Hynek Schlawack: Testing & Packaging (via) “How to ensure that your tests run code that you think they are running, and how to measure your coverage over multiple tox runs (in parallel!)”—Hynek makes a convincing argument for putting your packaged Python code in a src/ directory for ease of testing and coverage.

# 22nd May 2018, 10:12 pm / packaging, python, testing, hynek-schlawack

Observable: Downloading and Embedding Notebooks (via) Big news from the Observable team: firstly, they’ve released the open source runtime for their notebooks which means you can now execute the code from a notebook independently of their hosted service. On top of that they’ve constructed an elegant way of exporting and executing notebooks (or specific notebook cells) as ES6 modules and as installable npm package tarballs.

# 22nd May 2018, 12:14 pm / javascript, observable

New in Django 2.0: Database instrumentation. I missed this previously. Django 2.0 shipped with one of my most-wanted features: the ability to easily instrument database calls (for logging and metrics) without having to monkey-patch or run an entirely new database backend. Can’t wait to try this out.

# 21st May 2018, 9:28 pm / django

VirtualKNN for SpatiaLite. This looks amazing: a special virtual table shipped as part of SpatiaLite 4.4.0 which implements a fast, R-Tree backed mechanism for finding the X nearest points against a geospatial database table. There’s just one catch: it’s only available in 4.4.0, but the most recent “stable” release of SpatiaLite is 4.3.0a from September 2015 so the version you get if you install from apt-get or homebrew doesn’t yet have this functionality. I’d love to figure out a neat way to package and distribute this along with Datasette. I’d also like to figure out a clean way to ship a more recent version of SQLite than the one that is currently packaged with Python 3 (3.16.2, where the latest SQLite release is 3.23.1).

# 21st May 2018, 9:23 pm / geospatial, spatialite, sqlite

sqlitebiter. Similar to my csvs-to-sqlite tool, but sqlitebiter handles “CSV/Excel/HTML/JSON/LTSV/Markdown/SQLite/SSV/TSV/Google-Sheets”. Most interestingly, it works against HTML pages—run “sqlitebiter -v url ’https://en.wikipedia.org/wiki/Comparison_of_firewalls’” and it will scrape that Wikipedia page and create a SQLite table for each of the HTML tables it finds there.

# 17th May 2018, 10:40 pm / csv, scraping, sqlite, datasette

sql.js Online SQL interpreter (via) This is fascinating: sql.js is a project that complies the whole of SQLite to JavaScript using Emscripten. The demo is an online SQL interpreter which lets you import an existing SQLite database from your filesystem and run queries against it directly in your browser.

# 17th May 2018, 9:28 pm / javascript, sqlite

Django #8936: Add view (read-only) permission to admin (closed). Opened 10 years ago. Closed 15 hours ago. I apparently filed this issue during the first DjangoCon back in September 2008, when Adrian and Jacob mentioned on-stage that they would like to see a read-only permission for the Django Admin. Thanks to Olivier Dalang from Fiji and Petr Dlouhý from Prague it’s going to be a feature shipping in Django 2.1. Open source is a beautiful thing.

# 17th May 2018, 1:40 pm / django, django-admin, djangocon, open-source

How to number rows in MySQL. MySQL’s user variables can be used to add a “rank” or “row_number” column to a database query that shows the ranking of a row against a specific unique value. This means you can return the first N rows for any given column—for example, given a list of articles return just the first three tags for each article. I’ve recently found myself using this trick for a few different things—once you know it, chances to use it crop up surprisingly often.

# 16th May 2018, 9:06 pm / mysql

isomorphic-git (via) A pure-JavaScript implementation of the git protocol and underlying tools which works both server-side (Node.js) AND in the client, using an emulation of the fs API. Given the right CORS headers it can clone a GitHub repository over HTTPS right into your browser. Impressive.

# 16th May 2018, 8:54 pm / git, javascript, cors

Years

Tags