Simon Willison’s Weblog

Subscribe
Atom feed

Blogmarks

Filters: Sorted by date

TensorFlow 101. Concise, readable introduction to TensorFlow, with Python examples you can execute (and visualize) in Jupyter.

# 8th November 2017, 5:57 pm / python, tensorflow

spaCy. “Industrial-strength Natural Language Processing in Python”. Exciting alternative to nltk—spaCy is mostly written in Cython, makes bold performance claims and ships with a range of pre-built statistical models covering multiple different languages. The API design is clean and intuitive and spaCy even includes an SVG visualizer that works with Jupyter.

# 8th November 2017, 4:43 pm / nlp, python, spacy

Redis Streams and the Unified Log. In which Brandur Leach explores the new Kafka-style streams functionality coming to Redis 4.0, and shows an example of a robust at-least once processing architecture built on a combination of Redis streams and PostgreSQL transactions. I really like the pattern of writing log records to a staging table in PostgreSQL first in order to bundle them up in the same transaction as the originating state change, then have a separate process read them from that table and publish them to Redis.

# 8th November 2017, 4:37 pm / postgresql, redis, brandur-leach

ZEIT – 6x Faster Now Uploads with HTTP/2 (via) Fantastic optimization write-up by Pranay Prakash. The Now deployment tool works by computing a hash for every local file in a project, then uploading just the ones that are missing. Pranay switched to uploading over HTTP/2 using the fetch-h2 library and got a 6x speedup for larger projects.

# 8th November 2017, 1:04 am / nodejs, zeit-now, http2

Feature Visualization (via) Another gorgeous paper published on Distill, the journal that prides itself on including interactive visualizations to help provide clear explanations of machine learning.

# 7th November 2017, 8:48 pm / machine-learning, explorables

GOV.UK Registers (via) Canonical sources of “lists of information” intended for use by GDS teams building software for the UK government, but available for anyone. 17 registers are “ready for use”, 45 are “in progress”. Covers things like the FCO’s country list, the official list of prison estates, and DEFRA’s list of public bodies in England that manage drainage systems.

# 7th November 2017, 3:31 pm / datagov, government, open-data, gov-uk

Pull request #4120 · python/cpython. I just had my first ever change merged into Python! It was a one sentence documentation improvement (on how to cancel SQLite operations) but it was fascinating seeing how Python’s GitHub flow is set up—clever use of labels, plus a bot that automatically checks that you have signed a copy of their CLA.

# 7th November 2017, 2:06 pm / github, open-source, python, sqlite

Cloud SQL for PostgreSQL adds high availability and replication. Google Cloud Platform now offers PostgreSQL with automatic asynchronous disk-level replication to a separate instance in a different availability zone, via their new “Regional Disks“ feature. Between this, Heroku, Citus and Amazon RDS the appeal of a self-maintained PostgreSQL instance continues to fall.

# 7th November 2017, 1:49 pm / google, highavailability, postgresql

Something is wrong on the internet. James Bridle takes a fascinating and deeply troubling dive into the world of Kids’ YouTube videos, which appear to be increasingly algorithmically generated and are evolving in a very dark direction.

# 7th November 2017, 12:40 pm / james-bridle, youtube

How Balanced does Database Migrations with Zero-Downtime. I’m fascinated by the idea of “pausing” traffic during a blocking site maintenance activity (like a database migration) and then un-pausing when the operation is complete—so end clients just see some of their requests taking a few seconds longer than expected. I first saw this trick described by Braintree. Balanced wrote about a neat way of doing this just using HAproxy, which lets you live reconfigure the maxconns to your backend down to zero (causing traffic to be queued up) and then bring the setting back up again a few seconds later to un-pause those requests.

# 7th November 2017, 11:36 am / haproxy, highavailability, http, migrations, scaling, zero-downtime

Secondary indexing with Redis. I haven’t seen this section of the official Redis documentation before, and it’s absolutely fantastic—well worth reading the whole thing. It talks through various ways in which you can set up indexes in Redis, mainly by leaning on sorted sets—which it turns out will binary lexicographically sort items with the same score. This makes it easy to implement autocomplete with Redis—but if you use them creatively you can implement subject/predicate/object graph searches or even N-dimensional range queries as well.

# 7th November 2017, 2 am / redis

I’m a Unicorn. I got to try out Animoji on an iPhone X, and it was amazing.

# 7th November 2017, 1:50 am / emoji

How technology helped a blind athlete run free at the New York Marathon. Fascinating piece on technology to help blind people better navigate the world—combing GPS and chest-mounted ultrasonic sonar.

# 6th November 2017, 4:58 pm / accessibility

Alt-texts: The Ultimate Guide. By Daniel Göransson, a web developer with vision impairment who uses a screen reader. This is the best, most practical guide to writing image alt text I’ve seen. Just one of the neat tips contained within: consider ending your alt text in a period, so the screen user knows to pause.

# 6th November 2017, 4:54 pm / accessibility, alt-text

walrus. Fascinating collection of Python utilities for working with Redis, by Charles Leifer. There are a ton of interesting ideas in here. It starts with Python object wrappers for Redis so you can interact with lists, sets, sorted sets and Redis hashes using Python-like objects. Then it gets really interesting: walrus ships with implementations of autocomplete, rate limiting, a graph engine (using a sorted set hexastore) and an ORM-style models mechanism which manages secondary indexes and even implements basic full-text search.

# 6th November 2017, 1:14 am / python, redis, charles-leifer

direnv (via) A shell extension (for bash, zsh and others) which can automatically set and unset environment variables when you cd into specific directories. Useful for managing things like a project’s GOPATH or automatically activating Python virtual environments.

# 5th November 2017, 7:59 pm / bash, shell, zsh

Landsat on AWS (via) TIL Amazon make data from the Landsat 8 satellite available for free on S3 (though they are no doubt hoping you’ll pay for EC2 instances to process the data). “All new Landsat 8 scenes are made available each day, often within hours of production. The satellite images the entire Earth every 16 days at a roughly 30 meter resolution”.

# 5th November 2017, 7:56 pm / aws, satellite

Try hosting on PyPy by simonw. I had a go at hosting my blog on PyPy. Thanks to the combination of Travis CI, Sentry and Heroku it was pretty easy to give it a go—I had to swap psycopg2 for psycopg2cffi and switch to the currently undocumented pypy3-5.8.0 Heroku runtime (pypy3-5.5.0 is only compatible with Python 3.3, which Django 2.0 does not support). I ran it in production for a few minutes and didn’t get any Sentry errors but did end up using more Heroku dyno memory than I’m comfortable with—see the graph I posted in a comment. I’m going to stick with CPython 3.6 for the moment. Amusingly I did almost all of the work on this on my phone! Travis CI means it’s easy to create and test a branch through GitHub’s web UI, and deploying a tested branch to Heroku is then just a button click.

# 5th November 2017, 7:17 pm / pypy, python, heroku, travis, sentry

Super Fast String Matching in Python (via) Interesting technique for calculating string similarity at scale in Python, with much better performance than Levenshtein distances. The trick here uses TF/IDF against N-Grams, plus a CSR (Compressed Sparse Row) scipy matrix to run the calculations. Includes clear explanations of each of these concepts.

# 5th November 2017, 3:26 pm / python, scipy

Docker.qcow2 never shrinks—disk space usage leak in docker for mac (via) Interesting year-long thread on disk usage by Docker for Mac, including a bunch of potential workarounds for if it swallows too much disk space.

# 5th November 2017, 3:06 pm / docker

On Being A Senior Engineer. Thoughts on characteristics of a mature engineer from John Allspaw back in 2012. So much good thinking in here—my favourite piece of writing on the subject.

# 5th November 2017, 6:16 am / john-allspaw, careers

Docker Containers on the Desktop (via) Jessie Frazelle’s classic explanation from 2015 of how she runs every desktop application on her Linux machine in its own Docker container.

# 5th November 2017, 4:16 am / linux, docker

gillyb/sensitive: A native desktop version of the kibana sense plugin. I love using the Sense UI for developing against Elasticsearch, but it’s infuriatingly hard to obtain these days. You can install it as a Kibana plugin but I work with multiple Elasticsearch instances and I don’t want to have to get it installed on all of them. Until recently I was using a Chrome extension for it, but that’s now been disabled as containing malware and removed from the Chrome extension store. I’ve now switched to Sensitive, which packages Sense up as a native OS X application using Electron.

# 4th November 2017, 7:35 pm / elasticsearch, electron

Animoji karaoke performing Bohemian Rhapsody (via) Animoji just might be the most important advance in computer science in a decade.

# 4th November 2017, 7:29 pm / emoji

My blog: Items tagged “askmetafilter”. I’ve imported all 55 of my answers to questions on Ask MetaFilter (to accompany my previous Quora import) going back to 2005.

# 4th November 2017, 4:43 am / ask-metafilter

The Story Behind the Chicago Newspaper That Bought a Bar (via) Absolutely fascinating story—the Chicago Sun-Times bought a bar back in 1976 to investigate corrupt city inspectors, staffing it with journalists and with photographers hidden in a back room.

# 3rd November 2017, 3:27 pm / journalism

Connecting to Google Sheets with Python. Useful guide to interacting with Google Sheets via the gspread python library, including how to work with Google’s unintuitive “service account keys”.

# 3rd November 2017, 4:13 am / googlespreadsheet, python

How Adversarial Attacks Work. Adversarial attacks against machine learning classifiers involve constructing an input that deliberately produces the wrong classification. This article shows how these can be constructed, and includes examples generated using PyTorch which produce a sports car that gets identified as a toaster and a photo of Sylvester Stallone that gets classified as Keanu Reeves.

# 2nd November 2017, 8:25 pm / machine-learning, python

A Minimalist Guide to SQLite. Pretty comprehensive actually—covers the sqlite3 command line app, importing CSVs, integrating with Python, Pandas and Jupyter notebooks, visualization and more.

# 2nd November 2017, 1:23 am / pandas, python, sqlite, jupyter

Hemingway Editor. Hemingway is a web-based editor that applies style checks to your writing. It looks for complicated words, unnecessary adverbs and sentences that are hard to read. It highlighted the previous sentence as hard to read. It gave this whole paragraph a Grade 8 readability score.

# 1st November 2017, 8:38 pm / writing

Years

Tags