Simon Willison’s Weblog

Subscribe

January 2018

Jan. 2, 2018

Most infosec bugs are really boring after a while. But processor ones are always crazy and fascinating because processors are basically a hornet's nest of witchcraft and mayhem stacked on top of each other all the way down.

Matt Tait

# 9:54 am / security

Jan. 8, 2018

Himalayan Database: From Visual FoxPro GUI to JSON API with Datasette (via) The Himalayan Database is a compilation of records for all expeditions that have climbed in the Nepalese Himalaya, originally compiled by journalist Elizabeth Hawley over several decades. The database is published as a Visual FoxPro database—here Raffaele Messuti‏ provides step-by-step instructions for extracting the data from the published archive, converting them to CSV using dbfcsv and then converting the CSVs to SQLite using csvs-to-sqlite so you can browse them using Datasette.

# 7:26 pm / csv, datasette

Statistical NLP on OpenStreetMap. libpostal is ferociously clever: it’s a library for parsing and understanding worldwide addresses, built on top of a machine learning model trained on millions of addresses from OpenStreetMap. Al Barrentine describes how it works in this fascinating and detailed essay.

# 7:33 pm / machine-learning, nlp, openstreetmap

[On Meltdown's impact on hosting costs] The reality is that we have been living with borrowed performance. The new reality is that security is too important and can not be exchanged for speed. Time to profile, tune and optimize.

Miguel de Icaza‏

# 7:35 pm / migueldeicaza, security

csvkit. “A suite of command-line tools for converting to and working with CSV”—includes a huge range of utilities for things like converting Excel and JSON to CSV, grepping, sorting and extracting a subset of columns, combining multiple CSV files together and exporting CSV to a relational database. Worth reading through the tutorial which shows how the different commands can be piped together.

# 9:03 pm / csv, datasette

Jan. 9, 2018

ftfy—fix unicode that’s broken in various ways (via) I shipped a small web UI wrapper around the excellent Python FTFY library, which can take broken unicode strings and suggest a sequence of operations that can be applied to get back sensible text.

# 3:22 am / unicode, projects, zeit-now

Jan. 10, 2018

How to compile and run the SQLite JSON1 extension on OS X. Thanks, Stack Overflow! I’ve been battling this one for a while—it turns out you can download the SQLite source bundle, compile just the json1.c file using gcc and load that extension in Python’s sqlite3 module (or with Datasette’s --load-extension= option) to gain access to the full suite of SQLite JSON functions—json(), json_extract() etc.

# 9:01 pm / osx, sqlite, datasette, json, stackoverflow

Jan. 11, 2018

Incident report: npm. Fascinating insight into the challenges involved in managing a massive scale community code repository. An algorithm incorrectly labeled a legit user as spam, an NPM staff member acted on the report, dependent package installations started failing and because the package had been removed as spam other users were able to try and fix the bug by publishing fresh copies of the missing package to the same namespace.

# 5:27 pm / spammers, security, npm

Jan. 13, 2018

Notes on Kafka in Python. Useful review by Matthew Rocklin of the three main open source Python Kafka client libraries as of October 2017.

# 7:40 pm / python, kafka

Telling stories through your commits. Joel Chippendale’s excellent guide to writing a useful commit history. I spend a lot of time on my commit messages, because when I’m trying to understand code later on they are the only form of documentation that is guaranteed to remain up-to-date against the code at that exact point of time. These tips are clear, concise, teadabale and include some great examples.

# 7:44 pm / sourcecontrol, git

Jan. 14, 2018

How the industry-breaking Spectre bug stayed secret for seven months. It’s pretty amazing that the bug only became public knowledge a week before the intended embargo date, considering the number of individuals and companies that has to be looped in. The biggest public clues were patches being applied in public to the Linux kernel—one smart observer noted that the page table issue “has all the markings of a security patch being readied under pressure from a deadline.”

# 4:53 pm / security

A SIM Switch Account Takeover (Mine). Someone walked into a T-Mobile store with a fake ID in his name and stole Albert Wenger’s SIM identity, then used it to gain access to his Yahoo mail account, reset his Twitter password and post a tweet boosting a specific cryptocurrency. His accounts with Google Authenticator 2FA stayed safe.

# 8:37 pm / identitytheft, security, sms

Jan. 17, 2018

Datasette Publish: a web app for publishing CSV files as an online database

I’ve just released Datasette Publish, a web tool for turning one or more CSV files into an online database with a JSON API.

[... 863 words]

API 2.0: Log-In with ZEIT, New Docs & More. Here’s Zeit’s write-up of their brand new API 2.0, which adds OAuth support and allows anything that can be done with their command-line tools to be achieved via their public API as well. This is the enabling technology that allowed me to build Datasette Publish.

# 3:23 pm / zeit-now

Generating polygon representing a rough 100km circle around latitude/longitude point using Python. A question I posted to the GIS Stack Exchange—I found my own answer using a Python library called geog, then someone else posted a better solution using pyproj.

# 8:57 pm / gis, python

Jan. 18, 2018

The biggest bottleneck in web performance today is CPU. Compared to seven years ago, there’s 5x more JavaScript downloaded on the top 1000 websites over the last seven years, and 3x more CSS. Half of web activity comes from mobile devices with a smaller CPU and limited battery power.

Steve Souders

# 2:39 pm / steve-souders, web-performance

Jan. 19, 2018

GaretJax/django-click (via) I’ve been using Click to write command-line tools in Python recently (big datasette and csvs-to-sqlite use it) and its a delightful way of composing simple and complex CLI interfaces. I’ve always found Django’s default management command syntax hard to fit in my head—django-click means I can combine the two.

# 11:19 pm / django

Jan. 20, 2018

How to turn a list of JSON objects into a Datasette. ramadis on GitHub cleaned up data on 184,879 crimes reported in Buenos Aires since 2016 and shared them on GitHub as a JSON file. Here are my notes on how to use Pandas to convert JSON into SQLite and publish it using Datasette.

# 1:07 am / datasette, pandas, json, sqlite

Jan. 21, 2018

Nicaraguan Address System (via) “Instead of street names or numbers Nicaraguans use reference points from where they start describing a certain address. [...] There are instances, however, in which the reference points do not exist anymore!”

# 4:32 pm / gis

Jan. 25, 2018

I spent more time on my iPhone X review than anything I’ve written in years, and it went to paper twice. (Here’s a scan of my second printed draft, with handwritten revisions.) My thing is that I don’t use my favorite pen — which, of course, has black ink — but instead a pen with red ink. Editing is an angry, bloody act and therefore must be done in red.

John Gruber

# 1:43 pm / john-gruber, writing

Jan. 26, 2018

django-postgres-copy (via) Really neat Django queryset add-on which exposes the PostgreSQL COPY statement for importing (and exporting) CSV data. MyModel.objects.from_csv(“filename.csv”). Built by the team of data journalists at the California Civic Data Coalition.

# 12:43 am / csv, postgresql, django

Domains Search for Web: Instant, Serverless & Global (via) The team at Zeit are pioneering a whole bunch of fascinating web engineering architectural patterns. Their new domain name autocomplete search uses Next.js and server-side rendering on first load, then switches to client-side rendering from then on. It can then load results asynchronously over a custom WebSocket protocol as the microservices on the backend finish resolving domain availability from the various different TLD providers.

# 1:14 am / zeit-now, microservices, websockets

Jan. 27, 2018

How did the Roman Republic determine its budget? Fascinating answer on the AskHistorians subreddit about how taxation worked in the Roman Empire. Since the republic was almost permanently at war, and was very good at it, no taxes were levied on Roman citizens in Italy from 167 B.C. onwards.

# 4:51 pm / historians

Jan. 28, 2018

Analyzing my Twitter followers with Datasette

I decided to do some ad-hoc analsis of my social network on Twitter this afternoon… and since everything is more fun if you bundle it up into a SQLite database and publish it to the internet I performed the analysis using Datasette.

[... 1,314 words]

If I tweeted a throwaway comment in appreciation for McDonald’s apple pies and some other randos on Twitter happened to also tweet similar thoughts over the last few months, it doesn’t mean by extrapolation that ‘Millennials Can’t Get Enough Of McDonald’s Apple Pies’.  The Twitter search box is not a polling agency and Twitter doesn’t include everybody’s thoughts on everything. Just some people’s thoughts on some things.

Nick Walker

# 4:18 pm / journalism, twitter

6M observations total! Where has iNaturalist grown in 80 days with 1 million new observations? Citizen science app iNaturalist is seeing explosive growth at the moment—they’ve been around for nearly a decade but 1/6 of the observations posted to the site were added in just the past few months. Having tried the latest version of their iPhone app it’s easy to see why: snap a photo of some nature and upload it to the app and it will use surprisingly effective machine learning to suggest the genus or even the individual species. Submit the observation and within a few minutes other iNaturalist community members will confirm the identification or suggest a correction. It’s brilliantly well executed and an utter delight to use.

# 8:18 pm / crowdsourcing, machine-learning, computer-vision, science, citizenscience, inaturalist

Jan. 29, 2018

SQLite: The Spellfix1 Virtual Table (via) A SQLite extension that lets you create a spellfix1 virtual table which can power “fuzzy” search, by suggesting corrections for misspelled words. I haven’t tried this yet but it looks pretty powerful, including a configurable edit distance and the ability to set up custom “soundslike” terms for words with known unusual spellings.

# 5:24 am / sqlite

Datasette Demo (video) from the SF Python Meetup

I gave a short talk about Datasette last month at the SF Python Meetup Holiday Party. They’ve just posted the video, so here it is:

[... 63 words]

[On SQLite] The JSON interface is like, "we save the text and when you retrieve it we parse the JSON at several hundred MB/s and let you do path queries against it please stop overthinking it, this is filing cabinet."

Paul Ford

# 4:29 pm / json, paul-ford, sqlite

Jan. 31, 2018

Observable Beta (via) Observable just released their beta, and it’s quite something. It’s by Mike Bostock (d3), Jeremy Ashkenas (Backbone, CoffeeScript) and Tom MacWright (Mapbox Studio). The easiest way to describe it is Jupyter notebooks for JavaScript supporting reactive programming—so code is evaluated as you type and you can add interactive widgets (like sliders and canvas views) to construct explorable visualizations on the fly.

# 4:46 pm / jupyter, d3, javascript, observable, jeremy-ashkenas, mike-bostock, tom-macwright

2018 » January

MTWTFSS
1234567
891011121314
15161718192021
22232425262728
293031