Simon Willison’s Weblog

Subscribe

April 2018

April 2, 2018

rubber-docker/linux.c. rubber-docker is a workshop that talks through building a simply Docker clone from scratch in Python. I particularly liked this detail: linux.c is a Python extension written in C that exposes a small collection of Linux syscalls that are needed for the project—clone, mount, pivot_root, setns, umount, umount2 and unshare. Just reading through this module gives a really nice overview of how some of Docker’s underlying magic actually work.

# 6:18 pm / docker, python

April 3, 2018

Ask HN: What are the best MOOCs you’ve taken? Most useful Hacker News thread I’ve seen in a while: a torrent of great recommendations for online courses to learn everything from machine learning to astrophysics to songwriting.

# 5:17 pm / hacker-news, education

gron. Ingenious tool for working with JSON on the command line: run “gron URL/filepath” to transform a JSON document into a multi-line assignment structure designed to be easy to run grep against. Grep it, then pipe it back into “gron --ungron” to convert the filtered data back to JSON again. It solves a similar problem to jq—which is addressed in the README: “gron’s primary purpose is to make it easy to find the path to a value in a deeply nested JSON blob when you don’t already know the structure; much of jq’s power is unlocked only once you know that structure”.

# 9:16 pm / json, jq

April 6, 2018

Parsing CSV using ANTLR and Python 3. I’ve been trying to figure out how to use ANTLR grammars from Python—this is the first example I’ve found that really clicked for me.

# 2:33 pm / antlr, csv, parsing, python

Typesense (via) A new (to me) open source search engine, with a focus on being “typo-tolerant” and offering great, fast autocomplete—incredibly important now that most searches take place using a mobile phone keyboard. Similar to Elasticsearch or Solr in that it runs as an HTTP server that you serve JSON via POST and GET—and it offers read-only replicas for scaling and high availability. And since it’s 2018, if you have Docker running (I use Docker for Mac) you can start up a test instance with a one-line shell command.

# 5:07 pm / open-source, search, autocomplete

April 7, 2018

Cookies-over-HTTP Bad (via) Mike West from the Chrome security team proposes a way for browsers to start discouraging the use of tracking cookies sent over HTTP—which represent a significant threat to user privacy from network attackers. It’s a clever piece of thinking: browsers would slowly ramp up the forced expiry deadline for non-HTTPS cookies, further encouraging sites to switch to HTTPS cookies while giving them ample time to adapt.

# 2:39 pm / privacy, cookies, https

April 8, 2018

Scientific results today are as often as not found with the help of computers. That’s because the ideas are complex, dynamic, hard to grab ahold of in your mind’s eye. And yet by far the most popular tool we have for communicating these results is the PDF—literally a simulation of a piece of paper. Maybe we can do better.

James Somers

# 1:14 pm / ipython, science, jupyter

awesome-falsehood: Curated list of falsehoods programmers believe in (via) I really like the general category of “falsehoods programmers believe”, and Kevin Deldyckehas done an outstanding job curating this collection. Categories covered include date and time, email, human identity, geography, addresses, internationalization and more. This is a particularly good example of the “awesome lists” format in that each link is accompanied by a useful description.

# 7:57 pm / programming, internationalisation

April 9, 2018

So Fishing Times’s ad department is selling access to the prime Fishing Times readership. But the Data Lords can say, ‘we can show your ad just to Fishing Times readers when they’re on Facebook, or on some meme site, on the Times or TPM or really anywhere.’ Because the Data Lords have the data and they can track and target you. The publication’s role as the gatekeeper to an audience is totally undercut because the folks who control the data and the targeting can follow those readers anywhere and purchase the ads at the lowest price.

Josh Marshall

# 3:16 pm / advertising

Datasette 0.15: sort by column (via) I’ve released the latest version of Datasette to PyPI. The key new feature is the ability to sort tables by column, using clickable column headers or directly via the new _sort= and _sort_desc= querystring parameters.

# 5:25 pm / projects, datasette

elasticsearch-dump. Neat open source utility by TaskRabbit for importing and exporting data in bulk from Elasticsearch. It can copy data from one Elasticsearch cluster directly to another or to an intermediary file, making it a swiss-army knife for migrating data around. I successfully used the “docker run” incantation to execute it without needing to worry about having the correct version of Node.js installed.

# 10:10 pm / docker, nodejs, elasticsearch

April 10, 2018

Deckset for Mac (via) $29 desktop Mac application that creates presentations using a cleverly designed markdown dialect. You edit the underlying markdown in your standard text editor and the Deskset app shows a preview of the presentation and lets you hit “play” to run it or export it as a PDF.

# 9:34 pm / markdown, presentations

GitHub for Nonprofits (via) TIL GitHub provide legally recognized nonprofits with free organization accounts with unlimited users and unlimited private repos—and they’ve registered 30,000 nonprofit accounts through the program as of May 2017.

# 9:55 pm / github

April 11, 2018

Visualizing disk IO activity using log-scale banded graphs (via) This is a neat data visualization trick: to display rates of disk I/O, it splits the rate into a GB, MB and KB section on a stacked chart. This means that if you are getting jitter in the order of KBs even while running at 400+MB/second you can see the jitter in the KB section.

# 5:04 pm / visualization

April 12, 2018

The Academic Vanity Honeypot phishing scheme. Twitter thread describing a nasty phishing attack where an academic receives an email from a respected peer congratulating them on a recent article and suggesting further reading. The further reading link is a phishing site that emulates the victim’s institution’s login page.

# 3:07 pm / security, phishing

Mozilla Telemetry: In-depth Data Pipeline (via) Detailed behind-the-scenes look at an extremely sophisticated big data telemetry processing system built using open source tools. Some of this is unsurprising (S3 for storage, Spark and Kafka for streams) but the details are fascinating. They use a custom nginx module for the ingestion endpoint and have a “tee” server written in Lua and OpenResty which lets them route some traffic to alternative backend.

# 3:44 pm / nginx, big-data, analytics, mozilla, lua, kafka

Wireless Telegraphy Register (via) Russ Garrett used Datasette to build a browsable interface to the UK’s register of business radio licenses, using data from Ofcom.

# 4:08 pm / datasette

What do you mean “average”? (via) Lovely example of an interactive explorable demonstrating mode/mean/median, built as an Observable notebook using D3.

# 4:41 pm / d3, observable, explorables

April 14, 2018

Datasette 0.18: units (via) This release features the first Datasette feature that was entirely designed and implemented by someone else (yay open source)—Russ Garrett wanted unit support (Hz, ft etc) for his Wireless Telegraphy Register project. It’s a really neat implementation: you can tell Datasette what units are in use for a particular database column and it will display the correct SI symbols on the page. Specifying units also enables unit-aware filtering: if Datasette knows that a column is measured in meters you can now query it for all rows that are less than 50 feet for example.

# 3:56 pm / open-source, datasette

April 15, 2018

The way I would talk about myself as a senior engineer is that I’d say “I know how I would solve the problem” and because I know how I would solve it I could also teach someone else to do it. And my theory is that the next level is that I can say about myself “I know how others would solve the problem”. Let’s make that a bit more concrete. You make that sentence: “I can anticipate how the API choices that I’m making, or the abstractions that I’m introducing into a project, how they impact how other people would solve a problem.”

Malte Ubl

# 5:23 pm / api-design, careers

April 17, 2018

Datasette 0.19: Plugins Documentation (via) I’ve released the first preview of Datasette’s new plugin support, which uses the pluggy package originally developed for py.test. So far the only two plugin hooks are for SQLite connection creation (allowing custom SQL functions to be registered) and Jinja2 template environment initialization (for custom template tags), but this release is mainly about exercising the plugin registration mechanism and starting to gather feedback. Lots more to come.

# 3:59 am / datasette, plugins

A rating system for open data proposed by Tim Berners-Lee, founder of the World Wide Web. To score the maximum five stars, data must (1) be available on the Web under an open licence, (2) be in the form of structured data, (3) be in a non-proprietary file format, (4) use URIs as its identifiers (see also RDF), (5) include links to other data sources (see linked data). To score 3 stars, it must satisfy all of (1)-(3), etc.

Five stars of open data

# 4:20 am / opendata, tim-berners-lee

Suppose a runaway success novel/tv/film franchise has "Bob" as the evil bad guy. Reams of fanfictions are written with "Bob" doing horrible things. People endlessly talk about how bad "Bob" is on twitter. Even the New York times writes about Bob latest depredations, when he plays off current events.

Your name is Bob. Suddenly all the AIs in the world associate your name with evil, death, killing, lying, stealing, fraud, and incest. AIs silently, slightly ding your essays, loan applications, uber driver applications, and everything you write online. And no one believes it's really happening. Or the powers that be think it's just a little accidental damage because the AI overall is still, overall doing a great job of sentiment analysis and fraud detection.

Daniel Von Fange

# 8:51 pm / machine-learning

Text Embedding Models Contain Bias. Here’s Why That Matters (via) Excellent discussion from the Google AI team of the enormous challenge of building machine learning models without accidentally encoding harmful bias in a way that cannot be easily detected.

# 8:54 pm / machine-learning, ai, generative-ai

April 19, 2018

What’s New in MySQL 8.0. MySQL 8 has lots of exciting improvements: Window functions, SRS aware spatial types for GIS, utf8mb4 by default, a ton of JSON improvements and atomic DDL. I no longer feel at a significant disadvantage when I have to use MySQL in place of PostgreSQL.

# 4:03 pm / mysql

Creating Simple Interactive Forms Using Python + Markdown Using ScriptedForms + Jupyter (via) ScriptedForms is a fascinating Jupyter hack that lets you construct dynamic documents defined using markdown that provide form fields and evaluate Python code instantly as you interact with them.

# 4:05 pm / jupyter, markdown, python

The best of Python: a collection of my favorite articles from 2017 and 2018 (so far). Gergely Szerovay has brought together an outstandingly interesting selection of Python articles from the last couple of years of activity of the Python community on Medium. A whole load of gems in here that I hadn’t seen before.

# 6:28 pm / python

How to Use Static Type Checking in Python 3.6 (via) Useful introduction to optional static typing in Python 3.6, including how to use mypy, PyCharm and the Atom mypy plugin.

# 6:30 pm / statictyping, python, mypy

Intro to Threads and Processes in Python (via) I really like the diagrams in this article which compares the performance of Python threads and processes for different types of task via the excellent concurrent.futures library.

# 6:32 pm / multiprocessing, threading, python

How to rewrite your SQL queries in Pandas, and more (via) I still haven’t fully internalized the idioms needed to manipulate DataFrames in pandas. This tutorial helps a great deal—it shows the Pandas equivalents for a host of common SQL queries.

# 6:34 pm / pandas, sql, python

2018 » April

MTWTFSS
      1
2345678
9101112131415
16171819202122
23242526272829
30