Simon Willison’s Weblog

On programming, sql, projects, https, computervision, ...

 

Recent entries

Datasette plugins, and building a clustered map visualization two hours ago

Datasette now supports plugins!

Last Saturday I asked Twitter for examples of Python projects with successful plugin ecosystems. pytest was the clear winner: the pytest plugin compatibility table (an ingenious innovation that I would love to eventually copy for Datasette) lists 457 plugins, and even the core pytest system itself is built as a collection of default plugins that can be replaced or over-ridden.

Best of all: pytest’s plugin mechanism is available as a separate package: pluggy. And pluggy was exactly what I needed for Datasette.

You can follow the ongoing development of the feature in issue #14. This morning I released Datasette 0.20 with support for a number of different plugin hooks: plugins can add custom template tags and SQL functions, and can also bundle their own static assets, JavaScript, CSS and templates. The hooks are described in some detail in the Datasette Plugins documentation.

datasette-cluster-map

I also released my first plugin: datasette-cluster-map. Once installed, it looks out for database tables that have a latitude and longitude column. When it finds them, it draws all of the points on an interactive map using Leaflet and Leaflet.markercluster.

Let’s try it out on some polar bears!

Polar Bears on a cluster map

The USGS Alaska Science Center have released a delightful set of data entitled Sensor and Location data from Ear Tag PTTs Deployed on Polar Bears in the Southern Beaufort Sea 2009 to 2011. It’s a collection of CSV files, which means it’s trivial to convert it to SQLite using my csvs-to-sqlite tool.

Having created the SQLite database, we can deploy it to a hosting account on Zeit Now alongside the new plugin like this:

# Make sure we have the latest datasette
pip3 install datasette --upgrade
# Deploy polar-bears.db to now with an increased default page_size
datasette publish now \
    --install=datasette-cluster-map \
    --extra-options "--page_size=500" \
    polar-bears.db

The --install option is new in Datasette 0.20 (it works for datasette publish heroku as well)—it tells the publishing provider to pip install the specified package. You can use it more than once to install multiple plugins, and it accepts a path to a zip file in addition to the name of a PyPI package.

Explore the full demo at https://datasette-cluster-map-demo.now.sh/polar-bears

Visualize any query on a map

Since the plugin inserts itself at the top of any Datasette table view with latitude and longitude columns, there are all sorts of neat tricks you can do with it.

I also loaded the San Francisco tree list (thanks, Department of Public Works) into the demo. Impressively, you can click “load all” on this page and Leaflet.markercluster will load in all 189,144 points and display them on the same map… and it works fine on my laptop and my phone. Computers in 2018 are pretty good!

But since it’s a Datasette table, we can filter it. Here’s a map of every New Zealand Xmas Tree in San Francisco (8,683 points). Here’s every tree where the Caretaker is Friends of the Urban Forest. Here’s every palm tree planted in 1990:

Palm trees planted in 1990

The plugin currently only works against columns called latitude and longitude… but if your columns are called something else, don’t worry: you can craft a custom SQL query that aliases your columns and everything will work as intended. Here’s an example against some more polar bear data:

select *, "Capture Latitude" as latitude, "Capture Longitude" as longitude
from [USGS_WC_eartag_deployments_2009-2011]

Writing your own plugins

I’m really excited to see what people invent. If you want to have a go, your first stop should be the Plugins documentation. If you want an example of a simple plugin (including the all-important mechanism for packaging it up using setup.py) take a look at datasette-cluster-map on GitHub.

And if you have any thoughts, ideas or suggestions on how the plugin mechanism can be further employed please join the conversation on issue #14. I’ve literally just got started with Datasette’s plugin hooks, and I’m very keen to hear about things people want to build that aren’t yet supported.

Building a combined stream of recent additions using the Django ORM 26 days ago

I’m a big believer in the importance of a “recent additions” feed. Any time you’re building an application that involves users adding and editing records it’s useful to have a page somewhere that shows the most recent objects that have been created across multiple different types of data.

I’ve used a number of techniques to build these in the past—from an extra database table (e.g. the Django Admin’s LogEntry model) to a Solr or Elasticsearch index that exists just to serve recent additions.

For a recent small project I found myself needing a recent additions feed and realized that there’s a new, simple way to build one thanks to the QuerySet.union() method introduced in Django 1.11 back in April 2017.

Consider a number of different ORM models that can be added by users, each with a created timestamp field.

Prior to QuerySet.union(), building a combined recent additions feed across multiple models was difficult: it’s easy to show recent additions for a single model, but how can we intersperse and paginate additions made to models stored across more than one table?

Using .union() to combine records from different models

Consider the following three models:

class Project(models.Model):
    name = models.CharField(max_length=128)
    description = models.TextField()
    created = models.DateTimeField(auto_now_add=True)

class Image(models.Model):
    project = models.ForeignKey(
        Project, related_name='images', on_delete=models.CASCADE
    )
    image = models.ImageField()
    created = models.DateTimeField(auto_now_add=True)

class Comment(models.Model):
    project = models.ForeignKey(
        Project, related_name='comments', on_delete=models.CASCADE
    )
    comment = models.TextField()
    created = models.DateTimeField(auto_now_add=True)

Let’s build a single QuerySet that returns objects from all three models ordered by their created dates, most recent first.

Using .values() we can reduce these different models to a common subset of fields, which we can then .union() together like so:

recent = Project.objects.values(
    'pk', 'created'
).union(
    Image.objects.values('pk', 'created'),
    Comment.objects.values('pk', 'created'),
).order_by('-created')[:4]

Now if we print out list(recent) it will look something like this:

[{'created': datetime.datetime(2018, 3, 24, 1, 27, 23, 625195, tzinfo=<UTC>),
  'pk': 28},
 {'created': datetime.datetime(2018, 3, 24, 15, 51, 29, 116511, tzinfo=<UTC>),
  'pk': 15},
 {'created': datetime.datetime(2018, 3, 23, 20, 14, 3, 31648, tzinfo=<UTC>),
  'pk': 5},
 {'created': datetime.datetime(2018, 3, 23, 18, 57, 36, 585376, tzinfo=<UTC>),
  'pk': 11}]

We’ve successfully combined recent additions from three different tables! Here’s what the SQL for that looks like:

>>> from django.db import connection
>>> print(connection.queries[-1]['sql'])
(SELECT "myapp_project"."id", "myapp_project"."created" FROM "myapp_project")
 UNION (SELECT "myapp_image"."id", "myapp_image"."created" FROM "myapp_image")
 UNION (SELECT "myapp_comment"."id", "myapp_comment"."created" FROM "myapp_comment")

There’s just one problem: we got back a bunch of pk and created records, but we don’t know which model each of those rows represents.

Using .annotate() to add a type constant to the rows

We can fix this by using Django’s annotate() method combined with a Value() object to attach a constant string to each record specifying the type of the row it represents. Here’s how to do that for a single model:

>>> from django.db.models import Value, CharField
>>> list(Image.objects.annotate(
...     type=Value('image', output_field=CharField()
... )).values('pk','type', 'created')[:2])
[{'created': datetime.datetime(2018, 3, 22, 17, 16, 33, 964900, tzinfo=<UTC>),
  'pk': 3,
  'type': 'image'},
 {'created': datetime.datetime(2018, 3, 22, 17, 49, 47, 527907, tzinfo=<UTC>),
  'pk': 4,
  'type': 'image'}]

We’ve added the key/value pair 'type': 'image' to every record returned from the querystring. Now let’s do that to all three of our models and combine the results using .union():

recent = Project.objects.annotate(
    type=Value('project', output_field=CharField())
).values(
    'pk', 'created', 'type'
).union(
    Image.objects.annotate(
        type=Value('image', output_field=CharField())
    ).values('pk', 'created', 'type'),
    Comment.objects.annotate(
        type=Value('comment', output_field=CharField())
    ).values('pk', 'created', 'type'),
).order_by('-created')[:4]

If we run list(recent) we get this:

[{'created': datetime.datetime(2018, 3, 24, 15, 51, 29, 116511, tzinfo=<UTC>),
  'pk': 15,
  'type': 'comment'},
 {'created': datetime.datetime(2018, 3, 24, 15, 50, 3, 901320, tzinfo=<UTC>),
  'pk': 29,
  'type': 'image'},
 {'created': datetime.datetime(2018, 3, 24, 15, 46, 35, 42123, tzinfo=<UTC>),
  'pk': 15,
  'type': 'project'},
 {'created': datetime.datetime(2018, 3, 24, 7, 53, 15, 222029, tzinfo=<UTC>),
  'pk': 14,
  'type': 'comment'}]

This is looking pretty good! We’ve successfully run a single SQL UNION query across three different tables and returned the combined results in reverse chronological order. Thanks to the type column we know which model each record corresponds to.

Inflating the full referenced objects

Now we need to inflate those primary key references a full ORM object from each corresponding table.

The most efficient way to do this is to collect together the IDs for each type and then run a single SQL query per type to load the full objects.

Here’s code that does exactly that: it first collects the list of primary keys that need to be loaded for each type, then executes an efficient SQL IN query against each type to fetch the underlying objects:

records = list(recent)

type_to_queryset = {
    'image': Image.objects.all(),
    'comment': Comment.objects.all(),
    'project': Project.objects.all(),
}

# Collect the pks we need to load for each type:
to_load = {}
for record in records:
    to_load.setdefault(record['type'], []).append(record['pk'])

# Fetch them 
fetched = {}
for type, pks in to_load.items():
    for object in type_to_queryset[type].filter(pk__in=pks):
        fetched[(type, object.pk)] = object

# Annotate 'records' with loaded objects
for record in records:
    key = (record['type'], record['pk'])
    record['object'] = fetched[key]

After running the above code, records looks like this:

[{'created': datetime.datetime(2018, 3, 24, 15, 51, 29, 116511, tzinfo=<UTC>),
  'object': <Comment: a comment>,
  'pk': 15,
  'type': 'comment'},
 {'created': datetime.datetime(2018, 3, 24, 15, 50, 3, 901320, tzinfo=<UTC>),
  'object': <Image: Image object (29)>,
  'pk': 29,
  'type': 'image'},
 {'created': datetime.datetime(2018, 3, 24, 15, 46, 35, 42123, tzinfo=<UTC>),
  'object': <Project: Recent changes demo>,
  'pk': 15,
  'type': 'project'},
 {'created': datetime.datetime(2018, 3, 24, 7, 53, 15, 222029, tzinfo=<UTC>),
  'object': <Comment: Here is another comment>,
  'pk': 14,
  'type': 'comment'}]

We can now feed this to a template and use it to render our recent additions page.

Wrapping it in a re-usable function

Here’s a function that implements the above in a re-usable way:

def combined_recent(limit, **kwargs):
    datetime_field = kwargs.pop('datetime_field', 'created')
    querysets = []
    for key, queryset in kwargs.items():
        querysets.append(
            queryset.annotate(
                recent_changes_type=Value(
                    key, output_field=CharField()
                )
            ).values('pk', 'recent_changes_type', datetime_field)
        )
    union_qs = querysets[0].union(*querysets[1:])
    records = []
    for row in union_qs.order_by('-{}'.format(datetime_field))[:limit]:
        records.append({
            'type': row['recent_changes_type'],
            'when': row[datetime_field],
            'pk': row['pk']
        })
    # Now we bulk-load each object type in turn
    to_load = {}
    for record in records:
        to_load.setdefault(record['type'], []).append(record['pk'])
    fetched = {}
    for key, pks in to_load.items():
        for item in kwargs[key].filter(pk__in=pks):
            fetched[(key, item.pk)] = item
    # Annotate 'records' with loaded objects
    for record in records:
        record['object'] = fetched[(record['type'], record['pk'])]
    return records

This is also available as a gist.

I can now use that function to combine arbitrary querysets (provided they share a created datestamp field) like so:

recent = combined_recent(
    20,
    project=Project.objects.all(),
    image=Image.objects.all(),
    comment=Comment.objects.all(),
)

This will return the most recent 20 records across all three types, with the results looking like this:

[{'when': datetime.datetime(2018, 3, 24, 15, 51, 29, 116511, tzinfo=<UTC>),
  'object': <Comment: a comment>,
  'pk': 15,
  'type': 'comment'},
 {'when': datetime.datetime(2018, 3, 24, 15, 50, 3, 901320, tzinfo=<UTC>),
  'object': <Image: Image object (29)>,
  'pk': 29,
  'type': 'image'},
 {'when': datetime.datetime(2018, 3, 24, 15, 46, 35, 42123, tzinfo=<UTC>),
  'object': <Project: Recent changes demo>,
  'pk': 15,
  'type': 'project'},
 {'when': datetime.datetime(2018, 3, 24, 7, 53, 15, 222029, tzinfo=<UTC>),
  'object': <Comment: Here is another comment>,
  'pk': 14,
  'type': 'comment'}]

Efficient object loading with select/prefetch_related

If you’re going to render these objects on a page, it’s pretty likely you’ll need to load additional data about them. My example models above are deliberately simplified, but in any serious Django project it’s likely they will have additional references to other tables.

We can apply Django’s magic select_related() and prefetch_related() methods directly to the querysets we pass to the function, like so:

recent = combined_recent(
    20,
    project=Project.objects.all().prefetch_related('tags'),
    image=Image.objects.all().select_related('uploaded_by'),
    comment=Comment.objects.all().select_related('author'),
)

Django’s query optimizer is smart enough to ignore those calls entirely when building the initial union queries, so even with the above extras the initial union query will still look like this:

(SELECT "myapp_project"."id", "myapp_project"."created", 'project' AS "recent_changes_type" FROM "myapp_project")
 UNION (SELECT "myapp_image"."id", "myapp_image"."created", 'image' AS "recent_changes_type" FROM "myapp_image")
 UNION (SELECT "myapp_comment"."id", "myapp_comment"."created", 'comment' AS "recent_changes_type" FROM "myapp_comment")
ORDER BY (2) DESC LIMIT 20

The select_related() and prefetch_related() clauses will then be incorporated into the subsequent SQL queries that are used to efficiently inflate the full objects from the database.

Taking it further

There are a bunch of interesting extensions that can be made to this pattern.

Want pagination? The initial unioned queryset can be paginated using offset/limit by slicing the queryset, or using the Django Paginator class.

Want more efficient pagination (since offset/limit tends to get slow after the first few thousand rows)? We’re ordering by created already which means it’s not difficult to build efficient range-based pagination, requesting all records where the created date is less than the earliest date seen on the previous page.

Since everything is based on regular Django querysets, it’s possible to build all kinds of variants of the recent additions feed. So far we’ve just built one showing all changes across an entire application, but it’s not hard to apply additional filters to only show changes made by a specific user, or changes made relating to a specific foreign key relationship. If you can represent it as a collection of querysets that each expose a created column you can combine them into a single feed.

You don’t even need to use records that share a created column: if you have objects with columns of differing names you can use an annotation to alias those columns, like so:

recent = combined_recent(
    20,
    project=Project.objects.annotate(
        when=models.F('updated')
    ).prefetch_related('tags'),
    image=Image.objects.annotate(
        when=models.F('uploaded_at')
    ).select_related('uploaded_by'),
    comment=Comment.objects.annotate(
        when=models.F('commented_at')
    ).select_related('created_by'),
    datetime_field='when'
)

I haven’t extensively load-tested this pattern, but I expect it will work fine for databases with tens-of-thousands of records but may start running into trouble if you have millions of records (though an index on the created column should help a lot). If you need a recent additions feed on something larger scale than that you should probably look at a separate logging table or an external index in something like Elasticsearch instead.

For another interesting thing you can do with .union() check out my article on Implementing faceted search with Django and PostgreSQL.

Datasette Demo (video) from the SF Python Meetup two months ago

I gave a short talk about Datasette last month at the SF Python Meetup Holiday Party. They’ve just posted the video, so here it is:

I showed how I built San Francisco Tree Search using Datasette, csvs-to-sqlite and data from the San Francisco Department of Public Works.

Analyzing my Twitter followers with Datasette two months ago

I decided to do some ad-hoc analsis of my social network on Twitter this afternoon… and since everything is more fun if you bundle it up into a SQLite database and publish it to the internet I performed the analysis using Datasette.

The end result

Here’s the Datasette database containing all of my Twitter followers: https://simonw-twitter-followers.now.sh/simonw-twitter-followers-b9bff3a

Much more interesting though are the queries I can now run against it. A few examples:

The thing I find most exciting about this use-case for Datasette is that it allows you to construct entire mini-applications using just a SQL query encoded in a URL. Type queries into the textarea, iterate on them until they do something useful, add some :named parameters (which generate form fields) and bookmark the resulting URL. It’s an incredibly powerful way to build custom interfaces for exploring data.

The rest of this post will describe how I pulled the data from Twitter and turned it into a SQLite database for publication with Datasette.

Fetching my followers

To work with the Twitter API, we first need credentials. Twitter still mostly uses the OAuth 1 model of authentication which is infuriatingly complicated, requiring you to sign parameters using two pairs of keys and secrets. OAuth 2 mostly uses a single access token sent over TLS to avoid the signing pain, but Twitter’s API dates back to the times when API client libraries with robust TLS were not a safe assumption.

Since I have to re-figure out the Twitter API every few years, here’s how I got it working this time. I created a new Twitter app using the form on https://apps.twitter.com/ (which is surprisingly hard to find if you start out on the https://developer.twitter.com/ portal). Having created the app I navigated to the “Keys and Access Tokens” tab, scrolled down and clicked the “Create my access token” button. Then I grabbed the four magic tokens from the following spots on the page:

Twitter application setup

Now in Python I can make properly signed calls to the Twitter API like so:

from requests_oauthlib import OAuth1Session
twitter = OAuth1Session(
    client_key='...',
    client_secret='...',
    resource_owner_key='...',
    resource_owner_secret='...'
)
print(twitter.get(
    'https://api.twitter.com/1.1/users/show.json?screen_name=simonw'
).json())

The Twitter API has an endpoint for retrieving everyone who follows an account as a paginated JSON list: followers/list. At some point in the past few years Twitter got really stingy with their rate limits—most endpoints, including followers/list only allow 15 requests every 15 minutes! You can request up to 200 followers at a time, but with 15,000 followers that meant the full fetch would take 75 minutes. So I set the following running in a Jupyter notebook and went for a walk with the dog.

from requests_oauthlib import OAuth1Session
import urllib.parse
import time

twitter = OAuth1Session(...)
url = 'https://api.twitter.com/1.1/followers/list.json'

def fetch_followers(cursor=-1):
    r = twitter.get(url + '?'+ urllib.parse.urlencode({
        'count': 200,
        'cursor': cursor
    }))
    return r.headers, r.json()

cursor = -1
users = []
while cursor:
    headers, body = fetch_followers(cursor)
    print(headers)
    users.extend(body['users'])
    print(len(users))
    cursor = body['next_cursor']
    time.sleep(70)

A couple of hours later I had a users list with 15,281 user dictionaries in it. I wrote that to disk for safe keeping:

import json
json.dump(users, open('twitter-followers.json', 'w'), indent=4)

Converting that JSON into a SQLite database

I wrote some notes on How to turn a list of JSON objects into a Datasette using Pandas a few weeks ago. This works really well, but we need to do a bit of cleanup first: Pandas prefers a list of flat dictionaries, but the Twitter API has given us back some nested structures.

I won’t do a line-by-line breakdown of it, but here’s the code I ended up using. The expand_entities() function replaces Twitter’s ugly t.co links with their expanded display_url alternatives—then clean_user() flattens a nested user into a simple dictionary:

def expand_entities(s, entities):
    for key, ents in entities.items():
        for ent in ents:
            if 'url' in ent:
                replacement = ent['expanded_url'] or ent['url']
                s = s.replace(ent['url'], replacement)
    return s

def clean_user(user):
    if user['description'] and 'description' in user['entities']:
        user['description'] = expand_entities(
            user['description'], user['entities']['description']
        )
    if user['url'] and 'url' in user['entities']:
        user['url'] = expand_entities(user['url'], user['entities']['url'])
    if 'entities' in user:
        del user['entities']
    if 'status' in user:
        del user['status']

for user in users:
    clean_user(user):

I now have a nice flat list of users dictionaries—a subset of which is provided here for illustration.

One additional step: SQLite’s built-in functions for handling date and time prefer ISO formatted timestamps, but previewing the DataFrame in Jupyter shows that the data I pulled from Twitter has dates in a different format altogether. I can fix this with a one-liner using the ever-handy dateutil library:

from dateutil.parser import parse
import pandas as pd
df = pd.DataFrame(users)
df['created_at'] = df['created_at'].apply(lambda s: parse(s).isoformat())

Here’s the before and after:

df.apply() illustrated

Now that the list contains just simple dictionaries, I can load it into a Pandas DataFrame and convert it to a SQLite table like so:

import sqlite3
conn = sqlite3.connect('/tmp/followers.db')
df.to_sql('followers', conn)
conn.close()

Now I can run datasette /tmp/followers.db to preview what I’ve got so far.

Extracting columns and setting up full-text search

This all works fine, but it’s not quite the finished product I demonstrated above. My desired final state has two additional features: common values in the lang, location, time_zone and translator_type columns have been pulled out into lookup tables, and I’ve enabled SQLite full-text search against a subset of the columns.

Normally I would use the -c and -f arguments to my csvs-to-sqlite tool to do this (see my write-up here), but that tool only works against CSV files on disk. I want to work with an in-memory Pandas DataFrame.

So I reverse-engineered my own code and figured out how to apply the same transformations from an interactive Python prompt instead. It ended up looking like this:

from csvs_to_sqlite import utils

conn = sqlite3.connect('/tmp/simonw-twitter-followers.db')

# Define columns I want to refactor:
foreign_keys = {
    'time_zone': ('time_zone', 'value'),
    'translator_type': ('translator_type', 'value'),
    'location': ('location', 'value'),
    'lang': ('lang', 'value'),
}
new_frames = utils.refactor_dataframes(conn, [df], foreign_keys)

# Save my refactored DataFrame to SQLite
utils.to_sql_with_foreign_keys(
    conn, new_frames[0], 'followers',
    foreign_keys, None, index_fks=True
)

# Create the full-text search index across these columns:
fts = ['screen_name', 'description', 'name', 'location']
utils.generate_and_populate_fts(conn, ['followers'], fts, foreign_keys)

conn.close()

Final step: publishing with Datasette

Having run datasette /tmp/simonw-twitter-followers.db to confirm locally that I got the results I was looking for, the last step was to publish it to the internet. As always, I used Zeit Now via the datasette publish command for this final step:

tmp $ datasette publish now simonw-twitter-followers.db \
    --title="@simonw Twitter followers, 27 Jan 2018"
> Deploying /private/var/.../datasette under simonw
> Ready! https://datasette-cmpznehuku.now.sh (copied to clipboard) [14s]
> Synced 2 files (11.29MB) [0ms] 
> Initializing…
> Building
> ▲ docker build
Sending build context to Docker daemon 11.85 MBkB
> Step 1 : FROM python:3
...
> Deployment complete!

Then I ran new alias to assign a permanent, more memorable URL:

 now alias https://datasette-cmpznehuku.now.sh simonw-twitter-followers.now.sh

Elsewhere

Yesterday

17th April 2018

  • Text Embedding Models Contain Bias. Here's Why That Matters (via) Excellent discussion from the Google AI team of the enormous challenge of building machine learning models without accidentally encoding harmful bias in a way that cannot be easily detected. #
  • Suppose a runaway success novel/tv/film franchise has “Bob” as the evil bad guy. Reams of fanfictions are written with “Bob” doing horrible things. People endlessly talk about how bad “Bob” is on twitter. Even the New York times writes about Bob latest depredations, when he plays off current events. Your name is Bob. Suddenly all the AIs in the world associate your name with evil, death, killing, lying, stealing, fraud, and incest. AIs silently, slightly ding your essays, loan applications, uber driver applications, and everything you write online. And no one believes it’s really happening. Or the powers that be think it’s just a little accidental damage because the AI overall is still, overall doing a great job of sentiment analysis and fraud detection.

    Daniel Von Fange #

  • A rating system for open data proposed by Tim Berners-Lee, founder of the World Wide Web. To score the maximum five stars, data must (1) be available on the Web under an open licence, (2) be in the form of structured data, (3) be in a non-proprietary file format, (4) use URIs as its identifiers (see also RDF), (5) include links to other data sources (see linked data). To score 3 stars, it must satisfy all of (1)-(3), etc.

    Five stars of open data #

  • Datasette 0.19: Plugins Documentation (via) I’ve released the first preview of Datasette’s new plugin support, which uses the pluggy package originally developed for py.test. So far the only two plugin hooks are for SQLite connection creation (allowing custom SQL functions to be registered) and Jinja2 template environment initialization (for custom template tags), but this release is mainly about exercising the plugin registration mechanism and starting to gather feedback. Lots more to come. #

15th April 2018

  • The way I would talk about myself as a senior engineer is that I’d say “I know how I would solve the problem” and because I know how I would solve it I could also teach someone else to do it. And my theory is that the next level is that I can say about myself “I know how others would solve the problem”. Let’s make that a bit more concrete. You make that sentence: “I can anticipate how the API choices that I’m making, or the abstractions that I’m introducing into a project, how they impact how other people would solve a problem.”

    Malte Ubl #

14th April 2018

  • Datasette 0.18: units (via) This release features the first Datasette feature that was entirely designed and implemented by someone else (yay open source)—Russ Garrett wanted unit support (Hz, ft etc) for his Wireless Telegraphy Register project. It’s a really neat implementation: you can tell Datasette what units are in use for a particular database column and it will display the correct SI symbols on the page. Specifying units also enables unit-aware filtering: if Datasette knows that a column is measured in meters you can now query it for all rows that are less than 50 feet for example. #

12th April 2018

  • What do you mean "average"? (via) Lovely example of an interactive explorable demonstrating mode/mean/median, built as an Observable notebook using D3. #
  • Wireless Telegraphy Register (via) Russ Garrett used Datasette to build a browsable interface to the UK’s register of business radio licenses, using data from Ofcom. #
  • Mozilla Telemetry: In-depth Data Pipeline (via) Detailed behind-the-scenes look at an extremely sophisticated big data telemetry processing system built using open source tools. Some of this is unsurprising (S3 for storage, Spark and Kafka for streams) but the details are fascinating. They use a custom nginx module for the ingestion endpoint and have a “tee” server written in Lua and OpenResty which lets them route some traffic to alternative backend. #
  • The Academic Vanity Honeypot phishing scheme. Twitter thread describing a nasty phishing attack where an academic receives an email from a respected peer congratulating them on a recent article and suggesting further reading. The further reading link is a phishing site that emulates the victim’s institution’s login page. #

11th April 2018

  • Visualizing disk IO activity using log-scale banded graphs (via) This is a neat data visualization trick: to display rates of disk I/O, it splits the rate into a GB, MB and KB section on a stacked chart. This means that if you are getting jitter in the order of KBs even while running at 400+MB/second you can see the jitter in the KB section. #

10th April 2018

  • GitHub for Nonprofits (via) TIL GitHub provide legally recognized nonprofits with free organization accounts with unlimited users and unlimited private repos—and they’ve registered 30,000 nonprofit accounts through the program as of May 2017. #
  • Deckset for Mac (via) $29 desktop Mac application that creates presentations using a cleverly designed markdown dialect. You edit the underlying markdown in your standard text editor and the Deskset app shows a preview of the presentation and lets you hit “play” to run it or export it as a PDF. #

9th April 2018

  • elasticsearch-dump. Neat open source utility by TaskRabbit for importing and exporting data in bulk from Elasticsearch. It can copy data from one Elasticsearch cluster directly to another or to an intermediary file, making it a swiss-army knife for migrating data around. I successfully used the “docker run” incantation to execute it without needing to worry about having the correct version of Node.js installed. #
  • Datasette 0.15: sort by column (via) I’ve released the latest version of Datasette to PyPI. The key new feature is the ability to sort tables by column, using clickable column headers or directly via the new _sort= and _sort_desc= querystring parameters. #
  • So Fishing Times’s ad department is selling access to the prime Fishing Times readership. But the Data Lords can say, ‘we can show your ad just to Fishing Times readers when they’re on Facebook, or on some meme site, on the Times or TPM or really anywhere.’ Because the Data Lords have the data and they can track and target you. The publication’s role as the gatekeeper to an audience is totally undercut because the folks who control the data and the targeting can follow those readers anywhere and purchase the ads at the lowest price.

    Josh Marshall #

8th April 2018

  • awesome-falsehood: Curated list of falsehoods programmers believe in (via) I really like the general category of “falsehoods programmers believe”, and Kevin Deldyckehas done an outstanding job curating this collection. Categories covered include date and time, email, human identity, geography, addresses, internationalization and more. This is a particularly good example of the “awesome lists” format in that each link is accompanied by a useful description. #
  • Scientific results today are as often as not found with the help of computers. That’s because the ideas are complex, dynamic, hard to grab ahold of in your mind’s eye. And yet by far the most popular tool we have for communicating these results is the PDF—literally a simulation of a piece of paper. Maybe we can do better.

    James Somers #

7th April 2018

  • Cookies-over-HTTP Bad (via) Mike West from the Chrome security team proposes a way for browsers to start discouraging the use of tracking cookies sent over HTTP—which represent a significant threat to user privacy from network attackers. It’s a clever piece of thinking: browsers would slowly ramp up the forced expiry deadline for non-HTTPS cookies, further encouraging sites to switch to HTTPS cookies while giving them ample time to adapt. #