Simon Willison’s Weblog

Weeknotes: ONA19, twitter-to-sqlite, datasette-rure

I’ve decided to start writing weeknotes for the duration of my JSK fellowship. Here goes!

I started the fellowship last week, and this week I’ve been attending the Online News Association conference in New Orleans along with the other fellows.

Natalie and I have had a lot of fun exploring New Orleans, including several niche museums (documented in this Twitter thread, which I expect to continue adding to for years to come).

ONA is the largest digital news conference on the calendar, and as such it’s a great opportunity to meet all kinds of different journalists and get a start on figuring out how best to approach the fellowship.

twitter-to-sqlite

I’ve been working on this tool for a couple of weeks now, but this week it really started to prove its worth. twitter-to-sqlite is a utility I’m building to fetch data from Twitter and write it into a SQLite database. It has a comprehensive README, as do all of my recent projects: I’ve been doing this for long enough that I know that time spent on documentation will pay me back enormously in just a few months.

My initial goal for this tool was to use it for personal analytics, partly inspired by my Twitter followers project from last year. Then ONA sent out a spreadsheet with the Twitter names of most of the conference attendees and I realised there was an opportunity to use it for something more interesting.

I won’t share the attendee list here because it’s intended to be private to conference attendees, but the short version of what I did with it is this:

First, load the attendee list into a database using csvs-to-sqlite (after first converting the XLS file to CSV using Numbers):

$ csvs-to-sqlite attendees.csv ona.db

Next, fetch the full Twitter profiles for every one of those attendees. I added a --sql option to twitter-to-sqlite for this (issue #8) and ran the following:

$ twitter-to-sqlite users-lookup ona.db --sql="select Twitter from attendees"

Now I can run datasette ona.db and execute full-text searches against the complete Twitter profiles of those attendees—great for figuring out e.g. who is attending the conference from the Washington Post.

I took this a step further: can I identify people I follow on Twitter who are attending the conference? I can pull the list of IDs of people I follow like so:

$ twitter-to-sqlite friends-ids ona.db simonw

Now I can see which of my Twitter friends are attending the conference by loading it into Datasette and using this query:

select * from users
where screen_name in (select Twitter from attendees)
and id in (select followed_id from following where follower_id = 12497)

This is not a particularly efficient SQL query… and it doesn’t matter! The great thing about working against tiny SQLite databases that contain just a few thousand rows of data is that you can nest SQL queries like this with little concern for their performance—this one runs on my laptop in 18.213ms.

datasette-rure

SQLite has a REGEXP operator, which is documented thus:

The REGEXP operator is a special syntax for the regexp() user function. No regexp() user function is defined by default and so use of the REGEXP operator will normally result in an error message. If an application-defined SQL function named “regexp” is added at run-time, then the “X REGEXP Y” operator will be implemented as a call to “regexp(Y,X)”.

Python’s sqlite3 module lets you register custom functions, so this should be an easy fix. I’ve been holding off on implementing this in Datasette for a while though because Datasette allows user-provided queries, and Python’s regular expressions have a nasty quality: you can easily construct a regular expression and input that will hang the Python interpreter.

So I needed a safe regular expression library that I could trust not to hang my intepreter on the wrong input.

Google have such a library—re2—but the various Python bindings for it aren’t cleanly installable via pip install.

I asked about this on Twitter and Alex Willmer pointed me to rure-python—a Python wrapper for the Rust regular expression library, which “guarantees linear time searching using finite automata”—exactly what I was looking for!

So I built a quick Datasette plugin: datasette-rure, which provides a regexp() function to enable the REGEXP oaperator, powered by the Rust regular expression engine.

I also added regexp_match(pattern, text, index) and regexp_matches(pattern, text) functions. Here are their examples from the README:

select regexp_match('.*( and .*)', title) as n from articles where n is not null
-- Returns the ' and X' component of any matching titles, e.g.
--     and Recognition
--     and Transitions Their Place
-- etc
select regexp_matches(
    'hello (?P<name>\w+) the (?P<species>\w+)',
    'hello bob the dog, hello maggie the cat, hello tarquin the otter'
)
-- Returns a JSON array:
-- [{"name": "bob", "species": "dog"},
--  {"name": "maggie", "species": "cat"},
--  {"name": "tarquin", "species": "otter"}]

Returning JSON from a SQLite custom function is a fun trick: it means you can then loop over the returned rows or even join them against other tables within SQL using the bundled json_each() function.

I deployed an interactive demo of Datasette running the plugin. I’ve been trying to do this for all of the plugins I release—datasette-jq is another recent example (the interactive demos are linked from the README).

This is Weeknotes: ONA19, twitter-to-sqlite, datasette-rure by Simon Willison, posted on 13th September 2019.

Tagged , , , , ,

Previous: My JSK Fellowship: Building an open source ecosystem of tools for data journalism