Simon Willison’s Weblog

On facebook, sqlite, datasette, scaling, brandurleach, ...


Recent entries

Analyzing US Election Russian Facebook Ads 10 days ago

Two interesting data sources have emerged in the past few weeks concerning the Russian impact on the 2016 US elections.

FiveThirtyEight published nearly 3 million tweets from accounts associated with the Russian “Internet Research Agency”—see my article and searchable tweet archive here.

Separately, the House Intelligence Committee Minority released 3,517 Facebook ads that were reported to have been bought by the Russian Internet Research Agency as a set of redacted PDF files.

Exploring the Russian Facebook Ad spend

The initial data was released as zip files full of PDFs, one of the least friendly formats you can use to publish data.

Ed Summers took on the intimidating task of cleaning that up. His results are incredible: he used the pytesseract OCR library and PyPDF2 to extract both the images and the associated metadata and convert the whole lot into a single 3.9MB JSON file.

I wrote some code to convert his JSON file to SQLite (more on the details later) and the result can be found here:

Here’s an example search for “cops” ordered by the USD equivalent spent on the ad (some of the spends are in rubles, so I convert those to USD using today’s exchange rate of 0.016).

Search ads for cops, order by USD descending

One of the most interesting things about this data is that it includes the Facebook ad targetting options that were used to promote the ads. I’ve built a separate interface for browsing those—you can see the most frequently applied targets:

Top targets

And by browsing through the different facets you can construct e.g. a search for all ads that targeted people interested in both interests:Martin Luther King and interests:Police Brutality is a Crime:[“d6ade”%2C“40c27”]

New tooling under the hood

I ended up spinning up several new projects to help process and explore this data.


The first is a new library called sqlite-utils. If data is already in CSV I tend to convert it using csvs-to-sqlite, but if data is in a less tabular format (JSON or XML for example) I have to hand-write code. Here’s a script I wrote to process the XML version of the UK Register of Members Interests for example.

My goal with sqlite-utils is to take some of the common patterns from those scripts and make them as easy to use as possible, in particular when running inside a Jupyter notebook. It’s still very early, but the script I wrote to process the Russian ads JSON is a good example of the kind of thing I want to do with it.


The second new tool is a new Datasette plugin (and corresponding plugin hook) called datasette-json-html. I used this to solve the need to display both rendered images and customized links as part of the regular Datasette instance.

It’s a pretty crazy solution (hence why it’s implemented as a plugin and not part of Datasette core) but it works surprisingly well. The basic idea is to support a mini JSON language which can be detected and rendered as HTML. A couple of examples:

  "img_src": "",
  "width": 200

Is rendered as an HTML <img src=""> element.

    "label": "location:United States",
    "href": "/russian-ads/display_ads?_target=ec3ac"
    "label": "interests:Martin Luther King",
    "href": "/russian-ads/display_ads?_target=d6ade"
    "label": "interests:Jr.",
    "href": "/russian-ads/display_ads?_target=8e7b3"

Is rendered as a comma-separated list of HTML links.

Why use JSON for this? Because SQLite has some incredibly powerful JSON features, making it trivial to output JSON as part of the result of a SQL query. Most interestingly of all it has json_group_array() which can work as an aggregation function to combine a set of related rows into a single JSON array.

The display_ads page shown above is powered by a SQL view. Here’s the relevant subset of that view:

    case when image is not null then
        json_object("img_src", "" || image, "width", 200)
        "no image"
    end as img,
            "href", "/russian-ads/display_ads?_target="
                || urllib_quote_plus(
    ) as targeting
from ads
    join ad_targets on = ad_targets.ad_id
    join targets on ad_targets.target_id =
group by limit 10

I’m using SQLite’s JSON functions to dynamically assemble the JSON format that datasette-json-html knows how to render. I’m delighted at how well it works.

I’ve turned off arbitrary SQL querying against the main Facebook ads Datasette instance, but there’s a copy running at if you want to play with these queries.

Weird implementation details

The full source code for my implementation is available on GitHub.

I ended up using an experimental plugin hook to enable additional custom filtering on Datasette views in order to support showing ads against multiple m2m targets, but hopefully that will be made unnecessary as work on Datasette’s support for m2m relationships progresses.

I also experimented with YAML to generate the metadata.json file as JSON strings aren’t a great way of representing multi-line HTML and SQL. And if you want to see some really convoluted SQL have a look at how the canned query for the faceted targeting interface works.

This was a really fun project, which further stretched my ideas about what Datasette should be capable of out of the box. I’m hoping that the m2m work will make a lot of these crazy hacks redundant.

Analyzing US Election troll tweets with Datasette 10 days ago

FiveThirtyEight published nearly 3 million tweets from accounts associated with the Russian “Internet Research Agency”, based on research by Darren Linvill and Patrick Warren at at Clemson University.

FiveThirtyEight’s tweets were shared as CSV, so I’ve used my csvs-to-sqlite tool to convert them and used Datasette to publish them in a searchable, browsable interface:

The data is most interesting if you apply faceting. Here’s the full set of tweets faceted by author, language, region, post type and account category:

Faceted search interface showing Russian Troll Tweets

The minimal source code for this Datasette instance is on GitHub.

Documentation unit tests 19 days ago

Keeping documentation synchronized with an evolving codebase is difficult. Without extreme discipline, it’s easy for documentation to get out-of-date as new features are added.

One thing that can help is keeping the documentation for a project in the same repository as the code itself. This allows you to construct the ideal commit: one that includes the code change, the updated unit tests AND the accompanying documentation all in the same unit of work.

When combined with a code review system (like Phabricator or GitHub pull requests) this pattern lets you enforce documentation updates as part of the review process: if a change doesn’t update the relevant documentation, point that out in your review!

Good code review systems also execute unit tests automatically and attach the results to the review. This provides an opportunity to have the tests enforce other aspects of the codebase: for example, running a linter so that no-one has to waste their time arguing over standardize coding style.

I’ve been experimenting with using unit tests to ensure that aspects of a project are covered by the documentation. I think it’s a very promising technique.

Introspect the code, introspect the docs

The key to this trick is introspection: interogating the code to figure out what needs to be documented, then parsing the documentation to see if each item has been covered.

I’ll use my Datasette project as an example. Datasette’s module contains three relevant tests:

  • test_config_options_are_documented checks that every one of Datasette’s configuration options are documented.
  • test_plugin_hooks_are_documented ensures all of the plugin hooks (powered by pluggy) are covered in the plugin documentation.
  • test_view_classes_are_documented iterates through all of the *View classes (corresponding to pages in the Datasette user interface) and makes sure they are covered.

In each case, the test uses introspection against the relevant code areas to figure out what needs to be documented, then runs a regular expression against the documentation to make sure it is mentioned in the correct place.

Obviously the tests can’t confirm the quality of the documentation, so they are easy to cheat: but they do at least protect against adding a new option but forgetting to document it.

Testing that Datasette’s view classes are covered

Datasette’s view classes use a naming convention: they all end in View. The current list of view classes is DatabaseView, TableView, RowView, IndexView and JsonDataView.

Since these classes are all imported into the module (in order to be hooked up to URL routes) the easiest way to introspect them is to import that module, then run dir(app) and grab any class names that end in View. We can do that with a Python list comprehension:

from datasette import app
views = [v for v in dir(app) if v.endswith("View")]

I’m using reStructuredText labels to mark the place in the documentation that addresses each of these classes. This also ensures that each documentation section can be linked to, for example:

The reStructuredText syntax for that label looks like this:

.. _TableView:


The table page is the heart of Datasette...

We can extract these labels using a regular expression:

from pathlib import Path
import re

docs_path = Path(__file__).parent.parent / 'docs'
label_re = re.compile(r'\.\. _([^\s:]+):')

def get_labels(filename):
    contents = (docs_path / filename).open().read()
    return set(label_re.findall(contents))

Since Datasette’s documentation is spread across multiple *.rst files, and I want the freedom to document a view class in any one of them, I iterate through every file to find the labels and pull out the ones ending in View:

def documented_views():
    view_labels = set()
    for filename in docs_path.glob("*.rst"):
        for label in get_labels(filename):
            first_word = label.split("_")[0]
            if first_word.endswith("View"):
    return view_labels

We now have a list of class names and a list of labels across all of our documentation. Writing a basic unit test comparing the two lists is trivial:

def test_view_documentation():
    view_labels = documented_views()
    view_classes = set(v for v in dir(app) if v.endswith("View"))
    assert view_labels == view_classes

Taking advantage of pytest

Datasette uses pytest for its unit tests, and documentation unit tests are a great opportunity to take advantage of some advanced pytest features.


The first of these is parametrization: pytest provides a decorator which can be used to execute a single test function multiple times, each time with different arguments.

This example from the pytest documentation shows how parametrization works:

import pytest
@pytest.mark.parametrize("test_input,expected", [
    ("3+5", 8),
    ("2+4", 6),
    ("6*9", 42),
def test_eval(test_input, expected):
    assert eval(test_input) == expected

pytest treats this as three separate unit tests, even though they share a single function definition.

We can combine this pattern with our introspection to execute an independent unit test for each of our view classes. Here’s what that looks like:

@pytest.mark.parametrize("view", [v for v in dir(app) if v.endswith("View")])
def test_view_classes_are_documented(view):
    assert view in documented_views()

Here’s the output from pytest if we execute just this unit test (and one of our classes is undocumented):

$ pytest -k test_view_classes_are_documented -v
=== test session starts ===
collected 249 items / 244 deselected

tests/[DatabaseView] PASSED [ 20%]
tests/[IndexView] PASSED [ 40%]
tests/[JsonDataView] PASSED [ 60%]
tests/[RowView] PASSED [ 80%]
tests/[TableView] FAILED [100%]

=== FAILURES ===

view = 'TableView'

    @pytest.mark.parametrize("view", [v for v in dir(app) if v.endswith("View")])
    def test_view_classes_are_documented(view):
>       assert view in documented_views()
E       AssertionError: assert 'TableView' in {'DatabaseView', 'IndexView', 'JsonDataView', 'RowView', 'Table2View'}
E        +  where {'DatabaseView', 'IndexView', 'JsonDataView', 'RowView', 'Table2View'} = documented_views()

tests/ AssertionError
=== 1 failed, 4 passed, 244 deselected in 1.13 seconds ===


There’s a subtle inefficiency in the above test: for every view class, it calls the documented_views() function—and that function then iterates through every *.rst file in the docs/ directory and uses a regular expression to extract the labels. With 5 view classes and 17 documentation files that’s 85 executions of get_labels(), and that number will only increase as Datasette’s code and documentation grow larger.

We can use pytest’s neat fixtures to reduce this to a single call to documented_views() that is shared across all of the tests. Here’s what that looks like:

def documented_views():
    view_labels = set()
    for filename in docs_path.glob("*.rst"):
        for label in get_labels(filename):
            first_word = label.split("_")[0]
            if first_word.endswith("View"):
    return view_labels

@pytest.mark.parametrize("view_class", [
    v for v in dir(app) if v.endswith("View")
def test_view_classes_are_documented(documented_views, view_class):
    assert view_class in documented_views

Fixtures in pytest are an example of dependency injection: pytest introspects every test_* function and checks if it has a function argument with a name matching something that has been annotated with the @pytest.fixture decorator. If it finds any matching arguments, it executes the matching fixture function and passes its return value in to the test function.

By default, pytest will execute the fixture function once for every test execution. In the above code we use the scope="session" argument to tell pytest that this particular fixture should be executed only once for every pytest command-line execution of the tests, and that single return value should be passed to every matching test.

What if you haven’t documented everything yet?

Adding unit tests to your documentation in this way faces an obvious problem: when you first add the tests, you may have to write a whole lot of documentation before they can all pass.

Having tests that protect against future code being added without documentation is only useful once you’ve added them to the codebase—but blocking that on documenting your existing features could prevent that benefit from ever manifesting itself.

Once again, pytest to the rescue. The @pytest.mark.xfail decorator allows you to mark a test as “expected to fail”—if it fails, pytest will take note but will not fail the entire test suite.

This means you can add deliberately failing tests to your codebase without breaking the build for everyone—perfect for tests that look for documentation that hasn’t yet been written!

I used xfail when I first added view documentation tests to Datasette, then removed it once the documentation was all in place. Any future code in pull requests without documentation will cause a hard test failure.

Here’s what the test output looks like when some of those tests are marked as “expected to fail”:

$ pytest tests/
collected 31 items

tests/ ..........................XXXxx.                [100%]

============ 26 passed, 2 xfailed, 3 xpassed in 1.06 seconds ============

Since this reports both the xfailed and the xpassed counts, it shows how much work is still left to be done before the xfail decorator can be safely removed.

Structuring code for testable documentation

A benefit of comprehensive unit testing is that it encourages you to design your code in a way that is easy to test. In my experience this leads to much higher code quality in general: it encourages separation of concerns and cleanly decoupled components.

My hope is that documentation unit tests will have a similar effect. I’m already starting to think about ways of restructuring my code such that I can cleanly introspect it for the areas that need to be documented. I’m looking forward to discovering code design patterns that help support this goal.

Datasette Facets two months ago

Datasette 0.22 is out with the most significant new feature I’ve added since the initial release: faceted browse.

Datasette lets you deploy an instant web UI and JSON API for any SQLite database. csvs-to-sqlite makes it easy to create a SQLite database out of any collection of CSV files. Datasette Publish is a web app that can run these combined tools against CSV files you upload from your browser. And now the new Datasette Facets feature lets you explore any CSV file using faceted navigation with a couple of clicks.

Exploring characters from Marvel comics

Let’s use facets to explore every character in the Marvel Universe.

FiveThirtyEight have published a CSV file of 16,376 characters from Marvel comics, scraped from Wikia as part of the research for their 2014 story Comic Books Are Still Made By Men, For Men And About Men.

Here’s that CSV file loaded into the latest version of Datasette:

Marvel characters explored using Datasette Facets

We start by applying the identity status, alignment and gender facets. Then we filter down to just the bad characters with a public identity, and apply the eye colour facet. Now we can filter to just the 20 bad characters with a public identity and yellow eyes.

At each stage along the way we could see numerical summaries of the other facets. That’s a pretty sophisticated piece of analysis we’ve been able to run with just a few clicks (and it works responsively on mobile as well).

I’ve published a full copy of everything else in the FiveThirtyEight data repository, which means you can find plenty more examples of facets in action at—one example: Actions under the Antiquities Act, faceted by states, pres_or_congress, action and current_agency.

Analyzing GSA IT Standards with Datasette Publish

The US government’s General Services Administration have a GitHub account, and they use it to publish a repository of assorted data as CSVs.

Let’s take one of those CSVS and analyze it with Datasette Facets, using the Datasette Publish web app to upload and process the CSV.

We’ll start with the it-standards.csv file, downloaded from their repo. We’ll upload it to Datasette Publish and add some associated metadata:

GSA IT standards published using Datasette Publish

Here’s the result, with the Status and Deployment Type facets applied. And here’s a query showing just SaaS tools with status Approved—Preferred.

European Power Stations

The Open Power System Data project publishes data about electricity systems. They publish data in a number of formats, including SQLite databases. Let’s take their conventional_power_plants.sqlite file and explore it with Datasette. With Datasette installed, run the following commands in your terminal:

datasette conventional_power_plants.sqlite

This will start Datasette running at ready for you to explore the data.

Next we can publish the SQLite database directly to the internet using the datasette publish command-line tool:

$ datasette publish now conventional_power_plants.sqlite \
    --source="Open Power System Data. 2018. Data Package Conventional power plants. Version 2018-02-27" \
    --source_url="" \
    --title="Conventional power plants" \
> Deploying /private/var/folders/jj/fngnv0810tn2lt_kd3911pdc0000gp/T/tmpufvxrzgp/datasette under simonw
> [in clipboard] (sfo1) [11s]
> Synced 3 files (1.28MB) [11s]
> Building…
> ▲ docker build
Sending build context to Docker daemon 1.343 MBkB
> Step 1/7 : FROM python:3
> 3: Pulling from library/python
> 3d77ce4481b1: Already exists
> 534514c83d69: Already exists
> Successfully built da7ac223e8aa
> Successfully tagged
> ▲ Storing image
> Build completed
> Verifying instantiation in sfo1
> [0] Serve! files=('conventional_power_plants.sqlite',) on port 8001
> [0] [2018-05-20 22:51:51 +0000] [1] [INFO] Goin' Fast @
> [0] [2018-05-20 22:51:51 +0000] [1] [INFO] Starting worker [1]

Finally, let’s give it a nicer URL using now alias:

 now alias

The result can now be seen at

Here’s every conventional power plant in Europe faceted by country, energy source and technology.

Power Plant Facets

Implementation notes

I love faceted search engines. One of my first approaches to understanding any new large dataset has long been to throw it into a faceted search engine and see what comes out. In the past I’ve built them using Solr, Elasticsearch, PostgreSQL and even Whoosh. I guess it was inevitable that I’d try to build one with SQLite.

You can follow the development of Datasette Facets in the now-closed issue #255 on GitHub.

Facets are requested by appending one or more ?_facet=colname parameters to the URL. This causes Datasette to run the following SQL query for each of those specified columns:

select colname as value, count(*) as count
from tablename where (current where clauses)
group by colname order by count desc limit 31

For large tables, this could get expensive. Datasette supports time limits for SQLite queries, and facets are given up to 200ms (by default, this limit can be customized) to finish executing. If the query doesn’t complete in the given time the user sees a warning that the facet could not be displayed.

We ask for 31 values in the limit clause even though we only display 30. This lets us detect if there are more values available and show a ... indicator to let the user know that the facets were truncated.

Datasette also suggests facets that you might want to apply. This is implemented using another query, this time run against every column that is not yet being used as a facet. If a table has 20 columns this means 20 queries, so they run with an even tighter 50ms time limit. The query looks like this:

select distinct colname
from tablename where (current where clauses)
limit 31

All we are doing here is trying to determine if the column in question has less than 30 unique values. The limit clause here means that if you run this query against a column with entirely distinct values (the primary key for example) the query will terminate extremely quickly—after it has found just the first 31 values.

Once the query has executed, we count the distinct values and check to see if this column, when used as a facet:

  • Will return 30 or less unique options
  • Will return more than one unique option
  • Will return less unique options than the current total number of filtered rows

If the query takes longer than 50ms we terminate it and do not suggest that column as a potential facet.

Facets via JSON

As with everything in Datasette, the facets you can view in your browser are also available as part of the JSON API (which ships with CORS headers so you can easily fetch data from JavaScript running in a browser on any web page).

To get back JSON, add .json to the path (before the first ?). Here’s that power plants example returned as JSON:

Please let me know if you build something interesting with Datasette Facets!



  • Compiling SQLite for use with Python Applications (via) Charles Leifer’s recent tutorial on how to compile and build the latest SQLite (with window function support) for use from Python via his pysqlite3 library. #
  • coleifer/pysqlite3. Now that the pysqlite package is bundled as part of the Python standard library the original open source project is no longer actively maintained, and has not been upgraded for Python 3. Charles Leifer has been working on pysqlite3, a stand-alone package of the module. Crucially, this should enable compiling the latest version of SQLite (via the amalgamation package) without needing to upgrade the version that ships with the operating system. #
  • Window Functions in SQLite 3.25.0. The next release of SQLite (apparently die for release in September) will add window functions, as specified in various SQL standards and already available in PostgreSQL. This is going to dramatically improve SQLite as an engine for performing analytical queries, especially across time series data. It’s also going to further emphasize the need for people to be able to upgrade their SQLite versions beyond those provided by the operating system—the default Ubuntu run by Travis CI still only ships with SQLite 3.8 for example. #

13th August 2018

  • Experiences with running PostgreSQL on Kubernetes (via) Fascinating interview that makes a solid argument for the idea that running stateful data stores like PostgreSQL or Cassandra is made harder, not easier when you add an orchestration tool like Kubernetes into the mix. #

11th August 2018

  • With a sufficient number of users of an API, it does not matter what you promise in the contract: all observable behaviors of your system will be depended on by somebody.

    Hyrum's Law #

9th August 2018

  • Using achievement stats to estimate sales on steam (via) Really interesting data leak exploit here: Valve’s Steam API was showing the percentage of users that gained a specific achievement up to 16 decimal places—which inadvertently leaked their exact usage statistics, since if 0.012782207690179348 percent of players get an achievement the only possible input is 8 players out of 62,587. #
  • Easy explainer: a “blockchain” is a linked list with an append-only restriction, and appending is made incredibly expensive but super parallelizable, so when things work well a big group of people can work together and it’s too expensive for a small evil group to compete. [...] Does your problem benefit from storing information in an append-only list, and relying on a central authority to manage it is so bad that it’s worth paying the enormous append costs to have a bunch of Chinese servers manage it for you? Then *maybe* look at a blockchain.

    Tab Atkins #

7th August 2018

  • Securing Web Sites Made Them Less Accessible (via) This is fascinating: the move to HTTP everywhere breaks local HTTP caching servers (like Squid) which are still used in remote areas that get their internet by a high latency satellite connection. #

6th August 2018

  • Faust: Python Stream Processing (via) A new open source stream processing system released by Robinhood, created by Vineet Goel and Celery creator Ask Solem. The API looks delightful, making very smart use of Python decorators and async/await. The initial release requires Kafka but they plan to support multiple backends, hopefully including Redis Streams. #
  • How to Read an RFC. An extremely useful guide to reading RFCs by Mark Nottingham. I didn’t know most of the stuff in here. #
  • OWASP Top 10 2007-2017: The Fall of CSRF. I was surprised to learn recently that CSRF didn’t make it into the 2017 OWASP Top 10 security vulnerabilities (after featuring almost every year since the list started). The credited reason is that web frameworks do a good enough job protecting against CSRF by default that it’s no longer a top-ten problem. Defaults really do matter. #

2nd August 2018

  • Datasette unit tests: monkeytype_call_traces (via) Faceted browse against every function call that occurs during the execution of Datasette’s test suite. I used Instagram’s MonkeyType tool to generate this, which can run Python code and generates a SQLite database of all of the traced calls. It’s intended to be used to automatically add mypy annotations to your code, but since it produces a SQLite database as a by-product I’ve started exploring the intermediary format using Datasette. Generating this was as easy as running “monkeytype run `which pytest`” in the Datasette root directory. #

28th July 2018

  • The death of a TLD. Sony have terminated their .xperia TLD. Ben Cox used Certificate Transparency logs to evaluate the 11 total TLDs that have been abandoned since the gTLD gold rush started—since HTTPS is becoming the default now these logs of issued certificates are a great indicator of which domains (or TLDs) are being actively used. The only deleted TLD with legitimate looking certificates (apparently for a mail server) was .mcdonalds #
  • Big tech warns of 'Japan's millennium bug' ahead of Akihito's abdication (via) Emperor Akihito’s abdication in April 2019 triggers a new era, and the Japanese calendar counts years from the coronation of the current emperor. The era hasn’t changed since 1989 and a great deal of software is unable to handle a change. To make things more complicated... the name of the new era will be announced in late February, but it needs to be represented in unicode as a single new character... and the next version of Unicode (v12) is due out in early March. There may have to be a Unicode 12.1 released shortly afterwards that includes the new codepoint. #

26th July 2018

  • Datasette: publish_subcommand hook + default plugins mechanism, used for publish heroku/now (via) I just landed a new plugin hook to Datasette master: publish_subcommand, which lets you define new publisher subcommands for the “datasette publish” CLI tool in addition to Heroku and Zeit Now. As part of this I’ve refactored the heroku/now publisher implementations into two default plugins that ship as part of Datasette—I hope to use this pattern for other core functionality in the future. #

25th July 2018