Simon Willison’s Weblog

On javascript, github, datasette, mysql, https, ...

 

Recent entries

The interesting ideas in Datasette 11 days ago

Datasette (previously) is my open source tool for exploring and publishing structured data. There are a lot of ideas embedded in Datasette. I realized that I haven’t put many of them into writing.

Publishing read-only data
Bundling the data with the code
SQLite as the underlying data engine
Far-future cache expiration
Publishing as a core feature
License and source metadata
Facet everything
Respect for CSV
SQL as an API language
Optimistic query execution with time limits
Keyset pagination
Interactive demos based on the unit tests
Documentation unit tests

Publishing read-only data

Datasette provides a read-only API to your data. It makes no attempt to deal with writes. Avoiding writes entirely is fundamental to a plethora of interesting properties, many of which are expanded on further below. In brief:

  • Hosting web applications with no read/write persistence requirements is incredibly cheap in 2018—often free (both ZEIT Now and a Heroku have generous free tiers). This is a big deal: even having to pay a few dollars a month is enough to dicentivise sharing data, since now you have to figure out who will pay and ensure the payments don’t expire in the future.
  • Being read-only makes it trivial to scale: just add more instances, each with their own copy of the data. All of the hard problems in scaling web applications that relate to writable data stores can be skipped entirely.
  • Since the database file is opened using SQLite’s immutable mode, we can accept arbitrary SQL queries with no risk of them corrupting the data.

Any time your data changes, you need to publish a brand new copy of the whole database. With the right hosting this is easy: deploy a brand new copy of your data and application in parallel to your existing live deployment, then switch over incoming HTTP traffic to your API at the load balancer level. Heroku and Zeit Now both support this strategy out of the box.

Bundling the data with the code

Since the data is read-only and is encapsulated in a single binary SQLite database file, we can bundle the data as part of the app. This means we can trivially create and publish Docker images that provide both the data and the API and UI for accessing it. We can also publish to any hosting provider that will allow us to run a Python application, without also needing to provision a mutable database.

The datasette package command takes one or more SQLite databases and bundles them together with the Datasette application in a single Docker image, ready to be deployed anywhere that can run Docker containers.

SQLite as the underlying data engine

Datasette encourages people to use SQLite as a standard format for publishing data.

Relational database are great: once you know how to use them, you can represent any data you can imagine using a carefully designed schema.

What about data that’s too unstructured to fit a relational schema? SQLite includes excellent support for JSON data—so if you can’t shape your data to fit a table schema you can instead store it as text blobs of JSON—and use SQLite’s JSON functions to filter by or extract specific fields.

What about binary data? Even that’s covered: SQLite will happily store binary blobs. My datasette-render-images plugin (live demo here) is one example of a tool that works with binary image data stored in SQLite blobs.

What if my data is too big? Datasette is not a “big data” tool, but if your definition of big data is something that won’t fit in RAM that threshold is growing all the time (2TB of RAM on a single AWS instance now costs less than $4/hour).

I’ve personally had great results from multiple GB SQLite databases and Datasette. The theoretical maximum size of a single SQLite database is around 140TB.

SQLite also has built-in support for surprisingly good full-text search, and thanks to being extensible via modules has excellent geospatial functionality in the form of the SpatiaLite extension. Datasette benefits enormously from this wider ecosystem.

The reason most developers avoid SQLite for production web applications is that it doesn’t deal brilliantly with large volumes of concurrent writes. Since Datasette is read-only we can entirely ignore this limitation.

Far-future cache expiration

Since the data in a Datasette instance never changes, why not cache calls to it forever?

Datasette sends a far future HTTP cache expiry header with every API response. This means that browsers will only ever fetch data the first time a specific URL is accessed, and if you host Datasette behind a CDN such as Fastly or Cloudflare each unique API call will hit Datasette just once and then be cached essentially forever by the CDN.

This means it’s safe to deploy a JavaScript app using an inexpensively hosted Datasette-backed API to the front page of even a high traffic site—the CDN will easily take the load.

Zeit added Cloudflare to every deployment (even their free tier) back in July, so if you are hosted there you get this CDN benefit for free.

What if you re-publish an updated copy of your data? Datasette has that covered too. You may have noticed that every Datasette database gets a hashed suffix automatically when it is deployed:

https://fivethirtyeight.datasettes.com/fivethirtyeight-c9e67c4

This suffix is based on the SHA256 hash of the entire database file contents—so any change to the data will result in new URLs. If you query a previous suffix Datasette will notice and redirect you to the new one.

If you know you’ll be changing your data, you can build your application against the non-suffixed URL. This will not be cached and will always 302 redirect to the correct version (and these redirects are extremely fast).

https://fivethirtyeight.datasettes.com/fivethirtyeight/alcohol-consumption%2Fdrinks.json

The redirect sends an HTTP/2 push header such that if you are running behind a CDN that understands push (such as Cloudflare) your browser won’t have to make two requests to follow the redirect. You can use the Chrome DevTools to see this in action:

Chrome DevTools showing a redirect initiated by an HTTP/2 push

And finally, if you need to opt out of HTTP caching for some reason you can disable it on a per-request basis by including ?_ttl=0 in the URL query string. —for example, if you want to return a random member of the Avengers it doesn’t make sense to cache the response:

https://fivethirtyeight.datasettes.com/fivethirtyeight?sql=select+*+from+[avengers%2Favengers]+order+by+random()+limit+1&_ttl=0

Publishing as a core feature

Datasette aims to reduce the friction for publishing interesting data online as much as possible.

To this end, Datasette includes a “publish” subcommand:

# deploy to Heroku
datasette publish heroku mydatabase.db
# Or deploy to Zeit Now
datasette publish now mydatabase.db

These commands take one or more SQLite databases, upload them to a hosting provider, configure a Datasette instance to serve them and return the public URL of the newly deployed application.

Out of the box, Datasette can publish to either Heroku or to Zeit Now. The publish_subcommand plugin hook means other providers can be supported by writing plugins.

License and source metadata

Datasette believes that data should be accompanied by source information and a license, whenever possible. The metadata.json file that can be bundled with your data supports these. You can also provide source and license information when you run datasette publish:

datasette publish fivethirtyeight.db \
    --source="FiveThirtyEight" \
    --source_url="https://github.com/fivethirtyeight/data" \
    --license="CC BY 4.0" \
    --license_url="https://creativecommons.org/licenses/by/4.0/"

When you use these options Datasette will create the corresponding metadata.json file for you as part of the deployment.

Facet everything

I really love faceted search: it’s the first tool I turn to whenever I want to start understanding a collection of data. I’ve built faceted search engines on top of Solr, Elasticsearch and PostgreSQL and many of my favourite tools (like Splunk and Datadog) have it as a core feature.

Datasette automatically attempts to calculate facets against every table. You can read more about the Datasette Facets feature here—as a huge faceted search fan it’s one of my all-time favourite features of the project. Now I can add SQLite to the list of technologies I’ve used to build faceted search!

Respect for CSV

CSV is by far the most common format for sharing and publishing data online. Almost every useful data tool has the ability to export to it, and it remains the lingua franca of spreadsheet import and export.

It has many flaws: it can’t easily represent nested data structures, escaping rules for values containing commas are inconsistently implemented and it doesn’t have a standard way of representing character encoding.

Datasette aims to promote SQLite as a much better default format for publishing data. I would much rather download a .db file full of pre-structured data than download a .csv and then have to re-structure it as a separate piece of work.

But interacting well with the enormous CSV ecosystem is essential. Datasette has deep CSV export functionality: any data you can see, you can export—including the results of arbitrary SQL queries. If your query can be paginated Datasette can stream down every page in a single CSV file for you.

Datasette’s sister-tool csvs-to-sqlite handles the other side of the equation: importing data from CSV into SQLite tables. And the Datasette Publish web application allows users to upload their CSVs and have them deployed directly to their own fresh Datasette instance—no command line required.

SQL as an API language

A lot of people these days are excited about GraphQL, because it allows API clients to request exactly the data they need, including traversing into related objects in a single query.

Guess what? SQL has been able to do that since the 1970s!

There are a number of reasons most APIs don’t allow people to pass them arbitrary SQL queries:

  • Security: we don’t want people messing up our data
  • Performance: what if someone sends an accidental (or deliberate) expensive query that exhausts our resources?
  • Hiding implementation details: if people write SQL against our API we can never change the structure of our database tables

Datasette has answers to all three.

On security: the data is read-only, using SQLite’s immutable mode. You can’t damage it with a query—INSERT and UPDATEs will simply throw harmless errors.

On performance: SQLite has a mechanism for canceling queries that take longer than a certain threshold. Datasette sets this to one second by default, though you can alter that configuration if you need to (I often bump it up to ten seconds when exploring multi-GB data on my laptop).

On hidden implementation details: since we are publishing static data rather than maintaining an evolving API, we can mostly ignore this issue. If you are really worried about it you can take advantage of canned queries and SQL view definitions to expose a carefully selected forward-compatible view into your data.

Optimistic query execution with time limits

I mentioned Datasette’s SQL time limits above. These aren’t just there to avoid malicious queries: the idea of “optimistic SQL evaluation” is baked into some of Datasette’s core features.

Consider suggested facets—where Datasette inspects any table you view and tries to suggest columns that are worth faceting against.

The way this works is Datasette loops over every column in the table and runs a query to see if there are less than 20 unique values for that column. On a large table this could take a prohibitive amount of time, so Datasette sets an aggressive timeout on those queries: just 50ms. If the query fails to run in that time it is silently dropped and the column is not listed as a suggested facet.

Datasette’s JSON API provides a mechanism for JavaScript applications to use that same pattern. If you add ?_timelimit=20 to any Datasette API call, the underlying query will only get 20ms to run. If it goes over you’ll get a very fast error response from the API. This means you can design your own features that attempt to optimistically run expensive queries without damaging the performance of your app.

Keyset pagination

SQL pagination using OFFSET/LIMIT has a fatal flaw: if you request page number 300 at 20 per page the underlying SQL engine needs to calculate and sort all 6,000 preceding rows before it can return the 20 you have requested.

This does not scale at all well.

Keyset pagination (often known by other names, including cursor-based pagination) is a far more efficient way to paginate through data. It works against ordered data. Each page is returned with a token representing the last record you saw, then when you request the next page the engine merely has to filter for records that are greater than that tokenized value and scan through the next 20 of them.

(Actually, it scans through 21. By requesting one more record than you intend to display you can detect if another page of results exists—if you ask for 21 but get back 20 or less you know you are on the last page.)

Datasette’s table view includes a sophisticated implementation of keyset pagination.

Datasette defaults to sorting by primary key (or SQLite rowid). This is perfect for efficient pagination: running a select against the primary key column for values greater than X is one of the fastest range scan queries any database can support. This allows users to paginate as deep as they like without paying the offset/limit performance penalty.

This is also how the “export all rows as CSV” option works: when you select that option, Datasette opens a stream to your browser and internally starts keyset-pagination over the entire table. This keeps resource usage in check even while streaming back millions of rows.

Here’s where Datasette gets fancy: it handles keyset pagination for any other sort order as well. If you sort by any column and click “next” you’ll be requesting the next set of rows after the last value you saw. And this even works for columns containing duplicate values: If you sort by such a column, Datasette actually sorts by that column combined with the primary key. The “next” pagination token it generates encodes both the sorted value and the primary key, allowing it to correctly serve you the next page when you click the link.

Try clicking “next” on this page to see keyset pagination against a sorted column in action.

Interactive demos based on the unit tests

I love interactive demos. I decided it would be useful if every single release of Datasette had a permanent interactive demo illustrating its features.

Thanks to Zeit Now, this was pretty easy to set up. I’ve actually taken it a step further: every successful push to master on GitHub is also deployed to a permanent URL.

Some examples:

The database that is used for this demo is the exact same database that is created by Datasette’s unit test fixtures. The unit tests are already designed to exercise every feature, so reusing them for a live demo makes a lot of sense.

You can view this test database on your own machine by checking out the full Datasette repository from GitHub and running the following:

python tests/fixtures.py fixtures.db metadata.json
datasette fixtures.db -m metadata.json

Here’s the code in the Datasette Travis CI configuration that deploys a live demo for every commit and every released tag.

Documentation unit tests

I wrote about the Documentation unit tests pattern back in July.

Datasette’s unit tests include some assertions that ensure that every plugin hook, configuration setting and underlying view class is mentioned in the documentation. A commit or pull request that adds or modifies these without also updating the documentation (or at least ensuring there is a corresponding heading in the docs) will fail its tests.

Learning more

Datasette’s documentation is in pretty good shape now, and the changelog provides a detailed overview of new features that I’ve added to the project. I presented Datasette at the PyBay conference in August and I’ve published my annotated slides from that talk. I was interviewed about Datasette for the Changelog podcast in May and my notes from that conversation include some of my favourite demos.

Datasette now has an official Twitter account—you can follow @datasetteproj there for updates about the project.

Letterboxing on Lundy 27 days ago

Last week Natalie and I spent a delightful two days with our friends Hannah and Adam on the beautiful island of Lundy in the Bristol Channel, 12 miles off the coast of North Devon.

I’ve been wanting to visit Lundy for years. The island is managed by the Landmark Trust, a UK charity who look after historic buildings and make them available as holiday rentals.

Our first experience with the Landmark Trust was the original /dev/fort back in 2008 when we rented a Napoleonic Sea Fortress on Alderney in the Channel Islands. Ever since then I’ve been keeping an eye out for opportunities to try out more of their properties: just two weeks ago we stayed in Wortham Manor and used it as a staging ground to help prepare a family wedding.

I cannot recommend the Landmark Trust experience strongly enough: each property is unique and fascinating, they are kept in great condition and if you split the cost of a larger rental among a group of friends the price can be comparable to a youth hostel.

Lundy is their Crown Jewels: they’ve been looking after the island since the 1960s and now offer 23 self-catering properties there.

We took the ferry out on Tuesday morning (a truly horrific two hour voyage) and back again on Thursday evening (thankfully much calmer). Once on Lundy we stayed in Castle Keep South, a two bedroom house in the keep of a castle built in the 13th century by Henry III, after he retook the island from the apparently traitorous William de Marisco (who was then hanged, drawed and quartered for good measure—apparently one of the first ever uses of that punishment). Lundy has some very interesting history attached to it.

The island itself is utterly spectacular. Three miles long, half a mile wide, surrounded by craggy cliffs and mostly topped with ferns and bracken. Not a lot of trees except for the more sheltered eastern side. A charming population of sheep, goats, Lundy Ponies and some highland cattle with extremely intimidating horns.

A highland cow that looks like Boris Johnson

(“They’re complete softies. We call that one Boris because he looks like Boris Johnson”—a lady who works in the Tavern)

Lundy has three light houses (two operational, one retired), the aforementioned castle, a charming little village, a church and numerous fascinating ruins and isolated buildings, many of which you can stay in. It has the remains of two crashed WWII German Heinkel He 111 bombers (which we eventually tracked down).

It also hosts what is quite possibly the world’s best Letterboxing trail.

Letterboxing?

Letterboxing is an outdoor activity that is primarily pursued in the UK. It consists of weatherproof boxes hidden in remote locations, usually under a pile of rocks, containing a notebook and a custom stamp. The location of the boxes is provided by a set of clues. Given the clues, your challenge is to find all of the boxes and collect their stamps in your notebook.

On Lundy the clues can be purchased from the village shop.

I had dabbled with Letterboxing a tiny bit in the past but it hadn’t really clicked with me until Natalie (a keen letterboxer) encouraged us to give it a go on Lundy.

It ended up occupying almost every waking moment of our time there, and taking us to every far-flung corner of the island.

There are 28 letterboxes on Lundy. We managed to get 27 of them—and we would have got them all, if the last one hadn’t been located on a beach that was shut off from the public due to grey seals using it to raise their newly born pups! The pups were cute enough that we forgave them.

To give you an idea for how it works, here’s the clue for letterbox 27, “The Ugly”:

The Ugly: From the lookout hut near the flagpole on the east side of Millcombe, walk in the direction of the South Light until it is on a bearing of 130°, Bramble Villa 210° and Millcombe due west. The letterbox is beneah you.

There were letterboxes in lighthouses, letterboxes in ruins, letterboxes perilously close to cliff-faces, letterboxes in church pews, letterboxes in quarries, letterboxes in caves. If you thought that letterboxing was for kids, after scrabbling down more perilous cliff paths than I can count I can assure you it isn’t!

On Thursday I clocked up 24,000 steps walking 11 miles and burned 1,643 calories. For comparison, when I ran the half marathon last year I only burned 1,222. These GPS tracks from my Apple Watch give a good impression of how far we ended up walking on our second day of searching.

When we checked the letterboxing log book in the Tavern on Wednesday evening we found most people who attempt to hit all 28 letterboxes spread it out over a much more sensible timeframe. I’m not sure that I would recommend trying to fit it in to just two days, but it’s hard to imagine a better way of adding extra purpose to an exploration of the island.

Should you attempt letterboxing on Lundy (and if you can get out there you really should consider it), a few tips:

  • If in doubt, look for the paths. Most of the harder to find letterboxes were at least located near an obvious worn path.
  • “Earthquake” is a nightmare. The clue really didn’t help us—we ended up performing a vigorous search of most of the area next to (not inside) the earthquake fault.
  • The iPhone compass app is really useful for finding bearings. We didn’t use a regular compass at all.
  • If you get stuck, check for extra clues in the letterboxing log book in the tavern. This helped us crack Earthquake.
  • There’s more than one pond. The quarry pond is very obvious once you find it.
  • Take as many different maps as you can find—many of the clues reference named landmarks that may not appear on the letterboxing clue map. We forgot to grab an offline copy of Lundy in the Google Maps app and regretted it.

If you find yourself in Ilfracombe on the way to or from Lundy, the Ilfracombe Museum is well worth your time. It’s a classic example in the genre of “eccentric collects a wide variety of things, builds a museum for them”. Highlights include a cupboard full of pickled bats and a drawer full of 100-year-old wedding cake samples.

The subset of reStructuredText worth committing to memory one month ago

reStructuredText is the standard for documentation in the Python world.

It’s a bit weird. It’s like Markdown but older, more feature-filled and in my experience signifcantly harder to remember.

There are plenty of guides and cheatsheets out there, but when writing simple documentation for software projects I think there’s a subset that is worth committing to memory. I’ll describe that subset here.

First though: when writing reStructuredText having a live preview render is extremely useful. I use rst.ninjs.org for this. If you don’t trust that hosted version (it round-trips your documentation through the server in order to render it) you can run a local copy instead using the underlying source code.

Paragraphs

Paragraphs work the same way as Markdown and plain text. They are nice and easy.

This is the first paragraph. No need to wrap the text (though you can wrap at e.g. 80 characters without affecting rendering).

This is the second paragraph.

Headings

reStructuredText section headings are a little surprising.

Markdown has multiple levels of heading, each with a different number of prefix hashes:

# Markdown heading level 1
## Markdown heading level 2
..
###### Markdown heading fevel 6

In reStructuredText there is no single format for these different levels. Instead, the format you use first will be treated as an H1, the next format as an H2 and so on. Here’s the description from the official documentation:

Sections are identified through their titles, which are marked up with adornment: “underlines” below the title text, or underlines and matching “overlines” above the title. An underline/overline is a single repeated punctuation character that begins in column 1 and forms a line extending at least as far as the right edge of the title text. Specifically, an underline/overline character may be any non-alphanumeric printable 7-bit ASCII character. […] There may be any number of levels of section titles, although some output formats may have limits (HTML has 6 levels).

This is deeply confusing. I suggest instead standardizing on the following:

=====================
 This is a heading 1
=====================

This heading has = signs both above and below, and they extend past the text by a single character in each direction.

This is a heading 2
===================

This is a heading 3
-------------------

This is a heading 4
~~~~~~~~~~~~~~~~~~~

If you need more levels, you can invent them using whatever character you like—but try to stay consistent within your project.

Bulleted lists

As with headings, you can use a variety of characters for these. I suggest sticking with asterisks.

A blank line is required before starting a bulleted list.

* A bullet point
* Another bullet point

If you decide to wrap your text (I tend not to) you must maintain the indentation on the wrapped lines:

* A bulleted list item. Since the text is wrapped each subsequent
  line of text must be indented by two spaces.
* Second list item.

Nested lists are supported, but you MUST leave a blank line above the first inner list bullet point or they won’t work:

* This is the first bullet list item. Here comes a sub-list:

  * Hello sublist
  * Sublist two

* Back to the parent list.

Inline markup

I only use three inline markup features: bold, italic and code.

**Bold text** is surrounded by two asterisks.

*Italic text* is one asterisk.

``inline code`` uses two backticks at either side of the code.

Links

Links are my least favorite feature of reStructuredText. There are several different ways of including them, but the one I use most often (and hence have committed to memory) is this one:

`a link, note the trailing underscores <http://example.com>`__

So that’s a backtick at the start, then the link text, then the URL contained in greater than / less than symbols, then another backtick and then TWO underscores to finish it off.

Why two underscores? Because if you only use one, the text part of the link is remembered and can be used to duplicate your link later on—see example below. In my experience this is more trouble than it’s worth.

A more complex link syntax example (documented here) looks like this:

See the `Python home page`_ for info.

This link_ is an alias to the link above.

.. _Python home page: http://www.python.org
.. _link: `Python home page`_

I can’t remember this at all, so I stick with the anonymous hyperlink syntax instead.

Code blocks

The easiest way to embed a block of code is like this:

::

    # This is a code example
    print("It needs to be indented")

The :: indicates that a code block is coming up. The blank line after the :: before the indentation starts is required.

Most renderers have the ability to apply syntax highlighting. To specify that a block should have syntax highlighting for a specific language, replace the :: in the above example with one of the following:

.. code-block:: sql

.. code-block:: javascript

.. code-block:: python

Images

There are plenty of options for embedding images, but the most basic syntax (worth remembering) looks like this:

.. image:: full_text_search.png

This will embed an image of that filename that sits in the same directory as the document itself.

Internal references

In my opinion this is the key feature that makes reStructuredText more powerful than Markdown for larger documentation projects.

Again, there in a vast and complex array of options around this, but the key thing to remember is how to add a reference name to a specific section and how to link to that section later on.

Names are applied to section headings, by adding some magic text before the heading itself. For example:

.. _full_text_search:

Full-text search
================

Note the format: two periods, then a space, then an underscore, then the label, then a colon at the end.

The label full_text_search is now associated with that heading. I can link to it from any page in my documentation project like so:

:ref:`full_text_search`

Note that the leading underscore isn’t included in this reference.

The link text displayed will be the text of the heading, in this case “Full-text search”. If I want to replace that link text with something custom, I can do so like this:

Learn about the :ref:`search feature <full_text_search>`.

This syntax is similar to the inline hyperlink syntax described above.

Learning more

I extracted the patterns I describe in this post from the Datasette documentation—I encourage you to dig around in the source code to see how it all works.

The definitive guide to reStructuredText is the reStructuredText Markup Specification. My favourite of the various quick references is the Restructured Text (reST) and Sphinx CheatSheet by Thomas Cokelaer.

I’m a huge fan of Read the Docs for hosting documentation—it’s the key reason I use reStructuredText in my projects. Unsurprisingly, they offer extensive documentation to help you make the most of their platform.

Analyzing US Election Russian Facebook Ads two months ago

Two interesting data sources have emerged in the past few weeks concerning the Russian impact on the 2016 US elections.

FiveThirtyEight published nearly 3 million tweets from accounts associated with the Russian “Internet Research Agency”—see my article and searchable tweet archive here.

Separately, the House Intelligence Committee Minority released 3,517 Facebook ads that were reported to have been bought by the Russian Internet Research Agency as a set of redacted PDF files.

Exploring the Russian Facebook Ad spend

The initial data was released as zip files full of PDFs, one of the least friendly formats you can use to publish data.

Ed Summers took on the intimidating task of cleaning that up. His results are incredible: he used the pytesseract OCR library and PyPDF2 to extract both the images and the associated metadata and convert the whole lot into a single 3.9MB JSON file.

I wrote some code to convert his JSON file to SQLite (more on the details later) and the result can be found here:

https://russian-ira-facebook-ads.datasettes.com/

Here’s an example search for “cops” ordered by the USD equivalent spent on the ad (some of the spends are in rubles, so I convert those to USD using today’s exchange rate of 0.016).

Search ads for cops, order by USD descending

One of the most interesting things about this data is that it includes the Facebook ad targetting options that were used to promote the ads. I’ve built a separate interface for browsing those—you can see the most frequently applied targets:

Top targets

And by browsing through the different facets you can construct e.g. a search for all ads that targeted people interested in both interests:Martin Luther King and interests:Police Brutality is a Crime: https://russian-ira-facebook-ads.datasettes.com/russian-ads-919cbfd/display_ads?_targets_json=[“d6ade”%2C“40c27”]

New tooling under the hood

I ended up spinning up several new projects to help process and explore this data.

sqlite-utils

The first is a new library called sqlite-utils. If data is already in CSV I tend to convert it using csvs-to-sqlite, but if data is in a less tabular format (JSON or XML for example) I have to hand-write code. Here’s a script I wrote to process the XML version of the UK Register of Members Interests for example.

My goal with sqlite-utils is to take some of the common patterns from those scripts and make them as easy to use as possible, in particular when running inside a Jupyter notebook. It’s still very early, but the script I wrote to process the Russian ads JSON is a good example of the kind of thing I want to do with it.

datasette-json-html

The second new tool is a new Datasette plugin (and corresponding plugin hook) called datasette-json-html. I used this to solve the need to display both rendered images and customized links as part of the regular Datasette instance.

It’s a pretty crazy solution (hence why it’s implemented as a plugin and not part of Datasette core) but it works surprisingly well. The basic idea is to support a mini JSON language which can be detected and rendered as HTML. A couple of examples:

{
  "img_src": "https://raw.githubusercontent.com/edsu/irads/03fb4b/site/images/0771.png",
  "width": 200
}

Is rendered as an HTML <img src=""> element.

[
  {
    "label": "location:United States",
    "href": "/russian-ads/display_ads?_target=ec3ac"
  },
  {
    "label": "interests:Martin Luther King",
    "href": "/russian-ads/display_ads?_target=d6ade"
  },
  {
    "label": "interests:Jr.",
    "href": "/russian-ads/display_ads?_target=8e7b3"
  }
]

Is rendered as a comma-separated list of HTML links.

Why use JSON for this? Because SQLite has some incredibly powerful JSON features, making it trivial to output JSON as part of the result of a SQL query. Most interestingly of all it has json_group_array() which can work as an aggregation function to combine a set of related rows into a single JSON array.

The display_ads page shown above is powered by a SQL view. Here’s the relevant subset of that view:

select ads.id,
    case when image is not null then
        json_object("img_src", "https://raw.githubusercontent.com/edsu/irads/03fb4b/site/" || image, "width", 200)
    else
        "no image"
    end as img,
    json_group_array(
        json_object(
            "label", targets.name,
            "href", "/russian-ads/display_ads?_target="
                || urllib_quote_plus(targets.id)
        )
    ) as targeting
from ads
    join ad_targets on ads.id = ad_targets.ad_id
    join targets on ad_targets.target_id = targets.id
group by ads.id limit 10

I’m using SQLite’s JSON functions to dynamically assemble the JSON format that datasette-json-html knows how to render. I’m delighted at how well it works.

I’ve turned off arbitrary SQL querying against the main Facebook ads Datasette instance, but there’s a copy running at https://russian-ira-facebook-ads-sql-allowed.now.sh/russian-ads if you want to play with these queries.

Weird implementation details

The full source code for my implementation is available on GitHub.

I ended up using an experimental plugin hook to enable additional custom filtering on Datasette views in order to support showing ads against multiple m2m targets, but hopefully that will be made unnecessary as work on Datasette’s support for m2m relationships progresses.

I also experimented with YAML to generate the metadata.json file as JSON strings aren’t a great way of representing multi-line HTML and SQL. And if you want to see some really convoluted SQL have a look at how the canned query for the faceted targeting interface works.

This was a really fun project, which further stretched my ideas about what Datasette should be capable of out of the box. I’m hoping that the m2m work will make a lot of these crazy hacks redundant.

Elsewhere

10th October 2018

  • github-debug.com (via) This is a neat trick: GitHub have a dedicated site for their support engineers to send you to if you can’t connect to them. The site tests download speeds from their various domains and then lets you click a button to have GitHub run a traceroute/ping from their servers to your detected IP address and output the results (use devtools to spy on their API method for doing this). Then you can paste the results into a message to their support team. Turns out fastly-debug.com and dropbox-debug.com implement a similar pattern for those services as well. #

8th October 2018

  • The ASGI specification provides an opportunity for Python to hit a productivity/performance sweet-spot for a wide range of use-cases, from writing high-volume proxy servers through to bringing large-scale web applications to market at speed.

    Tom Christie #

  • Relational databases are a commodity now, but they power a much larger fraction of the world’s economy that AI ever will. And no company has a “relational database strategy”.

    Erik Bernhardsson #

25th September 2018

19th September 2018

  • Build impossible programs. Delightful talk by Julia Evans describing how she went about building a Ruby profiler in Rust despite having no knowledge of Ruby internals and only beginner’s knowledge of Rust. #
  • Sqorn (via) JavaScript library for building SQL queries that makes really smart usage of ES6 tagged template literals. The magic of tagged template literals is that they let you intercept and process interpolated values, making them ideally suited to escaping parameters in SQL queries. Sqorn takes that basic ability and layers on some really interesting API design to allow you to further compose queries. #

18th September 2018

  • Extended Validation Certificates are Dead. Troy Hunt has been writing about the flaws of Extended Validation certificates for a while. Now iOS 12 is out and Mobile Safari no longer displays their visual indicator in the URL bar (and desktop Safari will stop doing so next week when Mac OS Mojave ships). EV certificates are being dropped by many of the larger companies that were using them. “This turned out to be a long blog post because every time I sat down to write, more and more evidence on the absolute pointlessness of EV presented itself”. #

2nd September 2018

  • Tech Notes: TypeScript at Google (via) In which Evan Martin provides some fascinating colour on the state of JavaScript tooling within Google, which has some unique challenges given that Gmail is 14 years old now and Google have evolved their own internal JavaScript stack which differs widely from the rest of the industry (mainly because it predates most of the successful open source tools). “Which leads me to the middle path, which my little team has been pursuing: incrementally adopt some external tooling where it makes sense, by figuring out how to make it interoperate with our existing code base.” #
  • A tour of JavaScript timers on the web (via) Do you understand the differences between setTimeout, setInterval, setImmediate, requestAnimationFrame and requestIdleCallback? I didn’t. #
  • The (broken) economics of OSS (via) ‪This is worth reading: a very well thought-out summary of the challenges of financially supporting open source infrastructure projects in a world of cloud providers‬. Matt Klein is the creator of the Envoy proxy at Lyft. One of his conclusions is that the open source fellowship model (where foundations provide a full time salary to key maintainers) deserves more attention. #

1st September 2018

  • When you’re pump­ing mes­sages around the In­ter­net be­tween het­ero­ge­neous code­bas­es built by peo­ple who don’t know each oth­er, shit is gonna hap­pen. That’s the whole ba­sis of the We­b: You can safe­ly ig­nore an HTTP head­er or HTML tag you don’t un­der­stand, and noth­ing break­s. It’s great be­cause it al­lows peo­ple to just try stuff out, and the use­ful stuff catch­es on while the bad ideas don’t break any­thing.

    Tim Bray #

31st August 2018

  • Advice for a new executive, by Chad Dickerson (via) Lara Hogan shares the advice she was given by Chad Dickerson (CTO and then CEO of Etsy) when she first became VP Engineering at Kickstarter. There is so much good material in here. I can vouch for the “peer support group” recommendation: Natalie and I benefited from that through Y Combinator and ended up building our own founder peer support group when we moved our startup back to London. Having a confidential trusted group with which to discuss the challenges of growing a company was invaluable. #

27th August 2018

  • In too many organizations, deploy code is a technical backwater, an accumulation of crufty scripts and glue code, forked gems and interns’ earnest attempts to hack up Capistrano.  It usually gives off a strong whiff of “sloppily evolved from many 2 am patches with no code review”. This is insane.  Deploy software is the most important software you have.  Treat it that way: recruit an owner, allocate real time for development and testing, bake in metrics and track them over time.

    Charity Majors #

25th August 2018

  • Serverless for data scientists (via) Slides and accompanying notes from a talk by Mike Lee Williams at PyBay, providing an overview of Zappa and diving a bit more deeply into pywren, which makes it trivial to parallelize a function across a set of AWS lambda instances (serverless Python map() execution essentially). I really like this format for sharing presentations—I used something similar for my own PyBay talk. #
  • Computational and Inferential Thinking: The Foundations of Data Science. Free online textbook written for the UC Berkeley Foundations of Data Science class. The examples are all provided as Jupyter notebooks, using the mybinder web application to allow students to launch interactive notebooks for any of the examples without having to install any software on their own machines. #
  • The Future of Notebooks: Lessons from JupyterCon (via) It sounds like reactive notebooks (where cells keep track of their dependencies on other cells and re-evaluate when those update) were a hot topic at JupyterCon this year. #
  • Most administrators will force users to change their password at regular intervals, typically every 30, 60 or 90 days. This imposes burdens on the user (who is likely to choose new passwords that are only minor variations of the old) and carries no real benefits as stolen passwords are generally exploited immediately. [...] Regular password changing harms rather than improves security, so avoid placing this burden on users. However, users must change their passwords on indication or suspicion of compromise.

    UK National Cyber Security Centre #

  • In case you missed it: @GoogleColab can open any @ProjectJupyter notebook directly from @github! To run the notebook, just replace “github.com” with “colab.research.google.com/github/” in the notebook URL, and it will be loaded into Colab.

    Jake VanderPlas #

  • Honeycomb changelog (via) Too few hosted services have detailed user-facing changelogs. This one from Honeycomb (a metrics, tracing and observavility platform) is a particularly great example. I especially like the use of animated screenshots, something I’ve been evangelizing pretty heavily recently for internal communication at work. #
  • I don't like Jupyter Notebooks - a presentation by Joel Grus (via) Fascinating talk by Joel Grus at the Jupyter conference in New York. He highlights some of the drawbacks of he Jupyter way of working, including the huge confusion that can come from the ability to execute cells out of order (something Observable notebooks solve brilliantly using spreadsheet-style reactive cell associations). He also makes strong arguments that notebooks encourage a way of working that discourages people from producing stable, repeatable and well tested code. #