Simon Willison’s Weblog

On facebook, careers, nodejs, speaking, programminglanguages, ...

 

Recent entries

Deploying an asynchronous Python microservice with Sanic and Zeit Now two days ago

Back in 2008 Natalie Downe and I deployed what today we would call a microservice: json-head, a tiny Google App Engine app that allowed you to make an HTTP head request against a URL and get back the HTTP headers as JSON. One of our initial use-scase for this was Natalie’s addSizes.js, an unobtrusive jQuery script that could annotate links to PDFs and other large files with their corresponding file size pulled from the Content-Length header. Another potential use-case is detecting broken links, since the API can be used to spot 404 status codes (as in this example).

At some point in the following decade json-head.appspot.com stopped working. Today I’m bringing it back, mainly as an excuse to try out the combination of Python 3.5 async, the Sanic microframework and Zeit’s brilliant Now deployment platform.

First, a demo. https://json-head.now.sh/?url=https://simonwillison.net/ returns the following:

[
    {
        "ok": true,
        "headers": {
            "Date": "Sat, 14 Oct 2017 18:37:52 GMT",
            "Content-Type": "text/html; charset=utf-8",
            "Connection": "keep-alive",
            "Set-Cookie": "__cfduid=dd0b71b4e89bbaca5b27fa06c0b95af4a1508006272; expires=Sun, 14-Oct-18 18:37:52 GMT; path=/; domain=.simonwillison.net; HttpOnly; Secure",
            "Cache-Control": "s-maxage=200",
            "X-Frame-Options": "SAMEORIGIN",
            "Via": "1.1 vegur",
            "CF-Cache-Status": "HIT",
            "Vary": "Accept-Encoding",
            "Server": "cloudflare-nginx",
            "CF-RAY": "3adca70269a51e8f-SJC",
            "Content-Encoding": "gzip"
        },
        "status": 200,
        "url": "https://simonwillison.net/"
    }
]

Given a URL, json-head.now.sh performs an HTTP HEAD request and returns the resulting status code and the HTTP headers. Results are returned with the Access-Control-Allow-Origin: * header so you can call the API using fetch() or XMLHttpRequest from JavaScript running on any page.

Sanic and Python async/await

A key new feature added to Python 3.5 back in September 2015 was built-in syntactic support for coroutine control via the async/await statements. Python now has some serious credibility as a platform for asynchronous I/O (the concept that got me so excited about Node.js back in 2009). This has lead to an explosion of asynchronous innovation around the Python community.

json-head is the perfect application for async - it’s little more than a dumbed-down HTTP proxy, accepting incoming HTTP requests, making its own requests elsewhere and then returning the results.

Sanic is a Flask-like web framework built specifically to take advantage of async/await in Python 3.5. It’s designed for speed - built on top of uvloop, a Python wrapper for libuv (which itself was originally built to power Node.js). uvloop’s self-selected benchmarks are extremely impressive.

Zeit Now

To host this new microservice, I chose Zeit Now. It’s a truly beautiful piece of software design.

Now lets you treat deployments as immutable. Every time you deploy you get a brand new URL. You can then interact with your deployment directly, or point an existing alias to it if you want a persistent URL for your project.

Deployments are free, and deployed code stays available forever due to some clever engineering behind the scenes.

Best of all: deploying a project takes just a single command: type now and the code in your current directory will be deployed to their cloud and assigned a unique URL.

Now was originally built for Node.js projects, but last August Zeit added Docker support. If the directory you run it in contains a Dockerfile, running now will upload, build and run the corresponding image.

There’s just one thing missing: good examples of how to deploy Python projects to Now using Docker. I’m hoping this article can help fill that gap.

Here’s the complete Dockerfile I’m using for json-head:

FROM python:3
COPY . /app
WORKDIR /app
RUN pip install -r requirements.txt
EXPOSE 8006
CMD ["python", "json_head.py"]

I’m using the official Docker Python image as a base, copying the current directory into the image, using pip install to install dependencies and then exposing port 8006 (for no reason other than that’s the port I use for local development environment) and running the json_head.py script. Now is smart enough to forward incoming HTTP traffic on port 80 to the port that was exposed by the container.

If you setup Now yourself (npm install -g now or use one of their installers) you can deploy my code directly from GitHub to your own instance with a single command:

$ now simonw/json-head
> Didn't find directory. Searching on GitHub...
> Deploying GitHub repository "simonw/json-head" under simonw
> Ready! https://simonw-json-head-xqkfgorgei.now.sh (copied to clipboard) [1s]
> Initializing…
> Building
> ▲ docker build
Sending build context to Docker daemon 7.168 kBkB
> Step 1 : FROM python:3
> 3: Pulling from library/python
> ... lots more stuff here ...

Initial implementation

Here’s my first working version of json-head using Sanic:

from sanic import Sanic
from sanic import response
import aiohttp

app = Sanic(__name__)

async def head(session, url):
    try:
        async with session.head(url) as response:
            return {
                'ok': True,
                'headers': dict(response.headers),
                'status': response.status,
                'url': url,
            }
    except Exception as e:
        return {
            'ok': False,
            'error': str(e),
            'url': url,
        }

@app.route('/')
async def handle_request(request):
    url = request.args.get('url')
    if url:
        async with aiohttp.ClientSession() as session:
            head_info = await head(session, url)
            return response.json(
                head_info,
                headers={
                    'Access-Control-Allow-Origin': '*'
                },
            )
    else:
        return response.html('Try /?url=xxx')

if __name__ == '__main__':
    app.run(host="0.0.0.0", port=8006)

This exact code is deployed at https://json-head-thlbstmwfi.now.sh/ - since Now deployments are free, there’s no reason not to leave work-in-progress examples hosted as throwaway deployments.

In addition to Sanic, I’m also using the handy aiohttp asynchronous HTTP library - which features API design clearly inspired by my all-time favourite HTTP library, requests.

The key new pieces of syntax to understand in the above code are the async and await statements. async def is used to declare a function that acts as a coroutine. Coroutines need to be executed inside an event loop (which Sanic handles for us), but gain the ability to use the await statement.

The await statement is the real magic here: it suspends the current coroutine until the coroutine it is calling has finished executing. It is this that allows us to write asynchronous code without descending into a messy hell of callback functions.

Adding parallel requests

So far we haven’t really taken advantage of what async I/O can do - if every incoming HTTP request results in a single outgoing HTTP response then async may help us scale to serve more incoming requests at once but it’s not really giving us any new functionality.

Executing multiple outbound HTTP requests in parallel is a much more interesting use-case. Let’s add support for multiple ?url= parameters, such as the following:

https://json-head.now.sh/?url=https://simonwillison.net/&url=https://www.google.com/

[
    {
        "ok": true,
        "headers": {
            "Date": "Sat, 14 Oct 2017 19:35:29 GMT",
            "Content-Type": "text/html; charset=utf-8",
            "Connection": "keep-alive",
            "Set-Cookie": "__cfduid=ded486c1faaac166e8ae72a87979c02101508009729; expires=Sun, 14-Oct-18 19:35:29 GMT; path=/; domain=.simonwillison.net; HttpOnly; Secure",
            "Cache-Control": "s-maxage=200",
            "X-Frame-Options": "SAMEORIGIN",
            "Via": "1.1 vegur",
            "CF-Cache-Status": "EXPIRED",
            "Vary": "Accept-Encoding",
            "Server": "cloudflare-nginx",
            "CF-RAY": "3adcfb671c862888-SJC",
            "Content-Encoding": "gzip"
        },
        "status": 200,
        "url": "https://simonwillison.net/"
    },
    {
        "ok": true,
        "headers": {
            "Date": "Sat, 14 Oct 2017 19:35:29 GMT",
            "Expires": "-1",
            "Cache-Control": "private, max-age=0",
            "Content-Type": "text/html; charset=ISO-8859-1",
            "P3P": "CP=\"This is not a P3P policy! See g.co/p3phelp for more info.\"",
            "Content-Encoding": "gzip",
            "Server": "gws",
            "X-XSS-Protection": "1; mode=block",
            "X-Frame-Options": "SAMEORIGIN",
            "Set-Cookie": "1P_JAR=2017-10-14-19; expires=Sat, 21-Oct-2017 19:35:29 GMT; path=/; domain=.google.com",
            "Alt-Svc": "quic=\":443\"; ma=2592000; v=\"39,38,37,35\"",
            "Transfer-Encoding": "chunked"
        },
        "status": 200,
        "url": "https://www.google.com/"
    }
]

We’re now accepting multiple URLs and executing multiple HEAD requests… but Python 3.5 async makes it easy to do this in parallel, so our overall request time should match that of the single longest HEAD request that we triggered.

Here’s an implementation that adds support for multiple, parallel outbound HTTP requests:

@app.route('/')
async def handle_request(request):
    urls = request.args.getlist('url')
    if urls:
        async with aiohttp.ClientSession() as session:
            head_infos = await asyncio.gather(*[
                head(session, url) for url in urls
            ])
            return response.json(
                head_infos,
                headers={'Access-Control-Allow-Origin': '*'},
            )
    else:
        return response.html(INDEX)

We’re using the asyncio module from the Python 3.5 standard library here - in particular the gather function. async.gather takes a list of coroutines and returns a future aggregating their results. This future will resolve (and return to a corresponding await statement) as soon as all of those coroutines have returned their values.

My final code for json-head can be found on GitHub. As I hope I’ve demonstrated, the combination of Python 3.5+, Sanic and Now makes deploying asynchronous Python microservices trivially easy.

Changelogs to help understand the fires in the North Bay six days ago

The situation in the counties north of San Francisco is horrifying right now. I’ve repurposed some of the tools I built to for the Irma Response project last month to collect and track some data that might be of use to anyone trying to understand what’s happening up there. I’m sharing these now in the hope that they might prove useful.

I’m scraping a number of sources relevant to the crisis, and making the data available in a repository on GitHub. Because it’s a git repository, changes to those sources are tracked automatically. The value I’m providing here isn’t so much the data itself, it’s the history of the data. If you need to see what has changed and when, my repository’s commit log should have the answers for you. Or maybe you’ll just want to occasionally hit refresh on this history of changes to srcity.org/610/Emergency-Information to see when they edited the information.

The sources I’m tracking right now are:

  1. The Santa Rosa Fire Department’s Emergency Information page. This is being maintained by hand so it’s not a great source of structured data, but it has key details like the location and availability of shelters and it’s useful to know what was changed and when. History of changes to that page.
  2. PG&E power outages. This is probably the highest quality dataset with the neatest commit messages. The commit history of these shows exactly when new outages are reported and how many customers were affected.
  3. Road Conditions in the County of Sonoma. If you want to understand how far the fire has spread, this is a useful source of data as it shows which roads have been closed due to fire or other reasons. History of changes.
  4. California Highway Patrol Incidents, extracted from a KML feed on quickmap.dot.ca.gov. Since these cover the whole state of California there’s a lot of stuff in here that isn’t directly relevant to the North Bay, but the incidents that mention fire still help tell the story of what’s been happening. History of changes.

The code for the scrapers can be found in north_bay.py. Please leave comments, feedback or suggestions on other useful potential sources of data in this GitHub issue.

Recovering missing content from the Internet Archive eight days ago

When I restored my blog last weekend I used the most recent SQL backup of my blog’s database from back in 2010. I thought it had all of my content from before I started my 7 year hiatus, but in watching the 404 logs I started seeing the occasional hit to something that really should have been there but wasn’t. Turns out the SQL backup I was working from was missing some content.

Thank goodness then for the Wayback Machine at the Internet Archive! I tried some of the missing URLs there and found they had been captured and preserved. But how to get them back?

A quick search turned up wayback-machine-downloader, an open-source Ruby script that claims to be able to Download an entire website from the Internet Archive Wayback Machine. I gem installed it and tried it out (after some cargo cult incantations to work around some weird certificate errors I was seeing)

rvm osx-ssl-certs update all
gem update --system
gem install wayback_machine_downloader

wayback_machine_downloader http://simonwillison.net/

And it worked! I left it running overnight and came back to a folder containing 18,952 HTML files, neatly arranged in a directory structure that matched my site:

$ find . | more
.
./simonwillison.net
./simonwillison.net/2002
./simonwillison.net/2002/Aug
./simonwillison.net/2002/Aug/1
./simonwillison.net/2002/Aug/1/cetis
./simonwillison.net/2002/Aug/1/cetis/index.html
./simonwillison.net/2002/Aug/1/cssSelectorsTutorial
./simonwillison.net/2002/Aug/1/cssSelectorsTutorial/index.html
...

I tarred them up into an archive and backed them up to Dropbox.

Next challenge: how to restore the missing content?

I’m a recent and enthusiastic adopter of Jupyter notebooks. As a huge fan of development in a REPL I’m shocked I was so late to this particular party. So I fired up Jupyter and used it to start playing with the data.

Here’s the final version of my notebook. I ended up with a script that did the following:

  • Load in the full list of paths from the tar archive, and filter for just the ones matching the /YYYY/Mon/DD/slug/ format used for my blog content
  • Talk to my local Django development environment and load in the full list of actual content URLs represented in that database.
  • Calculate the difference between the two - those are the 213 items that need to be recovered.
  • For each of those 213 items, load the full HTML that had been saved by the Internet Archive and feed it into the BeautifulSoup HTML parsing library.
  • Detect if each one is an entry, a blogmark or a quotation. Scrape the key content out of each one based on the type.
  • Scrape the tags for each item, using this delightful one-liner: [a.text for a in soup.findAll('a', {'rel': 'tag'})]
  • Scrape the comments for each item separately. These were mostly spam, so I haven’t yet recovered these for publication (I need to do some aggressive spam filtering first). I have however stashed them in the database for later processing.
  • Write all of the scraped data out to a giant JSON file and upload it to a gist (a nice cheap way of giving it a URL).

Having executed the above script, I now have a JSON file containing the parsed content for all of the missing items found in the Wayback Machine. All I needed then was a script which could take that JSON and turn it into records in the database. I implemented that as a custom Django management command and deployed it to Heroku.

Last step: shell into a Heroku dyno (using heroku run bash) and run the following:

./manage.py import_blog_json \
    --url_to_json=https://gist.github.com/simonw/5a5bc1f58297d2c7d68dd7448a4d6614/raw/28d5d564ae3fe7165802967b0f9c4eff6091caf0/recovered-blog-content.json \
    --tag_with=recovered

The result: 213 recovered items (which I tagged with recovered so I could easily browse them). Including the most important entry on my whole site, my write-up of my wedding!

So thank you very much to the Internet Archive team, and thank you Hartator for your extremely useful wayback-machine-downloader tool.

Implementing faceted search with Django and PostgreSQL 11 days ago

I’ve added a faceted search engine to this blog, powered by PostgreSQL. It supports regular text search (proper search, not just SQL"like" queries), filter by tag, filter by date, filter by content type (entries vs blogmarks vs quotation) and any combination of the above. Some example searches:

It also provides facet counts, so you can tell how many results you will get back before you apply one of these filters - and get a general feeling for the shape of the corpus as you navigate it.

Screenshot of my faceted search interface

I love this kind of search interface, because the counts tell you so much more about the underlying data. Turns out I was most active in quoting people talking about JavaScript back in 2007, for example.

I usually build faceted search engines using either Solr or Elasticsearch (though the first version of search on this blog was actually powered by Hyper Estraier) - but I’m hosting this blog as simply and inexpensively as possible on Heroku and I don’t want to shell out for a SaaS search solution or run an Elasticsearch instance somewhere myself. I thought I’d have to go back to using Google Custom Search.

Then I read Postgres full-text search is Good Enough! by Rachid Belaid - closely followed by Postgres Full-Text Search With Django by Nathan Shafer - and I decided to have a play with the new PostgreSQL search functionality that was introduced in Django 1.10.

… and wow! Full-text search is yet another example of a feature that’s been in PostgreSQL for nearly a decade now, incrementally improving with every release to the point where it’s now really, really good.

At its most basic level a search system needs to handle four things:

  • It needs to take user input and find matching documents.
  • It needs to understand and ignore stopwords (common words like “the” and “and”) and apply stemming - knowing that “ridicule” and “ridiculous” should be treated as the same root, for example. Both of these features need to be language-aware.
  • It needs to be able to apply relevance ranking, calculating which documents are the best match for a search query.
  • It needs to be fast - working against some kind of index rather than scanning every available document in full.

Modern PostgreSQL ticks all of those boxes. Let’s put it to work.

Simple search without an index

Here’s how to execute a full-text search query against a simple text column:

from blog.models import Entry
from django.contrib.postgres.search import SearchVector

results = Entry.objects.annotate(
    searchable=SearchVector('body')
).filter(searchable='django')

The generated SQL looks something like this:

SELECT "blog_entry"."id", ...,
to_tsvector(COALESCE("blog_entry"."body", %s)) AS "searchable"
FROM "blog_entry"
WHERE to_tsvector(COALESCE("blog_entry"."body", "django"))
    @@ (plainto_tsquery("django")) = true
ORDER BY "blog_entry"."created" DESC

The SearchVector class constructs a stemmed, stopword-removed representation of the body column ready to be searched. The resulting queryset contains entries that are a match for “django”.

My blog entries are stored as HTML, but I don’t want search to include those HTML tags. One (extremely un-performant) solution is to use Django’s Func helper to apply a regular expression inside PostgreSQL to strip tags before they are considered for search:

from django.db.models import Value, F, Func

results = Entry.objects.annotate(
    searchable=SearchVector(
        Func(
            F('body'), Value('<.*?>'), Value(''), Value('g'),
            function='regexp_replace'
        )
    )
).filter(searchable='http')

Update 6th October 8:23pm UTC - it turns out this step is entirely unnecessary. Paolo Melchiorre points out that the PostgreSQL ts_vector() function already handles tag removal. Sure enough, executing SELECT to_tsvector('<div>Hey look what happens to <blockquote>this tag</blockquote></div>') using SQL Fiddle returns 'happen':4 'hey':1 'look':2 'tag':7, with the tags already stripped.

This works, but performance isn’t great. PostgreSQL ends up having to scan every row and construct a list of search vectors for each one every time you execute a query.

If you want it to go fast, you need to add a special search vector column to your table and then create the appropriate index on it. As of Django 1.11 this is trivial:

from django.contrib.postgres.search import SearchVectorField
from django.contrib.postgres.indexes import GinIndex

class Entry(models.Model):
    # ...
    search_document = SearchVectorField(null=True)

    class Meta:
        indexes = [
            GinIndex(fields=['search_document'])
        ]

Django’s migration system will automatically add both the field and the special GIN index.

What’s trickier is populating that search_document field. Django does not yet support a easy method to populate it directly in your initial INSERT call, instead recommending that you populated with a SQL UPDATE statement after the fact. Here is a one-liner that will populate the field for everything in that table (and strip tags at the same time):

def strip_tags_func(field):
    return Func(
        F(field), Value('<.*?>'), Value(''), Value('g'),
        function='regexp_replace'
    )
 
Entry.objects.update(
    search_document=(
        SearchVector('title', weight='A') +
        SearchVector(strip_tags_func('body'), weight='C')
    )
)

I’m using a neat feature of the SearchVector class here: it can be concatenated together using the + operator, and each component can be assigned a weight of A, B, C or D. These weights affect ranking calculations later on.

Updates using signals

We could just set this up to run periodically (as I did in my initial implementation), but we can get better real-time results by ensuring this field gets updated automatically when the rest of the model is modified. Some people solve this with PostgreSQL triggers, but I’m still more comfortable handling this kind of thing in python code - so I opted to use Django’s signals mechanism instead.

Since I need to run search queries across three different types of blog content - Entries, Blogmarks and Quotations - I added a method to each model that returns the text fragments corresponding to each of the weight values. Here’s that method for my Quotation model:

class Quotation(models.Model):
    quotation = models.TextField()
    source = models.CharField(max_length=255)
    tags = models.ManyToManyField(Tag, blank=True)

    def index_components(self):
        return {
            'A': self.quotation,
            'B': ' '.join(self.tags.values_list('tag', flat=True)),
            'C': self.source,
        }

As you can see, I’m including the tags that have been assigned to the quotation in the searchable document.

Here are my signals - loaded once via an import statement in my blog application’s AppConfig.ready() method:

@receiver(post_save)
def on_save(sender, **kwargs):
    if not issubclass(sender, BaseModel):
        return
    transaction.on_commit(make_updater(kwargs['instance']))

@receiver(m2m_changed)
def on_m2m_changed(sender, **kwargs):
    instance = kwargs['instance']
    model = kwargs['model']
    if model is Tag:
        transaction.on_commit(make_updater(instance))
    elif isinstance(instance, Tag):
        for obj in model.objects.filter(pk__in=kwargs['pk_set']):
            transaction.on_commit(make_updater(obj))

def make_updater(instance):
    components = instance.index_components()
    pk = instance.pk

    def on_commit():
        search_vectors = []
        for weight, text in components.items():
            search_vectors.append(
                SearchVector(Value(text), weight=weight)
            )
        instance.__class__.objects.filter(pk=pk).update(
            search_document=reduce(operator.add, search_vectors)
        )
    return on_commit

(The full code can be found here).

The on_save method is pretty straightforward - it checks if the model that was just saved has my BaseModel as a base class, then it calls make_updater to get a function to be executed by the transaction.on_commit hook.

The on_m2m_changed handler is signifcantly more complicated. There are a number of scenarios in which this will be called - I’m reasonably confident that the idiom I use here will capture all of the modifications that should trigger a re-indexing operation.

Running a search now looks like this:

results = Entry.objects.filter(
    search_document=SearchQuery('django')
)

We need one more thing though: we need to sort our search results by relevance. PostgreSQL has pretty good relevance built in, and sorting by the relevance score can be done by applying a Django ORM annotation:

query = SearchQuery('ibm')

results = Entry.objects.filter(
    search_document=query
).annotate(
    rank=SearchRank(F('search_document'), query)
).order_by('-rank')

We now have basic full text search implemented against a single Django model, making use of a GIN index. This is lightning fast.

Searching multiple tables using queryset.union()

My site has three types of content, represented in three different models and hence three different underlying database tables.

I’m using an abstract base model to define common fields shared by all three: the created date, the slug (used to construct permalink urls) and the search_document field populated above.

As of Django 1.11 It’s possible to combine queries across different tables using the SQL union operator.

Here’s what that looks like for running a search across three tables, all with the same search_document search vector field. I need to use .values() to restrict the querysets I am unioning to the same subset of fields:

query = SearchQuery('django')
rank_annotation = SearchRank(F('search_document'), query)
qs = Blogmark.objects.annotate(
    rank=rank_annotation,
).filter(
    search_document=query
).values('pk', 'created', 'rank').union(
    Entry.objects.annotate(
        rank=rank_annotation,
    ).filter(
        search_document=query
    ).values('pk', 'created', 'rank'),
    Quotation.objects.annotate(
        rank=rank_annotation,
    ).filter(
        search_document=query
    ).values('pk', 'created', 'rank'),
).order_by('-rank')[:5]

# Output
<QuerySet [
    {'pk': 186, 'rank': 0.875179, 'created': datetime.datetime(2008, 4, 8, 13, 48, 18, tzinfo=<UTC>)},
    {'pk': 134, 'rank': 0.842655, 'created': datetime.datetime(2007, 10, 20, 13, 46, 56, tzinfo=<UTC>)},
    {'pk': 1591, 'rank': 0.804502, 'created': datetime.datetime(2009, 9, 28, 23, 32, 4, tzinfo=<UTC>)},
    {'pk': 5093, 'rank': 0.788616, 'created': datetime.datetime(2010, 2, 26, 19, 22, 47, tzinfo=<UTC>)},
    {'pk': 2598, 'rank': 0.786928, 'created': datetime.datetime(2007, 1, 26, 12, 38, 46, tzinfo=<UTC>)}
]>

This is not enough information though - I have the primary keys, but I don’t know which type of model they belong to. In order to retrieve the actual resulting objects from the database I need to know which type of content is represented by each of those results.

I can achieve that using another annotation:

qs = Blogmark.objects.annotate(
    rank=rank_annotation,
    type=models.Value('blogmark', output_field=models.CharField())
).filter(
    search_document=query
).values('pk', 'type', 'rank').union(
    Entry.objects.annotate(
        rank=rank_annotation,
        type=models.Value('entry', output_field=models.CharField())
    ).filter(
        search_document=query
    ).values('pk', 'type', 'rank'),
    Quotation.objects.annotate(
        rank=rank_annotation,
        type=models.Value('quotation', output_field=models.CharField())
    ).filter(
        search_document=query
    ).values('pk', 'type', 'rank'),
).order_by('-rank')[:5]

# Output:
<QuerySet [
    {'pk': 186, 'type': u'quotation', 'rank': 0.875179},
    {'pk': 134, 'type': u'quotation', 'rank': 0.842655},
    {'pk': 1591, 'type': u'entry', 'rank': 0.804502},
    {'pk': 5093, 'type': u'blogmark', 'rank': 0.788616},
    {'pk': 2598, 'type': u'blogmark', 'rank': 0.786928}
]>

Now I just need to write function which can take a list of types and primary keys and return the full objects from the database:

def load_mixed_objects(dicts):
    """
    Takes a list of dictionaries, each of which must at least have a 'type'
    and a 'pk' key. Returns a list of ORM objects of those various types.
    Each returned ORM object has a .original_dict attribute populated.
    """
    to_fetch = {}
    for d in dicts:
        to_fetch.setdefault(d['type'], set()).add(d['pk'])
    fetched = {}
    for key, model in (
        ('blogmark', Blogmark),
        ('entry', Entry),
        ('quotation', Quotation),
    ):
        ids = to_fetch.get(key) or []
        objects = model.objects.prefetch_related('tags').filter(pk__in=ids)
        for obj in objects:
            fetched[(key, obj.pk)] = obj
    # Build list in same order as dicts argument
    to_return = []
    for d in dicts:
        item = fetched.get((d['type'], d['pk'])) or None
        if item:
            item.original_dict = d
        to_return.append(item)
    return to_return

One last challenge: when I add filtering by type, I’m going to want to selectively union together only a subset of these querysets. I need a queryset to start unions against, but I don’t yet know which queryset I will be using. I can abuse Django’s queryset.none() method to crate an empty ValuesQuerySet in the correct shape like this

qs = Entry.objects.annotate(
    type=models.Value('empty', output_field=models.CharField()),
    rank=rank_annotation
).values('pk', 'type', 'rank').none()

Now I can progressively build up my union in a loop like this:

for klass in (Entry, Blogmark, Quotation):
    qs = qs.union(klass.objects.annotate(
        rank=rank_annotation,
        type=models.Value('quotation', output_field=models.CharField())
    ).filter(
        search_document=query
    ).values('pk', 'type', 'rank'))

The Django ORM is smart enough to compile away the empty queryset when it constructs the SQL, which ends up looking something like this:

(((SELECT "blog_entry"."id",
            "entry" AS "type",
            ts_rank("blog_entry"."search_document", plainto_tsquery(%s)) AS "rank"
     FROM "blog_entry"
     WHERE "blog_entry"."search_document" @@ (plainto_tsquery(%s)) = TRUE
     ORDER BY "blog_entry"."created" DESC))
 UNION
   (SELECT "blog_blogmark"."id",
           "blogmark" AS "type",
           ts_rank("blog_blogmark"."search_document", plainto_tsquery(%s)) AS "rank"
    FROM "blog_blogmark"
    WHERE "blog_blogmark"."search_document" @@ (plainto_tsquery(%s)) = TRUE
    ORDER BY "blog_blogmark"."created" DESC))
UNION
  (SELECT "blog_quotation"."id",
          "quotation" AS "type",
          ts_rank("blog_quotation"."search_document", plainto_tsquery(%s)) AS "rank"
   FROM "blog_quotation"
   WHERE "blog_quotation"."search_document" @@ (plainto_tsquery(%s)) = TRUE
   ORDER BY "blog_quotation"."created" DESC)

Applying filters

So far, our search engine can only handle user-entered query strings. If I am going to build a faceted search interface I need to be able to handle filtering as well. I want the ability to filter by year, tag and type.

The key difference between filtering and querying (borrowing these definitions from Elasticsearch) is that querying is loose - it involves stemming and stopwords - while filtering is exact. Additionally, querying affects the calculated relevance score while filtering does not - a document either matches the filter or it doesn’t.

Since PostgreSQL is a relational database, filtering can be handled by simply constructing extra SQL where clauses using the Django ORM.

Each of the filters I need requires a slightly different approach. Filtering by type is easy - I just selectively include or exclude that model from my union queryset.

Year and month work like this:

selected_year = request.GET.get('year', '')
selected_month = request.GET.get('month', '')
if selected_year:
    qs = qs.filter(created__year=int(selected_year))
if selected_month:
    qs = qs.filter(created__month=int(selected_month))

Tags involve a join through a many-2-many relationship against the Tags table. We want to be able to apply more than one tag, for example this search for all items tagged both python and javascript. Django’s ORM makes this easy:

selected_tags = request.GET.getlist('tag')
for tag in selected_tags:
    qs = qs.filter(tags__tag=tag)

Adding facet counts

There is just one more ingredient needed to complete our faceted search: facet counts!

Again, the way we calculate these is different for each of our filters. For types, we need to call .count() on a separate queryset for each of the types we are searching:

queryset = make_queryset(Entry, 'entry')
type_counts['entry'] = queryset.count()

(the make_queryset function is defined here)

For years we can do this:

from django.db.models.functions import TruncYear

for row in queryset.order_by().annotate(
    year=TruncYear('created')
).values('year').annotate(n=models.Count('pk')):
    year_counts[row['year']] = year_counts.get(
        row['year'], 0
    ) + row['n']

Tags are trickiest. Let’s take advantage of he fact that Django’s ORM knows how to construct sub-selects if you pass another queryset to the __in operator.

tag_counts = {}
type_name = 'entry'
queryset = make_queryset(Entry, 'entry')
for tag, count in Tag.objects.filter(**{
    '%s__in' % type_name: queryset
}).annotate(
    n=models.Count('tag')
).values_list('tag', 'n'):
    tag_counts[tag] = tag_counts.get(tag, 0) + count

Rendering it all in a template

Having constructed the various facets counts in the view function, the template is really simple:

{% if type_counts %}
    <h3>Types</h3>
    <ul>
        {% for t in type_counts %}
            <li><a href="{% add_qsarg "type" t.type %}">{{ t.type }}</a> {{ t.n }}</a></li>
        {% endfor %}
    </ul>
{% endif %}
{% if year_counts %}
    <h3>Years</h3>
    <ul>
        {% for t in year_counts %}
            <li><a href="{% add_qsarg "year" t.year|date:"Y" %}">{{ t.year|date:"Y" }}</a> {{ t.n }}</a></li>
        {% endfor %}
    </ul>
{% endif %}
{% if tag_counts %}
    <h3>Tags</h3>
    <ul>
        {% for t in tag_counts %}
            <li><a href="{% add_qsarg "tag" t.tag %}">{{ t.tag }}</a> {{ t.n }}</a></li>
        {% endfor %}
    </ul>
{% endif %}

I am using custom templates tags here to add arguments to the current URL. I’ve built systems like this in the past where the URLs are instead generated in the view logic, which I think I prefer. As always, perfect is the enemy of shipped.

And because the results are just a Django queryset, we can use Django’s pagination helpers for the pagination links.

The final implementation

The full current version of the code at time of writing can be seen here. You can follow my initial implementation of this feature through the following commits: 7e3a0217 c7e7b30c 7f6b524c a16ddb5e 7055c7e1 74c194d9 f3ffc100 6c24d9fd cb88c2d4 2c262c75 776a562a b8484c50 0b361c78 1322ada2 79b1b13d 3955f41b 3f5ca052.

And that’s how I built faceted search on top of PostgreSQL and Django! I don’t have my blog comments up and running yet, so please post any thoughts or feedback over on this GitHub issue or over on this thread on Hacker News.

Elsewhere

Yesterday

  • An interactive explanation of quadtrees (via) Neat explorable explanation of quadtrees, using interactives built on top of D3. #
  • TL;DR on the KRACK WPA2 stuff - you can repeatedly resend the 3rd packet in a WPA2 handshake and it'll reset the key state, which leads to nonce reuse, which leads to trivial decryption with known plaintext. Can be easily leveraged to dump TCP SYN traffic and hijack connections.

    Graham Sutherland #

15th October 2017

  • Explorable Explanations. I’m fascinated by web articles and essays that embed interactive visualizations - taking advantage of the unique capabilities of the medium to help explain complex concepts. Explorable Explanations collects exactly these, under the banner of “learning through play”. They also gather tools and tutorials to help build more of them. #

13th October 2017

  • What's New In DevTools (Chrome 62). Some really neat stuff. Highlights include top-level "await" support in the console, the ability to take screenshots of specific HTML nodes, CSS grid highlighting and the ability to drop a .HAR file onto the network panel in order to view it as a waterfall. #
  • The Xi Text Engine CRDT (via) Xi is "a modern editor with a backend written in Rust" - an open-source text editor project from Google built on some very interesting computer science (Conflict-free Replicated Data Types). It's a native editor with server-backed synchronization as a first-class concept. #
  • Whatever weird thing you imagine might happen, something weirder probably did happen. Reporters tried to keep up, but it was too strange. As Max Read put it in New York Magazine, Facebook is “like a four-dimensional object, we catch slices of it when it passes through the three-dimensional world we recognize.” No one can quite wrap their heads around what this thing has become, or all the things this thing has become.

    Alexis C. Madrigal #

  • How to Do Code Reviews Like a Human. Some very well thought out advice on giving constructive reviews, including simple but effective language tricks for keeping suggestions positive and helpful. #

12th October 2017

  • Exploding Git Repositories. Kate Murphy describes how git is vulnerable to a similar attack to the XML "billion laughs" recursive entity expansion attack - you can create a tiny git repository that acts as a "git bomb", expanding 12 root objects to over a billion files using recursive blob references. #
  • Dead End Thrills. Duncan Harris Is a photographer who works in the medium of video game screen captures. #
  • How the CIA Staged Sham Academic Conferences to Thwart Iran’s Nuclear Program. “The importance of a conference may be measured not only by the number of Nobel Prize winners or Oxford dons it attracts, but by the number of spies. U.S. and foreign intelligence officers flock to conferences for the same reason that Army recruiters concentrate on low-income neighborhoods: They make the best hunting grounds. While a university campus may have only one or two professors of interest to an intelligence service, the right conference — on drone technology, perhaps, or ISIS — may have dozens.” #

11th October 2017

  • Cypress (via) Promising looking new open source testing framework for full-blown web integration testing - a modern alternative to Selenium. I spent five minutes playing with the demo and was really impressed by it - especially their "time travel" feature which lets you hover over a passed test and see the state of the browser when each of those assertions was executed. #
  • My essential django package list. Insightful list of Django packages - many of which I hadn’t seen before - by Serafeim Papastefanos, each with a handy explanation of what it’s useful for and why. #
  • WPO Stats (via) “Case studies and experiments demonstrating the impact of web performance optimization (WPO) on user experience and business metrics.“ #

10th October 2017

  • The Absurdly Underestimated Dangers of CSV Injection. This is horrifying. A plain old CSV file intended for import into Excel can embed formulas (a value prefixed with an equals symbol) which can execute system commands - with a big honking security prompt that most people will likely ignore. Even worse: they can embed IMPORTXML() functions that can silently leak data from the rest of the sheet to an external URL - and those will work against Google Sheets as well as Excel. #

9th October 2017

  • What happened: Hillary’s view. Lawrence Lessig’s review of Hillary Clinton’s book, including his thoughts on the failures of commercial media in covering the story of Russian interference during those crucial closing weeks before the election. #
  • I have come to the conclusion that the real heroes of ideas are not the people who have them – they are the people who buy them

    David Gluckman #

8th October 2017

  • fd (via) "A simple, fast and user-friendly alternative to find." Written in rust, with a saner default command-line syntax than the regular find command. Microbenchmark shows it running 7x faster. Install it on OS X using "brew install fd". #
  • Removing MediaWiki from SPA: Cool URIs don't change (via) Detailed write-up from Anna Shipman describing how she archived an old MediaWiki as static content using recursive wget and some cunning application of mod_rewrite. #
  • Generate dynamic titles for /search/ pages. Fun little enhancement to my faceted search implementation: I now generate dynamic titles for each search results page describing the search, e.g. "Blogmarks tagged security in Feb, 2005" or "“python” in quotations tagged ruby, python in 2007". #
  • Dillinger. I really like this online Markdown editor. It has source syntax highlighting, live previews of the generated HTML and it constantly syncs to localStorage so you won't lose your work if you accidentally shut your browser window. The code is also available open source on GitHub. #
  • C is a bit like Latin these days. We no longer write everything in it, but knowing it affords deeper knowledge of more-recent languages.

    Norman Wilson #

  • China Demonstrates Quantum Encryption By Hosting a Video Call. This reads like pure science fiction. “Pan’s team first established a connection and generated a secure key between a ground station in Xinglong and the Micius satellite as it passed overhead, orbiting about 500 kilometers above Earth. [...] Next, the Chinese team waited for Micius to pass over Vienna, where their collaborators at the Austria Academy of Sciences were waiting to also receive the key from the satellite. Then, with the keys in hand, the groups initiated a video conference and used those keys to encrypt the video data through a standard VPN protocol.” #
  • Facets. New open source visualization and data exploration tool from Google ("Disclaimer: This is not an official Google product", whatever that means). It's intended for visualizing machine learning datasets but it's obviously useful outside of ML as well - any time you need to understand a large dataset this looks like it could be extremely useful. Ships with example jupyter notebooks and an easy mechanism for embedding the Facets interactive UI directly inside a notebook cell. #

7th October 2017

  • Sanic. "Sanic is a Flask-like Python 3.5+ web server that's written to go fast [...] On top of being Flask-like, Sanic supports async request handlers. This means you can use the new shiny async/await syntax from Python 3.5, making your code non-blocking and speedy". #
  • uvloop: Blazing fast Python networking. “uvloop makes asyncio fast. In fact, it is at least 2x faster than nodejs, gevent, as well as any other Python asynchronous framework. The performance of uvloop-based asyncio is close to that of Go programs.” #

6th October 2017

  • SQL Fiddle demonstrating the PostgreSQL to_tsvector() function (via) SQL Fiddle is amazing - it's an interactive pastebin that lets you execute queries against MySQL, PostgreSQL, Oracle, SQLite & SQL Server, and then share both the input and the results by sending around the resulting URL. Here I'm using it to demonstrate that stripping tags before indexing documents in PostgreSQL is unnecessary because the ts_vector() function already does that for you. #