Simon Willison’s Weblog

Subscribe

Building a combined stream of recent additions using the Django ORM

25th March 2018

I’m a big believer in the importance of a “recent additions” feed. Any time you’re building an application that involves users adding and editing records it’s useful to have a page somewhere that shows the most recent objects that have been created across multiple different types of data.

I’ve used a number of techniques to build these in the past—from an extra database table (e.g. the Django Admin’s LogEntry model) to a Solr or Elasticsearch index that exists just to serve recent additions.

For a recent small project I found myself needing a recent additions feed and realized that there’s a new, simple way to build one thanks to the QuerySet.union() method introduced in Django 1.11 back in April 2017.

Consider a number of different ORM models that can be added by users, each with a created timestamp field.

Prior to QuerySet.union(), building a combined recent additions feed across multiple models was difficult: it’s easy to show recent additions for a single model, but how can we intersperse and paginate additions made to models stored across more than one table?

Using .union() to combine records from different models

Consider the following three models:

class Project(models.Model):
    name = models.CharField(max_length=128)
    description = models.TextField()
    created = models.DateTimeField(auto_now_add=True)

class Image(models.Model):
    project = models.ForeignKey(
        Project, related_name='images', on_delete=models.CASCADE
    )
    image = models.ImageField()
    created = models.DateTimeField(auto_now_add=True)

class Comment(models.Model):
    project = models.ForeignKey(
        Project, related_name='comments', on_delete=models.CASCADE
    )
    comment = models.TextField()
    created = models.DateTimeField(auto_now_add=True)

Let’s build a single QuerySet that returns objects from all three models ordered by their created dates, most recent first.

Using .values() we can reduce these different models to a common subset of fields, which we can then .union() together like so:

recent = Project.objects.values(
    'pk', 'created'
).union(
    Image.objects.values('pk', 'created'),
    Comment.objects.values('pk', 'created'),
).order_by('-created')[:4]

Now if we print out list(recent) it will look something like this:

[{'created': datetime.datetime(2018, 3, 24, 1, 27, 23, 625195, tzinfo=<UTC>),
  'pk': 28},
 {'created': datetime.datetime(2018, 3, 24, 15, 51, 29, 116511, tzinfo=<UTC>),
  'pk': 15},
 {'created': datetime.datetime(2018, 3, 23, 20, 14, 3, 31648, tzinfo=<UTC>),
  'pk': 5},
 {'created': datetime.datetime(2018, 3, 23, 18, 57, 36, 585376, tzinfo=<UTC>),
  'pk': 11}]

We’ve successfully combined recent additions from three different tables! Here’s what the SQL for that looks like:

>>> from django.db import connection
>>> print(connection.queries[-1]['sql'])
(SELECT "myapp_project"."id", "myapp_project"."created" FROM "myapp_project")
 UNION (SELECT "myapp_image"."id", "myapp_image"."created" FROM "myapp_image")
 UNION (SELECT "myapp_comment"."id", "myapp_comment"."created" FROM "myapp_comment")

There’s just one problem: we got back a bunch of pk and created records, but we don’t know which model each of those rows represents.

Using .annotate() to add a type constant to the rows

We can fix this by using Django’s annotate() method combined with a Value() object to attach a constant string to each record specifying the type of the row it represents. Here’s how to do that for a single model:

>>> from django.db.models import Value, CharField
>>> list(Image.objects.annotate(
...     type=Value('image', output_field=CharField()
... )).values('pk','type', 'created')[:2])
[{'created': datetime.datetime(2018, 3, 22, 17, 16, 33, 964900, tzinfo=<UTC>),
  'pk': 3,
  'type': 'image'},
 {'created': datetime.datetime(2018, 3, 22, 17, 49, 47, 527907, tzinfo=<UTC>),
  'pk': 4,
  'type': 'image'}]

We’ve added the key/value pair 'type': 'image' to every record returned from the querystring. Now let’s do that to all three of our models and combine the results using .union():

recent = Project.objects.annotate(
    type=Value('project', output_field=CharField())
).values(
    'pk', 'created', 'type'
).union(
    Image.objects.annotate(
        type=Value('image', output_field=CharField())
    ).values('pk', 'created', 'type'),
    Comment.objects.annotate(
        type=Value('comment', output_field=CharField())
    ).values('pk', 'created', 'type'),
).order_by('-created')[:4]

If we run list(recent) we get this:

[{'created': datetime.datetime(2018, 3, 24, 15, 51, 29, 116511, tzinfo=<UTC>),
  'pk': 15,
  'type': 'comment'},
 {'created': datetime.datetime(2018, 3, 24, 15, 50, 3, 901320, tzinfo=<UTC>),
  'pk': 29,
  'type': 'image'},
 {'created': datetime.datetime(2018, 3, 24, 15, 46, 35, 42123, tzinfo=<UTC>),
  'pk': 15,
  'type': 'project'},
 {'created': datetime.datetime(2018, 3, 24, 7, 53, 15, 222029, tzinfo=<UTC>),
  'pk': 14,
  'type': 'comment'}]

This is looking pretty good! We’ve successfully run a single SQL UNION query across three different tables and returned the combined results in reverse chronological order. Thanks to the type column we know which model each record corresponds to.

Inflating the full referenced objects

Now we need to inflate those primary key references a full ORM object from each corresponding table.

The most efficient way to do this is to collect together the IDs for each type and then run a single SQL query per type to load the full objects.

Here’s code that does exactly that: it first collects the list of primary keys that need to be loaded for each type, then executes an efficient SQL IN query against each type to fetch the underlying objects:

records = list(recent)

type_to_queryset = {
    'image': Image.objects.all(),
    'comment': Comment.objects.all(),
    'project': Project.objects.all(),
}

# Collect the pks we need to load for each type:
to_load = {}
for record in records:
    to_load.setdefault(record['type'], []).append(record['pk'])

# Fetch them 
fetched = {}
for type, pks in to_load.items():
    for object in type_to_queryset[type].filter(pk__in=pks):
        fetched[(type, object.pk)] = object

# Annotate 'records' with loaded objects
for record in records:
    key = (record['type'], record['pk'])
    record['object'] = fetched[key]

After running the above code, records looks like this:

[{'created': datetime.datetime(2018, 3, 24, 15, 51, 29, 116511, tzinfo=<UTC>),
  'object': <Comment: a comment>,
  'pk': 15,
  'type': 'comment'},
 {'created': datetime.datetime(2018, 3, 24, 15, 50, 3, 901320, tzinfo=<UTC>),
  'object': <Image: Image object (29)>,
  'pk': 29,
  'type': 'image'},
 {'created': datetime.datetime(2018, 3, 24, 15, 46, 35, 42123, tzinfo=<UTC>),
  'object': <Project: Recent changes demo>,
  'pk': 15,
  'type': 'project'},
 {'created': datetime.datetime(2018, 3, 24, 7, 53, 15, 222029, tzinfo=<UTC>),
  'object': <Comment: Here is another comment>,
  'pk': 14,
  'type': 'comment'}]

We can now feed this to a template and use it to render our recent additions page.

Wrapping it in a re-usable function

Here’s a function that implements the above in a re-usable way:

def combined_recent(limit, **kwargs):
    datetime_field = kwargs.pop('datetime_field', 'created')
    querysets = []
    for key, queryset in kwargs.items():
        querysets.append(
            queryset.annotate(
                recent_changes_type=Value(
                    key, output_field=CharField()
                )
            ).values('pk', 'recent_changes_type', datetime_field)
        )
    union_qs = querysets[0].union(*querysets[1:])
    records = []
    for row in union_qs.order_by('-{}'.format(datetime_field))[:limit]:
        records.append({
            'type': row['recent_changes_type'],
            'when': row[datetime_field],
            'pk': row['pk']
        })
    # Now we bulk-load each object type in turn
    to_load = {}
    for record in records:
        to_load.setdefault(record['type'], []).append(record['pk'])
    fetched = {}
    for key, pks in to_load.items():
        for item in kwargs[key].filter(pk__in=pks):
            fetched[(key, item.pk)] = item
    # Annotate 'records' with loaded objects
    for record in records:
        record['object'] = fetched[(record['type'], record['pk'])]
    return records

This is also available as a gist.

I can now use that function to combine arbitrary querysets (provided they share a created datestamp field) like so:

recent = combined_recent(
    20,
    project=Project.objects.all(),
    image=Image.objects.all(),
    comment=Comment.objects.all(),
)

This will return the most recent 20 records across all three types, with the results looking like this:

[{'when': datetime.datetime(2018, 3, 24, 15, 51, 29, 116511, tzinfo=<UTC>),
  'object': <Comment: a comment>,
  'pk': 15,
  'type': 'comment'},
 {'when': datetime.datetime(2018, 3, 24, 15, 50, 3, 901320, tzinfo=<UTC>),
  'object': <Image: Image object (29)>,
  'pk': 29,
  'type': 'image'},
 {'when': datetime.datetime(2018, 3, 24, 15, 46, 35, 42123, tzinfo=<UTC>),
  'object': <Project: Recent changes demo>,
  'pk': 15,
  'type': 'project'},
 {'when': datetime.datetime(2018, 3, 24, 7, 53, 15, 222029, tzinfo=<UTC>),
  'object': <Comment: Here is another comment>,
  'pk': 14,
  'type': 'comment'}]

Efficient object loading with select/prefetch_related

If you’re going to render these objects on a page, it’s pretty likely you’ll need to load additional data about them. My example models above are deliberately simplified, but in any serious Django project it’s likely they will have additional references to other tables.

We can apply Django’s magic select_related() and prefetch_related() methods directly to the querysets we pass to the function, like so:

recent = combined_recent(
    20,
    project=Project.objects.all().prefetch_related('tags'),
    image=Image.objects.all().select_related('uploaded_by'),
    comment=Comment.objects.all().select_related('author'),
)

Django’s query optimizer is smart enough to ignore those calls entirely when building the initial union queries, so even with the above extras the initial union query will still look like this:

(SELECT "myapp_project"."id", "myapp_project"."created", 'project' AS "recent_changes_type" FROM "myapp_project")
 UNION (SELECT "myapp_image"."id", "myapp_image"."created", 'image' AS "recent_changes_type" FROM "myapp_image")
 UNION (SELECT "myapp_comment"."id", "myapp_comment"."created", 'comment' AS "recent_changes_type" FROM "myapp_comment")
ORDER BY (2) DESC LIMIT 20

The select_related() and prefetch_related() clauses will then be incorporated into the subsequent SQL queries that are used to efficiently inflate the full objects from the database.

Taking it further

There are a bunch of interesting extensions that can be made to this pattern.

Want pagination? The initial unioned queryset can be paginated using offset/limit by slicing the queryset, or using the Django Paginator class.

Want more efficient pagination (since offset/limit tends to get slow after the first few thousand rows)? We’re ordering by created already which means it’s not difficult to build efficient range-based pagination, requesting all records where the created date is less than the earliest date seen on the previous page.

Since everything is based on regular Django querysets, it’s possible to build all kinds of variants of the recent additions feed. So far we’ve just built one showing all changes across an entire application, but it’s not hard to apply additional filters to only show changes made by a specific user, or changes made relating to a specific foreign key relationship. If you can represent it as a collection of querysets that each expose a created column you can combine them into a single feed.

You don’t even need to use records that share a created column: if you have objects with columns of differing names you can use an annotation to alias those columns, like so:

recent = combined_recent(
    20,
    project=Project.objects.annotate(
        when=models.F('updated')
    ).prefetch_related('tags'),
    image=Image.objects.annotate(
        when=models.F('uploaded_at')
    ).select_related('uploaded_by'),
    comment=Comment.objects.annotate(
        when=models.F('commented_at')
    ).select_related('created_by'),
    datetime_field='when'
)

I haven’t extensively load-tested this pattern, but I expect it will work fine for databases with tens-of-thousands of records but may start running into trouble if you have millions of records (though an index on the created column should help a lot). If you need a recent additions feed on something larger scale than that you should probably look at a separate logging table or an external index in something like Elasticsearch instead.

For another interesting thing you can do with .union() check out my article on Implementing faceted search with Django and PostgreSQL.

More recent articles

This is Building a combined stream of recent additions using the Django ORM by Simon Willison, posted on 25th March 2018.

Next: Datasette plugins, and building a clustered map visualization

Previous: Datasette Demo (video) from the SF Python Meetup