Notes on streaming large API responses

25th June 2021

I started a Twitter conversation last week about API endpoints that stream large amounts of data as an alternative to APIs that return 100 results at a time and require clients to paginate through all of the pages in order to retrieve all of the data:

Any unexpected downsides to offering streaming HTTP API endpoints that serve up eg 100,000 JSON objects in a go rather than asking users to paginate 100 at a time over 1,000 requests, assuming efficient implementation of that streaming endpoint?
— Simon Willison (@simonw) June 17, 2021

I got a ton of great replies. I tried to tie them together in a thread attached to the tweet, but I’m also going to synthesize them into some thoughts here.

Bulk exporting data

The more time I spend with APIs, especially with regard to my Datasette and Dogsheep projects, the more I realize that my favourite APIs are the ones that let you extract all of your data as quickly and easily as possible.

There are generally three ways an API might provide this:

Click an “export everything” button, then wait for a while for an email to show up with a link to a downloadable zip file. This isn’t really an API, in particular since it’s usually hard if not impossible to automate that initial “click”, but it’s still better than nothing. Google’s Takeout is one notable implementation of this pattern.
Provide a JSON API which allows users to paginate through their data. This is a very common pattern, although it can run into difficulties: what happens if new data is added while you are paginating through the original data, for example? Some systems only allow access to the first N pages too, for performance reasons.
Providing a single HTTP endpoint you can hit that will return ALL of your data—potentially dozens or hundreds of MBs of it—in one go.

It’s that last option that I’m interested in talking about today.

Efficiently streaming data

It used to be that most web engineers would quickly discount the idea of an API endpoint that streams out an unlimited number of rows. HTTP requests should be served as quickly as possible! Anything more than a couple of seconds spent processing a request is a red flag that something should be reconsidered.

Almost everything in the web stack is optimized for quickly serving small requests. But over the past decade the tide has turned somewhat: Node.js made async web servers commonplace, WebSockets taught us to handle long-running connections and in the Python world asyncio and ASGI provided a firm foundation for handling long-running requests using smaller amounts of RAM and CPU.

I’ve been experimenting in this area for a few years now.

Datasette has the ability to use ASGI trickery to stream all rows from a table (or filtered table) as CSV, potentially returning hundreds of MBs of data.

Django SQL Dashboard can export the full results of a SQL query as CSV or TSV, this time using Django’s StreamingHttpResponse (which does tie up a full worker process, but that’s OK if you restrict it to a controlled number of authenticated users).

VIAL implements streaming responses to offer an “export from the admin” feature. It also has an API-key-protected search API which can stream out all matching rows in JSON or GeoJSON.

Implementation notes

The key thing to watch out for when implementing this pattern is memory usage: if your server buffers 100MB+ of data any time it needs to serve an export request you’re going to run into trouble.

Some export formats are friendlier for streaming than others. CSV and TSV are pretty easy to stream, as is newline-delimited JSON.

Regular JSON requires a bit more thought: you can output a [ character, then output each row in a stream with a comma suffix, then skip the comma for the last row and output a ]. Doing that requires peeking ahead (looping two at a time) to verify that you haven’t yet reached the end.

Or... Martin De Wulf pointed out that you can output the first row, then output every other row with a preceeding comma—which avoids the whole “iterate two at a time” problem entirely.

The next challenge is efficiently looping through every database result without first pulling them all into memory.

PostgreSQL (and the psycopg2 Python module) offers server-side cursors, which means you can stream results through your code without loading them all at once. I use these in Django SQL Dashboard.

Server-side cursors make me nervous though, because they seem like they likely tie up resources in the database itself. So the other technique I would consider here is keyset pagination.

Keyset pagination works against any data that is ordered by a unique column—it works especially well against a primary key (or other indexed column). Each page of data is retrieved using a query something like this:

select * from items order by id limit 21

Note the limit 21—if we are retrieving pages of 20 items we ask for 21, since then we can use the last returned item to tell if there is a next page or not.

Then for subsequent pages take the 20th id value and ask for things greater than that:

select * from items where id > 20 limit 21

Each of these queries is fast to respond (since it’s against an ordered index) and uses a predictable, fixed amount of memory. Using keyset pagination we can loop through an abitrarily large table of data, streaming each page out one at a time, without exhausting any resources.

And since each query is small and fast, we don’t need to worry about huge queries tying up database resources either.

What can go wrong?

I really like these patterns. They haven’t bitten me yet, though I’ve not deployed them for anything truly huge scale. So I asked Twitter what kind of problems I should look for.

Based on the Twitter conversation, here are some of the challenges that this approach faces.

Challenge: restarting servers

If the stream takes a significantly long time to finish then rolling out updates becomes a problem. You don’t want to interrupt a download but also don’t want to wait forever for it to finish to spin down the server.
— Adam Lowry (@robotadam) June 17, 2021

This came up a few times, and is something I hadn’t considered. If your deployment process involves restarting your servers (and it’s hard to imagine one that doesn’t) you need to take long-running connections into account when you do that. If there’s a user half way through a 500MB stream you can either truncate their connection or wait for them to finish.

Challenge: how to return errors

If you’re streaming a response, you start with an HTTP 200 code... but then what happens if an error occurs half-way through, potentially while paginating through the database?

You’ve already started sending the request, so you can’t change the status code to a 500. Instead, you need to write some kind of error to the stream that’s being produced.

If you’re serving up a huge JSON document, you can at least make that JSON become invalid, which should indicate to your client that something went wrong.

Formats like CSV are harder. How do you let your user know that their CSV data is incomplete?

And what if someone’s connection drops—are they definitely going to notice that they are missing something, or will they assume that the truncated file is all of the data?

Challenge: resumable downloads

If a user is paginating through your API, they get resumability for free: if something goes wrong they can start again at the last page that they fetched.

Resuming a single stream is a lot harder.

The HTTP range mechanism can be used to provide resumable downloads against large files, but it only works if you generate the entire file in advance.

There is a way to design APIs to support this, provided the data in the stream is in a predictable order (which it has to be if you’re using keyset pagination, described above).

Have the endpoint that triggers the download take an optional ?since= parameter, like this:

GET /stream-everything?since=b24ou34
[
    {"id": "m442ecc", "name": "..."},
    {"id": "c663qo2", "name": "..."},
    {"id": "z434hh3", "name": "..."},
]

Here the b24ou34 is an identifier—it can be a deliberately opaque token, but it needs to be served up as part of the response.

If the user is disconnected for any reason, they can start back where they left off by passing in the last ID that they successfully retrieved:

GET /stream-everything?since=z434hh3

This still requires some level of intelligence from the client application, but it’s a reasonably simple pattern both to implement on the server and as a client.

Easiest solution: generate and return from cloud storage

It seems the most robust way to implement this kind of API is the least technically exciting: spin off a background task that generates the large response and pushes it to cloud storage (S3 or GCS), then redirect the user to a signed URL to download the resulting file.

This is easy to scale, gives users complete files with content-length headers that they know they can download (and even resume-downloading, since range headers are supported by S3 and GCS). It also avoids any issues with server restarts caused by long connections.

This is how Mixpanel handle their export feature, and it’s the solution Sean Coates came to when trying to find a workaround for the AWS Lambda/API Gate response size limit.

If your goal is to provide your users a robust, reliable bulk-export mechanism for their data, export to cloud storage is probably the way to go.

But streaming dynamic responses are a really neat trick, and I plan to keep exploring them!

Posted 25th June 2021 at 4:26 pm · Follow me on Mastodon or Twitter or subscribe to my newsletter

Simon Willison’s Weblog