Building smaller Python Docker images
19th November 2018
Changes are afoot at Zeit Now, my preferred hosting provider for the past year (see previous posts). They have announced Now 2.0, an intriguing new approach to providing auto-scaling immutable deployments. It’s built on top of lambdas, and comes with a whole host of new constraints: code needs to fit into a 5MB bundle for example (though it looks like this restriction will soon be relaxed a little—update November 19th you can now bump this up to 50MB).
Unfortunately, they have also announced their intent to deprecate the existing Now v1 Docker-based solution.
“We will only start thinking about deprecation plans once we are able to accommodate the most common and critical use cases of v1 on v2”—Matheus Fernandes
“When we reach feature parity, we still intend to give customers plenty of time to upgrade (we are thinking at the very least 6 months from the time we announce it)”—Guillermo Rauch
Datasette should be fine—it supports Heroku as an alternative to Zeit Now out of the box, and the publish_subcommand plugin hook makes it easy to add further providers (I’m exploring several new options at the moment).
Datasette Publish is a bigger problem. The whole point of that project is to make it easy for less-technical users to deploy their data as an interactive API to a Zeit Now account that they own themselves. Talking these users through what they need to do to upgrade should v1 be shut down in the future is not an exciting prospect.
So I’m going to start hunting for an alternative backend for Datasette Publish, but in the meantime I’ve had to make some changes to how it works in order to handle a new size limit of 100MB for Docker images deployed by free users.
Zeit appear to have introduced a new limit for free users of their Now v1 platform: Docker images need to be no larger than 100MB.
Datasette Publish was creating final image sizes of around 350MB, blowing way past that limit. I spent some time today figuring out how to get it to produce images within the new limit, and learned a lot about Docker image optimization in the process.
I ended up using Docker’s multi-stage build feature, which allows you to create temporary images during a build, use them to compile dependencies, then copy just the compiled assets into the final image.
An example of the previous Datasette Publish generated Dockerfile can be seen here. Here’s a rough outline of what it does:
- Start with the
gccso it can compile Python libraries with binary dependencies (pandas and uvloop for example)
- Add the uploaded CSV files, then run
csvs-to-sqliteto convert them into a SQLite database
datasette inspectto cache a JSON file with information about the different tables
datasette serveto serve the resulting web application
There’s a lot of scope for improvement here. The final image has all sorts of cruft that’s not actually needed for serving the image: it has
csvs-to-sqlite and all of its dependencies, plus the original uploaded CSV files.
Here’s the workflow I used to build a Dockerfile and check the size of the resulting image. My work-in-progress can be found in the datasette-small repo.
# Build the Dockerfile in the current directory and tag as datasette-small $ docker build . -t datasette-small # Inspect the size of the resulting image $ docker images | grep datasette-small # Start the container running $ docker run -d -p 8006:8006 datasette-small 654d3fc4d3343c6b73414c6fb4b2933afc56fbba1f282dde9f515ac6cdbc5339 # Now visit http://localhost:8006/ to see it running
When you start looking for ways to build smaller Dockerfiles, the first thing you will encounter is Alpine Linux. Alpine is a Linux distribution that’s perfect for containers: it builds on top of BusyBox to strip down to the smallest possible image that can still do useful things.
python:3.6-alpine container should be perfect: it gives you the smallest possible container that can run Python 3.6 applications (including the ability to
pip install additional dependencies).
There’s just one problem: in order to install C-based dependencies like pandas (used by csvs-to-sqlite) and Sanic (used by Datasette) you need a compiler toolchain. Alpine doesn’t have this out-of-the-box, but you can install one using Alpine’s
apk package manager. Of course, now you’re bloating your container with a bunch of compilation tools that you don’t need to serve the final image.
This is what makes multi-stage builds so useful! We can spin up an Alpine image with the compilers installed, build our modules, then copy the resulting binary blobs into a fresh container.
Here’s the basic recipe for doing that:
FROM python:3.6-alpine as builder # Install and compile Datasette + its dependencies RUN apk add --no-cache gcc python3-dev musl-dev alpine-sdk RUN pip install datasette # Now build a fresh container, copying across the compiled pieces FROM python:3.6-alpine COPY --from=builder /usr/local/lib/python3.6 /usr/local/lib/python3.6 COPY --from=builder /usr/local/bin/datasette /usr/local/bin/datasette
This pattern works really well, and produces delightfully slim images. My first attempt at this wasn’t quite slim enough to fit the 100MB limit though, so I had to break out some Docker tools to figure out exactly what was going on.
Part of the magic of Docker is the concept of layers. When Docker builds a container it uses a layered filesystem (UnionFS) and creates a new layer for every executable line in the Dockerfile. This dramatically speeds up future builds (since layers can be reused if they have already been built) and also provides a powerful tool for inspecting different stages of the build.
When you run
docker build part of the output is IDs of the different image layers as they are constructed:
datasette-small $ docker build . -t datasette-small Sending build context to Docker daemon 2.023MB Step 1/21 : FROM python:3.6-slim-stretch as csvbuilder ---> 971a5d5dad01 Step 2/21 : RUN apt-get update && apt-get install -y python3-dev gcc wget ---> Running in f81485df62dd
Given a layer ID, like
971a5d5dad01, it’s possible to spin up a new container that exposes the exact state of that layer (thanks, Stack Overflow). Here’s how do to that:
docker run -it --rm 971a5d5dad01 sh
-it argument attaches standard input to the container (
-i) and allocates a pseudo-TTY (
-rm option means that the container will be removed when you Ctrl+D back out of it.
sh is the command we want to run in the container—using a shell lets us start interacting with it.
Now that we have a shell against that layer, we can use regular unix commands to start exploring it.
du -m (
MB) is particularly useful here, as it will show us the largest directories in the filesystem. I pipe it through
sort like so:
$ docker run -it --rm abc63755616b sh # du -m | sort -n ... 58 ./usr/local/lib/python3.6 70 ./usr/local/lib 71 ./usr/local 76 ./usr/lib/python3.5 188 ./usr/lib 306 ./usr 350 .
Straight away we can start seeing where the space is being taken up in our image.
I spent quite a while inspecting different stages of my builds to try and figure out where the space was going. The alpine copy recipe worked neatly, but I was still a little over the limit. When I started to dig around in my final image I spotted some interesting patterns—in particular, the
/usr/local/lib/python3.6/site-packages/uvloop directory was 17MB!
# du -m /usr/local | sort -n -r | head -n 5 96 /usr/local 95 /usr/local/lib 83 /usr/local/lib/python3.6 36 /usr/local/lib/python3.6/site-packages 17 /usr/local/lib/python3.6/site-packages/uvloop
That seems like a lot of disk space for a compiled C module, so I dug in further…
It turned out the
uvloop folder still contained a bunch of files that were used as part of the compilation, including a 6.7MB
loop.c file and a bunch of
.pyd files that are compiled by Cython. None of these files are needed after the extension has been compiled, but they were there, taking up a bunch of precious space.
So I added the following to my Dockerfile:
RUN find /usr/local/lib/python3.6 -name '*.c' -delete RUN find /usr/local/lib/python3.6 -name '*.pxd' -delete RUN find /usr/local/lib/python3.6 -name '*.pyd' -delete
Then I noticed that there were
__pycache__ files that weren’t needed either, so I added this as well:
RUN find /usr/local/lib/python3.6 -name '__pycache__' | xargs rm -r
-delete flag didn’t work correctly for that one, so I used
This shaved off around 15MB, putting me safely under the limit.
The above tricks had got me the smallest Alpine Linux image I could create that would still run Datasette… but Datasette Publish also needs to run
csvs-to-sqlite in order to convert the user’s uploaded CSV files to SQLite.
csvs-to-sqlite has some pretty heavy dependencies of its own in the form of Pandas and NumPy. Even with the build chain installed I was having trouble installing these under Alpine, especially since building numpy for Alpine is notoriously slow.
Then I realized that thanks to multi-stage builds there’s no need for me to use Alpine at all for this step. I switched back to
python:3.6-slim-stretch and used it to install
csvs-to-sqlite and compile the CSV files into a SQLite database. I also ran
datasette inspect there for good measure.
Then in my final Alpine container I could use the following to copy in just those compiled assets:
COPY --from=csvbuilder inspect-data.json inspect-data.json COPY --from=csvbuilder data.db data.db
Here’s an example of a full Dockerfile generated by Datasette Publish that combines all of these tricks. To summarize, here’s what it does:
- Spin up a
apt-get install -y python3-dev gccso we can install compiled dependencies
pip install csvs-to-sqlite datasette
- Copy in the uploaded CSV files
csvs-to-sqliteto convert them into a SQLite database
datasette inspect data.dbto generate an
inspect-data.jsonfile with statistics about the tables. This can later be used to reduce startup time for
- Spin up a
- We need a build chain to compile a copy of datasette for Alpine Linux…
apk add --no-cache gcc python3-dev musl-dev alpine-sdk
- Now we can
pip install datasette, plus any requested plugins
- Reduce the final image size by deleting any
- Spin up a fresh
python:3.6-alpinefor our final image
- Copy in
- Copy across
- … and we’re done! Expose port 8006 and set
datasette serveto run when the container is started
- Copy in
Now that I’ve finally learned how to take advantage of multi-stage builds I expect I’ll be using them for all sorts of interesting things in the future.
More recent articles
- Weeknotes: datasette-enrichments, datasette-comments, sqlite-chronicle - 8th December 2023
- Datasette Enrichments: a new plugin framework for augmenting your data - 1st December 2023
- llamafile is the new best way to run a LLM on your own computer - 29th November 2023
- Prompt injection explained, November 2023 edition - 27th November 2023
- I'm on the Newsroom Robots podcast, with thoughts on the OpenAI board - 25th November 2023
- Weeknotes: DevDay, GitHub Universe, OpenAI chaos - 22nd November 2023
- Deciphering clues in a news article to understand how it was reported - 22nd November 2023
- Exploring GPTs: ChatGPT in a trench coat? - 15th November 2023
- Financial sustainability for open source projects at GitHub Universe - 10th November 2023
- ospeak: a CLI tool for speaking text in the terminal via OpenAI - 7th November 2023