Building smaller Python Docker images
19th November 2018
Changes are afoot at Zeit Now, my preferred hosting provider for the past year (see previous posts). They have announced Now 2.0, an intriguing new approach to providing auto-scaling immutable deployments. It’s built on top of lambdas, and comes with a whole host of new constraints: code needs to fit into a 5MB bundle for example (though it looks like this restriction will soon be relaxed a little—update November 19th you can now bump this up to 50MB).
Unfortunately, they have also announced their intent to deprecate the existing Now v1 Docker-based solution.
“We will only start thinking about deprecation plans once we are able to accommodate the most common and critical use cases of v1 on v2”—Matheus Fernandes
“When we reach feature parity, we still intend to give customers plenty of time to upgrade (we are thinking at the very least 6 months from the time we announce it)”—Guillermo Rauch
This is pretty disastrous news for many of my projects, most crucially Datasette and Datasette Publish.
Datasette should be fine—it supports Heroku as an alternative to Zeit Now out of the box, and the publish_subcommand plugin hook makes it easy to add further providers (I’m exploring several new options at the moment).
Datasette Publish is a bigger problem. The whole point of that project is to make it easy for less-technical users to deploy their data as an interactive API to a Zeit Now account that they own themselves. Talking these users through what they need to do to upgrade should v1 be shut down in the future is not an exciting prospect.
So I’m going to start hunting for an alternative backend for Datasette Publish, but in the meantime I’ve had to make some changes to how it works in order to handle a new size limit of 100MB for Docker images deployed by free users.
Building smaller Docker images
Zeit appear to have introduced a new limit for free users of their Now v1 platform: Docker images need to be no larger than 100MB.
Datasette Publish was creating final image sizes of around 350MB, blowing way past that limit. I spent some time today figuring out how to get it to produce images within the new limit, and learned a lot about Docker image optimization in the process.
I ended up using Docker’s multi-stage build feature, which allows you to create temporary images during a build, use them to compile dependencies, then copy just the compiled assets into the final image.
An example of the previous Datasette Publish generated Dockerfile can be seen here. Here’s a rough outline of what it does:
- Start with the
python:3.6-slim-stretch
image - apt-installs
python3-dev
andgcc
so it can compile Python libraries with binary dependencies (pandas and uvloop for example) - Use
pip
to installcsvs-to-sqlite
anddatasette
- Add the uploaded CSV files, then run
csvs-to-sqlite
to convert them into a SQLite database - Run
datasette inspect
to cache a JSON file with information about the different tables - Run
datasette serve
to serve the resulting web application
There’s a lot of scope for improvement here. The final image has all sorts of cruft that’s not actually needed for serving the image: it has csvs-to-sqlite
and all of its dependencies, plus the original uploaded CSV files.
Here’s the workflow I used to build a Dockerfile and check the size of the resulting image. My work-in-progress can be found in the datasette-small repo.
# Build the Dockerfile in the current directory and tag as datasette-small
$ docker build . -t datasette-small
# Inspect the size of the resulting image
$ docker images | grep datasette-small
# Start the container running
$ docker run -d -p 8006:8006 datasette-small
654d3fc4d3343c6b73414c6fb4b2933afc56fbba1f282dde9f515ac6cdbc5339
# Now visit http://localhost:8006/ to see it running
Alpine Linux
When you start looking for ways to build smaller Dockerfiles, the first thing you will encounter is Alpine Linux. Alpine is a Linux distribution that’s perfect for containers: it builds on top of BusyBox to strip down to the smallest possible image that can still do useful things.
The python:3.6-alpine
container should be perfect: it gives you the smallest possible container that can run Python 3.6 applications (including the ability to pip install
additional dependencies).
There’s just one problem: in order to install C-based dependencies like pandas (used by csvs-to-sqlite) and Sanic (used by Datasette) you need a compiler toolchain. Alpine doesn’t have this out-of-the-box, but you can install one using Alpine’s apk
package manager. Of course, now you’re bloating your container with a bunch of compilation tools that you don’t need to serve the final image.
This is what makes multi-stage builds so useful! We can spin up an Alpine image with the compilers installed, build our modules, then copy the resulting binary blobs into a fresh container.
Here’s the basic recipe for doing that:
FROM python:3.6-alpine as builder
# Install and compile Datasette + its dependencies
RUN apk add --no-cache gcc python3-dev musl-dev alpine-sdk
RUN pip install datasette
# Now build a fresh container, copying across the compiled pieces
FROM python:3.6-alpine
COPY --from=builder /usr/local/lib/python3.6 /usr/local/lib/python3.6
COPY --from=builder /usr/local/bin/datasette /usr/local/bin/datasette
This pattern works really well, and produces delightfully slim images. My first attempt at this wasn’t quite slim enough to fit the 100MB limit though, so I had to break out some Docker tools to figure out exactly what was going on.
Inspecting docker image layers
Part of the magic of Docker is the concept of layers. When Docker builds a container it uses a layered filesystem (UnionFS) and creates a new layer for every executable line in the Dockerfile. This dramatically speeds up future builds (since layers can be reused if they have already been built) and also provides a powerful tool for inspecting different stages of the build.
When you run docker build
part of the output is IDs of the different image layers as they are constructed:
datasette-small $ docker build . -t datasette-small
Sending build context to Docker daemon 2.023MB
Step 1/21 : FROM python:3.6-slim-stretch as csvbuilder
---> 971a5d5dad01
Step 2/21 : RUN apt-get update && apt-get install -y python3-dev gcc wget
---> Running in f81485df62dd
Given a layer ID, like 971a5d5dad01
, it’s possible to spin up a new container that exposes the exact state of that layer (thanks, Stack Overflow). Here’s how do to that:
docker run -it --rm 971a5d5dad01 sh
The -it
argument attaches standard input to the container (-i
) and allocates a pseudo-TTY (-t
). The -rm
option means that the container will be removed when you Ctrl+D back out of it. sh
is the command we want to run in the container—using a shell lets us start interacting with it.
Now that we have a shell against that layer, we can use regular unix commands to start exploring it. du -m
(m
for MB
) is particularly useful here, as it will show us the largest directories in the filesystem. I pipe it through sort
like so:
$ docker run -it --rm abc63755616b sh
# du -m | sort -n
...
58 ./usr/local/lib/python3.6
70 ./usr/local/lib
71 ./usr/local
76 ./usr/lib/python3.5
188 ./usr/lib
306 ./usr
350 .
Straight away we can start seeing where the space is being taken up in our image.
Deleting unnecessary files
I spent quite a while inspecting different stages of my builds to try and figure out where the space was going. The alpine copy recipe worked neatly, but I was still a little over the limit. When I started to dig around in my final image I spotted some interesting patterns—in particular, the /usr/local/lib/python3.6/site-packages/uvloop
directory was 17MB!
# du -m /usr/local | sort -n -r | head -n 5
96 /usr/local
95 /usr/local/lib
83 /usr/local/lib/python3.6
36 /usr/local/lib/python3.6/site-packages
17 /usr/local/lib/python3.6/site-packages/uvloop
That seems like a lot of disk space for a compiled C module, so I dug in further…
It turned out the uvloop
folder still contained a bunch of files that were used as part of the compilation, including a 6.7MB loop.c
file and a bunch of .pxd
and .pyd
files that are compiled by Cython. None of these files are needed after the extension has been compiled, but they were there, taking up a bunch of precious space.
So I added the following to my Dockerfile:
RUN find /usr/local/lib/python3.6 -name '*.c' -delete
RUN find /usr/local/lib/python3.6 -name '*.pxd' -delete
RUN find /usr/local/lib/python3.6 -name '*.pyd' -delete
Then I noticed that there were __pycache__
files that weren’t needed either, so I added this as well:
RUN find /usr/local/lib/python3.6 -name '__pycache__' | xargs rm -r
(The -delete
flag didn’t work correctly for that one, so I used xargs
instead.)
This shaved off around 15MB, putting me safely under the limit.
Running csvs-to-sqlite in its own stage
The above tricks had got me the smallest Alpine Linux image I could create that would still run Datasette… but Datasette Publish also needs to run csvs-to-sqlite
in order to convert the user’s uploaded CSV files to SQLite.
csvs-to-sqlite
has some pretty heavy dependencies of its own in the form of Pandas and NumPy. Even with the build chain installed I was having trouble installing these under Alpine, especially since building numpy for Alpine is notoriously slow.
Then I realized that thanks to multi-stage builds there’s no need for me to use Alpine at all for this step. I switched back to python:3.6-slim-stretch
and used it to install csvs-to-sqlite
and compile the CSV files into a SQLite database. I also ran datasette inspect
there for good measure.
Then in my final Alpine container I could use the following to copy in just those compiled assets:
COPY --from=csvbuilder inspect-data.json inspect-data.json
COPY --from=csvbuilder data.db data.db
Tying it all together
Here’s an example of a full Dockerfile generated by Datasette Publish that combines all of these tricks. To summarize, here’s what it does:
- Spin up a
python:3.6-slim-stretch
—call itcsvbuilder
apt-get install -y python3-dev gcc
so we can install compiled dependenciespip install csvs-to-sqlite datasette
- Copy in the uploaded CSV files
- Run
csvs-to-sqlite
to convert them into a SQLite database - Run
datasette inspect data.db
to generate aninspect-data.json
file with statistics about the tables. This can later be used to reduce startup time fordatasette serve
.
- Spin up a
python:3.6-alpine
—call itbuildit
- We need a build chain to compile a copy of datasette for Alpine Linux…
apk add --no-cache gcc python3-dev musl-dev alpine-sdk
- Now we can
pip install datasette
, plus any requested plugins - Reduce the final image size by deleting any
__pycache__
or*.c
,*.pyd
and*.pxd
files.
- Spin up a fresh
python:3.6-alpine
for our final image- Copy in
data.db
andinspect-data.json
fromcsvbuilder
- Copy across
/usr/local/lib/python3.6
and/usr/local/bin/datasette
frombulidit
- … and we’re done! Expose port 8006 and set
datasette serve
to run when the container is started
- Copy in
Now that I’ve finally learned how to take advantage of multi-stage builds I expect I’ll be using them for all sorts of interesting things in the future.
More recent articles
- Gemini 2.0 Flash: An outstanding multi-modal LLM with a sci-fi streaming mode - 11th December 2024
- ChatGPT Canvas can make API requests now, but it's complicated - 10th December 2024
- I can now run a GPT-4 class model on my laptop - 9th December 2024