Multi Stage Python

When shipping applications using containers, one often is confronted with overly large final images. Multi-stage builds are a common way to circumvent this issue, especially for compiled languages like Go or Java. In our latest blog post we show how to utilise multi-stage builds for python images to bring down image sizes and thereby improving security.

Using docker multi-stage builds to reduce image size

When writing container images, it is always preferable to have smaller images and to only include what is really needed. This has two main advantages:

  • images take up less storage space
  • security vulnerabilities of software that is not installed cannot be exploited

Example dockerfile

Let’s take a very simple image that starts off with an official python-docker image, using a slim version of the current stable debian release, bullseye. We are going to use python 3.8, but this is not important. The image is pulled from docker hub.

The dockerfiles are provided in this repository in the docker directory and can be built using ‘./bin/build.sh’ on Linux.

Reducing the image size

The all-in-one solution

We are simply going to `COPY` over a `requirements.txt` and install it using `pip`, as this is a fairly common use case. I have chosen to install pyodbc since it has some system
dependencies that have to be installed.

Note: The dockerfiles are a minimal reproducible example and do not follow some common best practices!

```dockerfile
FROM python:3.8-slim-bullseye

RUN apt-get update && \
    apt-get install -y gcc g++ unixodbc-dev

COPY requirements.txt /tmp/requirements.txt

RUN pip3 install --no-cache-dir -r /tmp/requirements.txt
```

After building the image, we get an image size of 422MB:

```bash
$ docker images original
REPOSITORY   TAG       IMAGE ID       CREATED              SIZE
original     latest    09ec99a1bafa   About a minute ago   422MB
```

Introducing: multi-stage

Multi-stage builds are a neat way to keep build dependencies from blowing up the size of your final image. I am not going to go into detail on how they work.

I am going to create wheels. From the pip documentation:

“Wheel is a built-package format, and offers the advantage of not recompiling your software during every install.”

And this is exactly what we want. Compile it and then reuse it without compilation.

The original dockerfile is split in 2: the builder image will install the build dependencies and create the wheels. The final image will install the packages from the wheels without the need to install any build tools:

the wheels. The final image will install the packages from the wheels without the need to install any 
```dockerfile
ARG WHEEL_DIST="/tmp/wheels"

FROM python:3.8-slim-bullseye as builder

ARG WHEEL_DIST

RUN apt-get update && \
    apt-get install -y gcc g++ unixodbc-dev

COPY requirements.txt /tmp/requirements.txt

RUN python3 -m pip wheel -w "${WHEEL_DIST}" -r /tmp/requirements.txt


FROM python:3.8-slim-bullseye

ARG WHEEL_DIST

COPY --from=builder "${WHEEL_DIST}" "${WHEEL_DIST}"

WORKDIR "${WHEEL_DIST}"

RUN pip3 --no-cache-dir install *.whl
``` 

The size of the image has gone down significantly:

```bash
$ docker images multi-stage
REPOSITORY    TAG       IMAGE ID       CREATED          SIZE
multi-stage   latest    b90d97997b3b   44 seconds ago   128MB
```

The difference is starker when considering the size of the base image:

```bash
docker images python:3.8-slim-bullseye
REPOSITORY   TAG                 IMAGE ID       CREATED      SIZE
python       3.8-slim-bullseye   caf584a25606   5 days ago   122MB
```

Increasing security

Smaller images can contribute to more security.
To illustrate this point, we are going to use trivy to scan our images for known security vulnerabilities.

The base image has 85 vulnerabilities:

  • LOW: 12
  • MEDIUM: 35
  • HIGH: 30
  • CRITICAL: 8

The original image that includes the build tools has 331 total vulnerabilities:

  • UNKNOWN: 2
  • LOW: 24
  • MEDIUM: 175
  • HIGH: 109
  • CRITICAL: 21

The multi-stage image again has 85 images that it “inherits” from the base image, but does not introduce any new ones.

Summary

When using python, do utilise wheels and multi-stage builds to decrease the image size and increase the security of your deployements.