Introduction
When building containers we can always leverage the layer caching mechanism to speed up the container builds. This can become tricky depending on the CI/CD tool used for building the images. For offering like Jenkins we may not need to worry much as we can expect a copy of previously build container image to be present. On the other hand, if we see Cloud Build it won’t have a copy of previous build, as the offering is serverless so starts light and stateless. One can simply pull the whole image in the first step of building the image, but that’s not optimized as it will contain extra layers and in many cases like MultiStage dockerfile, we may not even have those layers in the previous image as the builder container has those dependencies which gets discarded.
How to approach the problem then?
We can consider 2 cases here: -
- Single Stage Dockerfile (example python)
- Multistage Dockerfile (example Golang, Java)
But in the end process will be the same. In this approach we need to duplicate(name it Dockerfile.builder
) the original Dockerfile and comment all lines after fetching the build dependencies. Next follow the steps below: -
- Try to pull a docker image with tag
builder
. You also need to append|| true
at the end of the command. - Build the
Dockerfile.builder
with--cache-from IMAGE_NAME:builder
. If the cache from image doesn’t exist, this step will not fail. This step ensure that any changes in dependencies is reflected in builder image. - Then push the builder image
- Then build the main Dockerfile with same with
--cache-from IMAGE_NAME:builder
- Then you can push the main image
Pros
- Don’t need to pull complete container images
- This works perfectly with multi-stage Dockerfile
- No need to stick to naming main container image as latest
Cons
- Adds a bit complexity to pipeline
- Doubles the count of Dockerfile, but no operational overhead as we just need to comment lower-half of Dockerfile
- In case of python, one can simply use prebuilt base image with packages but then we need to maintain that separately
Working with single stage Dockerfile
The flow of the Dockerfile should be correct. For example: -
Consider a python flask application. You can’t have a Dockerfile like this
FROM python:3.8.5-slim
COPY . .
RUN pip install -r requirements.txt
EXPOSE 8080
CMD ['gunicorn','wsgi:app']
Here if you realize when we run COPY . .
, we also copy the code files along with requirements.txt which will have a change maximum number of times. Instead we should just copy requirements.txt
and then get dependencies. Post that we can do a COPY . .
. The core idealogy is to make container layers such that static contents lie in starting commands and dynamic content like copying the source code lies at the bottom of dockerfile allowing caching to work just fine. Check the Dockerfile below for reference: -
FROM python:3.8.5-slim
COPY requirements.txt .
RUN pip3 install -r requirements.txt
COPY . .
EXPOSE 8080
CMD ["/bin/sh","-c","gunicorn wsgi:app"]
Writing the dockerfile.builder. As you can see we have the same dockerfile, but commented everything post fetching the dependencies.
FROM python:3.8.5-slim
COPY requirements.txt .
RUN pip3 install -r requirements.txt
#COPY . .
#EXPOSE 8080
#CMD ["/bin/sh","-c","gunicorn wsgi:app"]
Building in local Docker env
We are building in local env, so I will be skipping the step where we pull builder image.
# Building the builder image and using it's own image as cache to skip re-build in case of no change in dependencies
docker build . -t myflaskapp:builder --cache-from=myflaskapp:builder -f Dockerfile.builder
# Building the application container with builder as cache.
docker build . -t myflaskapp:app-v1 --cache-from=myflaskapp:builder -f Dockerfile
We can see a significant speedup in cases when we have huge dependencies
Building in CloudBuild Pipeline
TODO
Contradiction
One might say that we can get the same performance by using main docker image as cache. I won’t certainly deny that it works well in situations like this, but we do have cases when the built binaries are as bug as 800Mb which would mean when you pull the main image instead of builder, you are pulling 800Mb of extra junk. This contradiction is not valid for multi-stage dockerfile, we will see that below.
Working with multi-stage Dockerfile
Consider a simple GoLang application here. Below is the main Dockerfile of our application: -
FROM golang:1.15 as builder
WORKDIR /go/src/app
COPY go.mod .
RUN go mod download
COPY . .
RUN CGO_ENABLED=0 GOOS=linux go build -o main ./...
FROM alpine:latest
COPY --from=builder /go/src/app/main .
CMD ["./main"]
From seeing this Dockerfile, you can immediately understand that the container named as builder downloads all dependencies for our application, but it gets excluded completely from our final image. So in this case by default we have no caching.
To solve this our Dockerfile.builder will look like this: -
FROM golang:1.15 as builder
WORKDIR /go/src/app
COPY go.mod .
RUN go mod download
#COPY . .
#RUN CGO_ENABLED=0 GOOS=linux go build -o main ./...
#
#FROM alpine:latest
#COPY --from=builder /go/src/app/main .
#CMD ["./main"]
Seeing this Dockerfile, you will realize we stop at the point when we fetch the dependencies, so the output container image will only contain the builder.
Building in local Docker env
# Building the builder image and using it's own image as cache to skip re-build in case of no change in dependencies
docker build . -t godemo:builder --cache-from=godemo:builder -f Dockerfile.builder
# Building the application container with builder as cache.
docker build . -t godemo:app-v1 --cache-from=godemo:builder -f Dockerfile
This will perform as good as a single stage stage dockerfile. But here we will have an overhead of pulling 2 base images: Alpine and Golang. We will still see a significant speedup here as we only be compiling the application.
Building in CloudBuild Pipeline
TODO