6 Demystifying Docker
Docker is an open source tool for building, sharing, and running software. Docker currently is the dominant way software developers capture a development environment and is an increasingly popular tool to take code to production.
Docker has become so popular because it makes code portable. In most cases, the only system prerequisite to run almost any Docker container is Docker itself.1 Everything else comes in the container.
Unlike language-specific environment as code tools like {renv}
or {venv}
, Docker captures the entire reproducibility stack all the way down to the operating system. If you’ve ever tried to have someone else run code which you’ve written, the appeal is obvious.
Docker has so many strengths that it’s easy to believe it will solve all reproducibility problems. It’s worth keeping a little perspective.
While Docker usually ensures that the code inside will run, it doesn’t fully solve reproducibility or IT/Admin concerns. Some highly regulated contexts consider a container insufficiently rigorous for reproducibility purposes.
Running a container also makes it easy to stand things up, but integrations to other services, like data sources and authentication, still must be configured externally.
Lastly, running a container adds one more services between you and the code you’re trying to run. Without a very good mental model of how the various services interact, trying to get Docker to work can be a frustrating mess.
This chapter is solely on the local use of Docker for building and running containers. For more on running containers in a production context, including using Kubernetes, see Chapter 17.
Really learning to use Docker is the subject of many books and online tutorials. In this chapter, my aim is to provide a little familiarity with the basic terminology and concepts for how to run other people’s Docker containers and how to create your own. In the lab at the end of the chapter, we’ll practice hosting an API inside a container.
6.1 Container lifecycle
Docker is primarily concerned with the creation, movement, and running of containers. A container is a software entity that packages code and its dependencies down to the operating system. Containers are one way to have completely different environments coexisting side by side on one physical machine.
Containers aren’t the only way to run multiple virtual environments on one host. They’re just the most talked about right now.
And Docker Containers aren’t the only type of container. You may run across other kinds, like Apptainer (formerly Singularity) which is often used in high performance computing contexts.
A Docker Image is an immutable snapshot of a container. When you want to run a container, you pull the image and run it as an instance or container, which is what you’ll actually interact with.
Confusingly, the term container is used both to refer to a running instance (“Here’s my running container”) as well as which image (“I used the newest Ubuntu container”).
I prefer the term instance for the running container to eliminate this confusion.
Images are most often stored in registries, which are similar to Git repositories. The most common registry for public containers is Docker Hub which allows public and private hosting of images in free and paid tiers. Docker Hub includes official images for operating systems and programming languages, as well as many, many community-contributed containers. Some organizations run their own private registries, usually using registry as a service offerings from cloud providers.2
Images are built from Dockerfiles – the code that defines the image. Dockerfiles are usually stored in a git repository. It’s common to build and push images in a CI/CD pipeline so changes to the Dockerfile are immediately reflected in the registry.
You can control Docker Containers from the Docker Desktop app. If you’re going to be using Docker on a server, you’ll mostly be interacting via the command line interface (CLI). All Docker CLI commands are formatted as docker <command>
.
The graphic below shows the different states for a container and the CLI commands to move from one to another.
I’ve included docker pull
on the graphic for completeness, but you’ll almost never run it. docker run
auto-pulls the container(s) it needs.
Instances run on an underlying machine, called a host. A primary feature – also a liability – of using containers is that they are completely ephemeral. Unless configured otherwise, anything inside an instance when it shuts down vanishes without a trace.
See Appendix D for a cheatsheet with a list of common Docker commands.
6.1.1 Image Names
In order to build, push, pull, or run an image, you’ll need to know which image you’re talking about. Every image has a name that consists of an id and a tag.
If you’re using Docker Hub, container ids take the form <user>/<container name>
, so I might have the container alexkgold/my-container
. This should look familiar to GitHub users.
Other registries may enforce similar conventions for ids, or they may allow ids in any format they want.
Tags are used to specify versions and variants of containers and come after the id
and :
. For example, the official Python Docker image has tags for each version of Python like python:3
as well as variants for different operating systems and a slim
version that saves space by excluding recommended packages.
Some tags, usually used for versions, are immutable. For example, the rocker/r-ver
container is a container that is built on Ubuntu and has a version of R built in. There’s a rocker/r-ver:4.3.1
, which is a container with R 4.3.1.
Other tags are relative to the point in time. If you don’t see a tag on a container name, it’s using the default latest
. Other common relative tags refer to the current development state of the software inside like devel
, release
, or stable
.
6.2 Running Containers
The docker run
command runs container images as an instance. You can run docker run <image name>
to get a running container. However, most things you want to do with your instance require several command line flags.
There are a few flags that are useful for managing how your containers run and get cleaned up.
The -name <name>
flag names an instance. If you don’t provide a name, each instance gets a random alphanumeric id on start. Names are useful because they persist across individual instances of a container, so they can be easily remembered or used in code.
The -rm
flag automatically removes the container after it’s done. If you don’t use the -rm
flag, the container will stick around until you clean it up manually with docker rm
. The -rm
flag can be useful when you’re iterating quickly – especially because you can’t re-use names until you remove the container.
The -d
flag will run your container in detached mode. This is useful when you want your container to run in the background and not block your terminal session. It’s useful when running containers in production, but you probably don’t want to use it when you’re trying things out and want to see logs streaming out as the container runs.
6.2.1 Getting information in and out
When a container runs, it is isolated from the host. This is a great feature. It means programs running inside the container can address the container’s filesystem and networking without worrying about the host outside. But it also means that using resources on the host requires explicit declarations as part of the docker run
command.
In order to get data in or out of a container, you need to mount a shared volume (directory) between the container and host with the -v
flag. You specify a host directory and a container directory separated by :
. Anything in the volume will be available to both the host and the container at the file paths specified.
For example, maybe you’ve got a container that runs a job against data it expects in the /data
directory. On your host machine, this data lives at /home/alex/data
. You could make this happen with
Terminal
docker run -v /home/alex/data:/data
Here’s a diagram of how this works.
Similarly, if you have a service running in a container on a particular port, you’ll need to map the container port to a host port with the -p
flag.
6.2.2 Other runtime commands
If you want to see what containers you’ve got, docker ps
lists them. This is especially useful to get instance ids if you didn’t bother with names.
To stop a running container docker stop
does so nicely and docker kill
terminates a container immediately.
You can view the logs from a container with docker logs
.
Lastly, if you need to run a command inside a running instance, you can use docker exec
. This is most commonly used to access the command line inside the container as if SSH-ing to a server with docker exec -it <container> /bin/bash
.
While it’s normal to SSH into a server to poke around, it’s somewhat of an anti-pattern to do the same in a container. Generally, you should prefer to review logs and adjust Dockerfiles and run
commands rather than exec
in.
6.3 Building Images from Dockerfiles
A Dockerfile is a set of instructions that you use to build a Docker image. If you know how to accomplish something from the command line, you shouldn’t have too much trouble building a Dockerfile to do the same.
One thing to consider when creating Dockerfiles is that the resulting image is immutable, meaning that anything you build into the image is forever frozen in time. You’ll definitely want to set up the versions of R and Python and install system requirements in your Dockerfile. Depending on the purpose of your container, you may want to copy in code, data, and/or R and Python packages, or you may want to mount those in from a volume at runtime.
There are many Dockerfile commands. You can review them all in the Dockerfile documentation, but here are the handful that are enough to build most images.
FROM
– specify the base image which is usually the first line of the Dockerfile.RUN
– run any command as if you were sitting at the command line inside the container.COPY
– copy a file from the host filesystem into the container.CMD
- Specify what command to run on the container’s shell when it runs, usually the last line of the Dockerfile.3
Every Dockerfile command defines a new layer. A great feature of Docker is that it only rebuilds the layers it needs to when you make changes. For example, take the following Dockerfile:
FROM ubuntu:latest
COPY my-data.csv /data/data.csv
RUN ["head", "/data/data.csv"]
Let’s say I wanted to change the head
command to tail
. Rebuilding this container would be nearly instantaneous because the container would only start rebuilding after the COPY
command.
Once you’ve created your Dockerfile, you build it into an image using docker build -t <image name> <build directory>
. If you don’t provide a tag, the default tag is latest
.
You can then push the image to DockerHub or another registry using docker push <image name>
.
6.4 Comprehension Questions
- Draw a mental map of the relationship between the following: Dockerfile, Docker Image, Docker Registry, Docker Container
- When would you want to use each of the following flags for
docker run
? When wouldn’t you?-p
,--name
,-d
,--rm
,-v
- What are the most important Dockerfile commands?
6.5 Lab: Putting an API in a Container
Putting an API into a container is a popular way to host them. In this lab, we’re going to put the Penguin Model Prediction API from Chapter 2 into a container.
If you’ve never used Docker before, start by installing Docker Desktop on your computer.
You should feel free to write your own Dockerfile to put the API in a container. If you want to make it easy, the {vetiver}
package, which you’ll remember auto-generated the API for us, can also auto-generate a Dockerfile. Look at the package documentation for details.
Once you’ve generated your Dockerfile, take a look at it. Here’s the one for my model:
Dockerfile
# # Generated by the vetiver package; edit with care
# start with python base image
FROM python:3.9
# create directory in container for vetiver files
WORKDIR /vetiver
# copy and install requirements
COPY vetiver_requirements.txt /vetiver/requirements.txt
#
RUN pip install --no-cache-dir --upgrade -r /vetiver/requirements.txt
# copy app file
COPY app.py /vetiver/app/app.py
# expose port
EXPOSE 8080
# run vetiver API
CMD ["uvicorn", "app.app:api", "--host", "0.0.0.0", "--port", "8080"]
This auto-generated Dockerfile is very nicely commented, so it’s easy to follow.
This container follows the best practices from Chapter 2. We’d expect the model to be updated much more frequently than the container itself, so the model isn’t built into the container. Instead, the container knows how to fetch the model using the {pins}
package.
Now build the container using docker build -t penguin-model .
.
You can run the container using
docker run --rm -d \
-p 8080:8080 \
--name penguin-model \
penguin-model
If you go to http://localhost:8080
you’ll find that…it doesn’t work? Why? If you run the container attached (remove the -d
from the run command) you’ll get some feedback that might be helpful.
In line 15 of the Dockerfile, we copy app.py
in to the container. Let’s take a look at that file to see if we can find any hints.
app.py
from vetiver import VetiverModel
import vetiver
import pins
= pins.board_folder('./model', allow_pickle_read=True)
b = VetiverModel.from_pin(b, 'penguin_model', version = '20230422T102952Z-cb1f9')
v
= vetiver.VetiverAPI(v)
vetiver_api = vetiver_api.app api
Look at that (very long) line 6. The API is connecting to a local directory to pull the model. Is your Spidey-Sense tingling? Something about container filesystem vs host filesystem?
That’s right: we put our model at /data/model
on our host machine. But the API inside the container is looking for /data/model
inside the container – which doesn’t exist!
This is a case where we need to mount a volume into the container like so:
docker run --rm -d \
-p 8080:8080 \
--name penguin-model \
-v /data/model:/data/model \
penguin-model-local
And NOW you should be able to get your model up in no time.
6.5.1 Lab Extensions
Right now, logs from the API just stay inside the container instance. But that means that the logs go away when the container does. That’s obviously bad if the container dies because something goes wrong.
How might you make sure that the container’s logs get written somewhere more permanent?
This was truer before the introduction of M-series chips for Macs. Chip architecture differences fall below the level that a container captures and many popular containers wouldn’t run on new Macs. These issues are getting better over time and will probably fully disappear relatively soon.↩︎
The big three container registries are AWS Elastic Container Registry (ECR), Azure Container Registry, and Google Container Registry.↩︎
You may also see
ENTRYPOINT
, which sets the commandCMD
runs against. Usually the default/bin/sh -c
to runCMD
in the shell will be the right choice.↩︎