Create a custom container image for training

Using a custom container image provides the most flexibility for training on Vertex AI. To learn how using a custom container image differs from using a Python training application with a prebuilt container, read Training code requirements.

The guide walks through the following steps:

Creating a custom container:
1. Writing a Dockerfile that sets up your container to work with Vertex AI and includes dependencies needed for your training application.
2. Building and running your Docker container locally.
Pushing the container image to Artifact Registry.

Before you begin

To configure an Artifact Registry API repository and set up Docker in your development environment, follow Artifact Registry's Quickstart for Docker. Specifically, make sure to complete the following steps of the quickstart:

Before you begin
Choose a shell
Create a Docker repository
Configure authentication

Create a custom container image

We recommend two possible workflows for creating a custom container image:

Write your training code. Then, use the gcloud CLI's local-run command to build and test a custom container image based on your training code without writing a Dockerfile yourself.

This workflow can be more straightforward if you are not familiar with Docker. If you follow this workflow, you can skip the rest of this section.
Write your training code. Then, write a Dockerfile and build a container image based on it. Finally, test the container locally.

This workflow offers more flexibility, because you can customize your container image as much as you want.

The rest of this section walks through an example of the latter workflow.

Training code

You can write training code using any dependencies in any programming language. Make sure your code meets the training code requirements. If you plan to use hyperparameter tuning, GPUs, or distributed training, make sure to read the corresponding sections of that document; these sections describe specific considerations for using the features with custom containers.

Create a Dockerfile

Create a Dockerfile to specify all the instructions needed to build your container image.

This section walks through creating a generic example of a Dockerfile to use for custom training. To learn more about creating a container image, read the Docker documentation's quickstart.

For use with Vertex AI, your Dockerfile needs to include commands that cover the following tasks:

Choose a base image
Install additional dependencies
Copy your training code to the image
Configure the entrypoint for Vertex AI to invoke your training code

Your Dockerfile can include additional logic, depending on your needs. For more information about each specific instruction, see the Dockerfile reference.

Dockerfile command	Description	Example(s)
`FROM image:tag`	Specifies a basic image and its tag.	Example base images with tags: `pytorch/pytorch:latest` `tensorflow/tensorflow:nightly` `python:2.7.15-jessie` `nvidia/cuda:9.0-cudnn7-runtime`
`WORKDIR /path/to/directory`	Specifies the directory on the image where subsequent instructions are run.	`/root`
`RUN pip install pkg1 pkg2 pkg3`	Installs additional packages using `pip`. Note: if your base image does not have `pip`, you must include a command to install it before you install other packages.	Example packages: `google-cloud-storage` `cloudml-hypertune` `pandas`
`COPY src/training-app.py dest/training-app.py`	Copies the code for your training application into the image. Depending on how your training application is structured, this likely includes multiple files.	Example names of files in your training application: `model.py` `task.py` `data_utils.py`
`ENTRYPOINT ["exec", "file"]`	Sets up the entry point to invoke your training code to run. When you start custom training, you can override this entrypoint by specifying the `command` field in your `ContainerSpec`. You can also specify the `args` field in the `ContainerSpec` to provide additional arguments to the entrypoint (and override the container image's `CMD` instruction if it has one).	`["python", "task.py"]`

The logic in your Dockerfile may vary according to your needs, but in general it resembles this:

# Specifies base image and tag
FROM image:tag
WORKDIR /root

# Installs additional packages
RUN pip install pkg1 pkg2 pkg3

# Downloads training data
RUN curl https://example-url/path-to-data/data-filename --output /root/data-filename

# Copies the trainer code to the docker image.
COPY your-path-to/model.py /root/model.py
COPY your-path-to/task.py /root/task.py

# Sets up the entry point to invoke the trainer.
ENTRYPOINT ["python", "task.py"]

(Optional) Adjust your Dockerfile for TPU VMs

If you want to train on Vertex AI using a TPU VM, then you must adjust your Dockerfile to install specially built versions of the tensorflow and libtpu libraries. Learn more about adjusting your container for use with a TPU VM.

Build the container image

Create the correct image URI by using environment variables, and then build the Docker image:

export PROJECT_ID=$(gcloud config list project --format "value(core.project)")
export REPO_NAME=REPOSITORY_NAME
export IMAGE_NAME=IMAGE_NAME
export IMAGE_TAG=IMAGE_TAG
export IMAGE_URI=us-central1-docker.pkg.dev/${PROJECT_ID}/${REPO_NAME}/${IMAGE_NAME}:${IMAGE_TAG}

docker build -f Dockerfile -t ${IMAGE_URI} ./

In these commands replace the following:

REPOSITORY_NAME: the name of the Artifact Registry repository that you created in the Before you begin section.
IMAGE_NAME: a name of your choice for your container image.
IMAGE_TAG: a tag of your choice for this version of your container image.

Learn more about Artifact Registry's requirements for naming your container image.

Run the container locally (optional)

Verify the container image by running it as a container locally. You likely want to run your training code on a smaller dataset or for a shorter number of iterations than you plan to run on Vertex AI. For example, if the entrypoint script in your container image accepts an --epochs flag to control how many epochs it runs for, you might run the following command:

docker run ${IMAGE_URI} --epochs 1

Push the container to Artifact Registry

If the local run works, you can push the container to Artifact Registry.

First, run gcloud auth configure-docker us-central1-docker.pkg.dev if you have not already done so in your development environment. Then run the following command:

docker push ${IMAGE_URI}

Artifact Registry permissions

If you are using an Artifact Registry image from the same Google Cloud project where you're using Vertex AI, then there is no further need to configure permissions. You can immediately create a custom training job that uses your container image.

However, if you have pushed your container image to Artifact Registry in a different Google Cloud project from the project where you plan to use Vertex AI, then you must grant the Vertex AI Service Agent for your Vertex AI project permission to pull the image from the other project. Learn more about the Vertex AI Service Agent and how to grant it permissions.

Artifact Registry

To learn how to grant your Vertex AI Service Agent access to your Artifact Registry repository, read the Artifact Registry documentation about granting repository-specific permissions.

What's next

Learn more about the concepts involved in using containers.
Learn about additional training code requirements for custom training.
Learn how to create a custom training job or a training pipeline that uses your custom container.