Create a custom container image for training

Using a custom container image provides the most flexibility for training on Vertex AI. To learn how using a custom container image differs from using a Python training application with a pre-built container, read Training code requirements.

The guide walks through the following steps:

  1. Creating a custom container:
    1. Writing a Dockerfile that sets up your container to work with Vertex AI and includes dependencies needed for your training application.
    2. Building and running your Docker container locally.
  2. Pushing the container image to Artifact Registry.

If you want to train using code on your local computer, you can use the gcloud command-line tool's autopackaging feature to create a container image, push the image to Container Registry, and create a CustomJob resource, all with one command. This might be useful if you are not familiar with Docker. (The gcloud tool's local-run command, described in a later section of this guide, provides similar benefits.) If you use autopackaging, you can skip the rest of this guide.

Before you begin

To configure an Artifact Registry API repository and set up Docker in your development environment, follow Artifact Registry's Quickstart for Docker. Specifically, make sure to complete the following steps of the quickstart:

  • Before you begin
  • Choose a shell
  • Create a Docker repository
  • Configure authentication

Create a custom container image

We recommend two possible workflows for creating a custom container image:

  • Write your training code. Then, use the gcloud tool's local-run command to build and test a custom container image based on your training code without writing a Dockerfile yourself.

    This workflow can be more straightforward if you are not familiar with Docker. If you follow this workflow, you can skip the rest of this section.

  • Write your training code. Then, write a Dockerfile and build a container image based on it. Finally, test the container locally.

    This workflow offers more flexibility, because you can customize your container image as much as you want.

The rest of this section walks through an example of the latter workflow.

Training code

You can write training code using any dependencies in any programming language. Make sure your code meets the training code requirements. If you plan to use hyperparameter tuning, GPUs, or distributed training, make sure to read the corresponding sections of that document; these sections describe specific considerations for using the features with custom containers.

Create a Dockerfile

Create a Dockerfile to specify all the instructions needed to build your container image.

This section walks through creating a generic example of a Dockerfile to use for custom training. To learn more about creating a container image, read the Docker documentation's quickstart.

For use with Vertex AI, your Dockerfile needs to include commands that cover the following tasks:

  • Choose a base image
  • Install additional dependencies
  • Copy your training code to the image
  • Configure the entrypoint for Vertex AI to invoke your training code

Your Dockerfile can include additional logic, depending on your needs. For more information about each specific instruction, see the Dockerfile reference.

Dockerfile command Description Example(s)
FROM image:tag Specifies a basic image and its tag.

Example base images with tags:

  • pytorch/pytorch:latest
  • tensorflow/tensorflow:nightly
  • python:2.7.15-jessie
  • nvidia/cuda:9.0-cudnn7-runtime
WORKDIR /path/to/directory Specifies the directory on the image where subsequent instructions are run. /root
RUN pip install pkg1 pkg2 pkg3 Installs additional packages using pip.

Note: if your base image does not have pip, you must include a command to install it before you install other packages.

Example packages:

  • google-cloud-storage
  • cloudml-hypertune
  • pandas
COPY src/foo.py dest/foo.py Copies the code for your training application into the image. Depending on how your training application is structured, this likely includes multiple files.

Example names of files in your training application:

  • model.py
  • task.py
  • data_utils.py
ENTRYPOINT ["exec", "file"] Sets up the entry point to invoke your training code to run. When you start custom training, you can override this entrypoint by specifying the command field in your ContainerSpec. You can also specify the args field in the ContainerSpec to provide additional arguments to the entrypoint (and override the container image's CMD instruction if it has one). ["python", "task.py"]

The logic in your Dockerfile may vary according to your needs, but in general it resembles this:

# Specifies base image and tag
FROM image:tag
WORKDIR /root

# Installs additional packages
RUN pip install pkg1 pkg2 pkg3

# Downloads training data
RUN curl https://example-url/path-to-data/data-filename --output /root/data-filename

# Copies the trainer code to the docker image.
COPY your-path-to/model.py /root/model.py
COPY your-path-to/task.py /root/task.py

# Sets up the entry point to invoke the trainer.
ENTRYPOINT ["python", "task.py"]

Build the container image

Create the correct image URI by using environment variables, and then build the Docker image:

export PROJECT_ID=$(gcloud config list project --format "value(core.project)")
export REPO_NAME=REPOSITORY_NAME
export IMAGE_NAME=IMAGE_NAME
export IMAGE_TAG=IMAGE_TAG
export IMAGE_URI=us-central1-docker.pkg.dev/${PROJECT_ID}/${REPO_NAME}/${IMAGE_NAME}:${IMAGE_TAG}

docker build -f Dockerfile -t ${IMAGE_URI} ./

In these commands replace the following:

  • REPOSITORY_NAME: the name of the Artifact Registry repository that you created in the Before you begin section.
  • IMAGE_NAME: a name of your choice for your container image.
  • IMAGE_TAG: a tag of your choice for this version of your container image.

Learn more about Artifact Registry's requirements for naming your container image.

Run the container locally (optional)

Verify the container image by running it as a container locally. You likely want to run your training code on a smaller dataset or for a shorter number of iterations than you plan to run on Vertex AI. For example, if the entrypoint script in your container image accepts an --epochs flag to control how many epochs it runs for, you might run the following command:

docker run ${IMAGE_URI} --epochs 1

Push the container to Artifact Registry

If the local run works, you can push the container to Artifact Registry.

First, run gcloud auth configure-docker us-central1-docker.pkg.dev if you have not already done so in your development environment. Then run the following command:

docker push ${IMAGE_URI}

Artifact Registry and Container Registry permissions

If you are using an Artifact Registry or Container Registry image from the same Google Cloud project where you're using Vertex AI, then there is no further need to configure permissions. You can immediately create a custom training job that uses your container image.

However, if you have pushed your container image to Artifact Registry or Container Registry in a different Google Cloud project from the project where you plan to use Vertex AI, then you must grant the Vertex AI Service Agent for your Vertex AI project permission to pull the image from the other project. Learn more about the Vertex AI Service Agent and how to grant it permissions..

Artifact Registry

To learn how to grant your Vertex AI Service Agent access to your Artifact Registry repository, read the Artifact Registry documentation about granting repository-specific permissions.

Container Registry

Access control for Container Registry is based on a Cloud Storage bucket behind the scenes. Follow the Granting permissions section in the Container Registry access control documentation to grant your Vertex AI Service Agent the Storage Object Viewer role (roles/storage.objectViewer) for the appropriate Cloud Storage bucket.

What's next