Using a custom container image provides the most flexibility for training on Vertex AI. To learn how using a custom container image differs from using a Python training application with a prebuilt container, read Training code requirements.
The guide walks through the following steps:
- Creating a custom container:
- Writing a Dockerfile that sets up your container to work with Vertex AI and includes dependencies needed for your training application.
- Building and running your Docker container locally.
- Pushing the container image to Artifact Registry.
Before you begin
To configure an Artifact Registry API repository and set up Docker in your development environment, follow Artifact Registry's Quickstart for Docker. Specifically, make sure to complete the following steps of the quickstart:
- Before you begin
- Choose a shell
- Create a Docker repository
- Configure authentication
Create a custom container image
We recommend two possible workflows for creating a custom container image:
Write your training code. Then, use the gcloud CLI's
local-run
command to build and test a custom container image based on your training code without writing a Dockerfile yourself.This workflow can be more straightforward if you are not familiar with Docker. If you follow this workflow, you can skip the rest of this section.
Write your training code. Then, write a Dockerfile and build a container image based on it. Finally, test the container locally.
This workflow offers more flexibility, because you can customize your container image as much as you want.
The rest of this section walks through an example of the latter workflow.
Training code
You can write training code using any dependencies in any programming language. Make sure your code meets the training code requirements. If you plan to use hyperparameter tuning, GPUs, or distributed training, make sure to read the corresponding sections of that document; these sections describe specific considerations for using the features with custom containers.
Create a Dockerfile
Create a Dockerfile to specify all the instructions needed to build your container image.
This section walks through creating a generic example of a Dockerfile to use for custom training. To learn more about creating a container image, read the Docker documentation's quickstart.
For use with Vertex AI, your Dockerfile needs to include commands that cover the following tasks:
- Choose a base image
- Install additional dependencies
- Copy your training code to the image
- Configure the entrypoint for Vertex AI to invoke your training code
Your Dockerfile can include additional logic, depending on your needs. For more information about each specific instruction, see the Dockerfile reference.
Dockerfile command | Description | Example(s) |
---|---|---|
FROM image:tag |
Specifies a basic image and its tag. | Example base images with tags:
|
WORKDIR /path/to/directory |
Specifies the directory on the image where subsequent instructions are run. | /root |
|
Installs additional packages using pip .Note: if your base image does not have |
Example packages:
|
COPY src/training-app.py dest/training-app.py |
Copies the code for your training application into the image. Depending on how your training application is structured, this likely includes multiple files. | Example names of files in your training application:
|
|
Sets up the entry point to invoke your training code to run. When
you start custom training, you can override this entrypoint by
specifying the command field in your ContainerSpec . You can also specify the
args field in the ContainerSpec to provide
additional arguments to the entrypoint (and override the container
image's CMD instruction if it has one). |
["python", "task.py"] |
The logic in your Dockerfile may vary according to your needs, but in general it resembles this:
# Specifies base image and tag FROM image:tag WORKDIR /root # Installs additional packages RUN pip install pkg1 pkg2 pkg3 # Downloads training data RUN curl https://example-url/path-to-data/data-filename --output /root/data-filename # Copies the trainer code to the docker image. COPY your-path-to/model.py /root/model.py COPY your-path-to/task.py /root/task.py # Sets up the entry point to invoke the trainer. ENTRYPOINT ["python", "task.py"]
(Optional) Adjust your Dockerfile for TPU VMs
If you want to train on Vertex AI using a TPU VM, then you must
adjust your Dockerfile to install specially built versions of the tensorflow
and libtpu
libraries. Learn more about adjusting your container for use with
a TPU VM.
Build the container image
Create the correct image URI by using environment variables, and then build the Docker image:
export PROJECT_ID=$(gcloud config list project --format "value(core.project)")
export REPO_NAME=REPOSITORY_NAME
export IMAGE_NAME=IMAGE_NAME
export IMAGE_TAG=IMAGE_TAG
export IMAGE_URI=us-central1-docker.pkg.dev/${PROJECT_ID}/${REPO_NAME}/${IMAGE_NAME}:${IMAGE_TAG}
docker build -f Dockerfile -t ${IMAGE_URI} ./
In these commands replace the following:
- REPOSITORY_NAME: the name of the Artifact Registry repository that you created in the Before you begin section.
- IMAGE_NAME: a name of your choice for your container image.
- IMAGE_TAG: a tag of your choice for this version of your container image.
Learn more about Artifact Registry's requirements for naming your container image.
Run the container locally (optional)
Verify the container image by running it as a container locally. You likely want
to run your training code on a smaller dataset or for a shorter number of
iterations than you plan to run on Vertex AI. For example, if the
entrypoint script in your container image accepts an --epochs
flag to control
how many epochs
it runs for, you might run the following command:
docker run ${IMAGE_URI} --epochs 1
Push the container to Artifact Registry
If the local run works, you can push the container to Artifact Registry.
First, run
gcloud auth configure-docker us-central1-docker.pkg.dev
if
you have not already done so in your development environment. Then run the
following command:
docker push ${IMAGE_URI}
Artifact Registry permissions
If you are using an Artifact Registry image from the same Google Cloud project where you're using Vertex AI, then there is no further need to configure permissions. You can immediately create a custom training job that uses your container image.
However, if you have pushed your container image to Artifact Registry in a different Google Cloud project from the project where you plan to use Vertex AI, then you must grant the Vertex AI Service Agent for your Vertex AI project permission to pull the image from the other project. Learn more about the Vertex AI Service Agent and how to grant it permissions..
Artifact Registry
To learn how to grant your Vertex AI Service Agent access to your Artifact Registry repository, read the Artifact Registry documentation about granting repository-specific permissions.
What's next
- Learn more about the concepts involved in using containers.
- Learn about additional training code requirements for custom training.
- Learn how to create a custom training job or a training pipeline that uses your custom container.