Creating a custom container image for training

This guide explains how to build a custom container image to perform custom training. Using a custom container image provides the most flexibility for training on AI Platform (Unified). To learn how using a custom container image differs from using a Python training application with a pre-built container, read Training code requirements.

The guide walks through the following steps:

  1. Creating a custom container:
    1. Writing a Dockerfile that sets up your container to work with AI Platform and includes dependencies needed for your training application.
    2. Building and running your Docker container locally.
  2. Pushing the container image to Artifact Registry.

Before you begin

To configure an Artifact Registry API repository and set up Docker in your development environment, follow Artifact Registry's Quickstart for Docker. Specifically, make sure to complete the following steps of the quickstart:

  • Before you begin
  • Choose a shell
  • Create a Docker repository
  • Configure authentication

Creating a custom container image

Creating a custom container image involves writing training code and a Dockerfile. After that, build and test your image locally.

Writing training code

You can write training code using any dependencies in any programming language. Make sure your code meets the training code requirements. If you plan to use hyperparameter tuning, GPUS, or distributed training, make sure to read the corresponding sections of that document; these sections describe specific considerations for using the features with custom containers.

Writing a Dockerfile

Write a Dockerfile to specify all the instructions needed to build your container image.

This section walks through creating a generic example of a Dockerfile to use for custom training. To learn more about creating a container image, read the Docker documentation's quickstart.

For use with AI Platform, your Dockerfile needs to include commands that cover the following tasks:

  • Choose a base image
  • Install additional dependencies
  • Copy your training code to the image
  • Configure the entrypoint for AI Platform to invoke your training code

Your Dockerfile can include additional logic, depending on your needs. For more information about each specific instruction, see the Dockerfile reference.

Dockerfile command Description Example(s)
FROM image:tag Specifies a basic image and its tag.

Example base images with tags:

  • pytorch/pytorch:latest
  • tensorflow/tensorflow:nightly
  • python:2.7.15-jessie
  • nvidia/cuda:9.0-cudnn7-runtime
WORKDIR /path/to/directory Specifies the directory on the image where subsequent instructions are run. /root
RUN pip install pkg1 pkg2 pkg3 Installs additional packages using pip.

Note: if your base image does not have pip, you must include a command to install it before you install other packages.

Example packages:

  • google-cloud-storage
  • cloudml-hypertune
  • pandas
COPY src/foo.py dest/foo.py Copies the code for your training application into the image. Depending on how your training application is structured, this likely includes multiple files.

Example names of files in your training application:

  • model.py
  • task.py
  • data_utils.py
ENTRYPOINT ["exec", "file"] Sets up the entry point to invoke your training code to run. When you start custom training, you can override this entrypoint by specifying the command field in your ContainerSpec. You can also specify the args field in the ContainerSpec to provide additional arguments to the entrypoint (and override the container image's CMD instruction if it has one). ["python", "task.py"]

The logic in your Dockerfile may vary according to your needs, but in general it resembles this:

# Specifies base image and tag
FROM image:tag
WORKDIR /root

# Installs additional packages
RUN pip install pkg1 pkg2 pkg3

# Downloads training data
RUN curl https://example-url/path-to-data/data-filename --output /root/data-filename

# Copies the trainer code to the docker image.
COPY your-path-to/model.py /root/model.py
COPY your-path-to/task.py /root/task.py

# Sets up the entry point to invoke the trainer.
ENTRYPOINT ["python", "task.py"]

Building the container image

Create the correct image URI by using environment variables, and then build the Docker image:

export PROJECT_ID=$(gcloud config list project --format "value(core.project)")
export REPO_NAME=REPOSITORY_NAME
export IMAGE_NAME=IMAGE_NAME
export IMAGE_TAG=IMAGE_TAG
export IMAGE_URI=us-central1-docker.pkg.dev/${PROJECT_ID}/${REPO_NAME}/${IMAGE_NAME}:${IMAGE_TAG}

docker build -f Dockerfile -t ${IMAGE_URI} ./

In these commands replace the following:

  • REPOSITORY_NAME: the name of the Artifact Registry repository that you created in the Before you begin section.
  • IMAGE_NAME: a name of your choice for your container image.
  • IMAGE_TAG: a tag of your choice for this version of your container image.

Learn more about Artifact Registry's requirements for naming your container image.

Running the container locally (optional)

Verify the container image by running it as a container locally. You likely want to run your training code on a smaller dataset or for a shorter number of iterations than you plan to run on AI Platform. For example, if the entrypoint script in your container image accepts an --epochs flag to control how many epochs it runs for, you might run the following command:

docker run ${IMAGE_URI} --epochs 1

Pushing the container to Artifact Registry

If the local run works, you can push the container to Artifact Registry.

First, run gcloud auth configure-docker us-central1-docker.pkg.dev if you have not already done so in your development environment. Then run the following command:

docker push ${IMAGE_URI}

Managing Artifact Registry or Container Registry permissions

If you are using an Artifact Registry or Container Registry image from the same Google Cloud project where you're using AI Platform, then there is no further need to configure permissions. You can immediately create a custom training job that uses your container image.

However, if you have pushed your container image to Artifact Registry or Container Registry in a different Google Cloud project from the project where you plan to use AI Platform, then you must grant the AI Platform Service Agent for your AI Platform project permission to pull the image from the other project. Learn more about the AI Platform Service Agent and how to grant it permissions..

Artifact Registry

To learn how to grant your AI Platform Service Agent access to your Artifact Registry repository, read the Artifact Registry documentation about granting repository-specific permissions.

Container Registry

Access control for Container Registry is based on a Cloud Storage bucket behind the scenes. Follow the Granting permissions section in the Container Registry access control documentation to grant your AI Platform Service Agent the Storage Object Viewer role (roles/storage.objectViewer) for the appropriate Cloud Storage bucket.

What's next