Using containers on AI Platform Training

This guide explains how to build your own custom container to run jobs on AI Platform Training.

The steps involved in using containers

The following steps show the basic process for training with custom containers:

  1. Set up a Google Cloud project and your local environment.
  2. Create a custom container:
    1. Write a Dockerfile that sets up your container to work with AI Platform Training, and includes dependencies needed for your training application.
    2. Build and test your Docker container locally.
  3. Push the container to Container Registry.
  4. Submit a training job that runs on your custom container.

Using hyperparameter tuning or GPUs requires some adjustments, but the basic process is the same.

Before you begin

Use either Cloud Shell or any environment where the Cloud SDK is installed.

Complete the following steps to set up a GCP account, enable the required APIs, and install and activate the Cloud SDK.

  1. Sign in to your Google Cloud account. If you're new to Google Cloud, create an account to evaluate how our products perform in real-world scenarios. New customers also get $300 in free credits to run, test, and deploy workloads.
  2. In the Google Cloud Console, on the project selector page, select or create a Google Cloud project.

    Go to project selector

  3. Make sure that billing is enabled for your Cloud project. Learn how to confirm that billing is enabled for your project.

  4. Enable the AI Platform Training & Prediction, Compute Engine and Container Registry APIs.

    Enable the APIs

  5. Install and initialize the Cloud SDK.
  6. Install Docker.

    If you're using a Linux-based operating system, such as Ubuntu or Debian, add your username to the docker group so that you can run Docker without using sudo:

    sudo usermod -a -G docker ${USER}

    You may need to restart your system after adding yourself to the docker group.

  7. Open Docker. To ensure that Docker is running, run the following Docker command, which returns the current time and date:
    docker run busybox date
  8. Use gcloud as the credential helper for Docker:
    gcloud auth configure-docker
  9. Optional: If you want to run the container using GPU locally, install nvidia-docker.

Create a custom container

Creating a custom container involves writing a Dockerfile to set up the Docker image you'll use for your training job. After that, you build and test your image locally.

Dockerfile basics for AI Platform Training

When you create a custom container, you use a Dockerfile to specify all the commands needed to build your image.

This section walks through a generic example of a Dockerfile. You can find specific examples in each custom containers tutorial, and in the related samples.

For use with AI Platform Training, your Dockerfile needs to include commands that cover the following tasks:

  • Choose a base image
  • Install additional dependencies
  • Copy your training code to the image
  • Configure the entry point for AI Platform Training to invoke your training code

Your Dockerfile could include additional logic, depending on your needs. Learn more about writing Dockerfiles, and for more information on each specific command, see the Dockerfile reference.

Dockerfile command Description Example(s)
FROM image:tag Specifies a basic image and its tag.

Example base images with tags:

  • pytorch/pytorch:latest
  • tensorflow/tensorflow:nightly
  • python:2.7.15-jessie
  • nvidia/cuda:9.0-cudnn7-runtime
WORKDIR /path/to/directory Specifies the directory on the image where subsequent instructions are run. /root
RUN pip install pkg1 pkg2 pkg3 Installs additional packages using pip.

Note: if your base image does not have pip, you must include a command to install it before you install other packages.

Example packages:

  • google-cloud-storage
  • cloudml-hypertune
  • pandas
COPY src/ dest/ Copies the code for your training application into the image. Depending on how your training application is structured, this likely includes multiple files.

Example names of files in your training application:

ENTRYPOINT ["exec", "file"] Sets up the entry point to invoke your training code to run. ["python", ""]

The logic in your Dockerfile may vary according to your needs, but in general it resembles this:

# Specifies base image and tag
FROM image:tag

# Installs additional packages
RUN pip install pkg1 pkg2 pkg3

# Downloads training data
RUN curl https://example-url/path-to-data/data-filename --output /root/data-filename

# Copies the trainer code to the docker image.
COPY your-path-to/ /root/
COPY your-path-to/ /root/

# Sets up the entry point to invoke the trainer.
ENTRYPOINT ["python", ""]

Build and test your Docker container locally

  1. Create the correct image URI by using environment variables, and then build the Docker image. The -t flag names and tags the image with your choices for IMAGE_REPO_NAME and IMAGE_TAG. You can choose a different name and tag for your image.

    export PROJECT_ID=$(gcloud config list project --format "value(core.project)")
    export IMAGE_REPO_NAME=example_custom_container_image
    export IMAGE_TAG=example_image_tag
    docker build -f Dockerfile -t $IMAGE_URI ./
  2. Verify the image by running it locally. Note that the --epochs flag is passed to the trainer script.

    docker run $IMAGE_URI --epochs 1

Push the container to Container Registry

If the local run works, you can push the container to Container Registry in your project.

Push your container to Container Registry. First, run gcloud auth configure-docker if you have not already done so.

docker push $IMAGE_URI

Manage Container Registry permissions

If you are using a Container Registry image from within the same project you're using to run training on AI Platform Training, then there is no further need to configure permissions for this tutorial, and you can move on to the next step.

Access control for Container Registry is based on a Cloud Storage bucket behind the scenes, so configuring Container Registry permissions is very similar to configuring Cloud Storage permissions.

If you want to pull an image from Container Registry in a different project, you need to allow your AI Platform Training service account to access the image from the other project.

  • Find the underlying Cloud Storage bucket for your Container Registry permissions.
  • Grant a role (such as Storage Object Viewer) that includes the storage.objects.get and storage.objects.list permissions to your AI Platform Training service account.

If you want to push the docker image to a project that is different than the one you're using to submit AI Platform Training training jobs, you should grant image pulling access to the AI Platform Training service account in the project that has your Container Registry repositories. The service account is in the format of service-$ and can be found in the IAM console.

The following command adds your AI Platform Training service account to your separate Container Registry project:

export \

gcloud projects add-iam-policy-binding $GCR_PROJ_ID \
    --member serviceAccount:$SVC_ACCOUNT --role roles/ml.serviceAgent

See more about how to configure access control for Container Registry.

Submit your training job

Submit the training job to AI Platform Training using the gcloud tool. Pass the URI to your Docker image using the --master-image-uri flag:

export BUCKET_NAME=custom_containers
export MODEL_DIR=example_model_$(date +%Y%m%d_%H%M%S)

gcloud ai-platform jobs submit training $JOB_NAME \
  --region $REGION \
  --master-image-uri $IMAGE_URI \
  -- \
  --model-dir=gs://$BUCKET_NAME/$MODEL_DIR \

Hyperparameter tuning with custom containers

In order to do hyperparameter tuning with custom containers, you need to make the following adjustments:

See an example of training with custom containers using hyperparameter tuning.

Using GPUs with custom containers

For training with GPUs, your custom container needs to meet a few special requirements. You must build a different Docker image than what you'd use for training with CPUs.

  • Pre-install the CUDA toolkit and cuDNN in your container. Using the nvidia/cuda image as your base image is the recommended way to handle this, because it has the CUDA toolkit and cuDNN pre-installed, and it helps you set up the related environment variables correctly.

    If your training configuration uses NVIDIA A100 GPUs, then your container must use CUDA 11 or later.

  • Install additional dependencies, such as wget, curl, pip, and any others needed by your training application.

See an example Dockerfile for training with GPUs.

Using TPUs with custom containers

If you perform distributed training with TensorFlow, you can use TPUs on your worker VMs. To do this, you must configure your training job to use TPUs and specify the tpuTfVersion field when you submit your training job.

Distributed training with custom containers

When you run a distributed training job with custom containers, you can specify just one image to be used as the master, worker, and parameter server. You also have the option to build and specify different images for the master, worker, and parameter server. In this case, the dependencies would likely be the same in all three images, and you can run different code logic within each image.

In your code, you can use the environment variables TF_CONFIG and CLUSTER_SPEC. These environment variables describe the overall structure of the cluster, and AI Platform Training populates them for you in each node of your training cluster. Learn more about CLUSTER_SPEC.

You can specify your images within the TrainingInput object when you submit a job, or through their corresponding flags in gcloud ai-platform submit training.

For this example, let's assume you have already defined three separate Dockerfiles, one for each type of machine (master, worker, and parameter server). After that, you name, build, test, and push your images to Container Registry. Finally, you submit a training job that specifies your different images along with your machine configuration for your cluster.

First, run gcloud auth configure-docker if you have not already done so.

export PROJECT_ID=$(gcloud config list project --format "value(core.project)")
export BUCKET_NAME=custom_containers
export MASTER_IMAGE_REPO_NAME=master_image_name
export MASTER_IMAGE_TAG=master_tag
export WORKER_IMAGE_REPO_NAME=worker_image_name
export WORKER_IMAGE_TAG=worker_tag
export PS_IMAGE_REPO_NAME=ps_image_name
export PS_IMAGE_TAG=ps_tag
export MODEL_DIR=distributed_example_$(date +%Y%m%d_%H%M%S)
export REGION=us-central1
export JOB_NAME=distributed_container_job_$(date +%Y%m%d_%H%M%S)

docker build -f Dockerfile-master -t $MASTER_IMAGE_URI ./
docker build -f Dockerfile-worker -t $WORKER_IMAGE_URI ./
docker build -f Dockerfile-ps -t $PS_IMAGE_URI ./

docker run $MASTER_IMAGE_URI --epochs 1
docker run $WORKER_IMAGE_URI --epochs 1
docker run $PS_IMAGE_URI --epochs 1

docker push $MASTER_IMAGE_URI
docker push $WORKER_IMAGE_URI
docker push $PS_IMAGE_URI

gcloud ai-platform jobs submit training $JOB_NAME \
  --region $REGION \
  --master-machine-type complex_model_m \
  --master-image-uri $MASTER_IMAGE_URI \
  --worker-machine-type complex_model_m \
  --worker-image-uri $WORKER_IMAGE_URI \
  --worker-count 9 \
  --parameter-server-machine-type large_model \
  --parameter-server-image-uri $PS_IMAGE_URI \
  --parameter-server-count 3 \
  -- \
  --model-dir=gs://$BUCKET_NAME/$MODEL_DIR \

Default credential in custom containers

When you run a training job with custom containers, your application by default runs as the Cloud ML Service Agent identity. You can find the service account ID of the Cloud ML Service Agent for your project on the IAM page in the Cloud Console. This ID has the following format:

Replace PROJECT_NUMBER with the project number for your Google Cloud project.

AI Platform Training automatically uses the Cloud ML Servce Agent credentials to set up authentication and authorization if you use Tensorflow tfds, Google Cloud client libraries, or other tools that use the Application Default Credentials strategy.

However, if you want your custom container job to access Google Cloud in other ways, you might need to perform additional configuration. For example, if you use gsutil to copy data from Cloud Storage and use the boto library to load credentials from a configuration file, then add a command to your Dockerfile that ensures gsutil uses the default Cloud ML Servce Agent credentials:

RUN echo '[GoogleCompute]\nservice_account = default' > /etc/boto.cfg

What's next