This guide explains how to build your own custom container to run jobs on AI Platform Training.
The steps involved in using containers
The following steps show the basic process for training with custom containers:
- Set up a Google Cloud project and your local environment.
- Create a custom container:
- Write a Dockerfile that sets up your container to work with AI Platform Training, and includes dependencies needed for your training application.
- Build and test your Docker container locally.
- Push the container to Container Registry.
- Submit a training job that runs on your custom container.
Using hyperparameter tuning or GPUs requires some adjustments, but the basic process is the same.
Before you begin
Use either Cloud Shell or any environment where the gcloud CLI is installed.
Complete the following steps to set up a GCP account, enable the required APIs, and install and activate the Cloud SDK.
- Sign in to your Google Cloud account. If you're new to Google Cloud, create an account to evaluate how our products perform in real-world scenarios. New customers also get $300 in free credits to run, test, and deploy workloads.
-
In the Google Cloud console, on the project selector page, select or create a Google Cloud project.
-
Make sure that billing is enabled for your Google Cloud project.
-
Enable the AI Platform Training & Prediction, Compute Engine and Container Registry APIs.
- Install the Google Cloud CLI.
-
To initialize the gcloud CLI, run the following command:
gcloud init
-
In the Google Cloud console, on the project selector page, select or create a Google Cloud project.
-
Make sure that billing is enabled for your Google Cloud project.
-
Enable the AI Platform Training & Prediction, Compute Engine and Container Registry APIs.
- Install the Google Cloud CLI.
-
To initialize the gcloud CLI, run the following command:
gcloud init
- Install Docker.
If you're using a Linux-based operating system, such as Ubuntu or Debian, add your username to the
docker
group so that you can run Docker without usingsudo
:sudo usermod -a -G docker ${USER}
You may need to restart your system after adding yourself to the
docker
group. - Open Docker. To ensure that Docker is running, run the following Docker command,
which returns the current time and date:
docker run busybox date
- Use
gcloud
as the credential helper for Docker:gcloud auth configure-docker
-
Optional: If you want to run the container using GPU locally,
install
nvidia-docker
.
Create a custom container
Creating a custom container involves writing a Dockerfile to set up the Docker image you'll use for your training job. After that, you build and test your image locally.
Dockerfile basics for AI Platform Training
When you create a custom container, you use a Dockerfile to specify all the commands needed to build your image.
This section walks through a generic example of a Dockerfile. You can find specific examples in each custom containers tutorial, and in the related samples.
For use with AI Platform Training, your Dockerfile needs to include commands that cover the following tasks:
- Choose a base image
- Install additional dependencies
- Copy your training code to the image
- Configure the entry point for AI Platform Training to invoke your training code
Your Dockerfile could include additional logic, depending on your needs. Learn more about writing Dockerfiles, and for more information on each specific command, see the Dockerfile reference.
Dockerfile command | Description | Example(s) |
---|---|---|
FROM image:tag |
Specifies a basic image and its tag. | Example base images with tags:
|
WORKDIR /path/to/directory |
Specifies the directory on the image where subsequent instructions are run. | /root |
|
Installs additional packages using pip .Note: if your base image does not have |
Example packages:
|
COPY src/foo.py dest/foo.py |
Copies the code for your training application into the image. Depending on how your training application is structured, this likely includes multiple files. | Example names of files in your training application:
|
|
Sets up the entry point to invoke your training code to run. | ["python", "task.py"] |
The logic in your Dockerfile may vary according to your needs, but in general it resembles this:
# Specifies base image and tag FROM image:tag WORKDIR /root # Installs additional packages RUN pip install pkg1 pkg2 pkg3 # Downloads training data RUN curl https://example-url/path-to-data/data-filename --output /root/data-filename # Copies the trainer code to the docker image. COPY your-path-to/model.py /root/model.py COPY your-path-to/task.py /root/task.py # Sets up the entry point to invoke the trainer. ENTRYPOINT ["python", "task.py"]
Build and test your Docker container locally
Create the correct image URI by using environment variables, and then build the Docker image. The
-t
flag names and tags the image with your choices forIMAGE_REPO_NAME
andIMAGE_TAG
. You can choose a different name and tag for your image.export PROJECT_ID=$(gcloud config list project --format "value(core.project)") export IMAGE_REPO_NAME=example_custom_container_image export IMAGE_TAG=example_image_tag export IMAGE_URI=gcr.io/$PROJECT_ID/$IMAGE_REPO_NAME:$IMAGE_TAG docker build -f Dockerfile -t $IMAGE_URI ./
Verify the image by running it locally. Note that the
--epochs
flag is passed to the trainer script.docker run $IMAGE_URI --epochs 1
Push the container to Container Registry
If the local run works, you can push the container to Container Registry in your project.
Push your container to Container Registry. First,
run gcloud auth configure-docker
if you have not already done so.
docker push $IMAGE_URI
Manage Container Registry permissions
If you are using a Container Registry image from within the same project you're using to run training on AI Platform Training, then there is no further need to configure permissions for this tutorial, and you can move on to the next step.
Access control for Container Registry is based on a Cloud Storage bucket behind the scenes, so configuring Container Registry permissions is very similar to configuring Cloud Storage permissions.
If you want to pull an image from Container Registry in a different project, you need to allow your AI Platform Training service account to access the image from the other project.
- Find the underlying Cloud Storage bucket for your Container Registry permissions.
- Grant a role (such as Storage Object Viewer) that includes the
storage.objects.get
andstorage.objects.list
permissions to your AI Platform Training service account.
If you want to push the docker image to a project that is different than the
one you're using to submit AI Platform Training training jobs, you should
grant image pulling access to the AI Platform Training service account in the
project that has your Container Registry repositories. The service account
is in the format of
service-$CMLE_PROJ_NUM@cloud-ml.google.com.iam.gserviceaccount.com
and
can be found in the
IAM console.
The following command adds your AI Platform Training service account to your separate Container Registry project:
export GCR_PROJ_ID=[YOUR-PROJECT-ID-FOR-GCR] export CMLE_PROJ_NUM=[YOUR-PROJECT-NUMBER-FOR-CMLE-JOB-SUBMISSION] export \ SVC_ACCT=service-$CMLE_PROJ_NUM@cloud-ml.google.com.iam.gserviceaccount.com gcloud projects add-iam-policy-binding $GCR_PROJ_ID \ --member serviceAccount:$SVC_ACCOUNT --role roles/ml.serviceAgent
See more about how to configure access control for Container Registry.
Submit your training job
Submit the training job to AI Platform Training using the gcloud CLI. Pass
the URI to your Docker image using the --master-image-uri
flag:
export BUCKET_NAME=custom_containers
export MODEL_DIR=example_model_$(date +%Y%m%d_%H%M%S)
gcloud ai-platform jobs submit training $JOB_NAME \
--region $REGION \
--master-image-uri $IMAGE_URI \
-- \
--model-dir=gs://$BUCKET_NAME/$MODEL_DIR \
--epochs=10
Hyperparameter tuning with custom containers
In order to do hyperparameter tuning with custom containers, you need to make the following adjustments:
- In your Dockerfile: install
cloudml-hypertune
. - In your training code:
- Use
cloudml-hypertune
to report the results of each trial by calling its helper function,report_hyperparameter_tuning_metric
. - Add command-line arguments for each hyperparameter, and handle the argument
parsing with an argument parser such as
argparse
.
- Use
- In your job request: add a
HyperparameterSpec
to theTrainingInput
object.
See an example of training with custom containers using hyperparameter tuning.
Using GPUs with custom containers
For training with GPUs, your custom container needs to meet a few special requirements. You must build a different Docker image than what you'd use for training with CPUs.
Pre-install the CUDA toolkit and cuDNN in your container. Using the nvidia/cuda image as your base image is the recommended way to handle this, because it has the CUDA toolkit and cuDNN pre-installed, and it helps you set up the related environment variables correctly.
If your training configuration uses NVIDIA A100 GPUs, then your container must use CUDA 11 or later.
Install additional dependencies, such as
wget
,curl
,pip
, and any others needed by your training application.
See an example Dockerfile for training with GPUs.
Using TPUs with custom containers
If you perform distributed
training with TensorFlow, you can
use TPUs on your worker VMs. To do this, you must configure your training job
to use TPUs and specify the tpuTfVersion
field when you submit
your training job.
Distributed training with custom containers
When you run a distributed training job with custom containers, you can specify just one image to be used as the master, worker, and parameter server. You also have the option to build and specify different images for the master, worker, and parameter server. In this case, the dependencies would likely be the same in all three images, and you can run different code logic within each image.
In your code, you can use the environment variables TF_CONFIG
and
CLUSTER_SPEC
. These environment variables describe the overall structure of
the cluster, and AI Platform Training populates them for you in each node of
your training cluster.
Learn more about CLUSTER_SPEC
.
You can specify your images within the TrainingInput
object
when you submit a job, or through their corresponding flags in
gcloud ai-platform submit training
.
For this example, let's assume you have already defined three separate Dockerfiles, one for each type of machine (master, worker, and parameter server). After that, you name, build, test, and push your images to Container Registry. Finally, you submit a training job that specifies your different images along with your machine configuration for your cluster.
First, run gcloud auth configure-docker
if you have not already done so.
export PROJECT_ID=$(gcloud config list project --format "value(core.project)")
export BUCKET_NAME=custom_containers
export MASTER_IMAGE_REPO_NAME=master_image_name
export MASTER_IMAGE_TAG=master_tag
export MASTER_IMAGE_URI=gcr.io/$PROJECT_ID/$MASTER_IMAGE_REPO_NAME:$MASTER_IMAGE_TAG
export WORKER_IMAGE_REPO_NAME=worker_image_name
export WORKER_IMAGE_TAG=worker_tag
export WORKER_IMAGE_URI=gcr.io/$PROJECT_ID/$WORKER_IMAGE_REPO_NAME:$WORKER_IMAGE_TAG
export PS_IMAGE_REPO_NAME=ps_image_name
export PS_IMAGE_TAG=ps_tag
export PS_IMAGE_URI=gcr.io/$PROJECT_ID/$PS_IMAGE_REPO_NAME:$PS_IMAGE_TAG
export MODEL_DIR=distributed_example_$(date +%Y%m%d_%H%M%S)
export REGION=us-central1
export JOB_NAME=distributed_container_job_$(date +%Y%m%d_%H%M%S)
docker build -f Dockerfile-master -t $MASTER_IMAGE_URI ./
docker build -f Dockerfile-worker -t $WORKER_IMAGE_URI ./
docker build -f Dockerfile-ps -t $PS_IMAGE_URI ./
docker run $MASTER_IMAGE_URI --epochs 1
docker run $WORKER_IMAGE_URI --epochs 1
docker run $PS_IMAGE_URI --epochs 1
docker push $MASTER_IMAGE_URI
docker push $WORKER_IMAGE_URI
docker push $PS_IMAGE_URI
gcloud ai-platform jobs submit training $JOB_NAME \
--region $REGION \
--master-machine-type complex_model_m \
--master-image-uri $MASTER_IMAGE_URI \
--worker-machine-type complex_model_m \
--worker-image-uri $WORKER_IMAGE_URI \
--worker-count 9 \
--parameter-server-machine-type large_model \
--parameter-server-image-uri $PS_IMAGE_URI \
--parameter-server-count 3 \
-- \
--model-dir=gs://$BUCKET_NAME/$MODEL_DIR \
--epochs=10
Default credential in custom containers
When you run a training job with custom containers, your application by default runs as the Cloud ML Service Agent identity. You can find the service account ID of the Cloud ML Service Agent for your project on the IAM page in the Google Cloud console. This ID has the following format:
service-PROJECT_NUMBER@cloud-ml.google.com.iam.gserviceaccount.com
Replace PROJECT_NUMBER with the project number for your Google Cloud project.
AI Platform Training automatically uses the Cloud ML Servce Agent
credentials to set up authentication and authorization if you use Tensorflow
tfds
, Google Cloud client
libraries, or other tools that use the Application Default Credentials
strategy.
What's next
- Learn more about the concepts involved in using containers.
- Train a PyTorch model using custom containers.
- Learn about distributed training with custom containers.