Train an ML model with custom containers
AI Platform Training supports training in custom containers, allowing users to bring their own Docker containers with any pre-installed ML framework or algorithm to run on AI Platform Training. This tutorial provides an introductory walkthrough showing how to train a PyTorch model on AI Platform Training with a custom container.
Overview
This getting-started guide demonstrates the process of training with custom containers on AI Platform Training, using a basic model that classifies handwritten digits based on the MNIST dataset.
This guide covers the following steps:
- Project and local environment setup
- Create a custom container
- Write a Dockerfile
- Build and test your Docker image locally
- Push the image to Container Registry
- Submit a custom container training job
- Submit a hyperparameter tuning job
- Using GPUs with a custom container
Before you begin
For this getting-started guide, use any environment where the Google Cloud CLI is installed.Optional: Review conceptual information about training with custom containers.
Complete the following steps to set up a GCP account, enable the required APIs, and install and activate the Cloud SDK.
- Sign in to your Google Cloud account. If you're new to Google Cloud, create an account to evaluate how our products perform in real-world scenarios. New customers also get $300 in free credits to run, test, and deploy workloads.
-
In the Google Cloud console, on the project selector page, select or create a Google Cloud project.
-
Make sure that billing is enabled for your Google Cloud project.
-
Enable the AI Platform Training & Prediction, Compute Engine and Container Registry APIs.
- Install the Google Cloud CLI.
-
To initialize the gcloud CLI, run the following command:
gcloud init
-
In the Google Cloud console, on the project selector page, select or create a Google Cloud project.
-
Make sure that billing is enabled for your Google Cloud project.
-
Enable the AI Platform Training & Prediction, Compute Engine and Container Registry APIs.
- Install the Google Cloud CLI.
-
To initialize the gcloud CLI, run the following command:
gcloud init
- Install Docker.
If you're using a Linux-based operating system, such as Ubuntu or Debian, add your username to the
docker
group so that you can run Docker without usingsudo
:sudo usermod -a -G docker ${USER}
You may need to restart your system after adding yourself to the
docker
group. - Open Docker. To ensure that Docker is running, run the following Docker command,
which returns the current time and date:
docker run busybox date
- Use
gcloud
as the credential helper for Docker:gcloud auth configure-docker
-
Optional: If you want to run the container using GPU locally,
install
nvidia-docker
.
Set up your Cloud Storage bucket
This section shows you how to create a new bucket. You can use an existing bucket, but it must be in the same region where you plan on running AI Platform jobs. Additionally, if it is not part of the project you are using to run AI Platform Training, you must explicitly grant access to the AI Platform Training service accounts.
-
Specify a name for your new bucket. The name must be unique across all buckets in Cloud Storage.
BUCKET_NAME="YOUR_BUCKET_NAME"
For example, use your project name with
-aiplatform
appended:PROJECT_ID=$(gcloud config list project --format "value(core.project)") BUCKET_NAME=${PROJECT_ID}-aiplatform
-
Check the bucket name that you created.
echo $BUCKET_NAME
-
Select a region for your bucket and set a
REGION
environment variable.Use the same region where you plan on running AI Platform Training jobs. See the available regions for AI Platform Training services.
For example, the following code creates
REGION
and sets it tous-central1
:REGION=us-central1
-
Create the new bucket:
gcloud storage buckets create gs://$BUCKET_NAME --location=$REGION
Download the code for this tutorial
Enter the following command to download the AI Platform Training sample zip file:
wget https://github.com/GoogleCloudPlatform/cloudml-samples/archive/master.zip
Unzip the file to extract the
cloudml-samples-master
directory.unzip master.zip
Navigate to the
cloudml-samples-master > pytorch > containers > quickstart > mnist
directory. The commands in this walkthrough must be run from themnist
directory.cd cloudml-samples-master/pytorch/containers/quickstart/mnist
Create a custom container
To create a custom container, the first step is to define a Dockerfile to install the dependencies required for the training job. Then, you build and test your Docker image locally to verify it before using it with AI Platform Training.
Write a Dockerfile
The sample Dockerfile provided in this tutorial accomplishes the following steps:
- Uses a Python 2.7 base image that has built-in Python dependencies.
- Installs additional dependencies, including PyTorch,
gcloud CLI, and
cloudml-hypertune
for hyperparameter tuning. - Copies the code for your training application into the container.
- Configures the entry point for AI Platform Training to run your training code when the container is being started.
Your Dockerfile could include additional logic, depending on your needs. Learn more about writing Dockerfiles.
Build and test your Docker image locally
Create the correct image URI by using environment variables, and build the Docker image. The
-t
flag names and tags the image with your choices forIMAGE_REPO_NAME
andIMAGE_TAG
. You can choose a different name and tag for your image.export PROJECT_ID=$(gcloud config list project --format "value(core.project)") export IMAGE_REPO_NAME=mnist_pytorch_custom_container export IMAGE_TAG=mnist_pytorch_cpu export IMAGE_URI=gcr.io/$PROJECT_ID/$IMAGE_REPO_NAME:$IMAGE_TAG docker build -f Dockerfile -t $IMAGE_URI ./
Verify the image by running it locally in a new container. Note that the
--epochs
flag is passed to the trainer script.docker run $IMAGE_URI --epochs 1
Push the image to Container Registry
If the local run works, you can push the Docker image to Container Registry in your project.
First, run gcloud auth configure-docker
if you have not already done so.
docker push $IMAGE_URI
Submit and monitor the job
Define environment variables for your job request.
MODEL_DIR
names a new timestamped directory within your Cloud Storage bucket where your saved model file is stored after training is complete.REGION
specifies a valid region for AI Platform Training training.
export MODEL_DIR=pytorch_model_$(date +%Y%m%d_%H%M%S) export REGION=us-central1 export JOB_NAME=custom_container_job_$(date +%Y%m%d_%H%M%S)
Submit the training job to AI Platform Training using the gcloud CLI. Pass the URI to your Docker image using the
--master-image-uri
flag:gcloud ai-platform jobs submit training $JOB_NAME \ --region $REGION \ --master-image-uri $IMAGE_URI \ -- \ --model-dir=gs://$BUCKET_NAME/$MODEL_DIR \ --epochs=10
After you submit your job, you can monitor the job status and stream logs:
gcloud ai-platform jobs describe $JOB_NAME gcloud ai-platform jobs stream-logs $JOB_NAME
Submit a hyperparameter tuning job
There are a few adjustments to make for a hyperparameter tuning job. Take note of these areas in the sample code:
- The sample Dockerfile includes the
cloudml-hypertune
package in order to install it in the custom container. - The sample code (
mnist.py
):- Uses
cloudml-hypertune
to report the results of each trial by calling its helper function,report_hyperparameter_tuning_metric
. The sample code reports hyperparameter tuning results after evaluation, unless the job is not submitted as a hyperparameter tuning job. - Adds command-line arguments for each hyperparameter, and handles the
argument parsing with
argparse
.
- Uses
- The job request includes
HyperparameterSpec
in theTrainingInput
object. In this case, we tune--lr
and--momentum
in order to minimize the model loss.
Create a
config.yaml
file to define your hyperparameter spec. RedefineMODEL_DIR
andJOB_NAME
. DefineREGION
if you have not already done so:export MODEL_DIR=pytorch_hptuning_model_$(date +%Y%m%d_%H%M%S) export REGION=us-central1 export JOB_NAME=custom_container_job_hptuning_$(date +%Y%m%d_%H%M%S) # Creates a YAML file with job request. cat > config.yaml <<EOF trainingInput: hyperparameters: goal: MINIMIZE hyperparameterMetricTag: "my_loss" maxTrials: 20 maxParallelTrials: 5 enableTrialEarlyStopping: True params: - parameterName: lr type: DOUBLE minValue: 0.0001 maxValue: 0.1 - parameterName: momentum type: DOUBLE minValue: 0.2 maxValue: 0.8 EOF
Submit the hyperparameter tuning job to AI Platform Training:
gcloud ai-platform jobs submit training $JOB_NAME \ --scale-tier BASIC \ --region $REGION \ --master-image-uri $IMAGE_URI \ --config config.yaml \ -- \ --epochs=5 \ --model-dir="gs://$BUCKET_NAME/$MODEL_DIR"
Using GPUs with custom containers
To submit a custom container job using GPUs, you must build a different Docker image than the one you used previously. We've provided an example Dockerfile for use with GPUs that meets the following requirements:
- Pre-install the CUDA toolkit and cuDNN in your container. Using the nvidia/cuda image as your base image is the recommended way to handle this, because it has the CUDA toolkit and cuDNN pre-installed, and it helps you set up the related environment variables correctly.
- Install additional dependencies, such as
wget
,curl
,pip
, and any others needed by your training application.
Build and test the GPU Docker image locally
Build a new image for your GPU training job using the GPU Dockerfile. To avoid overriding the CPU image, you must re-define
IMAGE_REPO_NAME
andIMAGE_TAG
with different names than you used earlier in the tutorial.export PROJECT_ID=$(gcloud config list project --format "value(core.project)") export IMAGE_REPO_NAME=mnist_pytorch_gpu_container export IMAGE_TAG=mnist_pytorch_gpu export IMAGE_URI=gcr.io/$PROJECT_ID/$IMAGE_REPO_NAME:$IMAGE_TAG docker build -f Dockerfile-gpu -t $IMAGE_URI ./
If you have GPUs available on your machine, and you've installed
nvidia-docker
, you can verify the image by running it locally:docker run --runtime=nvidia $IMAGE_URI --epochs 1
Push the Docker image to Container Registry. First, run
gcloud auth configure-docker
if you have not already done so.docker push $IMAGE_URI
Submit the job
This example uses the basic GPU scale tier to submit the training job request. See other machine options for training with GPUs.
Redefine
MODEL_DIR
andJOB_NAME
. DefineREGION
if you have not already done so:export MODEL_DIR=pytorch_model_gpu_$(date +%Y%m%d_%H%M%S) export REGION=us-central1 export JOB_NAME=custom_container_job_gpu_$(date +%Y%m%d_%H%M%S)
Submit the training job to AI Platform Training using the gcloud CLI. Pass the URI to your Docker image using the
--master-image-uri
flag.gcloud ai-platform jobs submit training $JOB_NAME \ --scale-tier BASIC_GPU \ --region $REGION \ --master-image-uri $IMAGE_URI \ -- \ --epochs=5 \ --model-dir=gs://$BUCKET_NAME/$MODEL_DIR
What's next
- Learn more about the concepts involved in using containers.
- Learn about distributed training with custom containers.