Train an ML model with custom containers

AI Platform Training supports training in custom containers, allowing users to bring their own Docker containers with any pre-installed ML framework or algorithm to run on AI Platform Training. This tutorial provides an introductory walkthrough showing how to train a PyTorch model on AI Platform Training with a custom container.

Overview

This getting-started guide demonstrates the process of training with custom containers on AI Platform Training, using a basic model that classifies handwritten digits based on the MNIST dataset.

This guide covers the following steps:

  • Project and local environment setup
  • Create a custom container
    • Write a Dockerfile
    • Build and test your Docker image locally
  • Push the image to Container Registry
  • Submit a custom container training job
  • Submit a hyperparameter tuning job
  • Using GPUs with a custom container

Before you begin

For this getting-started guide, use any environment where the Google Cloud CLI is installed.

Optional: Review conceptual information about training with custom containers.

Complete the following steps to set up a GCP account, enable the required APIs, and install and activate the Cloud SDK.

  1. Sign in to your Google Cloud account. If you're new to Google Cloud, create an account to evaluate how our products perform in real-world scenarios. New customers also get $300 in free credits to run, test, and deploy workloads.
  2. In the Google Cloud console, on the project selector page, select or create a Google Cloud project.

    Go to project selector

  3. Make sure that billing is enabled for your Google Cloud project.

  4. Enable the AI Platform Training & Prediction, Compute Engine and Container Registry APIs.

    Enable the APIs

  5. Install the Google Cloud CLI.
  6. To initialize the gcloud CLI, run the following command:

    gcloud init
  7. In the Google Cloud console, on the project selector page, select or create a Google Cloud project.

    Go to project selector

  8. Make sure that billing is enabled for your Google Cloud project.

  9. Enable the AI Platform Training & Prediction, Compute Engine and Container Registry APIs.

    Enable the APIs

  10. Install the Google Cloud CLI.
  11. To initialize the gcloud CLI, run the following command:

    gcloud init
  12. Install Docker.

    If you're using a Linux-based operating system, such as Ubuntu or Debian, add your username to the docker group so that you can run Docker without using sudo:

    sudo usermod -a -G docker ${USER}

    You may need to restart your system after adding yourself to the docker group.

  13. Open Docker. To ensure that Docker is running, run the following Docker command, which returns the current time and date:
    docker run busybox date
  14. Use gcloud as the credential helper for Docker:
    gcloud auth configure-docker
  15. Optional: If you want to run the container using GPU locally, install nvidia-docker.

Set up your Cloud Storage bucket

This section shows you how to create a new bucket. You can use an existing bucket, but it must be in the same region where you plan on running AI Platform jobs. Additionally, if it is not part of the project you are using to run AI Platform Training, you must explicitly grant access to the AI Platform Training service accounts.

  1. Specify a name for your new bucket. The name must be unique across all buckets in Cloud Storage.

    BUCKET_NAME="YOUR_BUCKET_NAME"

    For example, use your project name with -aiplatform appended:

    PROJECT_ID=$(gcloud config list project --format "value(core.project)")
    BUCKET_NAME=${PROJECT_ID}-aiplatform
  2. Check the bucket name that you created.

    echo $BUCKET_NAME
  3. Select a region for your bucket and set a REGION environment variable.

    Use the same region where you plan on running AI Platform Training jobs. See the available regions for AI Platform Training services.

    For example, the following code creates REGION and sets it to us-central1:

    REGION=us-central1
  4. Create the new bucket:

    gcloud storage buckets create gs://$BUCKET_NAME --location=$REGION

Download the code for this tutorial

  1. Enter the following command to download the AI Platform Training sample zip file:

    wget https://github.com/GoogleCloudPlatform/cloudml-samples/archive/master.zip
    
  2. Unzip the file to extract the cloudml-samples-master directory.

    unzip master.zip
    
  3. Navigate to the cloudml-samples-master > pytorch > containers > quickstart > mnist directory. The commands in this walkthrough must be run from the mnist directory.

    cd cloudml-samples-master/pytorch/containers/quickstart/mnist
    

Create a custom container

To create a custom container, the first step is to define a Dockerfile to install the dependencies required for the training job. Then, you build and test your Docker image locally to verify it before using it with AI Platform Training.

Write a Dockerfile

The sample Dockerfile provided in this tutorial accomplishes the following steps:

  1. Uses a Python 2.7 base image that has built-in Python dependencies.
  2. Installs additional dependencies, including PyTorch, gcloud CLI, and cloudml-hypertune for hyperparameter tuning.
  3. Copies the code for your training application into the container.
  4. Configures the entry point for AI Platform Training to run your training code when the container is being started.

Your Dockerfile could include additional logic, depending on your needs. Learn more about writing Dockerfiles.

# Copyright 2019 Google LLC
#
# Licensed under the Apache License, Version 2.0 (the \"License\");
# you may not use this file except in compliance with the License.\n",
# You may obtain a copy of the License at
#
#     http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an \"AS IS\" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

# Dockerfile
FROM python:2.7.16-jessie
WORKDIR /root

# Installs pytorch and torchvision.
RUN pip install torch==1.0.0 torchvision==0.2.1

# Installs cloudml-hypertune for hyperparameter tuning.
# It’s not needed if you don’t want to do hyperparameter tuning.
RUN pip install cloudml-hypertune

# Installs google cloud sdk, this is mostly for using gsutil to export model.
RUN wget -nv \
    https://dl.google.com/dl/cloudsdk/release/google-cloud-sdk.tar.gz && \
    mkdir /root/tools && \
    tar xvzf google-cloud-sdk.tar.gz -C /root/tools && \
    rm google-cloud-sdk.tar.gz && \
    /root/tools/google-cloud-sdk/install.sh --usage-reporting=false \
        --path-update=false --bash-completion=false \
        --disable-installation-options && \
    rm -rf /root/.config/* && \
    ln -s /root/.config /config && \
    # Remove the backup directory that gcloud creates
    rm -rf /root/tools/google-cloud-sdk/.install/.backup

# Path configuration
ENV PATH $PATH:/root/tools/google-cloud-sdk/bin
# Make sure gsutil will use the default service account
RUN echo '[GoogleCompute]\nservice_account = default' > /etc/boto.cfg

# Copies the trainer code 
RUN mkdir /root/trainer
COPY trainer/mnist.py /root/trainer/mnist.py

# Sets up the entry point to invoke the trainer.
ENTRYPOINT ["python", "trainer/mnist.py"]

Build and test your Docker image locally

  1. Create the correct image URI by using environment variables, and build the Docker image. The -t flag names and tags the image with your choices for IMAGE_REPO_NAME and IMAGE_TAG. You can choose a different name and tag for your image.

    export PROJECT_ID=$(gcloud config list project --format "value(core.project)")
    export IMAGE_REPO_NAME=mnist_pytorch_custom_container
    export IMAGE_TAG=mnist_pytorch_cpu
    export IMAGE_URI=gcr.io/$PROJECT_ID/$IMAGE_REPO_NAME:$IMAGE_TAG
    
    docker build -f Dockerfile -t $IMAGE_URI ./
    
  2. Verify the image by running it locally in a new container. Note that the --epochs flag is passed to the trainer script.

    docker run $IMAGE_URI --epochs 1
    

Push the image to Container Registry

If the local run works, you can push the Docker image to Container Registry in your project.

First, run gcloud auth configure-docker if you have not already done so.

docker push $IMAGE_URI

Submit and monitor the job

  1. Define environment variables for your job request.

    • MODEL_DIR names a new timestamped directory within your Cloud Storage bucket where your saved model file is stored after training is complete.
    • REGION specifies a valid region for AI Platform Training training.
    export MODEL_DIR=pytorch_model_$(date +%Y%m%d_%H%M%S)
    export REGION=us-central1
    export JOB_NAME=custom_container_job_$(date +%Y%m%d_%H%M%S)
    
  2. Submit the training job to AI Platform Training using the gcloud CLI. Pass the URI to your Docker image using the --master-image-uri flag:

    gcloud ai-platform jobs submit training $JOB_NAME \
      --region $REGION \
      --master-image-uri $IMAGE_URI \
      -- \
      --model-dir=gs://$BUCKET_NAME/$MODEL_DIR \
      --epochs=10
    
  3. After you submit your job, you can monitor the job status and stream logs:

    gcloud ai-platform jobs describe $JOB_NAME
    gcloud ai-platform jobs stream-logs $JOB_NAME
    

Submit a hyperparameter tuning job

There are a few adjustments to make for a hyperparameter tuning job. Take note of these areas in the sample code:

  • The sample Dockerfile includes the cloudml-hypertune package in order to install it in the custom container.
  • The sample code (mnist.py):
    • Uses cloudml-hypertune to report the results of each trial by calling its helper function, report_hyperparameter_tuning_metric. The sample code reports hyperparameter tuning results after evaluation, unless the job is not submitted as a hyperparameter tuning job.
    • Adds command-line arguments for each hyperparameter, and handles the argument parsing with argparse.
  • The job request includes HyperparameterSpec in the TrainingInput object. In this case, we tune --lr and --momentum in order to minimize the model loss.
  1. Create a config.yaml file to define your hyperparameter spec. Redefine MODEL_DIR and JOB_NAME. Define REGION if you have not already done so:

    export MODEL_DIR=pytorch_hptuning_model_$(date +%Y%m%d_%H%M%S)
    export REGION=us-central1
    export JOB_NAME=custom_container_job_hptuning_$(date +%Y%m%d_%H%M%S)
    
    # Creates a YAML file with job request.
    cat > config.yaml <<EOF
    trainingInput:
      hyperparameters:
        goal: MINIMIZE
        hyperparameterMetricTag: "my_loss"
        maxTrials: 20
        maxParallelTrials: 5
        enableTrialEarlyStopping: True
        params:
        - parameterName: lr
          type: DOUBLE
          minValue: 0.0001
          maxValue: 0.1
        - parameterName: momentum
          type: DOUBLE
          minValue: 0.2
          maxValue: 0.8
    EOF
    
  2. Submit the hyperparameter tuning job to AI Platform Training:

    gcloud ai-platform jobs submit training $JOB_NAME \
      --scale-tier BASIC \
      --region $REGION \
      --master-image-uri $IMAGE_URI \
      --config config.yaml \
      -- \
      --epochs=5 \
      --model-dir="gs://$BUCKET_NAME/$MODEL_DIR"
    

Using GPUs with custom containers

To submit a custom container job using GPUs, you must build a different Docker image than the one you used previously. We've provided an example Dockerfile for use with GPUs that meets the following requirements:

  • Pre-install the CUDA toolkit and cuDNN in your container. Using the nvidia/cuda image as your base image is the recommended way to handle this, because it has the CUDA toolkit and cuDNN pre-installed, and it helps you set up the related environment variables correctly.
  • Install additional dependencies, such as wget, curl, pip, and any others needed by your training application.
# Copyright 2019 Google LLC
#
# Licensed under the Apache License, Version 2.0 (the \"License\");
# you may not use this file except in compliance with the License.\n",
# You may obtain a copy of the License at
#
#     http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an \"AS IS\" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

# Dockerfile-gpu
FROM nvidia/cuda:9.0-cudnn7-runtime

# Installs necessary dependencies.
RUN apt-get update && apt-get install -y --no-install-recommends \
         wget \
         curl \
         python-dev && \
     rm -rf /var/lib/apt/lists/*

# Installs pip.
RUN curl https://bootstrap.pypa.io/get-pip.py -o get-pip.py && \
    python get-pip.py && \
    pip install setuptools && \
    rm get-pip.py

WORKDIR /root

# Installs pytorch and torchvision.
RUN pip install torch==1.0.0 torchvision==0.2.1

# Installs cloudml-hypertune for hyperparameter tuning.
# It’s not needed if you don’t want to do hyperparameter tuning.
RUN pip install cloudml-hypertune

# Installs google cloud sdk, this is mostly for using gsutil to export model.
RUN wget -nv \
    https://dl.google.com/dl/cloudsdk/release/google-cloud-sdk.tar.gz && \
    mkdir /root/tools && \
    tar xvzf google-cloud-sdk.tar.gz -C /root/tools && \
    rm google-cloud-sdk.tar.gz && \
    /root/tools/google-cloud-sdk/install.sh --usage-reporting=false \
        --path-update=false --bash-completion=false \
        --disable-installation-options && \
    rm -rf /root/.config/* && \
    ln -s /root/.config /config && \
    # Remove the backup directory that gcloud creates
    rm -rf /root/tools/google-cloud-sdk/.install/.backup

# Path configuration
ENV PATH $PATH:/root/tools/google-cloud-sdk/bin
# Make sure gsutil will use the default service account
RUN echo '[GoogleCompute]\nservice_account = default' > /etc/boto.cfg

# Copies the trainer code 
RUN mkdir /root/trainer
COPY trainer/mnist.py /root/trainer/mnist.py

# Sets up the entry point to invoke the trainer.
ENTRYPOINT ["python", "trainer/mnist.py"]

Build and test the GPU Docker image locally

  1. Build a new image for your GPU training job using the GPU Dockerfile. To avoid overriding the CPU image, you must re-define IMAGE_REPO_NAME and IMAGE_TAG with different names than you used earlier in the tutorial.

    export PROJECT_ID=$(gcloud config list project --format "value(core.project)")
    export IMAGE_REPO_NAME=mnist_pytorch_gpu_container
    export IMAGE_TAG=mnist_pytorch_gpu
    export IMAGE_URI=gcr.io/$PROJECT_ID/$IMAGE_REPO_NAME:$IMAGE_TAG
    
    docker build -f Dockerfile-gpu -t $IMAGE_URI ./
    
  2. If you have GPUs available on your machine, and you've installed nvidia-docker, you can verify the image by running it locally:

    docker run --runtime=nvidia $IMAGE_URI --epochs 1
    
  3. Push the Docker image to Container Registry. First, run gcloud auth configure-docker if you have not already done so.

    docker push $IMAGE_URI
    

Submit the job

This example uses the basic GPU scale tier to submit the training job request. See other machine options for training with GPUs.

  1. Redefine MODEL_DIR and JOB_NAME. Define REGION if you have not already done so:

    export MODEL_DIR=pytorch_model_gpu_$(date +%Y%m%d_%H%M%S)
    export REGION=us-central1
    export JOB_NAME=custom_container_job_gpu_$(date +%Y%m%d_%H%M%S)
    
  2. Submit the training job to AI Platform Training using the gcloud CLI. Pass the URI to your Docker image using the --master-image-uri flag.

    gcloud ai-platform jobs submit training $JOB_NAME \
      --scale-tier BASIC_GPU \
      --region $REGION \
      --master-image-uri $IMAGE_URI \
      -- \
      --epochs=5 \
      --model-dir=gs://$BUCKET_NAME/$MODEL_DIR
    

What's next