Training in a container using Google Kubernetes Engine

This page shows you how to run a training job in an AI Platform Deep Learning Containers instance, and run that container image on a Google Kubernetes Engine cluster.

Before you begin

Before you begin, make sure you have completed the following steps.

  1. Complete the set up steps in the Before you begin section of Getting started with a local deep learning container.

  2. Make sure that billing is enabled for your Google Cloud project.

    Learn how to enable billing

  3. Enable the Google Kubernetes Engine, Compute Engine, and Container Registry APIs.

    Enable the APIs

Open your command line tool

You can follow this guide using Google Cloud Shell or command line tools locally. Google Cloud Shell comes preinstalled with the gcloud, docker, and kubectl command-line tools used in this tutorial. If you use Cloud Shell, you don't need to install these command-line tools on your workstation.

Option A: Use Google Cloud Shell

To use Google Cloud Shell:

  1. Go to the Google Cloud Console.

  2. Click the Activate Cloud Shell button at the top of the console window.

    Google Cloud Platform console

    A Cloud Shell session opens inside a new frame at the bottom of the console and displays a command-line prompt.

    Cloud Shell session

Option B: Use command-line tools locally

If you prefer to follow this guide on your workstation, you need to install the following tool:

  1. Using the gcloud command line tool, install the Kubernetes command-line tool. kubectl is used to communicate with Kubernetes, which is the cluster orchestration system of Deep Learning Containers clusters:

    gcloud components install kubectl
    

    You installed Google Cloud SDK and Docker already when you completed the getting started steps.

Create a GKE cluster

Run the following command to create a two-node cluster in GKE named pytorch-training-cluster:

gcloud container clusters create pytorch-training-cluster \
    --num-nodes=2 \
    --zone=us-west1-b \
    --accelerator="type=nvidia-tesla-p100,count=1" \
    --machine-type="n1-highmem-2" \
    --scopes="gke-default,storage-rw"

For more information on these settings, see the documentation on creating clusters for running containers.

It may take several minutes for the cluster to be created.

Alternatively, instead of creating a cluster, you can use an existing cluster in your Google Cloud project. If you do this, you may need to run the following command to make sure the kubectl command-line tool has the proper credentials to access your cluster:

gcloud container clusters get-credentials your-existing-cluster

Next, install the NVIDIA GPU device drivers.

Create the Dockerfile

There are many ways to build a container image. These steps will show you how to build one to run a Python script named trainer.py.

To view a list of container images available:

gcloud container images list \
  --repository="gcr.io/deeplearning-platform-release"

You may want to go to Choosing a container to help you select the container that you want.

The following example will show you how to place a Python script named trainer.py into a specific PyTorch deep learning container type.

To create the dockerfile, write the following commands to a file named Dockerfile. This step assumes that you have code to train a machine learning model in a directory named model-training-code and that the main Python module in that directory is named trainer.py. In this scenario, the container will be removed once the job completes, so your training script should be configured to output to Cloud Storage (see an example of a script that outputs to Cloud Storage) or to output to persistent storage.

FROM gcr.io/deeplearning-platform-release/pytorch-gpu
COPY model-training-code /train
CMD ["python", "/train/trainer.py"]

Build and upload the container image

To build and upload the container image to Container Registry, use the following commands:

export PROJECT_ID=$(gcloud config list project --format "value(core.project)")
export IMAGE_REPO_NAME=pytorch_custom_container
export IMAGE_TAG=$(date +%Y%m%d_%H%M%S)
export IMAGE_URI=gcr.io/$PROJECT_ID/$IMAGE_REPO_NAME:$IMAGE_TAG

docker build -f Dockerfile -t $IMAGE_URI ./

docker push $IMAGE_URI

Deploy your application

Create a file named pod.yaml with the following contents, replacing image-uri with your image's URI.

apiVersion: v1
kind: Pod
metadata:
  name: gke-training-pod
spec:
  containers:
  - name: my-custom-container
    image: image-uri
    resources:
      limits:
        nvidia.com/gpu: 1

Use the kubectl command-line tool to run the following command and deploy your application:

kubectl apply -f ./pod.yaml

To track the pod's status, run the following command:

kubectl describe pod gke-training-pod