This page shows you how to run a training job in a Deep Learning Containers instance, and run that container image on a Google Kubernetes Engine cluster.
Before you begin
Before you begin, make sure you have completed the following steps.
Complete the set up steps in the Before you begin section of Getting started with a local deep learning container.
Make sure that billing is enabled for your Google Cloud project.
Enable the Google Kubernetes Engine, Compute Engine, and Container Registry APIs.
Open your command line tool
You can follow this guide using
Google Cloud Shell or
command line tools locally. Google Cloud Shell comes preinstalled
kubectl command-line tools used
in this tutorial. If you use Cloud Shell, you don't need to install these
command-line tools on your workstation.
Option A: Use Google Cloud Shell
To use Google Cloud Shell:
Go to the Google Cloud console.
Click the Activate Cloud Shell button at the top of the console window.
A Cloud Shell session opens inside a new frame at the bottom of the console and displays a command-line prompt.
Option B: Use command-line tools locally
If you prefer to follow this guide on your workstation, you need to install the following tool:
Using the gcloud CLI, install the Kubernetes command-line tool.
kubectlis used to communicate with Kubernetes, which is the cluster orchestration system of Deep Learning Containers clusters:
gcloud components install kubectl
Create a GKE cluster
Run the following command to create a two-node cluster in GKE
gcloud container clusters create pytorch-training-cluster \ --num-nodes=2 \ --zone=us-west1-b \ --accelerator="type=nvidia-tesla-p100,count=1" \ --machine-type="n1-highmem-2" \ --scopes="gke-default,storage-rw"
For more information on these settings, see the documentation on creating clusters for running containers.
It may take several minutes for the cluster to be created.
Alternatively, instead of creating a cluster, you can use an existing
cluster in your Google Cloud project. If you do this, you may need
to run the following command to make sure the
kubectl command-line tool
has the proper credentials to access your cluster:
gcloud container clusters get-credentials YOUR_EXISTING_CLUSTER
Create the Dockerfile
There are many ways to build a container image.
These steps will show you how to build one to run a Python
To view a list of container images available:
gcloud container images list \ --repository="gcr.io/deeplearning-platform-release"
You may want to go to Choosing a container to help you select the container that you want.
The following example will show you how to place a Python script named
trainer.py into a specific PyTorch deep learning container type.
To create the dockerfile, write the following commands to a file named
Dockerfile. This step assumes that you have code to train a machine
learning model in a directory named
model-training-code and that the
main Python module in that directory is named
trainer.py. In this
scenario, the container will be removed once the job completes, so
your training script should be configured to output to Cloud Storage (see
an example of a script that outputs to
or to output to persistent storage.
FROM gcr.io/deeplearning-platform-release/pytorch-gpu COPY model-training-code /train CMD ["python", "/train/trainer.py"]
Build and upload the container image
To build and upload the container image to Container Registry, use the following commands:
export PROJECT_ID=$(gcloud config list project --format "value(core.project)") export IMAGE_REPO_NAME=pytorch_custom_container export IMAGE_TAG=$(date +%Y%m%d_%H%M%S) export IMAGE_URI=gcr.io/$PROJECT_ID/$IMAGE_REPO_NAME:$IMAGE_TAG docker build -f Dockerfile -t $IMAGE_URI ./ docker push $IMAGE_URI
Deploy your application
Create a file named pod.yaml with the following contents, replacing IMAGE_URI with your image's URI.
apiVersion: v1 kind: Pod metadata: name: gke-training-pod spec: containers: - name: my-custom-container image: IMAGE_URI resources: limits: nvidia.com/gpu: 1
kubectl command-line tool to run the following command and
deploy your application:
kubectl apply -f ./pod.yaml
To track the pod's status, run the following command:
kubectl describe pod gke-training-pod