Objectives
This tutorial shows you how to train the TensorFlow ResNet-50 model on Cloud TPU and GKE.
The tutorial leads you through the steps to train the model, using a fake data set provided for testing purposes:
- Create a Cloud Storage bucket to hold your model output.
- Create a GKE cluster to manage your Cloud TPU resources.
- Download a Kubernetes Job spec describing the resources needed to train ResNet-50 with TensorFlow on a Cloud TPU.
- Run the Job in your GKE cluster to start training the model.
- Check the logs and the model output.
Requirements and limitations
Note the following when defining your configuration:
- You must use GKE version
1.13.4-gke.5
or later. You can specify the version by adding the--cluster-version
parameter to thegcloud container clusters create
command as described below. See more information about the version in the SDK documentation. - You must use TensorFlow 1.15.3 or later. You should specify the TensorFlow version used by the Cloud TPU in your Kubernetes Pod spec, as described below.
- You must create your GKE cluster and node pools in a zone
where Cloud TPU is available.
You must also create the
Cloud Storage buckets to hold your training data and models in the
same region as your GKE cluster. This tutorial
uses the
us-central1-b
zone. - Each container can request at most one Cloud TPU, but multiple containers in a Kubernetes Pod can request a Cloud TPU each.
- Cluster Autoscaler supports Cloud TPU on GKE 1.11.4-gke.12 and later.
- Your GKE cluster must be a VPC-native cluster.
Before you begin
-
Sign in to your Google Account.
If you don't already have one, sign up for a new account.
-
In the Google Cloud Console, on the project selector page, select or create a Google Cloud project.
-
Make sure that billing is enabled for your Cloud project. Learn how to confirm that billing is enabled for your project.
-
Enable the following APIs on the Cloud Console:
When you use Cloud TPU with GKE, your project uses billable components of Google Cloud. Check the Cloud TPU pricing and the GKE pricing to estimate your costs, and follow the instructions to clean up resources when you've finished with them.
Create a Service Account and a Cloud Storage bucket
You need a Cloud Storage bucket to store the results of training your machine learning model.
Open a Cloud Shell window.
Create a variable for your project's ID.
export PROJECT_ID=project-id
Create a Service Account for the Cloud TPU project.
gcloud beta services identity create --service tpu.googleapis.com --project $PROJECT_ID
The command returns a Cloud TPU Service Account with following format:
service-PROJECT_NUMBER@cloud-tpu.iam.gserviceaccount.com
Go to the Cloud Storage page on the Cloud Console.
Create a new bucket, specifying the following options:
- A unique name of your choosing.
- Location type:
region
- Location:
us-central1
- Default storage class:
Standard
- Access control:
fine-grained
Before using the storage bucket, you need to authorize the Cloud TPU Service Account access to the bucket. Use the Access control link above to set fine-grained ACLs for your Cloud TPU Service Account.
Create a GKE cluster with Cloud TPU support
You can enable Cloud TPU support on a new GKE cluster.
Creating a new cluster with Cloud TPU support
You can create a cluster with Cloud TPU support by using the
Cloud Console or the gcloud
tool.
Select an option below to see the relevant instructions:
Console
Follow these instructions to create a GKE cluster with Cloud TPU support:
Go to the GKE page on the Cloud Console.
Click Create cluster.
Specify a Name for your cluster. The name must be unique within the project and zone.
For the Location type, select zonal and then select the desired zone where you plan to use a Cloud TPU resource. For this tutorial, select the us-central1-b zone.
Ensure that the Master version is set to 1.13.4-gke.5 or later, to allow support for Cloud TPU.
From the navigation pane, under the node pool you want to configure, click Security.
Select Allow full access to all Cloud APIs. This ensures that all nodes in the cluster have access to your Cloud Storage bucket. The cluster and the storage bucket must be in the same project for this to work. Note that the Kubernetes Pods by default inherit the scopes of the nodes to which they are deployed. If you want to limit the access on a per Pod basis, see the GKE guide to authenticating with service accounts.
From the navigation pane, under Cluster, click Networking.
Select Enable VPC-native traffic routing (uses alias IP), you need to create a VPC network if one does not exist for the current project.
From the navigation pane, under Cluster, click Features.
Select Enable Cloud TPU.
Configure the remaining options for your cluster as desired. You can leave the options at their default values.
Click Create.
Connect to the cluster. You can do this by selecting your cluster from the Console Kubernetes clusters page and clicking the Connect button. This displays the
gcloud
command to run in a Cloud shell to connect.
gcloud
Follow the instructions below to set up your environment and create a
GKE cluster with Cloud TPU support,
using the gcloud
tool:
Install the
gcloud
components, which you need for running GKE with Cloud TPU:$ gcloud components install kubectl
Configure
gcloud
with your Google Cloud project ID:$ gcloud config set project project-name
Replace project-name with the name of your Google Cloud project.
The first time you run this command in a new Cloud Shell VM, an
Authorize Cloud Shell
page is displayed. ClickAuthorize
at the bottom of the page to allowgcloud
to make GCP API calls with your credentials.Configure
gcloud
with the zone where you plan to use a Cloud TPU resource. For this tutorial, use theus-central1-b
zone:$ gcloud config set compute/zone us-central1-b
Use the
gcloud container clusters create
command to create a cluster on GKE with support for Cloud TPU. In the following command, replace cluster-name with a cluster name of your choice:$ gcloud container clusters create cluster-name \ --cluster-version=1.16 \ --scopes=cloud-platform \ --enable-ip-alias \ --enable-tpu
Command flag descriptions
- cluster-version
- Indicates that the cluster will use the latest
Kubernetes 1.16 release. You
must use version
1.13.4-gke.5
or later. - scopes
- Ensures that all nodes in the cluster have
access to your Cloud Storage bucket.
The cluster and the storage bucket must be in the
same project for this to work. Note that the
Kubernetes Pods by default inherit the scopes of the
nodes to which they are deployed.
Therefore,
scopes=cloud-platform
gives all Kubernetes Pods running in the cluster thecloud-platform
scope. If you want to limit the access on a per Pod basis, see the GKE guide to authenticating with service accounts. - enable-ip-alias
- Indicates that the cluster uses alias IP ranges. This is required for using Cloud TPU on GKE.
- enable-tpu
- Indicates that the cluster must support Cloud TPU.
- tpu-ipv4-cidr (optional, not specified above)
- Indicates the CIDR range to use for
Cloud TPU. Specify the
IP_RANGE
in the form ofIP/20
, such as10.100.0.0/20
. If you do not specify this flag, a/20
size CIDR range is automatically allocated and assigned.
When the cluster has been created, you should see a message similar to the following:
NAME LOCATION MASTER_VERSION MASTER_IP MACHINE_TYPE NODE_VERSION NUM_NODES STATUS cluster-resnet us-central1-b 1.16.15-gke.4901 34.71.245.25 n1-standard-1 1.16.15-gke.4901 3 RUNNING
Viewing operations
Enabling Cloud TPU support starts an update operation. For zonal clusters this operation takes around 5 minutes, and for regional clusters, this operation takes roughly 15 minutes, depending on the cluster's region.
To list every running and completed operation in your cluster, run the following command:
$ gcloud container operations list
To get more information about a specific operation, run the following command:
$ gcloud container operations describe operation-id
Replace operation-id with the ID of the specific operation.
The GKE / ResNet Job spec
The Job spec used by this tutorial (shown below) requests one Preemptible Cloud TPU v2 device with TensorFlow 2.3. It also starts a TensorBoard process process.
The lifetime of Cloud TPU nodes is bound to the Kubernetes Pods that request them. The Cloud TPU is created on demand when the Kubernetes Pod is scheduled, and recycled when the Kubernetes Pod is deleted.
apiVersion: batch/v1 kind: Job metadata: name: resnet-tpu spec: template: metadata: annotations: # The Cloud TPUs that will be created for this Job will support # TensorFlow 2.3. This version MUST match the # TensorFlow version that your model is built on. tf-version.cloud-tpus.google.com: "2.3" spec: restartPolicy: Never containers: - name: resnet-tpu # The official TensorFlow 2.3.0 image. # https://hub.docker.com/r/tensorflow/tensorflow image: tensorflow/tensorflow:2.3.0 command: - bash - -c - | pip install tf-models-official==2.3.0 python3 -m official.vision.image_classification.resnet.resnet_ctl_imagenet_main \ --tpu=$(KUBE_GOOGLE_CLOUD_TPU_ENDPOINTS) \ --distribution_strategy=tpu \ --steps_per_loop=500 \ --log_steps=500 \ --use_synthetic_data=true \ --dtype=fp32 \ --enable_tensorboard=true \ --train_epochs=90 \ --epochs_between_evals=1 \ --batch_size=1024 \ --model_dir=gs://bucket-name/resnet resources: limits: # Request a single Preemptible v2-8 Cloud TPU device to train the # model. A single v2-8 Cloud TPU device consists of 4 chips, each of # which has 2 cores, so there are 8 cores in total. cloud-tpus.google.com/preemptible-v2: 8 - name: tensorboard image: tensorflow/tensorflow:2.2.0 command: - bash - -c - | pip install tensorboard-plugin-profile==2.3.0 cloud-tpu-client tensorboard --logdir=gs://bucket-name/resnet --port=6006 ports: - containerPort: 6006
Create the Job
Follow these steps to create the Job in the GKE cluster:
Using a text editor, create a Job spec,
example-job.yaml
, and copy/paste in the Job spec shown above. Be sure to replace the bucket-name variable in the--model_dir
parameter and in the tensorboard command with the name of your storage bucket.Run the Job:
$ kubectl create -f example-job.yaml
This command creates the job that automatically schedules the Pod and should return the following output.
job "resnet-tpu" created
Verify that the Kubernetes Pod has been scheduled and Cloud TPU nodes have been provisioned. A Kubernetes Pod requesting Cloud TPU nodes can be pending for 5 minutes before running. You will see output similar to the following until the Kubernetes Pod is scheduled.
$ kubectl get pods -w
Initial output should display:NAME READY STATUS RESTARTS AGE resnet-tpu-cmvlf 0/1 Pending 0 1m
After 5 minutes, you should see something like this:
NAME READY STATUS RESTARTS AGE resnet-tpu-cmvlf 1/1 Running 0 6m
When the
STATUS
showsrunning
, the training has started. TypeCTRL-C
to exit thekubectl get pods -w
process.
View the Cloud TPU and TensorBoard container status and logs
Follow these steps to verify the status and view the logs of the Cloud TPU and TensorBoard instances used by your Kubernetes Pods.
Create Deployment and Service specs using the following steps:
On the left-hand navigation bar, click Workloads.
Select your Job. This takes you to a page that includes a heading Managed Pods.
Under Managed Pods, select your Kubernetes Pod. This takes you to a page that includes a heading Containers.
Under Containers, a list of Containers is displayed. The list includes all Cloud TPU and TensorBoard instances. For each Container, the following information is displayed:
- The run status
- A link to the Container logs
Using TensorBoard for visualizing metrics and analyzing performance
TensorBoard is a suite of tools designed to present TensorFlow metrics visually. TensorBoard can help identify bottlenecks in processing and suggest ways to improve performance.
TensorFlow Profiler is a TensorBoard plugin for capturing a profile on an individual Cloud TPU or a Cloud TPU Pod which can be visualized on TensorBoard. See the TPU tools documentation for details on the visualization tools and what information can be captured.
Follow these steps to run TensorBoard in the GKE cluster:
Follow the steps for viewing the TensorBoard status to verify that the TensorBoard instance is running in a container.
Port-forward to the TensorBoard Kubernetes Pod:
$ kubectl port-forward pod/resnet-tpu-pod-id 6006
where pod-id is the last set of digits of your GKE Pod name shown on the Console at: Kubernetes Engine > Workloads > Managed Pods. For example:
resnet-tpu-wxskc
.On the bar at the top right-hand side of the Cloud Shell, click the Web preview button and open port 6006 to view the TensorBoard output. The TensorBoard UI will appear as a tab in your browser.
Select PROFILE from the dropdown menu on the top right side of the TensorBoard page.
Click on the CAPTURE PROFILE button on the PROFILE page.
In the popup menu, select the TPU name address type and enter the TPU name. The TPU name appears in the Cloud Console on the Compute Engine > TPUs page, in the following format:
gke-cluster-name-cluster-id-tpu-tpu-id
For example:gke-demo-cluster-25cee208-tpu-4b90f4c5
Select the CAPTURE button on the pop up menu when you're ready to begin profiling, and wait a few seconds for the profile to complete.
Refresh your browser to see the tracing data under the PROFILE tab on TensorBoard.
For more information on how to capture and interpret profiles, see the TensorFlow profiler guide.
Clean up
When you've finished with Cloud TPU on GKE, clean up the resources to avoid incurring extra charges to your Cloud Billing account.
Console
Delete your GKE cluster:
Go to the GKE page on the Cloud Console.
Select the checkbox next to the cluster that you want to delete.
Click Delete.
When you've finished finished examining the data, delete the Cloud Storage bucket that you created during this tutorial:
Go to the Cloud Storage page on the Cloud Console.
Select the checkbox next to the bucket that you want to delete.
Click Delete.
See the Cloud Storage pricing guide for free storage limits and other pricing information.
gcloud
Run the following command to delete your GKE cluster, replacing cluster-name with your cluster name, and project-name with your Google Cloud project name:
$ gcloud container clusters delete cluster-name --zone=us-central1-b --project=project-name
This command deletes the cluster, the container, and the Cloud TPU.
When you've finished finished examining the data, use the
gsutil
command to delete the Cloud Storage bucket that you created during this tutorial. Replace bucket-name with the name of your Cloud Storage bucket:$ gsutil rm -r gs://bucket-name
See the Cloud Storage pricing guide for free storage limits and other pricing information.
What's next
Run more models and dataset retrieval jobs using one of the following Job specs:
- Download and preprocess the COCO dataset on GKE.
- Download and preprocess ImageNet on GKE.
- Train Mask RCNN using Cloud TPU and GKE.
- Train AmoebaNet-D using Cloud TPU and GKE.
- Train Inception v3 using Cloud TPU and GKE.
- Train RetinaNet using Cloud TPU and GKE.