This tutorial shows you how to train the Mask RCNN model on Cloud TPU and GKE.
Objectives
- Create a Cloud Storage bucket to hold your dataset and model output.
- Create a GKE cluster to manage your Cloud TPU resources.
- Download a Kubernetes job spec describing the resources needed to train the Mask RCNN model with TensorFlow on a Cloud TPU.
- Run the job in your GKE cluster, to start training the model.
- Check the logs and the model output.
Costs
This tutorial uses billable components of Google Cloud, including:- Compute Engine
- Cloud TPU
- Cloud Storage
Use the pricing calculator to generate a cost estimate based on your projected usage. New Google Cloud users might be eligible for a free trial.
Before you begin
-
Sign in to your Google Account.
If you don't already have one, sign up for a new account.
-
In the Google Cloud Console, on the project selector page, select or create a Google Cloud project.
-
Make sure that billing is enabled for your Cloud project. Learn how to confirm that billing is enabled for your project.
-
Enable the APIs needed for this tutorial by opening the following links in the Cloud Console:
When you use Cloud TPU with GKE, your project uses billable components of Google Cloud. Check the Cloud TPU pricing and the GKE pricing to estimate your costs, and follow the instructions to clean up resources when you've finished with them.
Create a Cloud Storage bucket
You need a Cloud Storage bucket to store the data you use to train
your model and the training results. The ctpu up
tool used in this tutorial
sets up default permissions for the Cloud TPU Service Account. If you want
finer-grain permissions, review the access level permissions.
Go to the Cloud Storage page on the Cloud Console.
Create a new bucket, specifying the following options:
- A unique name of your choosing.
- Default storage class:
Regional
- Choose where to store your data:
region
Location:us-central1
- Choose a default storage class for your data: Standard
- Choose how to control access to objects: Set object-level and bucket-level permissions
Create a cluster on GKE
Follow the instructions below to set up your environment and create a
GKE cluster with Cloud TPU support,
using the gcloud
command-line tool.
Open a Cloud Shell window.
Configure
gcloud
command-line tool to use the project where you want to create Cloud TPU.gcloud config set project ${PROJECT_ID}
The first time you run this command in a new Cloud Shell VM, an
Authorize Cloud Shell
page is displayed. ClickAuthorize
at the bottom of the page to allowgcloud
to make GCP API calls with your credentials.Create a Service Account for the Cloud TPU project.
gcloud beta services identity create --service tpu.googleapis.com --project $PROJECT_ID
The command returns a Cloud TPU Service Account with following format:
service-PROJECT_NUMBER@cloud-tpu.iam.gserviceaccount.com
Authorize Cloud TPU to access your Cloud Storage bucket
You need to give your Cloud TPU read/write access to your Cloud Storage objects. To do that, you must grant the required access to the Service Account used by the Cloud TPU. Follow the guide to grant access to your storage bucket.
Specify the zone where you plan to use a Cloud TPU resource. For this example, use the
us-central1-b
zone:$ gcloud config set compute/zone us-central1-b
Use the
gcloud container clusters
command to create a cluster on GKE with support for Cloud TPU. Note that the GKE cluster and its node-pools must be created in a zone where Cloud TPU is available, as described in the section on environment variables above. The following command creates a cluster namedtpu-models-cluster
:$ gcloud container clusters create tpu-models-cluster \ --cluster-version=1.13 \ --scopes=cloud-platform \ --enable-ip-alias \ --enable-tpu \ --machine-type=n1-standard-4
In the above command:
--cluster-version=1.13
indicates that the cluster will use the latest Kubernetes 1.16 release. You must use version1.13.4-gke.5
or later.--scopes=cloud-platform
ensures that all nodes in the cluster have access to your Cloud Storage bucket in the Google Cloud defined as project-name above. The cluster and the storage bucket must be in the same project. Note that the Pods by default inherit the scopes of the nodes to which they are deployed. This flag gives all Pods running in the cluster thecloud-platform
scope. If you want to limit the access on a per Pod basis, see the GKE guide to authenticating with Service Accounts.--enable-ip-alias
indicates that the cluster uses alias IP ranges. This is required for using Cloud TPU on GKE.--enable-tpu
indicates that the cluster must support Cloud TPU.machine-type=n1-standard-4
is required to run this job.
When the command has finished running a confirmation message appears, similar to this:
kubeconfig entry generated for tpu-models-cluster. NAME LOCATION MASTER_VERSION MASTER_IP MACHINE_TYPE NODE_VERSION NUM_NODES STATUS tpu-models-cluster us-central1-b 1.13.6-gke.5 35.232.204.86 n1-standard-4 1.13.6-gke.5 3 RUNNING
Process the training data
Your next task is to create a job. In Google Kubernetes Engine, a job is a controller object that represents a finite task. The first job you need to create downloads and processes the COCO dataset used to train the Mask RCNN model.
In your shell environment, create a file named
download_and_preprocess_coco_k8s.yaml
as shown below. Alternatively, you can download this file from GitHub.Locate the environment variable named
DATA_BUCKET
. Update thevalue
for this variable to the path where you want to store the COCO dataset on the Cloud Storage bucket. For example: "gs://my-maskrcnn-bucket/coco/"Run the following command to create the job in your cluster:
kubectl create -f download_and_preprocess_coco_k8s.yaml
Jobs can take a few minutes to start. You can view the status of the jobs running on your cluster by using the following command:
kubectl get pods -w
This command returns information about all jobs running in the cluster. For example
NAME READY STATUS RESTARTS AGE download-and-preprocess-coco-xmx9s 0/1 Pending 0 10s
The -w
flag instructs the command to watch for any changes in a pod's status.
After a few minutes, the status for your job should update to Running
.
If you encounter any issues with your job, you can delete it and
try again. To do so, run the kubectl delete -f download_and_preprocess_coco_k8s.yaml
command before running the kubectl create
command again.
When the download and preprocessing completes, you are ready to run the model.
Run the Mask RCNN model
Everything is now in place for you to run the Mask RCNN model using Cloud TPU and GKE.
In your shell environment, create a file named
mask_rcnn_k8s.yaml
as shown below. Alternatively, you can download this file from GitHub.Locate the environment variable named
DATA_BUCKET
. Update thevalue
for this variable to the path to theDATA_BUCKET
file you specified in thedownload_and_preprocess_coco_k8s.yaml
script.Locate the environment variable named
MODEL_BUCKET
. Update thevalue
for this variable to the path to your Cloud Storage bucket. This location is where the script saves the training output. If the specifiedMODEL_BUCKET
does not exist, it is created when you run the job.Run the following command to create the job in your cluster.
kubectl create -f mask_rcnn_k8s.yaml
When you run this command, the following confirmation message appears:
job "mask-rcnn-gke-tpu" created
As mentioned in the section, Process the training data, it can take a few minutes for a job to start. View the status of the jobs running on your cluster by using the following command:
kubectl get pods -w
View training status
You can track the training status by using the kubectl
command-line utility.
Run the following command to get the status of the job:
kubectl get pods
During the training process, the status of the pod should be
RUNNING
.To get additional information on the training process, run the following command:
$ kubectl logs job name
Where
job name
is the full name of the job, for example,mask-rcnn-gke-tpu-abcd
.You can also check the output on the GKE Workloads dashboard on the Cloud Console.
Note that it takes a while for the first entry to appear in the logs. You can expect to see something like this:
I0622 18:14:31.617954 140178400511808 tf_logging.py:116] Calling model_fn. I0622 18:14:40.449557 140178400511808 tf_logging.py:116] Create CheckpointSaverHook. I0622 18:14:40.697138 140178400511808 tf_logging.py:116] Done calling model_fn. I0622 18:14:44.004508 140178400511808 tf_logging.py:116] TPU job name worker I0622 18:14:45.254548 140178400511808 tf_logging.py:116] Graph was finalized. I0622 18:14:48.346483 140178400511808 tf_logging.py:116] Running local_init_op. I0622 18:14:48.506665 140178400511808 tf_logging.py:116] Done running local_init_op. I0622 18:14:49.135080 140178400511808 tf_logging.py:116] Init TPU system I0622 18:15:00.188153 140178400511808 tf_logging.py:116] Start infeed thread controller I0622 18:15:00.188635 140177578452736 tf_logging.py:116] Starting infeed thread controller. I0622 18:15:00.188838 140178400511808 tf_logging.py:116] Start outfeed thread controller I0622 18:15:00.189151 140177570060032 tf_logging.py:116] Starting outfeed thread controller. I0622 18:15:07.316534 140178400511808 tf_logging.py:116] Enqueue next (100) batch(es) of data to infeed. I0622 18:15:07.316904 140178400511808 tf_logging.py:116] Dequeue next (100) batch(es) of data from outfeed. I0622 18:16:13.881397 140178400511808 tf_logging.py:116] Saving checkpoints for 100 into gs://<bucket-name>/mask_rcnn/model.ckpt. I0622 18:16:21.147114 140178400511808 tf_logging.py:116] loss = 1.589756, step = 0 I0622 18:16:21.148168 140178400511808 tf_logging.py:116] loss = 1.589756, step = 0 I0622 18:16:21.150870 140178400511808 tf_logging.py:116] Enqueue next (100) batch(es) of data to infeed. I0622 18:16:21.151168 140178400511808 tf_logging.py:116] Dequeue next (100) batch(es) of data from outfeed. I0622 18:17:00.739207 140178400511808 tf_logging.py:116] Enqueue next (100) batch(es) of data to infeed. I0622 18:17:00.739809 140178400511808 tf_logging.py:116] Dequeue next (100) batch(es) of data from outfeed. I0622 18:17:36.598773 140178400511808 tf_logging.py:116] global_step/sec: 2.65061 I0622 18:17:37.040504 140178400511808 tf_logging.py:116] examples/sec: 2698.56 I0622 18:17:37.041333 140178400511808 tf_logging.py:116] loss = 2.63023, step = 200 (75.893 sec)
View the trained model at
gs://<bucket-name>/mask-rcnn/model.ckpt
.gsutil ls -r gs://bucket-name/mask-rcnn/model.ckpt
Clean up
When you've finished with Cloud TPU on GKE, clean up the resources to avoid incurring extra charges to your Cloud Billing account.
If you haven't set the project and zone for this session, do so now. See the instructions earlier in this guide. Then follow this cleanup procedure:
Run the following command to delete your GKE cluster,
tpu-models-cluster
, replacing project-name with your Google Cloud project name:$ gcloud container clusters delete tpu-models-cluster --project=project-name
When you've finished examining the data, use the
gsutil
command to delete the Cloud Storage bucket you created during this tutorial. Replace bucket-name with the name of your Cloud Storage bucket:$ gsutil rm -r gs://bucket-name
See the Cloud Storage pricing guide for free storage limits and other pricing information.
What's next
- Explore the TPU tools in TensorBoard.
Run more models and dataset retrieval jobs using one of the following job specs:
- Download and preprocess the COCO dataset on GKE.
- Download and preprocess ImageNet on GKE.
- Train AmoebaNet-D using Cloud TPU and GKE.
- Train Inception v3 using Cloud TPU and GKE.
- Train RetinaNet using Cloud TPU and GKE.
Experiment with more TPU samples.