This page introduces Cloud TPU and shows you where to find information on using Cloud TPU with Google Kubernetes Engine (GKE). Tensor Processing Units (TPUs) are Google's custom-developed application-specific integrated circuits (ASICs) used to accelerate TensorFlow machine learning workloads.
Overview
Using GKE to manage your Cloud TPU brings the following advantages:
Easier setup and management: When you use Cloud TPU, you need a Compute Engine VM to run your workload, and a Classless Inter-Domain Routing (CIDR) block for the Cloud TPU. GKE sets up and manages the VM and the CIDR block for you.
Optimized cost: GKE scales your VMs and Cloud TPU nodes automatically based on workloads and traffic. You only pay for Cloud TPU and the VM when you run workloads on them.
Flexible usage: It's a one-line change in your Pod spec to request a different hardware accelerator (CPU, GPU, or TPU):
kind: Pod spec: containers: - name: example-container resources: limits: cloud-tpus.google.com/v2: 8 # See the line above for TPU, or below for CPU / GPU. # cpu: 2 # nvidia.com/gpu: 1
Scalability: GKE provides the Deployment API that can easily scale to hundreds of Pods and Cloud TPU nodes.
Fault tolerance: GKE's Job API, along with the TensorFlow checkpoint mechanism, provide the run-to-completion semantic. Your training jobs will automatically rerun with the latest state read from the checkpoint if failures occur on the VM instances or Cloud TPU nodes.
What's next
- Follow the quick guide to setting up Cloud TPU with GKE.
- Learn about best practices for using Cloud TPU for your machine learning tasks.