TPUs


This page introduces Cloud TPU and shows you where to find information on using Cloud TPU with Google Kubernetes Engine (GKE). Tensor Processing Units (TPUs) are Google's custom-developed application-specific integrated circuits (ASICs) used to accelerate TensorFlow machine learning workloads.

Overview

Using GKE to manage your Cloud TPU brings the following advantages:

  • Easier setup and management: When you use Cloud TPU, you need a Compute Engine VM to run your workload, and a Classless Inter-Domain Routing (CIDR) block for the Cloud TPU. GKE sets up and manages the VM and the CIDR block for you.

  • Optimized cost: GKE scales your VMs and Cloud TPU nodes automatically based on workloads and traffic. You only pay for Cloud TPU and the VM when you run workloads on them.

  • Flexible usage: It's a one-line change in your Pod spec to request a different hardware accelerator (CPU, GPU, or TPU):

    kind: Pod
      spec:
        containers:
        - name: example-container
          resources:
            limits:
              cloud-tpus.google.com/v2: 8
              # See the line above for TPU, or below for CPU / GPU.
              # cpu: 2
              # nvidia.com/gpu: 1
    
  • Scalability: GKE provides the Deployment API that can easily scale to hundreds of Pods and Cloud TPU nodes.

  • Fault tolerance: GKE's Job API, along with the TensorFlow checkpoint mechanism, provide the run-to-completion semantic. Your training jobs will automatically rerun with the latest state read from the checkpoint if failures occur on the VM instances or Cloud TPU nodes.

What's next

  • Follow the quick guide to setting up Cloud TPU with GKE.
  • Learn about best practices for using Cloud TPU for your machine learning tasks.