TPUs

This page introduces Cloud TPU and shows you where to find information on using Cloud TPU with Google Kubernetes Engine. Tensor Processing Units (TPUs) are Google’s custom-developed application-specific integrated circuits (ASICs) used to accelerate TensorFlow machine learning workloads.

Overview

Using GKE to manage your Cloud TPU brings the following advantages:

  • Easier setup and management: When you use Cloud TPU, you need a Compute Engine VM to run your workload, and a Classless Inter-Domain Routing (CIDR) block for the Cloud TPU. GKE sets up and manages the VM and the CIDR block for you.

  • Optimized cost: GKE scales your VMs and Cloud TPU nodes automatically based on workloads and traffic. You only pay for Cloud TPU and the VM when you run workloads on them.

  • Flexible usage: It's a one-line change in your Pod spec to request a different hardware accelerator (CPU, GPU, or TPU):

kind: Pod
    spec:
      containers:
      - name: example-container
        resources:
          limits:
            cloud-tpus.google.com/v2: 8
            # See the line above for TPU, or below for CPU / GPU.
            # cpu: 2
            # nvidia.com/gpu: 1

  • Scalability: GKE provides APIs (Job and Deployment) that can easily scale to hundreds of Pods and Cloud TPU nodes.

  • Fault tolerance: GKE's Job API, along with the TensorFlow checkpoint mechanism, provide the run-to-completion semantic. Your training jobs will automatically rerun with the latest state read from the checkpoint if failures occur on the VM instances or Cloud TPU nodes.

What's next

  • Follow the Cloud TPU ResNet tutorial, which shows you how to train the TensorFlow ResNet-50 model using Cloud TPU and GKE.
  • Alternatively, follow the quick guide to setting up Cloud TPU with GKE.
  • Learn about best practices for using Cloud TPU for your machine learning tasks.
Was this page helpful? Let us know how we did:

Send feedback about...

Kubernetes Engine