Services that can access TPUs

Your applications can access TPU nodes from containers, instances, or services on Google Cloud. The application requires a connection to your TPU node through your VPC network.

The following Google Cloud services are capable of accessing TPU nodes. Select the service with characteristics that best meet your requirements.

Compute Engine

  • Cloud TPU on Compute Engine is a good starting place for users new to Cloud TPU and for experienced machine learning users who want to manage their own Cloud TPU services. It includes:
    • The gcloud SDK that sets up your VM, TPU, and Cloud Storage resources.
    • A quickstart that guides you through training your first machine learning model.
    • Tutorials for image classification, object detection, and language translation models.
    • Tools for monitoring performance and resolving bottlenecks in TPU model processing.

Kubernetes Engine

  • Cloud TPU on Google Kubernetes Engine offers:
    • Easy setup and management: When you use Cloud TPU, you need a Compute Engine VM to run your workload, and a Classless Inter-Domain Routing (CIDR) block for Cloud TPU. Google Kubernetes Engine sets up and manages the VM and the CIDR block for you.
    • Optimized cost: Google Kubernetes Engine scales your VMs automatically based on workloads and traffic. You pay only for Cloud TPU and the VM when you run workloads on them.
    • Flexible usage: Changing your hardware accelerator (CPU, GPU, or TPU) requires only a single line change in your Pod spec.
    • Scalability: Google Kubernetes Engine provides APIs (Job and Deployment) that can easily scale to hundreds of Pods and Cloud TPU nodes.
    • Fault tolerance: The Google Kubernetes Engine Job API, along with the TensorFlow checkpoint mechanism, provide the run-to-completion semantic. Should failures occur on a VM instance or Cloud TPU node, your training jobs automatically rerun from the latest state of the checkpoint.

AI Platform

  • Cloud TPU on AI Platform is a good place to start if you have some ML experience and want to take advantage of the AI Platform managed services and APIs. AI Platform manages the following ML workflow stages:
    • Train an ML model on your data:
      • Training an ML model on your data
      • Evaluating model accuracy
      • Tuning hyperparameters
    • Deploy your trained model.
    • Send prediction requests to your model:
      • Online prediction
      • Batch prediction
    • Monitor the predictions on an ongoing basis.
    • Manage your models and model versions.