Using TPUs to train your model

Tensor Processing Units (TPUs) are Google's custom-developed ASICs used to accelerate machine-learning workloads. You can run your training jobs on AI Platform Training, using Cloud TPU. AI Platform Training provides a job management interface so that you don't need to manage the TPU yourself. Instead, you can use the AI Platform Training jobs API in the same way as you use it for training on a CPU or a GPU.

High-level TensorFlow APIs help you get your models running on the Cloud TPU hardware.

Setting up your Google Cloud environment

Configure your Google Cloud environment by working through the setup section of the getting-started guide.

Authorizing your Cloud TPU to access your project

Follow these steps to authorize the Cloud TPU service account name associated with your Google Cloud project:

  1. Get your Cloud TPU service account name by calling projects.getConfig. Example:

    PROJECT_ID=PROJECT_ID
    
    curl -H "Authorization: Bearer $(gcloud auth print-access-token)"  \
        https://ml.googleapis.com/v1/projects/$PROJECT_ID:getConfig
    
  2. Save the value of the serviceAccountProject and tpuServiceAccount field returned by the API.

  3. Initialize the Cloud TPU service account:

    curl -H "Authorization: Bearer $(gcloud auth print-access-token)"  \
      -H "Content-Type: application/json" -d '{}'  \
      https://serviceusage.googleapis.com/v1beta1/projects/<serviceAccountProject>/services/tpu.googleapis.com:generateServiceIdentity
    

Now add the Cloud TPU service account as a member in your project, with the role Cloud ML Service Agent. Complete the following steps in the Google Cloud console or using the gcloud command:

Console

  1. Log in to the Google Cloud console and choose the project in which you're using the TPU.
  2. Choose IAM & Admin > IAM.
  3. Click the Add button to add a member to the project.
  4. Enter the TPU service account in the Members text box.
  5. Click the Roles dropdown list.
  6. Enable the Cloud ML Service Agent role (Service Agents > Cloud ML Service Agent).

gcloud

  1. Set environment variables containing your project ID and the Cloud TPU service account:

    PROJECT_ID=PROJECT_ID
    SVC_ACCOUNT=your-tpu-sa-123@your-tpu-sa.google.com.iam.gserviceaccount.com
    
  2. Grant the ml.serviceAgent role to the Cloud TPU service account:

    gcloud projects add-iam-policy-binding $PROJECT_ID \
        --member serviceAccount:$SVC_ACCOUNT --role roles/ml.serviceAgent
    

For more details about granting roles to service accounts, see the IAM documentation.

Example: Training a sample MNIST model

This section shows you how to train a sample MNIST model using a TPU and runtime version 2.11. The example job uses the predefined BASIC_TPU scale tier for your machine configuration. Later sections of the guide show you how to set up a custom configuration.

This example assumes you are using a Bash shell with the gcloud CLI installed. Run the following commands to get the code and submit your training job to AI Platform Training:

  1. Download the code for TensorFlow's reference models and navigate to the directory with the sample code:

    git clone https://github.com/tensorflow/models.git \
      --branch=v2.11.0 \
      --depth=1
    
    cd models
    
  2. Create a setup.py file in the models directory. This ensures that the gcloud ai-platform jobs submit training command includes all the necessary subpackages within the models/official directory when it creates a tarball of your training code, and it ensures that AI Platform Training installs TensorFlow Datasets as a dependency when it runs the training job. This training code relies on TensorFlow datasets to load the MNIST data.

    To create the setup.py file, run the following command in your shell:

    cat << END > setup.py
    from setuptools import find_packages
    from setuptools import setup
    
    setup(
        name='official',
        install_requires=[
           'tensorflow-datasets~=3.1',
           'tensorflow-model-optimization>=0.4.1'
       ],
        packages=find_packages()
    )
    END
    
  3. Submit your training job using the gcloud ai-platform jobs submit training command:

    gcloud ai-platform jobs submit training tpu_mnist_1 \
      --staging-bucket=gs://BUCKET_NAME \
      --package-path=official \
      --module-name=official.vision.image_classification.mnist_main \
      --runtime-version=2.11 \
      --python-version=3.7 \
      --scale-tier=BASIC_TPU \
      --region=us-central1 \
      -- \
      --distribution_strategy=tpu \
      --data_dir=gs://tfds-data/datasets \
      --model_dir=gs://BUCKET_NAME/tpu_mnist_1_output
    

    Replace BUCKET_NAME with the name of a Cloud Storage bucket in your Google Cloud project. The gcloud CLI uploads your packaged training code to this bucket, and AI Platform Training saves training output in the bucket.

  4. Monitor your training job. When the job has completed, you can view its output in the gs://BUCKET_NAME/tpu_mnist_1_output directory.

More about training a model on Cloud TPU

This section tells you more about configuring a job and training a model on AI Platform Training with Cloud TPU.

Specifying a region that offers TPUs

You need to run your job in a region where TPUs are available. The following regions currently provide access to TPUs:

  • us-central1
  • europe-west4

To fully understand the available regions for AI Platform Training services, including model training and online/batch prediction, read the guide to regions.

TensorFlow and AI Platform Training versioning

AI Platform Training runtime version(s) 1.15, 2.1, 2.2, 2.3, 2.4, 2.5, 2.6, 2.7, 2.8, 2.9, and 2.11 are available for training your models on Cloud TPU. See more about AI Platform Training runtime versions and the corresponding TensorFlow versions.

The versioning policy is the same as for Cloud TPU. In your training job request, make sure to specify a runtime version that is available for TPUs and matches the TensorFlow version used in your training code.

Connecting to the TPU gRPC server

In your TensorFlow program, use TPUClusterResolver to connect with the TPU gRPC server running on the TPU VM.

The TensorFlow guide to using TPUs shows how to use TPUClusterResolver with the TPUStrategy distribution strategy.

However, you must make one important change when you use TPUClusterResolver for code that runs on AI Platform Training: Do not provide any arguments when you construct the TPUClusterResolver instance. When the tpu, zone, and project keyword arguments are all set to their default value of None, AI Platform Training automatically provides the cluster resolver with the necessary connection details through environment variables.

The following TensorFlow 2 example shows how to initialize a cluster resolver and a distribution strategy for training on AI Platform Training:

import tensorflow as tf

resolver = tf.distribute.cluster_resolver.TPUClusterResolver()
tf.config.experimental_connect_to_cluster(resolver)
tf.tpu.experimental.initialize_tpu_system(resolver)
strategy = tf.distribute.experimental.TPUStrategy(resolver)

Using TPUs in TensorFlow code

To make use of the TPUs on a machine, use TensorFlow 2's TPUStrategy API. The TensorFlow guide to using TPUs shows how to do this.

To train with TPUs in TensorFlow 1, you can use the TPUEstimator API instead. The Cloud TPU guide to the TPUEstimator API shows how to do this.

The Cloud TPU documentation also provides a list of low-level TensorFlow operations available on Cloud TPU.

Using TPUs in PyTorch code

To make use of a TPU when you use a pre-built PyTorch container, use the torch_xla package. Learn how to use torch_xla for TPU in training in the PyTorch documentation. For more examples of using torch_xla, see the tutorials in the PyTorch XLA GitHub repository

Note that when you train using a TPU on AI Platform Training, you are using a single XLA device, not multiple XLA devices.

See also the following section on this page about configuring your training job for PyTorch and TPU.

Configuring a custom TPU machine

A TPU training job runs on a two-VM configuration. One VM (the master) runs your Python code. The master drives the TensorFlow server running on a TPU worker.

To use a TPU with AI Platform Training, configure your training job to access a TPU-enabled machine in one of three ways:

  • Use the BASIC_TPU scale tier. You can use this method to access TPU v2 accelerators.
  • Use a cloud_tpu worker and a legacy machine type for the master VM. You can use this method to access TPU v2 accelerators.
  • Use a cloud_tpu worker and a Compute Engine machine type for the master VM. You can use this method to access TPU v2 or TPU v3 accelerators. TPU v3 accelerators are available in beta.

Basic TPU-enabled machine

Set the scale tier to BASIC_TPU to get a master VM and a TPU VM including one TPU with eight TPU v2 cores, as you did when running the previous example.

TPU worker in a legacy machine type configuration

Alternatively, you can set up a custom machine configuration if you need more computing resources on the master VM:

  • Set the scale tier to CUSTOM.
  • Configure the master VM to use a legacy machine type that suits your job requirements.
  • Set workerType to cloud_tpu, to get a TPU VM including one Cloud TPU with eight TPU v2 cores.
  • Set workerCount to 1.
  • Do not specify a parameter server when using a Cloud TPU. The service rejects the job request if parameterServerCount is greater than zero.

The following example shows a config.yaml file that uses this type of configuration:

trainingInput:
  scaleTier: CUSTOM
  masterType: complex_model_m
  workerType: cloud_tpu
  workerCount: 1

TPU worker in a Compute Engine machine type configuration

You can also set up a custom machine configuration with a Compute Engine machine type for your master VM and an acceleratorConfig attached to your TPU VM.

You can use this type of configuration to set up a TPU worker with eight TPU v2 cores (similar to a configuration without an acceleratorConfig) or a TPU worker with eight TPU v3 cores (beta). Read more about the difference between TPU v2 and TPU v3 accelerators.

Using a Compute Engine machine type also provides more flexibility for configuring your master VM:

  • Set the scale tier to CUSTOM.
  • Configure the master VM to use a Compute Engine machine type that suits your job requirements.
  • Set workerType to cloud_tpu.
  • Add a workerConfig with an acceleratorConfig field. Inside that acceleratorConfig, set type to TPU_V2 or TPU_V3 and count to 8. You may not attach any other number of TPU cores.
  • Set workerCount to 1.
  • Do not specify a parameter server when using a Cloud TPU. The service rejects the job request if parameterServerCount is greater than zero.

The following example shows a config.yaml file that uses this type of configuration:

TPU v2

trainingInput:
  scaleTier: CUSTOM
  masterType: n1-highcpu-16
  workerType: cloud_tpu
  workerCount: 1
  workerConfig:
    acceleratorConfig:
      type: TPU_V2
      count: 8

TPU v3 (beta)

trainingInput:
  scaleTier: CUSTOM
  masterType: n1-highcpu-16
  workerType: cloud_tpu
  workerCount: 1
  workerConfig:
    acceleratorConfig:
      type: TPU_V3
      count: 8

Using TPU Pods

A TPU Pod is a collection of TPU devices connected by dedicated high-speed network interfaces. A TPU Pod can have up to 2,048 TPU cores, allowing you to distribute the processing load across multiple TPUs.

To use TPU Pods, you must first file a quota increase request.

The following example config.yaml files show how to use TPU Pods:

TPU v2 Pods

trainingInput:
  scaleTier: CUSTOM
  masterType: n1-highcpu-16
  workerType: cloud_tpu
  workerCount: 1
  workerConfig:
    acceleratorConfig:
      type: TPU_V2_POD
      count: 128

TPU v3 Pods

trainingInput:
  scaleTier: CUSTOM
  masterType: n1-highcpu-16
  workerType: cloud_tpu
  workerCount: 1
  workerConfig:
    acceleratorConfig:
      type: TPU_V3_POD
      count: 32

There are limitations on the number of Pod cores that can be used for each TPU type. Available configurations:

TPU Pod Type Number of Pod cores available to use
TPU_V2_POD 32, 128, 256, 512
TPU_V3_POD 32, 128, 256

For more details on how to make full use of TPU Pod cores, see the Cloud TPU documentation about TPU Pods.

Using a pre-built PyTorch container on a TPU worker

If you want to perform PyTorch training with a TPU, then you must specify the tpuTfVersion field in your training job's trainingInput. Set the tpuTfVersion to match the version of the pre-built PyTorch container that you are using for training.

AI Platform Training supports training with TPUs for the following pre-built PyTorch containers:

Container image URI tpuTfVersion
gcr.io/cloud-ml-public/training/pytorch-xla.1-11 pytorch-1.11
gcr.io/cloud-ml-public/training/pytorch-xla.1-10 pytorch-1.10
gcr.io/cloud-ml-public/training/pytorch-xla.1-9 pytorch-1.9
gcr.io/cloud-ml-public/training/pytorch-xla.1-7 pytorch-1.7
gcr.io/cloud-ml-public/training/pytorch-xla.1-6 pytorch-1.6

For example, to train using the PyTorch 1.11 pre-built container, you might use the following config.yaml file to configure training:

trainingInput:
  scaleTier: CUSTOM
  masterType: n1-highcpu-16
  masterConfig:
    imageUri: gcr.io/cloud-ml-public/training/pytorch-xla.1-11
  workerType: cloud_tpu
  workerCount: 1
  workerConfig:
    imageUri: gcr.io/cloud-ml-public/training/pytorch-xla.1-11
    tpuTfVersion: pytorch-1.11
    acceleratorConfig:
      type: TPU_V2
      count: 8

See also the previous section on this page about Using TPUs in PyTorch code.

Using a custom container on a TPU worker

If you want to run a custom container on your TPU worker instead of using one of the AI Platform Training runtime versions that support TPUs, you must specify an additional configuration field when you submit your training job. Set the tpuTfVersion to a runtime version that includes the version of TensorFlow that your container uses. You must specify a runtime version currently supported for training with TPUs.

Because you are configuring your job to use a custom container, AI Platform Training doesn't use this runtime version's environment when it runs your training job. However, AI Platform Training requires this field so it can properly prepare the TPU worker for the version of TensorFlow that your custom container uses.

The following example shows a config.yaml file with a similar TPU configuration to the one from the previous section, except in this case the master VM and the TPU worker each run different custom containers:

TPU v2

trainingInput:
  scaleTier: CUSTOM
  masterType: n1-highcpu-16
  masterConfig:
    imageUri: gcr.io/YOUR_PROJECT_ID/your-master-image-name:your-master-tag-name
  workerType: cloud_tpu
  workerCount: 1
  workerConfig:
    imageUri: gcr.io/YOUR_PROJECT_ID/your-worker-image-name:your-worker-tag-name
    tpuTfVersion: 2.11
    acceleratorConfig:
      type: TPU_V2
      count: 8

TPU v3 (beta)

trainingInput:
  scaleTier: CUSTOM
  masterType: n1-highcpu-16
  masterConfig:
    imageUri: gcr.io/YOUR_PROJECT_ID/your-master-image-name:your-master-tag-name
  workerType: cloud_tpu
  workerCount: 1
  workerConfig:
    imageUri: gcr.io/YOUR_PROJECT_ID/your-worker-image-name:your-worker-tag-name
    tpuTfVersion: 2.11
    acceleratorConfig:
      type: TPU_V3
      count: 8

If you use the gcloud beta ai-platform jobs submit training command to submit your training job, you can specify the tpuTfVersion API field with the --tpu-tf-version flag instead of in a config.yaml file.

Using TPUClusterResolver after the TPU is provisioned

When using a custom container, you would have to wait for the TPU to be provisioned before you can call TPUClusterResolver to use it. The following sample code shows how to handle the TPUClusterResolver logic:

def wait_for_tpu_cluster_resolver_ready():
  """Waits for `TPUClusterResolver` to be ready and return it.

  Returns:
    A TPUClusterResolver if there is TPU machine (in TPU_CONFIG). Otherwise,
    return None.
  Raises:
    RuntimeError: if failed to schedule TPU.
  """
  tpu_config_env = os.environ.get('TPU_CONFIG')
  if not tpu_config_env:
    tf.logging.info('Missing TPU_CONFIG, use CPU/GPU for training.')
    return None

  tpu_node = json.loads(tpu_config_env)
  tf.logging.info('Waiting for TPU to be ready: \n%s.', tpu_node)

  num_retries = 40
  for i in range(num_retries):
    try:
      tpu_cluster_resolver = (
          tf.contrib.cluster_resolver.TPUClusterResolver(
              tpu=[tpu_node['tpu_node_name']],
              zone=tpu_node['zone'],
              project=tpu_node['project'],
              job_name='worker'))
      tpu_cluster_resolver_dict = tpu_cluster_resolver.cluster_spec().as_dict()
      if 'worker' in tpu_cluster_resolver_dict:
        tf.logging.info('Found TPU worker: %s', tpu_cluster_resolver_dict)
        return tpu_cluster_resolver
    except Exception as e:
      if i < num_retries - 1:
        tf.logging.info('Still waiting for provisioning of TPU VM instance.')
      else:
        # Preserves the traceback.
        raise RuntimeError('Failed to schedule TPU: {}'.format(e))
    time.sleep(10)

  # Raise error when failed to get TPUClusterResolver after retry.
  raise RuntimeError('Failed to schedule TPU.')

Learn more about distributed training with custom containers.

What's next