Using TPUs to Train your Model

Tensor Processing Units (TPUs) are Google’s custom-developed ASICs used to accelerate machine-learning workloads. You can run your training jobs on AI Platform, using Cloud TPU. AI Platform provides a job management interface so that you don't need to manage the TPU yourself. Instead, you can use the AI Platform jobs API in the same way as you use it for training on a CPU or a GPU.

High-level TensorFlow APIs help you get your models running on the Cloud TPU hardware.

Set up and test your GCP environment

Configure your GCP environment by working through the setup section of the getting-started guide.

Authorize your Cloud TPU to access your project

Follow these steps to authorise the Cloud TPU service account name associated with your GCP project:

  1. Get your Cloud TPU service account name by calling projects.getConfig. Example:

    curl -H "Authorization: Bearer $(gcloud auth print-access-token)"  \
        https://ml.googleapis.com/v1/projects/<your-project-id>:getConfig
    
  2. Save the value of the tpuServiceAccount field returned by the API.

Now add the Cloud TPU service account as a member in your project, with the role Cloud ML Service Agent. Complete the following steps in the Google Cloud Platform Console or using the gcloud command:

Console

  1. Log in to the Google Cloud Platform Console and choose the project in which you're using the TPU.
  2. Choose IAM & Admin > IAM.
  3. Click the Add button to add a member to the project.
  4. Enter the TPU service account in the Members text box.
  5. Click the Roles dropdown list.
  6. Enable the Cloud ML Service Agent role (Service Management > Cloud ML Service Agent).

gcloud

  1. Set environment variables containing your project ID and the Cloud TPU service account:

    PROJECT_ID=your-project-id
    SVC_ACCOUNT=your-tpu-sa-123@your-tpu-sa.google.com.iam.gserviceaccount.com
    
  2. Grant the ml.serviceAgent role to the Cloud TPU service account:

    gcloud projects add-iam-policy-binding $PROJECT_ID \
        --member serviceAccount:$SVC_ACCOUNT --role roles/ml.serviceAgent
    

For more details about granting roles to service accounts, see the Cloud IAM documentation.

Run the sample ResNet-50 model

This section shows you how to train the reference Tensorflow ResNet-50 model, using a fake dataset provided at gs://cloud-tpu-test-datasets/fake_imagenet. The example job uses the predefined BASIC_TPU scale tier for your machine configuration. Later sections of the guide show you how to set up a custom configuration.

Run the following commands to get the code and submit your training job on AI Platform:

  1. Download the code for the reference model:

    mkdir tpu-demos && cd tpu-demos
    wget https://github.com/tensorflow/tpu/archive/r1.12.tar.gz
    tar -xzvf r1.12.tar.gz && rm r1.12.tar.gz
    
  2. Go to the official directory within the unzipped directory structure:

    cd tpu-r1.12/models/official/
    
  3. Edit ./resnet/resnet_main.py and change the code to use explicit relative imports when importing submodules. For example, change this:

    import resnet_model
    

    To this:

    from . import resnet_model
    

    Use the above pattern to change all other imports in the file, then save the file.

  4. Check for any submodule imports in other files in the sample and update them to use explicit relative imports too.

  5. Set up some environment variables:

    JOB_NAME=tpu_1
    STAGING_BUCKET=gs://my_bucket_for_staging
    REGION=us-central1
    DATA_DIR=gs://cloud-tpu-test-datasets/fake_imagenet
    OUTPUT_PATH=gs://my_bucket_for_model_output
    

    The following regions currently provide access to TPUs:

    • us-central1

  6. Submit your training job using the gcloud ai-platform jobs submit training command:

    gcloud ai-platform jobs submit training $JOB_NAME \
            --staging-bucket $STAGING_BUCKET \
            --runtime-version 1.13 \
            --scale-tier BASIC_TPU \
            --module-name resnet.resnet_main \
            --package-path resnet/ \
            --region $REGION \
            -- \
            --data_dir=$DATA_DIR \
            --model_dir=$OUTPUT_PATH
    

More about training a model on Cloud TPU

The earlier part of this guide shows you how to use the ResNet-50 sample code. This section tells you more about configuring a job and training a model on AI Platform with Cloud TPU.

Specifying a region that offers TPUs

You need to run your job in a region where TPUs are available. The following regions currently provide access to TPUs:

  • us-central1

To fully understand the available regions for AI Platform services, including model training and online/batch prediction, read the guide to regions.

TensorFlow and AI Platform versioning

AI Platform runtime version(s) 1.12 and 1.13 are available for training your models on Cloud TPU. See more about AI Platform runtime versions and the corresponding TensorFlow versions.

The versioning policy is the same as for Cloud TPU. In your training job request, make sure to specify a runtime version that is available for TPUs and matches the TensorFlow version used in your training code.

Connecting with the TPU gRPC server

In your Tensorflow program, you should use TPUClusterResolver to connect with the TPU gRPC server running on the TPU VM. The TPUClusterResolver returns the IP address and port of the Cloud TPU.

The following example shows how the ResNet-50 sample code uses TPUClusterResolver:

tpu_cluster_resolver = tf.contrib.cluster_resolver.TPUClusterResolver(
    FLAGS.tpu,
    zone=FLAGS.tpu_zone,
    project=FLAGS.gcp_project)

config = tpu_config.RunConfig(
    cluster=tpu_cluster_resolver,
    model_dir=FLAGS.model_dir,
    save_checkpoints_steps=max(600, FLAGS.iterations_per_loop),
    tpu_config=tpu_config.TPUConfig(
        iterations_per_loop=FLAGS.iterations_per_loop,
        num_shards=FLAGS.num_cores,
        per_host_input_for_training=tpu_config.InputPipelineConfig.PER_HOST_V2))  # pylint: disable=line-too-long

Assigning ops to TPUs

To make use of the TPUs on a machine, you must use the TensorFlow TPUEstimator API, which inherits from the high-level TensorFlow Estimator API.

  • TPUEstimator handles many of the details of running on TPU devices, such as replicating inputs and models for each core, and returning to host periodically to run hooks.
  • The high-level TensorFlow API also provides many other conveniences. In particular, the API takes care of saving and restoring model checkpoints, so that you can resume an interrupted training job at the point at which it stopped.

See the list of TensorFlow operations available on Cloud TPU.

Configuring a custom TPU machine

A TPU training job runs on a two-VM configuration. One VM (the master) runs your Python code. The master drives the TensorFlow server running on a TPU worker.

To use a TPU with AI Platform, configure your training job to access a TPU-enabled machine in one of three ways:

  • Use the BASIC_TPU scale tier.
  • Use a cloud_tpu worker and an AI Platform machine type for the master VM.
  • Use a cloud_tpu worker and a Compute Engine machine type for the master VM.

Basic TPU-enabled machine

Set the scale tier to BASIC_TPU to get a master VM and a TPU VM including one TPU, as you did when running the above sample.

TPU worker in an AI Platform machine type configuration

Alternatively, you can set up a custom machine configuration if you need more computing resources on the master VM:

  • Set the scale tier to CUSTOM.
  • Configure the master VM to use a AI Platform machine type that suits your job requirements.
  • Set workerType to cloud_tpu, to get a TPU VM including one Cloud TPU.
  • Set workerCount to 1.
  • Do not specify a parameter server when using a Cloud TPU. The service rejects the job request if parameterServerCount is greater than zero.

The following example shows a config.yaml that uses this type of configuration:

trainingInput:
  scaleTier: CUSTOM
  masterType: complex_model_m
  workerType: cloud_tpu
  workerCount: 1

TPU worker in a Compute Engine machine type configuration

You can also set up a custom machine configuration with a Compute Engine machine type for your master VM and an acceleratorConfig attached to your TPU VM.

Currently, this provides the same TPU resources (eight TPU v2 cores) as a configuration without an acceleratorConfig. However, using a Compute Engine machine type may provide more flexibility for configuring your master VM. This type of training job configuration is in the Beta launch stage:

  • Set the scale tier to CUSTOM.
  • Configure the master VM to use a Compute Engine machine type that suits your job requirements.
  • Set workerType to cloud_tpu.
  • Add a workerConfig with an acceleratorConfig field. Inside that acceleratorConfig, set type to TPU_V2 and count to 8. You may not attach any other number of TPU v2 cores.
  • Set workerCount to 1.
  • Do not specify a parameter server when using a Cloud TPU. The service rejects the job request if parameterServerCount is greater than zero.

The following example shows a config.yaml that uses this type of configuration:

trainingInput:
  scaleTier: CUSTOM
  masterType: n1-highcpu-16
  workerType: cloud_tpu
  workerConfig:
    acceleratorConfig:
      type: TPU_V2
      count: 8
  workerCount: 1

What's next

Was this page helpful? Let us know how we did:

Send feedback about...

AI Platform for TensorFlow