Using TPUs to Train your Model

Tensor Processing Units (TPUs) are Google’s custom-developed ASICs used to accelerate machine-learning workloads. You can run your training jobs on Cloud Machine Learning Engine, using Cloud TPU. Cloud ML Engine provides a job management interface so that you don't need to manage the TPU yourself. Instead, you can use the Cloud ML Engine jobs API in the same way as you use it for training on a CPU or a GPU.

High-level TensorFlow APIs help you get your models running on the Cloud TPU hardware.

Set up and test your GCP environment

Configure your GCP environment by working through the setup section of the getting-started guide.

Authorize your Cloud TPU to access your project

Follow these steps to authorise the Cloud TPU service account name associated with your GCP project:

  1. Get your Cloud TPU service account name by calling projects.getConfig. Example:

    curl -H "Authorization: Bearer $(gcloud auth print-access-token)"  \
        https://ml.googleapis.com/v1/projects/<your-project-id>:getConfig
    
  2. Save the value of the tpuServiceAccount field returned by the API.

Now add the Cloud TPU service account as a member in your project, with the role Cloud ML Service Agent. Complete the following steps in the Google Cloud Platform Console or using the gcloud command:

Console

  1. Log in to the Google Cloud Platform Console and choose the project in which you’re using the TPU.
  2. Choose IAM & Admin > IAM.
  3. Click the Add button to add a member to the project.
  4. Enter the TPU service account in the Members text box.
  5. Click the Roles dropdown list.
  6. Enable the Cloud ML Service Agent role (Service Management > Cloud ML Service Agent).

gcloud

  1. Set environment variables containing your project ID and the Cloud TPU service account:

    PROJECT_ID=your-project-id
    SVC_ACCOUNT=your-tpu-sa-123@your-tpu-sa.google.com.iam.gserviceaccount.com
    
  2. Grant the ml.serviceAgent role to the Cloud TPU service account:

    gcloud projects add-iam-policy-binding $PROJECT_ID \
        --member serviceAccount:$SVC_ACCOUNT --role roles/ml.serviceAgent
    

For more details about granting roles to service accounts, see the Cloud IAM documentation.

Run the sample ResNet-50 model

This section shows you how to train the reference Tensorflow ResNet-50 model, using a fake dataset provided at gs://cloud-tpu-test-datasets/fake_imagenet. The example job uses the predefined BASIC_TPU scale tier for your machine configuration. Later sections of the guide show you how to set up a custom configuration.

Run the following commands to get the code and submit your training job on Cloud ML Engine:

  1. Download the code for the reference model:

    mkdir tpu-demos && cd tpu-demos
    wget https://github.com/tensorflow/tpu/archive/r1.9.tar.gz
    tar -xzvf r1.9.tar.gz && rm r1.9.tar.gz
    
  2. Go to the official directory within the unzipped directory structure:

    cd tpu-r1.9/models/official/
    
  3. Edit ./resnet/resnet_main.py and change the code to use explicit relative imports when importing submodules. For example, change this:

    import resnet_model
    

    To this:

    from . import resnet_model
    

    Use the above pattern to change all other imports in the file, then save the file.

  4. Check for any submodule imports in other files in the sample and update them to use explicit relative imports too.

  5. Set up some environment variables:

    JOB_NAME=tpu_1
    STAGING_BUCKET=gs://my_bucket_for_staging
    REGION=us-central1
    DATA_DIR=gs://cloud-tpu-test-datasets/fake_imagenet
    OUTPUT_PATH=gs://my_bucket_for_model_output
    

    The following regions currently provide access to TPUs:

    • us-central1

  6. Submit your training job using the gcloud ml-engine jobs submit training command:

    gcloud ml-engine jobs submit training $JOB_NAME \
            --staging-bucket $STAGING_BUCKET \
            --runtime-version 1.9 \
            --scale-tier BASIC_TPU \
            --module-name resnet.resnet_main \
            --package-path resnet/ \
            --region $REGION \
            -- \
            --data_dir=$DATA_DIR \
            --model_dir=$OUTPUT_PATH
    

More about training a model on Cloud TPU

The earlier part of this guide shows you how to use the ResNet-50 sample code. This section tells you more about configuring a job and training a model on Cloud ML Engine with Cloud TPU.

Specifying a region that offers TPUs

You need to run your job in a region where TPUs are available. The following regions currently provide access to TPUs:

  • us-central1

To fully understand the available regions for Cloud ML Engine services, including model training and online/batch prediction, read the guide to regions.

TensorFlow and Cloud ML Engine versioning

Cloud ML Engine runtime versions 1.8 to 1.9 are available for training your models on Cloud TPU. See more about Cloud ML Engine runtime versions and the corresponding TensorFlow versions.

While in Beta, the versioning policy is the same as for Cloud TPU.

Configuring a custom TPU machine

A TPU training job runs on a two-VM configuration. One VM (the master) runs your Python code. The master drives the TensorFlow server running on a TPU worker.

To use a TPU with Cloud ML Engine, configure your training job to access a TPU-enabled machine.

  • You can set the scale tier to BASIC_TPU to get a master VM and a TPU VM including one TPU, as you did when running the above sample.
  • Alternatively, you can set up a custom machine configuration if you need more computing resources on the master VM:

    • Set the scale tier to CUSTOM.
    • Configure the machine type for your master to suit your job requirements.
    • Configure a worker task to use a cloud_tpu machine type, to get a TPU VM including one Cloud TPU.
    • Note that you must not specify a parameter server when using a Cloud TPU. The service rejects the job request if parameterServerCount is greater than zero.
    • The workerCount must be 1.

See more information about scale tiers and machine types.

Connecting with the TPU gRPC server

In your Tensorflow program, you should use TPUClusterResolver to connect with the TPU gRPC server running on the TPU VM. The TPUClusterResolver returns the IP address and port of the Cloud TPU.

Example using TPUClusterResolver:

tpu_cluster_resolver = tf.contrib.cluster_resolver.TPUClusterResolver(
    FLAGS.tpu,
    zone=FLAGS.tpu_zone,
    project=FLAGS.gcp_project)

config = tpu_config.RunConfig(
    cluster=tpu_cluster_resolver,
    model_dir=FLAGS.model_dir,
    save_checkpoints_steps=max(600, FLAGS.iterations_per_loop),
    tpu_config=tpu_config.TPUConfig(
        iterations_per_loop=FLAGS.iterations_per_loop,
        num_shards=FLAGS.num_cores,
        per_host_input_for_training=tpu_config.InputPipelineConfig.PER_HOST_V2))

Assigning ops to TPUs

To make use of the TPUs on a machine, you must use the TensorFlow TPUEstimator API, which inherits from the high-level TensorFlow Estimator API.

  • TPUEstimator handles many of the details of running on TPU devices, such as replicating inputs and models for each core, and returning to host periodically to run hooks.
  • The high-level TensorFlow API also provides many other conveniences. In particular, the API takes care of saving and restoring model checkpoints, so that you can resume an interrupted training job at the point at which it stopped.

See the list of TensorFlow operations available on Cloud TPU.

What's next

Was this page helpful? Let us know how we did:

Send feedback about...

Cloud ML Engine for TensorFlow