Using TPUs to train your model

Tensor Processing Units (TPUs) are Google’s custom-developed ASICs used to accelerate machine-learning workloads. You can run your training jobs on AI Platform, using Cloud TPU. AI Platform provides a job management interface so that you don't need to manage the TPU yourself. Instead, you can use the AI Platform jobs API in the same way as you use it for training on a CPU or a GPU.

High-level TensorFlow APIs help you get your models running on the Cloud TPU hardware.

Set up and test your Google Cloud environment

Configure your Google Cloud environment by working through the setup section of the getting-started guide.

Authorize your Cloud TPU to access your project

Follow these steps to authorise the Cloud TPU service account name associated with your Google Cloud project:

  1. Get your Cloud TPU service account name by calling projects.getConfig. Example:

    curl -H "Authorization: Bearer $(gcloud auth print-access-token)"  \
        https://ml.googleapis.com/v1/projects/<your-project-id>:getConfig
    
  2. Save the value of the tpuServiceAccount field returned by the API.

Now add the Cloud TPU service account as a member in your project, with the role Cloud ML Service Agent. Complete the following steps in the Google Cloud Console or using the gcloud command:

Console

  1. Log in to the Google Cloud Console and choose the project in which you're using the TPU.
  2. Choose IAM & Admin > IAM.
  3. Click the Add button to add a member to the project.
  4. Enter the TPU service account in the Members text box.
  5. Click the Roles dropdown list.
  6. Enable the Cloud ML Service Agent role (Service Management > Cloud ML Service Agent).

gcloud

  1. Set environment variables containing your project ID and the Cloud TPU service account:

    PROJECT_ID=your-project-id
    SVC_ACCOUNT=your-tpu-sa-123@your-tpu-sa.google.com.iam.gserviceaccount.com
    
  2. Grant the ml.serviceAgent role to the Cloud TPU service account:

    gcloud projects add-iam-policy-binding $PROJECT_ID \
        --member serviceAccount:$SVC_ACCOUNT --role roles/ml.serviceAgent
    

For more details about granting roles to service accounts, see the Cloud IAM documentation.

Run the sample ResNet-50 model

This section shows you how to train the reference TensorFlow ResNet-50 model, using a fake dataset provided at gs://cloud-tpu-test-datasets/fake_imagenet. The example job uses the predefined BASIC_TPU scale tier for your machine configuration. Later sections of the guide show you how to set up a custom configuration.

Run the following commands to get the code and submit your training job on AI Platform:

  1. Download the code for the reference model:

    mkdir tpu-demos && cd tpu-demos
    wget https://github.com/tensorflow/tpu/archive/r1.14.tar.gz
    tar -xzvf r1.14.tar.gz && rm r1.14.tar.gz
    
  2. Go to the official directory within the unzipped directory structure:

    cd tpu-r1.14/models/official/
    
  3. Edit ./resnet/resnet_main.py and change the code to use explicit relative imports when importing submodules. For example, change this:

    import resnet_model
    

    To this:

    from . import resnet_model
    

    Use the above pattern to change all other imports in the file, then save the file.

  4. Check for any submodule imports in other files in the sample and update them to use explicit relative imports too.

  5. Set up some environment variables:

    JOB_NAME=tpu_1
    STAGING_BUCKET=gs://my_bucket_for_staging
    REGION=us-central1
    DATA_DIR=gs://cloud-tpu-test-datasets/fake_imagenet
    OUTPUT_PATH=gs://my_bucket_for_model_output
    

    The following regions currently provide access to TPUs:

    • us-central1

  6. Submit your training job using the gcloud ai-platform jobs submit training command:

    gcloud ai-platform jobs submit training $JOB_NAME \
            --staging-bucket $STAGING_BUCKET \
            --runtime-version 1.14 \
            --scale-tier BASIC_TPU \
            --module-name resnet.resnet_main \
            --package-path resnet/ \
            --region $REGION \
            -- \
            --data_dir=$DATA_DIR \
            --model_dir=$OUTPUT_PATH
    

More about training a model on Cloud TPU

The earlier part of this guide shows you how to use the ResNet-50 sample code. This section tells you more about configuring a job and training a model on AI Platform with Cloud TPU.

Specifying a region that offers TPUs

You need to run your job in a region where TPUs are available. The following regions currently provide access to TPUs:

  • us-central1

To fully understand the available regions for AI Platform services, including model training and online/batch prediction, read the guide to regions.

TensorFlow and AI Platform versioning

AI Platform runtime version(s) 1.13 and 1.14 are available for training your models on Cloud TPU. See more about AI Platform runtime versions and the corresponding TensorFlow versions.

The versioning policy is the same as for Cloud TPU. In your training job request, make sure to specify a runtime version that is available for TPUs and matches the TensorFlow version used in your training code.

Connecting with the TPU gRPC server

In your TensorFlow program, you should use TPUClusterResolver to connect with the TPU gRPC server running on the TPU VM. The TPUClusterResolver returns the IP address and port of the Cloud TPU.

The following example shows how the ResNet-50 sample code uses TPUClusterResolver:

tpu_cluster_resolver = tf.contrib.cluster_resolver.TPUClusterResolver(
    FLAGS.tpu,
    zone=FLAGS.tpu_zone,
    project=FLAGS.gcp_project)

config = tpu_config.RunConfig(
    cluster=tpu_cluster_resolver,
    model_dir=FLAGS.model_dir,
    save_checkpoints_steps=max(600, FLAGS.iterations_per_loop),
    tpu_config=tpu_config.TPUConfig(
        iterations_per_loop=FLAGS.iterations_per_loop,
        num_shards=FLAGS.num_cores,
        per_host_input_for_training=tpu_config.InputPipelineConfig.PER_HOST_V2))  # pylint: disable=line-too-long

Assigning ops to TPUs

To make use of the TPUs on a machine, you must use the TensorFlow TPUEstimator API, which inherits from the high-level TensorFlow Estimator API.

  • TPUEstimator handles many of the details of running on TPU devices, such as replicating inputs and models for each core, and returning to host periodically to run hooks.
  • The high-level TensorFlow API also provides many other conveniences. In particular, the API takes care of saving and restoring model checkpoints, so that you can resume an interrupted training job at the point at which it stopped.

See the list of TensorFlow operations available on Cloud TPU.

Configuring a custom TPU machine

A TPU training job runs on a two-VM configuration. One VM (the master) runs your Python code. The master drives the TensorFlow server running on a TPU worker.

To use a TPU with AI Platform, configure your training job to access a TPU-enabled machine in one of three ways:

  • Use the BASIC_TPU scale tier. You can use this method to access TPU v2 accelerators.
  • Use a cloud_tpu worker and a legacy machine type for the master VM. You can use this method to access TPU v2 accelerators.
  • Use a cloud_tpu worker and a Compute Engine machine type for the master VM. You can use this method to access TPU v2 or TPU v3 accelerators. TPU v3 accelerators are available in beta.

Basic TPU-enabled machine

Set the scale tier to BASIC_TPU to get a master VM and a TPU VM including one TPU with eight TPU v2 cores, as you did when running the previous sample.

TPU worker in a legacy machine type configuration

Alternatively, you can set up a custom machine configuration if you need more computing resources on the master VM:

  • Set the scale tier to CUSTOM.
  • Configure the master VM to use a legacy machine type that suits your job requirements.
  • Set workerType to cloud_tpu, to get a TPU VM including one Cloud TPU with eight TPU v2 cores.
  • Set workerCount to 1.
  • Do not specify a parameter server when using a Cloud TPU. The service rejects the job request if parameterServerCount is greater than zero.

The following example shows a config.yaml file that uses this type of configuration:

trainingInput:
  scaleTier: CUSTOM
  masterType: complex_model_m
  workerType: cloud_tpu
  workerCount: 1

TPU worker in a Compute Engine machine type configuration

You can also set up a custom machine configuration with a Compute Engine machine type for your master VM and an acceleratorConfig attached to your TPU VM.

You can use this type of configuration to set up a TPU worker with eight TPU v2 cores (similar to a configuration without an acceleratorConfig) or a TPU worker with eight TPU v3 cores (beta). Read more about the difference between TPU v2 and TPU v3 accelerators.

Using a Compute Engine machine type also provides more flexibility for configuring your master VM:

  • Set the scale tier to CUSTOM.
  • Configure the master VM to use a Compute Engine machine type that suits your job requirements.
  • Set workerType to cloud_tpu.
  • Add a workerConfig with an acceleratorConfig field. Inside that acceleratorConfig, set type to TPU_V2 or TPU_V3 and count to 8. You may not attach any other number of TPU cores.
  • Set workerCount to 1.
  • Do not specify a parameter server when using a Cloud TPU. The service rejects the job request if parameterServerCount is greater than zero.

The following example shows a config.yaml file that uses this type of configuration:

TPU v2

trainingInput:
  scaleTier: CUSTOM
  masterType: n1-highcpu-16
  workerType: cloud_tpu
  workerCount: 1
  workerConfig:
    acceleratorConfig:
      type: TPU_V2
      count: 8

TPU v3 (beta)

trainingInput:
  scaleTier: CUSTOM
  masterType: n1-highcpu-16
  workerType: cloud_tpu
  workerCount: 1
  workerConfig:
    acceleratorConfig:
      type: TPU_V3
      count: 8

Using a custom container on a TPU worker

If you want to run a custom container on your TPU worker instead of using one of the AI Platform runtime versions that support TPUs, you must specify an additional configuration field when you submit your training job. Set the tpuTfVersion to a runtime version that includes the version of TensorFlow that your container uses. You must specify a runtime version currently supported for training with TPUs.

Because you are configuring your job to use a custom container, AI Platform Training doesn't use this runtime version's environment when it runs your training job. However, AI Platform Training requires this field so it can properly prepare the TPU worker for the version of TensorFlow that your custom container uses.

The following example shows a config.yaml file with a similar TPU configuration to the one from the previous section, except in this case the master VM and the TPU worker each run different custom containers:

TPU v2

trainingInput:
  scaleTier: CUSTOM
  masterType: n1-highcpu-16
  masterConfig:
    imageUri: gcr.io/your-project-id/your-master-image-name:your-master-tag-name
  workerType: cloud_tpu
  workerCount: 1
  workerConfig:
    imageUri: gcr.io/your-project-id/your-worker-image-name:your-worker-tag-name
    tpuTfVersion: 1.14
    acceleratorConfig:
      type: TPU_V2
      count: 8

TPU v3 (beta)

trainingInput:
  scaleTier: CUSTOM
  masterType: n1-highcpu-16
  masterConfig:
    imageUri: gcr.io/your-project-id/your-master-image-name:your-master-tag-name
  workerType: cloud_tpu
  workerCount: 1
  workerConfig:
    imageUri: gcr.io/your-project-id/your-worker-image-name:your-worker-tag-name
    tpuTfVersion: 1.14
    acceleratorConfig:
      type: TPU_V3
      count: 8

If you use the gcloud beta ai-platform jobs submit training command to submit your training job, you can specify the tpuTfVersion API field with the --tpu-tf-version flag instead of in a config.yaml file.

Learn more about distributed training with custom containers.

What's next

Oliko tästä sivusta apua? Kerro mielipiteesi

Palautteen aihe:

Tämä sivu
AI Platform
Tarvitsetko apua? Siirry tukisivullemme.