Tensor Processing Units (TPUs) are Google’s custom-developed ASICs used to
accelerate machine-learning workloads. You can run your training jobs on
Cloud Machine Learning Engine, using Cloud TPU. Cloud ML Engine provides a
job management interface so that you don't need to manage the TPU yourself.
Instead, you can use the Cloud ML Engine
jobs API in the
same way as you use it for training on a CPU or a GPU.
High-level TensorFlow APIs help you get your models running on the Cloud TPU hardware.
Set up and test your GCP environment
Configure your GCP environment by working through the setup section of the getting-started guide.
Authorize your Cloud TPU to access your project
Follow these steps to authorise the Cloud TPU service account name associated with your GCP project:
Get your Cloud TPU service account name by calling
curl -H "Authorization: Bearer $(gcloud auth print-access-token)" \ https://ml.googleapis.com/v1/projects/<your-project-id>:getConfig
Save the value of the
tpuServiceAccountfield returned by the API.
Now add the Cloud TPU service account as a member in your project,
with the role Cloud ML Service Agent. Complete the following steps in the
Google Cloud Platform Console or using the
- Log in to the Google Cloud Platform Console and choose the project in which you’re using the TPU.
- Choose IAM & Admin > IAM.
- Click the Add button to add a member to the project.
- Enter the TPU service account in the Members text box.
- Click the Roles dropdown list.
- Enable the Cloud ML Service Agent role (Service Management > Cloud ML Service Agent).
Set environment variables containing your project ID and the Cloud TPU service account:
ml.serviceAgentrole to the Cloud TPU service account:
gcloud projects add-iam-policy-binding $PROJECT_ID \ --member serviceAccount:$SVC_ACCOUNT --role roles/ml.serviceAgent
For more details about granting roles to service accounts, see the Cloud IAM documentation.
Run the sample ResNet-50 model
This section shows you how to train the reference
Tensorflow ResNet-50 model, using a fake dataset provided at
gs://cloud-tpu-test-datasets/fake_imagenet. The example job uses the
BASIC_TPU scale tier for your machine configuration.
Later sections of the guide show you how to set up a custom configuration.
Run the following commands to get the code and submit your training job on Cloud ML Engine:
Download the code for the reference model:
mkdir tpu-demos && cd tpu-demos wget https://github.com/tensorflow/tpu/archive/r1.8.tar.gz tar -xzvf r1.8.tar.gz && rm r1.8.tar.gz
Go to the
officialdirectory within the unzipped directory structure:
./resnet/resnet_main.pyand change the code to use explicit relative imports when importing submodules. For example, change this:
from . import resnet_model
Use the above pattern to change all other imports in the file, then save the file.
Check for any submodule imports in other files in the sample and update them to use explicit relative imports too.
Set up some environment variables:
JOB_NAME=tpu_1 STAGING_BUCKET=gs://my_bucket_for_staging REGION=us-central1 DATA_DIR=gs://cloud-tpu-test-datasets/fake_imagenet OUTPUT_PATH=gs://my_bucket_for_model_output
The following regions currently provide access to TPUs:
Submit your training job using the
gcloud ml-engine jobs submit trainingcommand:
gcloud ml-engine jobs submit training $JOB_NAME \ --staging-bucket $STAGING_BUCKET \ --runtime-version 1.8 \ --scale-tier BASIC_TPU \ --module-name resnet.resnet_main \ --package-path resnet/ \ --region $REGION \ -- \ --data_dir=$DATA_DIR \ --model_dir=$OUTPUT_PATH
More about training a model on Cloud TPU
The earlier part of this guide shows you how to use the ResNet-50 sample code. This section tells you more about configuring a job and training a model on Cloud ML Engine with Cloud TPU.
Specifying a region that offers TPUs
You need to run your job in a region where TPUs are available. The following regions currently provide access to TPUs:
To fully understand the available regions for Cloud ML Engine services, including model training and online/batch prediction, read the guide to regions.
TensorFlow and Cloud ML Engine versioning
Cloud ML Engine runtime versions 1.7 to 1.8 are available for training your models on Cloud TPU. See more about Cloud ML Engine runtime versions and the corresponding TensorFlow versions.
While in Beta, the versioning policy is the same as for Cloud TPU.
Configuring a custom TPU machine
A TPU training job runs on a two-VM configuration. One VM (the master) runs your Python code. The master drives the TensorFlow server running on a TPU worker.
To use a TPU with Cloud ML Engine, configure your training job to access a TPU-enabled machine.
- You can set the scale tier to
BASIC_TPUto get a master VM and a TPU VM including one TPU, as you did when running the above sample.
Alternatively, you can set up a custom machine configuration if you need more computing resources on the master VM:
- Set the scale tier to
- Configure the machine type for your master to suit your job requirements.
- Configure a worker task to use a
cloud_tpumachine type, to get a TPU VM including one Cloud TPU.
- Note that you must not specify a parameter server when using a
Cloud TPU. The service rejects the job request if
parameterServerCountis greater than zero.
workerCountmust be 1.
- Set the scale tier to
See more information about scale tiers and machine types.
Connecting with the TPU gRPC server
If you are using TensorFlow 1.8 or later, you should use TPUClusterResolver to connect with the TPU gRPC server running on the TPU VM. The TPUClusterResolver returns the IP address and port of the Cloud TPU.
Example using TPUClusterResolver:
tpu_cluster_resolver = tf.contrib.cluster_resolver.TPUClusterResolver( FLAGS.tpu, zone=FLAGS.tpu_zone, project=FLAGS.gcp_project) config = tpu_config.RunConfig( cluster=tpu_cluster_resolver, model_dir=FLAGS.model_dir, save_checkpoints_steps=max(600, FLAGS.iterations_per_loop), tpu_config=tpu_config.TPUConfig( iterations_per_loop=FLAGS.iterations_per_loop, num_shards=FLAGS.num_cores, per_host_input_for_training=tpu_config.InputPipelineConfig.PER_HOST_V2))
If you're using TensorFlow 1.7, you should pass the relevant
information in a
Cloud ML Engine automatically resolves the IP address and port of
the TPU gRPC server running on the TPU VM, and
passes the information to your program in a
master flag. You must
master flag in your code and pass it to
Example using the
from tensorflow.contrib.tpu.python.tpu import tpu_config config = tpu_config.RunConfig( master=FLAGS.master, model_dir=FLAGS.model_dir, tpu_config=tpu_config.TPUConfig( iterations_per_loop=FLAGS.iterations_per_loop, num_shards=FLAGS.num_cores, per_host_input_for_training=tpu_config.InputPipelineConfig.PER_HOST_V2))
You can read more about distributed TensorFlow and the master RPC service in the TensorFlow documentation.
Assigning ops to TPUs
- TPUEstimator handles many of the details of running on TPU devices, such as replicating inputs and models for each core, and returning to host periodically to run hooks.
- The high-level TensorFlow API also provides many other conveniences. In particular, the API takes care of saving and restoring model checkpoints, so that you can resume an interrupted training job at the point at which it stopped.
See the list of TensorFlow operations available on Cloud TPU.