Tensor Processing Units (TPUs) are Google's custom-developed ASICs used to
accelerate machine-learning workloads. You can run your training jobs on
AI Platform Training, using Cloud TPU. AI Platform Training provides a
job management interface so that you don't need to manage the TPU yourself.
Instead, you can use the AI Platform Training jobs
API in the
same way as you use it for training on a CPU or a GPU.
High-level TensorFlow APIs help you get your models running on the Cloud TPU hardware.
Setting up your Google Cloud environment
Configure your Google Cloud environment by working through the setup section of the getting-started guide.
Authorizing your Cloud TPU to access your project
Follow these steps to authorize the Cloud TPU service account name associated with your Google Cloud project:
Get your Cloud TPU service account name by calling
projects.getConfig
. Example:PROJECT_ID=PROJECT_ID curl -H "Authorization: Bearer $(gcloud auth print-access-token)" \ https://ml.googleapis.com/v1/projects/$PROJECT_ID:getConfig
Save the value of the
serviceAccountProject
andtpuServiceAccount
field returned by the API.Initialize the Cloud TPU service account:
curl -H "Authorization: Bearer $(gcloud auth print-access-token)" \ -H "Content-Type: application/json" -d '{}' \ https://serviceusage.googleapis.com/v1beta1/projects/<serviceAccountProject>/services/tpu.googleapis.com:generateServiceIdentity
Now add the Cloud TPU service account as a member in your project,
with the role Cloud ML Service Agent. Complete the following steps in the
Google Cloud console or using the gcloud
command:
Console
- Log in to the Google Cloud console and choose the project in which you're using the TPU.
- Choose IAM & Admin > IAM.
- Click the Add button to add a member to the project.
- Enter the TPU service account in the Members text box.
- Click the Roles dropdown list.
- Enable the Cloud ML Service Agent role (Service Agents > Cloud ML Service Agent).
gcloud
Set environment variables containing your project ID and the Cloud TPU service account:
PROJECT_ID=PROJECT_ID SVC_ACCOUNT=your-tpu-sa-123@your-tpu-sa.google.com.iam.gserviceaccount.com
Grant the
ml.serviceAgent
role to the Cloud TPU service account:gcloud projects add-iam-policy-binding $PROJECT_ID \ --member serviceAccount:$SVC_ACCOUNT --role roles/ml.serviceAgent
For more details about granting roles to service accounts, see the IAM documentation.
Example: Training a sample MNIST model
This section shows you how to train a sample MNIST model using
a TPU and runtime version 2.11. The example job uses the
predefined BASIC_TPU
scale tier for your machine configuration.
Later sections of the guide show you how to set up a custom configuration.
This example assumes you are using a Bash shell with the gcloud CLI installed. Run the following commands to get the code and submit your training job to AI Platform Training:
Download the code for TensorFlow's reference models and navigate to the directory with the sample code:
git clone https://github.com/tensorflow/models.git \ --branch=v2.11.0 \ --depth=1 cd models
Create a
setup.py
file in themodels
directory. This ensures that thegcloud ai-platform jobs submit training
command includes all the necessary subpackages within themodels/official
directory when it creates a tarball of your training code, and it ensures that AI Platform Training installs TensorFlow Datasets as a dependency when it runs the training job. This training code relies on TensorFlow datasets to load the MNIST data.To create the
setup.py
file, run the following command in your shell:cat << END > setup.py from setuptools import find_packages from setuptools import setup setup( name='official', install_requires=[ 'tensorflow-datasets~=3.1', 'tensorflow-model-optimization>=0.4.1' ], packages=find_packages() ) END
Submit your training job using the
gcloud ai-platform jobs submit training
command:gcloud ai-platform jobs submit training tpu_mnist_1 \ --staging-bucket=gs://BUCKET_NAME \ --package-path=official \ --module-name=official.vision.image_classification.mnist_main \ --runtime-version=2.11 \ --python-version=3.7 \ --scale-tier=BASIC_TPU \ --region=us-central1 \ -- \ --distribution_strategy=tpu \ --data_dir=gs://tfds-data/datasets \ --model_dir=gs://BUCKET_NAME/tpu_mnist_1_output
Replace BUCKET_NAME with the name of a Cloud Storage bucket in your Google Cloud project. The gcloud CLI uploads your packaged training code to this bucket, and AI Platform Training saves training output in the bucket.
Monitor your training job. When the job has completed, you can view its output in the
gs://BUCKET_NAME/tpu_mnist_1_output
directory.
More about training a model on Cloud TPU
This section tells you more about configuring a job and training a model on AI Platform Training with Cloud TPU.
Specifying a region that offers TPUs
You need to run your job in a region where TPUs are available. The following regions currently provide access to TPUs:
us-central1
europe-west4
To fully understand the available regions for AI Platform Training services, including model training and online/batch prediction, read the guide to regions.
TensorFlow and AI Platform Training versioning
AI Platform Training runtime version(s) 1.15, 2.1, 2.2, 2.3, 2.4, 2.5, 2.6, 2.7, 2.8, 2.9, and 2.11 are available for training your models on Cloud TPU. See more about AI Platform Training runtime versions and the corresponding TensorFlow versions.
The versioning policy is the same as for Cloud TPU. In your training job request, make sure to specify a runtime version that is available for TPUs and matches the TensorFlow version used in your training code.
Connecting to the TPU gRPC server
In your TensorFlow program, use TPUClusterResolver
to connect with the TPU gRPC server
running on the TPU VM.
The TensorFlow guide to using TPUs shows how to use TPUClusterResolver
with
the TPUStrategy
distribution
strategy.
However, you must make one important change when you use TPUClusterResolver
for code that runs on AI Platform Training: Do not provide any arguments when
you construct the TPUClusterResolver
instance. When the tpu
, zone
, and
project
keyword arguments are all set to their default value of None
,
AI Platform Training automatically provides the cluster resolver with the
necessary connection details through environment variables.
The following TensorFlow 2 example shows how to initialize a cluster resolver and a distribution strategy for training on AI Platform Training:
import tensorflow as tf
resolver = tf.distribute.cluster_resolver.TPUClusterResolver()
tf.config.experimental_connect_to_cluster(resolver)
tf.tpu.experimental.initialize_tpu_system(resolver)
strategy = tf.distribute.experimental.TPUStrategy(resolver)
Using TPUs in TensorFlow code
To make use of the TPUs on a machine, use TensorFlow 2's
TPUStrategy
API. The
TensorFlow guide to using TPUs shows how to do this.
To train with TPUs in TensorFlow 1, you can use the TPUEstimator
API instead. The
Cloud TPU guide to the TPUEstimator
API shows how to do this.
The Cloud TPU documentation also provides a list of low-level TensorFlow operations available on Cloud TPU.
Using TPUs in PyTorch code
To make use of a TPU when you use a pre-built PyTorch
container, use the
torch_xla
package. Learn how to use torch_xla
for TPU in training in the
PyTorch documentation. For more examples
of using torch_xla
, see the tutorials in the PyTorch XLA GitHub
repository
Note that when you train using a TPU on AI Platform Training, you are using a single XLA device, not multiple XLA devices.
See also the following section on this page about configuring your training job for PyTorch and TPU.
Configuring a custom TPU machine
A TPU training job runs on a two-VM configuration. One VM (the master) runs your Python code. The master drives the TensorFlow server running on a TPU worker.
To use a TPU with AI Platform Training, configure your training job to access a TPU-enabled machine in one of three ways:
- Use the
BASIC_TPU
scale tier. You can use this method to access TPU v2 accelerators. - Use a
cloud_tpu
worker and a legacy machine type for the master VM. You can use this method to access TPU v2 accelerators. - Use a
cloud_tpu
worker and a Compute Engine machine type for the master VM. You can use this method to access TPU v2 or TPU v3 accelerators. TPU v3 accelerators are available in beta.
Basic TPU-enabled machine
Set the scale tier to BASIC_TPU
to get a master VM and a TPU VM including one
TPU with eight TPU v2 cores, as you did when running the previous
example.
TPU worker in a legacy machine type configuration
Alternatively, you can set up a custom machine configuration if you need more computing resources on the master VM:
- Set the scale tier to
CUSTOM
. - Configure the master VM to use a legacy machine type that suits your job requirements.
- Set
workerType
tocloud_tpu
, to get a TPU VM including one Cloud TPU with eight TPU v2 cores. - Set
workerCount
to 1. - Do not specify a parameter server when using a Cloud TPU. The
service rejects the job request if
parameterServerCount
is greater than zero.
The following example shows a config.yaml
file that uses this type of
configuration:
trainingInput:
scaleTier: CUSTOM
masterType: complex_model_m
workerType: cloud_tpu
workerCount: 1
TPU worker in a Compute Engine machine type configuration
You can also set up a custom machine configuration with a
Compute Engine machine
type for your master
VM and an acceleratorConfig
attached to your
TPU VM.
You can use this type of configuration to set up a TPU worker with eight TPU v2
cores (similar to a configuration without an acceleratorConfig
) or a TPU
worker with eight TPU v3 cores (beta). Read more about the difference between
TPU v2 and TPU v3 accelerators.
Using a Compute Engine machine type also provides more flexibility for configuring your master VM:
- Set the scale tier to
CUSTOM
. - Configure the master VM to use a Compute Engine machine type that suits your job requirements.
- Set
workerType
tocloud_tpu
. - Add a
workerConfig
with anacceleratorConfig
field. Inside thatacceleratorConfig
, settype
toTPU_V2
orTPU_V3
andcount
to8
. You may not attach any other number of TPU cores. - Set
workerCount
to 1. - Do not specify a parameter server when using a Cloud TPU. The
service rejects the job request if
parameterServerCount
is greater than zero.
The following example shows a config.yaml
file that uses this type of
configuration:
TPU v2
trainingInput:
scaleTier: CUSTOM
masterType: n1-highcpu-16
workerType: cloud_tpu
workerCount: 1
workerConfig:
acceleratorConfig:
type: TPU_V2
count: 8
TPU v3 (beta)
trainingInput:
scaleTier: CUSTOM
masterType: n1-highcpu-16
workerType: cloud_tpu
workerCount: 1
workerConfig:
acceleratorConfig:
type: TPU_V3
count: 8
Using TPU Pods
A TPU Pod is a collection of TPU devices connected by dedicated high-speed network interfaces. A TPU Pod can have up to 2,048 TPU cores, allowing you to distribute the processing load across multiple TPUs.
To use TPU Pods, you must first file a quota increase request.
The following example config.yaml
files show how to use TPU Pods:
TPU v2 Pods
trainingInput:
scaleTier: CUSTOM
masterType: n1-highcpu-16
workerType: cloud_tpu
workerCount: 1
workerConfig:
acceleratorConfig:
type: TPU_V2_POD
count: 128
TPU v3 Pods
trainingInput:
scaleTier: CUSTOM
masterType: n1-highcpu-16
workerType: cloud_tpu
workerCount: 1
workerConfig:
acceleratorConfig:
type: TPU_V3_POD
count: 32
There are limitations on the number of Pod cores that can be used for each TPU type. Available configurations:
TPU Pod Type | Number of Pod cores available to use |
---|---|
TPU_V2_POD |
32, 128, 256, 512 |
TPU_V3_POD |
32, 128, 256 |
For more details on how to make full use of TPU Pod cores, see the Cloud TPU documentation about TPU Pods.
Using a pre-built PyTorch container on a TPU worker
If you want to perform PyTorch training with a TPU, then you
must specify the tpuTfVersion
field in your training job's trainingInput
.
Set the tpuTfVersion
to match the version of the pre-built PyTorch
container that you are
using for training.
AI Platform Training supports training with TPUs for the following pre-built PyTorch containers:
Container image URI | tpuTfVersion |
---|---|
gcr.io/cloud-ml-public/training/pytorch-xla.1-11 |
pytorch-1.11 |
gcr.io/cloud-ml-public/training/pytorch-xla.1-10 |
pytorch-1.10 |
gcr.io/cloud-ml-public/training/pytorch-xla.1-9 |
pytorch-1.9 |
gcr.io/cloud-ml-public/training/pytorch-xla.1-7 |
pytorch-1.7 |
gcr.io/cloud-ml-public/training/pytorch-xla.1-6 |
pytorch-1.6 |
For example, to train using the PyTorch 1.11 pre-built container, you might use
the following config.yaml
file to configure training:
trainingInput:
scaleTier: CUSTOM
masterType: n1-highcpu-16
masterConfig:
imageUri: gcr.io/cloud-ml-public/training/pytorch-xla.1-11
workerType: cloud_tpu
workerCount: 1
workerConfig:
imageUri: gcr.io/cloud-ml-public/training/pytorch-xla.1-11
tpuTfVersion: pytorch-1.11
acceleratorConfig:
type: TPU_V2
count: 8
See also the previous section on this page about Using TPUs in PyTorch code.
Using a custom container on a TPU worker
If you want to run a custom container
on your TPU worker instead of using one of the AI Platform Training runtime
versions that support TPUs, you must specify an
additional configuration field when you submit your training job. Set the
tpuTfVersion
to a runtime version that
includes the version of TensorFlow that your container uses. You must specify a
runtime version currently supported for training with
TPUs.
Because you are configuring your job to use a custom container, AI Platform Training doesn't use this runtime version's environment when it runs your training job. However, AI Platform Training requires this field so it can properly prepare the TPU worker for the version of TensorFlow that your custom container uses.
The following example shows a config.yaml
file with a similar TPU
configuration to the one from the previous section, except in this case the
master VM and the TPU worker each run different custom containers:
TPU v2
trainingInput:
scaleTier: CUSTOM
masterType: n1-highcpu-16
masterConfig:
imageUri: gcr.io/YOUR_PROJECT_ID/your-master-image-name:your-master-tag-name
workerType: cloud_tpu
workerCount: 1
workerConfig:
imageUri: gcr.io/YOUR_PROJECT_ID/your-worker-image-name:your-worker-tag-name
tpuTfVersion: 2.11
acceleratorConfig:
type: TPU_V2
count: 8
TPU v3 (beta)
trainingInput:
scaleTier: CUSTOM
masterType: n1-highcpu-16
masterConfig:
imageUri: gcr.io/YOUR_PROJECT_ID/your-master-image-name:your-master-tag-name
workerType: cloud_tpu
workerCount: 1
workerConfig:
imageUri: gcr.io/YOUR_PROJECT_ID/your-worker-image-name:your-worker-tag-name
tpuTfVersion: 2.11
acceleratorConfig:
type: TPU_V3
count: 8
If you use the
gcloud beta ai-platform jobs submit training
command to submit your training job, you can specify the tpuTfVersion
API
field with the --tpu-tf-version
flag instead of in a config.yaml
file.
Using TPUClusterResolver
after the TPU is provisioned
When using a custom container, you would have to wait for the TPU to be
provisioned before you can call TPUClusterResolver
to use it. The following
sample code shows how to handle the TPUClusterResolver
logic:
def wait_for_tpu_cluster_resolver_ready():
"""Waits for `TPUClusterResolver` to be ready and return it.
Returns:
A TPUClusterResolver if there is TPU machine (in TPU_CONFIG). Otherwise,
return None.
Raises:
RuntimeError: if failed to schedule TPU.
"""
tpu_config_env = os.environ.get('TPU_CONFIG')
if not tpu_config_env:
tf.logging.info('Missing TPU_CONFIG, use CPU/GPU for training.')
return None
tpu_node = json.loads(tpu_config_env)
tf.logging.info('Waiting for TPU to be ready: \n%s.', tpu_node)
num_retries = 40
for i in range(num_retries):
try:
tpu_cluster_resolver = (
tf.contrib.cluster_resolver.TPUClusterResolver(
tpu=[tpu_node['tpu_node_name']],
zone=tpu_node['zone'],
project=tpu_node['project'],
job_name='worker'))
tpu_cluster_resolver_dict = tpu_cluster_resolver.cluster_spec().as_dict()
if 'worker' in tpu_cluster_resolver_dict:
tf.logging.info('Found TPU worker: %s', tpu_cluster_resolver_dict)
return tpu_cluster_resolver
except Exception as e:
if i < num_retries - 1:
tf.logging.info('Still waiting for provisioning of TPU VM instance.')
else:
# Preserves the traceback.
raise RuntimeError('Failed to schedule TPU: {}'.format(e))
time.sleep(10)
# Raise error when failed to get TPUClusterResolver after retry.
raise RuntimeError('Failed to schedule TPU.')
Learn more about distributed training with custom containers.
What's next
- Learn more about training models on AI Platform Training.
- Learn about hyperparameter tuning on AI Platform Training, with special attention to the details for hyperparameter tuning with Cloud TPU.
- Explore additional reference models for Cloud TPU.
- Optimize your models for Cloud TPU by following the Cloud TPU best practices.
- See the Cloud TPU troubleshooting and FAQ for help diagnosing and solving problems.