Troubleshooting and FAQ

This guide provides troubleshooting help for users who want to run their own TensorFlow models on Cloud TPU. For a more general guide to getting started with Cloud TPU, see the quickstart or the MNIST tutorial.


The recommended strategy for running TensorFlow models on the Cloud TPU is to use the TPUEstimator API. If you are already using TensorFlow's Estimator API, switching to TPUEstimator typically only requires changing a few lines of code. The recommended way of loading data into TPUEstimator is with the Dataset API. See the ResNet tutorial for a real-world example of how to use the Dataset and TPUEstimator.

After you convert your model to TPUEstimator, it is recommended to make sure it works with the flag use_tpu=False, which causes TensorFlow to fall back to the normal Estimator API and not use any code related to the TPU. Therefore, any issues encountered in running models when use_tpu=False are not related to the TPU and are out-of-scope of this guide. Please see the TensorFlow documentation for general help with TensorFlow.

Ideally, once a model can be run successfully using TPUEstimator and use_tpu=False, running it on the TPU is simply a matter of setting use_tpu=True and pointing master to a TPU server URL (typically through the use of a cluster resolver). However, because TensorFlow models can be very complex and the TPU uses an entirely new execution engine, it is possible to run into issues that are specific to the TPU. The issues fall into these five broad categories, with links to the relevant section in this guide:

  1. The training script is not able to connect to the TPU server at all.

  2. The TPU returns an error when attempting to execute the model.

  3. The model does not fit into TPU memory.

  4. The model can run on the TPU, but the training speed is not as fast as expected.

  5. The model can run on the TPU, but the accuracy of the TPU-trained model is worse than a CPU/GPU-trained baseline.

Additionally, this guide contains a FAQ about general functionality available on TPUs.

For more specialized help porting particular types of neural networks to the TPU, see the Cloud TPU Tutorials.

Trouble connecting to the TPU server

When running a model on the TPU, you must pass a remote TPU server URL to the master parameter in RunConfig. Under the hood, TensorFlow creates a remote tf.Session with this server. This section provides troubleshooting for situations where TensorFlow hangs or prints an error when connecting to the TPU server. Note that the TPU graph compilation step can take a long time for large models, so let the script execute for at least 5 minutes before concluding that it has hung.

The first step is to verify whether the issue is with the server itself, or with your TensorFlow training pipeline. To do this, run the MNIST tutorial using your TPU server URL and verify that it works correctly. If there are still connection issues with the MNIST tutorial, this confirms that it is an issue with the TPU server. In this case:

  1. Run the following command to list the available TPUs:

    (vm)$ gcloud compute tpus list

    You may need to also set your zone and project, as shown in the MNIST tutorial. This prints output such as:

    demo-tpu   us-central1-b  v2-8        default  READY

  2. Verify that you are passing the correct value to --tpu_name (demo-tpu in the above example), and that this TPU is listed as READY. Also make sure that your zone and project have been set with:

    (vm)$ gcloud config set project your-project-name
    (vm)$ gcloud config set compute/zone us-central1-b

  3. If your TPU is not listed as READY or you are still having trouble connecting, manually restart the server with gcloud compute tpus reset $TPU_NAME. In the above example $TPU_NAME is demo-tpu. This may take several minutes.

  4. Re-run the above ... tpus list command and wait for the TPU to be in the READY state. This may take several minutes.

  5. Try to run the MNIST tutorial again.

  6. If you are still having trouble running the MNIST tutorial, ask for help from TPU Support.

If the MNIST example runs correctly but your model still hangs, then the issue is likely with your training pipeline. First, make sure that your model is using the TPUEstimator API, since this not only handles the complex processing pipeline, but also allows effortless switching between TPU and non-TPU execution with the use_tpu flag. Please see the TPU tutorials for several examples of how to use TPUEstimator. Once your model is using the TPUEstimator API, please verify that it runs correctly when use_tpu=False is set. If your model does not run correctly when use_tpu=False is set, the issue is unrelated to the TPU.

Debugging common errors

Cannot use local filesystem

Error Message

InvalidArgumentError: Unimplemented: File system scheme '[local]' not implemented


All input files and the model directory must use a cloud storage bucket path (gs://bucket-name/...), and this bucket must be accessible from the TPU server. Note that all data processing and model checkpointing is performed on the TPU server, not the local machine. For information on how to properly configure cloud storage for use with the TPU, see the guide Connecting to Cloud Storage Buckets.

Unsupported data type

Error Message

TypeError: DataType is not a supported TPU infeed type.


Currently, only the tf.float32, tf.int32, tf.bfloat16, and tf.bool data types are supported on the TPU. Other common data types, such as tf.uint8, tf.string, and tf.int64, must be converted to one of the supported data types during data pre-processing (that is, in the input_fn of TPUEstimator). See the MNIST tutorial for another example. As an example, this code snippet from MNIST converts an image tensor stored as tf.uint8 byte sequence to a tf.float32 tensor:

image = tf.decode_raw(image, tf.uint8)
image = tf.cast(image, tf.float32)
image = tf.reshape(image, [784])

This snippet converts a label tensor stored as tf.int64 to a tf.int32 tensor:

label = tf.cast(label, tf.int32)

Dynamic shapes not supported

Error Message

ValueError: shape [Shape] must have a fixed size for dimension d that is known at graph construction time.


To execute a model on the TPU, TensorFlow compiles the model using the XLA framework. While this compilation step significantly improves training speed and memory usage, the shapes (dimension sizes) of all tensors in the graph must be static, that is, their values must be known at graph compilation time. If any shapes cannot be determined at compile time, TPU compilation fails with an error like the one above.

One common op that returns a dynamic shape is dataset.batch(batch_size), since the number of samples remaining in a stream might be less than the batch size. Therefore, when training on the TPU, use This potentially drops the last few samples from a file to ensure that every batch has a static shape of batch_size. For example:

dataset = ...
dataset = dataset.apply(

Unavailable TensorFlow op

Error Message

NotFoundError: No registered 'OpName' OpKernel for XLA_TPU_JIT devices compatible with node


The model uses a TensorFlow op which is not currently available on the TPU.

For a list of ops available on the TPU, along with plans for future support and suggestions for workarounds, please see the guide to available TensorFlow Ops.

Out-of-memory error message

Error Message

ResourceExhaustedError: Ran out of memory in memory space hbm; used: YYY; limit: 7.48G.


Each Cloud TPU is made of eight TPU cores, which each have 8GB of RAM (or HBM, High-Bandwidth Memory). This memory is used to store the weight (variable) tensors, as well as intermediate result tensors needed for gradient computation. If the model is too large to fit into TPU RAM, the initialization fails and the above error message is printed. See the section on reducing memory usage for more help.

Not using CrossShardOptimizer

Error Message

ValueError: CrossShardOptimizer must be used for model training on TPUs.


When defining a model using the TensorFlow Python API, the vast majority of code written by the user does not need to be specialized for the TPU. The most significant exception is the optimizer, which must be wrapped in tf.contrib.tpu.CrossShardOptimizer() as shown below:

optimizer = tf.train.GradientDescentOptimizer(learning_rate=learning_rate)
if FLAGS.use_tpu:
  optimizer = tf.contrib.tpu.CrossShardOptimizer(optimizer)
train_op=optimizer.minimize(loss, tf.train.get_global_step())

Each Cloud TPU is made of 8 TPU cores, which are independent processing units. For each training step (i.e., weight update), each TPU core runs the forward pass and gradient computation on an independent mini-batch of data, and then all of the cores exchange gradients with one another. In most cases, this is mathematically equivalent to computing the gradients on one large batch, although there are some caveats explained in Understanding Data Sharding.

CrossShardOptimizer is the op responsible for this gradient exchange. By default, CrossShardOptimizer computes gradients of the mean loss across the cores, but it can be configured to compute the sum loss by passing reduction=losses.Reduction.SUM.

Unable to connect to TPU server

Error Message

An error was raised while a session was being created. This may be due to a preemption of a connected worker or parameter server. A new session is created.


This error is printed when TensorFlow cannot connect to the TPU server URL that is passed to master. For help, see the section on trouble connecting to the TPU server.

Errors in the middle of training

If a model cannot be executed successfully on the TPU, any errors related to this are designed to be caught during initialization. Therefore, it is rare for a model to fail in the middle of training. If this does happen, the most likely cause is an issue in the data pre-processing function. For example, when using the Dataset API, you typically need to call dataset = dataset.repeat(), otherwise the training fails after making one pass through the data. Dynamic execution ops like tf.while_loop() can also only fail in way that is dependent on the input data. There is also the rare possibility of spurious hardware or network failures.

Problems stopping execution

If TensorFlow encounters an error during TPU execution, the script sometimes seems to hang rather than exit to the shell. If this happens, hit CTRL+\ on the keyboard to trigger aSIGQUIT, which causes Python to exit immediately.

Similarly, hitting CTRL+C during TPU execution does not shut down TensorFlow immediately, but instead waits until the end of the current iteration loop to exit cleanly. Hitting CTRL+\ causes Python to exit immediately.

If you have any trouble re-connecting to the TPU server after exiting in this manner, then manually reset the TPU server with the command gcloud compute tpus reset $TPU_SERVER_NAME, where $TPU_SERVER_NAME is taken from the first column of the gcloud compute tpus list command.

Reducing memory usage

If you encounter an out-of-memory error when executing your model on the TPU, you must take steps to reduce the model's memory usage. This section describes several root causes of memory issues and provides guidelines for fixing them.

Large number of model weights

Possible Cause of Memory Issue

Each float32 model weight requires 4 bytes. These weights are replicated on each TPU core. Therefore, a model with hundreds of millions of weights is likely to be too large to fit on the TPU.

How to Reduce Memory Usage

  1. Certain optimizers require extra memory per weight to store update statistics. Notably, AdamOptimizer and AdadeltaOptimizer both require an extra 8 bytes per weight. AdagradOptimizer and MomentumOptimizer require an extra 4 bytes per weight. The standard GradientDescentOptimizer requires no extra storage, although it may not perform as well as other optimizers in terms of final model accuracy. The experimental AdafactorOptimizer requires almost no extra memory and performs as well as the baseline Adam optimizer when training Transformer models.
  2. If the majority of weights are word embeddings, techniques such as WordPiece have been shown to substantially reduce vocabulary size while increasing accuracy across a variety of tasks.
  3. An upcoming release of TensorFlow will have experimental support for 16-bit floating point weights and gradients, which will reduce the memory requirements by half.

Excessive tensor padding

Possible Cause of Memory Issue

Tensors in TPU memory are padded, that is, the TPU rounds up the sizes of tensors stored in memory to perform computations more efficiently. This padding happens transparently at the hardware level and does not affect results. However, in certain cases the padding can result in significantly increased memory use and execution time.

How to Reduce Memory Usage

The TPU software attempts to lay out tensors in memory to maximize computational efficiency and minimize padding. This memory layout process is complex, however, for the best results the model should obey the following rule of thumb. To minimize memory overhead and maximize computational efficiency, one of the following must be true:

  • The total batch size should be a multiple of 64 (8 per TPU core), and feature dimensions should be a multiple of 128,


  • The total batch size should be a multiple of 1024 (128 per TPU core), and feature dimensions should be a multiple of 8.

Using a batch size of 1024 and feature dimensions that are a multiple of 128 results in the best efficiency, although this may not be possible for all models. For clarity, "feature dimension" refers to the hidden size of a fully-connected layer or the number of output channels in a convolution. Not all layers can conform to this rule, especially the first and last layers of the network. This is fine, and it is expected that most models require some amount of padding.

Batch size too large

Possible Cause of Memory Issue

When training a neural network on a CPU, GPU, or TPU, the memory use comes from two places:

  1. Storing the weights, the weight gradients, and optimizer-specific statistics such as momentum. The memory use is directly proportional to the number of weights in the model, but not the batch size.
  2. Storing intermediate activations from the forward pass necessary to compute the backward pass. The memory use is directly proportional to the batch size, layer sizes, and number of layers.

Therefore, the memory required by a model is largely dependent on the batch size.

How to Reduce Memory Usage

Try to slowly reduce the batch size until it fits in memory, making sure that the total batch size is a multiple of 64 (the per-core batch size should be a multiple of 8). Keep in mind that larger batch sizes are more efficient on the TPU. A total batch size of 1024 (128 per core) is generally a good starting point.

Model too large

Possible Cause of Memory Issue

The memory required by a model is highly dependent on the number of operators in the graph (that is, layers in the network). This storage requirement is separate from the number of weights. For example, computing the gradient of an operator like tf.nn.conv2d() may increase memory use, in addition to any memory used to store weights.

TPU engine attempts to strategically re-compute certain operators to fit the model in memory (called rematerialization, similar to gradient checkpointing), but it is not always able to do this.

How to Reduce Memory Usage

If the model cannot be run on the TPU even with a small batch size (for example, 64), try reducing the number of layers or the layer sizes. An upcoming release of TensorFlow will support "model parallelism" on the TPU, which will allow significantly larger models to be run on Cloud TPU by running different parts of the model on different TPU cores.

Improving training speed

If your model is able to run successfully on the TPU, but the training speed is less than expected, this section outlines several potential ways to improve the speed.

Not using all TPU cores

Description of Performance Issue

Each Cloud TPU contains 8 separate TPU cores, which operate as independent processing units. The TPU is not fully utilized unless all 8 cores are used.

How to Know if Your Model is Affected

If your model does not explicitly specify the num_shards parameter of TPUConfig to be 8, TensorFlow does not utilize all of the TPU cores.

How to Mitigate

When performing full-scale TPU training, always set num_shards to 8. See the MNIST tutorial or the code snippet below for an example:

tf.flags.DEFINE_integer("num_shards", 8, "Number of shards (TPU chips).")

run_config = tf.contrib.tpu.RunConfig(
  tpu_config=tf.contrib.tpu.TPUConfig(num_shards=FLAGS.num_shards, ...

estimator = tf.contrib.tpu.TPUEstimator(
    config=run_config, ...

Setting num_shards to 1 can sometimes be useful for debugging model accuracy differences, so we recommend using a command line flag with a default value of 8.

Too few iterations per loop

Description of Performance Issue

The iterations_per_loop parameter to TPUConfig controls how many batches of data are sent to the TPU in a single "training loop." Each training loop requires significant communication between the local machine and the TPU server, so if iterations_per_loop is too small, can substantially slow down training.

How to Know if Your Model is Affected

If the logging message Enqueue next (X) batch(es) of data to infeed is printed very frequently (for example, every 3 seconds), then your training might have significant overhead from the training loop.

How to Mitigate

Set iterations_per_loop to a larger value. In the MNIST tutorial, this is controlled by the --iterations flag. As long as the Enqueue next (X) batch(es) of data to infeed message is not printed more than a few times a minute, then the current value should be sufficient. Note that iterations_per_loop can be set to a very large value, with the only downside being that logging messages and checkpointing can only occur at the end of a loop.

Input processing bottleneck

Description of Performance Issue

While the TPU is training on a particular chunk of data, the input processing function prepares the next chunk of data on the CPU. Thus, if the input function takes less time than the model function, the cost of input processing is effectively zero. However, an input function that takes longer than the model function creates a bottleneck.

How to Know if Your Model is Affected

Follow the instructions in the Cloud TPU Tools: Input Pipeline Analyzer for viewing the input pipeline analysis in TensorBoard:


The input pipeline analysis page displays a clear summary which shows if your model is bottlenecked by input processing. The same page also shows per-op execution time, which allows you to pinpoint problematic ops.

How to Mitigate

There are several possible mitigations when loading data with the Dataset API:

  1. Store your data as a collection of tf.train.Example structures in TFRecord files, and load them with TFRecordDataset. See the Dataset API tutorial or the ResNet tutorial for examples.
  2. Use dataset.cache() and/or dataset.prefetch() to buffer the input data. This prevents sporadic slowdowns in file access from creating a bottleneck.
  3. Specify the num_parallel_calls parameter of the function to enable multi-threaded map() ops.
  4. Perform expensive data pre-processing offline as a one time cost, rather than incurring the cost through every epoch of every training.

All input processing is performed on CPUs located on the TPU server, not on the local machine, so the speed of the local machine is not a factor.

Too many non-matrix multiplication ops

Description of Performance Issue

The Cloud TPU can perform matrix multiplications and convolutions at incredibly high speeds. The majority of other TensorFlow ops do have efficient implementations on the TPU, but these are not the TPU's primary strength relative to other hardware. Therefore, a model should be dominated by matrix multiplications or convolutions to fully take advantage of the TPU.

How to Know if Your Model is Affected

The guide Cloud TPU Tools: Op Profile describes how to generate a performance profile for your model broken down by op type. In general, the vast majority of modern neural network architectures are dominated by matrix multiplications and convolutions.

How to Mitigate

If the lack of the matrix multiplications in your model was primarily motivated by training speed issues on other hardware, you are encouraged to re-benchmark those models on the TPU for better speed performance. If the lack of matrix multiplications is a fundamental property of the model, then the TPU might not be the optimal hardware choice.

Excessive tensor padding

Description of Performance Issue

The TPU pads tensors in memory so that the TPU can use its computational units efficiently. The padding can increase usage of both memory and memory bandwidth. See the section on tensor padding for help understanding and fixing tensor padding issues.

Batch size too small

Description of Performance Issue

As a general rule, using larger batch sizes results in greater training speed on the TPU, in terms of samples/second.

How to Know if Your Model is Affected

The batch size of any model should always be at least 64 (8 per TPU core), since the TPU always pads the tensors to this size. The ideal batch size when training on the TPU is 1024 (128 per TPU core), since this eliminates inefficiencies related to memory transfer and padding.

How to Mitigate

It is recommended to use the largest batch size which fits in to memory and is a multiple of 64. The easiest way to achieve this is to start with 1024, and if this causes an out-of-memory error then try reducing the batch size until the model runs successfully. Changing the batch size of a model may require adjusting other hyperparameters to achieve the same model accuracy, such as the the learning rate, but this must be evaluated on a case-by-case basis.

Layer sizes too small

Description of Performance Issue

Even when a model is dominated by matrix multiplications or convolutions, the TPU may not run at full efficiency if the input tensors are small. When compared to other hardware, the TPU runs most efficiently when both the batch size and layer sizes are large (for example, dimension >= 512).

How to Know if Your Model is Affected

As a general rule, layer sizes smaller than 128 achieve poor efficiency on the TPU, since 128 is the native dimension of the TPU matrix multiplication unit. For fully-connected layers, a minimum hidden size of 512 is recommended in order to achieve high efficiency. Note that convolutional layers typically do not need to be as large as fully connected layers to achieve an equal efficiency level. For example, a 3 × 3 convolution of size 256 achieves similar (high) efficiency compared to a fully-connected layer of size 2048, since 3 × 3 × 256 = 2304.

How to Mitigate

If the primary motivation for small layer sizes in your model is training speed, you are encouraged to re-benchmark your models with larger layers on the TPU. For example, increasing the output size of a layer from 256 to 512 may only increase the training time by 20% even though the model is performing 2x the computation.

Op-level model profiling

It is often useful to measure op-level execution time and memory usage in order to identify performance bottlenecks. For instructions on how to do this, \ see the guide Cloud TPU Tools: Trace Viewer.

Debugging drops in model accuracy

One of the goals of the Cloud TPU ecosystem is that any model that is currently being trained on a CPU or GPU achieves a very similar accuracy when it is trained on the TPU, with perhaps minor adjustments to hyperparameters like the batch size and learning rate. Occasionally, however, users can observe a degradation in accuracy when training models on the TPU. Debugging such issues can be extremely frustrating due to the random nature of neural network training. This section provides guidance on how to pinpoint the root cause of any drops in model accuracy when porting a model to the TPU.

Understanding data sharding (data parallelism)

One of TensorFlow's primary goals is that each op should produce nearly identical results whether it is executed on the CPU, GPU, or TPU. There are certain exceptions to this, such as random ops. In general, if you find any significant difference between the output of non-random ops on the TPU and CPU, report it as a bug to TPU Support.

However, for the training pipeline as a whole, there is a significant difference between training on the CPU/GPU and TPU: When using TPUEstimator and use_tpu=False, TensorFlow falls back to its standard execution engine. This engine trains with one batch per step. However, when training on the actual TPU, TensorFlow performs data sharding, also known as "data parallelism with synchronous SGD". The reason is that each Cloud TPU is made of 8 TPU cores which operate as independent processing units. So, for each step in the training, each TPU core is passed a batch of data, computes the weight gradients, exchanges the gradients with one another, and then computes the weight update. By default, the loss is averaged across the cores, but it can instead be summed by changing the parameter of CrossShardOptimizer.

If the total loss of the model can be computed as the average (or sum) of independent per-sample losses, then this procedure is mathematically equivalent to training on a single large batch. The most common op which is not independent per-sample is batch normalization, which runs over each per-core batch separately. For example, if the total batch size is 128, then the per-core batch size is 16, and each of the 8 cores performs batch norm over its own 16 samples. In some cases, performing batch normalization over small batches (for example, less than 32) has been found to cause degredations in accuracy. In the ideal scenario, the total batch size when training on the TPU can be large (for example, 256 to 1024), so batches of that size are not a major issue. However, if such a batch size is too large to fit into memory, the effect of sharding must be evaluated on a case-by-case basis.

Because of the complexities introduced by sharding, the first step in debugging drops in model accuracy is to run a deterministic, single-core TPU training, and compare it to a model trained on the CPU/GPU. Generally, this can be done quickly as it does not require training a model to convergence.

Deterministic training

One reason why it is difficult to debug differences in model accuracy is that TensorFlow uses different weight initialization and data shuffling each time a model is trained. It is beneficial to modify the training procedure to be deterministic, so that multiple runs produce nearly identical models. This section demonstrates how to run the MNIST tutorial deterministically:

  1. Generate an initial checkpoint file by running for a single step on the CPU. The step is used to achieve determinsitic weight initialization. This can also be achieved by seeding the variable initializers, but that is more difficult.
# Run training for 1 step to create an initial checkpoint.
python \
  --use_tpu=False \
  --data_dir=${STORAGE_BUCKET}/data/ \
  --model_dir=${STORAGE_BUCKET}/init_output \
  --random_seed=12345 \
  1. Modify any data shuffling functions in your input function to use a random seed. This is has already been done in the MNIST tutorial. This works for the input data processing ops because those always run on the CPU. Random ops in the model function may not be deterministic between the TPU and CPU. For example:
# In the flag definitions
tf.flags.DEFINE_integer("batch_size", None, "Random seed for training")

# In the input_fn
if FLAGS.random_seed is not None:
dataset = dataset.shuffle(seed=FLAGS.random_seed)
  1. Run the same model twice on the CPU, to verify that the training is deterministic. Note that the training must be run for a reasonable number of steps (for example, 1000) but it does not need to be run to convergence, as this can be very slow on the CPU.

    Since the CPU training is compared to a single-core TPU training, use a batch size that can fit on a single TPU core (typically, the full batch size divided by 8). TensorFlow does not guarantee bit-for-bit determinism between runs, but the loss should be very close:
# Copy the initial weights
gsutil mkdir ${STORAGE_BUCKET}/cpu_output_1
gsutil cp -f ${STORAGE_BUCKET}/init_output/* ${STORAGE_BUCKET}/cpu_output_1
gsutil mkdir ${STORAGE_BUCKET}/cpu_output_2
gsutil cp -f ${STORAGE_BUCKET}/init_output/* ${STORAGE_BUCKET}/cpu_output_2

# Run 1
python \
  --use_tpu=False \
  --data_dir=${STORAGE_BUCKET}/data/ \
  --model_dir=${STORAGE_BUCKET}/cpu_output_1 \
  --batch_size=128 \
  --random_seed=12345 \
  --train_steps=2000 \

# Output 1
accuracy = 0.9910644, global_step = 1000, loss = 0.025323588

# Run 2
python \
  --use_tpu=False \
  --data_dir=${STORAGE_BUCKET}/data/ \
  --model_dir=${STORAGE_BUCKET}/cpu_output_1 \
  --batch_size=128 \
  --random_seed=12345 \
  --train_steps=2000 \

# Output 2
accuracy = 0.9910644, global_step = 1000, loss = 0.025323414

Single-core TPU training

Once you can run the MNIST tutorial deterministically, the next step is to replicate the CPU-trained results on the TPU, using a single TPU core to pinpoint whether the issue is related to data sharding or to the TPU execution engine itself.

Here's how to execute single-core training and evaluation on the MNIST tutorial:

# Use the same weight initialization as the CPU
gsutil cp -f ${STORAGE_BUCKET}/init_output/* ${STORAGE_BUCKET}/tpu_output

# Run training for 1000 steps
python \
    --use_tpu=True \
    --master=$GRPC_SERVER \
    --train_file=${STORAGE_BUCKET}/data/train.tfrecords \
    --model_dir=${STORAGE_BUCKET}/tpu_output \
    --random_seed=12345 \
    --batch_size=128 \
    --num_shards=1 \
    --train_steps=1000 \

  accuracy = 0.9910644, global_step = 1000, loss = 0.02514153

The loss will not exactly match the CPU-trained model, but it should be close. If it isn't close for your model, this might indicate that you have found a bug in the TPU execution engine. Before submitting a bug report to TPU Support, double check the following:

  1. You are passing num_shards=1 to TPUConfig.

  2. You do not have any random ops in your model function, and any random ops in your input function are being seeded correctly.

  3. You are using the same initial checkpoint file for the CPU and TPU training.

Debugging multi-core TPU training

If your model does achieve the same loss on the CPU and single-core TPU, then the issue is likely one of the following:

(a) The degredation is due to the natural random variance when training neural models with different initializations.

(b) The degredation is due to an issue related to data sharding on the TPU.

To determine whether (a) is the issue, it might be useful to re-train the full model on the CPU/GPU and multi-core TPU using the same weight initialization, as above.

If you are confident that the drop in accuracy is statistically significant, then the most likely issues related to data sharding are:

  1. If your model computes the loss as the sum of per-sample errors, you probably want to pass reduction=losses.Reduction.SUM to CrossShardOptimizer. By default, CrossShardOptimizer computes the mean of the losses, rather than the sum.
  2. If your model uses batch normalization, a total batch size less than 256 (for example, less than 32 per core) might reduce accuracy.
  3. If your model has a batch-wise loss function, then this will be affected by sharding. Such loss functions are typically quite specialized. For example, Karras et al. 2017 uses a batch discriminator when training a generative adversarial network.

Available Functionality FAQ

Can I use the TPU for inference?

Yes, TPUs can be used for both training and inference. For example, the ResNet tutorial performs periodic evaluation during the training loop. For model serving, there are a few caveats to be aware of. In particular, the TPU software stack is currently optimized for throughput, not latency. Executing inference on a single batch of input and waiting for the result currently has an overhead of at least 10 ms, which can be problematic for low-latency serving.

This overhead will be reduced significantly in upcoming TensorFlow releases.

Are there any built-in TensorFlow ops that are not available on the TPU?

A small number of built-in TensorFlow ops are not currently available on the TPU. See the guide to available TensorFlow Ops, which details the current workarounds.

How can I write a custom op for the TPU?

TensorFlow ops that run on the TPU are implemented in XLA HLO, which is a language for defining high-level tensor ops using a small set of low-level functions. XLA is included in TensorFlow's open source release, so it is technically possible to write your op in HLO. The majority of existing implementations can be found in the tf2xla directory. However, this only allows for execution of a limited set of tensor ops on the TPU, not arbitrary C++ or Python code. Most common tensor ops that can be implemented in HLO have already been written. An upcoming release of TensorFlow will support the ability to efficiently execute standard CPU ops during TPU training/inference.

Can I use placeholders and feed dictionaries with a TPU?

Although this usage pattern is is technically available on the TPU, we strongly recommended against using it, as it uses only a single TPU core and results in excessive overhead. Instead, to create a training pipeline, use the TPUEstimator API and the Dataset API. See the ResNet tutorial for a example of how to create a simple training loop with TPUEstimator and Dataset.

Can I train a reinforcement learning (RL) model with a TPU?

Reinforcement learning covers a wide array of techniques, some of which currently are not compatible with the software abstractions for TPUs. Some reinforcement learning configurations require executing a black-box "simulation environment" using a CPU as part of the training loop. Our experience is that these cannot keep up with the TPU and result in significant inefficiencies. Future releases of TensorFlow will include abstractions to make "off-policy" reinforcement learning easier.

Can I use word embeddings with a TPU?

Yes, the TPU supports tf.nn.embedding_lookup() since it is just a wrapper around tf.gather(), which has an implementation on the TPU. However, the TPU does not support tf.nn.embedding_lookup_sparse(). Note that the input id tensor to tf.embedding_lookup() must have a static shape during training (that is, the batch size and sequence length must be the same for every batch). This is a more general restriction on all tensors when using the TPU.

Can I use variable-length sequences with a TPU?

There are several methods for representing variable-length sequences in TensorFlow, including padding, tf.while_loop(), inferred tensor dimensions, and bucketing. Unfortunately, the current TPU execution engine only supports a subset of these. Variable-length sequences must be implemented using tf.while_loop(), tf.dynamic_rnn(), bucketing, padding, or sequence concatenation.

Can I train a Recurrent Neural Network (RNN) on a TPU?

In certain configurations, tf.static_rnn() and tf.dynamic_rnn() are compatible with the current TPU execution engine. More generally, the TPU supports both tf.while_loop() and TensorArray, which are used to implement tf.dynamic_rnn(). Specialized toolkits such as CuDNN are not supported on the TPU, as they contain GPU-specific code. Using tf.while_loop() on the TPU does require specifying an upper bound on the number of loop iterations so that the TPU execution engine can statically determine the memory usage.

Can I train a generative adversarial network (GAN) with a TPU?

Training GANs typically requires frequently alternating between training the generator and training the discriminator. The current TPU execution engine only supports a single execution graph. Alternating between graphs requires a complete re-compilation, which can take 30 seconds or more. This limitation will be improved in an upcoming TensorFlow release.

One potential workaround is to always compute the sum of losses for both the generate and discriminator, but multiply these losses them by two input tensors g_w amd d_w. In batches where the generator should be trained, you can pass in g_w=1.0 amd d_w=0.0, and vice-versa for batches where the discriminator should be trained.

Can I train a multi-task learning model with a TPU?

If the tasks can be represented as one large graph with an aggregate loss function, then no special support is needed for multi-task learning. However, the TPU execution engine currently only supports a single execution graph. Therefore, it is not possible to quickly alternate between multiple execution graphs which share variables but have different structure. Changing execution graphs requires re-running the graph compilation step, which can take 30 seconds or more.

Does the TPU support eager mode?

No, eager mode uses a new dynamic execution engine, while the TPU uses XLA, which performs static compilation of the execution graph.

Does the TPU support model parallelism?

Model parallelism (or executing non-identical TPU programs on the multiple cores within a single TPU device) is not currently supported on the TPU, but will be supported in an upcoming TensorFlow release.

How can I inspect the actual value of intermediate tensors on the TPU, as with tf.Print or tfdbg?

This capability is currently not supported on the TPU. The suggested pattern for development on the TPU is to implement the model using the TPUEstimator framework, which allows for effortless transition between the TPU and CPU/GPU with the use_tpu flag. You are encouraged to debug your models on the CPU/GPU using the standard TensorFlow tools, and then switch to the TPU when your model is ready for a full-scale training.

My training scheme is too complex or specialized for TPUEstimator API, is there a lower-level API that I can use?

TPUEstimator is the primary framework for TPU training on a Cloud TPU. However, TPUEstimator wraps the tpu API, which is part of open source TensorFlow, so it is technically possible (but unsupported) to use the low-level tpu API directly. If your training pipeline requires frequent communication between the TPU and CPU, or requires frequently changing the execution graph, your computation cannot run efficiently on the TPU. Upcoming releases of TensorFlow will improve both capabilities.

Was this page helpful? Let us know how we did:

Send feedback about...