Using Cloud TPU Tools

This document covers how to set up and run profiling tools by capturing a profile and using it to identify and analyze program performance on Cloud TPU in TensorFlow's TensorBoard console. The document also describes how to continuously monitor your TPU job on the command line (see Monitoring your job).


Before you can use the Cloud TPU profiling tools described in this guide, you must complete the following:

Using Cloud TPU tools in TensorBoard

TensorBoard is a suite of tools designed to present TensorFlow data visually. We have provided a set of Cloud TPU profiling tools that you can access from TensorBoard after you install the Cloud TPU profiler plugin. The plugin supports performance visualization for an Cloud TPU nodes of all sizes.

The Cloud TPU tool selector becomes available under the Profile tab on the TensorBoard menu bar only after you have collected trace information from a running TensorFlow model.

The following sections contains instructions on how to set up your compute environment, run your model and capture a Cloud TPU profile, and run TensorBoard from a VM command line so you can use the tools. For details on how to invoke TensorBoard from your code, see the TensorBoard programming guides.

Creating Cloud TPU resources

The quick start instructions for Cloud TPU describe how to create a Compute Engine VM and Cloud TPU resources.

If you plan to continue with the procedures in this guide immediately after you create your resources, do not perform the clean up section of those instructions. When you have finished running your model and are done using your resources, follow the clean up instructions step to avoid incurring unwanted charges.

Running cloud_tpu_profiler

You run cloud-tpu-profiler 1.15.0rc1 to provide a Cloud TPU profiling plugin in TensorBoard and a script, capture-tpu-profile. You can run the script to either capture a profile that can be viewed in TensorBoard or to monitor your TPU jobs on the command line (see Monitoring your job).

In TensorBoard, the Profile tab appears after you run a model and then run capture profile (trace) information while the model is running.

To check your profiler version, use pip. If you do not have the latest version, use the second command to install it:

(vm)$ pip freeze | grep cloud-tpu-profiler
(vm)$ pip install --upgrade "cloud-tpu-profiler>=1.15.0rc1"

You also must set the PATH environment variable as follows:

(vm)$ export PATH="$PATH:`python -m site --user-base`/bin"

About capture_tpu_profile

When you use capture_tpu_profile to capture a profile, a .tracetable file is saved to your Google Cloud Storage bucket. The file contains a large number of trace events that can be viewed in both trace viewer and streaming trace viewer in TensorBoard.

You capture a profile, or trace data, by running your model, executing capture_tpu_profile, and then starting up TensorBoard before the model stops running. For example:

capture_tpu_profile --tpu=$TPU_NAME --logdir=${MODEL_DIR}

The parameters are set up when you run the model and have the following meanings:

  • --tpu=$TPU_NAME - This is the name assigned when you created your Cloud TPU. If you used ctpu up to start your Compute Engine VM and Cloud TPU it defaults to your username. If you used gcloud or the Cloud Console to set up your VM and TPU, it is the name you specified when you created them.
  • --logdir=${MODEL_DIR} - This is a Cloud Storage location where your model and checkpoints are stored. The model directory is usually defined as
    For example, if you are running the MNIST model, the output directory might be defined as:
    export MODEL_DIR=${STORAGE_BUCKET}/mnist

As the model runs, the Cloud TPU service account creates a directory (object) and writes data to your Compute Engine bucket. You must set permissions for that account on the bucket before you run your model.

By default, capture_tpu_profile captures a 2-second trace. You can set the trace duration with the duration_ms command-line option or in your program when you run your model.

Capturing a profile

The steps in this section describe how to capture a profile by running your model, executing capture_tpu_profile, and then starting up TensorBoard before the model stops running. Running TensorBoard as described here provides access to all of the Profile tools except for streaming trace viewer.

If you prefer to monitor your TPU jobs on the command line, see Monitoring your job.

The MNIST tutorial is used as the model in this example.

  1. Run ctpu up.

  2. Go to the Cloud Console Dashboard (Home) then select Compute Engine > TPUs.

  3. Click on the Cloud TPU you created.

  4. Locate the Cloud TPU service account name and copy it, for example:

  5. In the Resources panel on the Dashboard click on Cloud Storage. The storage bucket list appears or you are prompted to create a bucket. If you haven't yet created a storage bucket, click CREATE BUCKET at the top of the page and create one in the same region as your TPU.

  6. Click the checkbox of the storage bucket you want to use. Be sure the bucket is in the same region in which you created your Cloud TPU.

  7. With your storage bucket selected in the list, select Show Info Panel, and then select Edit bucket permissions.

  8. Paste your TPU service account name into the add members field for that bucket

  9. Select Storage Legacy and then select the Storage Legacy Bucket Reader, Writer, and Owner permissions:


  10. In your VM shell, use pip to check your TensorBoard version.

    (vm)$ pip freeze | grep tensorboard
  11. If your TensorBoard version is lower than 2.1, upgrade to the latest version.

    (vm)$ pip install --upgrade -U "tensorboard>=2.1"    
    (vm)$ pip freeze | grep tensorflow
    (vm)$ pip install --upgrade -U "tensorflow>=2.1"
  12. If you have not already done so, set the PATH environment variable:

    (vm)$ export PATH="$PATH:`python -m site --user-base`/bin" 
  13. Follow the MNIST tutorial to set up and execute an MNIST training job, for example:

    (vm)$ python /usr/share/models/official/mnist/ \
      --tpu=$TPU_NAME \
      --data_dir=${STORAGE_BUCKET}/data \
      --model_dir=${MODEL_DIR} \
      --use_tpu=True \
      --iterations=500 \
  14. While the job is running, open a new Cloud Shell and run ctpu up, specifying your non-default TPU name if you used one. After you log in to your VM, set up the following environmental variables as shown in previous steps: STORAGE_BUCKET, MODEL_DIR.

  15. While the job is running, open a 3rd Cloud Shell and ssh to your VM (replace $vm in the command with your VM name).

    gcloud compute ssh $vm --ssh-flag=-L6006:localhost:6006 

    After you log in, set up the following environmental variables as shown in previous steps:TPU_NAME, STORAGE_BUCKET, MODEL_DIR.

  16. In the 2nd Cloud Shell, run capture_tpu_profile to capture a .tracetable file.

    (vm)$ capture_tpu_profile --tpu=$TPU_NAME --logdir=${MODEL_DIR} 
  17. In the 3rd Cloud Shell, run TensorBoard and point it to the model directory:

    (vm)$ tensorboard --logdir=${MODEL_DIR} & 
  18. Click the Web preview button in the 2nd Cloud Shell and open port 8080 to view the TensorBoard output.


TensorBoard provides a number of visualizations, or graphs, of your model and its performance. Use the graphs together with the profiling tools to fine tune your models and improve their performance on Cloud TPU.

XLA structure graph

During model compilation, before the model is run, TensorFlow generates an (XLA) graph that will be run on the Cloud TPU. The data for the graph is stored in the MODEL_DIR directory. You can view this graph without running capture_tpu_profile.

To view a model's XLA structure graph, select the Graphs tab in TensorBoard. The default selection for Color is Structure.


A single node in the structure graph represents an XLA instruction. For example, for a TensorFlow add op named x/y/z that is mapped (or lowered) to XLA shows as x/y/z/add in the graph.

An XLA graph displays information on how a Cloud TPU will execute a particular model. The graph also provides the shapes of inputs and outputs for various operations. Once you capture a profile of your model, you can use the XLA graph along with trace viewer or streaming trace viewer to gain insight into where most of the time is being spent.


  • Some nodes do not have TensorFlow namespaces because not all XLA instructions (such as those injected by the XLA compiler) have corresponding TensorFlow operations.
  • The TensorFlow program structure is incorporated in the XLA graph where possible. However, because the XLA program running on Cloud TPU is highly optimized, its graph structure might be quite different from that of the original TensorFlow program.
  • A special XLA instruction called fusion can merge multiple instructions from different TensorFlow operations into a single computation. The TensorFlow operation corresponding to the root instruction in the fusion is used as the namespace of the fusion operation.

TPU compatibility graph

The Graphs tab includes a compatibility checker module which checks for and displays TensorFlow ops that can potentially cause issues when a model is run.

To view a model's TPU compatibility graph, select the Graphs tab in TensorBoard and then select the TPU Compatibility option. The graph presents the compatible (valid) operations in green and the incompatible (invalid) operations in red.


A given node can display both colors, each as a percentage of the Cloud TPU compatibility operations for that node. See Interpreting compatibility results for an example.

The compatibility summary panel displayed to the right of the graph shows the percentage of all Cloud TPU-compatible operations, their attributes and a list of incompatible operations for a selected node.

Click on any operation in the graph to display its attributes in the summary panel.


Note that compatibility checker does not assess any operations that are explicitly assigned to a non-TPU device using manual device placement. In addition, the checker does not actually compile the model for execution, so be sure to interpret the results as an estimate of compatibility.

  • Configure your model to write the model graph data to a file by setting the model_dir property of the tf.estimator API or the TPUEstimator.
  • Remove any manual assignments in your code to GPUs or CPUs for operations that you intend to run on the Cloud TPU. When used with the TPU Compatibility option, the compatibility checker skips any operation explicitly assigned to non-TPUs.
Interpreting compatibility results

The following diagram from the Abalone model shows a sample compatibility summary with several unavailable operations noted:


No manual device placement was specified for this model so all operations were checked, even those that should always be run on the CPU, such as the "save" and "report_uninitialized_variables" operations shown.

Two AssignAdd operations in root_mean_squared_error are potential problems.

You can see from the source code that root_mean_squared_error is used only as an additional evaluation metric:

    # Calculate root mean squared error as additional eval metric
    eval_metric_ops = {
        "rmse": tf.metrics.root_mean_squared_error(
            tf.cast(labels, tf.float64), predictions)

Unless it occurs inside a training loop, this operation is normally run on the CPU so the error report can be disregarded. In conclusion, the model is ready to be run on a Cloud TPU.


The Profile tab, created when you ran capture_tpu_profile, appears in TensorBoard only after you have captured some model data. Once data is available, clicking on the Profile tab presents a selection of tools to help with performance analysis:

Profile overview page

The overview page (overview_page), available under Profile, provides a top level view of how your model performed during a capture run. The page shows you an aggregated overview page for all the TPUs, as well as an overall input pipeline analysis. There is an option for selecting individual TPUs in the Host dropdown.

The page displays data in the following panels:


  • Performance summary

    • Step time averaged over all sampled steps
    • Percentage of time the Host was idle
    • Percentage of time the TPU was idle
    • Percentage utilization of the TPU matrix units
  • Step-time graph. Displays a graph of device step time (in milliseconds) over all the steps sampled. The blue area corresponds to the portion of the step time the TPUs were sitting idle waiting for input data from the host. The orange area shows how much of time the Cloud TPU was actually working.

  • Top 10 TensorFlow operations on TPU. Displays the TensorFlow operations that consumed the most time. Clicking the Show table button displays a table like the following:


    Each row displays an operation's self time (as the percentage of time taken by all operations), cumulative time, category, name, and the FLOPS rate achieved.

  • Run environment

    • Number of hosts used
    • Type of TPU used
    • Number of TPU cores
    • Training batch size
  • Recommendation for next steps. Reports when a model is input bound and whenever issues with Cloud TPU occur. Suggests tools you can use to locate performance bottlenecks in performance.

Input pipeline analyzer

The input pipeline analyzer provides insights into your performance results. The tool displays performance results from the input_pipeline.json file that is collected by the capture_tpu_profile tool.

The tool tells you immediately whether your program is input bound and can walk you through device- and host-side analysis to debug whatever stage(s) of the pipeline are creating bottlenecks.

See the guidance on input pipeline performance for deeper insight into optimizing pipeline performance.

Input pipeline

When a TensorFlow program reads data from a file it begins at the top of the TensorFlow graph in a pipelined manner. The read process is divided into multiple data processing stages connected in series, where the output of one stage is the input to the next one. This system of reading is called the input pipeline.

A typical pipeline for reading records from files has the following stages:

  1. File reading
  2. File preprocessing (optional)
  3. File transfer from the host machine to the device

An inefficient input pipeline can severely slow down your application. An application is considered input bound when it spends a significant portion of time in input pipeline. Use the input pipeline analyzer to understand where the input pipeline is inefficient.

Input pipeline dashboard

To open the input pipeline analyzer, select Profile, then select input_pipeline_analyzer from the Tools dropdown.

The dashboard contains three sections:


  1. Summary. Summarizes the overall input pipeline with information on whether your application is input bound and, if so, by how much.
  2. Device-side analysis. Displays detailed, device-side analysis results, including the device step-time and the range of device time spent waiting for input data across cores at each step.
  3. Host-side analysis. Shows a detailed analysis on the host side, including a breakdown of input processing time on the host.
Input pipeline summary

Section 1 reports if your program is input bound by presenting the percentage of device time spent on waiting for input from the host. If you are using a standard input pipeline that has been instrumented, the tool reports where most of the input processing time is spent. For example:


Device-side analysis

Section 2 details the device-side analysis, providing insights on time spent on the device versus on the host and how much device time was spent waiting for input data from the host.


  1. Step time plotted against step number. Displays a graph of device step time (in milliseconds) over all the steps sampled. The blue area corresponds to the part of the step time Cloud TPUs sat idle waiting for input data from the host. The orange area shows how much of time the Cloud TPU was actually working.
  2. Step time statistics. Reports the average, standard deviation, and range ([minimum, maximum]) of the device step time.
  3. Device time across cores spent waiting for input data, by step number. Displays a line chart showing the amount of device time (expressed as a percentage of total device step time) spent waiting for input data processing. Since the fraction of time spent varies from core to core, the range of fractions for each core is also plotted for each step. Since the time a step takes is determined by the slowest core, you want the range to be as small as possible.
  4. Fraction of time waiting for input data. Reports the average, standard deviation and the range ([minimum, maximum]) of the fraction of time spent in device waiting for the input data normalized to the total device step time.
Host-side analysis

Section 3 shows the details of host-side analysis, reporting a breakdown of the input processing time (the time spent on Dataset API operations) on the host into several categories:

  • Reading data from files on demand. Time spent on reading data from files without caching, prefetching, and interleaving.
  • Reading data from files in advance. Time spent reading files, including caching, prefetching, and interleaving.
  • Data preprocessing. Time spent on preprocessing operations, such as image decompression.
  • Enqueuing data to be transferred to device Time spent putting data into an infeed queue before transferring the data to the device.


To see the statistics for individual input operations and their categories broken down by execution time, click the "Show Input Op statistics" button.

A source data table like the following appears:


Each table entry contains the following information:

  1. Input Op. Shows the TensorFlow op name of the input operation.
  2. Count. Shows the total number of instances for the operation executed during the profiling period.
  3. Total Time (in ms). Shows the accumulative sum of time spent on each of those instances.
  4. Total Time %. Shows the total time spent on an operation as a fraction of the total time spent in input processing.
  5. Total Self Time (in ms). Shows the accumulative sum of the self time spent on each of those instances. The self time here measures the time spent inside the function body, excluding the time spent in the function it calls. For example, the Iterator::PaddedBatch::Filter::ForeverRepeat::Map is called by Iterator::PaddedBatch::Filter, therefore its total self time is excluded from the total self time of the latter.
  6. Total Self Time %. Shows the total self time as a fraction of the total time spent on input processing.
  7. Category. Shows the processing category of the input operation.

Op profile

Op profile (op_profile) is a Cloud TPU tool that displays the performance statistics of XLA operations executed during a profiling period. Op profile shows:

  • How well your application uses the Cloud TPU as a percentage of time spent on operations by category and of TPU FLOPS utilization
  • The most time-consuming operations. Those operations are potential targets for optimization.
  • Details of individual operations, including shape, padding and expressions that use the operation.

You can use op profile to find good targets for optimization. For example, if your model achieves only 5% of the TPU peak FLOPS, you can use the tool to identify which XLA operations are taking the longest time to execute and how many TPU FLOPS they consume.

Using op profile

During profile collection, capture_tpu_profile also collects a op_profile.json file that contains performance statistics of XLA operations.

You can view the data from op_profile in TensorBoard by clicking on the Profile tab at the top of the screen and then selecting op_profile from the Tools dropdown. You will see a display like this:


  1. Overview section. Shows the percentage used of the Cloud TPU computational potential and provides suggestions for optimization.
  2. Control panel. Contains a settings slider that controls the number of ops displayed in the Op table and a toggle that sets the Op table to list the ops that comprise the top 90% of the total execution time.
  3. Op table. Lists the top TensorFlow operation categories associated with the XLA ops by the total amount of time, expressed as a percentage of Cloud TPU usage, that all operations in the category took to execute.
  4. Op details cards. Details about the op that appear when you hover over an op in the table. Include the FLOPS utilization, the expression in which the op is used, and the op layout (fit).
XLA Op table

The Op table lists XLA operation categories in order from the highest to lowest percentage of Cloud TPU usage. Initially, the table shows the percentage of time taken, the op category name, the associated TensorFlow op name, and the percentage of FLOPS utilization for the category. To display (or hide) the 10 most time-consuming XLA operations for a category, click the triangle next to the category name in the table.


  1. Time. Shows the total percentage of time spent by all the operations in that category. You can click to expand the entry and see the breakdown of time spent by each individual operation.
  2. Horizontal Bar. Shows the time distribution across categories.
  3. Top10 Ops. The toggle next to a category's name displays/hides the top 10 time-consuming operations within the category. If a fusion operation entry is displayed in the operations list, you can expand it to see the non-fusion, elementwise operations it contains.
  4. TensorFlow Op. Shows the TensorFlow op name associated with the XLA operation.
  5. FLOPS. Shows the FLOPS utilization, which is the measured number of FLOPS expressed as a percentage of the Cloud TPU peak FLOPS. The higher the FLOPS utilization percentage, the faster operations run. The table cell is color coded: green for high FLOPS utilization (good) and red for low FLOPS utilization (bad).
Op details cards

When you hover over a table entry, a card appears on the left displaying details about the XLA op or the operation category. A typical card looks like this:


  1. Name. Shows the highlighted XLA operation name.
  2. Category. Shows the operation category.
  3. FLOPS utilization. Displays FLOPS utilization as a percentage of total FLOPS possible.
  4. Expression. Shows the XLA expression containing the operation.
  5. Memory Utilization. Displays the peak memory used by your program as a percentage of total possible.
  6. Layout (Convolution operations only.) Shows the shape and layout of a tensor, including whether the shape of the tensor is an exact fit for the matrix units and how the matrix is padded.
Interpreting results

For convolution operations, TPU FLOPS utilization can be low due to one or both of the following reasons:

  • padding (matrix units are partially used)
  • convolution op is memory bound

This section gives an interpretation of some numbers from a different model in which FLOPs were low. In this example, output fusion and convolution dominated the execution time and there was a long tail of vector or scalar operations that had very low FLOPS.

One optimization strategy for this type of profile is to transform the vector or scalar operations to convolution operations.

In the following example, %convolution.399 shows lower FLOPS and memory utilization than %convolution.340 in the previous example.


Examine the layout and note that batch size 16 is being padded to 128 and feature size 3 is being padded to 8, which indicates that only 5% of the matrix units are being effectively used. (The calculation for this instance of percent utilization is to multiply the batch times the feature (16 by 3) then divide the result by the padding, 128 and then by 8.) Compare the FLOPS in this example to the %convolution.340 in the previous example which has an exact fit to the matrix.

Pod viewer

The Pod viewer tool provides performance visualizations for every core in a Pod and displays the status of the communications channels across the cores in a Pod. Pod viewer can identify and highlight potential bottlenecks and areas that need optimization. The tool works for full Pods and all v2 and v3 Pod slices.

To display the Pod viewer tool:

  1. Select Profile from the menu button at the top right side of the Tensorboard window.
  2. image
  3. Click the Tools menu on the left side of the window and select pod_viewer.
  4. image

The Pod viewer user interface includes:

  1. A step slider, which allows you to select which step you want to examine.
  2. A topology graph, which interactively visualizes your TPU cores in the whole TPU system.
  3. A communication links chart, which visualizes the send and receive (recv) channels in the topology graph.
  4. A latency of send and recv channels bar chart. Hovering over a bar in this chart activates the communication links in the communication links chart. A channel details card appears on the left-hand bar, providing detailed information of the channel, such as the size of data transferred, latency, and bandwidth.
  5. A step breakdown chart, which visualizes a breakdown of a step for all cores. This can be used to track system bottlenecks and whether a particular core is slowing down the system.

Step slider


Use the slider to select a step. The rest of the tool displays statistics, such as step breakdown and communication links, for that step.

Topology graph

The topology graph is organized hirachically by host, chip and core. The smallest rectangles are TPU cores. Two cores together indicate a TPU chip and four chips together indicate a host.


The topology graph is also a heatmap, color coded by the percentage of time a particular breakdown (for example, High flops compute, infeed, send, etc.) takes in the selected step. The bar just below the topology graph (shown in the following graphic) shows a color coding for core and chip usage. The color of the cores show the utilization ranging from yellow to blue. For High flops compute, larger numbers (darker color) indicate more time spent doing compute. For all other breakdowns, smaller numbers (lighter colors) indicate smaller wait times. Potential problem areas, or hotspots, are indicated when a core is darker than the others.

Click on the pulldown menu selector next to the system name (circled in the diagram) to choose the particular type of breakdown you want to examine.

Hover your mouse over any of the small rectangles (single cores) to display a techtip showing the core's position in the system, its global chip ID, and its host name. The techtip also includes the duration of the selected breakdown category, for example High flops, and its utilization percentage out of a step.


Communication channels

This tool helps visualize send and recv links if your model uses them to communicate between cores. When your model contains send and recv ops, you can use a channel ID selector to select a channel ID. A link from the source (src) core and destination (dst) core, represents the communication channel. It is rendered on the topology graph by hovering your mouse over the bars on the chart showing the latency of send and recv channels.


A card appears on the left-hand bar giving you more details about the communication channel. A typical card looks like this:


  1. Data Transferred, which shows the data transferred by the send and recv channel in memibytes (MiB).
  2. Latency, which shows the duration, in microseconds, from the start of the send event to the end of the recv-done event.
  3. BW, which shows the amount of data transferred, in gibibites (GiB), from the source core to the destination core in the duration of time.
  4. Send Delay, which is the duration from the beginning of the recv-done to the beginning of send in microseconds. If the recv-done op starts after the beginning of the send op, the delay is zero.
  5. Hlo Names, which displays the XLA hlo ops names associated with this channel. These hlo names are associated with the statistics displayed in other TensorBoard tools such as op_profile and memory_viewer.

Latency of send and recv channels chart

This chart provides details about the send and recv communication channels. Hovering over a bar in this chart displays the send and recv links on the topology graph above.


Step breakdown chart

This chart provides details for each training or evaluation step.

The x-axis is the global chip ID and the y-axis is the time in microseconds. From this chart, you can see where the time is used in a particular training step, where any bottlenecks are, and whether there is a load imbalance across all chips.


A card appears on the left-hand bar giving you more details about the step breakdown. A typical card looks like this:


The fields in the card specify the following:

  1. High Flops Compute, which is the time spent on convolution or output fusion operations (ops).
  2. Low flops compute, which is calculated by deducting all other breakdowns from the total duration.
  3. Infeed, which is the time the TPU spends waiting on the host.
  4. Outfeed, which is the time the host spends waiting on output from the TPU.
  5. AllReduce sync, which is the portion of time spent on CrossReplicaSum ops that is waiting to synchronize with other cores. CrossReplicaSum ops computes the sum across replicas.
  6. AllReduce compute, which is the actual compute time spent on CrossReplicaSum ops.
  7. Chip to chip send ops, which is the time spent on send operations.
  8. Chip to chip recv-done ops, which is the time spent on recv operations.

Trace viewer

Trace viewer is a Cloud TPU performance analysis tool available under Profile. The tool uses the Chrome trace event profiling viewer so it only works in the Chrome browser.

Trace viewer displays a timeline that shows:

  • Durations for the operations that were executed by your TensorFlow model .
  • Which part of the system (TPU or host machine) executed an operation. Typically, the host machine executes infeed operations, which preprocesses training data and transfers it to the TPU, whereas the TPU executes the actual model training.

Trace viewer allows you to identify performance problems in your model, then take steps to resolve them. For example, at a high level, you can identify whether infeed or model training is taking the majority of the time. Drilling down, you can identify which TensorFlow operations are taking the longest to execute.

Note that trace viewer is limited to 1M events per Cloud TPU. If you need to assess more events, use the streaming trace viewer instead.

Trace viewer interface

To open trace viewer, go to TensorBoard and click on the Profile tab at the top of the screen. The viewer appears displaying your most recent run:


This screen contains the following main elements (marked with numbers above):

  1. Runs dropdown. Contains all of the runs for which you've captured trace information. The default view is your most recent run, but you can open the dropdown to select a different run.
  2. Tools dropdown. Selects different profiling tools.
  3. Host dropdown. Selects a host that contains a Cloud TPU set.
  4. Timeline pane. Shows operations that Cloud TPU and the host machine executed over time.
  5. Details pane. Shows additional information for operations selected in the Timeline pane.

Here's a closer look at the timeline pane:


The Timeline pane contains the following elements:

  1. Top bar. Contains various auxiliary controls.
  2. Time axis. Shows time relative to the beginning of the trace.
  3. Section and track labels. Each section contains multiple tracks and has a triangle on the left that you can click to expand and collapse the section. There is one section for every processing element in the system.
  4. Tool selector. Contains various tools for interacting with the trace viewer.
  5. Events. These show the time during which an operation was executed or the duration of meta-events, such as training steps.
  6. Vertical tab bar. This does not have a useful purpose for Cloud TPU. The bar is part of the general purpose trace viewer tool provided by Chrome that is used for a variety of performance analysis tasks.
Sections and tracks

Trace viewer contains the following sections:

  • One section for each TPU node, labeled with the number of the TPU chip and the TPU node within the chip (for example, "Chip 2: TPU Core 1"). Each TPU node section contains the following tracks:
    • Step. Shows the duration of the training steps that were running on the TPU.
    • TensorFlow Ops. Shows TensorFlow operations executed on the TPU.
    • XLA Ops. Shows XLA operations that ran on the TPU. (Each TensorFlow operation is translated into one or several XLA operations. The XLA compiler translates the XLA operations into code that runs on the TPU.)
  • One section for threads running on the host machine's CPU, labeled "Host Threads". The section contains one track for each CPU thread. Note: You can ignore the information displayed alongside the section labels.
Timeline tool selector

You can interact with the timeline view using the timeline tool selector in TensorBoard. You can click on a timeline tool or use the following keyboard shortcuts to activate and highlight a tool. To move the timeline tool selector, click in the dotted area at the top and then drag the selector to where you want it.

Use the timeline tools as follows:

Selection tool
Click on an event to select it or drag to select multiple events. Additional information about the selected event or events (name, start time, and duration) will be displayed in the details pane.

Pan tool
Drag to pan the timeline view horizontally and vertically.

Zoom tool
Drag up to zoom in or drag down to zoom out along the horizontal (time) axis. The horizontal position of the mouse cursor determines the center around which the zoom takes place.

Note: The zoom tool has a known bug where zoom remains active if you release the mouse button while the mouse cursor is outside the timeline view. If this happens to you, just click briefly on the timeline view to stop zooming.

Timing tool
Drag horizontally to mark a time interval. The length of the interval appears on the time axis. To adjust the interval, drag its ends. To clear the interval, click anywhere inside the timeline view.

Note that the interval remains marked if you select one of the other tools.

Events within the timeline are displayed in different colors; the colors themselves have no specific meaning.

Timeline top bar

The top bar of the Timeline pane contains several auxiliary controls:


  1. Metadata display. Not used for TPUs.
  2. View Options. Not used for TPUs.
  3. Search box. Enter text to search for all events whose name contains the text. Click the arrow buttons to the right of the search box to move forwards and backwards through the matching events, selecting each event in turn.
  4. Console button. Not used for TPUs.
  5. Help button. Click to display a help summary.
Keyboard shortcuts

Following are the keyboard shortcuts you can use in trace viewer. Click the help button (?) in the top bar to see more keyboard shortcuts.

    w Zoom in
    s Zoom out
    a Pan left
    d Pan right
    f Zoom to selected event(s)
    m Mark time interval for selected event(s)
    1 Activate selection tool
    2 Activate pan tool
    3 Activate zoom tool
    4 Activate timing tool

The f shortcut can be highly useful. Try selecting a step and pressing f to zoom into the step quickly.

Characteristic events

Following are some of the event types that can be very useful when analyzing TPU performance.


  • InfeedDequeueTuple. This TensorFlow operation runs on a TPU and receives input data coming from the host. When infeed takes a long time, it can mean that the TensorFlow operations which preprocess the data on the host machine cannot keep up with the TPU data consumption rate. You can see corresponding events in the host traces called InfeedEnqueueTuple. To view a more detailed input-pipeline analysis, use the Input Pipeline Analyzer tool.

  • CrossReplicaSum. This TensorFlow operation runs on a TPU and computes a sum across replicas. Because each replica corresponds to a different TPU node, the operation must wait for all TPU nodes to be finished with a step. If this operation is taking a long time, it might not mean that the summing operation itself is slow but that a TPU node is waiting for another TPU node with a slow data infeed.


  • Dataset Ops. Trace viewer visualizes dataset operations performed when data is loaded using the Dataset API. The Iterator::Filter::Batch::ForeverRepeat::Memory in the example is compiled and it corresponds to the operation. Use trace viewer to examine the loading operations as you work through debugging and mitigating input pipeline bottlenecks.


  • Prefetch Threads. Using dataset.prefetch() to buffer input data can prevent sporadic slowdowns in file access that create bottlenecks in the input pipeline.
What can go wrong

Here are some potential issues to be aware of when using trace viewer:

  • Event display limit. Trace viewer displays a maximum of 1 million events. If you captured more events, only the earliest 1 million events are displayed; later events are dropped. To capture more TPU events, you can use the --include_dataset_ops=False flag to explictly require capture_tpu_profile to exclude the dataset ops.
  • Very long events. Events that begin before a capture starts or that end after a capture is finished are not visible in trace viewer. Consequently, very long events can be missed.
  • When to start trace capture. Be sure to start trace capture after you know the Cloud TPU is running. If you start before then, you may see only a few events or no events at all in trace viewer. You can increase the profile time using the --duration_ms flag and you can set automatic retries using the --num_tracing_attempts flag. For example:

    (vm)$ capture_tpu_profile --tpu=$TPU_NAME
    --logdir=${MODEL_DIR} --duration_ms=60000 --num_tracing_attempts=10

Memory viewer

Memory viewer allows you to visualize the peak memory usage for your program, and memory usage trends over the program's lifetime.

The memory viewer UI looks like this:


  1. Host dropdown. Selects for a TPU host and XLA High Level Optimizer (HLO) modules to visualize.
  2. Memory overview. Displays peak memory allocation and size without padding.
  3. Working space chart. Displays peak memory use and a plot of memory usage trends over the program's lifetime. Hovering over a buffer in one of the buffer charts adds an annotation for the buffer lifetime and the buffer details card.
  4. Buffer charts. Two charts that display buffer allocation at the point of peak memory usage, as indicated by the vertical line in the working space plot. Hovering over a buffer in one of the buffer charts displays the buffer's lifetime bar in the working space chart and a details card on the left.
  5. Buffer allocation details card. Displays allocation details for a buffer.
Memory overview panel

The memory overview (top) panel shows you the module name and the peak memory allocation set when the total buffer allocation size reaches the maximum. The unpadded peak allocation size is also shown for comparison.


Working space chart

This chart displays peak memory use and a plot of memory usage trends over the program's lifetime. The line drawn from top to bottom of the plot indicates peak memory utilization for the program. This point determines whether or not a program can fit into the available global memory space.


Each point on the overlying line plot represents a "program point" in XLA's HLO program as scheduled by the compiler. The line provides a sense of the spikiness leading to and from the peak usage.

Interaction with buffer chart elements

When you hover over a buffer displayed in one of the buffer charts below the working space chart, a horizontal lifetime line for that buffer appears in the working space chart. The horizontal line is the same color as the highlighted buffer.


The horizontal line thickness indicates the relative magnitude of the buffer size relative to the peak memory allocation. The line length corresponds to the life of the buffer, starting at the point in the program where buffer space was allocated and ending where the space was freed.

Buffer charts

Two charts show the breakdown of memory usage at the peak usage point (indicated by the vertical line in the plot above the charts).


  • By Program Order. Displays the buffers from left to right in the order in which they were active during program execution. Buffers active for the longest time are on the left side of the chart.

  • By Size. Displays the buffers that were active during program execution in descending size order. Buffers that had the largest impact at the point of peak memory usage are on the left.

Buffer allocation details card

When you hover over a buffer displayed in one of the buffer charts, a buffer allocation details card appears (in addition to the lifetime line displayed in the working chart). A typical details card looks like this:


  1. Name. Name of the XLA operation.
  2. Category. Operation category.
  3. Size. Size of the buffer allocation (including padding).
  4. Unpadded size. Size of the buffer allocation without padding.
  5. Expansion. Relative magnitude of padded buffer size versus the unpadded size.
  6. Extra memory. Indicates how much extra memory is used for padding.
  7. Shape. Describes the rank, size, and data type of the N-dimensional array.
  8. TensorFlow op name. Shows the name of the TensorFlow operation associated with the buffer allocation.
  9. Allocation type. Indicates buffer allocation category. Types are: Parameter, Output, Thread-local, and Temporary (for example, buffer allocation within a fusion).
"Out of memory" errors

If you run a model and get an "out of memory error", use the following command to capture a memory profile and view it in the memory viewer. Make sure to set appropriate duration_ms so that the profiling period overlaps with your program compilation time. The output can help you understand what caused the error:

(vm)$ capture_tpu_profile --tpu=$TPU_NAME --logdir=${MODEL_DIR} --duration_ms=60000

Streaming trace viewer

Streaming trace viewer (trace_viewer) is a Cloud TPU performance analysis tool, available for TensorFlow 2.1 or later, that provides dynamic trace renderings. The tool uses the Chrome trace event profiling viewer so it works only in the Chrome browser.

When you use capture_tpu_profile 1.15.0rc1 to capture a profile, a .tracetable file is saved to your Google Cloud Storage bucket. The file contains a large number of trace events that can be viewed in in both trace viewer and streaming trace viewer.

Using streaming trace viewer

To use the streaming trace viewer, trace_viewer, you must shut down your existing TensorBoard session and then relaunch TensorBoard using the IP address of the TPU you want to examine. Streaming trace viewer requires TensorBoard to make a Google Remote Procedure Call (GRPC) to an IP address for the Cloud TPU. The GRPC channel is not encrypted.

To find the IP address for a Cloud TPU host on the Cloud Console, open the TPUs page and look at the displayed table for the name of the Cloud TPU whose trace you want to view.


The Internal IP column for each Cloud TPU contains an IP address, [TPU_IP].


In your VM, run TensorBoard as follows:

(vm)$ tensorboard --logdir=${MODEL_DIR} --master_tpu_unsecure_channel=[TPU_IP]

The trace_viewer tool appears in the Tools dropdown list.


In the timeline, you can zoom in and out to see trace events load dynamically into your browser.


Monitoring your Cloud TPU job

This section describes how to use capture_tpu_profile to capture a single profile or continuously monitor your Cloud TPU job on the command-line interface in real time. By setting the --monitoring_level option to 0 (the default), 1, or 2, you get a single profile, basic monitoring, or detailed monitoring, respectively.

  1. Open a new Cloud Shell and ssh to your VM (replace $vm in the command with your VM name):

    gcloud compute ssh $vm --ssh-flag=-L6006:localhost:6006
  2. In the new Cloud Shell, run capture_tpu_profile with the --monitoring_level flag set to either 1 or 2, such as:

    (vm)$ capture_tpu_profile --tpu=$TPU_NAME  --monitoring_level=1

Setting monitoring_level=1 produces output similar to the following:

    TPU type: TPU v2
    Utilization of TPU Matrix Units is (higher is better): 10.7%

Setting monitoring_level=2 displays more detailed information:

    TPU type: TPU v2
    Number of TPU Cores: 8
    TPU idle time (lower is better): 0.091%
    Utilization of TPU Matrix Units is (higher is better): 10.7%
    Step time: 1.95 kms (avg), 1.90kms (minute), 2.00 kms (max)
    Infeed percentage: 87.5% (avg). 87.2% (min), 87.8 (max)

Monitoring flags

  • --tpu (required) specifies the name of the Cloud TPU you want to monitor.
  • --monitoring_level. Change the behavior of capture_tpu_profile from producing a single profile, to basic or detailed continuous monitoring. There are three available levels: Level 0 (the default): Produces a single profile, then exits. Level 1: Shows TPU version and TPU utilization. Level 2: Shows the TPU utilization, TPU idle time, and number of TPU cores used. Also provides min, avg, and max step times along with the infeed percentage contribution.
  • --duration_ms (optional; default is 1000ms) specifies how long to profile the TPU host during each cycle. Generally, this should be long enough to capture at least one training step worth of data. 1 second captures a training step in most models but if your model step time is very large, you can set the value to 2x step_time (in ms).
  • --num_queries specifies how many cycles to run capture_tpu_profile. To continuously monitor your TPU job, set the value to a high number. To quickly check your model's step time set the value to a low number.