Stay organized with collections Save and categorize content based on your preferences.

Profile your model with Cloud TPU tools

Profiling enables you to optimize your model's training performance on Cloud TPUs. To profile your model you use TensorBoard and the Cloud TPU TensorBoard plug-in.

For more information about using TensorBoard with one of the supported frameworks, see the following documents:

Prerequisites to profiling a training script

Before you use the TPU profiling tools, you need to:

  1. Start a model training session

    1. Set up a v4-8 TPU to train a model. The profiling procedure described in this document uses a ResNet model, but you can use another model provided it trains on a v4 TPU.
    2. In your TPU VM, add a line to start the profiler server to the training script.

      For the ResNET training, the training script is at: /usr/share/tpu/tensorflow/resnet50_keras/

      Insert the highlighted lines into At the top of the file, add the following import:

      import tensorflow.compat.v2 as tf2

      Right before the scripts starts the training loop, add the highlighted line:

      if name == 'main':

      This starts up the TensorFlow profiler server on your TPU VM when the training starts.

    3. Start the model training.

      Run your training script and wait until you see output indicating your model is actively training. What this looks like depends on your code and model. Look for output like Epoch 1/100. Alternatively, you can navigate to the Cloud TPU page in the Google Cloud console, select your TPU and view the CPU utilization graph. While this does not show TPU utilization, it is a good indication that the TPU is training your model.

Start profiling the model training

When the model is training, open a separate terminal window or Cloud shell. Use the following steps to begin profiling the model training.

  1. In the new window or shell, SSH into your TPU VM with port forwarding

    gcloud compute tpus tpu-vm ssh your-vm --zone=us-central2-b --ssh-flag="-4 -L 9001:localhost:9001"

    This allows your local browser to communicate with the TensorBoard server running on your TPU VM.

  2. Install TensorFlow requirements

    TensorBoard is installed by default in Cloud TPU VMs as part of TensorFlow. Your TPU VM has TensorFlow installed by default. You can also install TensorFlow manually. Either way, some additional dependencies may be required. Install these on your TPU VM by running:

    pip3 install -r /usr/share/tpu/models/official/requirements.txt
  3. Install the Cloud TPU TensorBoard Plugin

    From the TPU VM, run the following commands:

     pip3 install --upgrade "cloud-tpu-profiler>=2.3.0"
     pip3 install tensorflow
     pip3 install tensorflow-plugin-profile
  4. Start the TensorBoard server

    Run TensorBoard and create a log directory (logdir) on the TPU VM where TensorBoard can write profiling data. Specify the log directory using the --logdir flag. For example:

    mkdir log-directory
    TPU_LOAD_LIBRARY=0 tensorboard --logdir log-directory --port 9001

TensorBoard starts a web server and displays its URL:

Serving TensorBoard on localhost; to expose to the network, use a proxy or pass --bind_all
TensorBoard 2.3.0 at http://localhost:9001 (Press CTRL+C to quit)

Open a web browser and go to the URL displayed in the TensorBoard output. Select Profile from the drop down menu in the upper right of the TensorBoard page. The list of available profiling tools is shown under the tools pulldown menu on the left sidebar.


Capture a profile on TPU VMs

  1. Select the CAPTURE PROFILE button
  2. Select the IP address radio button
  3. Type HOSTNAME:6000 in the text box
  4. Select the CAPTURE button


View profile data with TensorBoard

After you capture a profile, TensorBoard displays the overview_page. The list of profiling tools you can use is displayed in the left pane.



The Profile tab is displayed after you have captured some model data. You may need to click the refresh button on the TensorBoard page. Once data is available, clicking the Profile tab presents a selection of tools to help with performance analysis. You can use any of the following tools to profile your model.

Profile overview page

The overview page (overview_page), available under Profile, provides a top level view of how your model performed during a capture run. The page shows you an aggregated overview for all your TPUs, as well as an overall input pipeline analysis. There is an option for selecting individual TPUs in the Host dropdown.

The page displays data in the following panels:


  • Performance summary

    • FLOPS Utilization - The percentage utilization of the TPU matrix units
  • Top 10 TensorFlow operations on TPU. Displays the TensorFlow operations that consumed the most time:

    Each row displays an operation's self time (as the percentage of time taken by all operations), cumulative time, category, name, and the FLOPS rate achieved.

  • Run environment

    • Number of hosts used
    • Type of TPU used
    • Number of TPU cores

Input pipeline analyzer

The input pipeline analyzer provides insights into your performance results. The tool displays performance results from the input_pipeline.json file that is collected by the capture_tpu_profile tool.

The tool tells you immediately whether your program is input bound and can walk you through device and host-side analysis to debug whatever stage(s) of the pipeline are creating bottlenecks.

See the guidance on input pipeline performance for deeper insight into optimizing pipeline performance.

Input pipeline

When a TensorFlow program reads data from a file it begins at the top of the TensorFlow graph in a pipelined manner. The read process is divided into multiple data processing stages connected in series, where the output of one stage is the input to the next one. This system of reading is called the input pipeline.

A typical pipeline for reading records from files has the following stages:

  1. File reading
  2. File preprocessing (optional)
  3. File transfer from the host machine to the device

An inefficient input pipeline can severely slow down your application. An application is considered input bound when it spends a significant portion of time in its input pipeline. Use the Input pipeline analyzer to understand where the input pipeline is inefficient.

Input pipeline dashboard

To open the input pipeline analyzer, select Profile, then select input_pipeline_analyzer from the Tools dropdown.

The dashboard shows device-side and host-side analysis details.

Device-side analysis - Shows details on device step times.

  • Device step time statistics
  • % of device step time waiting for input data

Host-side analysis

This section shows the details of host-side analysis, reporting of the input processing time (the time spent on Dataset API operations) on the host broken into several categories:

  • Enqueuing data to be transferred to device. Time spent putting data into an infeed queue before transferring the data to the device.
  • Data preprocessing. Time spent on preprocessing operations, such as image decompression.
  • Reading data from files in advance. Time spent reading files, including caching, prefetching, and interleaving.
  • Reading data from files on demand. Time spent on reading data from files without caching, prefetching, and interleaving.
  • Other data reading or processing. Time spent on other input related operations not using


To see the statistics for individual input operations and their categories broken down by execution time, expand the "Show Input Op statistics" section.

A source data table like the following appears:


Each table entry contains the following information:

  1. Input Op. Shows the TensorFlow op name of the input operation.
  2. Count. Shows the total number of instances of the operation executed during the profiling period.
  3. Total Time (in ms). Shows the cumulative sum of time spent on each of the operation instances.
  4. Total Time %. Shows the total time spent on an operation as a fraction of the total time spent in input processing.
  5. Total Self Time (in ms). Shows the accumulated time over all instances of the function. The self time measures the time spent inside the function body, excluding the time spent in the function it calls. For example, the Iterator::PaddedBatch::Filter::ForeverRepeat::Mapis called by Iterator::PaddedBatch::Filter, therefore its total self time is excluded from the total self time of the latter.
  6. Total Self Time %. Shows the total self time as a fraction of the total time spent on input processing.
  7. Category. Shows the processing category of the input operation.

Op profile

Op profile is a Cloud TPU tool that displays the performance statistics of XLA operations executed during a profiling period. The op profile shows:

  • How well your application uses the Cloud TPU as a percentage of time spent on operations by category and of TPU FLOPS utilization.
  • The most time-consuming operations. Those operations are potential targets for optimization.
  • Details of individual operations, including shape, padding and expressions that use the operation.

You can use op profile to find good targets for optimization. For example, if your model achieves only 5% of the TPU peak FLOPS, you can use the tool to identify which XLA operations are taking the longest time to execute and how many TPU FLOPS they consume.

Using op profile

During profile collection, capture_tpu_profile also creates a op_profile.json file that contains performance statistics of XLA operations.

You can view the data from op_profile in TensorBoard by clicking on the Profile tab at the top of the screen and then selecting op_profile from the Tools dropdown. You will see a display like this:


  1. Overview section. Shows Cloud TPU utilization and provides suggestions for optimization.
  2. Control panel. Contains controls that allow you to set the number of operations displayed in the table, which operations are displayed, and how they are sorted.
  3. Op table. A table that lists the top TensorFlow operation categories associated with the XLA ops. These operations are sorted by percentage of Cloud TPU usage.
  4. Op details cards. Details about the op that appear when you hover over an op in the table. These include the FLOPS utilization, the expression in which the op is used, and the op layout (fit).

XLA Op table

The Op table lists XLA operation categories in order from the highest to lowest percentage of Cloud TPU usage. The table shows the percentage of time taken, the op category name, the associated TensorFlow op name, and the percentage of FLOPS utilization for the category. To display (or hide) the 10 most time-consuming XLA operations for a category, click the triangle next to the category name in the table.


  1. Time. Shows the total percentage of time spent by all the operations in that category. You can click to expand the entry and see the breakdown of time spent by each individual operation.
  2. Top10 Ops. The toggle next to a category's name displays/hides the top 10 time-consuming operations within the category. If a fusion operation entry is displayed in the operations list, you can expand it to see the non-fusion, elementwise operations it contains.
  3. TensorFlow Op. Shows the TensorFlow op name associated with the XLA operation.
  4. FLOPS. Shows the FLOPS utilization, which is the measured number of FLOPS expressed as a percentage of the Cloud TPU peak FLOPS. The higher the FLOPS utilization percentage, the faster operations run. The table cell is color coded: green for high FLOPS utilization (good) and red for low FLOPS utilization (bad).

Op details cards

When you select a table entry, a card appears displaying details about the XLA op or the operation category. A typical card looks like this:


  • Name and Category. Shows the highlighted XLA operation name and category.
  • FLOPS utilization. Displays FLOPS utilization as a percentage of total FLOPS possible.
  • Expression. Shows the XLA expression containing the operation.
  • Memory Utilization. Displays the percentage of peak memory usage by your program.
  • Layout (Convolution operations only.) Shows the shape and layout of a tensor, including whether the shape of the tensor is an exact fit for the matrix units and how the matrix is padded.

Interpreting results

For convolution operations, TPU FLOPS utilization can be low due to one or both of the following reasons:

  • padding (matrix units are partially used)
  • convolution op is memory bound

This section gives an interpretation of some numbers from a model in which FLOPs were low. In this example, output fusion and convolution dominated the execution time and there was a long tail of vector or scalar operations that had very low FLOPS.

One optimization strategy for this type of profile is to transform the vector or scalar operations to convolution operations.

In the following example, %convolution.399 shows lower FLOPS and memory utilization than %convolution.340 in the previous example.


Examine the layout and note that batch size (which is 16) is being padded to 128 and feature size (which is 3) is being padded to 8, which indicates that only 5% of the matrix units are being effectively used. (The calculation for this instance of percent utilization is (((batch_time * num_of_features) / padding_size ) / num_of_cores). Compare the FLOPS in this example to the %convolution.340 in the previous example which has an exact fit to the matrix.

Trace viewer

Trace viewer is a Cloud TPU performance analysis tool available under Profile. The tool uses the Chrome trace event profiling viewer so it only works in the Chrome browser.

Trace viewer displays a timeline that shows:

  • Durations for the operations that were executed by your TensorFlow model .
  • Which part of the system (TPU or host machine) executed an operation. Typically, the host machine executes infeed operations, which preprocesses training data and transfers it to the TPU, whereas the TPU executes the actual model training.

Trace viewer allows you to identify performance problems in your model, then take steps to resolve them. For example, at a high level, you can identify whether infeed or model training is taking the majority of the time. Drilling down, you can identify which TensorFlow operations are taking the longest to execute.

Note that trace viewer is limited to 1M events per Cloud TPU. If you need to assess more events, use the streaming trace viewer instead.

Trace viewer interface

To open trace viewer, go to TensorBoard, click on the Profile tab at the top of the screen, and choose trace_viewer from the Tools dropdown. The viewer appears displaying your most recent run:


This screen contains the following main elements (marked with numbers above):

  1. Runs dropdown. Contains all of the runs for which you've captured trace information. The default view is your most recent run, but you can open the dropdown to select a different run.
  2. Tools dropdown. Selects different profiling tools.
  3. Host dropdown. Selects a host that contains a Cloud TPU set.
  4. Timeline pane. Shows operations that Cloud TPU and the host machine executed over time.
  5. Details pane. Shows additional information for operations selected in the Timeline pane.

Here's a closer look at the timeline pane:


The Timeline pane contains the following elements:

  1. Top bar. Contains various auxiliary controls.
  2. Time axis. Shows time relative to the beginning of the trace.
  3. Section and track labels. Each section contains multiple tracks and has a triangle on the left that you can click to expand and collapse the section. There is one section for every processing element in the system.
  4. Tool selector. Contains various tools for interacting with the trace viewer.
  5. Events. These show the time during which an operation was executed or the duration of meta-events, such as training steps.
  6. Vertical tab bar. This does not have a useful purpose for Cloud TPU. The bar is part of the general purpose trace viewer tool provided by Chrome that is used for a variety of performance analysis tasks.

Sections and tracks

Trace viewer contains the following sections:

  • One section for each TPU node, labeled with the number of the TPU chip and the TPU node within the chip (for example, "Chip 2: TPU Core 1"). Each TPU node section contains the following tracks:
    • Step. Shows the duration of the training steps that were running on the TPU.
    • TensorFlow Ops. Shows TensorFlow operations executed on the TPU.
    • XLA Ops. Shows XLA operations that ran on the TPU. (Each operation is translated into one or several XLA operations. The XLA compiler translates the XLA operations into code that runs on the TPU.)
  • One section for threads running on the host machine's CPU, labeled "Host Threads". The section contains one track for each CPU thread. Note: You can ignore the information displayed alongside the section labels.

Timeline tool selector

You can interact with the timeline view using the timeline tool selector in TensorBoard. You can click on a timeline tool or use the following keyboard shortcuts to activate and highlight a tool. To move the timeline tool selector, click in the dotted area at the top and then drag the selector to where you want it.

Use the timeline tools as follows:

Selection tool
Click on an event to select it or drag to select multiple events. Additional information about the selected event or events (name, start time, and duration) will be displayed in the details pane.

Pan tool
Drag to pan the timeline view horizontally and vertically.

Zoom tool
Drag up to zoom in or drag down to zoom out along the horizontal (time) axis. The horizontal position of the mouse cursor determines the center around which the zoom takes place.

Note: The zoom tool has a known bug where zoom remains active if you release the mouse button while the mouse cursor is outside the timeline view. If this happens to you, just click briefly on the timeline view to stop zooming.

Timing tool
Drag horizontally to mark a time interval. The length of the interval appears on the time axis. To adjust the interval, drag its ends. To clear the interval, click anywhere inside the timeline view.

Note that the interval remains marked if you select one of the other tools.

Memory viewer

Memory viewer allows you to visualize the peak memory usage for your program, and memory usage trends over the program's lifetime.

The memory viewer UI looks like this:


  1. Host dropdown. Selects a TPU host and XLA High Level Optimizer (HLO) modules to visualize.
  2. Memory overview. Displays peak memory allocation and size without padding.
  3. Working space chart. Displays peak memory use and a plot of memory usage trends over the program's lifetime. Hovering over a buffer in one of the buffer charts adds an annotation for the buffer lifetime and the buffer details card.
  4. Buffer charts. Two charts that display buffer allocation at the point of peak memory usage, as indicated by the vertical line in the working space plot. Hovering over a buffer in one of the buffer charts displays the buffer's lifetime bar in the working space chart and a details card.
  5. Buffer allocation details card. Displays allocation details for a buffer.

Memory overview panel

The memory overview (top) panel shows you the module name and the peak memory allocation set when the total buffer allocation size reaches the maximum. The unpadded peak allocation size is also shown for comparison.


Working space chart

This chart displays peak memory use and a plot of memory usage trends over the program's lifetime. The line drawn from top to bottom of the plot indicates peak memory utilization for the program. This point determines whether or not a program can fit into the available global memory space.


Each point on the overlying line plot represents a "program point" in XLA's HLO program as scheduled by the compiler. The line provides a sense of change in memory usage leading to and from the peak usage.

Interaction with buffer chart elements

When you hover over a buffer displayed in one of the buffer charts below the working space chart, a horizontal lifetime line for that buffer appears in the working space chart. The horizontal line is the same color as the highlighted buffer.


The horizontal line thickness indicates the relative magnitude of the buffer size relative to the peak memory allocation. The line length corresponds to the life of the buffer, starting at the point in the program where buffer space was allocated and ending where the space was freed.

Buffer charts

Two charts show the breakdown of memory usage at the peak usage point (indicated by the vertical line in the plot above the charts).


  • By Program Order. Displays the buffers from left to right in the order in which they were active during program execution. Buffers active for the longest time are on the left side of the chart.

  • By Size. Displays the buffers that were active during program execution in order of decreasing size. Buffers that have the largest impact at the point of peak memory usage are on the left.

Buffer allocation details card

When you hover over a buffer displayed in one of the buffer charts, a buffer allocation details card appears (in addition to the lifetime line displayed in the working chart). A typical details card looks like this:


  1. Name - Name of the XLA operation.
  2. Category - The operation category.
  3. Size - The size of the buffer allocation (including padding).
  4. Unpadded size - The size of the buffer allocation without padding.
  5. Expansion - The relative magnitude of padded buffer size versus the unpadded size.
  6. Extra memory - Indicates how much extra memory is used for padding.
  7. Shape - Describes the rank, size, and data type of the N-dimensional array.
  8. TensorFlow op name - Shows the name of the TensorFlow operation associated with the buffer allocation.
  9. Allocation type - Indicates buffer allocation category : Parameter, Output, Thread-local, and Temporary (for example, buffer allocation within a fusion).

Out of memory errors

If you run a model and get an "out of memory error", use the following command to capture a memory profile and view it in the memory viewer. Make sure to set appropriate duration_ms so that the profiling period overlaps with your program compilation time. The output can help you understand what caused the error:

  (vm)$ capture_tpu_profile --tpu=$TPU_NAME --logdir=log-directory --duration_ms=60000