Cloud TPU Tools

This document covers how to set up and run profiling tools by capturing a profile and using it to identify and analyze program performance on Cloud TPU in TensorFlow's TensorBoard console. The document also describes how to continuously monitor your TPU job on the command line (see Monitoring your job).

Prerequisites

Before you can use the Cloud TPU profiling tools described in this guide, you must complete the following:

  1. Creating Cloud TPU resources
  2. Installing cloud_tpu_profiler
  3. Capturing a profile

For Cloud TPU job monitoring, only sections 1 and 2 are required.

Using Cloud TPU tools in TensorBoard

TensorBoard is a suite of tools designed to present TensorFlow data visually. We have provided a set of Cloud TPU profiling tools that you can access from TensorBoard after you install the Cloud TPU profiler plugin. The plugin supports performance visualization for an individual Cloud TPU and for a full pod.

The Cloud TPU tool selector becomes available under the Profile tab on the TensorBoard menu bar only after you have collected trace information from a running TensorFlow model.

The following sections contains instructions on how to set up your compute environment, run your model and capture a Cloud TPU profile, and run TensorBoard from a VM command line so you can use the tools. For details on how to invoke TensorBoard from your code, see the TensorBoard programming guides.

Creating Cloud TPU resources

The quick start instructions for Cloud TPU describe how to create a Compute Engine VM and Cloud TPU resources.

If you plan to continue with the procedures in this guide immediately after you create your resources, do not perform the clean up section of those instructions. When you have finished running your model and are done using your resources, follow the clean up instructions step to avoid incurring unwanted charges.

Running cloud_tpu_profiler

You run cloud-tpu-profiler 1.11 to provide a Cloud TPU profiling plugin in TensorBoard and a script, capture-tpu-profile. You can run the script to either capture a profile that can be viewed in TensorBoard or to monitor your TPU jobs on the command line (see Monitoring your job).

In TensorBoard, the Profile tab only appears after you run a model and then run capture profile (trace) information while the model is running.

To check your profiler version, use pip. If you do not have the latest version, use the second command to install it:

(vm)$ pip freeze | grep cloud-tpu-profiler
(vm)$ pip install --upgrade "cloud-tpu-profiler>=1.11"

You also must set the PATH environment variable as follows:

(vm)$ export PATH="$PATH:`python -m site --user-base`/bin"

About capture_tpu_profile

When you use capture_tpu_profile to capture a profile, a .tracetable file is saved to your Google Cloud Storage bucket. The file contains a large number of trace events that can be viewed in both trace viewer and streaming trace viewer in TensorBoard.

You capture a profile, or trace data, by running your model, executing capture_tpu_profile, and then starting up TensorBoard before the model stops running. For example:

(vm)$ capture_tpu_profile --tpu=$TPU_NAME --logdir=${MODEL_DIR}

As the model runs, the Cloud TPU service account creates a directory (object) and writes data to your Compute Engine bucket. You must set permissions for that account on the bucket before you run your model.

By default, capture_tpu_profile captures a 2-second trace. You can set the trace duration with the duration_ms command-line option or in your program when you run your model.

Capturing a profile

The steps in this section describe how to capture a profile by running your model, executing capture_tpu_profile, and then starting up TensorBoard before the model stops running. Running TensorBoard as described here provides access to all of the Profile tools except for streaming trace viewer.

If you prefer to monitor your TPU jobs on the command line, see Monitoring your job.

The MNIST tutorial is used as the model in this example.

  1. Run ctpu up.

  2. Go to the Cloud Console > TPUs > and click on the TPU you created.

  3. Locate the service account name for the Cloud TPU and copy it, for example:

    service-11111111118@cloud-tpu.iam.myserviceaccount.com

  4. In the list of buckets, select the bucket you want to use, select Show Info Panel, and then select Edit bucket permissions.

  5. Paste your service account name into the add members field for that bucket and select the following permissions:

    image

  6. In your VM shell, use pip to check your TensorBoard version.

    (vm)$ pip freeze | grep tensorboard
    
  7. If your TensorBoard version is lower than 1.11, install the latest version of TensorFlow to upgrade the TensorBoard version.

    (vm)$ pip install --upgrade "tensorboard>=1.11"

  8. If you have not already done so, set the PATH environment variable:

    (vm)$ export PATH="$PATH:python -m site --user-base/bin" 

  9. Follow the MNIST tutorial to set up and execute an MNIST training job, for example:

    (vm)$ export STORAGE_BUCKET=gs://[YOUR-BUCKET-NAME] \
    (vm)$ python /usr/share/models/official/mnist/mnist_tpu.py \
      --tpu=$TPU_NAME \
      --DATA_DIR=${STORAGE_BUCKET}/data \
      --MODEL_DIR=${STORAGE_BUCKET}/output \
      --use_tpu=True \
      --iterations=500 \
      --train_steps=5000 

  10. While the job is running, open a new tab and ssh to your VM (replace $vm in the command with your VM name).

    gcloud compute ssh $vm --ssh-flag=-L6006:localhost:6006 

  11. In the new tab, set up a model directory environment variable.

    (vm)$ export MODEL_DIR=gs://[YOUR-BUCKET-NAME]/output 

  12. In the new tab, run capture_tpu_profile.

    (vm)$ capture_tpu_profile --tpu=$TPU_NAME --logdir=${MODEL_DIR} 

  13. In the new tab, run TensorBoard and point it to the model directory:

    (vm)$ tensorboard --logdir=${MODEL_DIR} & 

  14. In the toolbar at the top of the new tab, select the Web Preview icon and change the port number to 6006.

  15. Select the Web Preview icon again and then select Preview on Port 6006.

Graphs

TensorBoard provides a number of visualizations, or graphs, of your model and its performance. Use the graphs together with the profiling tools to fine tune your models and improve their performance on Cloud TPU.

XLA structure graph

During model compilation, before the model is run, TensorFlow generates an (XLA) graph that will be run on the Cloud TPU. The data for the graph is stored in the model_dir directory. You can view this graph without running capture_tpu_profile.

To view a model's XLA structure graph, select the Graphs tab in TensorBoard. The default selection for Color is Structure.

image

A single node in the structure graph represents an XLA instruction. For example, for a TensorFlow add op named x/y/z that is mapped (or lowered) to XLA shows as x/y/z/add in the graph.

An XLA graph displays information on how a Cloud TPU will execute a particular model. The graph also provides the shapes of inputs and outputs for various operations. Once you capture a profile of your model, you can use the XLA graph along with trace viewer or streaming trace viewer to gain insight into where most of the time is being spent.

Notes:

  • Some nodes do not have TensorFlow namespaces because not all XLA instructions (such as those injected by the XLA compiler) have corresponding TensorFlow operations.
  • The TensorFlow program structure is incorporated in the XLA graph where possible. However, because the XLA program running on Cloud TPU is highly optimized, its graph structure might be quite different from that of the original TensorFlow program.
  • A special XLA instruction called fusion can merge multiple instructions from different TensorFlow operations into a single computation. The TensorFlow operation corresponding to the root instruction in the fusion is used as the namespace of the fusion operation.

TPU compatibility graph

The Graphs tab includes a compatibility checker module which checks for and displays TensorFlow ops that can potentially cause issues when a model is run.

To view a model's TPU compatibility graph, select the Graphs tab in TensorBoard and then select the TPU Compatibility option. The graph presents the compatibile operations in color in green and the incompatible operations in red:

image

A given node can display both colors, each as a percentage of the Cloud TPU compatibility operations for that node. See Interpreting compatibility results for an example.

The compatibility summary panel displayed to the right of the graph shows the percentage of all Cloud TPU-compatible operations, their attributes and a list of incompatible operations for a selected node.

Click on any operation in the graph to display its attributes in the summary panel.

image

Note that compatibility checker does not assess any operations that are explicitly assigned to a non-TPU device using manual device placement. In addition, the checker does not actually compile the model for execution, so be sure to interpret the results as an estimate of compatibility.

Prerequisites
  • Configure your model to write the model graph data to a file by setting the model_dir property of the tf.estimator API or the TPUEstimator.
  • Remove any manual assignments in your code to GPUs or CPUs for operations that you intend to run on the Cloud TPU. When used with the TPU Compatibility option, the compatibility checker skips any operation explicitly assigned to non-TPUs.
Interpreting compatibility results

The following diagram from the Abalone model shows a sample compatibility summary with several unavailable operations noted:

image

No manual device placement was specified for this model so all operations were checked, even those that should always be run on the CPU, such as the "save" and "report_uninitialized_variables" operations shown.

Two AssignAdd operations in root_mean_squared_error are potential problems.

You can see from the source code that root_mean_squared_error is used only as an additional evaluation metric:

    # Calculate root mean squared error as additional eval metric
    eval_metric_ops = {
        "rmse": tf.metrics.root_mean_squared_error(
            tf.cast(labels, tf.float64), predictions)
    }

Unless it occurs inside a training loop, this operation is normally run on the CPU so the error report can be disregarded. In conclusion, the model is ready to be run on a Cloud TPU.

Profile

The Profile tab, created when you ran capture_tpu_profile, appears in TensorBoard only after you have captured some model data. Once data is available, clicking on the Profile tab presents a selection of tools to help with performance analysis:

Profile overview page

The overview page (overview_page), available under Profile, provides a top level view of how your model performed during a capture run. The page shows you an aggregated overview page for all the TPUs, as well as an overall input pipeline analysis. There is an option for selecting individual TPUs in the Host dropdown.

The page displays data in the following panels:

image

  • Performance summary

    • Step time averaged over all sampled steps
    • Percentage of time the Host was idle
    • Percentage of time the TPU was idle
    • Percentage utilization of the TPU matrix units
  • Step-time graph. Displays a graph of device step time (in milliseconds) over all the steps sampled. The blue area corresponds to the portion of the step time the TPUs were sitting idle waiting for input data from the host. The orange area shows how much of time the Cloud TPU was actually working.

  • Top 10 TensorFlow operations on TPU. Displays the TensorFlow operations that consumed the most time. Clicking the Show table button displays a table like the following:

    image

    Each row displays an operation's self time (as the percentage of time taken by all operations), cumulative time, category, name, and the FLOPS rate achieved.

  • Run environment

    • Number of hosts used
    • Type of TPU used
    • Number of TPU cores
    • Training batch size
  • Recommendation for next steps. Reports when a model is input bound and whenever issues with Cloud TPU occur. Suggests tools you can use to locate performance bottlenecks in performance.

Input pipeline analyzer

The input pipeline analyzer provides insights into your performance results. The tool displays performance results from the input_pipeline.json file that is collected by the capture_tpu_profile tool.

The tool tells you immediately whether your program is input bound and can walk you through device- and host-side analysis to debug whatever stage(s) of the pipeline are creating bottlenecks.

See the guidance on input pipeline performance for deeper insight into optimizing pipeline performance.

Input pipeline

When a TensorFlow program reads data from a file it begins at the top of the TensorFlow graph in a pipelined manner. The read process is divided into multiple data processing stages connected in series, where the output of one stage is the input to the next one. This system of reading is called the input pipeline.

A typical pipeline for reading records from files has the following stages:

  1. File reading
  2. File preprocessing (optional)
  3. File transfer from the host machine to the device

An inefficient input pipeline can severely slow down your application. An application is considered input bound when it spends a significant portion of time in input pipeline. Use the input pipeline analyzer to understand where the input pipeline is inefficient.

Input pipeline dashboard

To open the input pipeline analyzer, select Profile, then select input_pipeline_analyzer from the Tools dropdown.

The dashboard contains three sections:

image

  1. Summary. Summarizes the overall input pipeline with information on whether your application is input bound and, if so, by how much.
  2. Device-side analysis. Displays detailed, device-side analysis results, including the device step-time and the range of device time spent waiting for input data across cores at each step.
  3. Host-side analysis. Shows a detailed analysis on the host side, including a breakdown of input processing time on the host.
Input pipeline summary

Section 1 reports if your program is input bound by presenting the percentage of device time spent on waiting for input from the host. If you are using a standard input pipeline that has been instrumented, the tool reports where most of the input processing time is spent. For example:

image

Device-side analysis

Section 2 details the device-side analysis, providing insights on time spent on the device versus on the host and how much device time was spent waiting for input data from the host.

image

  1. Step time plotted against step number. Displays a graph of device step time (in milliseconds) over all the steps sampled. The blue area corresponds to the part of the step time Cloud TPUs sat idle waiting for input data from the host. The orange area shows how much of time the Cloud TPU was actually working.
  2. Step time statistics. Reports the average, standard deviation, and range ([minimum, maximum]) of the device step time.
  3. Device time across cores spent waiting for input data, by step number. Displays a line chart showing the amount of device time (expressed as a percentage of total device step time) spent waiting for input data processing. Since the fraction of time spent varies from core to core, the range of fractions for each core is also plotted for each step. Since the time a step takes is determined by the slowest core, you want the range to be as small as possible.
  4. Fraction of time waiting for input data. Reports the average, standard deviation and the range ([minimum, maximum]) of the fraction of time spent in device waiting for the input data normalized to the total device step time.
Host-side analysis

Section 3 shows the details of host-side analysis, reporting a breakdown of the input processing time (the time spent on Dataset API operations) on the host into several categories:

  • Reading data from files on demand. Time spent on reading data from files without caching, prefetching, and interleaving.
  • Reading data from files in advance. Time spent reading files, including caching, prefetching, and interleaving.
  • Data preprocessing. Time spent on preprocessing operations, such as image decompression.
  • Enqueuing data to be transferred to device Time spent putting data into an infeed queue before transferring the data to the device.

image

To see the statistics for individual input operations and their categories broken down by execution time, click the "Show Input Op statistics" button.

A source data table like the following appears:

image

Each table entry contains the following information:

  1. Input Op. Shows the TensorFlow op name of the input operation.
  2. Count. Shows the total number of instances for the operation executed during the profiling period.
  3. Total Time (in ms). Shows the accumulative sum of time spent on each of those instances.
  4. Total Time %. Shows the total time spent on an operation as a fraction of the total time spent in input processing.
  5. Total Self Time (in ms). Shows the accumulative sum of the self time spent on each of those instances. The self time here measures the time spent inside the function body, excluding the time spent in the function it calls. For example, the Iterator::PaddedBatch::Filter::ForeverRepeat::Map is called by Iterator::PaddedBatch::Filter, therefore its total self time is excluded from the total self time of the latter.
  6. Total Self Time %. Shows the total self time as a fraction of the total time spent on input processing.
  7. Category. Shows the processing category of the input operation.

Op profile

Op profile (op_profile) is a Cloud TPU tool that displays the performance statistics of XLA operations executed during a profiling period. Op profile shows:

  • How well your application uses the Cloud TPU as a percentage of time spent on operations by category and of TPU FLOPS utilization
  • The most time-consuming operations. Those operations are potential targets for optimization.
  • Details of individual operations, including shape, padding and expressions that use the operation.

You can use op profile to find good targets for optimization. For example, if your model achieves only 5% of the TPU peak FLOPS, you can use the tool to identify which XLA operations are taking the longest time to execute and how many TPU FLOPS they consume.

Using op profile

During profile collection, capture_tpu_profile also collects a op_profile.json file that contains performance statistics of XLA operations.

You can view the data from op_profile in TensorBoard by clicking on the Profile tab at the top of the screen and then selecting op_profile from the Tools dropdown. You will see a display like this:

image

  1. Overview section. Shows the percentage used of the Cloud TPU computational potential and provides suggestions for optimization.
  2. Control panel. Contains a settings slider that controls the number of ops displayed in the Op table and a toggle that sets the Op table to list the ops that comprise the top 90% of the total execution time.
  3. Op table. Lists the top TensorFlow operation categories associated with the XLA ops by the total amount of time, expressed as a percentage of Cloud TPU usage, that all operations in the category took to execute.
  4. Op details cards. Details about the op that appear when you hover over an op in the table. Include the FLOPS utilization, the expression in which the op is used, and the op layout (fit).
XLA Op table

The Op table lists XLA operation categories in order from the highest to lowest percentage of Cloud TPU usage. Initially, the table shows the percentage of time taken, the op category name, the associated TensorFlow op name, and the percentage of FLOPS utilization for the category. To display (or hide) the 10 most time-consuming XLA operations for a category, click the triangle next to the category name in the table.

image

  1. Time. Shows the total percentage of time spent by all the operations in that category. You can click to expand the entry and see the breakdown of time spent by each individual operation.
  2. Horizontal Bar. Shows the time distribution across categories.
  3. Top10 Ops. The toggle next to a category's name displays/hides the top 10 time-consuming operations within the category. If a fusion operation entry is displayed in the operations list, you can expand it to see the non-fusion, elementwise operations it contains.
  4. TensorFlow Op. Shows the TensorFlow op name associated with the XLA operation.
  5. FLOPS. Shows the FLOPS utilization, which is the measured number of FLOPS expressed as a percentage of the Cloud TPU peak FLOPS. The higher the FLOPS utilization percentage, the faster operations run. The table cell is color coded: green for high FLOPS utilization (good) and red for low FLOPS utilization (bad).
Op details cards

When you hover over a table entry, a card appears on the left displaying details about the XLA op or the operation category. A typical card looks like this:

image

  1. Name. Shows the highlighted XLA operation name.
  2. Category. Shows the operation category.
  3. FLOPS utilization. Displays FLOPS utilization as a percentage of total FLOPS possible.
  4. Expression. Shows the XLA expression containing the operation.
  5. Memory Utilization. Displays the peak memory used by your program as a percentage of total possible.
  6. Layout (Convolution operations only.) Shows the shape and layout of a tensor, including whether the shape of the tensor is an exact fit for the matrix units and how the matrix is padded.
Interpreting results

For convolution operations, TPU FLOPS utilization can be low due to one or both of the following reasons:

  • padding (matrix units are partially used)
  • convolution op is memory bound

This section gives an interpretation of some numbers from a different model in which FLOPs were low. In this example, output fusion and convolution dominated the execution time and there was a long tail of vector or scalar operations that had very low FLOPS.

One optimization strategy for this type of profile is to transform the vector or scalar operations to convolution operations.

In the following example, %convolution.399 shows lower FLOPS and memory utilization than %convolution.340 in the previous example.

image

Examine the layout and note that batch size 16 is being padded to 128 and feature size 3 is being padded to 8, which indicates that only 5% of the matrix units are being effectively used. (The calculation for this instance of percent utilization is to multiply the batch times the feature (16 by 3) then divide the result by the padding, 128 and then by 8.) Compare the FLOPS in this example to the %convolution.340 in the previous example which has an exact fit to the matrix.

Trace viewer

Trace viewer is a Cloud TPU performance analysis tool available under Profile. The tool uses the Chrome trace event profiling viewer so it only works in the Chrome browser.

Trace Viewer displays a timeline that shows:

  • Durations for the operations that were executed by your TensorFlow model .
  • Which part of the system (TPU or host machine) executed an operation. Typically, the host machine executes infeed operations, which preprocesses training data and transfers it to the TPU, whereas the TPU executes the actual model training.

Trace viewer allows you to identify performance problems in your model, then take steps to resolve them. For example, at a high level, you can identify whether infeed or model training is taking the majority of the time. Drilling down, you can identify which TensorFlow operations are taking the longest to execute.

Note that trace viewer is limited to 1M events per Cloud TPU. If you need to assess more events, use the streaming trace viewer instead.

Trace viewer interface

To open Trace Viewer, go to TensorBoard and click on the Profile tab at the top of the screen. The viewer appears displaying your most recent run:

image

This screen contains the following main elements (marked with numbers above):

  1. Runs dropdown. Contains all of the runs for which you've captured trace information. The default view is your most recent run, but you can open the dropdown to select a different run.
  2. Tools dropdown. Selects different profiling tools.
  3. Host dropdown. Selects a host that contains a Cloud TPU set.
  4. Timeline pane. Shows operations that Cloud TPU and the host machine executed over time.
  5. Details pane. Shows additional information for operations selected in the Timeline pane.

Here's a closer look at the timeline pane:

image

The Timeline pane contains the following elements:

  1. Top bar. Contains various auxiliary controls.
  2. Time axis. Shows time relative to the beginning of the trace.
  3. Section and track labels. Each section contains multiple tracks and has a triangle on the left that you can click to expand and collapse the section. There is one section for every processing element in the system.
  4. Tool selector. Contains various tools for interacting with the trace viewer.
  5. Events. These show the time during which an operation was executed or the duration of meta-events, such as training steps.
  6. Vertical tab bar. This does not have a useful purpose for Cloud TPU. The bar is part of the general purpose trace viewer tool provided by Chrome that is used for a variety of performance analysis tasks.
Sections and tracks

Trace viewer contains the following sections:

  • One section for each TPU node, labeled with the number of the TPU chip and the TPU node within the chip (for example, "Chip 2: TPU Core 1"). Each TPU node section contains the following tracks:
    • Step. Shows the duration of the training steps that were running on the TPU.
    • TensorFlow Ops. Shows TensorFlow operations executed on the TPU.
    • XLA Ops. Shows XLA operations that ran on the TPU. (Each TensorFlow operation is translated into one or several XLA operations. The XLA compiler translates the XLA operations into code that runs on the TPU.)
  • One section for threads running on the host machine's CPU, labeled "Host Threads". The section contains one track for each CPU thread. Note: You can ignore the information displayed alongside the section labels.
Timeline tool selector

You can interact with the timeline view using the timeline tool selector in TensorBoard. You can click on a timeline tool or use the following keyboard shortcuts to activate and highlight a tool. To move the timeline tool selector, click in the dotted area at the top and then drag the selector to where you want it.

Use the timeline tools as follows:

Selection tool
Click on an event to select it or drag to select multiple events. Additional information about the selected event or events (name, start time, and duration) will be displayed in the details pane.

Pan tool
Drag to pan the timeline view horizontally and vertically.

Zoom tool
Drag up to zoom in or drag down to zoom out along the horizontal (time) axis. The horizontal position of the mouse cursor determines the center around which the zoom takes place.

Note: The zoom tool has a known bug where zoom remains active if you release the mouse button while the mouse cursor is outside the timeline view. If this happens to you, just click briefly on the timeline view to stop zooming.

Timing tool
Drag horizontally to mark a time interval. The length of the interval appears on the time axis. To adjust the interval, drag its ends. To clear the interval, click anywhere inside the timeline view.

Note that the interval remains marked if you select one of the other tools.
Events

Events within the timeline are displayed in different colors; the colors themselves have no specific meaning.

Timeline top bar

The top bar of the Timeline pane contains several auxiliary controls:

image

  1. Metadata display. Not used for TPUs.
  2. View Options. Not used for TPUs.
  3. Search box. Enter text to search for all events whose name contains the text. Click the arrow buttons to the right of the search box to move forwards and backwards through the matching events, selecting each event in turn.
  4. Console button. Not used for TPUs.
  5. Help button. Click to display a help summary.
Keyboard shortcuts

Following are the keyboard shortcuts you can use in trace viewer. Click the help button (?) in the top bar to see more keyboard shortcuts.

    w Zoom in
    s Zoom out
    a Pan left
    d Pan right
    f Zoom to selected event(s)
    m Mark time interval for selected event(s)
    1 Activate selection tool
    2 Activate pan tool
    3 Activate zoom tool
    4 Activate timing tool

The f shortcut can be highly useful. Try selecting a step and pressing f to zoom into the step quickly.

Characteristic events

Following are some of the event types that can be very useful when analyzing TPU performance.

image

  • InfeedDequeueTuple. This TensorFlow operation runs on a TPU and receives input data coming from the host. When infeed takes a long time, it can mean that the TensorFlow operations which preprocess the data on the host machine cannot keep up with the TPU data consumption rate. You can see corresponding events in the host traces called InfeedEnqueueTuple. To view a more detailed input-pipeline analysis, use the Input Pipeline Analyzer tool.

  • CrossReplicaSum. This TensorFlow operation runs on a TPU and computes a sum across replicas. Because each replica corresponds to a different TPU node, the operation must wait for all TPU nodes to be finished with a step. If this operation is taking a long time, it might not mean that the summing operation itself is slow but that a TPU node is waiting for another TPU node with a slow data infeed.

image

  • Dataset Ops. Trace viewer visualizes dataset operations performed when data is loaded using the Dataset API. The Iterator::Filter::Batch::ForeverRepeat::Memory in the example is compiled and it corresponds to the dataset.map() operation. Use trace viewer to examine the loading operations as you work through debugging and mitigating input pipeline bottlenecks.

image

  • Prefetch Threads. Using dataset.prefetch() to buffer input data can prevent sporadic slowdowns in file access that create bottlenecks in the input pipeline.
What can go wrong

Here are some potential issues to be aware of when using trace viewer:

  • Event display limit. Trace viewer displays a maximum of 1 million events. If you captured more events, only the earliest 1 million events are displayed; later events are dropped. To capture more TPU events, you can use the --include_dataset_ops=False flag to explictly require capture_tpu_profile to exclude the dataset ops.
  • Very long events. Events that begin before a capture starts or that end after a capture is finished are not visible in trace viewer. Consequently, very long events can be missed.
  • When to start trace capture. Be sure to start trace capture after you know the Cloud TPU is running. If you start before then, you may see only a few events or no events at all in trace viewer. You can increase the profile time using the --duration_ms flag and you can set automatic retries using the --num_tracing_attempts flag. For example:

    (vm)$ capture_tpu_profile --tpu_name=$TPU_NAME
    --logdir=${MODEL_DIR} --duration_ms=60000 --num_tracing_attempts=10
    

Memory viewer

Memory viewer allows you to visualize the peak memory usage for your program, and memory usage trends over the program's lifetime.

The memory viewer UI looks like this:

image

  1. Host dropdown. Selects for a TPU host and XLA High Level Optimizer (HLO) modules to visualize.
  2. Memory overview. Displays peak memory allocation and size without padding.
  3. Working space chart. Displays peak memory use and a plot of memory usage trends over the program's lifetime. Hovering over a buffer in one of the buffer charts adds an annotation for the buffer lifetime and the buffer details card.
  4. Buffer charts. Two charts that display buffer allocation at the point of peak memory usage, as indicated by the vertical line in the working space plot. Hovering over a buffer in one of the buffer charts displays the buffer's lifetime bar in the working space chart and a details card on the left.
  5. Buffer allocation details card. Displays allocation details for a buffer.
Memory overview panel

The memory overview (top) panel shows you the module name and the peak memory allocation set when the total buffer allocation size reaches the maximum. The unpadded peak allocation size is also shown for comparison.

image

Working space chart

This chart displays peak memory use and a plot of memory usage trends over the program's lifetime. The line drawn from top to bottom of the plot indicates peak memory utilization for the program. This point determines whether or not a program can fit into the available global memory space.

image

Each point on the overlying line plot represents a "program point" in XLA's HLO program as scheduled by the compiler. The line provides a sense of the spikiness leading to and from the peak usage.

Interaction with buffer chart elements

When you hover over a buffer displayed in one of the buffer charts below the working space chart, a horizontal lifetime line for that buffer appears in the working space chart. The horizontal line is the same color as the highlighted buffer.

image

The horizontal line thickness indicates the relative magnitude of the buffer size relative to the peak memory allocation. The line length corresponds to the life of the buffer, starting at the point in the program where buffer space was allocated and ending where the space was freed.

Buffer charts

Two charts show the breakdown of memory usage at the peak usage point (indicated by the vertical line in the plot above the charts).

image

  • By Program Order. Displays the buffers from left to right in the order in which they were active during program execution. Buffers active for the longest time are on the left side of the chart.

  • By Size. Displays the buffers that were active during program execution in descending size order. Buffers that had the largest impact at the point of peak memory usage are on the left.

Buffer allocation details card

When you hover over a buffer displayed in one of the buffer charts, a buffer allocation details card appears (in addition to the lifetime line displayed in the working chart). A typical details card looks like this:

image

  1. Name. Name of the XLA operation.
  2. Category. Operation category.
  3. Size. Size of the buffer allocation (including padding).
  4. Unpadded size. Size of the buffer allocation without padding.
  5. Expansion. Relative magnitude of padded buffer size versus the unpadded size.
  6. Extra memory. Indicates how much extra memory is used for padding.
  7. Shape. Describes the rank, size, and data type of the N-dimensional array.
  8. TensorFlow op name. Shows the name of the TensorFlow operation associated with the buffer allocation.
  9. Allocation type. Indicates buffer allocation category. Types are: Parameter, Output, Thread-local, and Temporary (for example, buffer allocation within a fusion).
"Out of memory" errors

If you run a model and get an "out of memory error", use the following command to capture a memory profile and view it in the memory viewer. Make sure to set appropriate duration_ms so that the profiling period overlaps with your program compilation time. The output can help you understand what caused the error:

(vm)$ capture_tpu_profile --tpu=$TPU_NAME --logdir=${MODEL_DIR} --duration_ms=60000

Streaming trace viewer

Streaming trace viewer (trace_viewer@) is a Cloud TPU performance analysis tool, available for TensorFlow 1.11 or later, that provides dynamic trace renderings. The tool uses the Chrome trace event profiling viewer so it works only in the Chrome browser.

When you use capture_tpu_profile 1.11 to capture a profile, a .tracetable file is saved to your Google Cloud Storage bucket. The file contains a large number of trace events that can be viewed in in both trace viewer and streaming trace viewer.

Using streaming trace viewer

To use the streaming trace viewer, trace_viewer@, you must shut down your existing TensorBoard session and then relaunch TensorBoard using the IP address of the TPU you want to examine. Streaming trace viewer requires a TensorBoard connection made a Google Remote Procedure Call (GRPC) to an IP address for the Cloud TPU. The GRPC channel is not encrypted.

To find the IP address for a Cloud TPU host on the GCP Console, open the TPUs page and look at the displayed table for the name of the Cloud TPU whose trace you want to view.

image

The Internal IP column for each Cloud TPU contains an IP address, [TPU_IP].

image

In your VM, run TensorBoard as follows:

(vm)$ tensorboard --logdir=${MODEL_DIR} --master_tpu_unsecure_channel=[TPU_IP]

The trace_viewer@ tool appears in the Tools dropdown list.

image

In the timeline, you can zoom in and out to see trace events load dynamically into your browser.

image

Monitoring your Cloud TPU job

This section describes how to use capture_tpu_profile to continuously monitor your Cloud TPU job on the command line interface in real time. You can perform monitoring by running capture_tpu_profile with the --monitoring option to profile your job. You can specify how metrics display, either continously or for a short period of time.

  1. Open a new tab and ssh to your VM (replace $vm in the command with your VM name):

    gcloud compute ssh $vm --ssh-flag=-L6006:localhost:6006
    
  2. In the new tab, run capture_tpu_profile with the --monitoring_level flag set to either 1 or 2, such as:

    (vm)$ capture_tpu_profile --tpu=$TPU_NAME  --monitoring_level=1
    

Setting monitoring_level=1 produces output similar to the following:

    TPU type: TPU v1
    Utilization of TPU Matrix Units is (higher is better): 10.7%

Setting monitoring_level=2 displays more detailed information:

    TPU type: TPU v2
    Number of TPU Cores: 8
    TPU idle time (lower is better): 0.091%
    Utilization of TPU Matrix Units is (higher is better): 10.7%
    Step time: 1.95 kms (avg), 1.90kms (minute), 2.00 kms (max)
    Infeed percentage: 87.5% (avg). 87.2% (min), 87.8 (max)

Monitoring flags

  • --tpu (required) specifies the name of the Cloud TPU you want to monitor.
  • --monitoring_level (required). When set, changes the behavior of capture_tpu_profile from profiling to continuous monitoring. There are two available options: Level 1: Captures only device performance counters and shows only TPU utilization. Level 2: Shows the TPU utilization, TPU idle time, and number of TPU cores used. Also provides min, avg, and max step times along with the infeed percentage contribution.
  • --duration_ms (optional; default is 1000ms) specifies how long to profile the TPU host during each cycle. Generally, this should be long enough to capture at least one training step worth of data. 1 second captures a training step in most models but if your model step time is very large, you can set the value to 2x step_time (in ms).
  • --num_queries specifies how many cycles to run capture_tpu_profile. To continuously monitor your TPU job, set the value to a high number. To quickly check your model's step time set the value to a low number.
Was this page helpful? Let us know how we did:

Send feedback about...