Cloud TPU Tools

This document covers how to run various Cloud TPU tools.

Prerequisites

Before you can use the tools in this guide, you must complete the following prerequisites:

Create a Compute Engine VM and a TPU resource

Before you use the tools in this guide, create a Compute Engine VM instance using the tf-1-8 image in the ml-images image family. Also, create a Cloud TPU resource. Complete the Cloud TPU quickstart for instructions.

Set up TensorBoard

The computations you'll use TensorFlow for - like training a massive deep neural network - can be complex and confusing. To make it easier to understand, debug, and optimize TensorFlow programs, we've included a suite of visualization tools called TensorBoard. You can use TensorBoard to visualize your TensorFlow graph, plot quantitative metrics about the execution of your graph, and show additional data like images that pass through it. For more details follow the tutorials available at tensorflow.org

In order to use TensorBoard with TensorFlow and Cloud TPU, there are a few things you need to do.

  1. Connect to your Compute Engine VM using the gcloud compute ssh command, with port forwarding for TensorBoard. Replace tpu-demo-vm with the name of your TPU instance:

    gcloud compute ssh tpu-demo-vm -- -L 6006:localhost:6006
    
  2. Make sure that TensorBoard is installed on your system:

    (vm)$ pip freeze | grep tensorboard
    (vm)$ sudo pip install --upgrade "tensorboard>=1.8"
    
  3. Set a model_dir directory. Estimators (and TPU Estimators) handle the heavy lifting of integrating with TensorBoard for you, however they need some configuration. When you build your estimator, provide a path to a directory in your Cloud Storage bucket to save metadata about your model.

    (vm)$ export STORAGE_BUCKET=gs://[YOUR-BUCKET-NAME]
    (vm)$ export model_dir=${STORAGE_BUCKET}/output
    
  4. Follow the MNIST tutorial to set up and execute an MNIST training:

    (vm)$ python /usr/share/models/official/mnist/mnist_tpu.py \
          --tpu=$TPU_NAME \
          --data_dir=${STORAGE_BUCKET}/data \
          --model_dir=$model_dir \
          --use_tpu=True \
          --iterations=500 \
          --train_steps=1000
    
  5. Run TensorBoard and point to your model directory:

    (vm)$ tensorboard --logdir=${model_dir}
    
  6. Open TensorBoard in your browser from your local workstation. Navigate to http://localhost:6006 in your web browser to view TensorBoard. You should be able to see the XLA graph of your model under the Graph tab. Because you haven't captured a TPU profile, you will not be able to see the profiler tools yet.

Install Cloud TPU Profiler

You need to have cloud-tpu-profiler == 1.5.2 installed on your system to capture a TPU profile:

(vm)$ pip freeze | grep cloud-tpu-profiler
(vm)$ sudo pip install --upgrade "cloud-tpu-profiler==1.5.2"

Capturing Trace Information

Before you can use the tools, you need to capture trace information while your model is running. Capture a trace by executing the following command on your VM:

(vm)$ capture_tpu_profile --tpu_name=$TPU_NAME --logdir=${model_dir}

By default, this captures a 2-second trace. You can set the trace duration with the duration_ms command-line option. Open your TensorBoard in your local browser again. Now you should be able to see a Profile tab showing up.

Overview Page

The Overview Page provides a top level view of how the workload performed as it ran on the TPU. The page displays data in the following panels: image

  • Performance Summary, which includes:

    • Step time averaged over all sampled steps
    • Percentage of idle Host time
    • Percentage of idle TPU time
    • Percentage utilization of the TPU matrix units
  • Step-time Graph, which plots a graph of step time (in milliseconds) over all the steps sampled. The blue area corresponds to the part of the step time waiting for input data from the host; the orange area corresponds to the compute time.

  • Top 10 TensorFlow operations on TPU, which shows the TensorFlow operations executed on the TPU that consumed the majority of time. Clicking the "Show" button displays a table similar to the following: image Each row shows self time (as the percentage of time taken by all operations), cumulative time, category, name, and FLOP rate achieved for an operation.

  • Run Environment, which includes:

    • Number of hosts used
    • Type of TPU used
    • Number of TPU cores
    • Training batch size
  • Recommendation for Next Steps, which reports if the workload is input bounded and, if so, suggests tools you can use to reduce the bottleneck depending on whether the issue is the input time, the TPU time, or both.

XLA graphs

While a model is compiled, it also generates a graph that represents the XLA (Accelerated Linear Algebra) program observed on the TPU devices. The graph is dumped to the model_dir directory and can be found in the "Graphs" tab in TensorBoard.

A node in the graph represents an XLA instruction. If an XLA instruction (for example, add) is lowered from a TensorFlow op (for example, x/y/z), it is shown as x/y/z/add in the graph.

The XLA graph can give you more information on how a TPU is executing a particular model and what the shapes of inputs and outputs of the different operations are. In conjunction with Trace Viewer, this can give you insight into where most of the runtime is spent.

image

Notes:

  • Not all XLA instructions have corresponding TensorFlow operations (for example, those injected by XLA compiler), thus some nodes don't have TensorFlow namespaces.
  • The TensorFlow program structure is incorporated in the XLA graph where possible. However, the XLA program running on TPU devices is highly optimized, so the graph structure might be quite different from the original TensorFlow program.
  • There is a special XLA instruction called fusion. This instruction can merge multiple instructions from different TensorFlow operations into a single computation. The TensorFlow operation corresponding to the root instruction in the fusion is used as the namespace of the entire fusion operation.

TPU Compatibility Checker

The TensorBoard graph viewer includes the TPU Compatibility Checker—a tool which calls out TensorFlow ops that may potentially be problematic when compiling a model for TPU use. The tool works by scanning the model TensorFlow graph for ops which are currently not available on the TPU.

Prerequisites

  • Make sure that you have configured your model to write the model graph to a file. If you are using the tf.estimator API, you do this by setting the model_dir property of your Estimator.
  • The TPU Compatibility Checker does not check any operations that have been explicitly assigned to a non-TPU device using manual device placement. For example, operations which are explicitly assigned to a GPU are skipped. If you have such assignments, remove the manual placements for any operations that you intend to run on the TPU.

Using the TPU Compatibility Checker

To access the TPU Compatibility Checker, open the Graphs tab in TensorBoard. Note that you can upload and check any model's graph by clicking the Choose File button. In the configuration pane on the left, go to the Color section and select the TPU Compatibility option:

image

The model graph now appears like this (using the Abalone tutorial as an example):

image

TPU-compatible operations are colored green; TPU-incompatible operations are colored red. Graph nodes that contain both compatible and incompatible operations show both colors in proportion to the percentage of compatible and incompatible operations contained in them.

The right hand side of the screen displays a compatibility summary:

image

The percentage at the top expresses what percentage of all operations are TPU compatible. Below that is a list of operations that are not compatible. Clicking on one of these operations selects it in the main graph view, expanding nodes as necessary to make the operation visible.

Interpreting the Results

The Abalone example shows how to interpret the results of the TPU Compatibility Checker. Here again is the compatibility summary:

image

For the purpose of illustration, we show a theoretical compatibility summary that has several unavailable ops.

When this model was run, no manual device placement was specified. As a result, the Compatibility Checker checked all operations, even those that should always be run on the CPU. The various "save" and "report_uninitialized_variables" operations definitely fall in this category.

This leaves three operations that are potentially an issue: the GradientDescent operation and the two AssignAdd operations in root_mean_squared_error.

Let's look at the GradientDescent node:

image

The incompatible operation is an AssignAdd that updates the global step count. This operation would typically be run on the CPU, so it's not a concern.

Moving on to root_mean_squared_error, we see from the source code that it is only used as an additional evaluation metric:

# Calculate root mean squared error as additional eval metric
eval_metric_ops = {
    "rmse": tf.metrics.root_mean_squared_error(
        tf.cast(labels, tf.float64), predictions)
}

Because this operation is not part of the training loop, it can also be run on the CPU and is therefore not a concern either. In conclusion, this model is ready to be run on a TPU.

Trace Viewer

Trace Viewer is a performance analysis tool integrated into TensorBoard. Trace Viewer contains a timeline that shows:

  • Duration of the various operations in your TensorFlow model that were executed.
  • Which part of the system (TPU or host machine) executed an operation. Typically, the host machine executes infeed operations, which preprocess training data and transfer it to the TPU, whereas the TPU executes the actual model training.

Trace Viewer allows you to identify performance problems in your model, then take steps to resolve them. For example, at a high level, you can identify whether infeed or model training is taking the majority of the time. Drilling down, you can identify which TensorFlow operations are taking the longest to execute.

User Interface Overview

To open Trace Viewer, go to TensorBoard and click on the "Profile" tab at the top of the screen. You will see something like this:

image

This screen contains the following main elements (marked with numbers above):

  1. A Runs dropdown, which contains all of the runs for which you've captured trace information. By default, Trace Viewer opens the most recent run. You can open the dropdown to select a different run.
  2. A Tools dropdown, which selects different profiling tools.
  3. A Timeline pane, which shows the operations that the TPUs and host machine executed over time.
  4. A Details pane, which shows additional information for operations selected in the Timeline pane.

Here's a closer look at the Timeline pane:

image

The Timeline pane contains the following elements:

  1. A top bar, which contains various auxiliary controls.
  2. A time axis, which shows time relative to the beginning of the trace.
  3. Section and track labels. Each section contains multiple tracks and has a triangle on the left that you can click to expand and collapse the section. There is one section for every processing element in the system. Sections and tracks will be explained in more detail below.
  4. A tool selector, which contains various tools for interacting with the Trace Viewer.
  5. Events. These show the time during which an operation was executed or the duration of meta-events, such as training steps.
  6. A vertical tab bar. This does not have a useful purpose for TPUs. It exists because Trace Viewer is a general-purpose tool provided by Chrome that is used for a variety of performance analysis tasks. We'll discuss sections, tracks and events next, since this is where you'll be spending most of your time.

Sections and Tracks

Trace Viewer contains the following sections:

  • One section for each TPU node, labeled with the number of the TPU chip and the TPU node within the chip (e.g. "Chip 2: TPU Core 1"). Each TPU node section contains the following tracks:
    • Step: This track shows the duration of the training steps that were running on the TPU.
    • TensorFlow Ops: TensorFlow operations executed on the TPU.
    • XLA Ops: XLA operations. Each TensorFlow operation is translated into one or several XLA operations. The XLA compiler then translates these XLA operations into code that runs on the TPU.
  • An additional section for threads running on the host machine's CPU, labeled "Host Threads". This section contains one track for each CPU thread. Note: Some other information is displayed alongside the section labels (e.g. "n-e93653ba-w-0", "pid 49"). This exists only for internal reasons and can be ignored.

Tool Selector

The tool selector contains tools that you can use to interact with the timeline view. Click on a tool to make it active (or use keyboard shortcuts, see below). The tool currently active tool will be highlighted. You can move the tool selector around the screen by clicking and dragging on the dotted area at the top.

Here is what the individual tools do:

Selection tool
Click on an event to select it. Click and drag to select multiple events. Additional information about the selected event or events (name, start time, and duration) will be displayed in the details pane.

Pan tool
Click and drag to pan the timeline view horizontally and vertically.

Zoom tool
Click and drag up to zoom in or down to zoom out along the horizontal (time) axis. The horizontal position of the mouse cursor determines the center around which the zoom takes place.

Note: The zoom tool has a known bug where zoom remains active if you release the mouse button while the mouse cursor is outside the timeline view. If this happens to you, just click briefly on the timeline view to stop zooming.

Timing tool
Click and drag horizontally to mark a time interval. The length of the interval appears on the time axis. To adjust the interval, drag its ends. To clear the interval, click anywhere inside the timeline view.

Note that the interval remains marked if you select one of the other tools.

Events

Events have different colors to make it easier to distinguish them visually. The colors themselves have no specific meaning.

Top Bar (Timeline pane)

The top bar of the Timeline pane contains several auxiliary controls:

image

  1. Metadata display: Not used for TPUs.
  2. View Options: Not used for TPUs.
  3. Search box: Enter text to search for all events whose name contains the text. Click the arrow buttons to the right of the search box to move forwards and backwards through the matching events, selecting each event in turn.
  4. Console button: Not used for TPUs.
  5. Help button: Click to display a quick help summary.

Keyboard Shortcuts

Here are some keyboard shortcuts you can use in Trace Viewer. Click the help button in the top bar to see more keyboard shortcuts.

w Zoom in
s Zoom out
a Pan left
d Pan right
f Zoom to selected event(s)
m Mark time interval for selected event(s)
1 Activate selection tool
2 Activate pan tool
3 Activate zoom tool
4 Activate timing tool

The f shortcut in particular can be highly useful. Try selecting a step and pressing f to zoom into it.

Characteristic Events

Here are some event types that will be of interest when analyzing TPU performance.

image

  • InfeedDequeueTuple: This TensorFlow operation runs on the TPU and receives input data coming from the host. If this is taking a long time, it can mean that the TensorFlow operations that preprocess data on the host machine cannot keep up with the rate at which the TPU can consume the data. You can see corresponding events in the host traces called InfeedEnqueueTuple. You can look at more detailed input-pipeline analysis using our Input Pipeline Analyzer tool.

  • CrossReplicaSum: This TensorFlow operation runs on the TPU and computes a sum across replicas. Because each replica corresponds to a different TPU node, this operation needs to wait for all TPU nodes to be done with a step. If you see a lot of time being spent in this operation, it typically doesn't mean that the sum itself is slow but that the TPU node is waiting for some other TPU node. This often happens because other TPU nodes were delayed by a slow data infeed.

image

  • Dataset Ops: When loading data with the Dataset API, Trace Viewer will visualize those dataset operations. The Iterator::Filter::Batch::ForeverRepeat::Memory in the example is compiled and corresponding to the dataset.map() operation. Checking out those operations on Trace Viewer are very helpful to debug and mitigate the input pipeline bottlenecks.

image

  • Prefetch Threads: Use dataset.prefetch() to buffer the input data. This technique prevents sporadic slowdowns in file access that create a bottleneck in the input pipeline. Prefetch threads show up in Trace Viewer when dataset.prefetch() operations are captured.

What Can Go Wrong

Here are some potential "gotchas" to be aware of when using Trace Viewer:

  • Event display limit: Trace Viewer will display a maximum of 1 million events. If you captured more events, only the 1 million events that came earliest will be displayed; later events will be dropped. To capture more TPU events, you can explictly ask the capture_tpu_profile to exclude the dataset ops with the --include_dataset_ops=False flag.
  • Very long events: If an event began before the capture started or ended after the capture finished, it won't be visible in Trace Viewer. This means that very long events can be missed.
  • When to start trace capture: If you start trace capture too early, the TPU may still be starting up and you may see only a few events or no events at all. You can add the --duration_ms flag and/or the --num_tracing_attempts flag to increase the profiling duration and automatically retry trace collection when there is no trace event collected:

    (vm)$ capture_tpu_profile --tpu_name=$TPU_NAME
    --logdir=${model_dir} --duration_ms=60000 --num_tracing_attempts=10
    

Op Profile

TensorBoard also contains the Op Profile, a tool that displays the performance statistics of XLA operations executed during the profiling period. Op Profile shows:

  • How your application uses the TPU. The TPU FLOPS utilization reported is defined as the measured number of floating point operations per second (FLOPS) normalized to the peak FLOPS available on the TPU.
  • The most time consuming operations. Those operations are potential targets for optimization.
  • Details of individual operations, including the shape, padding and expression.

Op Profile provides you insights on how well your model uses the TPU and helps you find good targets for optimization. For example, if your model only achieves 5% of the TPU peak FLOPS, you can drill down and identify which XLA operations are taking the longest time to execute and how much TPU FLOPS they consume.

Using the Op Profile

While collecting a profile, capture_tpu_profile also collects a op_profile.json file that contains performance statistics of XLA operations. To open Op Profile, go to TensorBoard and click on the Profile tab at the top of the screen. Select the op_profile from the Tools dropdown. You will see something like this:

image

  1. An overview section, which shows the overall TPU utilization and the operation that spends the most time in the profile duration. The tool also tells you on how well that operation uses the computational potential of the chip and gives suggestions for optimization.
  2. A control panel. You can select how many ops to show for each XLA category by sliding the bar on the left. You can also toggle the button on the right to only list ops within 90th percentile of the total execution time.
  3. An op table, which lists the XLA operations by category and sorted by the time spent in descending order.
  4. Op details cards. When you hover over a table entry, a card appears showing more details about the operation, for example, the FLOPS utilization, XLA expression and the layout.

Op Table

Each entry in the table contains multiple columns and has a triangle on the left that you can click to expand and collapse the entry. There is one entry for each operation category. For each category, the table shows the time, operation category name, the name of the associated TensorFlow op and its FLOPS utilization.

image

  1. Time, which shows the total percentage of the time spent by all the operations in that category. You can click to expand the entry and see the breakdown of the time spent by each individual operation.
  2. Horizontal Bar, which visualizes the time distribution across categories.
  3. Top10 Ops. When you click to expand each category, the top 10 operations that spend the most of time are listed. You can further expand a fusion op entry to see the non-fusion elementwise operations included.
  4. TensorFlow Op, which shows the TensorFlow op name associated with the XLA operation.
  5. FLOPS, which shows the FLOPS utilization, that is the measured FLOPS normalized to the peak FLOPS of the device. Higher FLOPS utilization is better as operations run fast. The table cell is color coded: green for high FLOPS utilization (good) and red for low FLOPS utilization (bad).

Op Details Cards

When you hover over a table entry, a card appears on the left telling you more details about the XLA op or the operation category. A typical card looks like this:

image

  1. Name, which shows the XLA operation name.
  2. Category, which shows the operation category.
  3. FLOPS utilization, which includes the value and a color coded progress bar.
  4. Expression, which shows the XLA expression of the op.
  5. Layout (optional), which shows the shape and layout of a tensor. Note that layout is only shown for convolution operations. The tool also shows whether the shape of the tensor is a exact fit for the matrix units and how it is padded.

Interpreting the Results

For illustration, this section gives a quick interpretation of the numbers shown in the above example: Overall the model reaches 34% of the highest FLOPS that can be achieved by the device. Output fusion and convolution dominate the execution time while there is also a long tail of vector or scalar operations that has very low FLOPS. One optimization strategy is to transform those vector or scalar operations to convolution operations.

For convolution ops, the TPU FLOPS utilization can also be low due to the following reasons:

  • padding, the matrix units are used only partially.
  • the convolution op is memory bound.

In the following example, the %convolution.11 shows lower FLOPS utilization than the %convolution.193 in the previous example.

image

Taking a closer look at its layout, there is a padding from 64 to 128, which indicates that only half of the matrix units are effectively used. Therefore, compared to the previous case which has an exact fit, the FLOPS utilization is almost halved.

Input Pipeline Analyzer

TensorBoard provides a powerful tool to analyze the TensorFlow input pipeline. When a TensorFlow program reads data from files, the data is read at the beginning of a TensorFlow graph in a pipelined manner: the read process is divided into multiple data processing stages connected in series, where the output of one stage is the input of the next one. This process of reading files is called input pipeline.

A typical pipeline for reading records from files has the following stages:

  1. File reading
  2. File preprocessing (optional)
  3. Transferral of the file from the host machine to the device

An inefficient input pipeline can severely slow down your application. We say an application is input bound when it spends a significant portion of time in input pipeline. This tool presents an in-depth analysis of your input pipeline performance based on various performance data collected. At the high level, the tool tells you whether your program is input bound. If that is the case, the tool can also walk you through the device and host side analysis to debug which stage of the pipeline is the bottleneck.

User Interface Overview

The Input Pipeline Analyzer tool reads the performance analysis results from a input_pipeline.json file that is also collected by the capture_tpu_profile. To open Input Pipeline Analyzer, select the input_pipeline_analyzer from the Tools dropdown. The analysis contains three sections:

image

  1. Summary, which tells you the overall input pipeline analysis: whether your application is input bound and by how much.
  2. Device-side analysis, which shows you the detailed device-side analysis results, including the device step time and how much is spent waiting for the input data.
  3. Host-side analysis, which shows you the detailed analysis on the host side, including a breakdown of input processing time on the host, and a tabular view of details for each input operation.

How to Tell Your Application is Input Bound

Section 1 is a summary of the overall analysis. It reports if your TPU program is input-bound and by how much (in terms of percentage of device time spent on waiting for input from the host). In addition, if you are using a standard input pipeline that has been instrumented, the tool reports where most of the input processing time is spent. For example:

image

Device-side Analysis

Section 2 shows the details of device-side analysis, which gives you insights on how much time is spent in the device versus in the host and how much device time is spent waiting for input data from the host.

image

  1. Step time plotted against step number, which plots a graph of device step time (in milliseconds) over all the steps sampled. The blue area corresponds to the part of the step time that is waiting for input data from the host while the orange area corresponds to the non-input time.
  2. Step time statistics, which reports the average, standard deviation and the range ([minimum, maximum]) of the device step time.
  3. Range of time waiting for input data, plotted against step number, which plots a line chart showing the fraction of device time waiting for input data processing (normalized to total device step time) over all the steps. Note that, the fraction of time spent varies for different TPU cores. Therefore, in addition to the fraction averaged across all the cores, the range of the fractions for different cores are also plotted for each step. Ideally, you want this range to be as small as possible, because the eventual step time of a particular step is determined by the slowest core.
  4. Fraction of time waiting for input data, which reports the average, standard deviation and the range ([minimum, maximum]) of the fraction of time spent in device waiting for the input data normalized to the total device step time.

Host-side Analysis

Section 3 shows the details of host-side analysis, which reports a breakdown of the input processing time (the time spent on the Dataset API operations) on the host into several categories:

  • Reading data from files on demand, which is the time spent on reading data from files without caching, prefetching, and interleaving.
  • Reading data from files in advance, including caching, prefetching, and interleaving.
  • Data preprocessing, for example, image decompression.
  • Enqueuing data to be transferred to device, which is typically called by TensorFlow to put the data into an infeed queue before transferring to the device.

image

If you want to see the statistics of individual input operations and their categories in the execution time breakdown, you can click the button "Show Input Op statistics". You will see a table like this:

image

Each table entry contains the following information:

  1. Input Op, which shows the TensorFlow op name of the input operation.
  2. Count, which shows the total number of instances of the operation executed during the profiling period.
  3. Total Time, which shows the accumulative sum of the wall clock time spent on each of those instances.
  4. Total Time %, which shows the total time spent on that operation as a fraction of the total time spent in input processing.
  5. Total Self Time, which shows the accumulative sum of the self time spent on each of those instances. The self time here measures the time spent inside the function body, excluding the time spent in the function it calls. For example, the Iterator::PaddedBatch::Filter::ForeverRepeat::Map is called by Iterator::PaddedBatch::Filter, therefore its total self time is excluded from the total self time of the latter.
  6. Total Self Time %, which shows the total self time as a fraction of the total time spent on input processing.
  7. Category, which corresponds to the categories defined above for the breakdown.
Was this page helpful? Let us know how we did:

Send feedback about...