Using Cloud TPU Tools

After getting a training script running on Cloud TPU, you can analyze the performance of your model on Cloud TPU using TensorBoard.

TensorBoard is a suite of tools designed to display TensorFlow metrics visually. Using the Cloud TPU Profiler TensorBoard plugin, you can access Cloud TPU profiling data in TensorBoard. The Cloud TPU Profiler plugin supports performance visualizations for Cloud TPU nodes of all sizes.

When you install the Cloud TPU Profiler plug-in, the capture_tpu_profile script is also installed. You use the capture_tpu_profile script to capture a profile, or to continuously monitor your Cloud TPU job (see Monitoring your job).

When you use the capture_tpu_profile script to capture a profile, two files (.trace and .traceable) are saved to the Google Cloud storage bucket you specify. The .trace file contains up to one million trace events that can be viewed in the trace viewer. The .tracetable file contains up to 2 GB of trace events that can be viewed in the streaming trace viewer.

Prerequisites

Before you can use the Cloud TPU profiling tools described in this guide, you must complete the following:

Creating Cloud TPU resources

You need a VM and TPU to follow along in this tutorial. Follow the instructions in the Cloud TPU EfficientNet tutorial up to the Train and evaluate the EfficientNet model with fake_imagenet section. Do not run the training script yet. The EfficientNet tutorial walks you through creating your VM and TPU, installing some libraries, and running the training script.

Installing the Cloud TPU Profiler TensorBoard Plug-in

Open up a new command prompt and connect to the VM you created in the previous section using this command:

    (vm)$ gcloud compute ssh efficientnet-tutorial --zone=us-central1-b

Use the following commands to install or upgrade the Cloud TPU Profiler TensorBoard Plug-in:

  (vm)$ pip3 install --upgrade "cloud-tpu-profiler>=2.3.0"
  (vm)$ sudo pip3 install --upgrade -U "tensorboard>=2.3"
  (vm)$ sudo pip3 install --upgrade -U "tensorflow>=2.3"

Create the following environment variables:

  (vm)$ export STORAGE_BUCKET=gs://bucket-name
  (vm)$ export MODEL_DIR=${STORAGE_BUCKET}/efficientnet-2x
  (vm)$ export PATH="$PATH:`python -m site --user-base`/bin"
  (vm)$ export TPU_NAME=efficientnet-tutorial"

Capturing a profile

In your first command prompt, run the training script. Wait until you see output indicating your model is training. What this looks like depends upon your code or model. Look for output like Epoch 1/100. Alternatively you can navigate to the Cloud TPU page, select your TPU and view the CPU utilization graph. While this does not show TPU utilization, it is a good indication that the TPU is training your model. To capture a profile, run the following command from your second command prompt:

  (vm)$ capture_tpu_profile --tpu=$TPU_NAME \
      --logdir=${MODEL_DIR}

As the model runs, the capture_tpu_profile script creates a directory (object) and writes profiling data to the Compute Engine bucket you specify with the --logdir parameter. By default, capture_tpu_profile captures a 2-second trace. You can set the trace duration with the --duration_ms command-line option.

Open a new command prompt, connect to your VM using the following command:

  (vm)$ gcloud compute ssh efficientnet-tutorial \
  --zone=us-central1-b \
  --ssh-flag=-L6006:localhost:6006

Create the following environment variables.

  (vm)$ export STORAGE_BUCKET=gs://bucket-name
  (vm)$ export MODEL_DIR=${STORAGE_BUCKET}/efficientnet-2x

Run TensorBoard and point it to the directory that contains the profiling data:

  (vm)$ tensorboard --logdir=${MODEL_DIR}

TensorBoard starts a web server and displays it's URL:

Serving TensorBoard on localhost; to expose to the network, use a proxy or pass --bind_all
TensorBoard 2.3.0 at http://localhost:6006/ (Press CTRL+C to quit)

Viewing profiling data in TensorBoard

Open a web browser and go to the URL displayed in the TensorBoard output. Make sure TensorBoard has fully loaded the profiling data by clicking the reload button in the upper right hand corner of the TensorBoard page. By default the TensorBoard page comes up with the Scalars tab selected.

image

Graphs

TensorBoard provides a number of visualizations, or graphs, of your model and its performance. Use the graphs together with the trace viewer or streaming trace viewer to fine tune your models and improve their performance on Cloud TPU.

TensorFlow graph

During model compilation, before the model is run, TensorFlow generates a graph that is run on the Cloud TPU. The data for the graph is stored in the MODEL_DIR directory in the storage bucket you specify with the --logdir parameter. You can view this graph without running capture_tpu_profile.

To view a model's TensorFlow graph, select the Graphs tab in TensorBoard. The default selection for Color is Structure.

image

A single node in the structure graph represents a TensorFlow operation.

TPU compatibility graph

The Graphs tab includes a compatibility checker module which checks for and displays TensorFlow ops that can potentially cause issues when a model is run.

To view a model's TPU compatibility graph, select the Graphs tab in TensorBoard and then select the TPU Compatibility option. The graph presents the compatible (valid) operations in green and the incompatible (invalid) operations in red.

image

A given node can display both colors, each as a percentage of the Cloud TPU compatibility operations for that node. See Interpreting compatibility results for an example.

The compatibility summary panel displayed to the right of the graph shows the percentage of all Cloud TPU-compatible operations, their attributes and a list of incompatible operations for a selected node.

Click on any operation in the graph to display its attributes in the summary panel.

image

Note that compatibility checker does not assess any operations that are explicitly assigned to a non-TPU device using manual device placement. In addition, the checker does not actually compile the model for execution, so be sure to interpret the results as an estimate of compatibility.

Interpreting compatibility results

The following diagram from the Abalone model shows a sample compatibility summary with several unavailable operations noted:

image

No manual device placement was specified for this model so all operations were checked, even those that should always be run on the CPU, such as the "save" and "report_uninitialized_variables" operations shown.

Two AssignAdd operations in root_mean_squared_error are potential problems.

You can see from the source code that root_mean_squared_error is used only as an additional evaluation metric:

    # Calculate root mean squared error as additional eval metric
    eval_metric_ops = {
        "rmse": tf.metrics.root_mean_squared_error(
            tf.cast(labels, tf.float64), predictions)
    }

Unless it occurs inside a training loop, this operation is normally run on the CPU so the error report can be disregarded. In conclusion, the model is ready to be run on a Cloud TPU.

Profile

The Profile tab, is displayed after you have captured some model data. You may need to click the reload button in the upper right hand corner of the TensorBoard page. Once data is available, clicking on the Profile tab presents a selection of tools to help with performance analysis:

Profile overview page

The overview page (overview_page), available under Profile, provides a top level view of how your model performed during a capture run. The page shows you an aggregated overview page for all the TPUs, as well as an overall input pipeline analysis. There is an option for selecting individual TPUs in the Host dropdown.

The page displays data in the following panels:

image

  • Performance summary

    • Average Step Time - The step time averaged over all sampled steps
    • Host Idle Time - The percentage of time the Host was idle
    • TPU Idle Time - The percentage of time the TPU was idle
    • FLOPS Utilization - The percentage utilization of the TPU matrix units
    • Memory Bandwidth Utilization - The percentage of memory bandwidth used
  • Step-time graph. Displays a graph of device step time (in milliseconds) over all the steps sampled. The blue area corresponds to the portion of the step time the TPUs were sitting idle waiting for input data from the host. The red area shows how much of time the Cloud TPU was actually working.

  • Top 10 TensorFlow operations on TPU. Displays the TensorFlow operations that consumed the most time:

    Each row displays an operation's self time (as the percentage of time taken by all operations), cumulative time, category, name, and the FLOPS rate achieved.

  • Run environment

    • Number of hosts used
    • Type of TPU used
    • Number of TPU cores
    • Training batch size
  • Recommendation for next steps. Reports when a model is input bound and whenever issues with Cloud TPU occur. Suggests tools you can use to locate performance bottlenecks in performance.

Input pipeline analyzer

The input pipeline analyzer provides insights into your performance results. The tool displays performance results from the input_pipeline.json file that is collected by the capture_tpu_profile tool.

The tool tells you immediately whether your program is input bound and can walk you through device and host-side analysis to debug whatever stage(s) of the pipeline are creating bottlenecks.

See the guidance on input pipeline performance for deeper insight into optimizing pipeline performance.

Input pipeline

When a TensorFlow program reads data from a file it begins at the top of the TensorFlow graph in a pipelined manner. The read process is divided into multiple data processing stages connected in series, where the output of one stage is the input to the next one. This system of reading is called the input pipeline.

A typical pipeline for reading records from files has the following stages:

  1. File reading
  2. File preprocessing (optional)
  3. File transfer from the host machine to the device

An inefficient input pipeline can severely slow down your application. An application is considered input bound when it spends a significant portion of time in its input pipeline. Use the input pipeline analyzer to understand where the input pipeline is inefficient.

Input pipeline dashboard

To open the input pipeline analyzer, select Profile, then select input_pipeline_analyzer from the Tools dropdown.

The dashboard contains three sections:

image

  1. Summary. Summarizes the overall input pipeline with information on whether your application is input bound and, if so, by how much.
  2. Device-side analysis. Displays detailed, device-side analysis results, including the device step-time and the range of device time spent waiting for input data across cores at each step.
  3. Host-side analysis. Shows a detailed analysis on the host side, including a breakdown of input processing time on the host.
Input pipeline summary

The first section reports if your program is input bound by presenting the percentage of device time spent on waiting for input from the host. If you are using a standard input pipeline that has been instrumented, the tool reports where most of the input processing time is spent. For example:

image

Device-side analysis

The second section details the device-side analysis, providing insights on time spent on the device versus on the host and how much device time was spent waiting for input data from the host.

image

  1. Device step time statistics. Reports the average, standard deviation, and range (minimum, maximum) of the device step time.
  2. Step time. Displays a graph of device step time (in milliseconds) over all the steps sampled. The blue area corresponds to the part of the step time Cloud TPUs sat idle waiting for input data from the host. The red area shows how much of time the Cloud TPU was actually working.
  3. Percentage of time waiting for input data. Reports the average, standard deviation and the range (minimum, maximum) of the fraction of time spent on a device waiting for the input data normalized to the total device step time.
  4. Range of device time across cores spent waiting for input data, by step number. Displays a line chart showing the amount of device time (expressed as a percentage of total device step time) spent waiting for input data processing. The fraction of time spent varies from core to core, so the range of fractions for each core is also plotted for each step. Since the time a step takes is determined by the slowest core, you want the range to be as small as possible.
Host-side analysis

Section 3 shows the details of host-side analysis, reporting of the input processing time (the time spent on Dataset API operations) on the host broken into several categories:

  • Enqueuing data to be transferred to device Time spent putting data into an infeed queue before transferring the data to the device.
  • Data preprocessing. Time spent on preprocessing operations, such as image decompression.
  • Reading data from files in advance. Time spent reading files, including caching, prefetching, and interleaving.
  • Reading data from files on demand. Time spent on reading data from files without caching, prefetching, and interleaving.
  • Other data reading or processing. Time spent on other input related operations not using tf.data.

image

To see the statistics for individual input operations and their categories broken down by execution time, expand the "Show Input Op statistics" section.

A source data table like the following appears:

image

Each table entry contains the following information:

  1. Input Op. Shows the TensorFlow op name of the input operation.
  2. Count. Shows the total number of instances of the operation executed during the profiling period.
  3. Total Time (in ms). Shows the accumulative sum of time spent on each of operation instances.
  4. Total Time %. Shows the total time spent on an operation as a fraction of the total time spent in input processing.
  5. Total Self Time (in ms). Shows the accumulative sum of the self time spent on each of those instances. The self time measures the time spent inside the function body, excluding the time spent in the function it calls. For example, the Iterator::PaddedBatch::Filter::ForeverRepeat::Mapis called by Iterator::PaddedBatch::Filter, therefore it's total self time is excluded from the total self time of the latter.
  6. Total Self Time %. Shows the total self time as a fraction of the total time spent on input processing.
  7. Category. Shows the processing category of the input operation.

Op profile

Op profile is a Cloud TPU tool that displays the performance statistics of XLA operations executed during a profiling period. The op profile shows:

  • How well your application uses the Cloud TPU as a percentage of time spent on operations by category and of TPU FLOPS utilization.
  • The most time-consuming operations. Those operations are potential targets for optimization.
  • Details of individual operations, including shape, padding and expressions that use the operation.

You can use op profile to find good targets for optimization. For example, if your model achieves only 5% of the TPU peak FLOPS, you can use the tool to identify which XLA operations are taking the longest time to execute and how many TPU FLOPS they consume.

Using op profile

During profile collection, capture_tpu_profile also creates a op_profile.json file that contains performance statistics of XLA operations.

You can view the data from op_profile in TensorBoard by clicking on the Profile tab at the top of the screen and then selecting op_profile from the Tools dropdown. You will see a display like this:

image

  1. Overview section. Shows Cloud TPU utilization and provides suggestions for optimization.
  2. Control panel. Contains controls that allow you to set the number of operations displayed in the table, which operations are displayed, and how they are sorted.
  3. Op table. A table that lists the top TensorFlow operation categories associated with the XLA ops. These operations are sorted by percentage of Cloud TPU usage.
  4. Op details cards. Details about the op that appear when you hover over an op in the table. These include the FLOPS utilization, the expression in which the op is used, and the op layout (fit).
XLA Op table

The Op table lists XLA operation categories in order from the highest to lowest percentage of Cloud TPU usage. Initially, the table shows the percentage of time taken, the op category name, the associated TensorFlow op name, and the percentage of FLOPS utilization for the category. To display (or hide) the 10 most time-consuming XLA operations for a category, click the triangle next to the category name in the table.

image

  1. Time. Shows the total percentage of time spent by all the operations in that category. You can click to expand the entry and see the breakdown of time spent by each individual operation.
  2. Top10 Ops. The toggle next to a category's name displays/hides the top 10 time-consuming operations within the category. If a fusion operation entry is displayed in the operations list, you can expand it to see the non-fusion, elementwise operations it contains.
  3. TensorFlow Op. Shows the TensorFlow op name associated with the XLA operation.
  4. FLOPS. Shows the FLOPS utilization, which is the measured number of FLOPS expressed as a percentage of the Cloud TPU peak FLOPS. The higher the FLOPS utilization percentage, the faster operations run. The table cell is color coded: green for high FLOPS utilization (good) and red for low FLOPS utilization (bad).
Op details cards

When you select a table entry, a card appears on the left displaying details about the XLA op or the operation category. A typical card looks like this:

image

  • Name and Category. Shows the highlighted XLA operation name and category.
  • FLOPS utilization. Displays FLOPS utilization as a percentage of total FLOPS possible.
  • Expression. Shows the XLA expression containing the operation.
  • Memory Utilization. Displays the percentage of peak memory usage by your program.
  • Layout (Convolution operations only.) Shows the shape and layout of a tensor, including whether the shape of the tensor is an exact fit for the matrix units and how the matrix is padded.
Interpreting results

For convolution operations, TPU FLOPS utilization can be low due to one or both of the following reasons:

  • padding (matrix units are partially used)
  • convolution op is memory bound

This section gives an interpretation of some numbers from a different model in which FLOPs were low. In this example, output fusion and convolution dominated the execution time and there was a long tail of vector or scalar operations that had very low FLOPS.

One optimization strategy for this type of profile is to transform the vector or scalar operations to convolution operations.

In the following example, %convolution.399 shows lower FLOPS and memory utilization than %convolution.340 in the previous example.

image

Examine the layout and note that batch size 16 is being padded to 128 and feature size 3 is being padded to 8, which indicates that only 5% of the matrix units are being effectively used. (The calculation for this instance of percent utilization is (((batch_time * num_of_features) / padding_size ) / num_of_cores). Compare the FLOPS in this example to the %convolution.340 in the previous example which has an exact fit to the matrix.

Pod viewer

The Pod viewer tool provides performance visualizations for every core in a Pod and displays the status of the communications channels across the cores in a Pod. Pod viewer can identify and highlight potential bottlenecks and areas that need optimization. The tool works for full Pods and all v2 and v3 Pod slices.

To display the Pod viewer tool:

  1. Select Profile from the menu button at the top right side of the Tensorboard window.
  2. Click the Tools menu on the left side of the window and select pod_viewer.

The Pod viewer user interface includes:

  1. A step slider, which allows you to select which step you want to examine.
  2. A topology graph, which interactively visualizes your TPU cores in the whole TPU system.
  3. A communication links chart, which visualizes the send and receive (recv) channels in the topology graph.
  4. A latency of send and recv channels bar chart. Hovering over a bar in this chart activates the communication links in the communication links chart. A channel details card appears on the left-hand bar, providing detailed information of the channel, such as the size of data transferred, latency, and bandwidth.
  5. A step breakdown chart, which visualizes a breakdown of a step for all cores. This can be used to track system bottlenecks and whether a particular core is slowing down the system.

image

Step slider

Use the slider to select a step. The rest of the tool displays statistics, such as step breakdown and communication links, for that step.

Topology graph

The topology graph is organized hirachically by host, chip and core. The smallest rectangles are TPU cores. Two cores together indicate a TPU chip and four chips together indicate a host.

image

The topology graph is also a heatmap, color coded by the percentage of time a particular breakdown (for example, High flops compute, infeed, send, etc.) takes in the selected step. The bar just below the topology graph (shown in the following graphic) shows a color coding for core and chip usage. The color of the cores show the utilization ranging from yellow to blue. For High flops compute, larger numbers (darker color) indicate more time spent doing compute. For all other breakdowns, smaller numbers (lighter colors) indicate smaller wait times. Potential problem areas, or hotspots, are indicated when a core is darker than the others.

Click on the pulldown menu selector next to the system name (circled in the diagram) to choose the particular type of breakdown you want to examine.

Hover your mouse over any of the small rectangles (single cores) to display a techtip showing the core's position in the system, its global chip ID, and its host name. The techtip also includes the duration of the selected breakdown category, for example High flops, and its utilization percentage out of a step.

image

Communication channels

This tool helps visualize send and recv links if your model uses them to communicate between cores. When your model contains send and recv ops, you can use a channel ID selector to select a channel ID. A link from the source (src) core and destination (dst) core, represents the communication channel. It is rendered on the topology graph by hovering your mouse over the bars on the chart showing the latency of send and recv channels.

image

A card appears on the left-hand bar giving you more details about the communication channel. A typical card looks like this:

image

  1. Data Transferred, which shows the data transferred by the send and recv channel in memibytes (MiB).
  2. Latency, which shows the duration, in microseconds, from the start of the send event to the end of the recv-done event.
  3. BW, which shows the amount of data transferred, in gibibites (GiB), from the source core to the destination core in the duration of time.
  4. Send Delay, which is the duration from the beginning of the recv-done to the beginning of send in microseconds. If the recv-done op starts after the beginning of the send op, the delay is zero.
  5. Hlo Names, which displays the XLA hlo ops names associated with this channel. These hlo names are associated with the statistics displayed in other TensorBoard tools such as op_profile and memory_viewer.

Step breakdown chart

This chart provides details for each training or evaluation step.

The x-axis is the global chip ID and the y-axis is the time in microseconds. From this chart, you can see where the time is used in a particular training step, where any bottlenecks are, and whether there is a load imbalance across all chips.

image

A card appears on the left-hand bar giving you more details about the step breakdown. A typical card looks like this:

image

The fields in the card specify the following:

  1. High Flops Compute, which is the time spent on convolution or output fusion operations (ops).
  2. Low flops compute, which is calculated by deducting all other breakdowns from the total duration.
  3. Infeed, which is the time the TPU spends waiting on the host.
  4. Outfeed, which is the time the host spends waiting on output from the TPU.
  5. AllReduce sync, which is the portion of time spent on CrossReplicaSum ops that is waiting to synchronize with other cores. CrossReplicaSum ops computes the sum across replicas.
  6. AllReduce compute, which is the actual compute time spent on CrossReplicaSum ops.
  7. Chip to chip send ops, which is the time spent on send operations.
  8. Chip to chip recv-done ops, which is the time spent on recv operations.

Trace viewer

Trace viewer is a Cloud TPU performance analysis tool available under Profile. The tool uses the Chrome trace event profiling viewer so it only works in the Chrome browser.

Trace viewer displays a timeline that shows:

  • Durations for the operations that were executed by your TensorFlow model .
  • Which part of the system (TPU or host machine) executed an operation. Typically, the host machine executes infeed operations, which preprocesses training data and transfers it to the TPU, whereas the TPU executes the actual model training.

Trace viewer allows you to identify performance problems in your model, then take steps to resolve them. For example, at a high level, you can identify whether infeed or model training is taking the majority of the time. Drilling down, you can identify which TensorFlow operations are taking the longest to execute.

Note that trace viewer is limited to 1M events per Cloud TPU. If you need to assess more events, use the streaming trace viewer instead.

Trace viewer interface

To open trace viewer, go to TensorBoard, click on the Profile tab at the top of the screen, and choose trace_viewer from the Tools dropdown. The viewer appears displaying your most recent run:

image

This screen contains the following main elements (marked with numbers above):

  1. Runs dropdown. Contains all of the runs for which you've captured trace information. The default view is your most recent run, but you can open the dropdown to select a different run.
  2. Tools dropdown. Selects different profiling tools.
  3. Host dropdown. Selects a host that contains a Cloud TPU set.
  4. Timeline pane. Shows operations that Cloud TPU and the host machine executed over time.
  5. Details pane. Shows additional information for operations selected in the Timeline pane.

Here's a closer look at the timeline pane:

image

The Timeline pane contains the following elements:

  1. Top bar. Contains various auxiliary controls.
  2. Time axis. Shows time relative to the beginning of the trace.
  3. Section and track labels. Each section contains multiple tracks and has a triangle on the left that you can click to expand and collapse the section. There is one section for every processing element in the system.
  4. Tool selector. Contains various tools for interacting with the trace viewer.
  5. Events. These show the time during which an operation was executed or the duration of meta-events, such as training steps.
  6. Vertical tab bar. This does not have a useful purpose for Cloud TPU. The bar is part of the general purpose trace viewer tool provided by Chrome that is used for a variety of performance analysis tasks.
Sections and tracks

Trace viewer contains the following sections:

  • One section for each TPU node, labeled with the number of the TPU chip and the TPU node within the chip (for example, "Chip 2: TPU Core 1"). Each TPU node section contains the following tracks:
    • Step. Shows the duration of the training steps that were running on the TPU.
    • TensorFlow Ops. Shows TensorFlow operations executed on the TPU.
    • XLA Ops. Shows XLA operations that ran on the TPU. (Each TensorFlow operation is translated into one or several XLA operations. The XLA compiler translates the XLA operations into code that runs on the TPU.)
  • One section for threads running on the host machine's CPU, labeled "Host Threads". The section contains one track for each CPU thread. Note: You can ignore the information displayed alongside the section labels.
Timeline tool selector

You can interact with the timeline view using the timeline tool selector in TensorBoard. You can click on a timeline tool or use the following keyboard shortcuts to activate and highlight a tool. To move the timeline tool selector, click in the dotted area at the top and then drag the selector to where you want it.

Use the timeline tools as follows:

Selection tool
Click on an event to select it or drag to select multiple events. Additional information about the selected event or events (name, start time, and duration) will be displayed in the details pane.

Pan tool
Drag to pan the timeline view horizontally and vertically.

Zoom tool
Drag up to zoom in or drag down to zoom out along the horizontal (time) axis. The horizontal position of the mouse cursor determines the center around which the zoom takes place.

Note: The zoom tool has a known bug where zoom remains active if you release the mouse button while the mouse cursor is outside the timeline view. If this happens to you, just click briefly on the timeline view to stop zooming.

Timing tool
Drag horizontally to mark a time interval. The length of the interval appears on the time axis. To adjust the interval, drag its ends. To clear the interval, click anywhere inside the timeline view.

Note that the interval remains marked if you select one of the other tools.
Events

Events within the timeline are displayed in different colors; the colors themselves have no specific meaning.

Timeline top bar

The top bar of the Timeline pane contains several auxiliary controls:

image

  1. Metadata display. Not used for TPUs.
  2. View Options. Not used for TPUs.
  3. Search box. Enter text to search for all events whose name contains the text. Click the arrow buttons to the right of the search box to move forwards and backwards through the matching events, selecting each event in turn.
  4. Console button. Not used for TPUs.
  5. Help button. Click to display a help summary.
Keyboard shortcuts

Following are the keyboard shortcuts you can use in trace viewer. Click the help button (?) in the top bar to see more keyboard shortcuts.

    w Zoom in
    s Zoom out
    a Pan left
    d Pan right
    f Zoom to selected event(s)
    m Mark time interval for selected event(s)
    1 Activate selection tool
    2 Activate pan tool
    3 Activate zoom tool
    4 Activate timing tool

The f shortcut can be highly useful. Try selecting a step and pressing f to zoom into the step quickly.

Characteristic events

Following are some of the event types that can be very useful when analyzing TPU performance.

image

  • InfeedDequeueTuple. This TensorFlow operation runs on a TPU and receives input data coming from the host. When infeed takes a long time, it can mean that the TensorFlow operations which preprocess the data on the host machine cannot keep up with the TPU data consumption rate. You can see corresponding events in the host traces called InfeedEnqueueTuple. To view a more detailed input-pipeline analysis, use the Input Pipeline Analyzer tool.

  • CrossReplicaSum. This TensorFlow operation runs on a TPU and computes a sum across replicas. Because each replica corresponds to a different TPU node, the operation must wait for all TPU nodes to be finished with a step. If this operation is taking a long time, it might not mean that the summing operation itself is slow but that a TPU node is waiting for another TPU node with a slow data infeed.

image

  • Dataset Ops. Trace viewer visualizes dataset operations performed when data is loaded using the Dataset API. The Iterator::Filter::Batch::ForeverRepeat::Memory in the example is compiled and it corresponds to the dataset.map() operation. Use trace viewer to examine the loading operations as you work through debugging and mitigating input pipeline bottlenecks.

image

  • Prefetch Threads. Using dataset.prefetch() to buffer input data can prevent sporadic slowdowns in file access that create bottlenecks in the input pipeline.
What can go wrong

Here are some potential issues to be aware of when using trace viewer:

  • Event display limit. Trace viewer displays a maximum of 1 million events. If you captured more events, only the earliest 1 million events are displayed; later events are dropped. To capture more TPU events, you can use the --include_dataset_ops=False flag to explicitly require capture_tpu_profile to exclude the dataset ops.
  • Very long events. Events that begin before a capture starts or that end after a capture is finished are not visible in trace viewer. Consequently, very long events can be missed.
  • When to start trace capture. Be sure to start trace capture after you know the Cloud TPU is running. If you start before then, you may see only a few events or no events at all in trace viewer. You can increase the profile time using the --duration_ms flag and you can set automatic retries using the --num_tracing_attempts flag. For example:

      (vm)$ capture_tpu_profile --tpu=$TPU_NAME
        --logdir=${MODEL_DIR} --duration_ms=60000 --num_tracing_attempts=10
        

Memory viewer

Memory viewer allows you to visualize the peak memory usage for your program, and memory usage trends over the program's lifetime.

The memory viewer UI looks like this:

image

  1. Host dropdown. Selects for a TPU host and XLA High Level Optimizer (HLO) modules to visualize.
  2. Memory overview. Displays peak memory allocation and size without padding.
  3. Working space chart. Displays peak memory use and a plot of memory usage trends over the program's lifetime. Hovering over a buffer in one of the buffer charts adds an annotation for the buffer lifetime and the buffer details card.
  4. Buffer charts. Two charts that display buffer allocation at the point of peak memory usage, as indicated by the vertical line in the working space plot. Hovering over a buffer in one of the buffer charts displays the buffer's lifetime bar in the working space chart and a details card on the left.
  5. Buffer allocation details card. Displays allocation details for a buffer.
Memory overview panel

The memory overview (top) panel shows you the module name and the peak memory allocation set when the total buffer allocation size reaches the maximum. The unpadded peak allocation size is also shown for comparison.

image

Working space chart

This chart displays peak memory use and a plot of memory usage trends over the program's lifetime. The line drawn from top to bottom of the plot indicates peak memory utilization for the program. This point determines whether or not a program can fit into the available global memory space.

image

Each point on the overlying line plot represents a "program point" in XLA's HLO program as scheduled by the compiler. The line provides a sense of the spikiness leading to and from the peak usage.

Interaction with buffer chart elements

When you hover over a buffer displayed in one of the buffer charts below the working space chart, a horizontal lifetime line for that buffer appears in the working space chart. The horizontal line is the same color as the highlighted buffer.

image

The horizontal line thickness indicates the relative magnitude of the buffer size relative to the peak memory allocation. The line length corresponds to the life of the buffer, starting at the point in the program where buffer space was allocated and ending where the space was freed.

Buffer charts

Two charts show the breakdown of memory usage at the peak usage point (indicated by the vertical line in the plot above the charts).

image

  • By Program Order. Displays the buffers from left to right in the order in which they were active during program execution. Buffers active for the longest time are on the left side of the chart.

  • By Size. Displays the buffers that were active during program execution in descending size order. Buffers that had the largest impact at the point of peak memory usage are on the left.

Buffer allocation details card

When you hover over a buffer displayed in one of the buffer charts, a buffer allocation details card appears (in addition to the lifetime line displayed in the working chart). A typical details card looks like this:

image

  1. Name. Name of the XLA operation.
  2. Category. Operation category.
  3. Size. Size of the buffer allocation (including padding).
  4. Unpadded size. Size of the buffer allocation without padding.
  5. Expansion. Relative magnitude of padded buffer size versus the unpadded size.
  6. Extra memory. Indicates how much extra memory is used for padding.
  7. Shape. Describes the rank, size, and data type of the N-dimensional array.
  8. TensorFlow op name. Shows the name of the TensorFlow operation associated with the buffer allocation.
  9. Allocation type. Indicates buffer allocation category. Types are: Parameter, Output, Thread-local, and Temporary (for example, buffer allocation within a fusion).
"Out of memory" errors

If you run a model and get an "out of memory error", use the following command to capture a memory profile and view it in the memory viewer. Make sure to set appropriate duration_ms so that the profiling period overlaps with your program compilation time. The output can help you understand what caused the error:

  (vm)$ capture_tpu_profile --tpu=$TPU_NAME --logdir=${MODEL_DIR} --duration_ms=60000
  

Streaming trace viewer

Streaming trace viewer (trace_viewer) is a Cloud TPU performance analysis tool, available for TensorFlow 2.1 or later, that provides dynamic trace renderings. The tool uses the Chrome trace event profiling viewer so it works only in the Chrome browser.

When you use capture_tpu_profile 2.3.0 to capture a profile, a .tracetable file is saved to your Google Cloud storage bucket. The file contains a large number of trace events that can be viewed in in both trace viewer and streaming trace viewer.

Using streaming trace viewer

To use the streaming trace viewer, trace_viewer, you must shut down your existing TensorBoard session and then relaunch TensorBoard using the IP address of the TPU you want to examine. Streaming trace viewer requires TensorBoard to make a Google Remote Procedure Call (GRPC) to an IP address for the Cloud TPU. The GRPC channel is not encrypted.

You can find the IP address for a Cloud TPU host on the Cloud TPU page. Find your Cloud TPU and look in the Internal IP column for the IP address.

In your VM, run TensorBoard as follows replacing tpu-ip with your TPU's IP address:

  (vm)$ tensorboard --logdir=${MODEL_DIR} \
    --master_tpu_unsecure_channel=tpu-ip

The in TensorBoard tool appears in the Tools dropdown list.

image

In the timeline, you can zoom in and out to see trace events load dynamically into your browser.

image

Monitoring your Cloud TPU job

This section describes how to use capture_tpu_profile to capture a single profile or continuously monitor your Cloud TPU job on the command-line interface in real time. By setting the --monitoring_level option to 0 (the default), 1, or 2, you get a single profile, basic monitoring, or detailed monitoring, respectively.

Open a new Cloud Shell and ssh to your VM (replace vm-name in the command with your VM name):

  (vm)$ gcloud compute ssh vm-name \
  --ssh-flag=-L6006:localhost:6006

In the new Cloud Shell, run capture_tpu_profile with the --monitoring_levelflag set to either 1 or 2, such as:

  (vm)$ capture_tpu_profile --tpu=$TPU_NAME \
   --monitoring_level=1

Setting monitoring_level=1 produces output similar to the following:

    TPU type: TPU v2
    Utilization of TPU Matrix Units is (higher is better): 10.7%

Setting monitoring_level=2 displays more detailed information:

    TPU type: TPU v2
    Number of TPU Cores: 8
    TPU idle time (lower is better): 0.091%
    Utilization of TPU Matrix Units is (higher is better): 10.7%
    Step time: 1.95 kms (avg), 1.90kms (minute), 2.00 kms (max)
    Infeed percentage: 87.5% (avg). 87.2% (min), 87.8 (max)

Monitoring flags

  • --tpu (required) specifies the name of the Cloud TPU you want to monitor.
  • --monitoring_level. Change the behavior of capture_tpu_profile from producing a single profile, to basic or detailed continuous monitoring. There are three available levels: Level 0 (the default): Produces a single profile, then exits. Level 1: Shows TPU version and TPU utilization. Level 2: Shows the TPU utilization, TPU idle time, and number of TPU cores used. Also provides min, avg, and max step times along with the infeed percentage contribution.
  • --duration_ms (optional, default is 1000ms) specifies how long to profile the TPU host during each cycle. Generally, this should be long enough to capture at least one training step worth of data. 1 second captures a training step in most models but if your model step time is very large, you can set the value to 2x step_time(in ms).
  • --num_queries specifies how many cycles to run capture_tpu_profile. To continuously monitor your TPU job, set the value to a high number. To quickly check your model's step time set the value to a low number.