Setting up TensorBoard

This document describes how to set up and run TensorBoard for visualizing and analyzing program performance on Cloud TPU.

Overview

TensorBoard offers a suite of tools designed to present TensorFlow data visually. When used for monitoring, TensorBoard can help identify bottlenecks in processing and suggest ways to improve performance.

Prerequisites

The following instructions assume you have already set up your Cloud TPU in Cloud Shell and are ready to run your training application.

If you don't have a model ready to train, you can get started with the MNIST tutorial.

Install the Cloud TPU profiler

Install the current version of cloud-tpu-profiler 1.15.0rc1 on the VM where you are running your model to create the capture-tpu-profile script.

Run TensorBoard

When you ran ctpu up to create your Compute Engine VM and Cloud TPU, the tool automatically set up port forwarding for the Cloud Shell environment to make TensorBoard available. You need to run Tensorboard in a new Cloud Shell, not the shell that's running your training application.

Follow these steps to run Tensorboard in a separate Cloud Shell:

  1. Open a second Cloud Shell to capture profiling data and to start TensorBoard.

  2. In the second Cloud Shell, run ctpu up to set some needed environment variables on the new shell:

    $ ctpu up --name=tpu-name --zone=your-zone
    

    Note the argument --zone is necessary in order for ctpu up to correctly find your Compute Engine VM.

    This should return output similar to the following:

    2018/08/02 12:53:12 VM already running.
    2018/08/02 12:53:12 TPU already running.
    About to ssh (with port forwarding enabled -- see docs for details)...
    

  3. In the second Cloud Shell, create environment variables for your Cloud Storage bucket and model directory. The model directory variable (MODEL_DIR) contains the name of the GCP directory where checkpoints, summaries, and TensorBoard output are stored during model training. For example, MODEL_DIR=${STORAGE_BUCKET}/model.

    (vm)$ export STORAGE_BUCKET=gs://your-bucket-name
    (vm)$ export MODEL_DIR=${STORAGE_BUCKET}/model-directory
    

Run the model, capture monitoring output, and display it in TensorBoard

There are two ways you can see TensorBoard trace information, static trace viewer or streaming trace viewer. Static trace viewer is limited to 1 million events per Cloud TPU. If you need to access more events, use streaming trace viewer. Both setups are shown below.

  1. In the first Cloud Shell, run your TensorFlow model training application. For example, if you're using the MNIST model, run mnist_tpu.py as described in the MNIST tutorial.
  2. Select the type of trace viewer you want to use: static trace viewer, or streaming trace viewer.
  3. Perform one of the following procedures:
  4. static trace viewer

    1. In the second Cloud Shell, run the following TensorBoard command:
    2. (vm)$ tensorboard --logdir=${MODEL_DIR} &
      
    3. On the bar at the top right-hand side of the Cloud Shell, click the Web preview button and open port 8080 to view the TensorBoard output. The TensorBoard UI will appear as a tab in your browser.
    4. Do one of the following to capture the profile.
    • If you are running TensorBoard 1.15 or greater, click the PROFILE link at the top of the TensorBoard UI. Next, click the CAPTURE PROFILE button at the top of the TensorBoard window.
    • A detail menu appears where you can specify how to capture the TPU output: by IP address or TPU name.

      Input the IP address or the TPU name to start capturing trace data that is then displayed in TensorBoard. See the Cloud TPU tools guide for more information about changing the defaults for the Profiling Duration and Trace dataset ops values.

    • To capture a profile from the command line instead of using the CAPTURE PROFILE button, in the second Cloud Shell, run the following command:
      (vm)$ capture_tpu_profile --tpu=tpu-name --logdir=${MODEL_DIR}
      

    streaming trace viewer

    For streaming trace viewer, copy the IP address of your TPU host from the Google Cloud Console before running the TensorBoard command.

    1. In the Cloud Console navigation sidebar, select Compute Engine > TPUs, copy the Internal IP address for your Cloud TPU. This is the value you specify for the --master_tpu_unsecure_channel in the TensorBoard command.
    2. Run the following TensorBoard command:
    3. (vm)$ tensorboard --logdir=${MODEL_DIR} --master_tpu_unsecure_channel=tpu-ip-address &
      
    4. On the bar at the top right-hand side of the Cloud Shell, click the Web preview button and open port 8080 to view the TensorBoard output. The TensorBoard UI will appear as a tab in your browser.
    5. To capture streaming trace viewer output, in the second Cloud Shell, run the following capture_tpu_profile command:
    6. (vm)$ capture_tpu_profile --tpu=tpu-name --logdir=${MODEL_DIR}
      

      This will start capturing profile data and displaying it on TensorBoard.

What's next