This document describes how to set up and run TensorBoard for visualizing and analyzing program performance on Cloud TPU.
Overview
TensorBoard offers a suite of tools designed to present TensorFlow data visually. When used for monitoring, TensorBoard can help identify bottlenecks in processing and suggest ways to improve performance.
Prerequisites
The following instructions assume you have already set up your Cloud TPU in Cloud Shell and are ready to run your training application.
If you don't have a model ready to train, you can get started with the MNIST tutorial.
Install the Cloud TPU profiler
Install the current version of cloud-tpu-profiler 2.3.0
on the VM where you are running your model to create the capture-tpu-profile
script.
Run TensorBoard
When you ran ctpu up
to create your Compute Engine VM and
Cloud TPU, the tool automatically set up port forwarding for the
Cloud Shell environment to make TensorBoard available. You need to run
Tensorboard in a new Cloud Shell, not the shell that's running your
training application.
Follow these steps to run Tensorboard in a separate Cloud Shell:
Open a second Cloud Shell to capture profiling data and to start TensorBoard.
In the second Cloud Shell, run
ctpu up
to set some needed environment variables on the new shell:$ ctpu up --name=tpu-name --zone=your-zone
Note the argument
--zone
is necessary in order forctpu up
to correctly find your Compute Engine VM.This should return output similar to the following:
2018/08/02 12:53:12 VM already running. 2018/08/02 12:53:12 TPU already running. About to ssh (with port forwarding enabled -- see docs for details)...
In the second Cloud Shell, create environment variables for your Cloud Storage bucket and model directory. The model directory variable (
MODEL_DIR
) contains the name of the GCP directory where checkpoints, summaries, and TensorBoard output are stored during model training. For example,MODEL_DIR=${STORAGE_BUCKET}/model
.(vm)$ export STORAGE_BUCKET=gs://your-bucket-name (vm)$ export MODEL_DIR=${STORAGE_BUCKET}/model-directory
Run the model, capture monitoring output, and display it in TensorBoard
There are two ways you can see TensorBoard trace information, static trace viewer or streaming trace viewer. Static trace viewer is limited to 1 million events per Cloud TPU. If you need to access more events, use streaming trace viewer. Both setups are shown below.
-
In the first Cloud Shell, run your TensorFlow model training
application. For example, if you're using the MNIST model,
run
mnist_tpu.py
as described in the MNIST tutorial. - Select the type of trace viewer you want to use: static trace viewer, or streaming trace viewer.
- Perform one of the following procedures:
- In the second Cloud Shell, run the following TensorBoard command:
- On the bar at the top right-hand side of the Cloud Shell, click the Web preview button and open port 8080 to view the TensorBoard output. The TensorBoard UI will appear as a tab in your browser.
- Do one of the following to capture the profile.
- If you are running TensorBoard 1.15 or greater, click the PROFILE link at the top of the TensorBoard UI. Next, click the CAPTURE PROFILE button at the top of the TensorBoard window.
-
To capture a profile from the command line instead of using the CAPTURE PROFILE
button, in the second Cloud Shell, run the following command:
(vm)$ capture_tpu_profile --tpu=tpu-name --logdir=${MODEL_DIR}
-
In the Cloud Console navigation sidebar, select
Compute Engine > TPUs, copy the Internal IP address for your
Cloud TPU. This is the value you specify for the
--master_tpu_unsecure_channel
in the TensorBoard command. - Run the following TensorBoard command:
- On the bar at the top right-hand side of the Cloud Shell, click the Web preview button and open port 8080 to view the TensorBoard output. The TensorBoard UI will appear as a tab in your browser.
-
To capture streaming trace viewer output, in the second
Cloud Shell, run the following
capture_tpu_profile
command:
static trace viewer
(vm)$ tensorboard --logdir=${MODEL_DIR} &
A detail menu appears where you can specify how to capture the TPU output: by IP address or TPU name.
Input the IP address or the TPU name to start capturing trace data that is then displayed in TensorBoard. See the Cloud TPU tools guide for more information about changing the defaults for the Profiling Duration and Trace dataset ops values.
streaming trace viewer
For streaming trace viewer, copy the IP address of your TPU host from the Google Cloud Console before running the TensorBoard command.
(vm)$ tensorboard --logdir=${MODEL_DIR} --master_tpu_unsecure_channel=tpu-ip-address &
(vm)$ capture_tpu_profile --tpu=tpu-name --logdir=${MODEL_DIR}
This will start capturing profile data and displaying it on TensorBoard.
What's next
- Explore the Cloud TPU profiling tools in TensorBoard.