Monitoring Cloud TPU Nodes

This guide explains how to use Google Cloud Monitoring to monitor your Cloud TPU Nodes. Google Cloud Monitoring automatically collects metrics and logs from your Cloud TPU and its host Compute Engine. These data can be used to monitor the health of your Cloud TPU and Compute Engine.

Metrics enable you to track a numerical quantity over time, for example, CPU utilization, network usage, or MXU utilization. Logs capture events at a specific point in time. Log entries are written by your own code, Google Cloud services, third-party applications, and the Google Cloud infrastructure. You can also generate metrics from the data present in a log entry by creating a log-based metric. You can also set alert policies based on metric values or log entries.

This guide discusses Google Cloud Monitoring and shows you how to:

View Cloud TPU metrics
Set up Cloud TPU metrics alert policies
Query Cloud TPU logs
Create log-based metrics for setting up alerts and visualizing dashboards.

Prerequisites

This document assumes some basic knowledge of Google Cloud Monitoring. You must have a Compute Engine VM and Cloud TPU resources created before you can begin generating and working with Google Cloud Monitoring. See the Cloud TPU Quickstart for more details.

Metrics

Google Cloud metrics are automatically generated by Compute Engine VMs and the Cloud Cloud TPU runtime. The following metrics are generated by Cloud TPU Nodes:

cpu/utilization
memory/usage
network/received_bytes_count
network/sent_bytes_count
tpu/mxu/utilization
tpu/tensorcore/idle_duration

CPU utilization

THe cpu/utilization metric tracks the current CPU utilization on the Cloud TPU worker, represented as a percentage. Values are typically between 0.0 and 100.0, but might exceed 100.0. Sampled every 60 seconds. It may take up to 180 seconds between the time a value is generated and when it's displayed.

Memory usage

The memory/usage metric tracks the memory currently being used by the Cloud TPU VM in bytes. This metric is sampled every 60 seconds. It may take up to 180 seconds between the time a value is generated and when it's displayed.

Network received bytes count

The network/received_bytes_count metric tracks the number of cumulative bytes of data the Cloud TPU VM received over the network at a point in time. It may take up to 180 seconds between the time a value is generated and when it's displayed.

Network sent bytes count

The network/sent_bytes_count metric tracks the number of cumulative bytes the Cloud TPU VM sent over the network at a point in time. It may take up to 180 seconds between the time a value is generated and when it's displayed.

TensorCore idle duration

The tpu/tensorcore/idle_duration metric tracks the number of seconds each TPU chip's TensorCore has been idle. This metric is available for each chip on all TPUs in use. If a TensorCore is in use, the idle duration value is reset to zero. When the TensorCore is no longer in use, the idle duration value starts to increase.

The following graph shows the tpu/tensorcore/idle_duration metric for a v2-8 Cloud TPU VM which has one worker. Each worker has four chips. In this example, all four chips have the same values for tpu/tensorcore/idle_duration, so the graphs are superimposed on each other.

MXU utilization

The tpu/mxu/utilization metric tracks the current MXU utilization on the TPU worker, represented as a percentage. Values are typically numbers between 0.0 and 100.0. Sampled every 60 seconds. After sampling, data is not visible for up to 180 seconds.

For a complete list of metrics generated by Cloud TPU, see Cloud TPU metrics.

Viewing metrics

You can view metrics using the Metrics Explorer in the Google Cloud console.

In the Metrics Explorer, click SELECT A METRIC and search for Cloud TPU Worker. If Show only active resources and metrics is on, only metrics that are currently being generated will be displayed. Click Cloud TPU Worker to display the available metrics.

You can also access metrics using curl HTTP calls:

Use the Try it! button in the projects.timeSeries.query documentation to retrieve the value for a metric within the specified timeframe.

Fill in the name in the following format: projects/{project-name}
Add a query to the Request body section. The following is a sample query for retrieving the idle duration metric for the specified zone for the last 5 minutes fetch tpu_worker | filter zone = 'us-central2-b' | metric tpu.googleapis.com/tpu/tensorcore/idle_duration | within 5m"
Click Execute to retrieve the results of the HTTP POST message

The Monitoring Query Language reference document has more information on how to customize this query.

You can create alert policies that tell Google Cloud Monitoring to send an alert when a condition is met.

Creating alerts

The steps in this section show an example of how to add an alert policy for the TensorCore Idle Duration metric. Whenever this metric exceeds 24 hours, Cloud Monitoring sends an email to the registered email address.

Go to the Monitoring console
In the navigation pane click Alerting
Click EDIT NOTIFICATION CHANNELS
Under Email, click ADD NEW
Type an email address, a display name, and click SAVE
Click CREATE POLICY
Click SELECT A METRIC, select Tensorcore Idle Duration and click APPLY
Click NEXT and then Threshold
For Alert trigger, select Any time series violates
For Threshold Position, select Above threshold
For Threshold Value, type 86400000
Click NEXT
Under Notification Channels select your email notification channel and click OK
Type a name for the alert policy
Click NEXT and then CREATE POLICY

When the TensorCore Idle Duration goes over 24 hours, an email is sent to the email address you specified.

Logging

Log entries are written by Google Cloud services, third party services, ML frameworks or your code. You can view logs using the Logs Viewer or Logs API. For more information about Google Cloud logging, see Google Cloud Logging.

In the Logs Explorer, you can select a resource type:

Cloud TPU Worker -> Zone -> Node ID
Audited Resource -> Cloud TPU -> API (google.cloud.tpu.v1.Tpu.CreateNode, google.cloud.tpu.v1.Tpu.DeleteNode, google.cloud.tpu.v1.Tpu.UpdateNode)

Cloud TPU Worker logs contain information about a specific Cloud TPU worker in a specific zone, for example the amount of memory available on the Cloud TPU worker (system_available_memory_GiB).

Audited Resource logs contain information about when a specific Cloud TPU API was called and who made the call. For example CreateNode, UpdateNode, and DeleteNode.

ML frameworks can generate logs to stdout and stderr. These logs are controlled by environment variables and are read by your training script.

Your code can write logs to Google Cloud Logging. For more information, see Write standard logs and Write structured logs.

Viewing Cloud TPU logs

Go to the Google Cloud Logs Viewer
Click the Resource drop-down
Click Cloud TPU Worker
Select a zone
Select the Cloud TPU you're interested in
Click Apply. Logs are displayed in the query results

To view Audited Resource logs:

Go to the Google Cloud Logs Viewer
Click the Resource drop-down
Click Audited Resource and then Cloud TPU
Choose the Cloud TPU API you're interested in
Click Apply. Logs are displayed in the query results
Choose the APIs that begin with google.cloud.tpu.v1.Tpu

Query Google Cloud Logs

When you view logs in the Google Cloud console, the page performs a default query. You can view the query by selecting the Show query toggle switch. You can modify the default query or create a new one. For more information, see Build Queries in the Logs Explorer.

Understanding the log output for Audited Resource logs

Click any log entry to expand it, and you will find a field called protoPayload. Expand protoPayload and you will see a number of subfields:

logName: the name of the log
protoPayload -> @type: the type of the log
resourceName: the name of your Cloud TPU
methodName: the name of the method called (audit logs only)
request -> @type: the request type
request -> node: details about the Cloud TPU node
request -> node_id: the name of the TPU
severity: the severity of the log

Understanding the log output for Cloud TPU Worker logs

Click any log entry to expand it, and you will find a field called jsonPayload. Expand jsonPayload and you will see a number of subfields:

accelerator_type: the accelerator type
consumer_project: the project where the Cloud TPU lives
evententry_timestamp: the time when the log was generated
system_available_memory_GiB: the available memory on the Cloud TPU worker (0~350 GB)

Creating log-based metrics

This section describes how to create log-based metrics used for setting up monitoring dashboards and alerts. For information about programmatically creating log-based metrics, see Creating log-based metrics programmatically using the Cloud Logging REST API.

The following example uses the system_available_memory_GiB subfield to demonstrate how to create a log-based metric for monitoring Cloud TPU worker available memory.

Navigate to the Logs Explorer

In the query box, enter the following query to extract all log entries that have system_available_memory_GiB defined for the primary Cloud TPU worker:

resource.type=tpu_worker
resource.labels.project_id=your-project
resource.labels.zone=your-tpu-zone
resource.labels.node_id=your-tpu-name
resource.labels.worker_id=0
logName=projects/your-project/logs/tpu.googleapis.com%2Fruntime_monitor
jsonPayload.system_available_memory_GiB:*

Click Create metric to display the Metric Editor
Under Metric Type, choose Distribution
Type a name, optional description, and unit of measurement for your metric. enter "matrix_unit_utilization_percent" and "MXU utilization" in the Name and Description fields, respectively
The filter is pre-populated with the script that you entered in the Logs Explorer
Click CREATE METRIC
Click Explore Metrics to view your new metric. It may take a few minutes before your metrics are displayed

Creating log-based metrics programmatically using the Cloud Logging REST API

You can also create log-based metrics through the Cloud Logging API. For more information, see Creating a distribution metric.

Creating dashboards and alerts using log-based metrics

Dashboards are useful for visualizing metrics (expect ~2 minute delay); alerts are helpful for sending notifications when errors occur. For more information, see Manage custom dashboards and Create metric-based alert policies.

Creating dashboards

To create a dashboard in Cloud Monitoring for the Tensorcore idle duration metric:

Go to the Monitoring console
In the navigation pane, click Dashboards
Click CREATE DASHBOARD and then Add Chart
Choose the chart that type you want to add. For this example, choose Line
Type a title for the dashboard
Click the button underneath Resource & Metric
Scroll down the list of resources/metrics and select Cloud TPU Worker -> Tpu -> Tensorcore idle duration
Click Apply
To filter the dashboard contents, click CREATE DASHBOARD FILTERS
In the Label field, set project_id to your project
Click ADD and set zone to the zone where you created your TPU
Add another filter for node_id and specify your Cloud TPU name