Cloud TPU Monitoring with Stackdriver
This guide explains how to use Stackdriver to monitor your Cloud TPU. Your Cloud TPU automatically collects logs and metrics of the Cloud TPU runtime binary (for example, Cloud TPU runtime CPU usage and MXU utilization) and saves them in Stackdriver. The collected logs and metrics can be used to debug resource usage problems. You can set alerts for when TPU utilization drops below a rolling average to catch performance problems as they happen. Alerts can also be set to notify you when training halts.
This guide familiarizes you with Stackdriver logs and shows you how to:
Query these logs
Create log-based metrics for setting up alerts and visualizing dashboards.
Prerequisites
This document assumes some basic knowledge of Stackdriver logging. You must have a Compute Engine VM and Cloud TPU resources created before you can begin generating and working with logs. See the quickstart for more details.
Do not perform the clean up instructions section in the quickstart for your framework until you are done running your model and no longer need the resources. Running the clean up step prevents you from incurring unwanted charges.
Logging
Stackdriver logging is automatically performed by Cloud TPU and can incur charges. For more information about logging charges, see Logging Charges.
Stackdriver logging scales linearly as you add Cloud TPU Pods. You can reduce the number logs ingested or disable Stackdriver logging by excluding logs. For more information, see Logs exclusions
Locating the monitoring logs in Stackdriver
The monitoring logs discussed in this guide are present in a special log
entity called runtime_monitor
. To locate them:
Go to the Google Cloud's operations suite Logging > Logs (Logs Explorer) page in the Google Cloud console:
Select an existing Google Cloud project at the top of the page.
The Audited Resource selector menu lets you choose resources, logs, and log level severities to display. Click the Audited Resource selector menu, scroll down and hover over TPU Worker. Select the zone and then the name (node_id) of the Cloud TPU for which you want to see logs.
In the log drop-down menu, select
runtime_monitor
. Click the OK button.
Using Stackdriver advanced query
You can use Stackdriver advanced queries to quickly identify your requested monitoring logs.
To use advanced logs queries in the Logs Explorer:
Go to the Google Cloud's operations suite Logging > Logs (Logs Explorer) page in the Google Cloud console:
Select an existing Google Cloud project at the top of the page, or create a new project.
In the search query box, click the drop-down menu arrow_drop_down and select Convert to advanced filter.
In the advanced logs query box, enter the following script, then click the Submit Filter button:
resource.type=tpu_worker resource.labels.project_id=your-project resource.labels.zone=your-tpu-zone resource.labels.node_id=your-tpu-name logName=projects/your-project/logs/tpu.googleapis.com%2Fruntime_monitor
Understanding the log output
Click on any log entry to expand it, and you will find a field called jsonPayload
.
This is where the monitoring logs reside. Click to expand it, and
you will see a number of subfields. The following summarizes the important subfields.
evententry_timestamp: the timestamp when the current log entry was created.
uid: the
id
of your Cloud TPU.logTime: the timestamp of the raw runtime log from which the log entry is generated.
checkpoint_succeeded: whether or not the checkpoint is successfully saved.
training_completed: whether or not the training process has been marked as completed.
compilation_succeeded: whether or not the compilation has succeeded.
compilation_timed_out: whether or not the compilation has timed out.
execute_succeeded: whether or not the execution has succeeded.
execute_timed_out: whether or not the execution has timed out.
eager_started: whether or not the runtime is adopting the eager mode.
framework: the runtime framework, i.e., tensorflow or pytorch.
runtime_cpu_perc: the Cloud TPU runtime CPU usage percentage. This is a numeric value with a range of 0~5,000.
runtime_used_MiB: the Cloud TPU runtime memory usage in MiB. This is a numeric value with a range of 0~350,000.
system_available_memory_GiB: the remaining system available memory in GiB. This is a numeric value with a range of 0~350.
matrix_unit_utilization_percent: the Cloud TPU MXU utilization percentage. This is a numeric value with a range of 0~100.
Depending on the origin of the log entries, not all subfields are present at once. For example, log entries with subfield system_available_memory_GiB will not have subfields such as matrix_unit_utilization_percent present.
Creating log-based metrics
This section describes how to create log-based metrics used for setting up monitoring dashboards and alerts. See also Creating log-based metrics programmatically using the Stackdriver REST API.
The following example uses the matrix_unit_utilization_percent subfield to demonstrate the procedures needed to create a log-based metric for monitoring Cloud TPU Matrix Multiplication Unit (MXU) utilization.
In the advanced query box, enter the following query script to extract all log entries that have matrix_unit_utilization_percent defined for the primary Cloud TPU worker:
resource.type=tpu_worker resource.labels.project_id=your-project resource.labels.zone=your-tpu-zone resource.labels.node_id=your-tpu-name resource.labels.worker_id=0 logName=projects/your-project/logs/tpu.googleapis.com%2Fruntime_monitor jsonPayload.matrix_unit_utilization_percent:*
Click the CREATE METRIC button. In the prompted Metric Editor sidebar on the right hand side, enter "matrix_unit_utilization_percent" and "MXU utilization" in the Name and Description field, respectively.
Click the Type drop-down menu and select Distribution. The Distribution type is appropriate for displaying numeric metrics.
In the Field name, enter "jsonPayload.matrix_unit_utilization_percent".
Click More. In the Histogram buckets section, change the drop menu Type to Linear. Enter 0 in Start value, 200 in Number of buckets, and 0.5 in Bucket width. This creates 200 buckets in the range of 0~100 with a bucket width of 0.5.
Click the Create Metric button at the bottom of the sidebar to finish metric creation.
To display monitoring results accurately, your bucket range definition must be appropriate. Use the range values specified for the numeric fields in Understanding the log output. One technique is to always use Linear as Type and 200 as Number of buckets, and then figure out Bucket width based on the metric range.
Creating log-based metrics programmatically using the Stackdriver REST API
You can create these metrics through the Stackdriver REST API programmatically. The prerequisite for using this API is a JSON file containing the definition of all required metrics. See the instructions for Creating a distribution metric to learn how to define a log-based metric using JSON. The following example declares two metrics within one JSON file:
[ { "name": "system_available_memory_GiB", "description": "System available memory.", "filter": "resource.type=tpu_worker AND resource.labels.project_id=your-project AND jsonPayload.system_available_memory_GiB:*", "valueExtractor": "EXTRACT(jsonPayload.system_available_memory_GiB)", "bucketOptions": { "linearBuckets": { "numFiniteBuckets": 200, "width": 1.75, "offset": 0 } }, "metricDescriptor": { "metricKind": "DELTA", "valueType": "DISTRIBUTION" } }, { "name": "matrix_unit_utilization_percent", "description": "MXU utilization.", "filter": "resource.type=tpu_worker AND resource.labels.project_id=your-project AND jsonPayload.matrix_unit_utilization_percent:*", "valueExtractor": "EXTRACT(jsonPayload.matrix_unit_utilization_percent)", "bucketOptions": { "linearBuckets": { "numFiniteBuckets": 200, "width": 0.5, "offset": 0 } }, "metricDescriptor": { "metricKind": "DELTA", "valueType": "DISTRIBUTION" } } ]
To simplify metric creation, we provide a JSON file with a few predefined metrics in this link, which contains the following:
matrix_unit_utilization_percent: MXU utilization percentage.
system_available_memory_GiB: system available memory in GiB.
runtime_used_MiB: the Cloud TPU runtime memory usage in MiB.
runtime_cpu_perc: the Cloud TPU runtime CPU usage percentage.
training_completed: number of completed training events.
compilation_succeeded: number of successful compilation events.
The Stackdriver REST API only takes one JSON object as the input at a time. Hence, we need a tool to break down the list in the JSON file to individual JSON objects before passing them to the REST API. We recommend using the open source tool jq.
If you have already created metrics with the same names, you need to either change the names of the metric definitions in the JSON file or remove the existing metrics before running the Stackdriver REST API. To remove the existing metrics, first you need to delete all alerts set upon them. You can follow the instruction described in this guide on how to remove existing alert policies. Removing metrics and creating new ones can be done automatically, follow:
Open the downloaded JSON file then replace
your-project
with your project.Remove all existing metrics:
jq -c '.[] | .name' your-json-file-name | \ xargs -I % curl -X DELETE "https://logging.googleapis.com/v2/projects/your-project/metrics/%" \ -H "Authorization: Bearer $(gcloud auth print-access-token)" -H "Accept: application/json"
Create new metrics:
jq -c .[] your-json-file-name | \ xargs -0 -d '\n' -I % curl -X POST "https://logging.googleapis.com/v2/projects/your-project/metrics" \ -H "Authorization: Bearer $(gcloud auth print-access-token)" \ -H "Accept: application/json" -H "Content-Type: application/json" -d %
Creating dashboards and alerts using log-based metrics
Once you have created log-based metrics, you can create dashboards and alerts in Stackdriver Monitoring. Dashboards are useful for visualizing metrics (expect ~2 minutes delay); alerts are helpful for notification when things go wrong.
Creating dashboards
The steps in this section show an example of creating a dashboard in Stackdriver Monitoring for the matrix_unit_utilization_percent metric.
Go to the Stackdriver Monitoring Console.
Move the cursor to Dashboards and click Create Dashboard in the prompted menu.
Click the Add Chart button on the top right corner.
In the prompted new page, enter "TPU runtime MXU utilization" in the Chart Title text input box.
In the Find resource type and metric field, enter "matrix_unit_utilization_percent". Stackdriver will automatically load the created metric logging.googleapis.com/user/matrix_unit_utilization_percent. Select the found metric. Note the Resource type field should be automatically filled by TPU Worker.
In the Filter field, set project_id to your project, zone to your tpu zone, and node_id to your tpu name.
Change the Aggregator to none. Then click the SAVE button at the bottom of the page.
Creating alerts
The steps in this section show an example of how to add an alert policy for the matrix_unit_utilization_percent metric. It creates an alert policy tied to Cloud TPU MXU utilization. Whenever this variable falls below 5% for more than 1 hour, Stackdriver sends an email to the registered email address. If the MXU utilization of the Cloud TPU rises above 5% again, Stackdriver sends a notification that the alert has been removed.
Go to the Stackdriver Monitoring Console.
Move the cursor to Alerting and click Create a Policy in the prompted menu.
In the prompted Create New Alerting Policy page, click the Add Condition button.
Enter "TPU runtime low MXU utilization alert" in the Condition field.
In the Find resource type and metric field, enter "matrix_unit_utilization_percent". Stackdriver will automatically load the created metric logging.googleapis.com/user/matrix_unit_utilization_percent. Select the found metric. Note the Resource type field should be automatically filled by TPU Worker.
In the Configuration section, change Condition to is below; enter 5 in Threshold field and select 1 hour in the For drop menu. Click the Save button.
In the Notification Channel Type drop menu, select Email and enter your email in the Email address text box. Click the Add Notification Channel button.
In the Documentation field, you can enter whatever information to help you identify the problem when the alerts are fired.
In the text input box beneath Name this policy, enter "TPU runtime low MXU utilization alert". Click Save button at the bottom.