Cloud TPU Monitoring with Stackdriver

This guide explains how to use Stackdriver to monitor your Cloud TPU. Your Cloud TPU automatically collects logs and metrics of the Cloud TPU runtime binary (for example, Cloud TPU runtime CPU usage, MXU utilization) and saves them in Stackdriver. This guide familiarizes you with these logs and shows you how to:

  • Query these logs

  • Create log-based metrics for setting up alerts and visualizing dashboards.

Prerequisites

This document assumes some basic knowledge of Stackdriver logging. You must have a Compute Engine VM and Cloud TPU resources created before you can begin generating and working with logs. See the quickstart for more details.

Do not perform the clean up instructions section in the quickstart until you are done running your model and no longer need the resources. Running the clean up step prevents you from incurring unwanted charges.

Logging

Stackdriver logging is automatically performed by Cloud TPU and can incur charges. For more information about logging charges, see Logging Charges.

Stackdriver logging scales linearly as you add Cloud TPU Pods. You can reduce the number logs ingested or disable Stackdriver logging by excluding logs. For more information, see Logs exclusions

Locating the monitoring logs in Stackdriver

The monitoring logs discussed in this guide are present in a special log entity called runtime_monitor. To locate them:

  1. Go to the Google Cloud's operations suite Logging > Logs (Logs Viewer) page in the Cloud Console:

    Go to the Logs Viewer page

  2. Select an existing Google Cloud project at the top of the page.

  3. The Audited Resource selector menu lets you choose resources, logs, and log level severities to display. Click the Audited Resource selector menu, scroll down and hover over TPU Worker. Select the zone and then the name (node_id) of the Cloud TPU for which you want to see logs.

  4. In the log drop-down menu, select runtime_monitor. Click the OK button.

image

Using Stackdriver advanced query

You can use Stackdriver advanced queries to quickly identify your requested monitoring logs.

To use advanced logs queries in the Logs Viewer:

  1. Go to the Google Cloud's operations suite Logging > Logs (Logs Viewer) page in the Cloud Console:

    Go to the Logs Viewer page

  2. Select an existing Google Cloud project at the top of the page, or create a new project.

  3. In the search query box, click the drop-down menu and select Convert to advanced filter.

  4. In the advanced logs query box, enter the following script, then click the Submit Filter button:

    resource.type=tpu_worker
    resource.labels.project_id=your-project
    resource.labels.zone=your-tpu-zone
    resource.labels.node_id=your-tpu-name
    logName=projects/your-project/logs/tpu.googleapis.com%2Fruntime_monitor
    

image

Understanding the log output

Click on any log entry to expand it, and you will find a field called jsonPayload. This is where the monitoring logs reside. Click to expand it, and you will see a number of subfields. The following summarizes the important subfields.

  • evententry_timestamp: the timestamp when the current log entry was created.

  • uid: the id of your Cloud TPU.

  • logTime: the timestamp of the raw runtime log from which the log entry is generated.

  • checkpoint_succeeded: whether or not the checkpoint is successfully saved.

  • training_completed: whether or not the training process has been marked as completed.

  • compilation_succeeded: whether or not the compilation has succeeded.

  • compilation_timed_out: whether or not the compilation has timed out.

  • execute_succeeded: whether or not the execution has succeeded.

  • execute_timed_out: whether or not the execution has timed out.

  • eager_started: whether or not the runtime is adopting the eager mode.

  • framework: the runtime framework, i.e., tensorflow or pytorch.

  • runtime_cpu_perc: the Cloud TPU runtime CPU usage percentage. This is a numeric value with a range of 0~5,000.

  • runtime_used_MiB: the Cloud TPU runtime memory usage in MiB. This is a numeric value with a range of 0~350,000.

  • system_available_memory_GiB: the remaining system available memory in GiB. This is a numeric value with a range of 0~350.

  • matrix_unit_utilization_percent: the Cloud TPU MXU utilization percentage. This is a numeric value with a range of 0~100.

Depending on the origin of the log entries, not all subfields are present at once. For example, log entries with subfield system_available_memory_GiB will not have subfields such as matrix_unit_utilization_percent present.

Creating log-based metrics

This section describes how to create log-based metrics used for setting up monitoring dashboards and alerts. See also Creating log-based metrics programmatically using the Stackdriver REST API.

The following example uses the matrix_unit_utilization_percent subfield to demonstrate the procedures needed to create a log-based metric for monitoring Cloud TPU Matrix Multiplication Unit (MXU) utilization.

  1. In the advanced query box, enter the following query script to extract all log entries that have matrix_unit_utilization_percent defined for the master Cloud TPU worker:

    resource.type=tpu_worker
    resource.labels.project_id=your-project
    resource.labels.zone=your-tpu-zone
    resource.labels.node_id=your-tpu-name
    resource.labels.worker_id=0
    logName=projects/your-project/logs/tpu.googleapis.com%2Fruntime_monitor
    jsonPayload.matrix_unit_utilization_percent:*
    
  2. Click the CREATE METRIC button. In the prompted Metric Editor sidebar on the right hand side, enter "matrix_unit_utilization_percent" and "MXU utilization" in the Name and Description field, respectively.

  3. Click the Type drop-down menu and select Distribution. The Distribution type is appropriate for displaying numeric metrics.

  4. In the Field name, enter "jsonPayload.matrix_unit_utilization_percent".

  5. Click More. In the Histogram buckets section, change the drop menu Type to Linear. Enter 0 in Start value, 200 in Number of buckets, and 0.5 in Bucket width. This creates 200 buckets in the range of 0~100 with a bucket width of 0.5.

  6. Click the Create Metric button at the bottom of the sidebar to finish metric creation.

image

To display monitoring results accurately, your bucket range definition must be appropriate. Use the range values specified for the numeric fields in Understanding the log output. One technique is to always use Linear as Type and 200 as Number of buckets, and then figure out Bucket width based on the metric range.

Creating log-based metrics programmatically using the Stackdriver REST API

You can create these metrics through the Stackdriver REST API programmatically. The prerequisite for using this API is a JSON file containing the definition of all required metrics. See the instructions for Creating a distribution metric to learn how to define a log-based metric using JSON. The following example declares two metrics within one JSON file:

[
  {
    "name": "system_available_memory_GiB",
    "description": "System available memory.",
    "filter": "resource.type=tpu_worker AND resource.labels.project_id=your-project AND jsonPayload.system_available_memory_GiB:*",
    "valueExtractor": "EXTRACT(jsonPayload.system_available_memory_GiB)",
    "bucketOptions": {
      "linearBuckets": {
        "numFiniteBuckets": 200,
        "width": 1.75,
        "offset": 0
      }
    },
    "metricDescriptor": {
      "metricKind": "DELTA",
      "valueType": "DISTRIBUTION"
    }
  },
  {
    "name": "matrix_unit_utilization_percent",
    "description": "MXU utilization.",
    "filter": "resource.type=tpu_worker AND resource.labels.project_id=your-project AND jsonPayload.matrix_unit_utilization_percent:*",
    "valueExtractor": "EXTRACT(jsonPayload.matrix_unit_utilization_percent)",
    "bucketOptions": {
      "linearBuckets": {
        "numFiniteBuckets": 200,
        "width": 0.5,
        "offset": 0
      }
    },
    "metricDescriptor": {
      "metricKind": "DELTA",
      "valueType": "DISTRIBUTION"
    }
  }
]

To simplify metric creation, we provide a JSON file with a few predefined metrics in this link, which contains the following:

  1. matrix_unit_utilization_percent: MXU utilization percentage.

  2. system_available_memory_GiB: system available memory in GiB.

  3. runtime_used_MiB: the Cloud TPU runtime memory usage in MiB.

  4. runtime_cpu_perc: the Cloud TPU runtime CPU usage percentage.

  5. training_completed: number of completed training events.

  6. compilation_succeeded: number of successful compilation events.

The Stackdriver REST API only takes one JSON object as the input at a time. Hence, we need a tool to break down the list in the JSON file to individual JSON objects before passing them to the REST API. We recommend using the open source tool jq.

If you have already created metrics with the same names, you need to either change the names of the metric definitions in the JSON file or remove the existing metrics before running the Stackdriver REST API. To remove the existing metrics, first you need to delete all alerts set upon them. You can follow the instruction described in this guide on how to remove existing alert policies. Removing metrics and creating new ones can be done automatically, follow:

  1. Open the downloaded JSON file then replace your-project with your project.

  2. Remove all existing metrics:

    jq -c '.[] | .name' your-json-file-name | \
    xargs -I % curl -X DELETE "https://logging.googleapis.com/v2/projects/your-project/metrics/%" \
    -H "Authorization: Bearer $(gcloud auth print-access-token)" -H "Accept: application/json"
    
  3. Create new metrics:

    jq -c .[] your-json-file-name | \
    xargs -0 -d '\n' -I % curl -X POST "https://logging.googleapis.com/v2/projects/your-project/metrics" \
    -H "Authorization: Bearer $(gcloud auth print-access-token)" \
    -H "Accept: application/json" -H "Content-Type: application/json" -d %
    

Creating dashboards and alerts using log-based metrics

Once you have created log-based metrics, you can create dashboards and alerts in Stackdriver Monitoring. Dashboards are useful for visualizing metrics (expect ~2 minutes delay); alerts are helpful for notification when things go wrong.

Creating dashboards

The steps in this section show an example of creating a dashboard in Stackdriver Monitoring for the matrix_unit_utilization_percent metric.

  1. Go to the Stackdriver Monitoring Console.

    Go to Stackdriver Monitoring

  2. Move the cursor to Dashboards and click Create Dashboard in the prompted menu.

  3. Click the Add Chart button on the top right corner.

  4. In the prompted new page, enter "TPU runtime MXU utilization" in the Chart Title text input box.

  5. In the Find resource type and metric field, enter "matrix_unit_utilization_percent". Stackdriver will automatically load the created metric logging.googleapis.com/user/matrix_unit_utilization_percent. Select the found metric. Note the Resource type field should be automatically filled by TPU Worker.

  6. In the Filter field, set project_id to your project, zone to your tpu zone, and node_id to your tpu name.

  7. Change the Aggregator to none. Then click the SAVE button at the bottom of the page.

Creating alerts

The steps in this section show an example of how to add an alert policy for the matrix_unit_utilization_percent metric. It creates an alert policy tied to Cloud TPU MXU utilization. Whenever this variable falls below 5% for more than 1 hour, Stackdriver sends an email to the registered email address. If the MXU utilization of the Cloud TPU rises above 5% again, Stackdriver sends a notification that the alarm has been removed.

  1. Go to the Stackdriver Monitoring Console.

    Go to Stackdriver Monitoring

  2. Move the cursor to Alerting and click Create a Policy in the prompted menu.

  3. In the prompted Create New Alerting Policy page, click the Add Condition button.

  4. Enter "TPU runtime low MXU utilization alert" in the Condition field.

  5. In the Find resource type and metric field, enter "matrix_unit_utilization_percent". Stackdriver will automatically load the created metric logging.googleapis.com/user/matrix_unit_utilization_percent. Select the found metric. Note the Resource type field should be automatically filled by TPU Worker.

  6. In the Configuration section, change Condition to is below; enter 5 in Threshold field and select 1 hour in the For drop menu. Click the Save button.

  7. In the Notification Channel Type drop menu, select Email and enter your email in the Email address text box. Click the Add Notification Channel button.

  8. In the Documentation field, you can enter whatever information to help you identify the problem when the alerts are fired.

  9. In the text input box beneach Name this policy, enter "TPU runtime low MXU utilization alert". Click Save button at the bottom.