Cloud TPU monitoring and logging dashboards

The Cloud TPU monitoring and debugging dashboards repository contains all the infrastructure and logic required to monitor and debug workflows running on TPU VMs. It requires Terraform, an open source tool that uses configuration files to define and manage Google Cloud infrastructure.

Monitoring dashboard

The monitoring dashboard displays the following Cloud TPU metrics:

CPU utilization on TPU worker: cpu/utilization
Memory usage by TPU VM: memory/usage
Network bytes received by a Pod over the network: network/received_bytes_count
Network bytes sent by a Pod over the network: network/sent_bytes_count
TensorCore idle duration of TPU chip: tpu/tensorcore/idle_duration

Cloud Monitoring automatically generates metrics data. To view the monitoring dashboard, go to the Google Cloud console, select Monitoring, and then Dashboards. Select GCE - TPU Monitoring Dashboard in the list of dashboards. The Cloud TPU monitoring dashboard is composed of a number of panes, each displaying metrics data. For each Cloud TPU metric the dashboard displays the current metric values, the mean, and the maximum values in separate panes. The mean and maximum panes enable you to identify outlier TPU VMs for further investigation.

You can also view TPU metrics using the Metrics Explorer in Cloud Monitoring Google Cloud console. For more information, see Monitoring Cloud TPU VMs.

Logging dashboard

The logging dashboard displays two sections:

Log Metrics: displays the count of stack trace entries collected for each TPU VM worker.
Logs Panel: displays all the stack trace entries with columns for severity, timestamp, and a summary of the entry. You can filter this logs panel by a string, or TPU worker labels like node_id, worker_id. For example, adding a filter resource.labels.node_id:"testing" displays all the trace entries collected for the TPU VM worker with ID set to "testing".

Cloud Logging collects log entries generated by Google Cloud services, third party services, ML frameworks, and your code. You can install the cloud-tpu-diagnostics PyPi package to write Python stack traces when a fault or exception occurs or when your workload is unresponsive. For more information, see Debugging Cloud TPU VMs.

To view the logging dashboard, go to the Google Cloud console, select Monitoring, and then Dashboards. Select GCE - TPU Logging Dashboard in the list of dashboards.

Configure the metrics and logging dashboards

Follow these instructions to configure the metrics and logging dashboards.

Clone the dashboard repository

You can clone the repository directly from the Cloud TPU Monitoring and Debugging GitHub repository or from the command line:

git clone https://github.com/google/cloud-tpu-monitoring-debugging.git

Install Terraform

Install Terraform on your local machine.
Run terraform init to add the necessary plugins and create the .terraform directory.
Run terraform init –upgrade to install any available updates.

Configure permissions for monitoring and logging

Monitoring Admin, Monitoring Dashboard Configuration Editor and Logging Admin roles are required to deploy monitoring resources to your Google Cloud project. For more information on monitoring roles, see Grant access to Cloud Monitoring.

Create a Cloud Storage bucket

Create a Cloud Storage bucket to store Terraform state files. You can also use an existing Cloud Storage bucket. Using a Cloud Storage bucket ensures that multiple users can run Terraform at the same time and access the same infrastructure.

In a Cloud Shell, create the Cloud Storage bucket:

gcloud storage buckets create gs://${GCS_BUCKET_NAME}

Enable Object Versioning to keep the history of your deployments. Enabling Object Versioning increases storage costs, which you can mitigate by configuring Object Lifecycle Management to delete old state versions.
```
gcloud storage buckets update gs://${GCS_BUCKET_NAME} --versioning
```

Initialize Terraform

Run terraform init inside the gcp_resources/gce directory of the repository. The command prompts you to enter the name of a Cloud Storage bucket to store Terraform state files.

Deploy the monitoring and logging dashboards

Run terraform apply inside the gcp_resources/gce directory to deploy the monitoring and logging dashboards in your Google Cloud project. The command prompts you to provide values for the following variables:

var.monitoring_dashboard_config
var.project_name
var.stack_trace_bucket_config

If you would like to deploy only the monitoring dashboard, run terraform init and terraform apply inside the gcp_resources/gce/dashboard/monitoring_dashboard directory. Similarly, run these commands inside the gcp_resources/gce/dashboard/logging_dashboard directory to deploy only the logging dashboard.

Example configurations

When you run terraform apply, Terraform prompts you for some configuration variable values. This section provides some example configurations you can use. Each example shows the output displayed by the terraform apply command as well as suggested values for each configuration variable in bold lettering. Input must be in JSON format surrounded by curly braces ("{}").

Example 1

This example specifies the Google Cloud project name and uses the default values for the rest of the configuration.

var.monitoring_dashboard_config
    Configuration for monitoring dashboard:
    {
      "node_prefix": "prefix used to generate the node name in multislice node provision, default to empty string",
      "outlier_count": &lt;the number of outliers to show on dashboard, default to 10 if not set&gt;
    }
    Enter {} to set default configuration for monitoring dashboard.

  Enter a value: {}

var.project_name
  Name of gcp project

  Enter a value: <gcp_project_name>

var.stack_trace_bucket_config
    Configuration to create a log bucket to store stack traces:
    {
      "bucket_name": "name of log bucket to create",
      "retention_days": &lt;number of days to retain stack traces, default to 30 days if not set&gt;
    }
    Enter {} to not create separate bucket for stack traces.

  Enter a value: {}

Example 2

This example configures the dashboard to display 5 outlier TPU VMs.

var.monitoring_dashboard_config
    Configuration for monitoring dashboard:
    {
        "node_prefix": "prefix used to generate the node name in multislice node provision, default to empty string",
      "outlier_count": &lt;number of outliers to show on dashboard, default to 10 if not set&gt;
    }
    Enter {} to set default configuration for monitoring dashboard.

  Enter a value: {"outlier_count":5}

var.project_name
  Name of gcp project

  Enter a value: <gcp_project_name>

var.stack_trace_bucket_config
    Configuration to create a log bucket to store stack traces:
    {
      "bucket_name": "name of log bucket to create",
      "retention_days": &lt;number of days to retain stack traces, default to 30 days if not set&gt;
    }
    Enter {} to not create separate bucket for stack traces.

  Enter a value: {}

Example 3

This configuration creates a dashboard that shows 10 outlier TPU VMs using "test" as the prefix for the node names generated by the dashboard:

var.monitoring_dashboard_config
    Configuration for monitoring dashboard:
    {
      "node_prefix": "prefix used to generate the node name in multislice node provision, default to empty string",
      "outlier_count": &ltnumber of outliers to show on dashboard, default to 10 if not set&gt;
    }
    Enter {} to set default configuration for monitoring dashboard.

  Enter a value: {"node_prefix":"test"}

var.project_name
  Name of gcp project

  Enter a value: <gcp_project_name>

var.stack_trace_bucket_config
    Configuration to create a log bucket to store stack traces:
    {
      "bucket_name": "name of log bucket to create",
      "retention_days": &lt;number of days to retain stack traces, default to 30 days if not set&gt;
    }
    Enter {} to not create separate bucket for stack traces.

  Enter a value: {}

Example 4

This configuration creates a dashboard that shows 5 outlier VMs using "test" as the prefix for the node names generated by the dashboard.

var.monitoring_dashboard_config
    Configuration for monitoring dashboard:
    {
        "node_prefix": "prefix used to generate the node name in multislice node provision, default to empty string",
      "outlier_count": "number of outliers to show on dashboard, default to 10 if not set"
    }
    Enter {} to set default configuration for monitoring dashboard.

  Enter a value: {"node_prefix":"test", "outlier_count":5}

var.project_name
  Name of gcp project

  Enter a value: <gcp_project_name>

var.stack_trace_bucket_config
    Configuration to create a log bucket to store stack traces:
    {
      "bucket_name": "name of log bucket to create",
      "retention_days": &lt;number of days to retain stack traces, default to 30 days if not set&gt;
    }
    Enter {} to not create separate bucket for stack traces.

  Enter a value: {}

Example 5

This example configures a Cloud Storage bucket to store stack traces. The stack traces have a default 30 day retention policy.

var.monitoring_dashboard_config
    Configuration for monitoring dashboard:
    {
    "node_prefix": "prefix used to generate the node name in multislice node provision, default to empty string",
      "outlier_count": "number of outliers to show on dashboard, default to 10 if not set"
    }
    Enter {} to set default configuration for monitoring dashboard.

  Enter a value: {}

var.project_name
  Name of gcp project

  Enter a value: <gcp_project_name>

var.stack_trace_bucket_config
    Configuration to create a log bucket to store stack traces:
    {
      "bucket_name": "name of log bucket to create",
      "retention_days": &lt;number of days to retain stack traces, default to 30 days if not set&gt;
    }
    Enter {} to not create separate bucket for stack traces.

  Enter a value: {"bucket_name":"stack_trace_log_bucket"}

Example 6

This example configures a Cloud Storage bucket with a retention period of 20 days.

var.monitoring_dashboard_config
    Configuration for monitoring dashboard:
    {
        "node_prefix": "prefix used to generate the node name in multislice node provision, default to empty string",
      "outlier_count": "number of outliers to show on dashboard, default to 10 if not set"
    }
    Enter {} to set default configuration for monitoring dashboard.

  Enter a value: {}

var.project_name
  Name of gcp project

  Enter a value: <gcp_project_name>

var.stack_trace_bucket_config
    Configuration to create a log bucket to store stack traces:
    {
      "bucket_name": "name of log bucket to create",
      "retention_days": &lt;number of days to retain stack traces, default to 30 days if not set&gt;
    }
    Enter {} to not create separate bucket for stack traces.

  Enter a value: {"bucket_name":"stack_trace_log_bucket", "retention_days":20}