Cloud TPU monitoring and logging dashboards
The Cloud TPU monitoring and debugging dashboards repository contains all the infrastructure and logic required to monitor and debug workflows running on TPU VMs. It requires Terraform, an open source tool that uses configuration files to define and manage Google Cloud infrastructure.
Monitoring dashboard
The monitoring dashboard displays the following Cloud TPU metrics:
- CPU utilization on TPU worker:
cpu/utilization
- Memory usage by TPU VM:
memory/usage
- Network bytes received by a Pod over the network:
network/received_bytes_count
- Network bytes sent by a Pod over the network:
network/sent_bytes_count
- TensorCore idle duration of TPU chip:
tpu/tensorcore/idle_duration
Cloud Monitoring automatically generates metrics data. To view the monitoring dashboard, go to the Google Cloud console, select Monitoring, and then Dashboards. Select GCE - TPU Monitoring Dashboard in the list of dashboards. The Cloud TPU monitoring dashboard is composed of a number of panes, each displaying metrics data. For each Cloud TPU metric the dashboard displays the current metric values, the mean, and the maximum values in separate panes. The mean and maximum panes enable you to identify outlier TPU VMs for further investigation.
You can also view TPU metrics using the Metrics Explorer in Cloud Monitoring Google Cloud console. For more information, see Monitoring Cloud TPU VMs.
Logging dashboard
The logging dashboard displays two sections:
- Log Metrics: displays the count of stack trace entries collected for each TPU VM worker.
- Logs Panel: displays all the stack trace entries with columns for severity,
timestamp, and a summary of the entry. You can filter this logs panel by a
string, or TPU worker labels like
node_id
,worker_id
. For example, adding a filterresource.labels.node_id:"testing"
displays all the trace entries collected for the TPU VM worker with ID set to "testing".
Cloud Logging collects log entries generated by Google Cloud services, third
party services, ML frameworks, and your code. You can install the
cloud-tpu-diagnostics
PyPi
package to write Python stack traces when a fault or exception occurs or when
your workload is unresponsive. For more information, see
Debugging Cloud TPU VMs.
To view the logging dashboard, go to the Google Cloud console, select Monitoring, and then Dashboards. Select GCE - TPU Logging Dashboard in the list of dashboards.
Configure the metrics and logging dashboards
Follow these instructions to configure the metrics and logging dashboards.
Clone the dashboard repository
You can clone the repository directly from the Cloud TPU Monitoring and Debugging Github repository or from the command line:
git clone https://github.com/google/cloud-tpu-monitoring-debugging.git
Install Terraform
- Install Terraform on your local machine.
- Run
terraform init
to add the necessary plugins and create the.terraform
directory. - Run
terraform init –upgrade
to install any available updates.
Configure permissions for monitoring and logging
Monitoring Admin, Monitoring Dashboard Configuration Editor and Logging Admin roles are required to deploy monitoring resources to your Google Cloud project. For more information on monitoring roles, see Grant access to Cloud Monitoring.
Create a Cloud Storage bucket
Create a Cloud Storage bucket to store Terraform state files. You can also use an existing Cloud Storage bucket. Using a Cloud Storage bucket ensures that multiple users can run Terraform at the same time and access the same infrastructure.
In a Cloud Shell, create the Cloud Storage bucket:
gcloud storage buckets create gs://${GCS_BUCKET_NAME}
Enable Object Versioning to keep the history of your deployments. Enabling Object Versioning increases storage costs, which you can mitigate by configuring Object Lifecycle Management to delete old state versions.
gcloud storage buckets update gs://${GCS_BUCKET_NAME} --versioning
Initialize Terraform
Run terraform init
inside the gcp_resources/gce
directory of the repository.
The command prompts you to enter the name of a Cloud Storage bucket to store
Terraform state files.
Deploy the monitoring and logging dashboards
Run terraform apply
inside the gcp_resources/gce
directory to deploy the
monitoring and logging dashboards in your Google Cloud project. The command
prompts you to provide values for the following variables:
var.monitoring_dashboard_config
var.project_name
var.stack_trace_bucket_config
If you would like to deploy only the monitoring dashboard, run terraform init
and terraform apply
inside the gcp_resources/gce/dashboard/monitoring_dashboard
directory. Similarly, run these commands inside the
gcp_resources/gce/dashboard/logging_dashboard
directory to deploy only the
logging dashboard.
Example configurations
When you run terraform apply
, Terraform prompts you for some configuration
variable values. This section provides some example configurations you can use.
Each example shows the output displayed by the terraform apply command as well
as suggested values for each configuration variable in bold lettering. Input
must be in JSON format surrounded by curly braces ("{}").
Example 1
This example specifies the Google Cloud project name and uses the default values for the rest of the configuration.
var.project_name
Name of gcp project
Enter a value: <gcp_project_name>
var.stack_trace_bucket_config
Configuration to create a log bucket to store stack traces:
{
"bucket_name": "name of log bucket to create",
"retention_days": <number of days to retain stack traces, default to 30 days if not set>
}
Enter {} to not create separate bucket for stack traces.
Enter a value: {}
var.monitoring_dashboard_config
Configuration for monitoring dashboard:
{
"node_prefix": "prefix used to generate the node name in multislice node provision, default to empty string",
"outlier_count": <the number of outliers to show on dashboard, default to 10 if not set>
}
Enter {} to set default configuration for monitoring dashboard.
Enter a value: {}
Example 2
This example configures the dashboard to display 5 outlier TPU VMs.
var.project_name
Name of gcp project
Enter a value: <gcp_project_name>
var.stack_trace_bucket_config
Configuration to create a log bucket to store stack traces:
{
"bucket_name": "name of log bucket to create",
"retention_days": <number of days to retain stack traces, default to 30 days if not set>
}
Enter {} to not create separate bucket for stack traces.
Enter a value: {}
var.monitoring_dashboard_config
Configuration for monitoring dashboard:
{
"node_prefix": "prefix used to generate the node name in multislice node provision, default to empty string",
"outlier_count": <number of outliers to show on dashboard, default to 10 if not set>
}
Enter {} to set default configuration for monitoring dashboard.
Enter a value: {"outlier_count":5}
Example 3
This configuration creates a dashboard that shows 10 outlier TPU VMs using "test" as the prefix for the node names generated by the dashboard:
var.project_name
Name of gcp project
Enter a value: <gcp_project_name>
var.stack_trace_bucket_config
Configuration to create a log bucket to store stack traces:
{
"bucket_name": "name of log bucket to create",
"retention_days": <number of days to retain stack traces, default to 30 days if not set>
}
Enter {} to not create separate bucket for stack traces.
Enter a value: {}
var.monitoring_dashboard_config
Configuration for monitoring dashboard:
{
"node_prefix": "prefix used to generate the node name in multislice node provision, default to empty string",
"outlier_count": <number of outliers to show on dashboard, default to 10 if not set>
}
Enter {} to set default configuration for monitoring dashboard.
Enter a value: {"node_prefix":"test"}
Example 4
This configuration creates a dashboard that shows 5 outlier VMs using "test" as the prefix for the node names generated by the dashboard.
var.project_name
Name of gcp project
Enter a value: <gcp_project_name>
var.stack_trace_bucket_config
Configuration to create a log bucket to store stack traces:
{
"bucket_name": "name of log bucket to create",
"retention_days": <number of days to retain stack traces, default to 30 days if not set>
}
Enter {} to not create separate bucket for stack traces.
Enter a value: {}
var.monitoring_dashboard_config
Configuration for monitoring dashboard:
{
"node_prefix": "prefix used to generate the node name in multislice node provision, default to empty string",
"outlier_count": "number of outliers to show on dashboard, default to 10 if not set"
}
Enter {} to set default configuration for monitoring dashboard.
Enter a value: {"node_prefix":"test", "outlier_count":5}
Example 5
This example configures a Cloud Storage bucket to store stack traces. The stack traces have a default 30 day retention policy.
var.project_name
Name of gcp project
Enter a value: <gcp_project_name>
var.stack_trace_bucket_config
Configuration to create a log bucket to store stack traces:
{
"bucket_name": "name of log bucket to create",
"retention_days": <number of days to retain stack traces, default to 30 days if not set>
}
Enter {} to not create separate bucket for stack traces.
Enter a value: {"bucket_name":"stack_trace_log_bucket"}
var.monitoring_dashboard_config
Configuration for monitoring dashboard:
{
"node_prefix": "prefix used to generate the node name in multislice node provision, default to empty string",
"outlier_count": "number of outliers to show on dashboard, default to 10 if not set"
}
Enter {} to set default configuration for monitoring dashboard.
Enter a value: {}
Example 6
This example configures a Cloud Storage bucket with a retention period of 20 days.
var.project_name
Name of gcp project
Enter a value: <gcp_project_name>
var.stack_trace_bucket_config
Configuration to create a log bucket to store stack traces:
{
"bucket_name": "name of log bucket to create",
"retention_days": <number of days to retain stack traces, default to 30 days if not set>
}
Enter {} to not create separate bucket for stack traces.
Enter a value: {"bucket_name":"stack_trace_log_bucket", "retention_days":20}
var.monitoring_dashboard_config
Configuration for monitoring dashboard:
{
"node_prefix": "prefix used to generate the node name in multislice node provision, default to empty string",
"outlier_count": "number of outliers to show on dashboard, default to 10 if not set"
}
Enter {} to set default configuration for monitoring dashboard.
Enter a value: {}