Sample Policies

This page provides a cookbook of specific of alerting policies. These can be used for inspiration and to bootstrap policies of your own design.

Generating JSON or YAML

You can represent alerting policies in two data formats, JSON and YAML. Cloud SDK can read and write both formats. The REST API can accept JSON.

To generate YAML (the default) representations of your existing alerting policies and notification channels, use the gcloud alpha monitoring policies list and describe commands, or the gcloud alpha monitoring channels list and describe commands, respectively.

For example, this command retrieves a single policy and captures the output in the file test-policy.yaml:

gcloud alpha monitoring policies describe projects/a-gcp-project/alertPolicies/12669073143329903307 > test-policy.yaml

To generate JSON representations of your existing alerting policies and notification channels:

Creating new policies from existing ones

As illustrated in the backup/restore example, you can use saved policies to create new instances. You can also use them as a starting point for creating similar policies. Before using the saved policies as the basis of new policies, edit the saved policies:

  • Remove the following fields from any notification channels:
    • name
    • verificationStatus
  • Create notification channels before referring to them in alerting policies
  • Remove the following fields from any alerting policies:
    • name
    • condition.name
    • creationRecord
    • mutationRecord

Policy samples

The policies here are organized using the same terminology that the Stackdriver Monitoring console uses, for example, “rate-of-change policy”, but there are really only two types of conditions underlying all these classifications:

  • A threshold condition; almost all of the policy types mentioned in the UI are variants on a theshold condition
  • An absence condition

In the samples here, these are indicated by conditionThreshold and conditionAbsent conditions. See the reference page for Condition for more information.

Metric-threshold policy

A metric-threshold policy is one that detects when some value crosses a predetermined boundary. Threshold policies let you know that something is approaching an important point, so you can take some action. For example, when available disk space falls below 10 percent of total disk space, and your system may run out of disk space soon.

The following policy uses average CPU usage as an indicator of the health of a group of VMs. It causes an alert when the average CPU utilization of a group of VMs in an instance and zone, measured over 60-second intervals, exceeds a threshold of 90-percent utilization for 15 minutes (600 seconds):

{
    "displayName": "Very high CPU usage",
    "combiner": "OR",
    "conditions": [
        {
            "displayName": "CPU usage is extremely high",
            "conditionThreshold": {
                "aggregations": [
                    {
                        "alignmentPeriod": "60s",
                        "crossSeriesReducer": "REDUCE_MEAN",
                        "groupByFields": [
                            "project",
                            "resource.label.instance_id",
                            "resource.label.zone"
                        ],
                        "perSeriesAligner": "ALIGN_MAX"
                    }
                ],
                "comparison": "COMPARISON_GT",
                "duration": "900s",
                "filter": "metric.type=\"compute.googleapis.com/instance/cpu/utilization\"
                           AND resource.type=\"gce_instance\"",
                "thresholdValue": 0.9,
                "trigger": {
                    "count": 1
                }
            }
        }
    ],
}

For another example, see Metric threshold.

Metric-absence policy

A metric-absence policy is triggered when no data is written to a metric for the specified duration.

One way to demonstrate this is to create a custom metric that nothing will ever write to. You don't need a custom metric for this kind of policy, but for demonstration purposes, it's easy to ensure nothing actually uses it.

Here's a sample descriptor for a custom metric. You could create the metric using the APIs Explorer.

{
  "description": "Number of times the pipeline has run",
  "displayName": "Pipeline runs",
  "metricKind": "GAUGE",
  "type": "custom.googleapis.com/pipeline_runs",
  "labels": [
    {
      "description": "The name of the pipeline",
      "key": "pipeline_name",
      "valueType": "STRING"
    },
  ],
  "unit": "1",
  "valueType": "INT64"
}

See Using Custom Metrics for more information.

The following alerting policy is triggered if no data is written to this metric for a period of approximately an hour: in other words, your hourly pipeline has failed to run. Note that the condition used here is conditionAbsent.

{
    "displayName": "Data ingestion functioning",
    "combiner": "OR",
    "conditions": [
        {
            "displayName": "Hourly pipeline is up",
            "conditionAbsent": {
                "duration": "3900s",
                "filter": "resource.type=\"global\"
                           AND metric.type=\"custom.googleapis.com/pipeline_runs\"
                           AND metric.label.pipeline_name=\"hourly\"",
            }
        }
    ],
}

For another example, see Metric absence.

Rate-of-change policy

This policy alerts you when the rate of CPU utilization is increasing rapidly:

{
  "displayName": "High CPU rate of change",
  "combiner": "OR",
  "conditions": [
    {
      "displayName": "CPU usage is increasing at a high rate",
      "conditionThreshold": {
         "aggregations": [
           {
             "alignmentPeriod": "900s",
             "perSeriesAligner": "ALIGN_PERCENT_CHANGE",
           }],
        "comparison": "COMPARISON_GT",
        "duration": "180s",
        "filter": "metric.type=\"compute.googleapis.com/instance/cpu/utilization\" AND resource.type=\"gce_instance\"",
        "thresholdValue": 0.5,
        "trigger": {
          "count": 1
         }
      }
    }
  ],
}

For another example, see Metric rate of change.

Group-aggregate policy

This policy alerts you when the average CPU utilization agross the resource group called “G​K​E cluster” exceeds a threshold:

{
    "displayName": "CPU utilization across GKE cluster exceeds 10 percent",
    "combiner": "OR",
    "conditions": [
         {
            "displayName": "Group Aggregate Threshold across All Instances in Group GKE cluster",
            "conditionThreshold": {
                "filter": "group.id=\"3691870619975147604\" AND metric.type=\"compute.googleapis.com/instance/cpu/utilization\" AND resource.type=\"gce_instance\"",
                "comparison": "COMPARISON_GT",
                "thresholdValue": 0.1,
                "duration": "300s",
                "trigger": {
                    "count": 1
                },
                "aggregations": [
                    {
                        "alignmentPeriod": "60s",
                        "perSeriesAligner": "ALIGN_MEAN",
                        "crossSeriesReducer": "REDUCE_MEAN",
                        "groupByFields": [
                              "project",
                              "resource.label.instance_id",
                              "resource.label.zone"
                        ]
                    },
                    {
                        "alignmentPeriod": "60s",
                        "perSeriesAligner": "ALIGN_SUM",
                        "crossSeriesReducer": "REDUCE_MEAN"
                    }
                ]
            },
        }
    ],
}

This policy assumes the existence of the following group:

{
    "name": "projects/a-gcp-project/groups/3691870619975147604",
    "displayName": "GKE cluster",
    "filter": "resource.metadata.name=starts_with(\"gke-kuber-cluster-default-pool-6fe301a0-\")"
}

For another example, see Group-aggregate threshold

Uptime-check policy

The status of uptime checks appears on the Stackdriver Monitoring console, but you can use an alerting policy to notify you directly if the uptime check fails.

For example, the following JSON describes an uptime check on the Google Cloud site. It checks the availability every 5 minutes.

The uptime check was created with the Stackdriver Monitoring console. The JSON representation here was created by listing the uptime checks in the project using the API; see uptimeCheckConfigs.list. You can also create uptime checks with the API.

{
    "name": "projects/a-gcp-project/uptimeCheckConfigs/uptime-check-for-google-cloud-site",
    "displayName": "Uptime check for Google Cloud site",
    "monitoredResource": {
        "type": "uptime_url",
        "labels": {
            "host": "cloud.google.com"
      }
    },
    "httpCheck": {
        "path": "/index.html",
        "port": 80,
        "authInfo": {}
    },
    "period": "300s",
    "timeout": "10s",
    "contentMatchers": [
        {}
    ]
}

To create an alerting policy for an uptime check, you refer to the uptime check by UPTIME_CHECK_ID. This ID is set when the check is created; it appears as the last component of the name field. It is not visible in the UI. If you are using the API, it is returned by the uptimeCheckConfigs.create method.

The ID is derived from the displayName, which was set in the UI in this case. The can be verified by listing the uptime checks and looking at the name value.

In this uptime check, the ID is uptime-check-for-google-cloud-site.

The alerting policy below will be triggered if the uptime check fails, and it will send a notification to the specified notification channel:

{
    "displayName": "Google Cloud site uptime failure",
    "combiner": "OR",
    "conditions": [
        {
            "displayName": "Failure of uptime check_id uptime-check-for-google-cloud-site",
            "conditionThreshold": {
                "aggregations": [
                    {
                        "alignmentPeriod": "1200s",
                        "perSeriesAligner": "ALIGN_NEXT_OLDER",
                        "crossSeriesReducer": "REDUCE_COUNT_FALSE",
                        "groupByFields": [ "resource.label.*" ]
                    }
                ],
                "comparison": "COMPARISON_GT",
                "duration": "600s",
                "filter": "metric.type=\"monitoring.googleapis.com/uptime_check/check_passed\"
                           AND metric.label.check_id=\"uptime-check-for-google-cloud-site\"
                           AND resource.type=\"uptime_url\"",
                "thresholdValue": 1,
                "trigger": {
                    "count": 1
                }
            }
        }
    ],
    "notificationChannels": [
        "projects/a-gcp-project/notificationChannels/2798987108321357979"
    ]
}

Note how the uptime check being monitored is specified. In the condition's filter, the check_id metric label specifies the value of the UPTIME_CHECK_ID to monitor.

AND metric.label.check_id=\"uptime-check-for-google-cloud-site\"

See Monitoring Filters for more information.

For another example, see Uptime check health .

Process-health policy

A process-health policy can notify you if the number of processes that match some pattern crosses a threshold. This can be used to tell you, for example, that a process has stopped running.

This policy sends a notification to the specified notification channel when no process matching the string nginx, running as user www, has been available for more than 5 minutes:

{
    "displayName": "Server health",
    "combiner": "OR",
    "conditions": [
        {
            "displayName": "Process 'nginx' is not running",
            "conditionThreshold": {
                "filter": "select_process_count(\"has_substring(\\\"nginx\\\")\", \"www\") AND resource.type=\"gce_instance\"",
                "comparison": "COMPARISON_LT",
                "thresholdValue": 1,
                "duration": "300s"
            }
        }
    ],
    "notificationChannels": [
        "projects/a-gcp-project/notificationChannels/16476255324959532809"
    ]
}

For another example, see Process health.

Was this page helpful? Let us know how we did:

Send feedback about...

Stackdriver Monitoring