Incidents for metric-based alerts

Stay organized with collections Save and categorize content based on your preferences.

An incident, also called an alert, is a record of the triggering of an alerting policy. Unless an alerting policy is snoozed or disabled, Cloud Monitoring opens an incident when a condition of an alerting policy is triggered. The incident contains information that you can use to investigate the cause of the notification.

This document describes how you can view, investigate, and manage incidents for metric-based alerting policies.

Before you begin

To get the permissions that you need to view incidents by using the Google Cloud console, ask your administrator to grant you the Monitoring Viewer (roles/monitoring.viewer) IAM role on your project. For more information about granting roles, see Manage access.

You might also be able to get the required permissions through custom roles or other predefined roles.

To get the permissions that you need to manage incidents by using the Google Cloud console, ask your administrator to grant you the Monitoring Editor (roles/monitoring.editor) IAM role on your project. For more information about granting roles, see Manage access.

You might also be able to get the required permissions through custom roles or other predefined roles.

For more information about Cloud Monitoring roles, see Control access with Identity and Access Management.

Finding incidents

To see a list of incidents, do the following:

  1. In the Google Cloud console toolbar, click  Navigation menu, and then select Monitoring:

    Go to Monitoring

  2. In the Monitoring navigation pane, select  Alerting:

    • The Summary pane lists the number of open incidents.
    • The Incidents pane displays the most recent open incidents. To list the most recent incidents in the table, including those that are closed, click Show closed incidents.
  3. Optional: To view the details of a specific incident, select the incident in the list. The Incident details page opens. For information about this page, see the Investigating incidents section of this page.

Finding older incidents

The Incidents pane on the Alerting page shows the most recent open incidents. To locate older incidents, do one of the following:

  • To page through the entries in the Incidents table, click  Newer or  Older.

  • To navigate to the Incidents page, click See all incidents. From the Incidents page, you can do all the following:

    • Show closed incidents: To list all incidents in the table, click Show closed incidents.
    • Filter incidents: For information about adding filters, see Filtering incidents.
    • Acknowledge, silence, or close an incident: To access these options, click  More options in the incident's row, and make a selection from the menu. For more information, see Managing incidents.

Filtering incidents

When you enter a value on the filter bar, only incidents that match the filter are listed in the Incidents table. If you add multiple filters, then an incident is displayed only if it satisfies all the filters.

To add a filter the table of incidents, do the following:

  1. On the Incidents page, click  Filter table and then select a filter property. Filter properties include all the following:

    • State of the incident
    • Name of the alerting policy
    • When the incident was opened or closed
    • Metric type
    • Resource type
  2. Select a value from the secondary menu or enter a value in the filter bar.

    For example, if you select Metric type and enter usage_time, then you might see only the following options in the secondary menu:

    agent.googleapis.com/cpu/usage_time
    compute.googleapis.com/guest/container/cpu/usage_time
    container.googleapis.com/container/cpu/usage_time
    

Investigating incidents

After you have found the incident you want to investigate, go to the Incident details page for that incident. To view the details, click on the incident summary in the table of incidents on either the Alerting page or the Incidents page.

Alternately, if you received a notification that includes a link to the incident, then you can use that link to view the incident details.

The following screenshot shows the details page for an incident:

The details page provides summary information and investigative tools for
an incident.

The Incident details page provides the following information:

  • Status information, including:

    • Name: The name of the alerting policy that caused this incident.
    • Status: The status of the incident: open, acknowledged, or closed.
    • Duration: The length of time for which the incident was open.
  • Information about the alerting policy that caused the incident:

    • Condition pane: identifies the condition in the alerting policy that caused the incident.

    • Message pane: provides a brief explanation of the cause based on the configuration of the condition in the alerting policy. This pane is always populated.

    • Documentation pane: shows the documentation template for notifications that you provided when creating the alerting policy. This information might include a description of what the alerting policy monitors and include tips for mitigation.

      If you skipped this field when creating the alerting policy, then this pane reports "No documentation is configured."

  • Labels: reports the following:
    • The labels and values for the monitored resource and metric of the time series that triggered the alerting policy. This information can help you identify the specific monitored resource that caused the incident.

      When you use variables in documentation for metric labels, Monitoring omits the label from notifications when the label value doesn't start with a digit, a letter, a forward slash (/), or an equal sign (=).

    • Any user-specified labels and values that you defined on the alerting policy. You can use these labels for organizing and identifying alerting policies. Labels associated with a policy are listed in the Policy Labels section, while labels defined as part of a condition are listed in the Metric labels section. For example usage, see Add severity levels to an alerting policy.

The Incidents details page also provides tools for investigating the incident:

  • Incident timeline: Shows two visual representations of the incident:

    • A red bar above a time axis represents the incident; the length and position of the bar reflect the duration of the incident.
    • A chart shows the time-series data and threshold used by the alerting policy that caused the incident. The incident was opened when some time series met a condition of the alerting policy.

    The time axis indicates the duration of the incident with two labeled dots. The position of these dots on the time axis determines the range of data shown on the chart that accompanies the incident timeline. By default, one dot is positioned at the opening of the incident and one at the close of the incident, or at the current time if the incident is still open.

    You can modify the time range on the incident timeline and the chart:

    • To change the time range shown on the chart, drag either of the dots along the time axis. By using this technique, you can focus on specific intervals, for example, around the beginning or end of the incident.

      Changing the chart by dragging the dots on the axis sets a custom value in the Time Span menu and disables the menu. To enable the Time Span menu, click Reset.

    • To change the range of time shown on the timeline, select a range from the Time Span menu.
  • Links to other troubleshooting tools. The configuration of your project and alerting policy and the age of the incident determine which links are available.
    • To see the details page for the alerting policy, click View policy.
    • To edit the definition of the alerting policy, click Edit policy.
    • To go to a dashboard of performance information for the resource, click View resource details.
    • To see related log entries in Logs Explorer, click View logs. For more information, see Using Logs Explorer.
    • To investigate the data in the chart, click View in Metrics Explorer.
  • Annotations: Provides a log of your findings, results, suggestions, or other comments from your investigation of the incident.
    • To add an annotation, enter text in the field and click Add comment.
    • To discard the comment, click Cancel.

You can also acknowledge, silence, or close incidents from the Incident details page. For more information, see Managing incidents.

Managing incidents

Incidents are in one of the following states:

  •  Open: The policy's set of conditions are being met or there isn't data to indicate that the condition is no longer met. If a policy contains multiple conditions, then incidents are opened depending on how those conditions are combined. See Combining conditions for more information.

  •  Acknowledged: The incident is open and has manually been marked as acknowledged. Typically, this status indicates that the incident is being investigated.

  •  Closed: The system observed that the condition stopped being met, you closed the incident, or 7 days passed without an observation that the condition continued to be met.

When you configure an alerting policy, ensure that the steady-state provides a signal when everything is OK. This is necessary to ensure that the error-free state can be identified and, if an incident is open, for that incident to be closed. If there is no signal to indicate that an error condition has stopped, after an incident is opened, it stays open for 7 days after the policy fires.

For example, if you create a policy that notifies you when the count of errors is more than 0, ensure that it produces a count of 0 errors when there aren't any errors. If the policy returns null or empty in the error-free state, then there is no signal to indicate when the errors have stopped. In some situations, Monitoring Query Language (MQL) supports the ability for you to specify a default value that is used when no measured value is available. For an example, see Using ratio.

Acknowledging incidents

We recommend that you mark an incident as acknowledged when you begin investigating the cause of the incident.

To mark an incident as acknowledged, do the following:

  • In the Incidents pane of the Alerting dashboard, click See all incidents.
  • On the Incidents page, find the incident that you want to acknowledge, and then do one of the following:

    • Click  More options and then select Acknowledge.
    • Open the details page for the incident and then click Acknowledge incident.

Silencing incidents

To close all open-incidents associated with a condition of an alerting policy, silence one incident associated with that condition. For example, assume that an alerting policy has one condition that monitors 10 time series. The condition is met if any time series goes above a threshold of one. If five of the time series exceed the threshold, then five incidents are created. If you silence any one of these incidents, then all five incidents are closed.

Silencing an incident doesn't reconcile the underlying cause of the incident. That is, if a condition for that alerting policy is met on the next alerting cycle, then an incident for that condition is opened.

When an alerting policy contains multiple conditions, silencing an incident for one condition doesn't close any incidents that are open for the other conditions.

To silence an incident, do the following:

  • In the Incidents pane of the Alerting dashboard, click See all incidents.
  • On the Incidents page, find the incident that you want to silence, click  More options, and then select Silence associated condition.

Closing incidents

You can let Monitoring close an incident for you, or you can close an incident after observations stop arriving. If you close an incident and then data arrives that indicates the condition is met, then a new incident is created. Also, if you close an incident, that action doesn't close any other incidents that are open for the same condition. This behavior is different from silencing an incident, which closes all open incidents for the same condition.

Monitoring automatically closes an incident when any of the following occur:

  • Metric-threshold conditions:

    • An observation arrives that indicates that the threshold isn't violated.
    • No observations arrive and the condition is configured to close incidents when observations stop arriving.
    • No observations arrive for the auto-close duration of the alerting policy and the condition doesn't automatically close incidents when observations stop arriving. To configure the auto-close duration, you can use the Google Cloud console or the Cloud Monitoring API. By default, the auto-close duration is seven days. The minimum auto-close duration is 30 minutes.
  • Metric-absence conditions:

    • An observation occurs.
    • No observations arrive for 24 hours after the auto-close duration of the alerting policy expires. To configure the auto-close duration, you can use the Google Cloud console or the Cloud Monitoring API. By default, the auto-close duration is seven days.
  • Forecast conditions:

    • A forecast is produced and it predicts that the time series won't violate the threshold within the forecast window.
    • No observations arrive for 10 minutes and the condition is configured to close incidents when observations stop arriving.
    • No observations arrive for the auto-close duration of the alerting policy and the condition doesn't automatically close incidents when observations stop arriving.

For example, an alerting policy generated an incident because the HTTP response latency was greater than 2 seconds for 10 consecutive minutes. If the next measurement of the HTTP response latency is less than or equal to two seconds, then the incident is closed. Similarly, if no data at all is received for seven days, then the incident is closed.

To close an incident, do the following:

  1. In the Incidents pane of the Alerting dashboard, click See all incidents.
  2. On the Incidents page, find the incident that you want to close, and then do one of the following:

    • Click  More options and then select Close this incident.
    • Open the details page for the incident and then click Close incident.

If you see the message Unable to close incident with active conditions, then the incident can't be closed because data has been received within the most recent alerting period.

If you see the message Unable to close incident. Please try again in a few minutes., then the incident couldn't be closed due to an internal error.

Data retention and limits

For information about limits and about the retention period of incidents, see Limits for alerting and uptime checks.

What's next