Incidents for metric-based alerting policies

An incident is a record of when the condition or conditions of an alerting policy are met. If an alerting policy contains multiple conditions, then the alerting policy specifies whether meeting one condition is sufficient to cause an incident to be created. Typically, when conditions are met, Cloud Monitoring opens an incident and sends a notification. However, incidents aren't created under the following circumstances:

The policy is snoozed or disabled.
The number of alerting policies or incidents exceed existing limits for alerting.

For each incident, Monitoring creates an Incident details page that lets you manage the incident, and that reports incident information that can help you troubleshoot the failure. For example, the Incident details page shows the incident timeline and a chart that shows the metric data being monitored. You can also find links to related incidents and log entries.

This document describes how you can find your incidents. It also describes how you can use the Incident details page to manage incidents for metric-based alerting policies, which evaluate time-series data stored by Cloud Monitoring.

This feature is supported only for Google Cloud projects. For App Hub configurations, select the App Hub host project or management project.

Before you begin

To get the permissions that you need to view and manage incidents by using the Google Cloud console, ask your administrator to grant you the following IAM roles on your project:

View incidents by using the Google Cloud console:
- Monitoring Cloud Console Incident Viewer (roles/monitoring.cloudConsoleIncidentViewer)
- Stackdriver Accounts Viewer (roles/stackdriver.accounts.viewer)
Manage incidents by using the Google Cloud console:
- Monitoring Cloud Console Incident Editor (roles/monitoring.cloudConsoleIncidentEditor)
- Stackdriver Accounts Viewer (roles/stackdriver.accounts.viewer)

For more information about granting roles, see Manage access to projects, folders, and organizations.

You might also be able to get the required permissions through custom roles or other predefined roles.

For more information about Cloud Monitoring roles, see Control access with Identity and Access Management.

Find incidents

To see a list of incidents in your Google Cloud project, do the following:

In the Google Cloud console, go to the Alerting page:
Go to Alerting

If you use the search bar to find this page, then select the result whose subheading is Monitoring.
In the toolbar of the Google Cloud console, select your Google Cloud project. For App Hub configurations, select the App Hub host project or management project.

The Alerting page displays information about your alerting policies, snoozes, and incidents:
- The Summary pane lists the number of open incidents.
- The Incidents pane displays the most recent open incidents. To list the most recent incidents in the table, including those that are closed, click Show closed incidents.
To view the details of a specific incident, select the incident in the list.

The Incident details page opens. For more information about the Incident details page, see the Investigate an incident section of this page.

Find older incidents

The Incidents pane on the Alerting page shows the most recent open incidents. To locate older incidents, do one of the following:

To page through the entries in the Incidents table, click Newer or Older.
To navigate to the Incidents page, click See all incidents. From the Incidents page, you can do all the following:
- Show closed incidents: To list all incidents in the table, click Show closed incidents.
- Filter incidents: For information about adding filters, see Filter incidents.
- Acknowledge or close an incident, or snooze its alerting policy. To access these options, click More options in the incident's row, and make a selection from the menu. For more information, see Manage incidents.

Filter incidents

When you enter a value on the filter bar, only incidents that match the filter are listed in the Incidents table. If you add multiple filters, then an incident is displayed only if it satisfies all the filters.

To add a filter the table of incidents, do the following:

On the Incidents page, click Filter table and then select a filter property. Filter properties include all the following:
- State of the incident
- Name of the alerting policy
- When the incident was opened or closed
- Metric type
- Resource type
Select a value from the secondary menu or enter a value in the filter bar.
For example, if you select Metric type and enter usage_time, then you might see only the following options in the secondary menu:
```
agent.googleapis.com/cpu/usage_time
compute.googleapis.com/guest/container/cpu/usage_time
container.googleapis.com/container/cpu/usage_time
```

Investigate an incident

The Incident details page contains information that may help you identify cause of an incident.

Explore metric data

To analyze the state of your metric before and after your incident occurred, use the Alert Metrics chart. This chart shows a timeline and the time series that caused the condition of your alerting policy to be met.

You can adjust the range of the timeline to look for trends and patterns in your metric data relative to the incident:

To toggle between showing only the time series that caused the condition to be met and showing all the time series that the condition evaluates, click Show all timeseries.
To change the time range displayed by the chart, you can use the time-range selector in the toolbar, or highlight time ranges on the chart with your pointer.

You can also analyze your metric data in greater detail by viewing it in the Metrics Explorer. To do so, go to the Alert Metrics chart and then click Explore Data. By default, the Metrics Explorer aggregates and filters metric data so that the metric chart aligns with the time series shown on the Alert Metrics timeline.

Explore log entries

The Logs pane on the Incident details page shows log entries that match the resource type and labels of the monitored resource for your metric. You can analyze these log entries to find additional information that might help you troubleshoot your incident.

To view the log entries in the Logs Explorer, click View in Logs Explorer, and then select a scoping project. The Logs Explorer provides additional tools to analyze log entry data, such as a timeline of when related log entries were created.
To view and edit the query used to filter the log entries in the Metrics Explorer, click Explore Data.

View application information

For alerting policies associated with an App Hub application, go to the Associated with application section. One entry in this section lists the application ID and links to a dashboard displaying information about the application. The second entry lists either a workload or service, and links to a dashboard.

View supplementary information

The Labels section shows the labels and values for the monitored resource and metric of the time series that caused the incident, as well as user labels defined in the alerting policy. This information might help you identify the specific monitored resource that caused the incident. For more information, see Annotate incidents with labels.

The Documentation section shows the documentation template for notifications that you provided when creating the alerting policy. This information might include a description of what the alerting policy monitors and include tips for mitigation. For more information, see Annotate notifications with user-defined documentation.

If you didn't configure documentation for your alerting policy, then the Documentation pane shows "No documentation is configured."

To help you discover underlying issues across your application, you can explore incidents related to other alerting policy conditions.

The Related incidents section shows a list of incidents that match one of the following:

The incident was created when a condition of the same alerting policy was met.
The incident shares a label with the incident shown on the Incident details page.

Manage incidents

Incidents are in one of the following states:

Open: The alerting policy's set of conditions are being met or there isn't data to indicate that the condition is no longer met. If an alerting policy contains multiple conditions, then incidents are opened depending on how those conditions are combined. For more information, see Policies with multiple conditions.
Acknowledged: The incident is open and has manually been marked as acknowledged. Typically, this status indicates that the incident is being investigated.
Closed: The system observed that the condition stopped being met, you closed the incident, or 7 days passed without an observation that the condition continued to be met.

When you configure an alerting policy, ensure that the steady-state provides a signal when everything is OK. This is necessary to ensure that the error-free state can be identified and, if an incident is open, for that incident to be closed. If there is no signal to indicate that an error condition has stopped, after an incident is opened, it stays open for 7 days after the alerting policy fires.

For example, if you create an alerting policy that notifies you when the count of errors is more than 0, ensure that it produces a count of 0 errors when there aren't any errors. If the alerting policy returns null or empty in the error-free state, then there is no signal to indicate when the errors have stopped. If necessary, PromQL lets you specify a default value that is used when no measured value is available.

Acknowledge incidents

We recommend that you mark an incident as acknowledged when you begin investigating the cause of the incident.

To mark an incident as acknowledged, do the following:

In the Incidents pane of the Alerting page, click See all incidents.
On the Incidents page, find the incident that you want to acknowledge, and then do one of the following:
- Click More options and then select Acknowledge.
- Open the details page for the incident and then click Acknowledge incident.

If your alerting policy is configured to send repeated notifications, then acknowledging an incident doesn't stop the notifications. To stop them, do one of the following:

Create a snooze for the alerting policy.
Disable the alerting policy.

Snooze an alerting policy

To prevent Monitoring from creating incidents and sending notifications during a specific time period, snooze the related alerting policy. When you snooze an alerting policy, Monitoring also closes all incidents related to the alerting policy.

To create a snooze for an incident that you are viewing, do the following:

On the Incident details page, click Snooze Policy.
Select the snooze duration. After you select the snooze duration, the snooze begins immediately.

You can also snooze an alerting policy from the Incidents page by finding the incident that you want to snooze, clicking More options, and then selecting Snooze. You can snooze alerting policies during outages to prevent further notifications during the troubleshooting process.

Close incidents

You can let Monitoring close an incident for you, or you can close an incident after observations stop arriving. If you close an incident and then data arrives that indicates the condition is met, then a new incident is created. When you close an incident, that action doesn't close any other incidents that are open for the same condition. If you snooze an alerting policy, then open incidents are closed when the snooze starts.

Monitoring automatically closes an incident when any of the following occur:

Metric-threshold conditions:
- An observation arrives that indicates that the threshold isn't violated.
- No observations arrive, the condition is configured to close incidents when observations stop arriving, and the state of the underlying resource is either unknown or isn't disabled.
  
  Note: The incident isn't closed when data stops arriving when the state of a resource is known to be disabled. However, you can close the incident manually. Monitoring can determine the state of a resource when the resource contains the metadata.system_labels.state label and when the alerting policy isn't written with the Monitoring Query Language. For more information, see Incident isn't closed when data stops arriving.
- No observations arrive for the auto-close duration of the alerting policy and the condition isn't configured to automatically close incidents when observations stop arriving. To configure the auto-close duration, you can use the Google Cloud console or the Cloud Monitoring API. By default, the auto-close duration is seven days. The minimum auto-close duration is 30 minutes.
Metric-absence conditions:
- An observation occurs.
- No observations arrive for 24 hours after the auto-close duration of the alerting policy expires. To configure the auto-close duration, you can use the Google Cloud console or the Cloud Monitoring API. By default, the auto-close duration is seven days.
Forecast conditions:
- A forecast is produced and it predicts that the time series won't violate the threshold within the forecast window.
- No observations arrive for 10 minutes, the condition is configured to close incidents when observations stop arriving, and the state of the underlying resource is either unknown or isn't disabled.
  
  Note: The incident isn't closed when data stops arriving when the state of a resource is known to be disabled. However, you can close the incident manually. Monitoring can determine the state of a resource when the resource contains the metadata.system_labels.state label and when the alerting policy isn't written with the Monitoring Query Language. For more information, see Incident isn't closed when data stops arriving.
- No observations arrive for the auto-close duration of the alerting policy and the condition isn't configured to automatically close incidents when observations stop arriving.

For example, an alerting policy generated an incident because the HTTP response latency was greater than 2 seconds for 10 consecutive minutes. If the next measurement of the HTTP response latency is less than or equal to two seconds, then the incident is closed. Similarly, if no data at all is received for seven days, then the incident is closed.

To close an incident, do the following:

In the Incidents pane of the Alerting page, click See all incidents.
On the Incidents page, find the incident that you want to close, and then do one of the following:
- Click View more and then select Close incident
- Open the Incident details page for that incident and then click Close incident.

If you see the message Unable to close incident with active conditions, then the incident can't be closed because data has been received within the most recent alerting period.

If you see the message Unable to close incident. Please try again in a few minutes., then the incident couldn't be closed due to an internal error.

Data retention and limits

For information about limits and about the retention period of incidents, see Limits for alerting.

What's next

To create and manage alerting policies with the Cloud Monitoring API or from the command line, see Manage alerting policies by API.

For a detailed conceptual treatment of alerting policies, see Behavior of metric-based alerting policies.