An incident is a record of the triggering of an alerting policy. Cloud Monitoring opens an incident when a condition of an alerting policy has been met.
This page describes how you can view, investigate, and manage incidents.
To find a list of incidents, do the following:
In the Cloud Console toolbar, click menu Navigation menu, and then select Monitoring:
In the Monitoring navigation pane, select notifications Alerting.
On the alerting dashboard page, the Summary pane lists the number of open incidents, and the Incidents table displays the most recent incidents. By default, closed incidents aren't listed. To include closed incidents in the table, click Show closed incidents.
Finding older incidents
The Incidents table on the Alerting page shows only the most recent incidents. To locate older incidents, do one of the following:
Page through the entries in the Incidents table by clicking arrow_back_ios Newer or arrow_forward_ios Older.
Navigate to the Incidents page by clicking See all incidents.
By default, this table displays all open incidents. To include closed incidents in the table, click Show closed incidents.
To control which incidents are listed, add filters. For more information, see Filtering incidents.
To manage incidents or alerting policies from this table, click more_vert More options in the incidents's row, and make a selection from the menu of options. For more information on acknowledging or silencing an incident, see Managing incidents.
To filter the table of incidents, do the following:
On the Incidents page, click filter_list Filter table and then select a filtering attribute. You can filter on the following:
- State of the incident
- Name of the alerting policy
- When the incident was opened or closed
- Metric type
- Resource type
Select a value from the secondary menu or enter a value in the filter bar. When you enter a value on the filter bar, the list of options shows only the options that contain the value you entered.
For example, if you select Metric type and enter
usage_time, then you might see only the following options in the secondary menu:
agent.googleapis.com/cpu/usage_time compute.googleapis.com/guest/container/cpu/usage_time container.googleapis.com/container/cpu/usage_time
If you add multiple filters, then an incident is displayed only if it satisfies all of the filters.
After you have found the incident you want to investigate, go to the the Incident details page for that incident. To view the details, click on the incident summary in the table of incidents on either the Alerting page or the Incidents page.
The following screenshot shows the details page for an incident:
The Incident details page provides the following information:
Status information, including:
- Name: The name of the alerting policy that caused this incident.
- Status: The status of the incident: open, acknowledged, or closed.
- Duration: The length of time for which the incident was open.
Information about the alerting policy that caused the incident:
- Condition: The condition in the alerting policy that caused the incident.
- Message: A brief explanation of the cause based on the configuration of the condition in the alerting policy. This pane is always populated.
- Documentation: The (optional) documentation for notifications provided when the alerting policy was created. This information might include a description of what the alerting policy monitors and include tips for mitigation. Because documentation is optional, this pane might be empty.
Labels: The labels and values for the monitored resource and metric of the time series that triggered the alerting policy. This information can help you identify the specific monitored resource that caused the incident.
The Incidents details page also provides tools for investigating the incident:
Incident timeline: Shows two visual representations of the incident:
- A red bar above a time axis represents the incident; the length and position of the bar reflect the duration of the incident.
- A chart shows the time-series data and threshold used by the alerting policy that caused the incident. The incident was opened when some time series met a condition of the alerting policy.
The time axis indicates the duration of the incident with two labeled dots. The position of these dots on the time axis determines the range of data shown on the chart that accompanies the incident timeline. By default, one dot is positioned at the opening of the incident and one at the close of the incident, or at the current time if the incident is still open.
You can modify the time range on the incident timeline and the chart:
To change the time range shown on the chart, drag either of the dots along the time axis. By using this technique, you can focus on specific intervals, for example, around the beginning or end of the incident.
Changing the chart by dragging the dots on the axis sets a custom value in the Time Span menu and disables the menu. To enable the Time Span menu, click Reset.
To change the range of time shown on the timeline, select a range from the Time Span menu.
Links to other troubleshooting tools. The configuration of your project and alerting policy and the age of the incident determine which links are available.
- To see the details page for the alerting policy, click View policy.
- To edit the definition of the alerting policy, click Edit policy.
- To go to a dashboard of performance information for the resource, click View resource details.
- To see related log entries in Logs Explorer, click View logs. For more information, see Using Logs Explorer.
- To investigate the data in the chart, click View in Metrics Explorer.
Annotations: Provides a log of your findings, results, suggestions, or other comments from your investigation of the incident.
- To add an annotation, enter text in the field and click Add comment.
- To discard the comment, click Cancel.
You can also acknowledge or silence incidents from the Incident details page. For more information, see Managing incidents.
Incidents are in one of three states:
error Open: The policy's set of conditions are being met or there isn't data to indicate that the condition is no longer met. If a policy contains multiple conditions, then incidents are opened depending on how those conditions are combined. See Combining conditions for more information.
warning Acknowledged: The incident is open and has manually been marked as acknowledged. Typically, this status indicates that the incident is being investigated.
check_circle Closed: The system observed that the condition stopped being met or 7 days passed without an observation that the condition continued to be met.
When you configure an alerting policy, ensure that the steady-state provides a signal when everything is OK. This is necessary to ensure that the error-free state can be identified and, if an incident is open, for that incident to be closed. If there is no signal to indicate that an error condition has stopped, after an incident is opened, it stays open for 7 days after the policy fires.
For example, if you create a policy that notifies you when the count of errors is more than 0, ensure that it produces a count of 0 errors when there aren't any errors. If the policy returns null or empty in the error-free state, then there is no signal to indicate when the errors have stopped. In some situations, Monitoring Query Language (MQL) supports the ability for you to specify a default value that is used when no measured value is available. For an example, see Using ratio.
To mark an incident as acknowledged, do the following:
- In the Incidents pane of the Alerting dashboard, click See all incidents. This opens the Incidents window.
To acknowledge an incident, do one of the following:
- For the incident that you want to acknowledge, select more_vert More options and then select Acknowledge.
- Open the details page for the incident that you want to acknowledge and then click Acknowledge incident.
You must have the Monitoring Editor role,
acknowledge incidents; for more information, see
Access control: Predefined roles.
If you silence a condition, then all open incidents with that condition are silenced, and you won't receive an alert notification when the condition stops being met. Silencing a condition removes the incident from the active incidents display. When you investigate an incident, you should acknowledge it instead of silencing it.
Silencing an incident doesn't reconcile the underlying cause of the incident. That is, if the condition that generated the incident continues to be met on the next alerting cycle, then the incident is re-opened.
To silence a condition, do the following:
- In the Incidents pane of the Alerting dashboard, click See all incidents.
- On the Incidents page, find the incident that you want to acknowledge, select more_vert More options, and then select Silence associated condition.
Incidents are closed automatically; you can't close an incident. An incident is closed when the system observed that the condition is no longer being met or when 7 days have passed without an observation that the condition is still being met.
For example, assume you have an alerting policy that is configured to generate an incident if the HTTP response latency is above 2 seconds for 10 consecutive minutes, and that an incident was opened. If the next measurement of the HTTP response latency is equal to or below 2 seconds, then the incident is closed. Similarly, if no data at all is received for 7 days, then the incident is closed.
- To create and manage alerting policies with the Cloud Monitoring API or from the command line, see Managing alerting policies by API.
- For a detailed conceptual treatment of alerting policies, see Alerting behavior.