Stackdriver Incident Response and Management (IRM) is a product within Stackdriver for managing and responding to incidents. It augments Stackdriver Monitoring alerting with information and tools to help reduce the time it takes to mitigate and resolve an incident.
The IRM dashboard contains two lists: the Available alerts and Incidents that are associated with your Workspace. Using this dashboard, you can search your alerts and incidents, and see high-level information about your incidents and alerts.
In IRM, an alert indicates that a parameter is outside its expected range. Monitoring-based alerts appear automatically in the Available alerts list, and remain until they are triaged.
IRM displays alerts that result from a triggered Stackdriver Monitoring alerting policy. If you created an incident without an alert, IRM creates a corresponding alert in IRM for you, but it is not based on an underlying Stackdriver Monitoring alerting policy.
An alert can be associated with only one incident, but there can be multiple alerts associated with a single incident.
All alerts are listed either in the Available alerts list or in the Incident Details view as Added alerts.
Alert Details view
The Alert Details view provides the following:
- A summary of the alert details, including the underlying alerting policy, GCP details, and time that the violation started.
- An interactive chart of the metric that violated a condition and triggered an alerting policy.
- Insights, which provide relevant additional context for your alert.
- A Take action pane where you can select next steps.
In the Take action pane, there are three options you can take with an alert:
- Dismiss: If you select this option, an incident is created, and the alert is assigned to that incident. IRM automatically assigns the incident a Negligible severity, marks its status as Resolved, and moves it to the Resolved Incidents list.
- New incident: Select this option to begin the creation of a new incident. The alert is assigned to the new incident, and displays in the Added alerts list.
- Add to existing incident: Select this option to add the alert to an existing active incident.
In the IRM dashboard, click on an alert to go to the Alert Details view.
The Available alerts list contains alerts for your Workspace that haven't been added to an incident or dismissed.
You can see this list on the IRM dashboard. It's a best practice to add alerts to incidents promptly, keeping this list empty.
From the Available alerts list, you can:
- Click on an alert to take you to the Alert Details view.
- Select one or more alerts to use as the basis for a new incident.
- Select one or more alerts and add them to an active incident.
If you don't see a relevant alert in your Available alerts list, you can manually create an incident; go to Create an incident without an alert for instructions.
IRM relies on the response team to add alerts to relevant incidents. Adding alerts helps scope the impact of an incident and provides situational awareness for your response team, both now and in future incidents.
Grouping related alerts into a single incident focuses data collection, interactions, and visibility as your response team responds to an incident. However, incidents involving multiple services (say, a frontend, backend, and a common dependency) might require two incidents (one for frontend+backend, another for the common dependency) because the response team members might have different goals.
You can add alerts to an incident from the Available alerts list on the Investigate tab of the Incident Details view. When you add an alert to an incident, the alert is visible in the Added alerts table which is on the Investigate tab. You can add one or more alerts to an incident.
The alert status, shown in the Available alerts list, indicates whether an alert is within its normal parameters, according to its underlying alerting policy:
Recovered indicates that a previously firing alert is currently within its normal parameters.
Firing indicates that an alert is currently out of its normal parameters.
Insights provide additional information for an alert, and are automatically generated by IRM. They can point you to the potential triggers and contributing factors for an alert, which can help accelerate your investigation.
Insights, if they are available, appear in the Alert Details view in an Insights panel under the alerting metric chart.
When the response team receives notification that a Monitoring alerting policy triggered, and they review the alert and determine that it requires organized management, they can create an incident in IRM. IRM also allows response teams to create an incident without an alert. While a given incident might or might not require full incident management protocol, all incidents tracked in IRM benefit from structured incident data.
The incidents listed in Stackdriver Monitoring are also displayed in IRM. However, IRM gives you additional context and opportunities to add structured data to your incidents.
All IRM incidents are classified by an incident stage. The stage changes automatically to Triaged when the response team sets the incident's initial severity classification; you can also change the stage manually. It is a best practice to add other information to the incident, such as updates, tags, and links.
Incident Details view
The Incident Details view is split into three tabs:
The Investigate tab, which features two lists:
In this tab, you can also view and write investigation updates.
The Related incidents tab: This tab provides you with a list of two types of incidents that relate to your current incident:
Above each of the three tabs in the Incident Details view, a toolbar persists. The toolbar includes: title, elapsed time since first alert (not editable), severity classification, tags, stage, and primary communications channel ("Comms"). You can edit these fields from each of the three tabs.
In the IRM dashboard, click on an incident to go to the Incident Details view.
Active incidents are incidents that are still being managed. Incidents stay active until their stage is set to Resolved.
Clicking on one of these incidents takes you to the Incident Details view, where you can track your investigation and coordinate with others.
Similar incidents are incidents whose aspects, such as underlying alerting policy, overlap with the current incident. Similarity is determined by IRM; you can't add incidents to this list. Response teams can use similar incidents to provide historic context on the past occurrence of the current incident.
Similar incidents, if they are available, appear in the Related incidents tab in the Incident Details view.
Duplicate incidents are incidents that have been marked as duplicate by the response team.
When an incident is marked as duplicate, the incident's alerts move to the authoritative incident. The duplicate incident becomes an artifact and isn't intended for resolution. Since the duplicate incident is effectively closed, its incident stage shows as Duplicate, but its real stage belongs to the authoritative incident.
An incident can only be marked as a duplicate of an incident that is not itself a duplicate.
Duplicate incidents, if they are available, appear in the Related incidents tab in the Incident Details view.
Incident summaries provide broad visibility for the incident. A useful incident summary includes the incident's impact, investigation status, and the next scheduled update. The response team should continually update an incident's summary as the incident progresses.
You can update the incident summary from the Overview tab of the Incident Details view.
Incidents can be in the following stages:
- Detected: One or more alerts have been added to create the incident, but severity has not been set.
- Triaged: The team has completed the initial triage of the incident, and has set the incident's severity.
- Mitigated: The incident is no longer impacting users.
- Resolved: The incident no longer requires active incident response.
- Duplicate: This is not a true incident stage, but represents that the incident has been marked as a duplicate incident.
Incident subscriptions let you receive notifications when an incident's data is updated. You can configure the types of notification events and select notification channels at the incident level.
To view, add, or update subscriptions, go to the Subscriptions pane on the Overview tab of the Incident Details view.
You can add links to useful information that resides outside the IRM tool; for example, bugs (including Jira issues) or instructions for using new communication channels.
You can provide or view these links in the Links pane on the Overview tab of the Incident Details view.
Severity classifications help the response team prioritize and understand an incident's context quickly. The following severity classifications are available:
- (Not set): The initial default state.
- Negligible: Incident is not user-facing. It has little-to-no impact on production, but might deserve some follow up action items to be tracked at a low priority.
- Minor: External users might not have noticed; internal users were inconvenienced.
- Medium: A significant number of internal users were significantly impacted; users were able to use workarounds.
- Major: User-visible but no lasting damage to your services or customers; possible noticeable revenue loss.
- Huge: Major user-facing outage; significant revenue loss.
Tags provide flexible, concise labels for reporting and future reference in recurrences of similar incidents. IRM provides a set of tags to help ensure data uniformity.
You can set and view tags in the Incident Details view toolbar.
Upon review, if you think an incident is important enough to warrant quicker response and greater visibility by stakeholders, you can escalate the incident.
You can change and view escalation status on the Overview tab of the Incident Details view. An incident's escalation status is marked by a triangle icon by its title.
Presets are sets of predetermined values that you can add to an incident when you escalate it. Leveraging presets can expedite the tasks of the response team, quickly applying default values and key information to important incidents.
You can create, change, and view your Workspace's presets by selecting the gear icon in the Workspace's settings toolbar. For details, go to Set up incident presets.
Investigation updates let you capture, in granular detail, key learnings, actions, and milestones during your investigation.
You can write and view these updates in the Investigate tab of the Incident Details view.
It's useful to assign specific people to common incident response roles. Distinct roles help to focus the information flow and the activities of each member of the response team.
Roles are not based on seniority. Rather, they are assigned based on availability and ability to debug and mitigate an incident. At any time, each member of a response team might be assigned any of these roles.
During an incident, the response team focuses on the activities of their respective roles, until the role is transferred or the incident is over.
You can view and assign roles from the Roles pane on the Overview tab of the Incident Details view.
The following sections describe the functions of the different roles.
Incident Commander role
The Incident Commander (IC) coordinates efforts of the response team to address an active incident.
During an incident, the response team reports to the Incident Commander. The Incident Commander's role is not to personally resolve the incident, but to make sure that the incident gets resolved.
The Incident Commander is at the top of the pyramid structure that the team has organized itself into. During an incident, every member of the response team reports to the Incident Commander.
The Incident Commander’s responsibilities include:
- Build and maintain the ad hoc response team.
- Coordinate the parts of the response, set priorities, and delegate activities.
- Know the current status of major events during the incident.
- Act as Communication Lead, if one is not assigned.
Communications Lead role
The Communications Lead (CL) leads the communications portion of the response. The Communications Lead is responsible for clear, timely communication and decides which channels are appropriate for recording and conveying messages.
The Communications Lead does the following:
- Keeps everyone outside the response team informed.
- Fields questions and other incoming information about the incident.
Operations Lead role
The Operations Lead (OL) manages the immediate, detailed, technical, and tactical work of the incident response, which is typically the largest aspect of the response. A number of operations team members often help the Operations Lead troubleshoot, mitigate, and resolve the incident. The Operations Lead pulls in additional responders as needed, and creates dynamic subteams to prevent overlap in efforts.
The Operations Lead does the following during an incident response:
- Develop and execute an incident action plan.
- Request additional resources to support operations.
- Allocate resources among the various operational parts of the incident response.
- Maintain close contact with the Incident Commander, Communications Lead, Primary and Secondary Responders, and others involved with managing the incident.
Primary Responder role
The Primary Responder role executes the technical response for the incident. The Primary Responder maintains close contact with the Operations Lead during the incident and, if there is one assigned, uses the Secondary Responder for additional technical support.
Secondary Responder role
The Secondary Responder role assists the Primary Responder if they need help on the particular incident.
IRM relies on Workspaces to provide access to Stackdriver resources. For example, Stackdriver Monitoring lets you create alerting policies to make you aware of issues, which you can then respond to and manage using IRM. These alerting policies belong to a Workspace. To access these policies, IRM must have access to the Workspace.
Alerting policies, including conditions, notification channels, and documentation, are covered in much more detail in Using alerting policies.