Active Assist change risk recommenders: Introducing a new way to prevent misconfigurations
Dima Melnyk
Senior Product Manager
Xiang Wang
Software Engineer
Avoid common configuration errors that can cause production incidents, outages, and data loss
Tl;dr: Active Assist can now help you reduce the risk of cloud infrastructure misconfigurations by intelligently flagging common risky changes to your most important resources and providing recommendations to prevent and mitigate issues. Initially, Active Assist can safeguard against high-risk deletions of projects, service accounts, and changes to IAM policies. This feature is available to all customers via the Recommender API and gcloud immediately, and will be progressively rolled out in the Cloud console within the next few weeks. Check out this blog and the documentation to learn more.
Misconfigurations of cloud infrastructure can happen due to human errors (e.g., accidentally deleting a critical resource) or unexpected workload changes (e.g., a workload rapidly growing and exceeding its resource limits). Recent research studies show that nearly 70% of production incidents are caused by human errors, costing businesses millions of dollars in lost revenue and productivity. And that's just the tip of the iceberg. While not all misconfigurations result in outages, many cause wasted cycles to recover the system or go completely unnoticed, transpiring later as hard-to-root-cause performance problems and reliability issues.
Examples of common, repeatable misconfigurations, and their impact
(background image source)
Existing misconfiguration prevention and mitigation solutions often require complex configuration and maintenance of rigid guardrails, and can be difficult to use effectively. This can lead to misconfigurations being overlooked or detected too late. Another issue is the low signal-to-noise ratio, where operators are bombarded by numerous warnings for even minor changes.
Today, we're excited to announce change risk recommendations, a new category of Active Assist recommendations that redefines guardrails by making them smart and automated. Unlike traditional solutions, Active Assist doesn’t require configuration and maintenance, and provides a simple way to prevent and detect common misconfigurations. This is to help you to reduce risk, improve operational resilience, and save time and money.
Preventing misconfigurations with Active Assist
Active Assist analyzes usage activity across cloud resources in your organization and uses machine learning to automatically identify the ones that are most important to your business operations from the change management perspective. These resources are typically the ones that have the most usage and dependencies relative to other resources in your organization, and have a higher risk of breaking something in the system if misconfigured.
Once Active Assist has identified the most important resources in your environment, it can generate recommendations to prevent high-risk changes to those resources. For example, it can warn you about the risks of deleting an important project, including the potential blast radius in terms of impacted resources.
Risky change detection and prevention with Active Assist
As part of the initial launch of this feature, Active Assist can assess risks associated with the following configuration changes:
Change risk recommendations supported as a part of this initial release
Let’s take a quick look at each of these guardrails.
1. Preventing risky project deletions
Given that projects are basic containers used to organize resources in Google Cloud, they are very important from a change management perspective. Due to the complexity and diversity of cloud workloads, including non-obvious dependencies that can span multiple projects, important projects are often deleted accidentally. Deleting an important project can be very difficult to recover from, especially if it impacts resources that can’t be restored easily, like objects in Cloud Storage. Active Assist can warn you about the potential impact of deleting a project that’s been identified as important based on its usage activity.
Example project deletion warning in the Google Cloud console
From there, you can examine the risks associated with this particular project deletion by clicking on the View risk assessment button.
Example risk assessment explaining why this project deletion might be unsafe
2. Preventing risky service account deletions
We received feedback from many early adopters that accidental deletions of critical service accounts are another common misconfiguration that often resulted in downtime, as they are difficult to restore or roll back. For service accounts that are in active use, Active Assist can warn you about the potential impact on other resources, and recommend disabling the service account first. Similarly to the other change risk recommendations, you can examine the associated risk assessment to understand usage details and dependencies.
Example service account deletion warning in the console
3. Preventing risky changes to IAM policies
Identity and Access Management (IAM) policies define who can take what actions on which resources. This seemingly simple construct can become incredibly complex to manage at scale, and represents another common source of misconfigurations, especially when essential permissions are revoked from a user or service account, or when a user or service account is removed from an essential role. This can lead to wasted time and effort to recover, or downtime caused by users or service accounts being denied access to the resources they need. For roles that are in active use, Active Assist can warn you about the potential impact of removing permissions and suggest an alternative, less privileged role.
example role change warning in the console
In addition to seeing the automatic warnings in the console, you can also check risks associated with supported changes by running gcloud
commands with the --recommend=yes
flag. For example, the following command will abort if deleting the “staging_v8” project is unsafe:
You can also query the Recommender API directly as part of your custom scripts or automation workflows. For example, the following calls check if deleting the “staging_8” project is unsafe and retrieve detailed risk assessments associated with the given change.
Getting started with change risk recommendations today
Active Assist can now help you reduce the risk of common cloud misconfigurations caused by human errors, such as accidentally deleting critical resources or making incorrect IAM policy changes. It does this by intelligently flagging risky changes to your most important resources and providing recommendations to prevent issues before they occur.
To get started, read the documentation to learn more about how to implement the pre-submit guardrails with change risk recommendations. You can also learn more about how Active Assist can help you tackle out-of-quota issues, one of the most common workload-inflicted misconfigurations.
We hope Active Assist’s smart guardrails help you reduce the rate and impact of misconfigurations in your environment. We look forward to and welcome your feedback, including suggestions for new guardrails! Please don’t hesitate to reach out to your Google Cloud account team to set up a meeting or shoot us a note at active-assist-feedback@google.com.