Exploring the machine learning models behind Cloud IAM Recommender
Xiang Wang
Software Engineer
Abhi Yadav
Product Manager
To help you fine-tune your Google Cloud environment, we offer a family of ‘recommenders’ that suggest ways to optimize how you configure your infrastructure and security settings. But unlike many other recommendation engines, which use policy-based rules, some Google Cloud recommenders use machine learning (ML) to generate their suggestions. In this blog post, we’ll take a look at one of our recommendation engines, the Cloud Identity and Access Management (IAM) Recommender, and take you on a behind-the-scenes look at the ML that powers its functionality.
IAM Recommender in action
IAM Recommender helps security professionals enforce the principle of least privilege by identifying and removing unwanted access to GCP resources. It does this by using machine learning to help determine what users actually need by analyzing their permission usage over a 90 day period.
For example, a user Anita might have been given the Project Editor role when a new Google Cloud Platform (GCP) project was spun up, which gives her more than two thousand permissions. Elisa the Cloud Admin might have granted her a lot more access than required, simply because she did not understand Anita’s needs.
Here’s how Cloud IAM Recommender helps. Elisa can now use IAM Recommender to analyze Anita’s permissions usage, and determine that she only needs day-to-day access to the Compute Engine service, and occasional access to Cloud Storage services. Using ML, IAM Recommender predicts what Anita will need in the long-term, and recommends the Compute Engine Admin and the Storage Object Viewer roles. Elisa can choose to apply the recommendations, removing thousands of unneeded permissions in the process. This minimizes the potential attack surface and helps her organization stay compliant with governance best practices.
As simple as the idea might sound, it can be challenging to fully capture a given user's intent and permission needs. On the one hand, we want to make timely recommendations after processing a reasonable amount of usage history (e.g., 90 days). On the other hand, there could be some permission usage missing from our observation window—for example, some operations could be interrupted when a user goes on vacation, or, like cron jobs, only happen very infrequently. This is one of the ways that we leverage ML: using inference to fill those small but crucial gaps and improve the accuracy of our recommendations.
Training the model
Once we have normalized the logs, we run an ML model to answer the question: "Given that a user used permission A, B, and C in the last 90 days, what are the other permissions they might need in order to do their job?" We train our model to answer this question from two sets of signals:
Common co-occurrence patterns in the observed history. The fact that a user used permission A, B, and C in the past provides a hint that A, B, and C might be related in some way and that they are needed together to carry out a task on GCP. If our ML model observes this pattern frequently enough across a large user base, the next time if a different user used permission A and B, the model would suggest that the user might need permission C as well.
Domain knowledge as encoded in the role definitions. Cloud IAM provides hundreds of different predefined roles that are service-specific. If a set of permissions co-occur in the same predefined role, it is a strong signal that the role creators determined certain permissions should be granted together.
Our ML model uses both of these signals as input attributes, and each attribute is an IAM permission name, such as iam.serviceAccounts.get, or bigquery.tables.list. To further capture the semantics encoded in the permission name, which can be easily understood by a human but not by a machine, we employ word embedding, a technique that is widely used in Natural Language Processing applications. The key idea is to project a large number of words (in our case thousands of permission names) to a lower-dimensional vector space where we can calculate the similarity between a pair of words, which is a reflection of the actual semantics of these two words. For example, bigquery.datasets.get and bigquery.tables.list will become very "close" to each other after embedding.
Google Cloud takes precautionary measures to maintain the privacy of our users; for example no data from one customer is being shared with another. In addition, we deployed an anonymization scheme to achieve k-anonymity before feeding the usage history data into our training pipeline. First, we drop all personally identifiable information (PII) such as user ID related to each permission usage pattern. Then we drop all usage patterns that do not show up frequently enough across GCP. The global model trained on the anonymized data can be further customized for each organization using Federated Learning.
Right-size your permissions
Now we’re ready to make some recommendations! For a given user, we retrieve their usage history for the trailing 90 days and their current role definition, and feed those into the trained model. The model then makes a prediction on what unobserved permissions this user is likely to need. We combine the inferred permissions with the observed usage and rank roles to recommend the least permissive role(s) that can cover them all, thereby helping to ensure our recommendations are safe and any access that has been previously used is not removed.
Better yet, unlike deterministic rules-based recommendations that become outdated over time, the ML model will adapt to changes over time. So as your footprint in Google Cloud grows, or as Google Cloud adds more services and permissions, the ML model evolves with these changes to ensure it provides relevant recommendations.
To learn more about IAM Recommender, check out the documentation.