Safe rollouts with Anthos Config Management

This document shows cluster operators and platform administrators how to safely roll out changes across multiple environments using Anthos Config Management. Anthos Config Management can help you avoid errors that affect all of your environments simultaneously.

Anthos Config Management lets you manage single clusters, multi-tenant clusters, and multi-cluster Kubernetes configurations by using files stored in a Git repository. Anthos Config Management combines three technologies— Config Sync, Policy Controller, and Config Connector. Config Sync watches for updates to all files in the Git repository and applies changes to all relevant clusters automatically. Policy Controller manages and enforces policies for objects in your clusters. Config Connector uses Google Kubernetes Engine (GKE) custom resources to manage cloud resources.

Config Sync configurations can represent several things, including the following:

Anthos Config Management is especially suited to deploy configurations, policies, and workloads needed to run the platform that you build on top of Anthos—for example, security agents, monitoring agents, and certificate managers.

Although you can deploy user-facing applications with Anthos Config Management, we don't recommend linking their release lifecycle to the release lifecycle of the administrative workloads mentioned earlier. Instead, we recommend that you use a tool dedicated to application deployment, such as a continuous deployment tool, so that application teams can be in charge of their release schedule.

Anthos Config Management is a powerful product that can manage many elements, so you need guardrails to avoid errors that have a major impact. This document describes several methods to create guardrails. The first section covers staged rollouts, the second section focuses on tests and validations, and the third section explains how to use Policy Controller to create guardrails. The fourth section shows how to monitor Anthos Config Management deployments.

You can use most of the methods discussed in this document, even if you're using only Config Sync and not the full Anthos Config Management product. If you are not using the full Anthos Config Management product but still want to implement the methods involving Policy Controller, you can successfully do so using Gatekeeper The exceptions to this rule are methods that rely on the Anthos Config Management page in the Google Cloud console. You can also use several of the methods described in this document at the same time. In the following section, a table indicates which methods are compatible for simultaneous use.

Implementing staged rollouts with Anthos Config Management

In a multi-cluster environment, which is a common situation for Anthos users, we don't recommend applying a configuration change across all the clusters at the same time. A staged rollout, cluster per cluster—or even namespace per namespace, if you use namespaces as the boundary between applications—is much safer because it reduces the blast radius of any error.

Following are several ways to implement staged rollouts with Anthos Config Management:

  • Use Git commits or tags to manually apply the changes that you want to the clusters.
  • Use Git branches to automatically apply the changes when the changes are merged. You can use different branches for different groups of clusters.
  • Use ClusterSelector and NamespaceSelector objects to selectively apply changes to subgroups of clusters or namespaces.

All methods for staged rollouts have advantages and disadvantages. The following table shows which of these methods you can use at the same time.

Are X compatible with Y? Git commits or tags Git branches Cluster selectors Namespace selectors
Git commits or tags Not compatible Compatible Compatible
Git branches Not compatible Compatible Compatible
Cluster selectors Compatible Compatible Compatible
Namespace selectors Compatible Compatible Compatible

The following decision tree helps you decide when to use one of the staged rollout methods.

Decision tree for rollout methods.

Use Git commits or tags

Compared to the other staged rollout methods, using Git commits or tags provides the most control and is the safest. You can use the Anthos Config Management page in the console to update multiple clusters at the same time. Use this method if you want to apply changes to your clusters one by one, and to control exactly when this happens.

In this method, you "pin" each cluster to a specific version (either a commit or a tag) of your Anthos Config Management repository. This method is similar to using the Git commit as a container image tag. You implement this method by specifying the commit or the tag in the spec.git.syncRev field of the ConfigManagement custom resource. If you synchronize configs from multiple repositories, you implement this method by updating the RootSync and RepoSync custom resources instead. For more information about the configuration fields, see configuring the Operator. If you manage your ConfigManagement custom resources with a tool like kustomize, you can reduce the amount of manual work required to roll out changes. With such a tool, you only need to change the syncRev parameter in one place, and then selectively apply the new ConfigManagement custom resource to your clusters in the order, and at the pace, that you choose.

Additionally, if you are using Anthos Config Management (and not Config Sync), you have access to the Anthos Config Management page in the Google Cloud console. This page lets you update the syncRev parameter for multiple clusters belonging to the same environ at the same time.

For example, the following ConfigManagement definition configures Anthos Config Management to use the 1.2.3 tag:

apiVersion: configmanagement.gke.io/v1
kind: ConfigManagement
metadata:
  name: config-management
spec:
  # clusterName is required and must be unique among all managed clusters
  clusterName: my-cluster
  git:
    syncRepo: git@example.com:anthos/config-management.git
    # Pin the cluster using a tag
    syncRev: 1.2.3
    secretType: ssh

If you apply this configuration to your cluster, Anthos Config Management will use the 1.2.3 tag of the example.com:anthos/config-management.git repository.

To update a cluster, change the spec.git.syncRev field to the new value for the cluster. This lets you define which clusters get updated and when. If you need to roll back a change, change the spec.git.syncRev field back to its former value.

The following diagram illustrates the rollout process for this method. First, you commit changes to the Anthos Config Management repository, and then you update the ConfigManagement definitions on all the clusters:

Rollout process for Git commits and tags.

We recommend the following actions:

  • Use Git commit IDs rather than tags. Because of the way that Git functions, you have a guarantee that they will never change. For example, a git push --force can't change the commit that Anthos Config Management is using. This approach is useful for auditing purposes and to track which commit you are using in logs. Additionally, unlike with tags, there's no extra step to commit IDs.
  • If you prefer using Git tags instead of Git commit IDs, and you're using GitLab, protect the tags to keep them from being moved or deleted. The other major Git solutions do not have this feature.
  • If you want to update multiple clusters at the same time, you can do that in the Anthos Config Management console page. For you to update multiple clusters at once, they need to be part of the same environ (and be in the same project).

Use Git branches

If you want changes to be applied to clusters as soon as they are merged in your Git repository, configure Anthos Config Management to use Git branches instead of commits or tags. In this method, you can create multiple long-lived branches in your Git repository, and configure Anthos Config Management in different clusters to read its configuration from different branches.

For example, a simple pattern has two branches:

  • A staging branch for non-production clusters.
  • A master branch for production clusters.

For non-production clusters, create the ConfigManagement object with the spec.git.syncBranch field set to staging. For production clusters, create the ConfigManagement object with the spec.git.syncBranch parameter set to master. If you synchronize configs from multiple repositories, make this configuration in the RootSync and RepoSync custom resources instead. For more information, see configuring the Operator.

For example, the following ConfigManagement definition configures Anthos Config Management to use the master branch:

apiVersion: configmanagement.gke.io/v1
kind: ConfigManagement
metadata:
  name: config-management
spec:
  # clusterName is required and must be unique among all managed clusters
  clusterName: my-cluster
  git:
    syncRepo: git@example.com:anthos/config-management.git
    # This cluster will apply the configuration
    # available on the master branch.
    syncBranch: master
    secretType: ssh

The following diagram illustrates the rollout process for this method:

Rollout process for Git branches.

You can adapt this pattern to specific needs, using more than two branches, or using branches that are mapped to something other than environments. If you need to roll back a change, use the git revert command to create a new commit on the same branch that reverts the changes from the previous commit.

We recommend the following actions:

  • When dealing with multiple clusters, use at least two Git branches to help to distinguish between production and non-production clusters.
  • Most Git solutions let you use the protected branches feature to prevent deletions or unreviewed changes of those branches. For more information, see the documentation for GitHub, GitLab, and Bitbucket.

Use ClusterSelector and NamespaceSelector objects

Git branches are a good way of doing a staged rollout of changes across multiple clusters that will eventually all have the same policies. However, if you want to roll out a change only to a subset of clusters or of namespaces, then use the ClusterSelector and NamespaceSelector objects. These objects have a similar goal: they let you apply objects only to clusters or namespaces that have specific labels.

For example:

  • By using ClusterSelector objects, you can apply different policies to clusters, depending on which country they are located in, for various compliance regimes.
  • By using NamespaceSelector objects, you can apply different policies to namespaces used by an internal team and by an external contractor.

ClusterSelector and NamespaceSelector objects also let you implement advanced testing and release methodologies, such as the following:

  • Canary releases of policies, where you deploy a new policy to a small subset of clusters and namespaces for a long time to study the policy's impact.
  • A/B testing, where you deploy different versions of the same policy to different clusters to study the difference of the policy versions' impact and then choose the best one to deploy everywhere.

For example, imagine an organization with several production clusters. The platform team has already created two categories of production clusters, called canary-prod and prod, using Anthos Config Management, Cluster, and ClusterSelector objects (see configuring only a subset of clusters).

The platform team wants to roll out a policy with Policy Controller to enforce the presence of a team label on namespaces in order to identify which team each namespace belongs to. They have already rolled out a version of this policy in dry run mode, and now they want to enforce it on a small number of clusters. Using ClusterSelector objects, they create two different K8sRequiredLabels resources that are applied to different clusters.

  • The K8sRequiredLabels resource is applied to clusters of type prod, with an enforcementAction parameter set to dryrun:

    apiVersion: constraints.gatekeeper.sh/v1beta1
    kind: K8sRequiredLabels
    metadata:
      name: ns-must-have-team
      annotations:
        configmanagement.gke.io/cluster-selector: prod
    Spec:
      enforcementAction: dryrun
      match:
        kinds:
          - apiGroups: [""]
            kinds: ["Namespace"]
      parameters:
        labels:
          - key: "team"
    
  • The K8sRequiredLabels resource is applied to clusters of type canary-prod, without the enforcementAction parameter, meaning that the policy is actually enforced:

    apiVersion: constraints.gatekeeper.sh/v1beta1
    kind: K8sRequiredLabels
    metadata:
      name: ns-must-have-team
      annotations:
        configmanagement.gke.io/cluster-selector: canary-prod
    spec:
      match:
        kinds:
          - apiGroups: [""]
        kinds: ["Namespace"]
      parameters:
        labels:
          - key: "team"
    

The configmanagement.gke.io/cluster-selector annotation allows the team to enforce the policy only in clusters of type canary-prod, preventing any unintended side-effects from spreading to the whole production fleet. For more information about the dry run feature of Policy Controller, see creating constraints.

We recommend the following actions:

  • Use ClusterSelector and NamespaceSelector objects if you need to apply a configuration change to only a subset of clusters or namespaces indefinitely or for a long time.
  • If you roll out a change by using selectors, be very careful. If you use Git commits, any error affects only one cluster a time, because you're rolling out cluster by cluster. But if you use Git branches, any error can affect all the clusters that use that branch. If you use selectors, error can affect all clusters at once.

Implementing reviews, tests, and validations

One advantage of Anthos Config Management is that it manages everything declaratively—Kubernetes resources, cloud resources, and policies. This means that files in a source control management system represent the resources (Git files, in the case of Anthos Config Management). This characteristic lets you implement development workflows that you already use for an application's source code: reviews and automated testing.

Implement reviews

Because Anthos Config Management is based on Git, you can use your preferred Git solution to host the Anthos Config Management repository. Your Git solution probably has a code review feature, which you can use to review changes made to the Anthos Config Management repository.

The best practices for reviewing changes to the Anthos Config Management repository are the same as with a normal code review, as follows:

Because of the sensitivity of the Anthos Config Management codebase, we also recommend that, if possible with your Git solution, you make the following configurations:

By using these different features, you can enforce approvals for each change request to the Anthos Config Management codebase. For example, you can ensure that each change is approved at least by a member of the platform team (who operates the fleet of clusters), and by a member of the security team (who is in charge of defining and implementing security policies).

We recommend the following action:

  • Enforce peer reviews on the Anthos Config Management repository, and protect the Git branches that are used by your clusters.

Implement automated tests

A common best practice when working on a codebase is to implement continuous integration. This means that you configure automated tests to run when a change request is created or updated. Automated tests can catch many errors before a human reviews the change request. This tightens the feedback loop for the developer. You can implement the same idea, using the same tools, for the Anthos Config Management repository.

For example, a good place to start is to run the nomos vet command automatically on new changes. This command validates that your Anthos Config Management repository's syntax is valid. You can implement this test by using Cloud Build by following the validating configs tutorial. You can integrate Cloud Build with the following options:

As you can see in the validating configs tutorial, the test is done by using a container image. You can therefore implement the test in any Continuous Integration solution that runs containers, not only Cloud Build. Specifically, you can implement it with GitLab CI, following this example, which also includes tests for Policy Controller.

To tighten the feedback loop even more, you can ask that users run the nomos vet command as a Git pre-commit hook. One caveat is that some users will have access to the Kubernetes clusters managed by Anthos Config Management, and they might not be able to run the full validation from their workstation. Run the nomos vet --clusters "" command to restrict the validation to semantic and syntactic checks.

You can implement any other test that you think is necessary or useful. If you use Policy Controller, you can implement automated tests of suggested changes against its policies, as outlined in Test changes against Policy Controller policies.

We recommend the following action:

  • Implement tests in a continuous integration pipeline. Run at least the nomos vet command on all suggested changes.

Using Policy Controller

Policy Controller is a Kubernetes dynamic admission controller. When you install and configure Policy Controller, Kubernetes can reject changes that don't comply with predefined rules, which are called policies.

Following are two example use cases of Policy Controller:

  • Enforce the presence of specific labels on Kubernetes objects,
  • Prevent the creation of privileged pods.

A library of policy templates is available for implementing the most commonly used policies, but you can write your own with a powerful language called Rego. Using Policy Controller, you can, for example, restrict the hostnames that users can configure in an ingress (for more information, see this tutorial).

Like Config Sync, Policy Controller is part of the Anthos Config Management product. Policy Controller and Config Sync have different, but complementary, use cases, as follows:

  • Config Sync is a GitOps-style tool that lets you create any Kubernetes object, potentially in multiple clusters at the same time. As mentioned in the introduction, Config Sync is especially useful for managing policies.
  • Policy Controller lets you define policies for objects that can be created in Kubernetes. You define these policies in custom resources, which are Kubernetes objects themselves.

The preceding features create a bidirectional relationship between the two applications. You can use Config Sync to create the policies that are enforced by Policy Controller, and you can use those policies to control exactly which objects that Config Sync (or any other process) can create, as shown in the following diagram:

Config Sync and Policy Controller.

The Git repository, Config Sync, Policy Controller, Kubernetes, a continuous deployment (CD) system, and users all interact with each other, in the following ways:

  • Users interact with the Anthos Config Management Git repository to create, update, or delete Kubernetes objects.
  • Config Sync reads its configuration from the Anthos Config Management Git repository.
  • Config Sync interacts with the Kubernetes API server to create objects, which include policies for Policy Controller.
  • The CD system also interacts with the Kubernetes API server to create objects. It can create constraints for Policy Controller. However, we recommend that you use Anthos Config Management for this use case because it gives you a centralized place to manage and test the constraints.
  • The Kubernetes API server either accepts or rejects the creation of objects by Config Sync and by the CD system, based on the response from Policy Controller.
  • Policy Controller gives that response based on the policies that it reads from the Kubernetes API server.

The following diagram illustrates these interactions:

Interactions between Git repository, Config Sync, Policy Controller, Kubernetes, a continuous deployment system, and users.

Policy Controller can prevent policy violations that escape human reviewers and automated tests, so you can consider it the last line of defense for your Kubernetes clusters. Policy Controller also becomes more useful as the number of human reviewers grows for Anthos Config Management. Due to the phenomenon of social loafing, the more reviewers that you have, the less likely it is that they are consistently enforcing the rules defined in your organization.

Test changes against Policy Controller policies

If you use Policy Controller, you can add a few steps to your continuous integration pipeline (see Implement automated tests) to automatically test suggested changes against policies. Automating the tests gives quicker and more visible feedback to the person who suggests the change. If you don't test the changes against the policies in the continuous integration pipeline, then you have to rely on the system described in Monitor rollouts to be alerted of Anthos Config Management syncing errors. Testing the changes against the policies exposes any violation clearly, and early, to the person who suggests the change.

You can implement this test in Cloud Build by following the Using Policy Controller in a CI pipeline tutorial. As mentioned earlier in Implement automated tests, you can integrate Cloud Build with GitHub and Bitbucket. You can also implement this test with GitLab CI. See this repository for an implementation example.

We recommend the following action:

  • If you use Policy Controller, validate the suggested changes against its policies in your continuous integration pipeline.

Monitoring rollouts

Even if you implement all the guardrails that this document covers, errors can still occur. Following are two common types of errors:

  • Errors that pose no problem to Config Sync itself, but prevent your workloads from working properly, such as an overly restrictive NetworkPolicy that prevents components of your workload from communicating.
  • Errors that make it impossible for Config Sync to apply changes to a cluster, such as an invalid Kubernetes manifest, or an object rejected by an admission controller. The methods explained earlier should catch most of these errors.

Detecting the errors described in the first preceding bullet is almost impossible at the level of Anthos Config Management, because this requires understanding the state of each of your workloads. For this reason, detecting these errors is best done by your existing monitoring system that alerts you when an application is misbehaving.

Detecting the errors described in the second preceding bullet—which should be rare if you have implemented all the guardrails—requires a specific setup. By default, Anthos Config Management writes errors to its logs (which you will find, by default, in Cloud Logging). Errors are also displayed in the Anthos Config Management console page. Neither logs nor the console are usually enough to detect errors, because you probably don't monitor them at all times. The simplest way to automate error detection is to run the nomos status command, which tells you if there's an error in a cluster.

You can also set up a more advanced solution with automatic alerts for errors. Anthos Config Management exposes metrics in the Prometheus format. You can use Prometheus to scrape these metrics, you can configure the import of Prometheus metrics into Cloud Monitoring, or you can use any monitoring solution compatible with the Prometheus format. For more information, see monitoring Anthos Config Management.

After you have the Anthos Config Management metrics in your monitoring system, create an alert to notify you when the gkeconfig_monitor_errors metric is greater than 0. For more information, see managing alerting policies for Cloud Monitoring, or alerting rules for Prometheus.

Summary of mechanisms for safe rollouts with Anthos Config Management

The following table summarizes the various mechanisms described earlier in this document. None of these mechanisms is exclusive. You can choose to use some of them or all of them, for different purposes.

Mechanism What it's good for What it's not good for Example use case
Git commit IDs and tags Use specific Git commit IDs or tags to precisely control which cluster changes are applied on. Don't use Git commit IDs or tags for long-lived differences between clusters. Use cluster selectors. All your clusters are configured to apply the 12345 Git commit. You make a change with a new commit, abcdef, that you want to test. You change the configuration of a single cluster to use this new commit to validate the change.
Git branches Use multiple Git branches when you want to roll out the same change to multiple environments, one after the other. Don't use multiple Git branches for long-lived differences between clusters. The branches will significantly diverge and will be hard to merge back together. First merge the change in the staging branch, where it will be picked up by staging clusters.
Then merge the change in the master branch, where it will be picked up by production clusters.
Cluster selectors and namespace selectors Use selectors for long-lived differences between clusters and namespaces. Don't use selectors for a staged rollout across multiple environments. If you want to test a modification first in staging, and then deploy it in production, use separate Git branches. If the application teams need full access to development clusters, but read-only access to production clusters, use the ClusterSelector object to apply the correct RBAC policies only to the relevant clusters.
Peer reviews Use peer reviews to ensure that the relevant teams approve the changes. Human reviewers don't catch all errors, especially items like syntax errors. Your organization mandates that the security team must review configuration changes that affect multiple systems. Have a security team member review the changes.
Automated tests in continuous integration pipeline Use automated tests to catch errors in suggested changes. Automated tests can't fully replace a human reviewer. Use both. Running a nomos vet command on all suggested changes confirms that the repository is a valid Anthos Config Management configuration.
Policy Controller Enforce organization-wide policies, and implement guardrails directly at the Kubernetes API server level. Policy Controller can't be used to create, update, or delete policies (that's the role of Anthos Config Management). Policy Controller can only enforce policies. The security team uses Anthos Config Management to create a Policy Controller constraint to prevent users from creating privileged containers, even in namespaces that are directly managed by the application teams.
Test changes against Policy Controller constraints. Make sure that Policy Controller is not rejecting changes when Anthos Config Management applies them. Testing changes against Policy Controller constraints in a continuous integration pipeline is not a replacement for enabling Policy Controller on the clusters. Every namespace must have a "team" label to identify its owner. A user wants to create a new namespace, and forgets to add this label in their suggested change. The continuous integration pipeline catches the error before a human reviews the change.
Monitor syncing errors Be sure that Anthos Config Management actually applies changes to your clusters. Syncing errors occur only if Anthos Config Management tries to apply an invalid repository or if the Kubernetes API server rejects some of the objects. If you haven't codified all your constraints in Policy Controller policies, then resources that violate those constraints won't be detected. A user bypasses all your tests and reviews and commits an invalid change to the Anthos Config Management repository. This change can't be applied to your clusters. If you're monitoring syncing errors, you'll be alerted if an error is made.

What's next