Safe rollouts with Anthos Config Management
This document shows cluster operators and platform administrators how to safely roll out changes across multiple environments using Anthos Config Management. Anthos Config Management can help you avoid errors that affect all of your environments simultaneously.
Anthos Config Management lets you manage single clusters, multi-tenant clusters, and multi-cluster Kubernetes configurations by using files stored in a Git repository. Anthos Config Management combines three technologies— Config Sync, Policy Controller, and Config Connector. Config Sync watches for updates to all files in the Git repository and applies changes to all relevant clusters automatically. Policy Controller manages and enforces policies for objects in your clusters. Config Connector uses Google Kubernetes Engine (GKE) custom resources to manage cloud resources.
Config Sync configurations can represent several things, including the following:
- Standard GKE objects, such as NetworkPolicies resources, DaemonSets resources, or RoleBindings resources.
- Google Cloud resources, such as Compute Engine instances or Cloud SQL databases, through Config Connector.
- Constraints on the configuration themselves, through Policy Controller.
Anthos Config Management is especially suited to deploy configurations, policies, and workloads needed to run the platform that you build on top of Anthos—for example, security agents, monitoring agents, and certificate managers.
Although you can deploy user-facing applications with Anthos Config Management, we don't recommend linking their release lifecycle to the release lifecycle of the administrative workloads mentioned earlier. Instead, we recommend that you use a tool dedicated to application deployment, such as a continuous deployment tool, so that application teams can be in charge of their release schedule.
Anthos Config Management is a powerful product that can manage many elements, so you need guardrails to avoid errors that have a major impact. This document describes several methods to create guardrails. The first section covers staged rollouts, the second section focuses on tests and validations, and the third section explains how to use Policy Controller to create guardrails. The fourth section shows how to monitor Anthos Config Management deployments.
You can use most of the methods discussed in this document, even if you're using only Config Sync and not the full Anthos Config Management product. If you are not using the full Anthos Config Management product but still want to implement the methods involving Policy Controller, you can successfully do so using Gatekeeper. The exceptions to this rule are methods that rely on the Anthos Config Management page in the Google Cloud console, like updating the Anthos Config Management configuration in the Google Cloud console. You can also use several of the methods described in this document at the same time. In the following section, a table indicates which methods are compatible for simultaneous use.
Implementing staged rollouts with Anthos Config Management
In a multi-cluster environment, which is a common situation for Anthos users, we don't recommend applying a configuration change across all the clusters at the same time. A staged rollout, cluster per cluster—or even namespace per namespace, if you use namespaces as the boundary between applications—is much safer because it reduces the blast radius of any error.
Following are several ways to implement staged rollouts with Anthos Config Management:
- Use Git commits or tags to manually apply the changes that you want to the clusters.
- Use Git branches to automatically apply the changes when the changes are merged. You can use different branches for different groups of clusters.
- Use
ClusterSelector
andNamespaceSelector
objects to selectively apply changes to subgroups of clusters or namespaces.
All methods for staged rollouts have advantages and disadvantages. The following table shows which of these methods you can use at the same time.
Are X compatible with Y? | Git commits or tags | Git branches | Cluster selectors | Namespace selectors |
---|---|---|---|---|
Git commits or tags | Not compatible | Compatible | Compatible | |
Git branches | Not compatible | Compatible | Compatible | |
Cluster selectors | Compatible | Compatible | Compatible | |
Namespace selectors | Compatible | Compatible | Compatible |
The following decision tree helps you decide when to use one of the staged rollout methods.
Use Git commits or tags
Compared to the other staged rollout methods, using Git commits or tags provides the most control and is the safest. You can use the Anthos Config Management page in the console to update multiple clusters at the same time. Use this method if you want to apply changes to your clusters one by one, and to control exactly when this happens.
In this method, you "pin" each cluster to a specific version (either a commit
or a tag) of your Anthos Config Management repository. This method is
similar to
using the Git commit as a container image tag.
You implement this method by specifying the commit, tag, or hash in the
spec.git.revision
field of the RootSync
or RepoSync
custom resource.
If you manage your RootSync
or RepoSync
custom resources with a tool like
kustomize,
you can reduce the amount of manual work required to roll out changes. With such
a tool, you only need to change the revision
parameter in one place, and then
selectively apply the new RootSync
or RepoSync
custom resource to your clusters in
the order, and at the pace, that you choose.
Additionally, if you are using Anthos Config Management (and not
Config Sync), you have access to the Anthos Config Management page
in the Google Cloud console. This page lets you update the revision
parameter
for multiple clusters belonging to the same fleet
at the same time. If you have an automated system to update the Anthos Config Management
configuration, we recommend against using the console to change this configuration.
For example, the following RootSync definition configures
Anthos Config Management to use the 1.2.3
tag:
apiVersion: configsync.gke.io/v1
kind: RootSync
metadata:
name: root-sync
namespace: config-management-system
spec:
sourceType: git
sourceFormat: unstructured
git:
repo: git@example.com:anthos/config-management.git
revision: 1.2.3
auth: ssh
If you apply this configuration to your cluster,
Anthos Config Management will use the 1.2.3
tag of the
example.com:anthos/config-management.git
repository.
To update a cluster, change the spec.git.revision
field to the new value for
the cluster. This lets you define which clusters get updated and when. If you
need to roll back a change, change the spec.git.revision
field back to its
former value.
The following diagram illustrates the rollout process for this method. First, you commit changes to the Anthos Config Management repository, and then you update the RootSync definitions on all the clusters:
We recommend the following actions:
- Use Git commit IDs rather than tags. Because of the way that Git
functions, you have a
guarantee
that they will never change. For example, a
git push --force
can't change the commit that Anthos Config Management is using. This approach is useful for auditing purposes and to track which commit you are using in logs. Additionally, unlike with tags, there's no extra step to commit IDs. - If you prefer using Git tags instead of Git commit IDs, and you're using GitLab, protect the tags to keep them from being moved or deleted. The other major Git solutions do not have this feature.
- If you want to update multiple clusters at the same time, you can do that in the Anthos Config Management console page. For you to update multiple clusters at once, they need to be part of the same fleet (and be in the same project).
Use Git branches
If you want changes to be applied to clusters as soon as they are merged in your Git repository, configure Anthos Config Management to use Git branches instead of commits or tags. In this method, you can create multiple long-lived branches in your Git repository, and configure Anthos Config Management in different clusters to read its configuration from different branches.
For example, a simple pattern has two branches:
- A
staging
branch for non-production clusters. - A
master
branch for production clusters.
For non-production clusters, create the RootSync
or RepoSync
object with the
spec.git.branch
field set to staging
. For production clusters, create
the RootSync
or RepoSync
object with the spec.git.branch
parameter set to
master
.
For example, the following RootSync definition configures
Anthos Config Management to use the master
branch:
apiVersion: configsync.gke.io/v1
kind: RootSync
metadata:
name: root-sync
namespace: config-management-system
spec:
git:
repo: git@example.com:anthos/config-management.git
branch: master
auth: ssh
The following diagram illustrates the rollout process for this method:
You can adapt this pattern to specific needs, using more than two branches, or
using branches that are mapped to something other than environments. If you need
to roll back a change, use the
git revert
command
to create a new commit on the same branch that reverts the changes from the
previous commit.
We recommend the following actions:
- When dealing with multiple clusters, use at least two Git branches to help to distinguish between production and non-production clusters.
- Most Git solutions let you use the protected branches feature to prevent deletions or unreviewed changes of those branches. For more information, see the documentation for GitHub, GitLab, and Bitbucket.
Use ClusterSelector and NamespaceSelector objects
Git branches are a good way of doing a staged rollout of changes across
multiple clusters that will eventually all have the same policies. However, if
you want to roll out a change only to a subset of clusters or of namespaces,
then use the ClusterSelector
and NamespaceSelector
objects. These objects
have a similar goal: they let you apply objects only to clusters or namespaces
that have specific labels.
For example:
- By using
ClusterSelector
objects, you can apply different policies to clusters, depending on which country they are located in, for various compliance regimes. - By using
NamespaceSelector
objects, you can apply different policies to namespaces used by an internal team and by an external contractor.
ClusterSelector
and NamespaceSelector
objects also let you implement
advanced testing and release methodologies, such as the following:
- Canary releases of policies, where you deploy a new policy to a small subset of clusters and namespaces for a long time to study the policy's impact.
- A/B testing, where you deploy different versions of the same policy to different clusters to study the difference of the policy versions' impact and then choose the best one to deploy everywhere.
For example, imagine an organization with several production clusters.
The platform team has already created two categories of production clusters,
called canary-prod
and prod
, using Anthos Config Management,
Cluster
, and ClusterSelector
objects (see
Use ClusterSelectors).
The platform team wants to roll out a policy with Policy Controller to enforce
the presence of a team label on namespaces in order to identify which team each
namespace belongs to. They have already rolled out a version of this policy in
dry run mode, and now they want to enforce it on a small number of clusters.
Using ClusterSelector
objects, they create two different K8sRequiredLabels
resources that are applied to different clusters.
The
K8sRequiredLabels
resource is applied to clusters of typeprod
, with anenforcementAction
parameter set todryrun
:apiVersion: constraints.gatekeeper.sh/v1beta1 kind: K8sRequiredLabels metadata: name: ns-must-have-team annotations: configmanagement.gke.io/cluster-selector: prod Spec: enforcementAction: dryrun match: kinds: - apiGroups: [""] kinds: ["Namespace"] parameters: labels: - key: "team"
The
K8sRequiredLabels
resource is applied to clusters of typecanary-prod
, without theenforcementAction
parameter, meaning that the policy is actually enforced:apiVersion: constraints.gatekeeper.sh/v1beta1 kind: K8sRequiredLabels metadata: name: ns-must-have-team annotations: configmanagement.gke.io/cluster-selector: canary-prod spec: match: kinds: - apiGroups: [""] kinds: ["Namespace"] parameters: labels: - key: "team"
The configmanagement.gke.io/cluster-selector
annotation allows the team to
enforce the policy only in clusters of type canary-prod
, preventing any
unintended side-effects from spreading to the whole production fleet. For more
information about the dry run feature of Policy Controller, see
creating constraints.
We recommend the following actions:
- Use
ClusterSelector
andNamespaceSelector
objects if you need to apply a configuration change to only a subset of clusters or namespaces indefinitely or for a long time. - If you roll out a change by using selectors, be very careful. If you use Git commits, any error affects only one cluster at a time, because you're rolling out cluster by cluster. But if you use Git branches, any error can affect all the clusters that use that branch. If you use selectors, error can affect all clusters at once.
Implementing reviews, tests, and validations
One advantage of Anthos Config Management is that it manages everything declaratively—Kubernetes resources, cloud resources, and policies. This means that files in a source control management system represent the resources (Git files, in the case of Anthos Config Management). This characteristic lets you implement development workflows that you already use for an application's source code: reviews and automated testing.
Implement reviews
Because Anthos Config Management is based on Git, you can use your preferred Git solution to host the Anthos Config Management repository. Your Git solution probably has a code review feature, which you can use to review changes made to the Anthos Config Management repository.
The best practices for reviewing changes to the Anthos Config Management repository are the same as with a normal code review, as follows:
- Practice trunk-based development.
- Work in small batches.
- Ensure that code review is done synchronously or at least promptly.
- The person who reviews and approves the change should not be the same person who suggested the change.
Because of the sensitivity of the Anthos Config Management codebase, we also recommend that, if possible with your Git solution, you make the following configurations:
- Protect the branches that are directly used by clusters. See the documentation for GitHub, GitLab, and Bitbucket. GitLab also lets you protect tags.
- After the branches are protected, you can refine the approvals that are
needed to merge a change:
- On GitHub, enable required reviews, and optionally use the CODEOWNERS file to control who can approve changes for subsections of the repository.
- For GitLab, follow the recommendations for managing who can approve merge requests in the Best practices for policy management with Anthos Config Management and GitLab article.
- On Bitbucket, combine default reviewers with default merge checks. Optionally, use a Code Owners plugin for Bitbucket Server is available on the Atlassian Marketplace to control who can approve changes for subsections of the repository.
By using these different features, you can enforce approvals for each change request to the Anthos Config Management codebase. For example, you can ensure that each change is approved at least by a member of the platform team (who operates the fleet of clusters), and by a member of the security team (who is in charge of defining and implementing security policies).
We recommend the following action:
- Enforce peer reviews on the Anthos Config Management repository, and protect the Git branches that are used by your clusters.
Implement automated tests
A common best practice when working on a codebase is to implement continuous integration. This means that you configure automated tests to run when a change request is created or updated. Automated tests can catch many errors before a human reviews the change request. This tightens the feedback loop for the developer. You can implement the same idea, using the same tools, for the Anthos Config Management repository.
For example, a good place to start is to run the
nomos vet
command
automatically on new changes. This command validates that your
Anthos Config Management repository's syntax is valid. You can implement
this test by using
Cloud Build
by following the
validating configs
tutorial. You can integrate Cloud Build with the following options:
- Bitbucket, by using build triggers.
- GitHub, by using the Google Cloud Build GitHub application. Build triggers are also available for GitHub, but the GitHub application is the preferred method of integration.
As you can see in the validating configs tutorial, the test is done by using a container image. You can therefore implement the test in any continuous integration solution that runs containers, not only Cloud Build. Specifically, you can implement it with GitLab CI, following this example, which also includes tests for Policy Controller.
To tighten the feedback loop even more, you can ask that users run the nomos
vet
command as a
Git pre-commit hook.
One caveat is that some users might not have access to the Kubernetes clusters
managed by Anthos Config Management, and they might not be able to run
the full validation from their workstation. Run the nomos vet --clusters ""
command to restrict the validation to semantic and syntactic checks.
You can implement any other test that you think is necessary or useful. If you use Policy Controller, you can implement automated tests of suggested changes against its policies, as outlined in Test changes against Policy Controller policies.
We recommend the following action:
- Implement tests in a continuous integration pipeline. Run at least the
nomos vet
command on all suggested changes.
Using Policy Controller
Policy Controller is a Kubernetes dynamic admission controller. When you install and configure Policy Controller, Kubernetes can reject changes that don't comply with predefined rules, which are called policies.
Following are two example use cases of Policy Controller:
- Enforce the presence of specific labels on Kubernetes objects,
- Prevent the creation of privileged pods.
A library of policy templates is available for implementing the most commonly used policies, but you can write your own with a powerful language called Rego. Using Policy Controller, you can, for example, restrict the hostnames that users can configure in an ingress (for more information, see this tutorial).
Like Config Sync, Policy Controller is part of the Anthos Config Management product. Policy Controller and Config Sync have different, but complementary, use cases, as follows:
- Config Sync is a GitOps-style tool that lets you create any Kubernetes object, potentially in multiple clusters at the same time. As mentioned in the introduction, Config Sync is especially useful for managing policies.
- Policy Controller lets you define policies for objects that can be created in Kubernetes. You define these policies in custom resources, which are Kubernetes objects themselves.
The preceding features create a bidirectional relationship between the two applications. You can use Config Sync to create the policies that are enforced by Policy Controller, and you can use those policies to control exactly which objects that Config Sync (or any other process) can create, as shown in the following diagram:
The Git repository, Config Sync, Policy Controller, Kubernetes, a continuous deployment (CD) system, and users all interact with each other, in the following ways:
- Users interact with the Anthos Config Management Git repository to create, update, or delete Kubernetes objects.
- Config Sync reads its configuration from the Anthos Config Management Git repository.
- Config Sync interacts with the Kubernetes API server to create objects, which include policies for Policy Controller.
- The CD system also interacts with the Kubernetes API server to create objects. It can create constraints for Policy Controller. However, we recommend that you use Anthos Config Management for this use case because it gives you a centralized place to manage and test the constraints.
- The Kubernetes API server either accepts or rejects the creation of objects by Config Sync and by the CD system, based on the response from Policy Controller.
- Policy Controller gives that response based on the policies that it reads from the Kubernetes API server.
The following diagram illustrates these interactions:
Policy Controller can prevent policy violations that escape human reviewers and automated tests, so you can consider it the last line of defense for your Kubernetes clusters. Policy Controller also becomes more useful as the number of human reviewers grows for Anthos Config Management. Due to the phenomenon of social loafing, the more reviewers that you have, the less likely it is that they are consistently enforcing the rules defined in your organization.
Test changes against Policy Controller policies
If you use Policy Controller, you can add a few steps to your continuous integration pipeline (see Implement automated tests) to automatically test suggested changes against policies. Automating the tests gives quicker and more visible feedback to the person who suggests the change. If you don't test the changes against the policies in the continuous integration pipeline, then you have to rely on the system described in Monitor rollouts to be alerted of Anthos Config Management syncing errors. Testing the changes against the policies exposes any violation clearly, and early, to the person who suggests the change.
You can implement this test in Cloud Build by following the Using Policy Controller in a CI pipeline tutorial. As mentioned earlier in Implement automated tests, you can integrate Cloud Build with GitHub and Bitbucket. You can also implement this test with GitLab CI. See this repository for an implementation example.
We recommend the following action:
- If you use Policy Controller, validate the suggested changes against its policies in your continuous integration pipeline.
Monitoring rollouts
Even if you implement all the guardrails that this document covers, errors can still occur. Following are two common types of errors:
- Errors that pose no problem to Config Sync itself, but prevent your workloads from working properly, such as an overly restrictive NetworkPolicy that prevents components of your workload from communicating.
- Errors that make it impossible for Config Sync to apply changes to a cluster, such as an invalid Kubernetes manifest, or an object rejected by an admission controller. The methods explained earlier should catch most of these errors.
Detecting the errors described in the first preceding bullet is almost impossible at the level of Anthos Config Management, because this requires understanding the state of each of your workloads. For this reason, detecting these errors is best done by your existing monitoring system that alerts you when an application is misbehaving.
Detecting the errors described in the second preceding bullet—which should be
rare if you have implemented all the guardrails—requires a specific setup. By
default, Anthos Config Management writes errors to its logs (which you
will find, by default, in
Cloud Logging).
Errors are also displayed in the
Anthos Config Management console page.
Neither logs nor the console are usually enough to detect errors, because you
probably don't monitor them at all times. The simplest way to automate error
detection is to run the nomos status
command,
which tells you if there's an error in a cluster.
You can also set up a more advanced solution with automatic alerts for errors. Anthos Config Management exposes metrics in the Prometheus format. For more information, see monitoring Anthos Config Management.
After you have the Anthos Config Management metrics in your monitoring
system, create an alert to notify you when the gkeconfig_monitor_errors
metric
is greater than 0. For more information, see
managing alerting policies
for Cloud Monitoring, or
alerting rules
for Prometheus.
Summary of mechanisms for safe rollouts with Anthos Config Management
The following table summarizes the various mechanisms described earlier in this document. None of these mechanisms is exclusive. You can choose to use some of them or all of them, for different purposes.
Mechanism | What it's good for | What it's not good for | Example use case |
---|---|---|---|
Git commit IDs and tags | Use specific Git commit IDs or tags to precisely control which cluster changes are applied on. | Don't use Git commit IDs or tags for long-lived differences between clusters. Use cluster selectors. | All your clusters are configured to apply the 12345 Git
commit. You make a change with a new commit, abcdef , that you
want to test. You change the configuration of a single cluster to use this
new commit to validate the change. |
Git branches | Use multiple Git branches when you want to roll out the same change to multiple environments, one after the other. | Don't use multiple Git branches for long-lived differences between clusters. The branches will significantly diverge and will be hard to merge back together. | First merge the change in the staging branch, where it will be
picked up by staging clusters. Then merge the change in the master branch, where it will be picked up by production clusters. |
Cluster selectors and namespace selectors | Use selectors for long-lived differences between clusters and namespaces. | Don't use selectors for a staged rollout across multiple environments. If you want to test a modification first in staging, and then deploy it in production, use separate Git branches. | If the application teams need full access to development clusters, but
read-only access to production clusters, use the
ClusterSelector object to apply the correct RBAC policies
only to the relevant clusters. |
Peer reviews | Use peer reviews to ensure that the relevant teams approve the changes. | Human reviewers don't catch all errors, especially items like syntax errors. | Your organization mandates that the security team must review configuration changes that affect multiple systems. Have a security team member review the changes. |
Automated tests in continuous integration pipeline | Use automated tests to catch errors in suggested changes. | Automated tests can't fully replace a human reviewer. Use both. | Running a nomos vet command on all suggested changes confirms
that the repository is a valid Anthos Config Management
configuration. |
Policy Controller | Enforce organization-wide policies, and implement guardrails directly at the Kubernetes API server level. | Policy Controller can't be used to create, update, or delete policies (that's the role of Anthos Config Management). Policy Controller can only enforce policies. | The security team uses Anthos Config Management to create a Policy Controller constraint to prevent users from creating privileged containers, even in namespaces that are directly managed by the application teams. |
Test changes against Policy Controller constraints. | Make sure that Policy Controller is not rejecting changes when Anthos Config Management applies them. | Testing changes against Policy Controller constraints in a continuous integration pipeline is not a replacement for enabling Policy Controller on the clusters. | Every namespace must have a "team" label to identify its owner. A user wants to create a new namespace, and forgets to add this label in their suggested change. The continuous integration pipeline catches the error before a human reviews the change. |
Monitor syncing errors | Be sure that Anthos Config Management actually applies changes to your clusters. | Syncing errors occur only if Anthos Config Management tries to apply an invalid repository or if the Kubernetes API server rejects some of the objects. If you haven't codified all your constraints in Policy Controller policies, then resources that violate those constraints won't be detected. | A user bypasses all your tests and reviews and commits an invalid change to the Anthos Config Management repository. This change can't be applied to your clusters. If you're monitoring syncing errors, you'll be alerted if an error is made. |
Rollout strategy example
This section uses the concepts introduced in the rest of this article to help you create an end-to-end rollout strategy across all the clusters in your organization. This strategy assumes that you have separate fleets for development, staging, and production (as shown in Fleet Example 1 - Approach 1).
In this scenario, you configure each cluster to synchronize with the Anthos Config Management Git repository using a specific Git commit. Deploying a change to a given fleet is a 4-step process:
- You update a single (the "canary") cluster in the fleet to use the new commit first.
- You validate that everything works as expected by running tests and monitoring the rollout.
- You update the rest of the clusters in the fleet.
- You validate again that everything works as expected.
To deploy a change across all your clusters, you repeat this process for each fleet. You can technically apply this method with any Git commit, from any branch. However, we suggest that you adopt the following process to identify problems early in the review process:
- When someone opens a change request in the Anthos Config Management Git repository, deploy that change to one of the development clusters.
- If the change request is accepted and merged in your main branch, run the full deployment across all fleets as described earlier.
While some changes might target only a specific fleet, we recommend that you deploy all changes to all fleets eventually. This strategy eliminates the problem of tracking which fleet should sync with which commit. Pay special attention to the changes that target only the production fleet because proper testing will not have been possible in previous fleets. For example, this means waiting longer for issues to surface between deploying to the canary clusters and to the rest of the clusters.
To summarize, a full end-to-end deployment looks like this:
- Someone opens a change request.
- Automated tests and validations run, and a manual review is done.
- You trigger a job manually to deploy the change to the canary cluster in the development fleet. Automated end-to-end tests run in this cluster.
- If everything is OK, you merge the change request on the main branch.
- The merge triggers an automated job to deploy the new main branch tip commit to the canary cluster in the development fleet. Automated end-to-end tests run in this cluster (to detect potential incompatibilities between two change requests that have been created and merged approximately at the same time).
- The following jobs run one after the other (you trigger them manually, or
after a predefined time to allow for user reports of regressions):
- Deploy to all the clusters of the development fleet.
- Run tests and validations in the clusters of the development fleet.
- Deploy to the canary cluster of the staging fleet.
- Run tests and validations in the canary cluster of the staging fleet.
- Deploy to all the clusters of the staging fleet.
- Run tests and validations in the clusters of the staging fleet.
- Deploy to the canary cluster of the production fleet.
- Run tests and validations in the canary cluster of the production fleet.
- Deploy to all the clusters of the production fleet.
- Run tests and validations in the clusters of the production fleet.
What's next
- Read about monitoring Anthos Config Management.
- Read about fleets.
- Learn how to validate your app against company policies in a continuous integration pipeline.
- Read about best practices for policy management with Anthos Config Management and GitLab.