This document shows cluster operators and platform administrators how to safely roll out changes across multiple environments using Config Sync. This approach can help you avoid errors that affect all of your environments simultaneously.
Config Sync lets you manage single clusters, multi-tenant clusters, and multi-cluster Kubernetes configurations by using files stored in a Git repository.
Configurations can represent several things, including the following:
- Standard GKE objects, such as NetworkPolicies resources, DaemonSets resources, or RoleBindings resources.
- Google Cloud resources, such as Compute Engine instances or Cloud SQL databases, through Config Connector.
- Constraints on the configuration themselves, through Policy Controller.
Config Sync is especially suited to deploy configurations, policies, and workloads needed to run the platform that you build on top of Google Kubernetes Engine (GKE) Enterprise edition—for example, security agents, monitoring agents, and certificate managers.
Although you can deploy user-facing applications with Config Sync, we don't recommend linking their release lifecycle to the release lifecycle of the administrative workloads mentioned earlier. Instead, we recommend that you use a tool dedicated to application deployment, such as a continuous deployment tool, so that application teams can be in charge of their release schedule.
Config Sync is a powerful product that can manage many elements, so you need guardrails to avoid errors that have a major impact. This document describes several methods to create guardrails. The first section covers staged rollouts and the second section focuses on tests and validations. The third section shows how to monitor your deployments.
Implementing staged rollouts with Config Sync
In a multi-cluster environment, which is a common situation for GKE Enterprise users, we don't recommend applying a configuration change across all the clusters at the same time. A staged rollout, cluster per cluster, is much safer because it reduces the potential impact of any error.
There are several ways to implement staged rollouts with Config Sync:
- Use Git commits or tags to manually apply the changes that you want to the clusters.
- Use Git branches to automatically apply the changes when the changes are merged. You can use different branches for different groups of clusters.
- Use
ClusterSelector
andNamespaceSelector
objects to selectively apply changes to subgroups of clusters or namespaces.
All methods for staged rollouts have advantages and disadvantages. The following table shows which of these methods you can use at the same time:
Compatibility | Git commits or tags | Git branches | Cluster selectors | Namespace selectors |
---|---|---|---|---|
Git commits or tags | Not compatible | Compatible | Compatible | |
Git branches | Not compatible | Compatible | Compatible | |
Cluster selectors | Compatible | Compatible | Compatible | |
Namespace selectors | Compatible | Compatible | Compatible |
The following decision tree can help you decide when to use one of the staged rollout methods.
Use Git commits or tags
Compared to the other staged rollout methods, using Git commits or tags provides the most control and is the safest. You can use the Config Sync page in the Google Cloud console in the console to update multiple clusters at the same time. Use this method if you want to apply changes to your clusters one by one, and to control exactly when this happens.
In this method, you "pin" each cluster to a specific version (either a commit
or a tag) of your repository. This method is
similar to using the Git commit as a container image tag.
You implement this method by specifying the commit, tag, or hash in the
spec.git.revision
field of the RootSync
or RepoSync
custom resource.
If you manage your RootSync
or RepoSync
custom resources with a tool like
Kustomize,
you can reduce the amount of manual work required for rollouts. With such
a tool, you only need to change the revision
parameter in one place, and then
selectively apply the new RootSync
or RepoSync
custom resource to your clusters in
the order, and at the pace, that you choose.
Additionally, you can use the Google Cloud console to update the revision
parameter
for multiple clusters belonging to the same fleet
at the same time. However, if you have an automated system to update your
configurations, we don't recommend using the Google Cloud console to make configuration changes.
For example, the following RootSync definition configures
Config Sync to use the 1.2.3
tag:
apiVersion: configsync.gke.io/v1
kind: RootSync
metadata:
name: root-sync
namespace: config-sync-system
spec:
sourceType: git
sourceFormat: unstructured
git:
repo: git@example.com:gke/config-sync.git
revision: 1.2.3
auth: ssh
If you apply this configuration to your cluster,
Config Sync will use the 1.2.3
tag of the
example.com:gke/config-sync.git
repository.
To update a cluster, change the spec.git.revision
field to the new value for
the cluster. This lets you define which clusters get updated and when. If you
need to roll back a change, change the spec.git.revision
field back to its
former value.
The following diagram illustrates the rollout process for this method. First, you commit changes to the Config Sync repository, and then you update the RootSync definitions on all the clusters:
We recommend the following actions:
- Use Git commit IDs rather than tags. Because of the way that Git
functions, you have a
guarantee
that they will never change. For example, a
git push --force
can't change the commit that Config Sync is using. This approach is useful for auditing purposes and to track which commit you are using in logs. Additionally, unlike with tags, there's no extra step to commit IDs. - If you prefer using Git tags instead of Git commit IDs, you can protect your tags if you're using a Git solution that supports protection.
- If you want to update multiple clusters at the same time, you can do that in the Google Cloud console. To update multiple clusters at once, they must be part of the same fleet (and be in the same project).
Use Git branches
If you want changes to be applied to clusters as soon as they are merged in your Git repository, configure Config Sync to use Git branches instead of commits or tags. In this method, you create multiple long-lived branches in your Git repository, and configure Config Sync in different clusters to read its configuration from different branches.
For example, a simple pattern has two branches:
- A
staging
branch for non-production clusters. - A
main
branch for production clusters.
For non-production clusters, create the RootSync
or RepoSync
object with the
spec.git.branch
field set to staging
. For production clusters, create
the RootSync
or RepoSync
object with the spec.git.branch
parameter set to
main
.
For example, the following RootSync definition configures
Config Sync to use the main
branch:
apiVersion: configsync.gke.io/v1
kind: RootSync
metadata:
name: root-sync
namespace: config-sync-system
spec:
git:
repo: git@example.com:gke/config-sync.git
branch: main
auth: ssh
The following diagram illustrates the rollout process for this method:
You can adapt this pattern to specific needs, using more than two branches, or
using branches that are mapped to something other than environments. If you need
to roll back a change, use the
git revert
command
to create a new commit on the same branch that reverts the changes from the
previous commit.
We recommend the following actions:
- When dealing with multiple clusters, use at least two Git branches to help to distinguish between production and non-production clusters.
- Most Git solutions let you use the protected branches feature to prevent deletions or unreviewed changes of those branches. For more information, see the documentation for GitHub, GitLab, and Bitbucket.
Use ClusterSelector and NamespaceSelector objects
Git branches are a good way of doing a staged rollout of changes across
multiple clusters that will eventually all have the same policies. However, if
you want to rollout a change only to a subset of clusters or of namespaces,
then use the ClusterSelector
and NamespaceSelector
objects. These objects
have a similar goal: they let you apply objects only to clusters or namespaces
that have specific labels.
For example:
- By using
ClusterSelector
objects, you can apply different policies to clusters, depending on which country they are located in, for various compliance regimes. - By using
NamespaceSelector
objects, you can apply different policies to namespaces used by an internal team and by an external contractor.
ClusterSelector
and NamespaceSelector
objects also let you implement
advanced testing and release methodologies, such as the following:
- Canary releases of policies, where you deploy a new policy to a small subset of clusters and namespaces for a long time to study the policy's impact.
- A/B testing, where you deploy different versions of the same policy to different clusters to study the difference of the policy versions' impact and then choose the best one to deploy everywhere.
For example, imagine an organization with several production clusters.
The platform team has already created two categories of production clusters,
called canary-prod
and prod
, using
Cluster
, and ClusterSelector
objects (see
Use ClusterSelectors).
The platform team wants to rollout a policy with Policy Controller to enforce
the presence of a team label on namespaces in order to identify which team each
namespace belongs to. They have already rolled out a version of this policy in
dry run mode, and now they want to enforce it on a small number of clusters.
Using ClusterSelector
objects, they create two different K8sRequiredLabels
resources that are applied to different clusters.
The
K8sRequiredLabels
resource is applied to clusters of typeprod
, with anenforcementAction
parameter set todryrun
:apiVersion: constraints.gatekeeper.sh/v1beta1 kind: K8sRequiredLabels metadata: name: ns-must-have-team annotations: configmanagement.gke.io/cluster-selector: prod Spec: enforcementAction: dryrun match: kinds: - apiGroups: [""] kinds: ["Namespace"] parameters: labels: - key: "team"
The
K8sRequiredLabels
resource is applied to clusters of typecanary-prod
, without theenforcementAction
parameter, meaning that the policy is actually enforced:apiVersion: constraints.gatekeeper.sh/v1beta1 kind: K8sRequiredLabels metadata: name: ns-must-have-team annotations: configmanagement.gke.io/cluster-selector: canary-prod spec: match: kinds: - apiGroups: [""] kinds: ["Namespace"] parameters: labels: - key: "team"
The configmanagement.gke.io/cluster-selector
annotation allows the team to
enforce the policy only in clusters of type canary-prod
, preventing any
unintended side-effects from spreading to the whole production fleet. For more
information about the dry run feature of Policy Controller, see
creating constraints.
We recommend the following actions:
- Use
ClusterSelector
andNamespaceSelector
objects if you need to apply a configuration change to only a subset of clusters or namespaces indefinitely or for a long time. - If you roll out a change by using selectors, be very careful. If you use Git commits, any error affects only one cluster at a time, because you're rolling out cluster by cluster. But if you use Git branches, any error can affect all the clusters that use that branch. If you use selectors, error can affect all clusters at once.
Implementing reviews, tests, and validations
One advantage of Config Sync is that it manages everything declaratively—Kubernetes resources, cloud resources, and policies. This means that files in a source control management system represent the resources (Git files, in the case of Config Sync). This characteristic lets you implement development workflows that you already use for an application's source code: reviews and automated testing.
Implement reviews
Because Config Sync is based on Git, you can use your preferred Git solution to host the Config Sync repository. Your Git solution probably has a code review feature, which you can use to review changes made to the Config Sync repository.
The best practices for reviewing changes to the your repository are the same as with a normal code review, as follows:
- Practice trunk-based development.
- Work in small batches.
- Ensure that code review is done synchronously or at least promptly.
- The person who reviews and approves the change should not be the same person who suggested the change.
Because of the sensitivity of the Config Sync codebase, we also recommend that, if possible with your Git solution, you make the following configurations:
- Protect the branches that are directly used by clusters. See the documentation for GitHub, GitLab, and Bitbucket. GitLab also lets you protect tags.
- After the branches are protected, you can refine the approvals that are
needed to merge a change:
- On GitHub, enable required reviews.
- For GitLab, use Code Owners to delegate approval permissions on a file or directory basis. You can use merge request approvals to require different people from different teams to approve a request before it's merged.
- On Bitbucket, combine default reviewers with default merge checks. Optionally, use a Code Owners plugin for Bitbucket Server is available on the Atlassian Marketplace to control who can approve changes for subsections of the repository.
By using these different features, you can enforce approvals for each change request to your codebase. For example, you can ensure that each change is approved at least by a member of the platform team (who operates the fleet of clusters), and by a member of the security team (who is in charge of defining and implementing security policies).
We recommend the following action:
- Enforce peer reviews on your repository and protect the Git branches that are used by your clusters.
Implement automated tests
A common best practice when working on a codebase is to implement continuous integration. This means that you configure automated tests to run when a change request is created or updated. Automated tests can catch many errors before a human reviews the change request. This tightens the feedback loop for the developer. You can implement the same idea, using the same tools, for the Config Sync repository.
For example, a good place to start is to run the
nomos vet
command
automatically on new changes. This command validates that your
Config Sync repository's syntax is valid. You can implement
this test by using
Cloud Build
by following the
validating configs
tutorial. You can integrate Cloud Build with the following options:
- Bitbucket, by using build triggers.
- GitHub, by using the Google Cloud Build GitHub application. Build triggers are also available for GitHub, but the GitHub application is the preferred method of integration.
As you can see in the validating configs tutorial, the test is done by using a container image. You can therefore implement the test in any continuous integration solution that runs containers, not only Cloud Build.
To tighten the feedback loop even more, you can ask that users run the nomos
vet
command as a
Git pre-commit hook.
One caveat is that some users might not have access to the Kubernetes clusters
managed by Config Sync, and they might not be able to run
the full validation from their workstation. Run the nomos vet --clusters ""
command to restrict the validation to semantic and syntactic checks.
We recommend the following action:
- Implement tests in a continuous integration pipeline.
- Run at least the
nomos vet
command on all suggested changes.
Monitoring rollouts
Even if you implement all the guardrails that this document covers, errors can still occur. Following are two common types of errors:
- Errors that pose no problem to Config Sync itself, but prevent your workloads from working properly, such as an overly restrictive NetworkPolicy that prevents components of your workload from communicating.
- Errors that make it impossible for Config Sync to apply changes to a cluster, such as an invalid Kubernetes manifest, or an object rejected by an admission controller. The methods explained earlier should catch most of these errors.
Detecting the errors described in the first preceding bullet is almost impossible at the level of Config Sync because this requires understanding the state of each of your workloads. For this reason, detecting these errors is best done by your existing monitoring system that alerts you when an application is misbehaving.
Detecting the errors described in the second preceding bullet—which should be
rare if you have implemented all the guardrails—requires a specific setup. By
default, Config Sync writes errors to its logs (which you
will find, by default, in
Cloud Logging).
Errors are also displayed in the
Config Sync Google Cloud console page.
Neither logs nor the console are usually enough to detect errors, because you
probably don't monitor them at all times. The simplest way to automate error
detection is to run the nomos status
command,
which tells you if there's an error in a cluster.
You can also set up a more advanced solution with automatic alerts for errors. Config Sync exposes metrics in the Prometheus format. For more information, see monitoring Config Sync.
After you have the Config Sync metrics in your monitoring
system, create an alert to notify you when the gkeconfig_monitor_errors
metric
is greater than 0. For more information, see
managing alerting policies
for Cloud Monitoring, or
alerting rules
for Prometheus.
Summary of mechanisms for safe rollouts with Config Sync
The following table summarizes the various mechanisms described earlier in this document. None of these mechanisms is exclusive. You can choose to use some of them or all of them, for different purposes.
Mechanism | What it's good for | What it's not good for | Example use case |
---|---|---|---|
Git commit IDs and tags | Use specific Git commit IDs or tags to precisely control which cluster changes are applied on. | Don't use Git commit IDs or tags for long-lived differences between clusters. Use cluster selectors. | All your clusters are configured to apply the 12345 Git
commit. You make a change with a new commit, abcdef , that you
want to test. You change the configuration of a single cluster to use this
new commit to validate the change. |
Git branches | Use multiple Git branches when you want to roll out the same change to multiple environments, one after the other. | Don't use multiple Git branches for long-lived differences between clusters. The branches will significantly diverge and will be hard to merge back together. | First merge the change in the staging branch, where it will be
picked up by staging clusters. Then merge the change in the master branch, where it will be picked up by production clusters. |
Cluster selectors and namespace selectors | Use selectors for long-lived differences between clusters and namespaces. | Don't use selectors for a staged rollout across multiple environments. If you want to test a modification first in staging, and then deploy it in production, use separate Git branches. | If the application teams need full access to development clusters, but
read-only access to production clusters, use the
ClusterSelector object to apply the correct RBAC policies
only to the relevant clusters. |
Peer reviews | Use peer reviews to ensure that the relevant teams approve the changes. | Human reviewers don't catch all errors, especially items like syntax errors. | Your organization mandates that the security team must review configuration changes that affect multiple systems. Have a security team member review the changes. |
Automated tests in continuous integration pipeline | Use automated tests to catch errors in suggested changes. | Automated tests can't fully replace a human reviewer. Use both. | Running a nomos vet command on all suggested changes confirms
that the repository is a valid Config Sync
configuration. |
Monitor syncing errors | Be sure that Config Sync actually applies changes to your clusters. | Syncing errors occur only if Config Sync tries to apply an invalid repository or if the Kubernetes API server rejects some of the objects. | A user bypasses all your tests and reviews and commits an invalid change to the Config Sync repository. This change can't be applied to your clusters. If you're monitoring syncing errors, you'll be alerted if an error is made. |
Rollout strategy example
This section uses the concepts introduced in the rest of this article to help you create an end-to-end rollout strategy across all the clusters in your organization. This strategy assumes that you have separate fleets for development, staging, and production (as shown in Fleet Example 1 - Approach 1).
In this scenario, you configure each cluster to synchronize with your Git repository using a specific Git commit. Deploying a change to a given fleet is a 4-step process:
- You update a single (the "canary") cluster in the fleet to use the new commit first.
- You validate that everything works as expected by running tests and monitoring the rollout.
- You update the rest of the clusters in the fleet.
- You validate again that everything works as expected.
To deploy a change across all your clusters, you repeat this process for each fleet. You can technically apply this method with any Git commit, from any branch. However, we suggest that you adopt the following process to identify problems early in the review process:
- When someone opens a change request in the Config Sync Git repository, deploy that change to one of the development clusters.
- If the change request is accepted and merged in your main branch, run the full deployment across all fleets as described earlier.
While some changes might target only a specific fleet, we recommend that you deploy all changes to all fleets eventually. This strategy eliminates the problem of tracking which fleet should sync with which commit. Pay special attention to the changes that target only the production fleet because proper testing will not have been possible in previous fleets. For example, this means waiting longer for issues to surface between deploying to the canary clusters and to the rest of the clusters.
To summarize, a full end-to-end deployment looks like this:
- Someone opens a change request.
- Automated tests and validations run, and a manual review is done.
- You trigger a job manually to deploy the change to the canary cluster in the development fleet. Automated end-to-end tests run in this cluster.
- If everything is OK, you merge the change request on the main branch.
- The merge triggers an automated job to deploy the new main branch tip commit to the canary cluster in the development fleet. Automated end-to-end tests run in this cluster (to detect potential incompatibilities between two change requests that have been created and merged approximately at the same time).
- The following jobs run one after the other (you trigger them manually, or
after a predefined time to allow for user reports of regressions):
- Deploy to all the clusters of the development fleet.
- Run tests and validations in the clusters of the development fleet.
- Deploy to the canary cluster of the staging fleet.
- Run tests and validations in the canary cluster of the staging fleet.
- Deploy to all the clusters of the staging fleet.
- Run tests and validations in the clusters of the staging fleet.
- Deploy to the canary cluster of the production fleet.
- Run tests and validations in the canary cluster of the production fleet.
- Deploy to all the clusters of the production fleet.
- Run tests and validations in the clusters of the production fleet.
What's next
- Read about monitoring Config Sync.
- Read about fleets.
- Learn how to validate your app against company policies in a continuous integration pipeline.