Best practices for operating large-scale deployments
This page contains a series of recommendations and best practices for managing and operating multiple GKE, Anthos, or Anthos-attached cluster deployments using Anthos Config Management.
The best practices in this guide are relevant for multi-cluster Kubernetes deployments, which have the following characteristics:
- The control plane for all deployments is powered by Anthos capabilities.
- Multiple deployment environments are needed for development, quality assurance testing, staging, and production purposes.
- Deployments span multiple clusters and multiple regions.
- Kubernetes clusters are intended for use in single and multi-tenant scenarios.
- Application teams (tenants) coordinate with platform administrators for application dependencies and production software releases.
- Application teams and platform administrators collaborate using Git, which supports either pull request (PR) or merge request (MR) based workflows.
Create groupings for deployments and clusters
When operating large scale Kubernetes deployments, it's important to apply consistent policies and practices within individual deployment environments (like staging, or production). To help you create groupings for your deployments, we recommend that you use fleets. Each deployment environment should be configured as a fleet because it simplifies the management and operation of the grouped clusters. Fleets also provide you with the ability to update Anthos Config Management controller configurations across multiple clusters, as needed.
Fleets offer deployment environment granularity. For additional granularity,
objects to create groups that let Kubernetes objects be scoped for individual
or multiple clusters within a fleet. For example, you can have groups organized
by regions or usage scenarios.
When defining metadata, it's important to use a consistent and stable naming
convention, for example
[environment name]-[region name]-[cluster name]. This
approach enables teams to scope Kubernetes objects at cluster, region, or
environment level. The following
ClusterSelector objects show
you how to configure individual clusters and create logical groups using
Kubernetes label selectors:
kind: Cluster apiVersion: clusterregistry.k8s.io/v1alpha1 metadata: name: dev-eastus-cluster1 labels: env: dev region: eastus --- kind: ClusterSelector apiVersion: configmanagement.gke.io/v1 metadata: name: dev-eastus spec: selector: matchLabels: env: dev region: eastus
This ClusterSelector would select the
dev-eastus-cluster1 cluster since it has
env: dev and
region: eastus labels. You can then
reference the ClusterSelector
in another config.
For objects located in Config Sync root repositories, use either the
configsync.gke.io/cluster-name-selector annotations to specify their
deployment location. The following
Pod objects show you how to scope
Kubernetes objects to individual clusters or cluster groups:
# Using ClusterSelector name kind: Pod metadata: name: nginx annotations: configmanagement.gke.io/cluster-selector: dev-eastus spec: containers: - name: nginx image: nginx:latest ports: - containerPort: 80 --- # Using Cluster name kind: Pod metadata: name: nginx annotations: configsync.gke.io/cluster-name-selector: dev-eastus-cluster1 spec: containers: - name: nginx image: nginx:latest ports: - containerPort: 80
The practice of mapping environments and clusters provides the necessary capabilities for properly scoping individual Kubernetes objects when managing and operating large scale deployments.
Organize platform and team repositories
When you map platform repositories to deployment environments, there are two supported approaches:
Select the approach that best matches your team's workflows and Git management strategies. In both approaches, the platform repository should be an unstructured repository to enable simplified management of downstream tenant-based namespace dependencies. Due to the expected number of namespace folders, using hierarchical repositories is not recommended when managing large-scale deployments.
In the branch-based approach, each environment is mapped to a distinct branch
main branch is used for regular iterative development, and PRs or MRs are
used to merge changes from
main to individual branches. Config Sync is
configured to sync from each branch head for non-production deployment
environments. For production clusters, you should configure Config Sync to sync
from a specific
show you how you can configure Config Sync for development environments and
production environments, specifically in the
# DEV EXAMPLE apiVersion: configsync.gke.io/v1beta1 kind: RootSync metadata: name: root-sync namespace: config-management-system spec: sourceFormat: unstructured git: repo: https://github.com/example/root-repo branch: dev dir: "config" auth: token --- # PROD EXAMPLE apiVersion: configsync.gke.io/v1beta1 kind: RootSync metadata: name: root-sync namespace: config-management-system spec: sourceFormat: unstructured git: repo: https://github.com/example/prod-repo revision: ee341e3c896ccf731c2efb9e42162c8ca74757ac dir: "config" auth: token
In the folder-based approach, you map each environment to a distinct folder within the platform repository. All development happens on the main branch in a root folder, and each environment is mapped to a sub-folder containing environment-specific Kustomize overlays. You should configure Config Sync to sync from each of the corresponding folders for non-production deployment environments. For production clusters, configure Config Sync to sync from a specific Git commit.
The following example shows how you can organize a folder-based repository
structure. There are distinct folders for each environment in the
folder and corresponding Kustomize bases in
├── configsync │ ├── prod │ │ ├── ~g_v1_namespace_default.yaml │ │ ├── networking.k8s.io_v1_networkpolicy_deny-all.yaml │ │ ├── rbac.authz.k8s.io_v1_rolebinding_prod-admin-rolebinding.yaml │ │ └── rbac.authz.k8s.io_v1_role_prod-admin.yaml │ ├── staging │ │ ├── ~g_v1_namespace_default.yaml │ │ ├── networking.k8s.io_v1_networkpolicy_deny-all.yaml │ │ ├── rbac.authz.k8s.io_v1_rolebinding_staging-admin-rolebinding.yaml │ │ └── rbac.authz.k8s.io_v1_role_staging-admin.yaml │ └── dev │ ├── ~g_v1_namespace_default.yaml │ ├── networking.k8s.io_v1_networkpolicy_deny-all.yaml │ ├── rbac.authz.k8s.io_v1_rolebinding_dev-admin-rolebinding.yaml │ └── rbac.authz.k8s.io_v1_role_dev-admin.yaml ├── configsync-src │ ├── base │ │ ├── kustomization.yaml │ │ ├── namespace.yaml │ │ ├── networkpolicy.yaml │ │ ├── rolebinding.yaml │ │ └── role.yaml │ ├── staging │ │ └── kustomization.yaml │ ├── dev │ │ └── kustomization.yaml ├── README.md └── scripts └── render.sh
For a tutorial showing you how to use the folder-based approach, see Create policies for a multi-tenant cluster.
Choose single tenant or multi-tenant clusters
Large-scale deployments often have a mix of single tenant (one team per cluster) and multi-tenant (multiple teams per cluster) clusters. Whether teams select single tenant or multi tenant clusters depends on individual tenant resource needs, scaling considerations, lifecycle management, and ongoing maintenance operations. For more information, refer to the Enterprise multi-tenancy and Cluster multi-tenancy guides.
In single tenant cluster configurations, platform teams and tenant teams use a "shared responsibility" model for cluster management and applications. Platform teams operate single tenant clusters, while tenant teams deploy and operate their applications. When using Config Sync in single tenant scenarios, there is a single root repository that is used for cluster management. Tenant teams either adopt a pull request (PR) or merge request (MR) based strategy for collaboration with platform teams, or tenants are given push access to the root repository. In PR and MR based approaches, platform teams are responsible for merging in tenant team changes. With push access, tenant teams require elevated Git permissions to commit and merge their changes. The choice of strategy depends on which Git workflow most closely aligns with existing processes.
In multi-tenant cluster configurations, platform teams manage clusters using the Config Sync root repository and configure Config Sync to use multiple namespace repositories. Each tenant team is given access to a Namespace and corresponding repository. The repository is organized similarly to the root repository, incorporating either the branch-based or folder-based approach, and configured to follow the appropriate mapping approach for non-production (branch or folder) and production environments (Git commit or tag).
Orchestrate safe rollouts
For non-production environments, changes can be rolled out as they are committed and merged in platform repositories. This approach provides the most flexibility.
For production environments, when teams release new or updated services to production, the platform team should execute a safe rollout strategy. Clusters should be pinned to specific Git commit hashes, and you should regularly update Config Sync's settings to use the new commit after there's been a change to the production environment. This approach maximizes safety for production environments.
To automate the deployment of new RootSync and RepoSync objects (and reduce errors using imperative operations), use CI/CD pipelines such as Cloud Build, to statefully update production cluster commit hashes.
To protect branches and refine the approvals process, all platform and team repositories should also implement a review strategy.
For more information, see Safe rollouts using Anthos Config Management.
Develop a dependency management strategy
When platform teams and tenant teams use a shared responsibility model for operating clusters and applications they should develop a management strategy for sharing artifacts. Artifacts might be individual Kubernetes objects such as GPU or storage configurations, or bundled Kubernetes applications such as database or caching tools. Tenant teams should treat these artifacts as dependencies and collaborate with the platform team using different package management approaches.
Dependency management approaches include using
publishing Helm charts using
Artifact Registry, or having
platform teams deploying namespaced objects in their platform repositories. The
choice of which dependency management approach you choose depends on existing
platform and tenant team workflows and experience with the associated tools.
Anthos Config Management's Policy Controller enables you to enforce fully programmable policies on your clusters. You can use these policies to shift security left and guard against violations during development and test time, as well as runtime violations. Platform teams should maintain a centralized repository for the policies that are used for policy validation and admission control.
You should also ensure that policy validation at development and test time is orchestrated by CI tools. You can use an approach similar to the one described in running Policy Controller in a CI pipeline.
To ensure that Kubernetes objects are always deployed to a specific cluster, region, or environment, you should also use policies that require specific annotations.
Deprecate imperative operations
You should move any imperative cluster operations to a fully declarative, repository-backed approach. This approach ensures that all cluster or environment configurations are synchronized with repositories and any changes are easily tracked and reviewed. For existing imperative operations, consider migrating to a Kustomize or kpt based workflow, using "base plus overlay" or package management approaches.
In certain unique scenarios, it might be necessary to deactivate
synchronization between individual cluster objects and the upstream repository.
If you need to deactivate synchronization, update the
configmanagement.gke.io/managed annotation to
halt management of a managed object.
To avoid long term configuration drift or staleness, deploy a Policy Controller
policy to log any such objects where the
managed annotation has been
- Learn about Best practices for policy management with Anthos Config Management and GitLab.
- Discover more about Safe rollouts with Anthos Config Management.
- Read the Anthos security blueprint about Enforcing locality restrictions for clusters on Google Cloud.
- Read the Anthos security blueprint about Enforcing policies.
- Discover more about Modern CI/CD with Anthos: A software delivery framework.