Best practices for operating large-scale deployments

This page contains a series of recommendations and best practices for managing and operating multiple GKE, Anthos, or Anthos-attached cluster deployments using Anthos Config Management.

The best practices in this guide are relevant for multi-cluster Kubernetes deployments, which have the following characteristics:

  • The control plane for all deployments is powered by Anthos capabilities.
  • Multiple deployment environments are needed for development, quality assurance testing, staging, and production purposes.
  • Deployments span multiple clusters and multiple regions.
  • Kubernetes clusters are intended for use in single and multi-tenant scenarios.
  • Application teams (tenants) coordinate with platform administrators for application dependencies and production software releases.
  • Application teams and platform administrators collaborate using Git, which supports either pull request (PR) or merge request (MR) based workflows.

Create groupings for deployments and clusters

When operating large scale Kubernetes deployments, it's important to apply consistent policies and practices within individual deployment environments (like staging, or production). To help you create groupings for your deployments, we recommend that you use fleets and Kustomize.

Each deployment environment should be configured as a fleet because it simplifies the management and operation of the grouped clusters. Fleets also provide you with the ability to update Anthos Config Management controller configurations across multiple clusters, as needed.

Kustomize enables you to create additional groupings. For resources that are the same on each cluster, place their configs in a shared Kustomize base and be referenced by each cluster. For resources specific to each cluster the configs place the configs in the overlay specific to each cluster. For an example of using Kustomize, see the Multi-Cluster Access and Quota tutorial.

The practice of mapping environments and clusters provides the necessary capabilities for properly scoping individual Kubernetes objects when managing and operating large scale deployments.

Organize platform and team repositories

To enable simplified management, the platform repository should be an unstructured repository . Due to the expected number of namespace directories, using hierarchical repositories is not recommended when managing large-scale deployments.

In the directory-based approach, you map each environment to a distinct directory within the platform repository. All development happens on the main branch in a root directory, and each environment is mapped to a sub-directory containing environment-specific Kustomize overlays. You should configure Config Sync to sync from each of the corresponding directories for non-production deployment environments. For production clusters, configure Config Sync to sync from a specific Git commit.

The following example shows how you can organize a directory-based repository structure:

config-source/
├── base
│   ├── foo
│   │   ├── kustomization.yaml
│   │   ├── namespace.yaml
│   │   └── serviceaccount.yaml
│   ├── kustomization.yaml
│   ├── pod-creator-clusterrole.yaml
│   └── pod-creator-rolebinding.yaml
├── cloudbuild.yaml
├── overlays
│   ├── dev
│   │   └── kustomization.yaml
│   └── prod
│       └── kustomization.yaml
└── README.md

The config-source directory includes the base/ manifests and the dev/ and prod/ Kustomize overlays. Each directory contains a kustomization.yaml file, which lists the files Kustomize should manage and apply to the cluster. In dev/kustomization.yaml and prod/kustomization.yaml a series of patches are defined. These patches manipulate the base/ resources for that specific environment.

For a tutorial showing you how to use the directory-based approach, see Using Config Sync in multiple environments with automated rendering .

Choose single tenant or multi-tenant clusters

Large-scale deployments often have a mix of single tenant (one team per cluster) and multi-tenant (multiple teams per cluster) clusters. Whether teams select single tenant or multi tenant clusters depends on individual tenant resource needs, scaling considerations, lifecycle management, and ongoing maintenance operations. For more information, refer to the Enterprise multi-tenancy and Cluster multi-tenancy guides.

In single tenant cluster configurations, platform teams and tenant teams use a "shared responsibility" model for cluster management and applications. Platform teams operate single tenant clusters, while tenant teams deploy and operate their applications. When using Config Sync in single tenant scenarios, there is a single root repository that is used for cluster management. Tenant teams either adopt a pull request (PR) or merge request (MR) based strategy for collaboration with platform teams, or tenants are given push access to the root repository. In PR and MR based approaches, platform teams are responsible for merging in tenant team changes. With push access, tenant teams require elevated Git permissions to commit and merge their changes. The choice of strategy depends on which Git workflow most closely aligns with existing processes.

In multi-tenant cluster configurations, platform teams manage clusters using the Config Sync root repository and configure Config Sync to use multiple namespace repositories. Each tenant team is given access to a Namespace and corresponding repository. The repository is organized similarly to the root repository, incorporating either the branch-based or directory-based approach, and configured to follow the appropriate mapping approach for non-production (branch or directory) and production environments (Git commit or tag).

Orchestrate safe rollouts

For non-production environments, changes can be rolled out as they are committed and merged in platform repositories. This approach provides the most flexibility.

For production environments, when teams release new or updated services to production, the platform team should execute a safe rollout strategy. Clusters should be pinned to specific Git commit hashes, and you should regularly update Config Sync's settings to use the new commit after there's been a change to the production environment. This approach maximizes safety for production environments.

To automate the deployment of new RootSync and RepoSync objects (and reduce errors using imperative operations), use CI/CD pipelines such as Cloud Build, to statefully update production cluster commit hashes.

To protect branches and refine the approvals process, all platform and team repositories should also implement a review strategy.

For more information, see Safe rollouts using Anthos Config Management.

Develop a dependency management strategy

When platform teams and tenant teams use a shared responsibility model for operating clusters and applications they should develop a management strategy for sharing artifacts. Artifacts might be individual Kubernetes objects such as GPU or storage configurations, or bundled Kubernetes applications such as database or caching tools. Tenant teams should treat these artifacts as dependencies and collaborate with the platform team using different package management approaches.

Dependency management approaches include using kpt package management capabilities, publishing Helm charts using Artifact Registry, or having platform teams deploying namespaced objects in their platform repositories. The choice of which dependency management approach you choose depends on existing platform and tenant team workflows and experience with the associated tools.

Enforce policies

Anthos Config Management's Policy Controller enables you to enforce fully programmable policies on your clusters. You can use these policies to shift security left and guard against violations during development and test time, as well as runtime violations. Platform teams should maintain a centralized repository for the policies that are used for policy validation and admission control.

You should also ensure that policy validation at development and test time is orchestrated by CI tools. You can use an approach similar to the one described in running Policy Controller in a CI pipeline.

To ensure that Kubernetes objects are always deployed to a specific cluster, region, or environment, you should also use policies that require specific annotations.

Deprecate imperative operations

You should move any imperative cluster operations to a fully declarative, repository-backed approach. This approach ensures that all cluster or environment configurations are synchronized with repositories and any changes are easily tracked and reviewed. For existing imperative operations, consider migrating to a Kustomize or kpt based workflow, using "base plus overlay" or package management approaches.

In certain unique scenarios, it might be necessary to deactivate synchronization between individual cluster objects and the upstream repository. If you need to deactivate synchronization, update the configmanagement.gke.io/managed annotation to halt management of a managed object. To avoid long term configuration drift or staleness, deploy a Policy Controller policy to log any such objects where the managed annotation has been deactivated.

What's next