Best practices for operating large-scale deployments

This page contains a series of recommendations and best practices for managing and operating multiple GKE, Anthos, or Anthos-attached cluster deployments using Anthos Config Management.

The best practices in this guide are relevant for multi-cluster Kubernetes deployments, which have the following characteristics:

  • The control plane for all deployments is powered by Anthos capabilities.
  • Multiple deployment environments are needed for development, quality assurance testing, staging, and production purposes.
  • Deployments span multiple clusters and multiple regions.
  • Kubernetes clusters are intended for use in single and multi-tenant scenarios.
  • Application teams (tenants) coordinate with platform administrators for application dependencies and production software releases.
  • Application teams and platform administrators collaborate using Git, which supports either pull request (PR) or merge request (MR) based workflows.

Create groupings for deployments and clusters

When operating large scale Kubernetes deployments, it's important to apply consistent policies and practices within individual deployment environments (like staging, or production). To help you create groupings for your deployments, we recommend that you use fleets. Each deployment environment should be configured as a fleet because it simplifies the management and operation of the grouped clusters. Fleets also provide you with the ability to update Anthos Config Management controller configurations across multiple clusters, as needed.

Fleets offer deployment environment granularity. For additional granularity, use Cluster and ClusterSelector objects to create groups that let Kubernetes objects be scoped for individual or multiple clusters within a fleet. For example, you can have groups organized by regions or usage scenarios.

When defining metadata, it's important to use a consistent and stable naming convention, for example [environment name]-[region name]-[cluster name]. This approach enables teams to scope Kubernetes objects at cluster, region, or environment level. The following Cluster and ClusterSelector objects show you how to configure individual clusters and create logical groups using Kubernetes label selectors:

kind: Cluster
apiVersion: clusterregistry.k8s.io/v1alpha1
metadata:
  name: dev-eastus-cluster1
  labels:
    env: dev
    region: eastus
---
kind: ClusterSelector
apiVersion: configmanagement.gke.io/v1
metadata:
  name: dev-eastus
spec:
  selector:
    matchLabels:
      env: dev
      region: eastus

This ClusterSelector would select the dev-eastus-cluster1 cluster since it has both the env: dev and region: eastus labels. You can then reference the ClusterSelector in another config.

For objects located in Config Sync root repositories, use either the configmanagement.gke.io/cluster-selector or configsync.gke.io/cluster-name-selector annotations to specify their deployment location. The following Pod objects show you how to scope Kubernetes objects to individual clusters or cluster groups:

# Using ClusterSelector name
kind: Pod
metadata:
  name: nginx
  annotations:
    configmanagement.gke.io/cluster-selector: dev-eastus
spec:
  containers:
  - name: nginx
    image: nginx:latest
    ports:
    - containerPort: 80
---
# Using Cluster name
kind: Pod
metadata:
  name: nginx
  annotations:
    configsync.gke.io/cluster-name-selector: dev-eastus-cluster1
spec:
  containers:
  - name: nginx
    image: nginx:latest
    ports:
    - containerPort: 80

The practice of mapping environments and clusters provides the necessary capabilities for properly scoping individual Kubernetes objects when managing and operating large scale deployments.

Organize platform and team repositories

When you map platform repositories to deployment environments, there are two supported approaches:

  • Branch-based
  • Folder-based

Select the approach that best matches your team's workflows and Git management strategies. In both approaches, the platform repository should be an unstructured repository to enable simplified management of downstream tenant-based namespace dependencies. Due to the expected number of namespace folders, using hierarchical repositories is not recommended when managing large-scale deployments.

Branch-based approach

In the branch-based approach, each environment is mapped to a distinct branch within the root repository. The main branch is used for regular iterative development, and PRs or MRs are used to merge changes from main to individual branches. Config Sync is configured to sync from each branch head for non-production deployment environments. For production clusters, you should configure Config Sync to sync from a specific Git commit.

The following RootSync objects show you how you can configure Config Sync for development environments and production environments, specifically in the spec.git.branch and spec.git.revision fields:

# DEV EXAMPLE
apiVersion: configsync.gke.io/v1beta1
kind: RootSync
metadata:
  name: root-sync
  namespace: config-management-system
spec:
  sourceFormat: unstructured
  git:
    repo: https://github.com/example/root-repo
    branch: dev
    dir: "config"
    auth: token

---

# PROD EXAMPLE
apiVersion: configsync.gke.io/v1beta1
kind: RootSync
metadata:
  name: root-sync
  namespace: config-management-system
spec:
  sourceFormat: unstructured
  git:
    repo: https://github.com/example/prod-repo
    revision: ee341e3c896ccf731c2efb9e42162c8ca74757ac
    dir: "config"
    auth: token

Folder-based approach

In the folder-based approach, you map each environment to a distinct folder within the platform repository. All development happens on the main branch in a root folder, and each environment is mapped to a sub-folder containing environment-specific Kustomize overlays. You should configure Config Sync to sync from each of the corresponding folders for non-production deployment environments. For production clusters, configure Config Sync to sync from a specific Git commit.

The following example shows how you can organize a folder-based repository structure. There are distinct folders for each environment in the configsync folder and corresponding Kustomize bases in configsync-src:

├── configsync
│   ├── prod
│   │   ├── ~g_v1_namespace_default.yaml
│   │   ├── networking.k8s.io_v1_networkpolicy_deny-all.yaml
│   │   ├── rbac.authz.k8s.io_v1_rolebinding_prod-admin-rolebinding.yaml
│   │   └── rbac.authz.k8s.io_v1_role_prod-admin.yaml
│   ├── staging
│   │   ├── ~g_v1_namespace_default.yaml
│   │   ├── networking.k8s.io_v1_networkpolicy_deny-all.yaml
│   │   ├── rbac.authz.k8s.io_v1_rolebinding_staging-admin-rolebinding.yaml
│   │   └── rbac.authz.k8s.io_v1_role_staging-admin.yaml
│   └── dev
│       ├── ~g_v1_namespace_default.yaml
│       ├── networking.k8s.io_v1_networkpolicy_deny-all.yaml
│       ├── rbac.authz.k8s.io_v1_rolebinding_dev-admin-rolebinding.yaml
│       └── rbac.authz.k8s.io_v1_role_dev-admin.yaml
├── configsync-src
│   ├── base
│   │   ├── kustomization.yaml
│   │   ├── namespace.yaml
│   │   ├── networkpolicy.yaml
│   │   ├── rolebinding.yaml
│   │   └── role.yaml
│   ├── staging
│   │   └── kustomization.yaml
│   ├── dev
│   │   └── kustomization.yaml
├── README.md
└── scripts
    └── render.sh

For a tutorial showing you how to use the folder-based approach, see Create policies for a multi-tenant cluster.

Choose single tenant or multi-tenant clusters

Large-scale deployments often have a mix of single tenant (one team per cluster) and multi-tenant (multiple teams per cluster) clusters. Whether teams select single tenant or multi tenant clusters depends on individual tenant resource needs, scaling considerations, lifecycle management, and ongoing maintenance operations. For more information, refer to the Enterprise multi-tenancy and Cluster multi-tenancy guides.

In single tenant cluster configurations, platform teams and tenant teams use a "shared responsibility" model for cluster management and applications. Platform teams operate single tenant clusters, while tenant teams deploy and operate their applications. When using Config Sync in single tenant scenarios, there is a single root repository that is used for cluster management. Tenant teams either adopt a pull request (PR) or merge request (MR) based strategy for collaboration with platform teams, or tenants are given push access to the root repository. In PR and MR based approaches, platform teams are responsible for merging in tenant team changes. With push access, tenant teams require elevated Git permissions to commit and merge their changes. The choice of strategy depends on which Git workflow most closely aligns with existing processes.

In multi-tenant cluster configurations, platform teams manage clusters using the Config Sync root repository and configure Config Sync to use multiple namespace repositories. Each tenant team is given access to a Namespace and corresponding repository. The repository is organized similarly to the root repository, incorporating either the branch-based or folder-based approach, and configured to follow the appropriate mapping approach for non-production (branch or folder) and production environments (Git commit or tag).

Orchestrate safe rollouts

For non-production environments, changes can be rolled out as they are committed and merged in platform repositories. This approach provides the most flexibility.

For production environments, when teams release new or updated services to production, the platform team should execute a safe rollout strategy. Clusters should be pinned to specific Git commit hashes, and you should regularly update Config Sync's settings to use the new commit after there's been a change to the production environment. This approach maximizes safety for production environments.

To automate the deployment of new RootSync and RepoSync objects (and reduce errors using imperative operations), use CI/CD pipelines such as Cloud Build, to statefully update production cluster commit hashes.

To protect branches and refine the approvals process, all platform and team repositories should also implement a review strategy.

For more information, see Safe rollouts using Anthos Config Management.

Develop a dependency management strategy

When platform teams and tenant teams use a shared responsibility model for operating clusters and applications they should develop a management strategy for sharing artifacts. Artifacts might be individual Kubernetes objects such as GPU or storage configurations, or bundled Kubernetes applications such as database or caching tools. Tenant teams should treat these artifacts as dependencies and collaborate with the platform team using different package management approaches.

Dependency management approaches include using kpt package management capabilities, publishing Helm charts using Artifact Registry, or having platform teams deploying namespaced objects in their platform repositories. The choice of which dependency management approach you choose depends on existing platform and tenant team workflows and experience with the associated tools.

Enforce policies

Anthos Config Management's Policy Controller enables you to enforce fully programmable policies on your clusters. You can use these policies to shift security left and guard against violations during development and test time, as well as runtime violations. Platform teams should maintain a centralized repository for the policies that are used for policy validation and admission control.

You should also ensure that policy validation at development and test time is orchestrated by CI tools. You can use an approach similar to the one described in running Policy Controller in a CI pipeline.

To ensure that Kubernetes objects are always deployed to a specific cluster, region, or environment, you should also use policies that require specific annotations.

Deprecate imperative operations

You should move any imperative cluster operations to a fully declarative, repository-backed approach. This approach ensures that all cluster or environment configurations are synchronized with repositories and any changes are easily tracked and reviewed. For existing imperative operations, consider migrating to a Kustomize or kpt based workflow, using "base plus overlay" or package management approaches.

In certain unique scenarios, it might be necessary to deactivate synchronization between individual cluster objects and the upstream repository. If you need to deactivate synchronization, update the configmanagement.gke.io/managed annotation to halt management of a managed object. To avoid long term configuration drift or staleness, deploy a Policy Controller policy to log any such objects where the managed annotation has been deactivated.

What's next