Migrating containers to Google Cloud: Migrating to a multi-cluster GKE environment

Last reviewed 2023-05-08 UTC

This document helps you plan, design, and implement your migration from a Google Kubernetes Engine (GKE) environment to a new GKE environment. If done incorrectly, moving apps from one environment to another can be a challenging task. Therefore, you need to plan and execute your migration carefully.

This document is part of a multi-part series about migrating to Google Cloud. For an overview of the series, see Migration to Google Cloud: Choosing your migration path.

This document is part of a series that discusses migrating containers to Google Cloud:

This document is useful if you're planning to migrate from a GKE environment to another GKE environment. This document is also useful if you're evaluating the opportunity to migrate and want to explore what it might look like.

Reasons to migrate from a GKE environment to another GKE environment can include the following:

  • Enabling GKE features available only on cluster creation. GKE is constantly evolving with new features and security fixes. To benefit from most new features and fixes, you might need to upgrade your GKE clusters and node pools to a newer GKE version, either through auto-upgrade or manually.

    Some new GKE features can't be enabled on existing clusters, and they require you to create new GKE clusters with those new features enabled. For example, you can enable VPC-native networking in GKE, Dataplane V2 or Metadata concealment only when you create new clusters. You can't update the configuration of existing clusters to enable those features after their creation.

  • Implementing an automated provisioning and configuration process for your infrastructure. If you manually provision and configure your infrastructure, you can design and implement an automated process to provision and configure your GKE clusters, instead of relying on manual, and error-prone, methods.

When you design the architecture of your new environment, we recommend that you consider a multi-cluster GKE environment. By provisioning and configuring multiple GKE clusters in your environment, you do the following:

  • Reduce the chances of introducing a single point of failure in your architecture. For example, if a cluster suffers an outage, other clusters can take over.
  • Benefit from the greater flexibility that a multi-cluster environment provides. For example, by applying changes to a subset of your clusters, you can limit the impact of issues caused by erroneous configuration changes. You can then validate the changes before you apply them to your remaining clusters.
  • Let your workloads communicate across clusters. For example, workloads deployed in a cluster can communicate with workloads deployed in another cluster.

The guidance in this document is also applicable to a single-cluster GKE environment. When you migrate to a single-cluster GKE environment, your environment is less complex to manage compared to a multi-cluster environment. However, a single-cluster environment doesn't benefit from the increased flexibility, reliability, and resilience of a multi-cluster GKE environment.

The following diagram illustrates the path of your migration journey.

Migration path with four phases.

The framework illustrated in the preceding diagram has the following phases, which are defined in Migration to Google Cloud: Getting started:

  1. Assessing and discovering your workloads.
  2. Planning and building a foundation.
  3. Deploying your workloads.
  4. Optimizing your environment.

You follow the preceding phases during each migration step. This document also relies on concepts that are discussed in Migrating containers to Google Cloud: Migrating Kubernetes to GKE. It includes links where appropriate.

Assessing your environment

In the assessment phase, you gather information about your source environment and the workloads that you want to migrate. This assessment is crucial for your migration and to rightsize the resources that you need for the migration and your target environment. In the assessment phase, you do the following:

  1. Build a comprehensive inventory of your apps.
  2. Catalog your apps according to their properties and dependencies.
  3. Train and educate your teams on Google Cloud.
  4. Build an experiment and proof of concept on Google Cloud.
  5. Calculate the total cost of ownership (TCO) of the target environment.
  6. Choose the workloads that you want to migrate first.

The following sections rely on Migration to Google Cloud: Assessing and discovering your workloads. However, they provide information that is specific to assessing workloads that you want to migrate to new GKE clusters.

In Migrating Kubernetes to GKE, Assessing your environment describes how to assess Kubernetes clusters and resources, such as ServiceAccounts, and PersistentVolumes. The information also applies to assessing your GKE environment.

Build your inventories

To scope your migration, you must understand your current GKE environment. You start by gathering information about your clusters, and then you focus on your workloads deployed in those clusters and the workloads' dependencies. At the end of the assessment phase, you have two inventories: one for your clusters, and one for the workloads deployed in those clusters.

In Migrating Kubernetes to GKE, Build your inventories describes how to build the inventories of your Kubernetes clusters and workloads. It is also applicable to building the inventories of your GKE environments. Before you proceed with this document, follow that guidance to build the inventory of your Kubernetes clusters.

After you follow the Migrating Kubernetes to GKE guidance to build your inventories, you refine the inventories. To complete the inventory of your GKE clusters and Node pools, consider GKE-specific aspects and features for each cluster and Node pool, including the following:

When you build your inventory, you might find some GKE clusters that need to be decommissioned as part of your migration. Some Google Cloud resources aren't deleted when you delete the GKE clusters that created them. Make sure that your migration plan includes retiring those resources.

For information about other potential GKE-specific aspects and features, review the GKE documentation.

Complete the assessment

After you build the inventories related to your GKE clusters and workloads, complete the rest of the activities of the assessment phase in Migrating containers to Google Cloud: Migrating Kubernetes to GKE.

Planning and building your foundation

In the plan phase, you provision and configure the foundation, the cloud infrastructure, and services that support your workloads on Google Cloud. In the plan phase, you do the following:

  • Build a resource hierarchy.
  • Configure identity and access management.
  • Set up billing.
  • Set up network connectivity.
  • Harden your security.
  • Set up monitoring and alerting.

When you set up the network connectivity, ensure that you have enough IP addresses in your subnets to allocate for Nodes, Pods, and Services. When you set up networking for your clusters, plan your IP address allocations carefully—for example, you can configure privately used public IPs for GKE. The secondary IP address ranges that you set for Pods and Services on your clusters can't be changed after you allocate them. Take particular care if you allocate a Pod or Service range of /22 (1024 addresses) or smaller. Otherwise, you might run out of IP addresses for Pods and Services as your cluster grows. For more information, see IP address range planning.

We recommend that you use a separate shared subnet for internal load balancers that you create for your GKE environment. When you use a Kubernetes Service of type: LoadBalancer, you can specify a load balancer subnet. When you configure internal HTTP(S) internal load balancers, you must configure a proxy-only subnet.

To build the foundation of your GKE environment, complete the activities of the planning and building phase in Migrating containers to Google Cloud: Migrating Kubernetes to GKE.

Deploying your workloads

In the deployment phase, you do the following:

  1. Provision and configure the target environment.
  2. Migrate data from your source environment to the target environment.
  3. Deploy your workloads in the target environment.

This section provides information that is specific to deploying workloads to GKE. It builds on the information in Migrating Kubernetes to GKE: Deploying your workloads.

Evaluate your runtime platform and environments

To have a more flexible, reliable, and maintainable infrastructure, we recommend that you design and implement a multi-cluster architecture. In a multi-cluster architecture, you have multiple production GKE clusters in your environment. For example, if you provision multiple GKE clusters in your environment, you can implement advanced cluster lifecycle strategies, such as rolling upgrades or blue-green upgrades. For more information about multi-cluster GKE architecture designs and their benefits, see Multi-cluster GKE upgrades using Multi Cluster Ingress.

When you run your environment across multiple clusters, there are additional challenges to consider, such as the following:

  • You need to adapt configuration management, service discovery and communication, application rollouts, and load balancing for incoming traffic.
  • You likely need to run extra software on your cluster, and extra automation and infrastructure.

To address these challenges, you might need Continuous Integration/Continuous Deployment (CI/CD) pipelines to update the configuration of clusters sequentially to minimize the impact of mistakes. You might also need load balancers to distribute traffic from one cluster to other clusters.

Manually managing your infrastructure is error prone and exposes you to issues due to misconfiguration, and lack of internal documentation about the current state of your infrastructure. To help mitigate the risks due to those issues, we recommend that you apply the infrastructure as code pattern. When you apply this pattern, you treat the provisioning of your infrastructure the same way you handle the source code of your workloads.

There are several architecture options for your multi-cluster GKE environment, described later in this section. Choosing one option over the others depends on several factors, and no option is inherently better than the others. Each type has its own strengths and weaknesses. To choose a type of architecture, do the following:

  1. Establish a set of criteria to evaluate the types of architectures of multi-cluster GKE environments.
  2. Assess each option against the evaluation criteria.
  3. Choose the option that best suits your needs.

To establish the criteria to evaluate the architecture types of multi-cluster GKE environments, use the environment assessment that you completed to identify the features that you need. Order the features according to importance. For example, after assessing your workloads and environment, you might consider the following evaluation criteria, listed in potential order of importance:

  1. Google-managed solution. Do you prefer Google-managed or self-managed services and products?
  2. Interfaces to interact with the architecture. Is there a machine-readable interface that you can interact with? Is the interface defined as an open standard? Does the interface support declarative directives, imperative directives, or both?
  3. Expose services outside the GKE environment. Does the architecture let you expose your workloads outside the GKE cluster where they are deployed to?
  4. Inter-cluster communication. Does the architecture support communication channels between clusters? Do your workloads support a distributed architecture? This criterion is important to support workloads with a distributed design, such as Jenkins.
  5. Traffic management. Does the architecture support advanced traffic management features, such as fault injection, traffic shifting, request timeouts and retries, circuit breakers, and traffic mirroring? Are those features ready to use or do you have to implement them by yourself?
  6. Provision and configure additional tools. Do you need to provision and configure additional hardware or software components?

Google Cloud provides the following options to design the architecture of a multi-cluster GKE environment. To choose the best option for your workloads, you first assess them against the preceding evaluation criteria that you established. Use an arbitrary, ordered scale to assign each design option a score against each evaluation criterion. For example, you can assign each environment a score from a scale from 1 to 10 against each evaluation criterion. The following options are presented in increasing order of how much effort is required to manage the new multi-cluster GKE environment.

  1. Multi Cluster Ingress and Multi-Cluster Service Discovery
  2. Multi Cluster Ingress and Anthos Service Mesh
  3. Traffic Director
  4. Kubernetes and self-managed DNS record updates

The following sections describe these options in detail, and include a list of criteria to evaluate each option. You might be able to assign scores against some of the criteria by reading the product documentation. For example, by reading the documentation, you can evaluate Anthos Service Mesh against some of the evaluation criteria that you previously established. However, to assign scores against other criteria, you might need to design and execute more in-depth benchmarks and simulations. For example, you might need to benchmark the performance of different multi-cluster GKE architectures to assess whether they suit your workloads.

Multi Cluster Ingress and Multi-Cluster Service Discovery

Multi Cluster Ingress is a Google-managed service that lets you expose workloads regardless of which GKE cluster they are deployed to. It also lets you configure shared load balancers across GKE clusters and across regions. For more information about how to use Multi Cluster Ingress in a multi-cluster GKE environment, see Multi-cluster GKE upgrades using Multi Cluster Ingress and Supporting your migration with Istio mesh expansion.

Multi-Cluster Service Discovery is a Kubernetes-native cross-cluster service discovery and connectivity mechanism for GKE. Multi-Cluster Service Discovery builds on the Kubernetes Service resource to help apps to discover and connect to each other across cluster boundaries. To evaluate Multi Cluster Ingress and Multi-Cluster Service Discovery against the criteria that you established earlier, use the following list, numbered in order of relative importance:

  1. Google-managed solution. Multi-Cluster Service Discovery is a fully managed feature of GKE. It configures resources (Cloud DNS zones and records, firewall rules, and Traffic Director) so that you don't have to manage them.
  2. Interfaces to interact with the architecture. Services are exported to other clusters using a declarative Kubernetes resource called ServiceExport.
  3. Expose services outside the GKE environment. When you use Multi Cluster Ingress and Multi-Cluster Service Discovery, you can expose your workloads outside your GKE clusters, regardless of where you deployed them.
  4. Inter-cluster communication. You export existing Services to other GKE clusters by declaring a ServiceExport object. GKE clusters in the same fleet automatically import Services that you export using ServiceExport objects. Multi-Cluster Service Discovery sets up a virtual IP address for each exported Service. Multi-Cluster Service Discovery automatically configures Traffic Director resources, Cloud DNS, and firewall rules to discover and connect to Services by using a simple variation of the Kubernetes DNS convention. For example, to reach the my-svc Service, you can use the my-svc.my-ns.svc.clusterset.local name.
  5. Traffic management. Multi-Cluster Service Discovery configures simple layer 3/4 connectivity and relies on DNS for service discovery. It doesn't provide any traffic management capabilities.
  6. Provision and configure additional tools. You can set up Multi-Cluster Service Discovery by enabling Google APIs. It doesn't require the installation of any additional tools. For more information, see Migrating containers to Google Cloud: Migrating to a multi-cluster GKE environment with Multi-Cluster Service Discovery and Multi Cluster Ingress

Multi Cluster Ingress and Anthos Service Mesh

A service mesh is an architectural pattern that helps with the network challenges of distributed apps. These challenges include service discovery, load balancing, fault tolerance, traffic control, observability, authentication, authorization, and encryption-in-transit. Typical service mesh implementations consist of a data plane and a control plane. The data plane is responsible for directly handling traffic and forwarding it to destination workloads, usually by using sidecar proxies. The control plane refers to the components that configure the data plane.

When you implement the service mesh pattern in your infrastructure, you can choose between two Google Cloud products: Anthos Service Mesh and Traffic Director. Both products provide a control plane for configuring application layer (L7) networking across multiple GKE clusters. Anthos Service Mesh is based on Istio and offers declarative open source APIs. Traffic Director is based on a combination of Google load-balancing features, open source technologies, and offers imperative Google APIs.

Anthos Service Mesh is a Google-managed suite of tools that lets you connect, manage, secure, and monitor your workloads regardless of which GKE cluster they are deployed to and without modifying your app code. For more information about how to use Anthos Service Mesh to set up a mesh that spans multiple GKE clusters, see Add GKE clusters to Anthos Service Mesh. To evaluate Multi Cluster Ingress and Anthos Service Mesh against the criteria that you established earlier, use the following list, numbered in order of relative importance:

  1. Google-managed solution. Both Multi Cluster Ingress and Anthos Service Mesh are fully managed products. You don't need to provision those products, because Google manages them for you.
  2. Interfaces to interact with the architecture. Anthos Service Mesh uses Istio at its core. Anthos Service Mesh API supports declarative configuration based on the Kubernetes resource model.
  3. Expose services outside the GKE environment. Multi Cluster Ingress and Anthos Service Mesh Ingress gateways let you expose your workloads outside of your GKE clusters.
  4. Inter-cluster communication. Anthos Service Mesh sets up secure communication channels directly between Pods regardless of the cluster they are running in. This setup lets you avoid spending additional effort to provision and configure these communication channels. Anthos Service Mesh uses the concept of fleets and service sameness to extend GKE service discovery to multiple clusters. Therefore, you don't need to modify your workloads to discover workloads running on other clusters in the mesh.
  5. Traffic management. Anthos Service Mesh provides advanced traffic management features that you can use to control how incoming traffic is secured and routed to workloads. For example, Anthos Service Mesh supports all the Istio traffic management features, such as: fault injection, request timeouts and retries, circuit breakers, traffic mirroring, and traffic shifting. You can also use these traffic-management features to simplify your migration to a new GKE environment. For example, you can gradually shift traffic from your old environment to the new one.
  6. Provision and configure additional tools. To use Multi Cluster Ingress, you need to meet Multi-Cluster Service Discovery prerequisites, but you don't need to install additional tools in your GKE clusters. To use Anthos Service Mesh, you need to install it in your clusters.

Traffic Director

Traffic Director is a managed control plane for application networking. It lets you provision and configure rich service mesh topologies, with advanced traffic management and observability features. For more information about Traffic Director, see Traffic Director overview and Traffic Director features. To provision and configure a service mesh spanning multiple GKE clusters, you can use a multi-cluster or a multi-environment Traffic Director configuration. To evaluate Traffic Director against the criteria that you established earlier, use the following list, numbered in order of relative importance:

  1. Google-managed solution. Traffic Director is a fully managed product. You don't need to provision such products, because Google manages them for you.
  2. Interfaces to interact with the architecture. You can configure Traffic Director by using the Google Cloud console, Google Cloud CLI, Traffic Director API, or tools like Terraform. Traffic Director supports an imperative configuration model, and it's built on open source products and technologies, such as xDS and gRPC.
  3. Expose services outside the GKE environment. Traffic Director provisions and configures load balancers to handle incoming traffic from outside the service network.
  4. Inter-cluster communication. The Traffic Director control plane offers APIs that let you group endpoints (such as GKE pods on multiple clusters) into service backends. These backends are then routable from other clients in the mesh. Traffic Director is not directly integrated with GKE service discovery, but you can optionally automate integration using an open source controller such as gke-autoneg-controller. You can also optionally use Multi-Cluster Service Discovery to extend GKE service discovery to multiple clusters.
  5. Traffic management. Traffic Director provides advanced traffic management features that you can use to simplify your migration to a new GKE environment and to enhance the reliability of your architecture. For information about configuring features like fine-grained traffic routing, weight-based traffic splitting, traffic mirroring, and fine-tuned load balancing, see Configuring advanced traffic management.
  6. Provision and configure additional tools. Traffic Director doesn't run in your GKE clusters. For information about provisioning and configuring Traffic Director, see Preparing for Traffic Director setup. To configure the sidecar Pods that Traffic Director needs to include your workloads in the service network, see Deploying Traffic Director with Envoy on GKE Pods.

Kubernetes and self-managed DNS record updates

If you don't want to install additional software on your cluster and you don't need the features that a service mesh provides, you can choose the Kubernetes and self-managed DNS record updates option.

Although you can configure inter-cluster discovery and connectivity using this option, we recommend that you choose one of the other options described in this document. The effort needed to operate a self-managed solution greatly outweighs the benefits you might get in return. Also consider the following important limitations:

When you create a Service of type: LoadBalancer or an Ingress object in a GKE cluster, GKE automatically creates Network Load Balancers and HTTP(S) Load Balancers to expose that Service using the load balancer IP address. You can use the IP addresses of the load balancers to communicate with your Services. However, we recommend that you avoid depending on IP addresses by mapping those IP addresses to DNS records using Cloud DNS, or to Service Directory endpoints that you can query by using DNS, and that you configure your clients to use those DNS records. You can deploy multiple instances of the Service, and map all the resulting load balancer IP addresses to the related DNS record or Service Directory endpoint.

To retire an instance of a Service, first you remove the related load balancer IP address from the relevant DNS record or Service Directory endpoint. Then you ensure that the DNS cache of the clients is updated, and then retire the Service.

You can configure your workloads to be able to communicate with each other across different GKE clusters. To do so, first you expose your services outside the cluster using Internal TCP/UDP Load Balancers or Internal HTTP(S) Load Balancers. Then you map the IP addresses of the load balancers to DNS records or Service Directory endpoints. And finally, you create Services of type: ExternalName that point to those DNS records or Service Directory endpoints in each cluster.

Optionally, you can use an extra Ingress controller to share a single load balancer and Cloud DNS record or Service Directory endpoint with multiple workloads. For example, if you provision an Ingress controller in a cluster, you can configure it to redirect requests coming to the load balancer that GKE creates for that Ingress controller to multiple Services. Using an extra Ingress controller lets you reduce the number of DNS records or Service Directory endpoints that you need to manage.

To evaluate Kubernetes and self-managed DNS record updates against the criteria that you established earlier, use the following list, numbered in order of relative importance:

  1. Google-managed solution. You self-manage the Kubernetes objects that are part of this solution. Cloud DNS, Service Directory, and Load Balancing are Google-managed services.
  2. Interfaces to interact with the architecture. GKE uses Kubernetes at its core, and it supports both imperative and declarative configuration models.
  3. Expose services outside the GKE environment. You can use DNS records, Service Directory Endpoints, and load balancers to expose services to clients outside your GKE clusters.
  4. Inter-cluster communication. Services of type: ExternalName let you define endpoints that point to services deployed in the other GKE cluster. This configuration lets the services communicate with each other as if they were deployed in the same cluster.
  5. Traffic management. The solution doesn't offer additional traffic-management capabilities other than those already offered by Kubernetes and GKE. For example, this option doesn't support partitioning traffic between different clusters.
  6. Provision and configure additional tools. This option doesn't require additional software to be provisioned and configured in your GKE clusters. Optionally, you might install an Ingress controller.

Selecting an architecture

After you assign a value to every criteria for each option, you calculate the total score of each option. To calculate the total score of each option, you add all the ratings for that design option based on the criteria. For example, if an environment scored 10 against a criterion, and 6 against another criterion, the total score of that option is 16.

You can also assign different weights to the score against each criterion so that you can represent the importance of each criterion for your evaluation. For example, if a Google-managed solution is more important than the support of a distributed workload architecture in your evaluation, you might define multipliers to reflect that: a 1.0 multiplier for Google-managed solution and a 0.7 multiplier for distributed-workload architecture. You then use these multipliers to calculate the total score of an option.

After you calculate the total score of each environment that you evaluated, organize the environments by their total score, in descending order. Then, pick the option with the highest score as your environment of choice.

There are multiple ways to represent this data—for example, you can visualize the results with a chart suitable to represent multivariate data, such as a radar chart.

Migrate data from your old environment to your new environment

For guidance about migrating data from your old environment to your new environment, see Migrating Kubernetes to GKE, Migrate data from your old environment to your new environment.

Deploy your workloads

For guidance about migrating data from your old environment to your new environment, see Deploy your workloads.

All the proposed architectures in this document let you migrate your workloads from an existing GKE environment to a new, multi-cluster environment without any downtime or cut-over window. To migrate your workloads without any downtime, do the following:

  1. Temporarily integrate your existing, legacy GKE clusters in the new, multi-cluster GKE environment.
  2. Deploy instances of your workloads in your new, multi-cluster environment.
  3. Gradually migrate traffic from your existing environment, so that you can gradually migrate your workloads to the new GKE clusters, and then retire the legacy GKE clusters.

Complete the deployment

After you provision and configure your runtime platform and environments, complete the activities described in Migrating Kubernetes to GKE, Deploying your workloads.

Optimizing your environment

Optimization is the last phase of your migration. In this phase, you make your environment more efficient than it was before. To optimize your environment, complete multiple iterations of the following repeatable loop until your environment meets your optimization requirements:

  1. Assessing your current environment, teams, and optimization loop.
  2. Establishing your optimization requirements and goals.
  3. Optimizing your environment and your teams.
  4. Tuning the optimization loop.

To perform the optimization of your GKE environment, see Migrating Kubernetes to GKE, Optimizing your environment.

What's next