Support your migration with Istio mesh expansion

Last reviewed 2023-11-02 UTC

This document shows an architecture that uses an Istio service mesh to migrate from a legacy environment, like an on-premises data center running applications in virtual machines, to Google Kubernetes Engine (GKE). Using a service mesh can reduce the complexity of migration and refactoring because it decouples network functions from service functions.

This document explains the rationale of the migration and outlines its high-level phases. You can use this architecture to perform a migration in one operation (sometimes called a big-bang migration), or to perform a gradual, feature-by-feature migration. The accompanying deployment document guides you through an example migration.

This architecture and its deployment guide are intended for IT professionals who manage a complex infrastructure that they want to gradually migrate and modernize, while minimizing the following:

  • Downtime
  • Refactoring effort
  • Operational complexity of your network

The concepts that are explained apply to any cloud, so the document assumes that you're familiar with cloud technologies, containers, and microservices.

As described in Migrate to Google Cloud: Get started, there are several patterns for migrating to the cloud. The architecture in this document uses a refactor pattern (sometimes called move and improve), where the pattern is applied to each feature of the application, rather than to the application as a whole.

During the migration, the application has a hybrid architecture where some features are on Google Cloud and some are still in the legacy environment. After the migration is finished, the complete application is hosted on Google Cloud.

Architecture

The following diagram shows how you can use a service mesh to route traffic either to microservices running in the source environment or to Google Cloud:

Architecture that uses a service mesh to route traffic either to microservices running in the legacy environment or to Google Cloud.

The architecture includes the following components:

  • A source environment, in this case a Compute Engine instance that hosts the example workload to migrate. The source environment can also be on-premises or hosted in other cloud platforms.

  • A service mesh, in this case Istio Gateway, which links different services together. The service mesh provides high-value networking features such as service discovery, secure communications, load balancing, traffic management, and observability.

    A typical implementation of a service mesh pairs each service with a proxy that provides those features. The service proxy is most often referred to as a sidecar. The role of the sidecar is to augment and improve the application it's attached to, often without the awareness of the application. In the accompanying deployment guide, you use Envoy as a sidecar proxy.

  • A workload that contains an application that's made up of microservices. A microservice is a standalone component that's built to accommodate an application feature. In this architecture, the application is composed of different microservices that are indistinguishable to users. For example, a component that handles book reviews is a microservice.

    In the microservices pattern, the application is the aggregate of multiple microservices, each with a specific goal. For example, you might have one microservice that handles the book ratings and another that handles book reviews. Those microservices should be loosely coupled, and they should interface with each other through well-defined APIs. They can be written in different languages and frameworks (as in polyglot applications), and they can have different lifecycles.

  • A container that serves to further define the boundaries of each microservice. Because the target environment in this architecture is orchestrated with Kubernetes, we recommend that you containerize your microservices by using the following tools:

    • Docker is a tool to isolate user-space level programs at the operating system level. It runs packages called containers.
    • Kubernetes is the leading orchestration solution for containerized workloads. It provides features like service discovery, load balancing, self-healing pods and nodes, and secret and configuration management.
    • GKE is a managed, production-ready Kubernetes environment, part of Google Cloud.

    For advice on how to containerize these microservices, see Best practices for building containers and Best practices for operating containers.

To perform a migration by using a service mesh, you complete the following phases:

  • Assess the legacy environment: Gather information and establish a set of requirements for the target environment and a baseline for testing and validation.
  • Build a foundation in the target environment: Provision your target environment and apply an infrastructure-as-code methodology to make your infrastructure auditable, repeatable, and automatically provisionable.
  • Deploy services and start routing traffic to the target environment: Complete a single operation to deploy all the microservice instances at the same time, or complete a gradual deployment to deploy one microservice at a time.
  • Stop routing traffic to the legacy environment: Set up routing rules to route traffic to services that are running in the target environment only.
  • Retire the legacy environment: Update your DNS records to point to the load balancer you set up during the target environment provisioning phase.

This document describes some design considerations for this type of migration, and the accompanying deployment guide provides detailed steps to complete the migration.

Design considerations

This section provides guidance to help you develop an architecture that meets your specific requirements for reliability, operational efficiency, cost, and performance optimization.

Reliability

The following sections describe considerations for the reliability of your migration.

Use a gradual migration strategy

Although this architecture can be used for a single-operation migration, we recommend that you use a gradual migration strategy. Completing a migration in one operation is difficult because of the challenges and risks that are involved in migrating one or more applications all at once. When you have constraints on time and budget, focusing on a single-operation migration doesn't leave much capacity for work on new application features.

By contrast, a gradual, feature-by-feature migration has a lower overall complexity due to the smaller size of the workload to migrate: a single feature has a smaller footprint compared to a whole application. A gradual migration lets you spread the risk across smaller migration events, instead of on a single, high-stakes exercise. A gradual migration also lets the migration team plan, design, and develop multiple migration strategies to accommodate different kinds of features.

For guidance on how to choose which features to migrate first and how to migrate stateful features, see Choosing the apps to migrate first in "Migrate to Google Cloud: Assess and discover your workloads."

Use a compliance test suite

To help facilitate your migration, we recommend that you use a compliance test suite. A compliance test suite is a set of tests that you can run against an environment to check whether it satisfies a given set of requirements. If the environment is valid, it meets the requirements. For example, you can validate the response to a test request, or you can check if the dependencies of the application are installed.

You can start with monitoring, tracing, and service mesh visualization tools for a manual validation. Then you can implement the test suite and evolve it over time in the following ways:

  • Load test: You can evolve the test suite by automatically sending test traffic to the environment and evaluating the results.
  • Compliance testing tool: You can design and develop a test suite using a dedicated tool.

Run the compliance test suite against the legacy environment as a baseline and also against the target environment during and after the migration.

Your test suites should focus on the following aspects to validate during the migration:

  • Provisioning: Provision the resources that you need in your environment before you configure them.
  • Configuration: After you provision resources in the environment, configure them to your application needs. Having a configuration test suite ensures that the environment is ready to host your application.
  • Business logic: After you validate the provisioning and the configuration of the environment, validate the business logic of your application. For example, you could validate the responses to requests.

Chef InSpec, RSpec, and Serverspec are tools to develop automated compliance test suites. They work on any platform, and they're extensible so that you can implement your own controls, starting from the built-in controls.

When you provision your target environment, you can use the compliance test suite to validate it. You might have to update the test suite to account for the eventual differences between the legacy and the target environment, like hardware and network topology. Keep in mind that you're moving from an on-premises environment, where you have full control, to a public cloud environment, where you usually don't have full access to the whole stack.

Before you route traffic from the load-balancing layer of the legacy environment to the target environment, we recommend that you run the business logic compliance test suite against the microservices that are exposed by the target environment. The test suite can help to ensure that the microservices are working as expected. For example, you can validate the responses returned by services that are exposed by the service mesh. You can keep the original route as a backup option in case you need to roll back the changes and revert to the legacy environment. By routing a small portion of production traffic to instances of the microservices that are running in the target environment, you can build confidence in the target environment and increase total traffic over time.

Disallow cross-environment requests

During the compliance test phase, we recommend that you refine the routing rules to disallow cross-environment requests, so that when a client request hits an environment, either legacy or target, it stays in that environment. Disallowing cross-environment requests helps to ensure that tests correctly validate the target environment. If you don't disallow these requests, a test might report success on the source environment instead of the target environment without your knowledge.

Retire the legacy environment

Before you retire the legacy environment, verify that all of the following is true:

  • No traffic is being routed to instances of microservices running in the legacy environment.
  • No traffic comes through the interfaces of the legacy environment.
  • The target environment has been completely validated.

When these conditions are met, you can update your DNS records to point to the load balancer you set up during the target environment provisioning phase. Don't retire the legacy environment unless you're sure that the target environment has been validated. And don't limit the validation to the business logic only, but consider other nonfunctional requirements like performance and scalability.

Operational efficiency

The following sections describe considerations for operational efficiency of your migration.

Assess the legacy environment

Before you design or implement a migration plan, you should assess the legacy environment to gather information and to establish a set of requirements for the target environment and a baseline for testing and validation. You start by building a catalog of all the application features to migrate. For each feature, you should be able to answer this non-exhaustive set of questions:

  • What are the runtime environment and performance requirements?
  • Are there any dependencies on other features?
  • Is this feature business critical?
  • Is this feature stateless or stateful?
  • How much refactoring is expected to migrate it?
  • Can this feature afford a cut-over window?

For more information, see Migrate to Google Cloud: Assess and discover your workloads.

Stateless versus stateful resources

The architecture uses a service mesh because it decouples service functions (how business logic is implemented) from network functions (how and when to route traffic to service functions). In the accompanying deployment, you use Istio as the service mesh.

In the legacy environment, most service calls don't involve the network, because they occur in a monolithic platform. In a microservices architecture, communications between services occur over a network, and services must handle this different model. A service mesh abstracts away functions to handle network communications, so you don't have to implement them in each application. A service mesh also reduces the operational complexity of the network because it provides secure communication channels, load balancing, traffic management, and observability features that don't require any configuration.

By deploying and configuring a service mesh, you can dynamically route traffic to either the legacy environment or the target environment. You don't need to modify the configuration of the application to support your migration, because traffic management is transparent to services in the mesh.

Although this approach works well for stateless features, you need to do additional planning and refactoring in order to migrate features that are stateful, latency sensitive, or highly coupled to other features:

  • Stateful: When you migrate stateful features, you have to migrate data as well, in order to minimize downtime and to mitigate synchronization and integrity issues during the migration. You can read more about data migration strategies in the Evaluating data migration approaches section of "Migration to Google Cloud: Transferring your large datasets."
  • Latency sensitive: If a feature is sensitive to latency when communicating with other features, you might need to deploy additional components during the migration process. Proxies that can prefetch data or caching layers are commonly used to mitigate this sensitivity.
  • Highly coupled to other features: If two or more features are highly coupled, you might have to migrate them at the same time. Although this approach is easier than migrating a whole application, it might be harder than migrating a single feature.

Register services with the service mesh

Because your legacy environment isn't directly integrated with the service mesh, when you configure your migration, you have to manually register all services that are running in the legacy environment with the service mesh. If your environment is already running in Kubernetes, you can automate the registration by using the built-in integration with the service mesh APIs.

After you register the legacy environment, you use the service mesh to expose the microservices that are running in the legacy environment. You can then gradually route traffic to microservices from the interfaces of the legacy environment to the interfaces of the target environment.

Clients won't see any service disruption, because they access the interfaces of the two environments through a load-balancing layer. Traffic routing inside the service mesh is transparent to clients, so clients won't know that routing configuration has been changed.

Cost optimization

This architecture uses the following billable components of Google Cloud:

When you deploy the architecture, you can use the pricing calculator to generate a cost estimate based on your projected usage.

Performance optimization

In this architecture, traffic is almost equally split between the services that are running in Compute Engine and the services that are running in GKE. The service mesh balances traffic between the instances of a service that it selects, whether they're running in the GKE cluster or in the Compute Engine instances. If you want all the traffic to go through the service mesh, you need to reconfigure the workloads that are running in Compute Engine to point to the services in the service mesh, instead of connecting to them directly. You can configure the load-balancing policy for the revisions of each service with the DestinationRule resource.

Deployment

To deploy this architecture, see Deploy your migration with Istio mesh expansion.

What's next