Supporting your migration with Istio mesh expansion: Concept

This article is the first part of a series that discusses using a service mesh to migrate feature by feature from a legacy environment, like an on-premises data center running applications in virtual machines, to Google Kubernetes Engine (GKE).

The series consists of this conceptual article and the accompanying tutorial. This article explains the rationale of the migration and outlines its high-level steps. The tutorial guides you through an example migration.

Introduction

This article is intended for IT professionals in charge of a complex infrastructure that you want to gradually migrate and modernize, while minimizing the following:

  • Downtime
  • Refactoring effort
  • Operational complexity of your network

The concepts explained apply to any cloud, so the article assumes you are familiar with cloud technologies, containers, and microservices.

As described in Hybrid and multi-cloud patterns and practices, there are three main patterns for migrating to the cloud: lift and shift, improve and move, and rip and replace. This article describes an improve-and-move pattern, where the pattern is applied to each feature of the application, rather than to the application as a whole.

During the migration, the application has a hybrid architecture where some features are on Google Cloud and some are still in the legacy environment. After the migration is finished, the complete application is hosted on Google Cloud.

Terminology

application
In this article, an application is a complete software system with many features. Users perceive an application as a single unit. For example, a website selling books is an application.
feature

A feature is a unit of functionality of an application, like the book review feature of a bookstore application. Features consist of microservices.

Features can be either stateless, where they don't depend on any kind of data, or stateful, where they have a dependency on data.

microservice

A microservice is a standalone component that is built to accommodate an application feature. In this article, an application is composed of different microservices that are indistinguishable to users. For example, a component that handles book reviews is a microservice.

In the microservices pattern, the application is the aggregate of multiple microservices, each with a specific goal. For example, you might have one microservice that handles the book ratings and another that handles book reviews. Those microservices should be loosely coupled, and they should interface with each other through well-defined APIs. They can be written in different languages and frameworks (as in polyglot applications), and they can have different life cycles.

To further define the boundaries of each microservice, you can containerize them, using the following tools:

service mesh

A service mesh is software that links different services together, and that provides high value networking features such as service discovery, secure communications, load balancing, traffic management, monitoring, and observability.

A typical implementation of service meshes is to pair each service with a proxy that provides those features. The service proxy is most often referred to as sidecar. The role of the sidecar is to augment and improve the application it's attached to, often without the application's knowledge.

migration

A migration is a process to move features of one or more applications running in a legacy environment to a destination environment, like Google Cloud. In this article, a migration can be either of the following kinds:

  • Big-bang migration: when you migrate all the features of an application at once.
  • Gradual, feature-by-feature migration: when you migrate one feature at a time.
compliance test suite

A compliance test suite is a set of tests that you can run against an environment to check if it satisfies a given set of requirements. If the environment is valid, it meets the requirements. For example, you could validate the response to a test request, or you could check if the dependencies of the application are installed.

You can start with monitoring, tracing, and service mesh visualization tools for a manual validation. Then you can implement the test suite and evolve it over time:

  • Load test. You can evolve the test suite by automatically sending test traffic to the environment and evaluating the results.
  • Compliance testing tool. You can design and develop a test suite using a dedicated tool.

Why choose a gradual migration strategy?

A big-bang migration is difficult because of the challenges and risks involved in migrating one or more applications in a single exercise. When you have constraints on time and budget, focusing on a big-bang migration doesn't leave much capacity for work on new application features.

By contrast, a gradual, feature-by-feature migration has a lower overall complexity due to the smaller size of the workload to migrate: a single feature has a smaller footprint compared to a whole application. A gradual migration allows you to spread the risk across smaller migration events, instead of on a single, high-stakes exercise. A gradual migration also lets the migration team plan, design, and develop multiple migration strategies to accommodate different kinds of features.

You can find guidance on how to choose which features to migrate first and how to migrate stateful features in Migrating a monolithic application to microservices on Google Kubernetes Engine.

Why use a service mesh?

A service mesh decouples service functions (how business logic is implemented) from network functions (how and when to route traffic to service functions).

In the legacy environment, most service calls don't involve the network, because they occur in a monolithic platform. In microservices architectures, communications between services occur over a network and services must deal with this different model. A service mesh abstracts away functions to deal with network communications, so you don't have to implement them in each application. A service mesh also reduces the operational complexity of the network because it provides secure communication channels, load balancing, traffic management, monitoring, and observability features out of the box.

In the accompanying tutorial, you use Istio as the service mesh. Istio features include the following:

  • Traffic management: fine-grained control of traffic with rich routing rules for HTTP, gRPC, WebSocket, and TCP traffic.
  • Request resiliency features: retries, failovers, circuit breakers, and fault injection.
  • A pluggable policy layer and configuration API supporting access control and rate limiting.
  • Automatic metrics, logs, and traces for all traffic within a cluster, including cluster ingress and egress.
  • Secure service-to-service communication with service account–based authentication and authorization.
  • Facilitation of testing and deployment tasks, such as A/B testing, canary rollouts, fault injection, and rate limiting.

You can also visualize your service mesh. Tools like Kiali integrate with Istio to show which services are part of a service mesh and how they are connected.

The following screenshot shows an example of a Kiali service graph representing an Istio mesh.

Kiali service graph representing an Istio mesh

By deploying and configuring a service mesh, you can dynamically route traffic to either the legacy environment or the target environment. You don't need to modify the configuration of the application to support your migration, because traffic management is transparent to services in the mesh.

Although this approach works well for stateless features, you need to do additional planning and refactoring in order to migrate features that are stateful, latency sensitive, or highly coupled to other features:

  • Stateful. When migrating stateful features, you have to migrate data as well, minimizing downtime and dealing with synchronization and integrity issues during the migration. You can read more about data migration strategies in Migrating a monolithic application to microservices on Google Kubernetes Engine.
  • Latency sensitive. If a feature is sensitive to latency when communicating with other features, you might need to deploy additional components during the migration process. Proxies that are capable of prefetching data or caching layers are commonly used to mitigate this sensitivity.
  • Highly coupled to other features. If two or more features are highly coupled, you might have to migrate them at the same time. Although this approach is easier than migrating a whole application, it might be harder than migrating a single feature.

The migration plan

This section presents a plan to perform a gradual, feature-by-feature migration using a service mesh. The plan is composed of the following phases:

  1. Assess the legacy environment.
  2. Build a foundation in the target environment.
  3. Deploy services in the target environment and start routing traffic to the target environment.
  4. Stop routing traffic to the legacy environment.
  5. Retire the legacy environment.

Assess the legacy environment

Before any migration design or implementation activity, you assess the legacy environment to gather information and to establish a set of requirements for the target environment and a baseline for testing and validation. You start by building a catalog of all the application features to migrate. For each feature, you should be able to answer this (non-exhaustive) set of questions:

  • What are the runtime environment and performance requirements?
  • Are there any dependencies on other features?
  • Is this feature business critical?
  • Is this feature stateless or stateful?
  • How much refactoring is expected to migrate it?
  • Can this feature afford a cut-over window?

You can read more about the assessment process and which features to migrate. Then run the compliance test suite against the legacy environment as a baseline and also against the target environment during and after the migration.

Your test suites should focus on the following aspects to validate during the migration:

  • Provisioning. Provision the resources you need in your environment before you configure them.
  • Configuration. After you provision resources in the environment, configure them to your application needs. Having a configuration test suite ensures that the environment is ready to host your application.
  • Business logic. After validating the provisioning and the configuration of the environment, validate the business logic of your application. For example, you could validate the responses to requests.

InSpec, RSpec, and Serverspec are tools to develop automated compliance test suites. They are platform agnostic and extensible so you can implement your own controls, starting from the built-in ones.

The following diagram shows an example of a legacy environment:

Legacy environment

Provision the target environment

Now that you have enough information about the environments and the applications to migrate, you can provision the target environment according to the requirements you set during your assessment.

During this phase, you can use the compliance test suite to validate the target environment. You might have to update the test suite to take into account the eventual differences between the legacy and the target environment, like hardware and network topology. Keep in mind that you are moving from an on-premises environment, where you have full control, to a public cloud environment, where you usually don't have full access to the whole stack.

Because you are provisioning a new environment, we recommend applying an infrastructure-as-code methodology to make your infrastructure auditable, repeatable, and automatically provisionable.

The following diagram shows the legacy environment and the newly provisioned—and currently empty—target environment.

Legacy environment and (empty) target environment

Configure a service mesh

Your next step is to set up a service mesh that spans the legacy environment and the target environment. The service mesh takes care of connecting the various microservices that are running in the legacy environment that will be migrated to the target environment. In this phase, the service mesh is empty, waiting for services to be registered. The service mesh is not receiving any production traffic yet.

The following diagram shows the legacy environment and the empty service mesh in the target environment.

Legacy environment and empty service mesh

Add services in the legacy environment to the mesh

In this example, the legacy environment isn't directly integrated with the service mesh. That integration requires you to manually register all services running in the legacy environment with the service mesh. If your environment is already running in Kubernetes, you can automate the registration thanks to native integration with the service mesh APIs.

At this stage, clients still use the legacy environment interfaces to access the microservices you are migrating. The service mesh doesn't receive any production traffic yet.

The following diagram shows that services running in the legacy environment are added to the service mesh.

Services running in the legacy environment are
added to the service mesh

Expose services through the service mesh

With the legacy environment registered, you use the service mesh to expose the microservices running in the legacy environment. This stage is also when you gradually route traffic to microservices from the interfaces of the legacy environment to the interfaces of the target environment.

Clients don't see any service disruption, because they access the interfaces of the two environments through a load balancing layer. Moreover, traffic routing inside the service mesh is transparent to clients. Clients won't know that a routing configuration change has occurred.

The following diagram shows that services running in the legacy environment are exposed through the service mesh.

services running in the legacy environment are
exposed through the service mesh

Deploy services in the target environment

In this phase, you deploy microservices in the target environment. It's assumed here that you've already containerized these microservices. So with that process completed, you can choose a deployment strategy to initialize the workload to migrate in the target environment:

  • Bulk deployment: you deploy all the microservice instances in at the same time.
  • Gradual deployment: you deploy one microservice at a time.

The microservice instances in the target environment don't receive any production traffic yet.

When migrating stateful microservices, you have to migrate related data as well, minimizing downtime and dealing with synchronization and integrity issues. Read more about data migration strategies.

Microservices running in the target environment are automatically registered in the service mesh, thanks to the native integration with Kubernetes and Istio. At this stage, clients still use the microservices running in the legacy environment that are exposed through the service mesh. The microservices running in the target environment still won't receive any production traffic.

The following diagram shows the expanded services that don't receive any production traffic, and the ones in the legacy environment exposed using the service mesh.

Expanded services that don't receive any
production traffic, and the ones in the legacy environment exposed using the
service mesh

Set up routing rules to split traffic

You now set up routing rules to split production traffic between services running in the legacy environment and ones running in the target environment, using the service mesh. You can start by routing a small portion of production traffic to instances of microservices running in the target environment. As you build confidence in the target environment through the test suite and robust monitoring, you can increase that portion of total traffic over time.

Since client requests are being routed in both environments, you need additional planning and coordination for stateful microservices, since related data has to be migrated as well, and during the migration there could be a transitory phase with multiple sources of truth.

The following diagram shows how traffic is split between the microservices running in the target environment and the ones running in the legacy environment.

Traffic split between the microservices
running in the target environment and the ones running in the legacy
environment

We also recommend that you refine the routing rules to disallow cross-environment requests, so when a client request hits an environment, either legacy or target, it stays in that environment.

Set up rules to route traffic to the target environment

In this phase, you update the routing rules to gradually route traffic to services running in the target environment only.

The following diagram shows how traffic is now routed only to the target environment, while the legacy environment is kept as a backup.

Traffic is now routed only to the target
environment, while the legacy environment is kept as a backup

Retire the legacy environment

In this phase, you retire the legacy environment.

Before retiring the legacy environment, make sure of the following:

  • No traffic is being routed to instances of microservices running in the legacy environment.
  • No traffic comes through the interfaces of the legacy environment.
  • The target environment has been completely validated.

When these conditions are met, you can update your DNS records to point to the load balancer you set up during the target environment provisioning phase.

The following diagram shows only the target environment, because the legacy environment has been retired.

Target environment (with legacy environment retired)

What's next