Migrating a monolithic application to microservices on Google Kubernetes Engine

This article guides you through the high-level concepts of migrating a website from a monolithic platform to a refactored, container-based microservices platform on Google Cloud Platform (GCP). The migration is done feature by feature, avoiding a large-scale migration event and its associated risks. This article is intended for IT professionals in charge of a complex website that's hosted on a monolithic platform and that they want to modernize. Reading this article doesn't require an in-depth knowledge of GCP or Kubernetes.

The goal of a migration like this is to provide a more nimble and scalable environment for individual features of the site, where the features can be more easily managed and updated than if they are part of the monolithic platform. Running in such an environment leads to faster improvements on each migrated feature, providing users with value along the way.

This article uses ecommerce websites as an example workload. Many ecommerce websites are built with monolithic and proprietary platforms, so they are good candidates for the migration described here. However, you can apply the principles described in this article to a wide range of workloads. You can benefit from the principles covered in this article as long as your systems and constraints are close enough to those described here. For example, websites for booking hotels or renting cars would also be good candidates for this migration pattern.

As described in the document Hybrid and multi-cloud patterns and practices, there are three main patterns for migrating to the cloud: lift and shift, improve and move, and rip and replace. This article describes a specific flavor of the rip and replace pattern, where the pattern is applied incrementally to each feature of the application, rather than to the application as a whole.

In addition to this migration pattern, the article explores two hybrid patterns. During the migration itself, the application has a hybrid architecture where some features are in the cloud and some are still on-premises. After the migration is finished, the complete application is hosted in the cloud, but it still interacts with backend services that remain on-premises.

Terminology

application
In this article, an application is a complete software system, with potentially many features, that's perceived as a single unit by end users. For example, an ecommerce website is an application. Everything that can be done with an application is done with a specific feature (or functionality) of that application.
feature
A unit of functionality for an application. Features can be user-facing and core to the application (like the shopping cart in an ecommerce website), user-facing and required by the application (like logging in), or internal to the application (like stock management for an ecommerce website).
service
A standalone component of an application. In this article, an application is composed of different services that are invisible to the end users. For example, a database used by a website is a service.
monolithic application
An application built as a single deployable unit. (Also known simply as a monolith.) Examples include a single Java EE application or a single .NET Web Application. A monolithic application is often associated with a database and a client-side user interface.
microservice
A single service that's built to accommodate an application feature. In the microservices pattern, the application is the aggregate of multiple services, each having a specific goal. For example, you might have a service that handles the shopping cart for your customers, another one for handling the payment service, and another one for interfacing with a backend application for stock. Those microservices should be loosely coupled and they should interface with each other through well-defined APIs. They can be written in different languages and frameworks, and they can have different life cycles.
hybrid architecture
An architecture where you use a public cloud provider (such as GCP) in combination with private resources hosted in your own data center. There are many reasons and ways to implement a hybrid architecture; they're outlined in the article Hybrid and multi-cloud patterns and practices.
stateless and stateful services
An indication about whether a service directly manages data storage. This article uses the same definition of statelessness and statefulness as the Twelve-Factor App methodology. A service is stateless when it doesn't rely on any data to work. A stateful service is the opposite. For example, a service that handles the shopping carts of your customers is stateful, because the shopping carts need to be stored and retrieved. A service that checks the availability of items in the backend system is stateless—the data (state) is stored by the backend system, not by the service.

Why move to microservices?

Breaking down an application into microservices has the following advantages; most of these stem from the fact that microservices are loosely coupled.

  • The microservices can be independently tested and deployed. The smaller the unit of deployment, the easier the deployment.
  • They can be implemented in different languages and frameworks. For each microservice, you're free to choose the best technology for its particular use case.
  • They can be managed by different teams. The boundary between microservices makes it easier to dedicate a team to one or several microservices.
  • By moving to microservices, you loosen the dependencies between the teams. Each team has to care only about the APIs of the microservices they are dependent on. The team doesn't need to think about how those microservices are implemented, about their release cycles, and so on.
  • You can more easily design for failure. By having clear boundaries between services, it's easier to determine what to do if a service is down.

Microservices do have a few disadvantages when compared to monoliths:

  • Because a microservice-based app is a network of different services that often interact in ways that are not obvious, the overall complexity of the system tends to grow.
  • Unlike the internals of a monolith, microservices communicate over a network. In some circumstances, this can be seen as a security concern. Istio solves this problem by automatically encrypting the traffic between microservices.
  • It can be hard to achieve the same level of performance as with a monolithic approach because of latencies between services.
  • The behavior of your system isn't caused by a single service, but by many of them and by their interactions. Because of this, understanding how your system behaves in production (its observability) is harder. Istio is a solution to this problem as well.

While some companies like Google have been using microservices for many years, the concept (and its relation to service-oriented architecture, or SOA) was formalized by James Lewis, Martin Fowler, and Sam Newman in their Microservices article.

Overview of the migration process

The migration process described in this article is long and can take months to complete, because it's a large-scale project. This section maps the path from a monolithic, on-premises application to an application that's fully hosted on GCP and built with microservices.

The starting point: a monolithic, on-premises app

This article assumes that you're currently running a monolithic application on-premises. The following high-level architecture diagram is probably similar to your current architecture.

Architecture of typical monolithic application

This diagram isn't meant to be fully representative of your current systems. Some of the technologies that are typically found in an architecture like this are:

  • Relational databases, such as Oracle® database or SAP HANA
  • Ecommerce platforms, such as Oracle ATG or SAP Hybris
  • Apache Solr or other search engines

The end point: a microservices-based app on GCP

The migration process described here is meant to go from an on-premises, monolithic architecture to a microservices-based architecture running on GCP. The target architecture is described in Scalable commerce workloads using microservices. It looks like this:

Architecture after migrating to containers and GKE

In the target architecture, you run microservices in Google Kubernetes Engine (GKE). Kubernetes is a platform to manage, host, scale, and deploy containers. Containers are a portable way of packaging and running code. They are well suited to the microservices pattern, where each microservice can run in its own container. The microservices can make calls to your backend systems through a private network connection created by Cloud Interconnect or Cloud VPN. Alternatively, you can use Apigee to expose your backend services and secure access to them. More information about these products and choosing between them is described later in this document under Apigee, Cloud VPN, and Cloud Interconnect.

The microservices can also interact with a number of other GCP products, depending on their needs. Cloud Storage and Cloud SQL are two of the most common GCP products for an ecommerce application. Cloud Pub/Sub can be used as a message bus between different microservices for asynchronous work.

You expose your application to the internet in two ways: directly through a GCP load balancer for assets such as images, and optionally through Apigee for your public APIs.

The following tables list the products that you might use in your target architecture. Not all of them are mandatory, and using them depends on your exact use case.

Networking

GCP product Usage Notes
Virtual Private Cloud A VPC is a software-defined private network where your resources (such as a GKE cluster) live. VPC includes features such as firewalling and routing. A VPC is globally available, allowing private connectivity across the world on top of Google-owned fiber networks. Cloud VPC is mandatory in this architecture.
Cloud Interconnect Cloud Interconnect extends your on-premises network to Google's network through a highly available, low-latency connection. You can use Dedicated Interconnect to connect directly to Google, or use Partner Interconnect to connect to Google through a supported service provider. Your ecommerce website probably needs to call backend services on-premises such as a warehouse management system or a billing system.
Cloud Interconnect is one of the solutions for this.
Cloud VPN Cloud VPN securely extends your on-premises network to Google's network through an IPsec VPN tunnel. Traffic is encrypted and travels between the two networks over the public internet. Cloud VPN is useful for low-volume data connections. Your ecommerce website probably needs to call backend services on-premises such as a warehouse management system or a billing system.
Cloud VPN is one of the solutions for this.
Cloud Load Balancing Cloud Load Balancing is Google's managed load-balancing solution. It supports both L4 (TCP/UDP) and L7 (HTTP) load balancing. It can also serve as an SSL (HTTPS) termination endpoint. Creating a GCP load balancer gives you a single anycast IP. All of your users who are trying to access this IP are automatically routed to the nearest Google point of presence for lower network latency, thanks to Google's Premium Tier network. Cloud Load Balancing is mandatory in this architecture.
Cloud CDN Cloud CDN (content delivery network) uses Google's globally distributed edge points of presence to cache HTTP(S) load balanced content close to your users. Caching content at the edges of Google's network provides faster delivery of content to your users while reducing serving costs. Cloud CDN isn't mandatory in this scenario, but it's recommended, especially for static content.

Platform

GCP product Usage Notes
Google Kubernetes Engine (GKE) GKE is Google's managed Kubernetes product for deploying containerized applications. GKE is fully compliant with open source Kubernetes, and it provides many advanced features such as regional clusters for high availability, private clusters, vertical pod autoscaling, cluster autoscaling, GPUs, and preemptible nodes. GKE is mandatory in this architecture.
Istio Istio reduces the complexity of managing microservice deployments by providing a uniform way to secure, connect, and monitor microservices. Istio is not mandatory in this architecture, but provides useful features such as enhanced monitoring, traffic encryption, and routing, as well as fault injection for testing the resilience of your application.
Apigee Apigee is a managed API gateway. Using Apigee, you can securely design, publish, monitor, and monetize your APIs.
You can use Apigee for exposing both public and private APIs. In a microservices architecture, public APIs are hosted on GCP, while the backend on-premises systems can be exposed as private APIs that are consumed only by microservices on GCP.
Apigee is not mandatory in this architecture, but it's recommended that all of your site's content be served by public APIs. An API gateway like Apigee provides many features for API management, such as quotas, versioning, and authentication.
Cloud Pub/Sub Cloud Pub/Sub is a messaging and streaming system. If it's used in this architecture, it functions as a message bus between microservices, which allows asynchronous workflows between the microservices. Cloud Pub/Sub isn't mandatory, but the publisher/subscriber pattern can help mitigate scaling problems.

Storage

GCP product Usage Notes
Cloud Storage Cloud Storage provides a unified object storage API. It's suitable for many use cases, such as serving website content, backing up and archiving data, and distributing large objects. For an ecommerce website, the main use for Cloud Storage is to store and serve static assets, such as product images and videos. Cloud Storage's seamless scalability, coupled with Cloud CDN, makes it a good candidate for the microservices architecture described in this document. It's therefore recommended.
Cloud Firestore Cloud Firestore is a fast, fully managed NoSQL document database. Cloud Firestore isn't mandatory in this architecture, but it's well suited to several common use cases in ecommerce websites. For example, you can use it to store user shopping carts and user metadata.
Cloud SQL Cloud SQL is Google's managed product for MySQL and PostgreSQL. These relational databases have a wide range of uses. Cloud SQL is not mandatory in this scenario, but it's well suited to several common use cases in ecommerce websites. For example, you can use Cloud SQL to store orders, allowing easy aggregation and calculations.
Cloud Spanner Cloud Spanner is a globally available, horizontally scalable, strongly consistent relational database. With an SLA for an uptime of 99.999% (for multi-regional instances), it guarantees very high availability of your business-critical data. Cloud Spanner isn't mandatory in this scenario. Because it's a relational database that offers transactional consistency at a global scale, Cloud Spanner is well suited to ecommerce websites that cater to an international audience.
Cloud Memorystore Cloud Memorystore for Redis is a managed version of Redis, an in-memory key-value database. As a low-latency database, Redis is well suited to data that's accessed frequently. Cloud Memorystore isn't mandatory in this scenario, but it's well suited to several use cases common to websites, such as storing user session information and providing an application cache.

You might use other GCP products as well, but this list represents the most common ones found in an ecommerce architecture. You should also consider that a natural evolution of this architecture is to add components to gather intelligence from data by leveraging products like Cloud Bigtable, BigQuery, Cloud AutoML, and AI Platform.

You might also keep using some of your current technologies in this target architecture, either because they're still relevant to you, or because the cost of moving them is too high. Here are a few examples of how you can run common ecommerce technologies on GCP:

  • Google Cloud partners can manage your Oracle workloads so that there is submillisecond latency between those workloads and your GCP infrastructure. This also lets you reuse your existing licences.
  • Thanks to the partnership between SAP and GCP, you can run a wide range of SAP workloads on GCP, including HANA and Hybris.
  • You can run Apache Solr on GKE, or use some of the Solr solutions available on Google Cloud Platform Marketplace.
  • You can run Elasticsearch on GKE (for example, using the GCP Marketplace solution) or use the managed service by Elastic built on GCP.

Apigee, Cloud VPN, and Cloud Interconnect

One of the most important decisions that you must make early during this project is how to handle communication between the new microservices hosted on GKE and your legacy system on-premises. There are two main solutions, and they can co-exist: API-based communication or communication based on a private connection.

In the API-based solution, you use an API management solution such as Apigee as a proxy between the two environments. This gives you precise control over what portions of your legacy systems you expose and how you expose them. It also lets you seamlessly refactor the implementation of an API (that is, moving from a legacy service to a microservice) without impacting the consumers of the API. The following diagram shows the interactions between Apigee, your on-premises systems, and your GCP systems.

Apigee as proxy in front of a combination of on-premises and GCP-based systems

Patterns for deploying Kubernetes APIs at scale with Apigee and the Apigee eBook Beyond ESB Architecture with APIs can help you in this design.

In a solution based on private connectivity, you connect your GCP and on-premises environments using a private network connection. The microservices communicate with your legacy systems over this connection. You can set up IPSec-based VPN tunnels with Cloud VPN. For larger bandwidth needs, Cloud Interconnect provides a highly available, low-latency connection. See Choose an interconnect type for an overview of the different options.

The following diagram shows the interactions between your on-premises systems and your GCP systems through Cloud VPN or Cloud Interconnect.

Cloud Interconnect or Cloud VPN connecting an on-premises system and a GCP-based system

An API-based solution is implemented by the application teams. It requires a greater integration from the beginning of the project with the legacy application than a solution based on private connectivity. However, an API-based solution provides more management options in the long term. A solution based on Cloud VPN or Cloud Interconnect will be implemented by a networking team and initially requires less integration with the legacy application. However, it does not provide any added value in the long term.

The migration process

This section outlines the high-level steps that you need to follow to migrate to the new architecture.

Prerequisites

Before moving the first microservice to GKE, you need to prepare the GCP environment that you'll work in. Address the following before you put your first microservice in production on GKE:

  • Set up your GCP Organization, which is the global environment that will host your GCP resources. As part of this step, you also configure your Google identities—the accounts that the employees of your organization need in order to use Google products. This process is outlined in Best practices for enterprise organizations.
  • Design your GCP policies for control over your GCP resources. The article Policy design for enterprise customers will help you.
  • Create a plan to deploy your GCP resources, including the GKE clusters, using infrastructure as code (IaC). This will give you standardized, reproducible, and auditable environments. Cloud Deployment Manager or Terraform are recommended for this. All the resources for IaC on GCP are available on the Infrastructure as code page.
  • Study the different GKE features and tweak them as needed. For a business-critical application, you might want to change some of the defaults and harden your clusters. Preparing a GKE environment for production and Hardening your cluster's security contain information on how to achieve this.
  • Build your continuous integration/continuous delivery (CI/CD) tooling for Kubernetes. You can use Cloud Build to build your container images, and Container Registry to store them and to detect vulnerabilities.You can also combine those products with your existing CI/CD tools. Take advantage of this work to implement the best practices for building and operating containers. Doing this early in the project will avoid problems when you're in production.
  • Depending on the option you choose, set up your Apigee account, or set up a private connection between GCP and your on-premises data center with Cloud VPN or Cloud Interconnect.

Migrating in stages

You should migrate the features of your ecommerce website one by one to the new environment, creating microservices when required. Those microservices can call back to the legacy system when needed.

This approach is equivalent to transforming this major migration and refactoring project into several smaller projects. Proceeding this way has two advantages:

  • Each smaller project is more easily bounded and easier to reason about than the overall migration project. Migrating the whole website in a single effort requires teams to understand more about interactions between systems, constraints, third-party systems that depend on the website being available, and so on. This increases the risk of errors.
  • Having many smaller projects gives you flexibility. If you have a smaller team, the team can tackle those projects one after the other without being overwhelmed. If you have several teams, you might be able to parallelize some of the work and lead several migrations at a time.

Choosing which features to migrate and when to migrate them is one of the most important decisions you will make during this stage of the project. When making this decision, you must take into account the web of dependencies between features. Some features might heavily depend on others to work correctly, while others might be fairly independent. The fewer dependencies a feature has, the easier it is to migrate. The issue of dependencies, along with other factors to consider when you decide which feature to migrate, are covered in more detail later in the section Which features should you migrate first?

Example: Migrating a shopping cart

To illustrate the migration process, this section runs through the migration of a single feature, namely the shopping cart of an ecommerce website.

To understand the dependencies of this feature, consider a typical user journey and how it can be implemented in a monolithic application:

  1. The user is browsing the website and finds an item they're interested in.
  2. The user clicks Add to my shopping cart. This action triggers an API call from the user's browser to the shopping-cart feature. This is the first dependency: the frontend is acting on the shopping cart.
  3. When the shopping-cart feature receives the call, it checks whether the item is in stock. This event triggers an API call from the shopping cart feature to the system that handles stock. This is the second dependency: the shopping cart depends on the stock subsystem.
  4. If the item is in stock, the shopping-cart feature stores information like "user A has 1 instance of item X in their cart." This is the third dependency: the shopping cart needs a database to store this information.
  5. When the user checks out and goes through the payment process, the shopping cart is queried by the payment subsystem to compute the total. When the payment is complete, the payment subsystem notifies the shopping-cart feature to empty the shopping cart. This is the fourth dependency: the shopping cart is queried by the payment subsystem.

To summarize, the shopping-cart feature is called by the frontend and the payment subsystem, and it queries a database and the stock subsystem.

A document database is well suited for storing shopping carts. You don't need the power of relational databases for shopping-cart data, and shopping carts can easily be indexed by user IDs. Cloud Firestore is a managed and serverless NoSQL document database, and it's particularly well suited for this use case, so that's the data store suggested for the target architecture.

You can migrate the shopping-cart data in several ways. (This is discussed in more detail later in the section Data migration strategies.) This document assumes that the scheduled maintenance approach described later is acceptable. Under these conditions, you can migrate the shopping-cart feature by following these high-level steps:

  1. Create a new microservice that implements a shopping-cart API. Use Cloud Firestore to store the shopping carts. Make sure that this new microservice can call the stock subsystem.
  2. Create a script that extracts the shopping carts from the legacy shopping-cart subsystem and writes them to Cloud Firestore. Write the script so you can rerun it as many times as needed, and have it copy only the shopping carts that changed since the last execution.
  3. Create a script that does the same thing, but the other way around: it copies shopping carts from Cloud Firestore to your legacy system. You use this script only if you need to roll back the migration.
  4. Expose the shopping-cart API with Apigee.
  5. Prepare and test the modifications to the frontend and the payment subsystem so they can call your new shopping-cart microservice.
  6. Run the data-migration script you created in step 2.
  7. Put your website in maintenance mode.
  8. Rerun the data-migration script.
  9. Deploy the modifications from step 5 to your legacy production system.
  10. Disable maintenance mode in your website.

The shopping-cart feature of your website is now a microservice hosted on GCP. Step 5 is probably the hardest step in this process, because it requires you to modify the legacy system. However, this step gets easier as you migrate more features to microservices, because more and more of the dependencies will already be microservices. As such, they are more loosely coupled than in a monolithic application, and they are easier to modify and deploy.

As an alternative migration, you could have the new microservice call back to the original shopping-cart database, and then migrate the data to Cloud Firestore. To choose between those two migration models, consider the latency between the microservice and the original database, the complexity of the schema refactoring, and the data-migration strategy you want to adopt.

Which features should you migrate first?

This section helps you identify the features to migrate first—the initial migration effort. A divide-and-conquer approach is always preferable in order to reduce the inherent risks of complex migrations.

When you plan your migration, it's tempting to start with features that are trivial to migrate. This might represent a quick win, but might not be the best learning experience for your team. Instead of going straight to the migration, you should spend time evaluating all of the features and create plans for their migration.

You can use the following list of evaluation areas as a framework for your feature-by-feature analysis:

  • Business processes
  • Design and development
  • Operations
  • People and teams

Business processes evaluation

The migration team will be learning and developing processes during the initial migration efforts, and they'll probably make mistakes. Because of this, these initial efforts shouldn't involve business-critical systems (to avoid impacting the main line of business) but they should represent significant use cases (for the team to have a learning opportunity). When you evaluate business processes, also consider processes that are related to compliance and licensing, not just the development processes.

Design and development evaluation

From the design and development point of view, ideal initial migration efforts are the ones that have the least number of dependencies on other features and data, and which are the easiest to refactor if necessary.

You should analyze each feature, considering its dependencies and the effort needed for refactoring. For the dependency analysis of each feature, look at the following:

  • The type of the dependency—dependencies from data or other features.
  • The scale of the dependency—how many features might be impacted by a change in the dependencies of this feature.

Migrating a feature with heavy data dependencies is usually a nontrivial task for these reasons:

  • If you decide to migrate features first and migrate the related data later, you must account for the increased network latency between producers, consumers, and datastores after the initial migration effort.
  • Data integrity and synchronization challenges will arise during the migration phase, because you might be temporarily reading from and writing data to multiple locations.

Another evaluation is the amount of refactoring needed for each feature. Potential refactoring depends on both the current design of the feature and its future direction. In your estimate, consider how that effort could have an impact on the related business processes.

Here are the most important questions you need to answer during this evaluation:

  • Which data does this feature use?
  • How much data does this feature use?
  • Does this feature need other features in order to work properly?
  • How many other features are affected by a change in this feature?
  • Are there any networking or connectivity requirements for this feature?
  • How does the current design of this feature impact refactoring?

Operations evaluation

From the operations point of view, you should also take into account which features can afford the downtime of a cut-over window. If you need to minimize downtime, migrating features that require high availability can mean extra work.

People and teams evaluation

Preferably, choose teams that have well-defined processes to lead the initial migration efforts. Additionally, the teams should be willing to pave the way for the migration journey and understand that they will encounter new challenges for which they must find solutions.

Choosing an initial migration effort

According to this evaluation framework, the ideal candidate for the initial migration effort should be challenging enough to be meaningful, but simple enough to minimize the risk of failure. The initial migration process should also:

  • Require little refactoring, considering both the feature itself and the related business processes.
  • Be stateless—that is, have no external data requirements.
  • Have few or no dependencies.

A migration plan example

The following list shows an example of a migration order:

  1. Platform frontend; that is, the user interface
  2. Stateless features, such as a currency-conversion service
  3. Features with independent datasets (datasets that have no dependencies on other datasets), such as a service to list your brick-and-mortar stores
  4. Features with shared datasets—the business logic of the ecommerce platform

Platform frontend and stateless features

Platform frontends and stateless features usually have few dependencies. They are both ideal initial migration candidates, because they are nontrivial components of the architecture. They require limited refactoring, because in the initial migration phase, the backend API is still served from the legacy data center or from the runtime environment hosted by another cloud provider.

For both platform frontend features and stateless features, focus on the integration and deployment pipeline. Because your GKE workloads must be containerized, you might have to do more operational work.

Features with independent datasets

The next components to migrate are features whose datasets are independent from other datasets. These independent datasets are easier to extract from your legacy system than datasets that have dependencies. (Of course, when compared to migrating stateless features, migrating features that have independent datasets requires additional work, namely creating and managing the new data store along with actually migrating the data.)

When you're planning data migration, you have a choice of storage systems. Because you want to modernize your applications, you can use the following:

  • Managed data storage services like Cloud Storage or Cloud Filestore to store files
  • Cloud SQL to migrate data from an RDBMS
  • Cloud Firestore to migrate data from a NoSQL database

Features with shared datasets

Features with shared datasets are the hardest to migrate. This is because, as detailed later, migrating data is a challenging task due to the requirements for consistency, distribution, access, and latency.

Data migration strategies

When migrating data, you can follow this general approach:

  1. Transfer data from the legacy site to the new site.
  2. Resolve any data integration issues that arise—for example, synchronizing the same data from multiple sources.
  3. Validate the data migration.
  4. Promote the new site to be the master copy.
  5. When you no longer need the legacy site as a fallback option, retire it.

You should base your data migration strategy on the following questions:

  • How much data do you need to migrate?
  • How often does this data change?
  • Can you afford the downtime represented by a cut-over window while migrating data?
  • What is your current data consistency model?

There is no best approach; choosing one depends on the environment and on your requirements.

The following sections present four data migration approaches:

  • Scheduled maintenance
  • Continuous replication
  • Y (writing and reading)
  • Data-access microservice

Each approach tackles different issues, depending on the scale and the requirements of the data migration.

The data-access microservice approach is the preferred option in a microservices architecture. However, the other approaches are useful for data migration. They're also useful during the transition period that's necessary in order to modernize your infrastructure to use the data-access microservice approach.

The following graph outlines the respective cut-over windows sizes, refactoring effort, and flexibility properties of each of these approaches.

Bar graph with each bar showing relative values for flexibiity, refactoring effort, and cut-over window sizes for each of the 4 approaches

Before following any of these approaches, make sure that you've set up the required infrastructure in the new environment.

Scheduled maintenance

The scheduled maintenance approach is ideal if your workloads can afford a cut-over window. (It's scheduled in the sense that you can plan when your cut-over window occurs.)

In this approach, your migration consists of these steps:

  1. Copy data that's currently in the legacy site to the new site. This initial copy minimizes the cut-over window; after this initial copy, you need to copy only the data that has changed during this window.
  2. Perform data validation and consistency checks to compare data in the legacy site against the copied data in the new site.
  3. Stop the workloads and services that have write access to the copied data, so that no further changes occur.
  4. Synchronize changes that occurred after the initial copy.
  5. Refactor workloads and services to use the new site.
  6. Start your workloads and services.
  7. When you no longer need the legacy site as a fallback option anymore, retire it.

The scheduled maintenance approach places most of the burden on the operations side, because minimal refactoring of workload and services is needed.

Continuous replication

Because not all workloads can afford a long cut-over window, you can build on the scheduled maintenance approach by providing a continuous replication mechanism after the initial copy and validation steps. When you design a mechanism like this, you should also take into account the rate at which changes are applied to your data; it might be challenging to keep two systems synchronized.

The continuous replication approach is more complex than the scheduled maintenance approach. However, the continuous replication approach minimizes the time for the required cut-over window, because it minimizes the amount of data that you need to synchronize. The sequence for a continuous replication migration is as follows:

  1. Copy data that's currently in the legacy site to the new site. This initial copy minimizes the cut-over window; after the initial copy, you need to copy only the data that changed during this window.
  2. Perform data validation and consistency checks to compare data in the legacy site against the copied data in the new site.
  3. Set up a continuous replication mechanism from the legacy site to the new site.
  4. Stop the workloads and services that have access to the data to migrate (that is, to the data involved in the previous step).
  5. Refactor workloads and services to use the new site.
  6. Wait for the replication to fully synchronize the new site with the legacy site.
  7. Start your workloads and services.
  8. When you no longer need the legacy site as a fallback option anymore, retire it.

As with the scheduled maintenance approach, the continuous replication approach places most of the burden on the operations side.

Y (writing and reading)

If your workloads have hard high-availability requirements and you cannot afford the downtime represented by a cut-over window, you need to take a different approach. For this scenario, you can use an approach that in this document is referred to as Y (writing and reading), which is a form of parallel migration. With this approach, the workload is writing and reading data in both the legacy site and the new site during the migration. (The letter Y is used here as a graphic representation of the data flow during the migration period.)

This approach is summarized as follows:

  1. Refactor workloads and services to write data both to the legacy site and to the new site and to read from the legacy site.
  2. Identify the data that was written before you enabled writes in the new site and copy it from the legacy site to the new site. Along with the refactoring above, this ensures that the data stores are aligned.
  3. Perform data validation and consistency checks that compare data in the legacy site against data in the new site.
  4. Switch read operations from the legacy site to the new site.
  5. Perform another round of data validation and consistency checks to compare data in the legacy site against the new site.
  6. Disable writing in the legacy site.
  7. When you no longer need the legacy site as a fallback option anymore, retire it.

Unlike the scheduled maintenance and continuous replication approaches, the Y (writing and reading) approach shifts most of the efforts from the operations side to the development side due to the multiple refactorings.

Data-access microservice

If you want to reduce the refactoring effort necessary to follow the Y (writing and reading) approach, you can centralize data read and write operations by refactoring workloads and services to use a data-access microservice. This scalable microservice becomes the only entry point to your data storage layer, and it acts as a proxy for that layer. Of the approaches discussed here, this gives you the maximum flexibility, because you can refactor this component without impacting other components of the architecture and without requiring a cut-over window.

Using a data-access microservice is much like the Y (writing and reading) approach. The difference is that the refactoring efforts focus on the data-access microservice alone, instead of having to refactor all the workloads and services that access the data storage layer. This approach is summarized as follows:

  1. Refactor the data-access microservice to write data both in the legacy site and the new site. Reads are performed against the legacy site.
  2. Identify the data that was written before you enabled writes in the new site and copy it from the legacy site to the new site. Along with the refactoring above, this ensures that the data stores are aligned.
  3. Perform data validation and consistency checks comparing data in the legacy site against data in the new site.
  4. Refactor the data-access microservice to read from the new site.
  5. Perform another round of data validation and consistency checks comparing data in the legacy site against data in the new site.
  6. Refactor the data-access microservice to write only in the legacy site.
  7. When you no longer need the legacy site as a fallback option anymore, retire it.

Like the Y (writing and reading) approach, the data-access microservice approach places most of the burden on the development side. However, it's significantly lighter compared to the Y (writing and reading) approach, because the refactoring efforts are focused on the data-access microservice.

Best practices for microservices

The following sections include a set of best practices to follow when designing, writing, and deploying microservices.

Designing microservices

When you design your microservices, follow a design pattern that lets you properly determine the context and the boundary for each microservice. This protects you from unnecessary fragmentation of your microservice workloads. It also lets you define a precise context (or domain) where your microservices are valid. An example of such a design pattern is domain-driven design (DDD).

API contracts

Each microservice should be invoked only from a set of interfaces. Each interface should in turn be clearly defined by a contract that can be implemented using an API definition language like the OpenAPI Initiative specification (previously Swagger) or RAML. Having well-defined API contracts and interfaces allows you to develop tests as a main component of your solution (for example, by applying test-driven development) against these API interfaces.

Managing changes

If you must introduce breaking changes in your API contracts, you should prepare in advance in order to minimize the refactoring effort that will be required for clients to cope with the changes. Two ways to deal with this issue are:

  • Versioning
  • Implementing a new microservice

Versioning

To give you flexibility in managing updates that might break existing clients, you should implement a versioning scheme for your microservices. Versioning lets you deploy updated versions of a microservice without affecting the clients that are using an existing version. If you use versioning, you must create a new version every time you introduce a change that breaks any part of an existing contract, even if the change is minimal.

You can implement a versioning scheme using the following schemes:

  • Global versioning
  • Resource versioning

When you implement a global versioning scheme, it's implied that you're versioning the entire API. One implementation is to build the version information into the resource URIs, like the following:

/api/v1/entities or api.v3.domain.tld/entities

This kind of versioning scheme is easy to deploy, because versioning is managed in one place. But it's not as flexible as a resource versioning scheme. For an example of a global versioning scheme, see the Kubernetes API versioning scheme.

A resource versioning scheme allows you to independently version every resource served by your microservice. For maximum flexibility, you can even version a given resource according to the operation on that resource (for example, the HTTP verb). You can implement this versioning scheme in various ways. You can use URIs or you can use custom or standard HTTP request headers, like the following example:

Accept: application/tld.domain.entities.v2+json

A resource versioning scheme can be harder to design than a global versioning scheme, because you need a generalized approach for all your entities. However, it's more flexible, because you can implement different and independent upgrade policies for your resources.

Implementing a new microservice

Another approach to dealing with breaking changes is to implement a brand new microservice that has a new contract. You might do this if the new version of the microservice requires so many changes that it makes more sense to write a new version than to update the existing one. Although this approach gives the maximum flexibility, it has the potential of presenting microservices with overlapping functionality and contracts. After you implement the new microservice, you can gradually refactor clients to use that and migrate from the old one.

What's next?

Czy ta strona była pomocna? Podziel się z nami swoją opinią:

Wyślij opinię na temat...