Migrate to Google Cloud: Best practices for validating a migration plan

Last reviewed 2023-11-29 UTC

This document describes the best practices for validating the plan to migrate your workloads to Google Cloud. This document doesn't list all of the possible best practices for validating a migration plan, and it doesn't give you guarantees of success. Instead, it helps you to stimulate discussions about potential changes and improvements to your migration plan.

This document is useful if you're planning a migration from an on-premises environment, from a private hosting environment, or from another cloud provider to Google Cloud. The document is also useful if you're evaluating the opportunity to migrate and want to explore what it might look like.

This document is part of the following multi-part series about migrating to Google Cloud:

Assessment

Performing a complete assessment of your workloads and environments helps to ensure that you develop a deep understanding of your workloads and environments. Developing this understanding helps you to minimize the risks of issues happening during and after your migration to Google Cloud.

Make a complete assessment

Before you proceed with the steps that follow the assessment phase, complete the assessment of your workloads and environments. To make a complete assessment, consider the following items, which are often overlooked:

  • Inventory: Ensure that the inventory of the workloads to migrate is up to date and that you completed the assessment. For example, consider how fresh and reliable the source data is for your assessment, and what gaps might exist in the data.
  • Downtimes: Assess which workloads can afford a downtime, and the maximum length of time that those downtimes can be. Migrating workloads while experiencing zero or nearly zero downtimes is harder than migrating workloads that can afford downtimes. To complete a zero-downtime migration, you need to design for and implement redundancy for each workload to migrate. You also need to coordinate these redundant instances.

    When you assess how much downtime a workload can tolerate, assess whether the business benefit of a zero-downtime migration is greater than the added migration complexity. Where possible, avoid creating a zero-downtime requirement for a workload.

  • Clustering and redundancy: Assess which workloads support clustering and redundancy. If a workload supports clustering and redundancy, you can deploy multiple instances of that workload, even across different environments, such as the source environment and the target environment. Clustered and redundant deployments might simplify the migration because those workloads coordinate with each other with limited intervention.

  • Configuration updates: Assess how you update the configuration of your workloads. For example, consider how you deliver updates to the configuration of each workload that you want to migrate. This consideration is critical for the success of your migration because you might have to update the configuration of your workloads while you migrate them to the target environment.

  • Generate multiple assessment reports: During the assessment phase, it might be useful to generate more than one assessment report to account for different scenarios. For example, you can generate reports to take into account different load profiles for your workloads, such as at- and off-peak times.

Assess the failure modes that your workloads support

Knowing how your workloads behave under exceptional circumstances helps you to ensure that you don't expose them to conditions from which they can't recover. As part of the assessment, gather information about the failure modes and their effects that your workloads support and can automatically recover from, and which failure modes need your intervention. For example, you can start by considering questions about possible failure modes, such as the following:

  • What happens if a workload loses connectivity to the network?
  • Is a workload able to resume its work from where it left off after being stopped?
  • What happens if the performance of a workload or its dependencies is inadequate?
  • What happens if there are two workloads that have the same identifier in the architecture?
  • What happens if a scheduled task doesn't run?
  • What happens if two workloads process the same request?

Another source for unsupported failure modes might be the migration plan itself. Determine whether your migration plan includes steps that depend on the success of a particular condition and whether it includes contingencies if the condition is not met. A plan that includes these types of conditions can indicate that the plan itself might fail or that individual components might fail during migration.

After you assess those failure modes and their effects, validate your findings in a non-critical environment by simulating failures and injecting faults that emulate those failure modes. For example, if a workload is designed to automatically recover after a network connectivity loss, validate the automatic recovery by forcibly interrupting its connectivity and restoring it afterwards.

Assess your data processing pipelines

Your workload assessment should be able to answer the following questions:

  • Are resources correctly sized for the migration?
  • How much time is required to migrate the data that your workloads need?
  • Can the target environment accommodate the full volume of data?
  • How do your workloads behave when they have to accommodate spikes in demand or spikes in the amount of data that they produce in a given time window?
  • If there are spikes in demand or spikes in the amount of data that your workloads produce, is there any adverse effect, such as increased latency or delays in responses?
  • After your workloads start, do they need time to ramp up to the expected levels of performance?

The results of this assessment are often models of the demand that your workloads satisfy and the data that the workloads produce in a given time window. When you gather data points to produce such models, consider that those data points might vary significantly between peak and non-peak time windows. For more information about how and what to monitor, see Service Level Objectives in the Site Reliability Engineering book.

Ensure that you can update and deploy each workload to migrate

During the migration, you might need to update some of the workloads that you're migrating. For example, you might need to deploy a fix for an issue, or roll back a recent change that is causing an issue. For each workload that you're migrating, ensure that you can apply and deploy changes. For example, if you're migrating a workload for which you have the source code, ensure that you can access that source code, and that you can build, package, and deploy the source code as needed.

Your migration might include workloads that you can't apply and deploy changes to (such as proprietary software). In that scenario, refactor your migration plan to consider additional effort to mitigate the issues that might occur after you migrate those workloads.

Assess your network infrastructure

A functional network infrastructure is fundamental for the migration. You can use the network infrastructure as part of your migration tooling. For example, you can use load balancers and DNS servers to direct traffic according to your migration plan.

To avoid issues during the migration, it's important to assess your network infrastructure and evaluate to what extent it can support your migration. For example, you can start by considering questions about your load-balancing infrastructure, such as the following:

  • What happens when you reconfigure your load balancers?
  • How long does it take for the updated configuration to be in effect?
  • When migrating with zero downtime, what happens if you get a spike of traffic before the updated configuration is in place?

After you consider questions about your load-balancing infrastructure, next consider questions about your DNS infrastructure, such as the following:

  • Which DNS records should you update to point them to the target environment, and when should you update them?
  • Which clients are using those DNS records?
  • How is the time to live (TTL) configured for the DNS records to update?
  • Can you set the DNS record TTL to its minimum during the migration?
  • Do your DNS clients respect the TTL of the DNS records to update? For example, do your applications have client-side DNS caching that ignores the TTL that you've configured for the migration?

Migration planning

Thoroughly planning your migration helps you to avoid issues during and after the migration. Planning also helps you to avoid effort to deal with unanticipated tasks.

Develop a rollback strategy for each step of the migration plan

During the migration, any step of the migration plan that you execute might result in unanticipated issues. To ensure that you're able to recover from those issues, prepare a rollback strategy for each step of the migration plan. To avoid losing time during an outage, do the following:

  • Ensure that your rollback strategies work by periodically reviewing and testing each rollback strategy.
  • Set a maximum-allowed execution time for each migration step. After this allowed execution time expires, your teams start rolling back the migration step.

Even if you have rollback strategies ready for each step of the migration plan, some of those steps might still be potentially disruptive. A potentially disruptive step might cause some kind of loss even if you roll it back, such as a data loss. Assess which steps of the migration plan are potentially disruptive.

If you automated any step of the migration plan, ensure that you have a preplanned procedure for each automated step if there is a failure in the automation. As with rollback strategies, periodically review and test each preplanned procedure.

If you set up communication channels as part of the migration, to ensure that you aren't locked out from your environment, provision backup channels that you can use to recover from a failure. For example, if you're setting up Partner Interconnect, during the migration you can also set up a backup access through the public internet in case you experience any issues during provisioning and configuration.

Plan for gradual rollouts and deployments

To reduce the scope of issues and problems that might occur during the migration, avoid big-scale changes, and design your migration plan to gradually deploy changes. For example, plan for gradual deployments and configuration changes.

If you plan for gradual rollouts, to lower the risk of unanticipated issues caused by the application of the changes, minimize the number and the size of those changes. After you identify and resolve issues in your first small rollout, you can make the subsequent rollouts for similar changes at larger scales.

Alert development and operations teams

To reduce the impact of issues that might occur during a migration, alert the teams that are responsible for any workload to migrate. Also alert the teams that are responsible for the infrastructure of both the source and target environments.

If your teams work in different time zones, ensure the following:

  • Your teams properly cover those time zones and they cover multiple consecutive shifts, because they might be unable to resolve issues during a single shift.
  • Your teams are prepared to collect detailed information about the issues that they might face. This collection provides the engineers on the next shift a complete understanding of what the previous shift did, and why.
  • Specific people in your teams are responsible for any given shift.

Remove proof-of-concept resources from the target production environment

As part of the assessment, you might have used the target environment to host experiments and proofs of concept. Before the migration, remove any resources that you created during those experiments and proofs of concept from the production area of the target environment.

You can keep resources in a non-production area of the target environment while the migration is in progress because they might help you to gather information about any issue that might arise during the migration. For example, to diagnose issues that affect your production workloads after the migration, you can compare the configuration and data logs of the production workload against the configuration and data logs of the proofs of concept and experiments.

After you complete the migration and you validate that the target environment works as expected, you can delete the resources in the non-production area of the target environment.

Define criteria to safely retire the source environment

To avoid the cost of running two environments indefinitely, define what conditions must be met for you to safely retire the source environment, such as the following:

  • All workloads, including their backups and high availability and disaster recovery mechanisms, are successfully migrated to the target environment.
  • The data migrated on the target environment is consistent, accessible, and usable.
  • The accuracy and completeness of the migrated data fulfill the defined standard.
  • Resources that remain in the source environment aren't dependencies for workloads that are out of the migration scope.
  • The performance of your workloads on the target environment fulfill your SLA targets.
  • Your monitoring systems report that there isn't any network traffic to the source environment that should be directed to the target environment.
  • After the workloads are running without issue in the target environment for a period that you define, you are confident that you no longer need the ability to fall back to the source environment.

Operations

To efficiently manage the source environment and the target environment during the migration, you need to engineer your operational processes as well.

Monitor your environments

To observe how your source and target environments are behaving and to help you diagnose issues as they occur, set up the following:

  • A monitoring system to gather metrics that are useful to your scenario.
  • A logging system to observe the flow of operations that is performed by your workloads and other components of your environments.
  • An alerting system that warns you before a problematic event occurs.

Google Cloud Observability supports integrated monitoring, logging, and alerting for your Google Cloud environment.

Because a workload and its dependencies span multiple environments, you might need to consider using multiple monitoring and alerting tools for different environments. Consider the timing of when you migrate the monitoring and alerting policies that support the workloads. For example, if your source environment is configured to alert when a particular server is down, the alert triggers when you intentionally turn down that server. The alert trigger is expected, but it's unhelpful behavior. As part of the migration, you need to continuously adjust the alerts for the source environment and reconfigure them for the target environment.

Manage the migration

To manage the migration, you review the performance of the migration to gather information that you can use as a retrospective after the migration is complete. After you gather information, you use it to analyze the migration performance and to prepare data points about potential improvements to your environments.

For example, to start planning to manage the migration, consider the following questions:

  • How long did each step of the migration plan take?
  • Were there any steps of the migration plan that took more time to complete than anticipated?
  • Were there any missing steps or checks?
  • Did any adverse events occur during the migration?

What's next