Migration overview

This document describes the process of migrating your database to Spanner. We describe the stages of migration and the tools that we recommend for each stage, depending on your source database and other factors. The recommended tools include Google Cloud products as well as third-party commercial and open source tools. Together, these tools help you accelerate migrations and reduce risk.

Any Spanner migration involves the following core stages:

Assess the complexity of your migration.
Migrate your schema.
Load sample data.
Migrate your application.
Test and tune your performance.
Migrate the data.
Validate the migration.
Configure cutover and failover mechanisms.

Within these stages, your migration plan can vary widely depending on factors such as your database source and size, downtime requirements, application code complexity, sharding schema, custom functions or transformations, and failover and replication strategy.

Migration tools

We recommend using the following tools to assist you in various stages of your migration, depending on your source database and other factors. Some tools only support certain source databases. For some steps of the process, no tool is available, so you complete those steps manually.

Spanner migration tool (SMT) is an open source tool that can perform basic assessments, schema conversion, and data migrations.
Database Migration Assessment (DMA) offers a basic assessment to migrate PostgreSQL to Spanner.
Datastream is a Google Cloud service that lets you read change data capture (CDC) events and bulk data from a source database and write to a specified destination.
Dataflow is a Google Cloud service that helps you to write a large amount of data to Spanner more efficiently using templates.
Bulk data migration is a Dataflow template that lets you migrate large MySQL data sets directly to Spanner.
Minimal downtime migration uses Datastream and Dataflow to migrate:
- Existing data in your source database.
- Stream of changes that are made to your source database during the migration.
Data Validation Tool (DVT) is a standardized data validation method built by Google and supported by the open source community. You can integrate DVT into existing Google Cloud products.

Migration tools for MySQL source databases

If your source database is MySQL, then you can perform some of the initial migration stages using MySQL dump files. You must directly connect to your running source MySQL database to complete a production migration.

The following table recommends migration tools based on the migration stage and whether you're working using a dump file or directly connecting your source database:

Migration stage	Dump file	Direct connection to source database
Assessment	Use SMT with `mysqldump`.	Use SMT with `mysqldump`.
Schema conversion	Use SMT with `mysqldump`.	Use SMT to configure and convert schema.
Sample data load	If your sample dump file is less than 100 GB, use SMT in POC mode. If your sample dump file is more than 100 GB, export the file to Cloud SQL and perform a bulk migration. If your sample dump file is in a CSV, Avro, or Parquet file format, load the file to BigQuery and copy to Spanner using reverse ETL.	Perform a bulk migration.
Data migration	Not applicable	Perform a bulk migration, then perform a minimal downtime migration.
Data validation	Not applicable	Use DVT.
Failover configuration	Not applicable	Use SMT for reverse replication.

Migration tools for PostgreSQL source databases

If your source database uses PostgreSQL, then you can perform some of the migration stages using a PostgreSQL dump file. You must directly connect to your running source PostgreSQL database to complete the migration.

The following table recommends migration tools based on the migration stage and whether you're working with a dump file or directly connecting from your source database:

Migration stage	Dump file	Direct connection to source database
Assessment	Use SMT with `pg_dump`.	Use DMA.
Schema conversion	Use SMT with `pg_dump`.	Use SMT to configure and convert schema.
Sample data load	If your sample dump file is less than 100GB, use SMT in POC mode. If your sample dump file is more than 100GB, export the file to Cloud SQL and perform a minimal downtime migration. If your sample dump file is in a CSV, Avro, or Parquet file format, load the file to BigQuery and copy to Spanner using reverse ETL.	Perform a minimal downtime migration.
Data migration	Not applicable	Perform a minimal downtime migration.
Data validation	Not applicable	Use DVT.
Failover	Not applicable	Not applicable

Assess the complexity of your migration

To assess the scope and complexity of your migration and plan your approach, you need to gather data about your source database, including the following:

Query patterns
Amount of application logic that's dependent on database features such as stored procedures and triggers
Hardware requirements
Total cost of ownership (TCO)

Migrate your schema

Before migrating a schema to a Spanner schema, assess the compatibility between the schemas, and optimize your schema for Spanner. For example, you might want to change keys, drop or add indexes, or add or remove columns of existing tables. To optimize your schema for Spanner, see Schema design best practices and Recommended primary key migration strategies.

Spanner migration tool, an open source, community-maintained tool created by Google developers, automatically builds a Spanner schema from your source database schema. You can customize the schema using the Spanner migration tool schema assistant.

Spanner migration tool ingests schema and data from one of the following locations:

A dump file from a local location or Cloud Storage (MySQL, PostgreSQL, CSV)
Directly from the source database (MySQL, PostgreSQL)

Spanner migration tool performs the following functions for schema assessments, recommendations, and migrations:

Data type compatibility assessment and recommendations
Primary key editing and recommendations
Secondary index editing and recommendations
Interleaving table editing and recommendations
General Spanner schema design recommendations
Schema versioning
Collaborative schema modification

For more information about schema migrations with the Spanner migration tool, see the Spanner migration tool README.md file.

You also use the Spanner migration tool for data migration.

Load sample data

After you create a Spanner compatible schema, you can prepare your database to be tested using sample data. You can use the BigQuery reverse ETL workflow to load the sample data. For more information, see Load sample data.

Migrate your application

A database migration requires different drivers and libraries, as well as compensation for features that Spanner doesn't support. To optimize to Spanner strengths, you might need to change your code, application flows, and architecture.

Here are some of the changes required to migrate your application to Spanner:

Spanner doesn't support running user code at the database level, so you need to move any procedures and triggers stored at the database level into the application.
Use Spanner client libraries and object-relational mappers (ORMs). For more information, see Overview of APIs, client libraries, and ORM drivers.
If you need to translate queries, translate them manually or use other third-party tools.
Take note of partitioned DML, read-only transactions, commit timestamps, and read timestamps and how they can optimize application performance.

You also might need to make changes to transaction handling. There's no tooling available to help with this, so you need to complete this step manually. Keep the following in mind:

The limit of mutations per commit is 40,000. Each secondary index on a table is an additional mutation per row. To modify data using mutations, see Insert, update, and delete data using mutations. To modify a large amount of data, use partitioned DML.
For transaction isolation level, no handling is required because Spanner transactions are more isolated.
Because Spanner is linearizable, it handles consistency and locking by default.

Test and tune your schema and application performance

Performance tuning is an iterative process in which you evaluate metrics like CPU utilization and latency based on a subset of your data, adjust your schema and application to improve performance, and test again.

For example, in your schema, you might add or change an index, or change a primary key. In your application, you might batch writes, or you might merge or modify your queries.

For production traffic in particular, performance tuning is important to help avoid surprises. Performance tuning is more effective the closer the setup is to live production traffic throughput and data sizes.

To test and tune your schema and application performance, follow these steps:

Upload a subset of your data into a Spanner database. For more information, see Migrate your data.
Point the application to Spanner.
Verify the correctness by checking for basic flows.
Verify that performance meets your expectations by performing load tests on your application. For help identifying and optimizing your most expensive queries, see Detect query performance issues with query insights. In particular, the following factors can contribute to suboptimal query performance:
1. Inefficient queries: For information about writing efficient queries, see SQL best practices.
2. High CPU utilization: For more information, see Investigate high CPU utilization.
3. Locking: To reduce bottlenecks caused by transaction locking, see Identify transactions that might cause high latencies.
4. Inefficient schema design: If the schema isn't designed well, query optimization isn't very useful.
5. Hotspotting: Hotspots in Spanner limit write throughput, especially for high-QPS applications. To identify hotspots or antipatterns, check the Key Visualizer statistics from the Google Cloud console. For more information about avoiding hotspots, see Choose a primary key to prevent hotspots.
If you modify schema or indexes, repeat correctness and performance testing until you achieve satisfactory results.

For more information about fine-tuning your database performance, contact Spanner support.

Migrate your data

After optimizing your Spanner schema and migrating your application, you move your data into an empty production-sized Spanner database, and then switch over to the Spanner database.

Depending on your source database, you might be able to migrate your database with minimal downtime, or you might require prolonged downtime.

For both minimal-downtime migrations and migrations with prolonged downtime, we recommend using Dataflow and the Spanner migration tool.

The following table shows the differences between minimal-downtime migrations and migrations with more downtime, including supported sources, formats, size, and throughput.

	Minimal-downtime migration	Migration with downtime
Supported sources	MySQL, PostgreSQL	Any database that can export to CSV or Avro.
Supported data formats	Connect directly. See Directly connecting to a MySQL database.	MySQL, PostgreSQL, CSV, Avro
Supported database sizes	No limit	No limit
Max throughput	45 GB per hour	200 GB per hour

Minimal-downtime migration

Spanner supports minimal-downtime migrations from MySQL, PostgreSQL, and Oracle Database. A minimal-downtime migration consists of two components:

A consistent snapshot of all the data in the database
The stream of changes (inserts and updates) since that snapshot, referred to as change data capture (CDC)

While minimal-downtime migrations help protect your data, the process involves challenges, including the following:

Storing CDC data while the snapshot is migrated.
Writing the CDC data to Spanner while capturing the incoming CDC stream.
Ensuring that the migration of CDC data to Spanner is faster than the incoming CDC stream.

To manage a minimal-downtime migration, the Spanner migration tool orchestrates the following processes for you:

Sets up a Cloud Storage bucket to store CDC events on the source database while the snapshot migration progresses.
Sets up a Datastream job that moves the bulk load of CDC data and continuously streams incremental CDC data to the Cloud Storage bucket. You set up the source connection profile within Spanner migration tool.
Sets up the Dataflow job to migrate the CDC events into Spanner.

When Dataflow has copied most of the data, it stops writing to the source database and waits for the data to finish migrating. This results in a short downtime while Spanner catches up to the source database. Afterward, the application can be cut over to Spanner.

The following diagram shows this process:

Diagram shows the process of a minimal-downtime migration.

Migration with downtime

For databases other than MySQL, PostgreSQL, or Oracle Database, if the database can export to CSV or Avro, then you can migrate to the Spanner with downtime. We recommend using Dataflow or Spanner migration tool.

Migrations with downtime are recommended only for test environments or applications that can handle a few hours of downtime. On a live database, a migration with downtime can result in data loss.

To perform a downtime migration, follow these high-level steps:

Generate a dump file of the data from the source database.
Upload the dump file to Cloud Storage in a MySQL, PostgreSQL, Avro, or CSV dump format.
Load the dump file into Spanner using Dataflow or the Spanner migration tool.

Generating multiple small dump files makes it quicker to write to Spanner, as Spanner can read multiple dump files in parallel.

When generating a dump file from the source database, to generate a consistent snapshot of data, keep the following in mind:

To prevent the data from changing during the generation of the dump file, before you perform the dump, apply a read lock on the source database.
Generate the dump file using a read replica from the source database with replication disabled.

Recommended formats for bulk migration

Avro is the preferred format for a bulk migration to Spanner. If you're using Avro, consider the following:

To generate an Avro dump of the data, use a tool like DBeam. For more information about exporting to Avro, see Export data from a non-Spanner database to Avro files.
To import Avro data, use a Dataflow import job. For more information, see Import Avro files from non-Spanner databases to Spanner.

If you're using CSV, consider the following:

To generate a CSV dump of the data, use CSV generation supported by the source. If data contains new lines, use a custom line separator.
To import CSV data, use a Dataflow import job. You can create your own Dataflow import template or use a Google Cloud import template. For more information, see Dataflow data pipeline templates.

If you're using MySQL or PostgreSQL, you can use the Spanner migration tool.

For information about using custom scripts to load data into Spanner, see Performance guidelines for bulk loading.

Validate your data migration

Data validation is the process of comparing data from both the source and the destination tables to ensure they match.

Data Validation Tool (DVT) is an open source tool that can connect to data stores and perform checks between the source and Spanner. We recommend using it to perform basic validations as a part of your migration, such as the following:

Check that all tables were created and that all schema mappings are correct .
Match the count of rows for each table.
Extract random rows to verify correctness.
Validate your columns (count, sum, avg, min, max, group by).
Compare any cyclic redundancy checks or hash functions at the row level.

To perform more specific validations, build custom checks during migration.

Configure cutover and failover mechanisms

Migrations are often time consuming and complex. Build in fallbacks to avoid significant impact in case of an error during migration, allowing you to switch back to the source database with minimal downtime.

The current recommendation is to consume change streams to perform reverse replication, and write back to the source database through a stream like Pub/Sub or Cloud Storage.

Diagram shows cutover process.

The reverse replication needs to do the following:

Handle changes in data types or content.
Reverse any transformations performed during the migration.
Push the data to the appropriate destination, taking into account sharding schemes on source.