Use managed migration to migrate a self-managed MySQL metastore to Dataproc Metastore

Managed migration is an automated feature that helps you migrate data from a self-managed Hive Metastore to a Dataproc Metastore service, without any sizable down time (otherwise known as a flag day).

How managed migration works

To complete a managed migration, your service must run through two migration processes—start migration and complete migration. You can cancel a migration at any time with the cancel migration process. There are also a number of operational commands you can run, which aren't required to complete a migration. For example, list migration or delete migration.

As your service moves through this process, it also moves between various migration states and migration phases. These states and phases represent the processes that are occurring in the background. For example, the MIGRATING state indicates that your service is actively transferring data from your Cloud SQL database to Dataproc Metastore.

Managed migration requires three subnets to run a migration: a proxy_subnet, nat_subnet, and a reverse_proxy_subnet. To avoid any latency issues, configure the following:

  • proxy_subnet and nat_subnet in the same region as Cloud SQL.
  • reverse_proxy_subnet is in the same region of the Dataproc Metastore service.

For more information about the different types of subnets, see Purposes of subnets.

Managed migration flow

Start Migration

  1. You run a start migration process on a Dataproc Metastore service.
  2. Your Dataproc Metastore service connects to your Cloud SQL database, which is acting as the backend database for your self-managed Hive metastore. In this step, the Cloud SQL database is still the source of truth for your data.

  3. Dataproc Metastore runs a Change Data Capture (CDC) stream that copies data from the Cloud SQL database to the Dataproc Metastore database (Cloud Spanner).

    Once the start migration is successful, you can start routing workloads to Dataproc Metastore. At this point, Cloud SQL is still the source of truth for your data.

Complete migration

After you finish moving your workloads to Dataproc Metastore, you can complete the migration. When a complete migration process is called, the following occurs:

  • Dataproc Metastore transitions into a read-only mode until the complete migration process finishes.
  • The CDC stream transfers all in-flight data to Dataproc Metastore.
  • Dataproc Metastore connects to Spanner and disconnects from Cloud SQL. Dataproc Metastore now acts as the source of truth for your data.

Before you begin

To start a migration, you must set up or have access to the following services:

  • A Dataproc Metastore configured with the Spanner database type.
  • A Cloud SQL for MySQL database instance configured with Private IP.

    • The VPC network of the Cloud SQL instance has the required subnets.

    • Cloud SQL uses a database with a schema that is compatible with the Hive metastore version running on the Dataproc Metastore service it's copying data to.

    • Set up Cloud SQL to be used as a Datastream source. Create a username and password to connect Datastream to Cloud SQL.

Required Roles

To get the permissions that you need to create a Dataproc Metastore and start a managed migration, ask your administrator to grant you the following IAM roles:

  • To grant full access to all Dataproc Metastore resources, including setting IAM permissions: Dataproc Metastore Admin (roles/metastore.admin) on the Dataproc Metastore user account or service account
  • To grant full control of Dataproc Metastore resources: Dataproc Metastore Editor (roles/metastore.editor) on the Dataproc Metastore user account or service account
  • To create a managed migration: Migration Admin (roles/metastore.migrationAdmin) on the Dataproc Metastore [service agent][7]
  • To use the Cloud Storage object with Datastream:

For more information about granting roles, see Manage access.

You might also be able to get the required permissions through custom roles or other predefined roles.

Configure a managed migration

You can configure a migration using the Dataproc Metastore APIs.

A Dataproc Metastore service can only run a single migration at a time.

Start migration

When you start a migration, Dataproc Metastore connects to Cloud SQL and uses Cloud SQL as its backend database. During this process, Dataproc Metastore runs a pipeline that copies data from Cloud SQL to its own database (Spanner).

Dataproc Metastore continues to use Cloud SQL as its backend and replicates data until the complete migration process is called.

Considerations

  • A migration remains active until you complete the migration process. There isn't a deadline to complete your migration, for example, the migration can take 1 day, 30 days, or a year.

  • Scheduled backups are not restricted during a migration. However, the backup might be incomplete. To avoid any issues, disable any scheduled backups while the migration is in progress.

A start migration triggers the following state changes:

  • Dataproc Metastore moves to the MIGRATING state.
  • The migration execution state state moves to RUNNING.
  • The migration execution phase moves to REPLICATION.

Console

Get started

  1. In the Google Cloud console, open the Dataproc Metastore page:

    Open Dataproc Metastore

  2. On the Dataproc Metastore page, click the name of the service you want to migrate to.

    The Service detail page opens.

  3. At the top of the page, click Migrate Data.

    The Create migration page opens to the Connectivity tab and displays the Cloud SQL database configuration for Dataproc Metastore configuration settings.

Cloud SQL database configuration for DPMS

  1. In the Instance connection name, enter the instance connection name of the Cloud SQL database, in the following format: PROJECT_ID/LOCATION/CLOUDSQL_INSTANCE_ID.

  2. In the IP address field, enter the IP address required to connect to the Cloud SQL instance.

  3. In the Port field, enter 3306.

  4. In the Hive database name field, enter the name of the database being used as the backend of self-managed Hive metastore.

  5. In the Username field, enter the username that you use to connect Cloud SQL to the Hive Metastore.

  6. In the Password field, enter the password that you use to connect Cloud SQL to the Hive Metastore.

SOCKS5 Proxy service

  1. In the Proxy subnet field, enter the subnetwork used in the Cloud SQL VPC network. This subnet is used to deploy the intermediate SOCKS5 proxy.

  2. In the Nat subnet field, enter the Private Service Connect subnet that provides a connection from the Dataproc Metastore service to access to the intermediate proxy. The subnet size should have a prefix length of at least /29.

  3. Click Continue.

    The Change Data Capture (CDC) tab opens and displays the Cloud SQL database configuration for Datastream configuration settings.

Cloud SQL database configuration for data stream

  1. In the Username field, enter the username that you use to login to the Cloud SQL CDC used by Datastream.

  2. In the Password field, enter the password that you use to login to the Cloud SQL CDC used by Datastream.

  3. In the VPC network field, enter the network in the same VPC network as the Cloud SQL instance used by Datastream to establish a private connection to the CDC.

  4. In the Subnet IP range field, enter a subnet IP range of at least /29. Datastream uses this IP to establish peering to the VPC network.

  5. In the Reverse proxy subnet field, enter the subnetwork you created in the same VPC network as the Cloud SQL. Datastream uses this subnetwork. The subnetwork is used to host a reverse proxy connection for the Datastream CDC. The subnet must be configured in the same region as the Dataproc Metastore service.

GCS configuration

  1. For the Bucket ID, select the Cloud Storage path to store CDC data during the migration.

  2. In the Root path field, enter the root path inside the Cloud Storage bucket. The stream event data is written to this path.

  3. Click Create.

REST

curl -H "Authorization: Bearer $(gcloud auth print-access-token)" \
  -H "Content-Type:application/json" \
  -X POST -d \
  '{
    "migration_execution": {
      "cloud_sql_migration_config": {
        "cloud_sql_connection_config": {
          "instance_connection_name": INSTANCE_CONNECTION_NAME,
          "hive_database_name": "HIVE_DATABASE_NAME",
          "ip_address": "IP_ADDRESS",
          "port": 3306,
          "username": "CONNECTION_USERNAME",
          "password": "CONNECTION_PASSWORD",
          "proxy_subnet": "PROXY_SUBNET",
          "nat_subnet": "NAT_SUBNET"
        },
        "cdc_config": {
          "username": "CDC_USENAME",
          "password": "CDC_PASSWORD",
          "vpc_network": "VPC_NETWORK",
          "subnet_ip_range": "SUBNET_IP_RANGE",
          "reverse_proxy_subnet": "REVERSE_PROXY_SUBNET_ID",
          "bucket": "BUCKET_NAME",
          "root_path": "ROOT_PATH",
        }
      }
    }
}' \
  https://metastore.googleapis.com/v1alpha/projects/PROJECT_ID/locations/LOCATION/services/SERVICE:startMigration

Replace the following:

  • SERVICE: the name or ID of your Dataproc Metastore service.
  • PROJECT_ID: the project ID of the Google Cloud project your Dataproc Metastore service resides in.
  • LOCATION: the Google Cloud region in which your Dataproc Metastore service resides.

Cloud SQL Migration configuration

  • INSTANCE_CONNECTION_NAME: the instance connection name for the Cloud SQL database, in the following format: PROJECT_ID/LOCATION/CLOUDSQL_INSTANCE_ID.
  • HIVE_DATABASE_NAME: the name of the self managed Hive database connected to Cloud SQL.
  • IP_ADDRESS: the IP address required to connect to the Cloud SQL instance.
  • CONNECTION_USERNAME: the username that you use to connect Cloud SQL to the Hive Metastore.
  • CONNECTION_PASSWORD the password that you use to connect Cloud SQL to the Hive Metastore
  • PROXY_SUBNET: the subnetwork used in the Cloud SQL VPC network. This subnetwork hosts an intermediate proxy to provide connectivity across transitive networks.
  • NAT_SUBNET: a Private Service Connect subnet that provides a connection from the Dataproc Metastore service to access to the intermediate proxy. The subnet size should have a prefix length of at least /29 and in the IPv4 range.

CDC configuration

  • CDC_USERNAME: the username that the Datastream service uses to login into Cloud SQL.
  • CDC_PASSWORD: the password that the Datastream service uses to login into Cloud SQL.
  • VPC_NETWORK: a network in the same VPC network as the Cloud SQL instance used by Datastream to establish a private connection to the CDC.
  • SUBNET_IP_RANGE: A subnet IP range of at least /29 used by Datastream to establish peering to the VPC network.
  • REVERSE_PROXY_SUBNET_ID: a subnetwork in the same VPC network as the Cloud SQL instance used by Datastream. The subnetwork is used to host a reverse proxy connection for the Datastream CDC. The subnet must be configured in the same region as the Dataproc Metastore service.
  • BUCKET_NAME: the Cloud Storage path to store CDC data during the migration.
  • ROOT_PATH: the root path inside the Cloud Storage bucket. The stream event data is written to this path.

Complete migration

When you complete a migration, Dataproc Metastore connects to Spanner and starts to use Spanner as its backend database.

A complete migration triggers the following state changes:

  • Dataproc Metastore moves back to the ACTIVE state.
  • The migration execution state moves to SUCCEEDED.

Console

  1. In the Google Cloud console, open the Dataproc Metastore page.

  2. At the top of the page, click Migrate Data.

    The Migrate Data page opens and displays your completed managed migrations.

REST

curl -H "Authorization: Bearer $(gcloud auth print-access-token)" \
  -H "Content-Type:application/json" \
  -X POST -d '' \
  https://metastore.googleapis.com/v1alpha/projects/PROJECT_ID/locations/LOCATION/services/SERVICE:completeMigration

Replace the following:

  • SERVICE: The name or ID of your Dataproc Metastore service.
  • PROJECT_ID: the project ID of the Google Cloud project your Dataproc Metastore service resides in.
  • LOCATION: the Google Cloud region in which your Dataproc Metastore service resides.

Cancel migration

When you cancel a migration, Dataproc Metastore reverts any changes and starts using the Spanner database type as it's backend database. Any data that was transferred during the migration is deleted.

A cancel migration triggers the following state changes:

  • Dataproc Metastore moves back to the ACTIVE state.
  • The migration execution state moves to CANCELLED.

Console

  1. In the Google Cloud console, open the Dataproc Metastore page.

  2. At the top of the page, click Migrate Data.

    The Migrate Data page opens and displays your canceled managed migrations.

REST

curl -H "Authorization: Bearer $(gcloud auth print-access-token)" \
  -H "Content-Type:application/json" \
   -X POST -d '' \
   https://metastore.googleapis.com/v1alpha/projects/PROJECT_ID/locations/LOCATION/services/SERVICE:cancelMigration

Replace the following:

  • SERVICE_NAME: The name or ID of your Dataproc Metastore service.
  • PROJECT_ID: the project ID of the Google Cloud project your Dataproc Metastore service resides in.
  • LOCATION: the Google Cloud region in which your Dataproc Metastore service resides.

Get migration details

Get details about a single managed migration.

Console

  1. In the Google Cloud console, open the Dataproc Metastore page.

  2. At the top of the page, click Migrate Data.

    The Migrate Data page opens and displays your managed migrations.

    To get more migration details, click the name of a managed migration.

REST

curl -H "Authorization: Bearer $(gcloud auth print-access-token)" \
  -X GET \
  https://metastore.googleapis.com/v1alpha/projects/PROJECT_ID/locations/LOCATION/services/SERVICE/migrationExecutions/MIGRATION_ID

Replace the following:

  • SERVICE: The name or ID of your Dataproc Metastore service.
  • PROJECT_ID: the project ID of the Google Cloud project your Dataproc Metastore service resides in.
  • LOCATION: the Google Cloud region in which your Dataproc Metastore service resides.
  • MIGRATION_ID: The name or ID of your Dataproc Metastore migration.

List migrations

List managed migrations.

Console

  1. In the Google Cloud console, open the Dataproc Metastore page.

  2. At the top of the page, click Migrate Data.

    The Migrate Data page opens and displays your managed migrations.

  3. Verify that the command listed the migrations.

REST

curl -H "Authorization: Bearer $(gcloud auth print-access-token)" \
  -X GET \
  https://metastore.googleapis.com/v1alpha/projects/PROJECT_ID/locations/LOCATION/services/SERVICE/migrationExecutions/MIGRATION_ID

Replace the following:

  • SERVICE: The name or ID of your Dataproc Metastore service.
  • PROJECT_ID: the project ID of the Google Cloud project your Dataproc Metastore service resides in.
  • LOCATION: the Google Cloud region in which your Dataproc Metastore service resides.

Delete migrations

Delete managed migrations.

REST

curl -H "Authorization: Bearer $(gcloud auth print-access-token)" \
  -X DELETE \
   https://metastore.googleapis.com/v1alpha/projects/PROJECT_ID/locations/LOCATION/services/SERVICE/migrationExecutions/MIGRATION_ID

Replace the following:

  • SERVICE: the name or ID of your Dataproc Metastore service.
  • PROJECT_ID: the project ID of the Google Cloud project your Dataproc Metastore service resides in.
  • LOCATION: the Google Cloud region in which your Dataproc Metastore service resides.
  • MIGRATION_ID: the name or ID of the Dataproc Metastore migration.

What's next