Use managed migration with a Dataproc Metastore service

This page shows you how to start and manage a Dataproc Metastore managed migration.

You can configure a migration using the Dataproc Metastore APIs.

Before you begin

Start migration

When you run a start migration, Dataproc Metastore connects to Cloud SQL and uses Cloud SQL as its backend database. During this process, Dataproc Metastore runs a pipeline that copies data from Cloud SQL to its own database (Spanner).

Dataproc Metastore continues to use Cloud SQL as its backend and replicates data until the complete migration process is called.

Before you start a migration, make sure you have set up the managed migration prerequisites.

Start migration considerations

  • A Dataproc Metastore service can only run a single migration at a time.

  • A migration remains active until you complete the migration process. There isn't a deadline to complete your migration, for example, the migration can take 1 day, 30 days, or a year.

  • Scheduled backups are not restricted during a migration. However, the backup might be incomplete. To avoid any issues, disable any scheduled backups while the migration is in progress.

A start migration triggers the following state changes:

  • Dataproc Metastore moves to the MIGRATING state.
  • The migration execution state state moves to RUNNING.
  • The migration execution phase moves to REPLICATION.

Console

Get started

  1. In the Google Cloud console, open the Dataproc Metastore page:

    Open Dataproc Metastore

  2. On the Dataproc Metastore page, click the name of the service you want to migrate to.

    The Service detail page opens.

  3. At the top of the page, click Migrate Data.

    The Create migration page opens to the Connectivity tab and displays the Cloud SQL database configuration for Dataproc Metastore configuration settings.

Cloud SQL database configuration for DPMS

  1. In the Instance connection name, enter the instance connection name of the Cloud SQL database, in the following format: project_id:region:instance_name.

  2. In the IP address field, enter the IP address required to connect to the Cloud SQL instance.

  3. In the Port field, enter 3306.

  4. In the Hive database name enter the name of the database being used as the backend of self-managed Hive Metastore.

  5. In the Username field, enter the username that you use to connect Cloud SQL to the Hive Metastore.

  6. In the Password field, enter the password that you use to connect Cloud SQL to the Hive Metastore.

SOCKS5 Proxy service

  1. In the Proxy Subnet field, enter a subnet of Regular type. The subnetwork should be present in the Cloud SQL VPC network. This subnet is used to deploy the intermediate SOCKS5 proxy service

  2. In the Nat Subnet field, enter a subnet of Private Service Connect type. Cloud SQL VPC network. This subnetwork should be present in the Cloud SQL VPC network and is used to publish the SOCKS5 proxy service using private service connect).

  3. Click Continue.

    The Change Data Capture (CDC) tab opens and displays the Cloud SQL database configuration for Datastream configuration settings.

Cloud SQL database configuration for data stream

  1. In the Username field, enter the username that you use to login to the Cloud SQL CDC used by Datastream.

  2. In the Password field, enter the password that you use to login to the Cloud SQL CDC used by Datastream.

  3. In the VPC network field, enter the network in the same VPC network as the Cloud SQL instance used by Datastream to establish a private connection to the CDC.

  4. In the Subnet IP range field, enter a subnet IP range of at least /29. Datastream uses this IP to establish peering to the VPC network.

  5. In the Reverse proxy subnet field, enter the subnetwork you created in the same VPC network as the Cloud SQL. Datastream uses this subnetwork. The subnetwork is used to host a reverse proxy connection for the Datastream CDC. The subnet must be configured in the same region as the Dataproc Metastore service.

GCS configuration

  1. For the Bucket ID, select the Cloud Storage path to store CDC data during the migration.

  2. In the Root path field, enter the root path inside the Cloud Storage bucket. The stream event data is written to this path.

  3. Click Create.

REST

curl -H "Authorization: Bearer $(gcloud auth print-access-token)" \
  -H "Content-Type:application/json" \
  -X POST -d \
  '{
    "migration_execution": {
      "cloud_sql_migration_config": {
        "cloud_sql_connection_config": {
          "instance_connection_name": INSTANCE_CONNECTION_NAME,
          "hive_database_name": "HIVE_DATABASE_NAME",
          "ip_address": "IP_ADDRESS",
          "port": 3306,
          "username": "CONNECTION_USERNAME",
          "password": "CONNECTION_PASSWORD",
          "proxy_subnet": "PROXY_SUBNET",
          "nat_subnet": "NAT_SUBNET"
        },
        "cdc_config": {
          "username": "CDC_USENAME",
          "password": "CDC_PASSWORD",
          "vpc_network": "VPC_NETWORK",
          "subnet_ip_range": "SUBNET_IP_RANGE",
          "reverse_proxy_subnet": "REVERSE_PROXY_SUBNET_ID",
          "bucket": "BUCKET_NAME",
          "root_path": "ROOT_PATH",
        }
      }
    }
}' \
  https://metastore.googleapis.com/v1/projects/PROJECT_ID/locations/LOCATION/services/SERVICE:startMigration

Replace the following:

  • SERVICE: the name or ID of your Dataproc Metastore service.
  • PROJECT_ID: the project ID of the Google Cloud project your Dataproc Metastore service resides in.
  • LOCATION: the Google Cloud region in which your Dataproc Metastore service resides.

Cloud SQL Migration configuration

  • INSTANCE_CONNECTION_NAME: the instance connection name for the Cloud SQL database, in the following format: PROJECT_ID/LOCATION/CLOUDSQL_INSTANCE_ID.
  • HIVE_DATABASE_NAME: the name of the self managed Hive database connected to Cloud SQL.
  • IP_ADDRESS: the IP address required to connect to the Cloud SQL instance.
  • CONNECTION_USERNAME: the username that you use to connect Cloud SQL to the Hive Metastore.
  • CONNECTION_PASSWORD the password that you use to connect Cloud SQL to the Hive Metastore
  • PROXY_SUBNET: the subnetwork used in the Cloud SQL VPC network. This subnetwork hosts an intermediate proxy to provide connectivity across transitive networks.
  • NAT_SUBNET: a Private Service Connect subnet that provides a connection from the Dataproc Metastore service to access to the intermediate proxy. The subnet size should have a prefix length of at least /29 and in the IPv4 range.

CDC configuration

  • CDC_USERNAME: the username that the Datastream service uses to login into Cloud SQL.
  • CDC_PASSWORD: the password that the Datastream service uses to login into Cloud SQL.
  • VPC_NETWORK: a network in the same VPC network as the Cloud SQL instance used by Datastream to establish a private connection to the CDC.
  • SUBNET_IP_RANGE: A subnet IP range of at least /29 used by Datastream to establish peering to the VPC network.
  • REVERSE_PROXY_SUBNET_ID: a subnetwork in the same VPC network as the Cloud SQL instance used by Datastream. The subnetwork is used to host a reverse proxy connection for the Datastream CDC. The subnet must be configured in the same region as the Dataproc Metastore service.
  • BUCKET_NAME: the Cloud Storage path to store CDC data during the migration.
  • ROOT_PATH: the root path inside the Cloud Storage bucket. The stream event data is written to this path.

Complete migration

When you complete a migration, Dataproc Metastore connects to Spanner and starts to use Spanner as its backend database.

A complete migration triggers the following state changes:

  • Dataproc Metastore moves back to the ACTIVE state.
  • The migration execution state moves to SUCCEEDED.

Console

  1. In the Google Cloud console, open the Dataproc Metastore page.

  2. At the top of the page, click Migrate Data.

    The Migrate Data page opens and displays your completed managed migrations.

REST

curl -H "Authorization: Bearer $(gcloud auth print-access-token)" \
  -H "Content-Type:application/json" \
  -X POST -d '' \
   https://metastore.googleapis.com/v1/projects/PROJECT_ID/locations/LOCATION/services/SERVICE:completeMigration

Replace the following:

  • SERVICE: the name or ID of your Dataproc Metastore service.
  • PROJECT_ID: the project ID of the Google Cloud project your Dataproc Metastore service resides in.
  • LOCATION: the Google Cloud region in which your Dataproc Metastore service resides.

Cancel migration

When you cancel a migration, Dataproc Metastore reverts any changes and starts using the Spanner database type as it's backend database. Any data that was transferred during the migration is deleted.

A cancel migration triggers the following state changes:

  • Dataproc Metastore moves back to the ACTIVE state.
  • The migration execution state moves to CANCELLED.

Console

  1. In the Google Cloud console, open the Dataproc Metastore page.

  2. At the top of the page, click Migrate Data.

    The Migrate Data page opens and displays your canceled managed migrations.

REST

curl -H "Authorization: Bearer $(gcloud auth print-access-token)" \
  -H "Content-Type:application/json" \
  -X POST -d '' \
    https://metastore.googleapis.com/v1/projects/PROJECT_ID/locations/LOCATION/services/SERVICE:cancelMigration

Replace the following:

  • SERVICE_NAME: the name or ID of your Dataproc Metastore service.
  • PROJECT_ID: the project ID of the Google Cloud project your Dataproc Metastore service resides in.
  • LOCATION: the Google Cloud region in which your Dataproc Metastore service resides.

Get migration details

Get details about a single managed migration.

Console

  1. In the Google Cloud console, open the Dataproc Metastore page.

  2. At the top of the page, click Migrate Data.

    The Migrate Data page opens and displays your managed migrations.

    To get more migration details, click the name of a managed migration.

REST

curl -H "Authorization: Bearer $(gcloud auth print-access-token)" \
  -X GET \
   https://metastore.googleapis.com/v1/projects/PROJECT_ID/locations/LOCATION/services/SERVICE/migrationExecutions/MIGRATION_ID

Replace the following:

  • SERVICE: the name or ID of your Dataproc Metastore service.
  • PROJECT_ID: the project ID of the Google Cloud project your Dataproc Metastore service resides in.
  • LOCATION: the Google Cloud region in which your Dataproc Metastore service resides.
  • MIGRATION_ID: the name or ID of your Dataproc Metastore migration.

List migrations

List managed migrations.

Console

  1. In the Google Cloud console, open the Dataproc Metastore page.

  2. At the top of the page, click Migrate Data.

    The Migrate Data page opens and displays your managed migrations.

  3. Verify that the command listed the migrations.

REST

curl -H "Authorization: Bearer $(gcloud auth print-access-token)" \
  -X GET \
   https://metastore.googleapis.com/v1/projects/PROJECT_ID/locations/LOCATION/services/SERVICE/migrationExecutions/MIGRATION_ID

Replace the following:

  • SERVICE: the name or ID of your Dataproc Metastore service.
  • PROJECT_ID: the project ID of the Google Cloud project your Dataproc Metastore service resides in.
  • LOCATION: the Google Cloud region in which your Dataproc Metastore service resides.

Delete migrations

Delete managed migrations.

Console

  1. In the Google Cloud console, open the Dataproc Metastore page.

  2. At the top of the page, click Migrate Data.

    The Migrate Data page opens and displays your managed migrations.

  3. Select the migration and click Delete.

REST

curl -H "Authorization: Bearer $(gcloud auth print-access-token)" \
   -X DELETE \
    https://metastore.googleapis.com/v1/projects/PROJECT_ID/locations/LOCATION/services/SERVICE/migrationExecutions/MIGRATION_ID

Replace the following:

  • SERVICE: the name or ID of your Dataproc Metastore service.
  • PROJECT_ID: the project ID of the Google Cloud project your Dataproc Metastore service resides in.
  • LOCATION: the Google Cloud region in which your Dataproc Metastore service resides.
  • MIGRATION_ID: the name or ID of the Dataproc Metastore migration.

What's next