Prerequisites for managed migration

This page shows you how to set up your Google Cloud project to prepare for a Dataproc Metastore managed migration.

Before you begin

Required Roles

To get the permissions that you need to create a Dataproc Metastore and start a managed migration, ask your administrator to grant you the following IAM roles:

  • To grant full access to all Dataproc Metastore resources, including setting IAM permissions: Dataproc Metastore Admin (roles/metastore.admin) on the Dataproc Metastore user account or service account
  • To grant full control of Dataproc Metastore resources: Dataproc Metastore Editor (roles/metastore.editor) on the Dataproc Metastore user account or service account
  • To grant permission to start a migration: Migration Admin (roles/metastore.migrationAdmin) on the Dataproc Metastore service agent in the service project

For more information about granting roles, see Manage access.

You might also be able to get the required permissions through custom roles or other predefined roles.

Grant additional roles depending on your project settings

Depending on how your project is configured, you might need to add the following additional roles. Examples on how to grant these roles to the appropriate accounts are shown in the prerequisites section later on this page.

  • Grant the Network User (roles/compute.networkUser) role to the Dataproc Metastore service agent and the [Google APIs Service Agent] on the service project.
  • Grant the Network Admin (roles/compute.networkAdmin) role to the Datastream Service Agent on the host project.

If your Cloud SQL instance is in a different project than the Dataproc Metastore service project:

  • Grant the roles/cloudsql.client role and the roles/cloudsql.instanceUser role to the Dataproc Metastore service agent on the Cloud SQL instance project.

If the Cloud Storage bucket for the Change-Data-Capture pipeline is in a different project than your Dataproc Metastore service project:

  • Make sure your Datastream service agent has the required permissions to write to the bucket. Typically these are the roles/storage.objectViewer, roles/storage.objectCreator and roles/storage.legacyBucketReader roles.

Managed migration prerequisites

To facilitate data transfer, Dataproc Metastore uses proxies and a change data capture pipeline. It's important to understand how these work before starting a transfer.

Key terms

  • Service Project: A service project is the Google Cloud project where you created your Dataproc Metastore service.
  • Host Project: A host project is the Google Cloud project that holds your Shared VPC networks. One or more service projects can be linked to your host project to use these shared networks. For more information, see Shared VPC.
  1. Enable the Datastream API in your service project.
  2. Grant the roles/metastore.migrationAdmin role to the Dataproc Metastore Service Agent in your service project.

    gcloud projects add-iam-policy-binding SERVICE_PROJECT --role "roles/metastore.migrationAdmin" --member "serviceAccount:service-SERVICE_PROJECT@gcp-sa-metastore.iam.gserviceaccount.com"
    
  3. Add the following firewall rules.

    To establish a connection between Dataproc Metastore and your private IP Cloud SQL instance.

    • A firewall rule to allow traffic from the health check probe probe to the network load balancer of SOCKS5 proxy. For example:

      gcloud compute firewall-rules create RULE_NAME --direction=INGRESS --priority=1000 --network=CLOUD_SQL_NETWORK--allow=tcp:1080 --source-ranges=35.191.0.0/16,130.211.0.0/22
      

      Port 1080 is where the SOCKS5 proxy server is running.

    • A firewall rule to allow traffic from the load balancer to the SOCKS5 proxy MIG. For example:

      gcloud compute firewall-rules create RULE_NAME --direction=INGRESS --priority=1000 --network=CLOUD_SQL_NETWORK--action=ALLOW --rules=all --source-ranges=PROXY_SUBNET_RANGE
      
    • A firewall rule to allow traffic from the PSC service attachment to the load balancer. For example:

      gcloud compute firewall-rules create RULE_NAME --direction=INGRESS --priority=1000 --network=CLOUD_SQL_NETWORK --allow=tcp:1080 --source-ranges=NAT_SUBNET_RANGE
      

    A firewall rule to allow Datastream to use the /29 CIDR IP range to create a private IP connection. For example:

    gcloud compute firewall-rules create RULE_NAME --direction=INGRESS --priority=1000 --network=CLOUD_SQL_NETWORK --action=ALLOW --rules=all --source-ranges=CIDR_RANGE
    

(Optional) Steps if you use a Shared VPC

Follow these steps if you use a Shared VPC setup.

For more details about a Shared VPC, see Service Project Admins.

  1. Grant the roles/compute.networkUser role to the Dataproc Metastore Service Agent and the Google API service agent on the host project.

    gcloud projects add-iam-policy-binding HOST_PROJECT  --role "roles/compute.networkUser" --member "serviceAccount:service-SERVICE_ACCOUNT@gcp-sa-metastore.iam.gserviceaccount.com"
    gcloud projects add-iam-policy-binding HOST_PROJECT  --role "roles/compute.networkUser" --member "serviceAccount:SERVICE_PROJECT@cloudservices.gserviceaccount.com"
    
  2. Grant the roles/compute.networkAdmin role to the Datastream Service Agent on the host project.

    gcloud projects add-iam-policy-binding HOST_PROJECT --role "roles/compute.networkAdmin" --member "serviceAccount:service-SERVICE_PROJECT@gcp-sa-datastream.iam.gserviceaccount.com"
    

If you can't grant the roles/compute.networkAdmin role, create a custom role with the permissions listed in Shared VPC prerequisites.

  • These permissions are required to establish peering between the VPC network in the host project with Datastream at the start of the migration.

  • This role can be removed as soon as the migration is started. Although, if you remove the role before the migration is complete, Dataproc Metastore can't clean up the peering job. In this case, you must clean the job up yourself.

Proxy and pipeline considerations

Proxies

Dataproc Metastore uses a Cloud SQL Auth proxy chained to a SOCKS5 proxy to connect to your private IP Cloud SQL instance. The SOCKS5 proxy servers are exposed through a service attachment as shown in the architecture diagram on about managed migrations.

  • Since a NAT subnet can't have more than 1 service attachment, each migration requires a dedicated NAT subnet.

  • To avoid any cross-region latency issues, provide subnets that are in the same region as your Cloud SQL instance to host the SOCKS5 proxy. For example, proxy_subnet and nat_subnet.

Change data capture pipeline

For the change data capture pipeline, a Datastream and private IP Cloud SQL connection is established using VPC peering.

  • For each migration, a new private connection is created and a new peering connection is established.

  • The VPC network hosting the Cloud SQL instance has as many peering connections as there are active migrations. Make sure that your VPC network has the capacity to host all of the necessary peering connections.

What's next