This page shows you how to set up your Google Cloud project to prepare for a Dataproc Metastore managed migration.
Before you begin
Understand how managed migration works.
Set up or have access to the following services:
- A Dataproc Metastore configured with the Spanner database type.
A Cloud SQL for MySQL database instance configured with Private IP.
The VPC network of the Cloud SQL is configured to use the required subnets.
Cloud SQL uses a database with a schema that is compatible with the Hive Metastore version running on the Dataproc Metastore service it's copying data to.
The Cloud SQL instance contains the appropriate users to establish connectivity between Datastream and Dataproc Metastore and Dataproc Metastore and Cloud SQL.
Required Roles
To get the permissions that you need to create a Dataproc Metastore and start a managed migration, ask your administrator to grant you the following IAM roles:
-
To grant full access to all Dataproc Metastore resources, including setting IAM permissions:
Dataproc Metastore Admin (
roles/metastore.admin
) on the Dataproc Metastore user account or service account -
To grant full control of Dataproc Metastore resources:
Dataproc Metastore Editor (
roles/metastore.editor
) on the Dataproc Metastore user account or service account -
To grant permission to start a migration:
Migration Admin (
roles/metastore.migrationAdmin
) on the Dataproc Metastore service agent in the service project
For more information about granting roles, see Manage access.
You might also be able to get the required permissions through custom roles or other predefined roles.
Grant additional roles depending on your project settings
Depending on how your project is configured, you might need to add the following additional roles. Examples on how to grant these roles to the appropriate accounts are shown in the prerequisites section later on this page.
- Grant the Network User (
roles/compute.networkUser
) role to the Dataproc Metastore service agent and the [Google APIs Service Agent] on the service project. - Grant the Network Admin (
roles/compute.networkAdmin
) role to the Datastream Service Agent on the host project.
If your Cloud SQL instance is in a different project than the Dataproc Metastore service project:
- Grant the
roles/cloudsql.client
role and theroles/cloudsql.instanceUser
role to the Dataproc Metastore service agent on the Cloud SQL instance project.
If the Cloud Storage bucket for the Change-Data-Capture pipeline is in a different project than your Dataproc Metastore service project:
- Make sure your Datastream service agent has the required permissions to write to the bucket. Typically these are the
roles/storage.objectViewer
,roles/storage.objectCreator
androles/storage.legacyBucketReader
roles.
Managed migration prerequisites
To facilitate data transfer, Dataproc Metastore uses proxies and a change data capture pipeline. It's important to understand how these work before starting a transfer.
Key terms
- Service Project: A service project is the Google Cloud project where you created your Dataproc Metastore service.
- Host Project: A host project is the Google Cloud project that holds your Shared VPC networks. One or more service projects can be linked to your host project to use these shared networks. For more information, see Shared VPC.
- Enable the Datastream API in your service project.
Grant the
roles/metastore.migrationAdmin
role to the Dataproc Metastore Service Agent in your service project.gcloud projects add-iam-policy-binding SERVICE_PROJECT --role "roles/metastore.migrationAdmin" --member "serviceAccount:service-SERVICE_PROJECT@gcp-sa-metastore.iam.gserviceaccount.com"
Add the following firewall rules.
To establish a connection between Dataproc Metastore and your private IP Cloud SQL instance.
A firewall rule to allow traffic from the health check probe probe to the network load balancer of SOCKS5 proxy. For example:
gcloud compute firewall-rules create RULE_NAME --direction=INGRESS --priority=1000 --network=CLOUD_SQL_NETWORK--allow=tcp:1080 --source-ranges=35.191.0.0/16,130.211.0.0/22
Port
1080
is where the SOCKS5 proxy server is running.A firewall rule to allow traffic from the load balancer to the SOCKS5 proxy MIG. For example:
gcloud compute firewall-rules create
RULE_NAME --direction=INGRESS --priority=1000 --network=CLOUD_SQL_NETWORK--action=ALLOW --rules=all --source-ranges=PROXY_SUBNET_RANGE A firewall rule to allow traffic from the PSC service attachment to the load balancer. For example:
gcloud compute firewall-rules create RULE_NAME --direction=INGRESS --priority=1000 --network=CLOUD_SQL_NETWORK --allow=tcp:1080 --source-ranges=NAT_SUBNET_RANGE
A firewall rule to allow Datastream to use the
/29
CIDR IP range to create a private IP connection. For example:gcloud compute firewall-rules create RULE_NAME --direction=INGRESS --priority=1000 --network=CLOUD_SQL_NETWORK --action=ALLOW --rules=all --source-ranges=CIDR_RANGE
(Optional) Steps if you use a Shared VPC
Follow these steps if you use a Shared VPC setup.
For more details about a Shared VPC, see Service Project Admins.
Grant the
roles/compute.networkUser
role to the Dataproc Metastore Service Agent and the Google API service agent on the host project.gcloud projects add-iam-policy-binding HOST_PROJECT --role "roles/compute.networkUser" --member "serviceAccount:service-SERVICE_ACCOUNT@gcp-sa-metastore.iam.gserviceaccount.com" gcloud projects add-iam-policy-binding HOST_PROJECT --role "roles/compute.networkUser" --member "serviceAccount:SERVICE_PROJECT@cloudservices.gserviceaccount.com"
Grant the
roles/compute.networkAdmin
role to the Datastream Service Agent on the host project.gcloud projects add-iam-policy-binding HOST_PROJECT --role "roles/compute.networkAdmin" --member "serviceAccount:service-SERVICE_PROJECT@gcp-sa-datastream.iam.gserviceaccount.com"
If you can't grant the roles/compute.networkAdmin
role, create a
custom role with the permissions listed in Shared VPC
prerequisites.
These permissions are required to establish peering between the VPC network in the host project with Datastream at the start of the migration.
This role can be removed as soon as the migration is started. Although, if you remove the role before the migration is complete, Dataproc Metastore can't clean up the peering job. In this case, you must clean the job up yourself.
Proxy and pipeline considerations
Proxies
Dataproc Metastore uses a Cloud SQL Auth proxy chained to a SOCKS5 proxy to connect to your private IP Cloud SQL instance. The SOCKS5 proxy servers are exposed through a service attachment as shown in the architecture diagram on about managed migrations.
Since a NAT subnet can't have more than 1 service attachment, each migration requires a dedicated NAT subnet.
To avoid any cross-region latency issues, provide subnets that are in the same region as your Cloud SQL instance to host the SOCKS5 proxy. For example,
proxy_subnet
andnat_subnet
.
Change data capture pipeline
For the change data capture pipeline, a Datastream and private IP Cloud SQL connection is established using VPC peering.
For each migration, a new private connection is created and a new peering connection is established.
The VPC network hosting the Cloud SQL instance has as many peering connections as there are active migrations. Make sure that your VPC network has the capacity to host all of the necessary peering connections.