Migrate Dataproc Metastore metadata to BigQuery metastore

This document explains how to use the BigQuery metastore migration tool, which helps you move database, table, and partition metadata from a Dataproc Metastore service to BigQuery metastore.

Before you begin

  1. Enable billing for your Google Cloud project. Learn how to check if billing is enabled on a project.
  2. Enable the BigQuery and Dataflow APIs.

    Enable the APIs

  3. Optional: Learn more about the following:

Required roles

To get the permissions that you need to run the migration tool, ask your administrator to grant you the following IAM roles:

  • Create databases, tables, and partitions in BigQuery: BigQuery Data Editor (roles/bigQuery.dataEditor) on the Dataflow service account that's running the migration job.
  • Grant read access to Dataproc Metastore metadata on a gRPC endpoint: Metadata Viewer role (roles/metastore.metadataViewer) on the Dataflow service account that's running the migration job.

For more information about granting roles, see Manage access to projects, folders, and organizations.

You might also be able to get the required permissions through custom roles or other predefined roles.

How the migration tool works

The migration tool starts a Dataflow job that extracts metadata from Dataproc Metastore and ingests it into BigQuery metastore.

You can't access Dataproc Metastore data directly from BigQuery metastore. This migration process is necessary to run workloads or jobs on existing metadata. If you don't use this tool, then you must manually extract the metadata from Dataproc Metastore and ingest it into BigQuery metastore.

Considerations

  • The migration tool supports Dataproc Metastore instances that use either the gRPC or Thrift endpoint protocol.
  • Each time you run the migration tool, the Dataflow job makes a full copy of the metadata from Dataproc Metastore to BigQuery metastore.

Limitations

The migration tool doesn't migrate metadata unrelated to database and table partitions. If the tool can't migrate a database or table, it logs a message in the Cloud Logging instance of the project that contains the Dataflow job.

After the error is logged, the Dataflow job continues processing other tables and databases in the pipeline.

Run a migration with the tool

To run a migration with the tool, create a Dataflow flex template job by running the following gcloud dataflow flex-template command.

gcloud dataflow flex-template run JOB_NAME" \
   --template-file-gcs-location "gs://bigquery-metastore-migration/dpms_to_bqms_migration.json" \
   --parameters bigQueryMetastoreProjectId="DESTINATION_BIGQUERY_PROJECT_ID" \
   --parameters endpointUri="METASTORE_URI" \
   --parameters bigQueryMetastoreDatabaseLocation=DATABASE_LOCATION \
  [--network=NETWORK] \
  [--subnetwork=SUBNETWORK]

Replace the following:

  • JOB_NAME: the name of the Dataflow job to run the migration.
  • DESTINATION_BIGQUERY_PROJECT_ID: the ID of the BigQuery project where the Dataflow job writes data.
  • METASTORE_URI: the URI of Dataproc Metastore service.
  • DATABASE_LOCATION: the location that BigQuery metastore writes data to. For example, if you set this value to US, all BigQuery resources created by the tool are stored in this location.
  • NETWORK: optional: the network to run the Dataflow job. Only required for Dataproc Metastore services that use a Thrift endpoint.
  • SUBNETWORK: optional: the subnetwork to run the Dataflow job. Only required for Dataproc Metastore services that use a Thrift endpoint.

What's next