Importing metadata into a service

An import migrates metadata stored in an external Hive metastore, backed by some relational database, to Dataproc Metastore. Supported import sources include Avro and MySQL dump files. Avro based imports are supported for Hive versions 2.3.6 and 3.1.2.

This page explains how to import metadata into an existing Dataproc Metastore service. Note that you must first place the data in a Cloud Storage bucket, as described in the following sections.

Before you begin

  • Most gcloud metastore commands require a location. You can specify the location by using the --location flag or by setting the default location.

    For best results, use Cloud Storage buckets that are located in the same region as your Dataproc Metastore service. Although Dataproc Metastore doesn't enforce region restrictions, co-located resources and global resources perform better. For example, a global bucket is fine for any service region, but an EU multi-region bucket doesn't work well with a us-central1 service. Cross-region access results in higher latency, lack of regional failure isolation, and charges for cross-region network bandwidth.

Access control

  • To import metadata, you must be granted an IAM role containing the metastore.imports.create IAM permission. The Dataproc Metastore specific roles roles/metastore.admin and roles/metastore.editor include import permission.

  • You can give import permission to users or groups by using the roles/owner and roles/editor legacy roles.

  • The Dataproc Metastore service agent (service-CUSTOMER_PROJECT_NUMBER@gcp-sa-metastore.iam.gserviceaccount.com) and the user importing the metadata must have the following read permission to the Cloud Storage bucket used for the import:

    • For MySQL, they must have storage.objects.get permission on the Cloud Storage object (SQL dump file) used for the import.

    • For Avro, they must have storage.objects.get permission on the Cloud Storage bucket used for the import.

  • If you're using VPC Service Controls, then you can only import data from a Cloud Storage bucket that resides in the same service perimeter as the Dataproc Metastore service.

For more information, see Dataproc Metastore IAM and access control.

Importing metadata into a service

To import metadata into a service, enter import parameters on the Service detail page opened in a local browser, use the gcloud tool, or issue a Dataproc Metastore API method services.metadataImports.create.

While an import is pending, no updates can be made to the service. You can still use the service while it's undergoing an import.

Preparing the import

Before you can import metadata stored in an external Hive metastore database into Dataproc Metastore you must first prepare the import:

  1. Create database dumps in the format of either Avro files or MySQL dump file of your external database.

  2. Move the files into Cloud Storage. Make note of the Cloud Storage path. You'll need it to perform the import.

    • For Avro, move the files into a Cloud Storage folder. There should be a file for each Hive table, even if the table is empty. The file names must follow the format <table-name>.avro, where <table-name> must be all caps. You can find the <table-name> in the SQL dump files. Here is an example file name, AUX_TABLE.avro.

    • For MySQL, move the file into a Cloud Storage bucket.

Dataproc Metastore performs a Hive metadata schema validation when you start the import. It verifies the tables in the SQL dump file and the file names for Avro. If any table is missing then the import fails with an error message describing the missing table. You can use the Dataproc Metastore Toolkit as a reference to check the correct Hive metadata schemas.

Now that you've prepared the file, you can import it into a Dataproc Metastore service:

Console

  1. In the Cloud Console, open the Dataproc Metastore page:

    Open Dataproc Metastore in the Cloud Console

  2. On the Dataproc Metastore page, click the name of the service you'd like to import metadata into. The Service detail page opens.

    Service detail page
  3. At the top of the page, click the Import button. The Import dialog opens.

  4. Enter the Import name.

  5. Select the Source.

  6. Browse to and select the Cloud Storage URI. This is the Cloud Storage path to the import data bucket.

  7. Optional: Enter a Description of the import. You can edit this later on the Service detail page.

  8. Click the Import button to update the service.

  9. Verify that you have returned to the Service detail page, and that your import appears under the Import history on the Import/Export tab.

gcloud

  1. Run the following gcloud metastore services import gcs command to update a service:

    gcloud metastore services import gcs SERVICE \
        --import-id=IMPORT_ID \
        --location=LOCATION \
        --description=DESCRIPTION \
        --dump-type=DUMP_TYPE \
        --database-dump=DATABASE_DUMP
    

    Replace the following:

    • SERVICE: The name of the service.
    • IMPORT_ID: The ID of this metadata import.
    • LOCATION: Refers to a Google Cloud region.
    • DESCRIPTION: Optional: The import description. You can edit this later using gcloud metastore services imports update IMPORT
    • DUMP_TYPE: The type of the external database. Defaults to mysql.
    • DATABASE_DUMP: The path to the Cloud Storage bucket containing the database files. It must begin with gs://.
  2. Verify that the import was successful.

REST

Follow the API instructions to import metadata into a service by using the APIs Explorer.

When the import succeeds, the service automatically enters active state. If it fails, then the service rolls back to its previous healthy state.

To view a service's import history, refer to the Import/Export tab on the Service detail page in the Cloud Console.

Import caveats

  • Importing updates the entire service's metadata. Any existing metadata is overwritten.

  • Your database files must be either Avro or MySQL dump files. If you're using a different method such as PostgreSQL you must convert the dump to either Avro or MySQL dump files. Import supports MySQL dumps obtained from a Dataproc cluster using native SQL.

  • Avro based imports are supported for Hive versions 2.3.6 and 3.1.2.

  • Dataproc Metastore doesn't transform the database content during import and subsequently doesn't handle file migration. Therefore, if you move your data to a different location, then you must manually update your table data locations and schema in Dataproc Metastore.

  • Using the API, you can create, list, describe, and update imports, but you can't delete them. However, deleting the service itself deletes all nested imports under that service.

Common failures

  • When you import metadata, the Hive metastore and Dataproc Metastore versions aren't compatible. Your import may fail if the two don't match. For more information, see version policy.

  • When preparing the import for MySQL, there are missing Hive tables or the schema is incorrect. The import fails if there are any missing tables or if the schema is incorrect.

  • When preparing the import for Avro, there are missing Hive table files. There should be a file for each Hive table, even if the table is empty, otherwise the import will fail.

  • The Dataproc Metastore service agent (service-CUSTOMER_PROJECT_NUMBER@gcp-sa-metastore.iam.gserviceaccount.com) or the user importing the metadata doesn't have the following read permission:

    • For MySQL, they don't have storage.objects.get permission on the Cloud Storage object (SQL dump file) used for the import.

    • For Avro, they don't have storage.objects.get permission on the Cloud Storage bucket used for the import.

  • Your database file is too large and takes more than one hour to complete the import process.

What's next