Dataplex metadata

This guide describes Dataplex metadata and how you can use Dataplex APIs to manage it.

Overview

Dataplex scans structured and semi-structured data assets within data lakes to discover table metadata, and scans unstructured data, such as images and texts, to discover fileset metadata. Table and fileset metadata is scanned into table and fileset entities. You use the Dataplex Metadata API to view, edit, and delete table and fileset entity metadata, or create your own table or fileset entity metadata. Dataplex metadata is available in Data Catalog for searching and tagging. Table metadata is available in Dataproc Metastore and BigQuery for querying and analytics processing.

Dataplex APIs

This section summarizes the Dataplex APIs and the key resources with them.

Control plane API

The Dataplex control plane API allows for the creation and management of the lake, zone, and asset resources.

  • Lake: A Dataplex service instance that allows managing storage resources across projects within an organization.

  • Zone: A logical grouping of assets within a lake. Use multiple zones within a lake to organize data based on readiness, workload, or organization structure.

  • Assets: Storage resources, with data stored in Cloud Storage buckets or BigQuery datasets, that are attached to a zone within a lake.

Metadata API

Use the Dataplex Metadata API to create and manage metadata within table and fileset entities and partitions. Entities and partitions are created by Dataplex, which scans data assets within a lake, or by the user. Entities and partitions maintain references to associated assets and physical storage locations.

Key concepts

  1. Table entity: Metadata for structured data with well-defined schemas. Table entities are uniquely identified by entity ID and data location. Table entity metadata is queryable in BigQuery and Dataproc Metastore:

    • Cloud Storage objects: Metadata for Cloud Storage objects, which are accessed via the Cloud Storage APIs.
    • BigQuery tables: Metadata for BigQuery datasets, which are accessed via BigQuery APIs.
  2. Fileset entity: Metadata about unstructured, typically schema-less, data. Filesets are uniquely identified by entity ID and data location. Each fileset has a data format.

  3. Partitions: Metadata for a subset of data within a table or fileset entity, identified by a set of key/value pairs and a data location.

Try the API

Use the Dataplex lakes.zones.entities and lakes.zones.partitions API reference documentation pages to view the parameters and fields associated with each API. Use the Try this API panel that accompanies the reference documentation for each API method to make API requests using different parameters and fields. You can construct, view, and submit your requests without the need to generate credentials, and then view responses returned by the service.

The following sections provide information to help you understand and use the Dataplex Metadata APIs.

Entities

List entities

Add filter query parameters to the list entities request URL to limit the list of entities returned by the service.

Get entity

By default, the Get Entity response contains basic entity metadata. To retrieve additional schema metadata, add the view query parameter to the request URL.

Compatibility details: While Dataplex metadata is centrally registered in the metadata API, only entity table metadata that is compatible with BigQuery and Apache Hive Metastore is published to BigQuery and Dataproc Metastore. The Get Entity API returns a CompatibilityStatus message, which indicates if table metadata is compatible with BigQuery and Hive Metastore, and if not, the reason for the incompatibility.

Update entity

Use this API to edit entity metadata, including whether you or Dataplex will manage entity metadata.

  • This API performs a full replacement of all mutable Entity fields. The following Entity fields are immutable, and will be ignored if specified in an update request:
  • Specify a value for all mutable Entity fields, including all schema fields, even if the values are not being changed.
  • You must supply the etag field. You can obtain the etag by first submitting a entities.get request, which returns the entity's etag in the response.
  • Updating schema fields: You can update the table schema discovered by Dataplex to improve its accuracy:
    • If the schema is a fileset, leave all schema fields empty.
    • To define a repeated field, set the mode to REPEATED. To define a struct field, set the type to RECORD.
    • You can set the schema's userManaged field to specify whether you or Dataplex manages table metadata. The default setting is Dataplex managed. If userManaged is set to true, this setting is included in the information returned from an entities.get request if EntityView is set to SCHEMA or FULL.
  • Updating partition fields:
    • For non-Hive style partitioned data, Dataplex discovery auto-generates partition keys. For example, for the data path `gs://root/2020/12/31`, partition keys `p0`, `p1`, and `p2` are generated. To make querying more intuitive, you can update `p0`, `p1`, and `p2`to `year`, `month`, and `day`.
    • If you update the partition style to HIVE style, the partition field is immutable.
  • Updating other metadata fields: You can update auto-generated mimeType, CompressionFormat, CsvOptions and JsonOptions fields to aid Dataplex discovery. Dataplex discovery will use new values on its next run.

Create entity

Use the entities.create API to create table or fileset metadata entities. Populate the required and relevant optional fields, or let the Dataplex discovery service fill in optional fields.

Delete entity

  • You must supply the etag field. You can obtain the etag by first submitting a entities.get request, which returns the entity's etag in the response.

If underlying data for a table or fileset in a raw zone is deleted, the table or fileset metadata get deleted automatically upon the next Discovery scan. If underlying data for a table in a curated zone is deleted, the table metadata isn't deleted correspondingly, but rather, a missing data action is reported. To resolve this, explicitly delete the table metadata entity via the metadata API.

Partitions

List partitions

Add filter query parameters to the list partitions request URL to limit the list of partitions returned by the service.

Examples:

  • ?filter="Country=US AND State=CA AND City=Sunnyvale"
  • ?filter="year < 2000 AND month > 12 AND Date > 10"

Get partition

To get a partition, you must complete the request URL by appending the partition key values to the end of the URL, formatted to read as "partitions/value1/value2/…./value10".

Example: if a partition has values, {Country=US, State=CA, City=Sunnyvale}, the get request URL should end with "/partitions/US/CA/Sunnyvale".

Important: The appended URL values must be double encoded. For example, url_encode(url_encode(value)) can be used to encode "US:CA/CA#Sunnyvale so that the request URL ends with "/partitions/US%253ACA/CA%2523Sunnyvale". The name field in the response retains the encoded format.

Create partition

Use the partitions.create API to create a customized partition for your data source. You must specify the required location field with a Cloud Storage path.

Delete partition

Complete the request URL by appending partition key values to the end of the request URL, formatted to read as "partitions/value1/value2/…./value10".

Example: if a partition has values, {Country=US, State=CA, City=Sunnyvale}, the request URL should end with "/partitions/US/CA/Sunnyvale".

Important: The appended URL values must conform to RFC-1034 or they must be double encoded, for example, "US:/CA#/Sunnyvale" as "US%3A/CA%3A/Sunnyvale".

Accessing metadata in Apache Spark

The example in this section creates a Dataproc cluster running Spark 2.x.

  • You create the cluster after the Dataproc Metastore service instance is associated with the Dataplex lake to ensure that the cluster can rely on the Hive Metastore endpoint to gain access to Dataplex metadata.

  • Metadata managed within Dataplex can be accessed via standard interfaces, such as Hive Metastore, to power Spark queries. The queries run on the Dataproc cluster.

  • For Parquet data, set Spark property spark.sql.hive.convertMetastoreParquet to false to avoid execution errors. More details.

  1. Run the following commands to create a Dataproc cluster, specifying the Dataproc Metastore service associated with the Dataplex lake.

    GRPC_ENDPOINT=$(gcloud metastore services describe SERVICE_ID \
      --location LOCATION \
      --format "value(endpointUri)" | cut -c9-)
    
    WHDIR=$(gcloud metastore services describe SERVICE_ID \
      --location LOCATION \
      --format "value(hiveMetastoreConfig.configOverrides.'hive.metastore.warehouse.dir')")
    
    METASTORE_VERSION=$(gcloud metastore services describe SERVICE_ID \
      --location LOCATION \
      --format "value(hiveMetastoreConfig.version)")
    
    # This command  creates a cluster with default settings. You can customize
    # it as needed. The --optional-components, --initialization-actions,
    # --metadata and --properties flags are used to to connect with
    # the associated metastore.
    gcloud dataproc clusters create CLUSTER_ID \
      --project PROJECT \
      --region LOCATION \
      --scopes "https://www.googleapis.com/auth/cloud-platform" \
      --image-version 2.0-debian10 \
      --optional-components=DOCKER \
      --initialization-actions "gs://metastore-init-actions/metastore-grpc-proxy/metastore-grpc-proxy.sh" \
      --metadata "proxy-uri=$GRPC_ENDPOINT,hive-version=$METASTORE_VERSION" \
      --properties "hive:hive.metastore.uris=thrift://localhost:9083,hive:hive.metastore.warehouse.dir=$WHDIR"
    
  2. Run DDL queries to explore the metadata; run Spark queries to query data.

    a. Open an SSH session on the Dataproc cluster's master node.

      VM_ZONE=$(gcloud dataproc clusters describe CLUSTER_ID \
        --project PROJECT \
        --region LOCATION \
        --format "value(config.gceClusterConfig.zoneUri)")
      gcloud compute ssh CLUSTER_ID-m --project PROJECT --zone $VM_ZONE
    

    b. At the master node command prompt, open a new Python REPL.

    python3
    

    c. List databases. Each Dataplex zone within the lake maps to a metastore database.

    import pyspark.sql as sql
    
    session = sql.SparkSession.builder.enableHiveSupport().getOrCreate()
    
    df = session.sql("SHOW DATABASES")
    df.show()
    

    d. List tables in one of the zones.

    import pyspark.sql as sql
    
    session = sql.SparkSession.builder.enableHiveSupport().getOrCreate()
    
    df = session.sql("SHOW TABLES IN zone_id")
    df.show()
    

    e. Query the data in one of the tables.

    import pyspark.sql as sql
    
    session = sql.SparkSession.builder.enableHiveSupport().getOrCreate()
    
    # Modify the SQL statement to retrieve or filter on table columns.
    df = session.sql("SELECT columns FROM zone_id.table_id WHERE query LIMIT 10")
    df.show()