Dataplex metadata

This guide describes Dataplex metadata and how you can use Dataplex APIs to manage it.

Dataplex scans the following:

  • Structured and semi-structured data assets within data lakes, to extract table metadata into table entities
  • Unstructured data, such as images and texts, to extract fileset metadata into fileset entities

You can use the Dataplex Metadata API to do the following:

  • View, edit, and delete table and fileset entity metadata
  • Create your own table or fileset entity metadata

You can analyze Dataplex metadata using the following:

  • Data Catalog for searching and tagging
  • Dataproc Metastore and BigQuery for table metadata querying and analytics processing

Dataplex APIs

This section summarizes the Dataplex APIs and the key resources with them.

Control plane API

The Dataplex control plane API allows for the creation and management of the lake, zone, and asset resources.

  • Lake: A Dataplex service instance that allows managing storage resources across projects within an organization.

  • Zone: A logical grouping of assets within a lake. Use multiple zones within a lake to organize data based on readiness, workload, or organization structure.

  • Assets: Storage resources, with data stored in Cloud Storage buckets or BigQuery datasets, that are attached to a zone within a lake.

Metadata API

Use the Dataplex Metadata API to create and manage metadata within table and fileset entities and partitions. Dataplex scans data assets, either in a lake or provided by you, to create entities and partitions. Entities and partitions maintain references to associated assets and physical storage locations.

Key concepts

Table entity:

Metadata for structured data with well-defined schemas. Table entities are uniquely identified by entity ID and data location. Table entity metadata is queryable in BigQuery and Dataproc Metastore:

  • Cloud Storage objects: Metadata for Cloud Storage objects, which are accessed through the Cloud Storage APIs.
  • BigQuery tables: Metadata for BigQuery tables, which are accessed through the BigQuery APIs.
Fileset entity:

Metadata about unstructured, typically schema-less, data. Filesets are uniquely identified by entity ID and data location. Each fileset has a data format.

Partitions:

Metadata for a subset of data within a table or fileset entity, identified by a set of key-value pairs and a data location.

Try the API

Use the Dataplex lakes.zones.entities and lakes.zones.partitions API reference documentation pages to view the parameters and fields associated with each API. Use the Try this API panel that accompanies the reference documentation for each API method to make API requests using different parameters and fields. You can construct, view, and submit your requests without the need to generate credentials, and then view responses returned by the service.

The following sections provide information to help you understand and use the Dataplex Metadata APIs.

Entities

To limit the list of entities returned by the service, add filter query parameters to the list entities request URL.

By default, the Get Entity response contains basic entity metadata. To retrieve additional schema metadata, add the view query parameter to the request URL.

Compatibility details: While Dataplex metadata is centrally registered in the metadata API, only entity table metadata that is compatible with BigQuery and Apache Hive Metastore is published to BigQuery and Dataproc Metastore. The Get Entity API returns a CompatibilityStatus message, which indicates if table metadata is compatible with BigQuery and Hive Metastore, and if not, the reason for the incompatibility.

Use this API to edit entity metadata, including whether you or Dataplex will manage entity metadata.

  • This API performs a full replacement of all mutable Entity fields. The following Entity fields are immutable, and if you specify them in an update request, they will be ignored:
  • Specify a value for all mutable Entity fields, including all schema fields, even if the values are not being changed.
  • Supply the etag field. You can obtain the etag by first submitting a entities.get request, which returns the etag of the entity in the response.
  • Updating schema fields: You can update the table schema discovered by Dataplex to improve its accuracy:
    • If the schema is a fileset, leave all schema fields empty.
    • To define a repeated field, set the mode to REPEATED. To define a struct field, set the type to RECORD.
    • You can set the userManaged field of the schema to specify whether you or Dataplex manages table metadata. The default setting is Dataplex managed. If userManaged is set to true, this setting is included in the information returned from an entities.get request if EntityView is set to SCHEMA or FULL.
  • Updating partition fields:
    • For non-Hive style partitioned data, Dataplex discovery auto-generates partition keys. For example, for the data path gs://root/2020/12/31, partition keys p0, p1, and p2 are generated. To make querying more intuitive, you can update p0, p1, and p2 to year, month, and day respectively.
    • If you update the partition style to HIVE style, the partition field is immutable.
  • Updating other metadata fields: You can update auto-generated mimeType, CompressionFormat, CsvOptions, and JsonOptions fields to aid Dataplex discovery. Dataplex discovery will use new values on its next run.

Use the entities.create API to create table or fileset metadata entities. Populate the required and relevant optional fields, or let the Dataplex discovery service fill in optional fields.

  • Supply the etag field. You can obtain the etag by first submitting a entities.get request, which returns the etag of the entity in the response.

If underlying data for a table or fileset in a raw zone is deleted, the table or fileset metadata get deleted automatically upon the next Discovery scan. If underlying data for a table in a curated zone is deleted, the table metadata isn't deleted correspondingly, but rather, a missing data action is reported. To resolve this issue, explicitly delete the table metadata entity through the metadata API.

Partitions

To limit the list of partitions returned by the service, add filter query parameters to the list partitions request URL.

Examples:

  • ?filter="Country=US AND State=CA AND City=Sunnyvale"
  • ?filter="year < 2000 AND month > 12 AND Date > 10"

To get a partition, you must complete the request URL by appending the partition key values to the end of the URL, formatted to read as partitions/value1/value2/…./value10.

Example: if a partition has values, {Country=US, State=CA, City=Sunnyvale}, the get request URL should end with /partitions/US/CA/Sunnyvale.

Important: The appended URL values must be double encoded. For example, url_encode(url_encode(value)) can be used to encode "US:CA/CA#Sunnyvale" so that the request URL ends with /partitions/US%253ACA/CA%2523Sunnyvale. The name field in the response retains the encoded format.

To create a customized partition for your data source, use the partitions.create API. Specify the required location field with a Cloud Storage path.

Complete the request URL by appending partition key values to the end of the request URL, formatted to read as partitions/value1/value2/…./value10.

Example: if a partition has values, {Country=US, State=CA, City=Sunnyvale}, the request URL should end with /partitions/US/CA/Sunnyvale.

Important: The appended URL values must conform to RFC-1034 or they must be double encoded, for example, US:/CA#/Sunnyvale as US%3A/CA%3A/Sunnyvale.

What's next