Dataplex metadata

Stay organized with collections Save and categorize content based on your preferences.

This guide describes Dataplex metadata and how you can use Dataplex APIs to manage it.

Overview

Dataplex scans the following:

  • structured and semi-structured data assets within data lakes, to extract table metadata into table entities
  • unstructured data, such as images and texts, to extract fileset metadata into fileset entities

You can use the Dataplex Metadata API to do either of the following:

  • view, edit, and delete table and fileset entity metadata
  • create your own table or fileset entity metadata

You can also analyze Dataplex metadata through either of the following:

  • Data Catalog, for searching and tagging
  • Dataproc Metastore and BigQuery, for table metadata querying and analytics processing.

Dataplex APIs

This section summarizes the Dataplex APIs and the key resources with them.

Control plane API

The Dataplex control plane API allows for the creation and management of the lake, zone, and asset resources.

  • Lake: A Dataplex service instance that allows managing storage resources across projects within an organization.

  • Zone: A logical grouping of assets within a lake. Use multiple zones within a lake to organize data based on readiness, workload, or organization structure.

  • Assets: Storage resources, with data stored in Cloud Storage buckets or BigQuery datasets, that are attached to a zone within a lake.

Metadata API

Use the Dataplex Metadata API to create and manage metadata within table and fileset entities and partitions. Dataplex scans data assets, either in a lake or provided by you, to create entities and partitions. Entities and partitions maintain references to associated assets and physical storage locations.

Key concepts

  1. Table entity: Metadata for structured data with well-defined schemas. Table entities are uniquely identified by entity ID and data location. Table entity metadata is queryable in BigQuery and Dataproc Metastore:

    • Cloud Storage objects: Metadata for Cloud Storage objects, which are accessed via the Cloud Storage APIs.
    • BigQuery tables: Metadata for BigQuery datasets, which are accessed via BigQuery APIs.
  2. Fileset entity: Metadata about unstructured, typically schema-less, data. Filesets are uniquely identified by entity ID and data location. Each fileset has a data format.

  3. Partitions: Metadata for a subset of data within a table or fileset entity, identified by a set of key/value pairs and a data location.

Try the API

Use the Dataplex lakes.zones.entities and lakes.zones.partitions API reference documentation pages to view the parameters and fields associated with each API. Use the Try this API panel that accompanies the reference documentation for each API method to make API requests using different parameters and fields. You can construct, view, and submit your requests without the need to generate credentials, and then view responses returned by the service.

The following sections provide information to help you understand and use the Dataplex Metadata APIs.

Entities

List entities

Add filter query parameters to the list entities request URL to limit the list of entities returned by the service.

Get entity

By default, the Get Entity response contains basic entity metadata. To retrieve additional schema metadata, add the view query parameter to the request URL.

Compatibility details: While Dataplex metadata is centrally registered in the metadata API, only entity table metadata that is compatible with BigQuery and Apache Hive Metastore is published to BigQuery and Dataproc Metastore. The Get Entity API returns a CompatibilityStatus message, which indicates if table metadata is compatible with BigQuery and Hive Metastore, and if not, the reason for the incompatibility.

Update entity

Use this API to edit entity metadata, including whether you or Dataplex will manage entity metadata.

  • This API performs a full replacement of all mutable Entity fields. The following Entity fields are immutable, and will be ignored if specified in an update request:
  • Specify a value for all mutable Entity fields, including all schema fields, even if the values are not being changed.
  • You must supply the etag field. You can obtain the etag by first submitting a entities.get request, which returns the entity's etag in the response.
  • Updating schema fields: You can update the table schema discovered by Dataplex to improve its accuracy:
    • If the schema is a fileset, leave all schema fields empty.
    • To define a repeated field, set the mode to REPEATED. To define a struct field, set the type to RECORD.
    • You can set the schema's userManaged field to specify whether you or Dataplex manages table metadata. The default setting is Dataplex managed. If userManaged is set to true, this setting is included in the information returned from an entities.get request if EntityView is set to SCHEMA or FULL.
  • Updating partition fields:
    • For non-Hive style partitioned data, Dataplex discovery auto-generates partition keys. For example, for the data path `gs://root/2020/12/31`, partition keys `p0`, `p1`, and `p2` are generated. To make querying more intuitive, you can update `p0`, `p1`, and `p2`to `year`, `month`, and `day`.
    • If you update the partition style to HIVE style, the partition field is immutable.
  • Updating other metadata fields: You can update auto-generated mimeType, CompressionFormat, CsvOptions and JsonOptions fields to aid Dataplex discovery. Dataplex discovery will use new values on its next run.

Create entity

Use the entities.create API to create table or fileset metadata entities. Populate the required and relevant optional fields, or let the Dataplex discovery service fill in optional fields.

Delete entity

  • You must supply the etag field. You can obtain the etag by first submitting a entities.get request, which returns the entity's etag in the response.

If underlying data for a table or fileset in a raw zone is deleted, the table or fileset metadata get deleted automatically upon the next Discovery scan. If underlying data for a table in a curated zone is deleted, the table metadata isn't deleted correspondingly, but rather, a missing data action is reported. To resolve this, explicitly delete the table metadata entity via the metadata API.

Partitions

List partitions

Add filter query parameters to the list partitions request URL to limit the list of partitions returned by the service.

Examples:

  • ?filter="Country=US AND State=CA AND City=Sunnyvale"
  • ?filter="year < 2000 AND month > 12 AND Date > 10"

Get partition

To get a partition, you must complete the request URL by appending the partition key values to the end of the URL, formatted to read as "partitions/value1/value2/…./value10".

Example: if a partition has values, {Country=US, State=CA, City=Sunnyvale}, the get request URL should end with "/partitions/US/CA/Sunnyvale".

Important: The appended URL values must be double encoded. For example, url_encode(url_encode(value)) can be used to encode "US:CA/CA#Sunnyvale so that the request URL ends with "/partitions/US%253ACA/CA%2523Sunnyvale". The name field in the response retains the encoded format.

Create partition

Use the partitions.create API to create a customized partition for your data source. You must specify the required location field with a Cloud Storage path.

Delete partition

Complete the request URL by appending partition key values to the end of the request URL, formatted to read as "partitions/value1/value2/…./value10".

Example: if a partition has values, {Country=US, State=CA, City=Sunnyvale}, the request URL should end with "/partitions/US/CA/Sunnyvale".

Important: The appended URL values must conform to RFC-1034 or they must be double encoded, for example, "US:/CA#/Sunnyvale" as "US%3A/CA%3A/Sunnyvale".

Accessing metadata in Apache Spark

The example in this section creates a Dataproc cluster running Spark 2.x.

  • You create the cluster after the Dataproc Metastore service instance is associated with the Dataplex lake to ensure that the cluster can rely on the Hive Metastore endpoint to gain access to Dataplex metadata.

  • Metadata managed within Dataplex can be accessed via standard interfaces, such as Hive Metastore, to power Spark queries. The queries run on the Dataproc cluster.

  • For Parquet data, set Spark property spark.sql.hive.convertMetastoreParquet to false to avoid execution errors. More details.

  1. Run the following commands to create a Dataproc cluster, specifying the Dataproc Metastore service associated with the Dataplex lake.

    GRPC_ENDPOINT=$(gcloud metastore services describe SERVICE_ID \
      --location LOCATION \
      --format "value(endpointUri)" | cut -c9-)
    
    WHDIR=$(gcloud metastore services describe SERVICE_ID \
      --location LOCATION \
      --format "value(hiveMetastoreConfig.configOverrides.'hive.metastore.warehouse.dir')")
    
    METASTORE_VERSION=$(gcloud metastore services describe SERVICE_ID \
      --location LOCATION \
      --format "value(hiveMetastoreConfig.version)")
    
    # This command  creates a cluster with default settings. You can customize
    # it as needed. The --optional-components, --initialization-actions,
    # --metadata and --properties flags are used to to connect with
    # the associated metastore.
    gcloud dataproc clusters create CLUSTER_ID \
      --project PROJECT \
      --region LOCATION \
      --scopes "https://www.googleapis.com/auth/cloud-platform" \
      --image-version 2.0-debian10 \
      --optional-components=DOCKER \
      --initialization-actions "gs://metastore-init-actions/metastore-grpc-proxy/metastore-grpc-proxy.sh" \
      --metadata "proxy-uri=$GRPC_ENDPOINT,hive-version=$METASTORE_VERSION" \
      --properties "hive:hive.metastore.uris=thrift://localhost:9083,hive:hive.metastore.warehouse.dir=$WHDIR"
    
  2. Run DQL queries to explore the metadata; run Spark queries to query data.

    1. Open an SSH session on the Dataproc cluster's master node.

      VM_ZONE=$(gcloud dataproc clusters describe CLUSTER_ID \
        --project PROJECT \
        --region LOCATION \
        --format "value(config.gceClusterConfig.zoneUri)")
      gcloud compute ssh CLUSTER_ID-m --project PROJECT --zone $VM_ZONE
      
    2. At the master node command prompt, open a new Python REPL.

      python3
      
    3. List databases. Each Dataplex zone within the lake maps to a metastore database.

      import pyspark.sql as sql
      
      session = sql.SparkSession.builder.enableHiveSupport().getOrCreate()
      
      df = session.sql("SHOW DATABASES")
      df.show()
      
    4. List tables in one of the zones.

      import pyspark.sql as sql
      
      session = sql.SparkSession.builder.enableHiveSupport().getOrCreate()
      
      df = session.sql("SHOW TABLES IN ZONE_ID")
      df.show()
      
    5. Query the data in one of the tables.

      import pyspark.sql as sql
      
      session = sql.SparkSession.builder.enableHiveSupport().getOrCreate()
      
      # Modify the SQL statement to retrieve or filter on table columns.
      df = session.sql("SELECT COLUMNS FROM ZONE_ID.TABLE_ID WHERE QUERY LIMIT 10")
      df.show()
      
  3. Run DDL queries to create tables and partitions in Dataplex metadata using Apache Spark.

    For more information about the supported data types, file formats, and row formats, see Supported values.

    1. Before you create a table, create a Dataplex asset that maps to the Cloud Storage bucket containing the underlying data. For more information, see Add a bucket.

    2. Create a table. Parquet, ORC, AVRO, CSV, and JSON tables are supported.

      import pyspark.sql as sql
      
      session = sql.SparkSession.builder.enableHiveSupport().getOrCreate()
      
      df = session.sql("CREATE TABLE ZONE_ID.TABLE_ID (COLUMNS DATA_TYPE) PARTITIONED BY (COLUMN) STORED AS FILE_FORMAT ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' LOCATION 'gs://MY_GCP_BUCKET/TABLE_ID' TBLPROPERTIES('dataplex.entity.partition_style' = 'HIVE_COMPATIBLE')")
      df.show()
      
    3. Alter the table.

      Dataplex does not allow you to alter the location of a table or edit the partition columns for a table. Altering a table does not automatically set userManaged to true.

      In Spark SQL, you can rename a table, add columns, and set the file format of a table.

      For example:

      • Rename the table.
      import pyspark.sql as sql
      
      session = sql.SparkSession.builder.enableHiveSupport().getOrCreate()
      
      df = session.sql("ALTER TABLE OLD_TABLE_NAME RENAME TO NEW_TABLE_NAME")
      df.show()
      
      • Add columns.
      import pyspark.sql as sql
      
      session = sql.SparkSession.builder.enableHiveSupport().getOrCreate()
      
      df = session.sql("ALTER TABLE TABLE_NAME ADD COLUMN (COLUMN_NAME DATA_TYPE"))
      df.show()
      
      • Set the file format.
      import pyspark.sql as sql
      
      session = sql.SparkSession.builder.enableHiveSupport().getOrCreate()
      
      df = session.sql("ALTER TABLE TABLE_NAME SET FILEFORMAT FILE_FORMAT")
      df.show()
      
    4. Drop the table.

      Dropping the table from Dataplex's metadata API doesn't delete the underlying data in Cloud Storage.

      import pyspark.sql as sql
      
      session = sql.SparkSession.builder.enableHiveSupport().getOrCreate()
      
      df = session.sql("DROP TABLE ZONE_ID.TABLE_ID")
      df.show()
      
    5. Add a partition.

      Dataplex does not allow altering a partition once created. However, the partition can be dropped.

      import pyspark.sql as sql
      
      session = sql.SparkSession.builder.enableHiveSupport().getOrCreate()
      
      df = session.sql("ALTER TABLE ZONE_ID.TABLE_ID ADD PARTITION (COLUMN1=VALUE1) PARTITION (COLUMN2=VALUE2)")
      df.show()
      

      You can add multiple partitions of the same partition key and different partition values as shown in the preceding example.

    6. Drop a partition.

      import pyspark.sql as sql
      
      session = sql.SparkSession.builder.enableHiveSupport().getOrCreate()
      
      df = session.sql("ALTER TABLE ZONE_ID.TABLE_ID DROP PARTITION (COLUMN=VALUE)")
      df.show()
      

Supported values

The supported data types are defined as follows:

Data type Values
Primitive
  • TINYINT
  • SMALLINT
  • INT
  • BIGINT
  • BOOLEAN
  • FLOAT
  • DOUBLE
  • DOUBLE PRECISION
  • STRING
  • BINARY
  • TIMESTAMP
  • DECIMAL
  • DATE
Array ARRAY < DATA_TYPE >
Structure STRUCT < COLUMN : DATA_TYPE >

The supported file formats are defined as follows:

  • TEXTFILE
  • ORC
  • PARQUET
  • AVRO
  • JSONFILE

For more information about the file formats, see Storage Formats.

The supported row formats are defined as follows:

  • DELIMITED [FIELDS TERMINATED BY CHAR]
  • SERDE SERDE_NAME [WITH SERDEPROPERTIES (PROPERTY_NAME=PROPERTY_VALUE, PROPERTY_NAME=PROPERTY_VALUE, ...)]