This guide describes Dataplex metadata and how you can use Dataplex APIs to manage it.
Overview
Dataplex scans the following:
- structured and semi-structured data assets within data lakes, to extract table metadata into table entities
- unstructured data, such as images and texts, to extract fileset metadata into fileset entities
You can use the Dataplex Metadata API to do either of the following:
- view, edit, and delete table and fileset entity metadata
- create your own table or fileset entity metadata
You can also analyze Dataplex metadata through either of the following:
- Data Catalog, for searching and tagging
- Dataproc Metastore and BigQuery, for table metadata querying and analytics processing.
Dataplex APIs
This section summarizes the Dataplex APIs and the key resources with them.
Control plane API
The Dataplex control plane API allows for the creation and management of the lake, zone, and asset resources.
Lake: A Dataplex service instance that allows managing storage resources across projects within an organization.
Zone: A logical grouping of assets within a lake. Use multiple zones within a lake to organize data based on readiness, workload, or organization structure.
Assets: Storage resources, with data stored in Cloud Storage buckets or BigQuery datasets, that are attached to a zone within a lake.
Metadata API
Use the Dataplex Metadata API to create and manage metadata within table and fileset entities and partitions. Dataplex scans data assets, either in a lake or provided by you, to create entities and partitions. Entities and partitions maintain references to associated assets and physical storage locations.
Key concepts
Table entity: Metadata for structured data with well-defined schemas. Table entities are uniquely identified by entity ID and data location. Table entity metadata is queryable in BigQuery and Dataproc Metastore:
- Cloud Storage objects: Metadata for Cloud Storage objects, which are accessed via the Cloud Storage APIs.
- BigQuery tables: Metadata for BigQuery datasets, which are accessed via BigQuery APIs.
Fileset entity: Metadata about unstructured, typically schema-less, data. Filesets are uniquely identified by entity ID and data location. Each fileset has a data format.
Partitions: Metadata for a subset of data within a table or fileset entity, identified by a set of key/value pairs and a data location.
Try the API
Use the Dataplex lakes.zones.entities and lakes.zones.partitions API reference documentation pages to view the parameters and fields associated with each API. Use the Try this API panel that accompanies the reference documentation for each API method to make API requests using different parameters and fields. You can construct, view, and submit your requests without the need to generate credentials, and then view responses returned by the service.
The following sections provide information to help you understand and use the Dataplex Metadata APIs.
Entities
List entities
Add
filter
query parameters to the list entities
request URL to limit the list
of entities returned by the service.
Get entity
By default, the Get Entity
response contains basic entity
metadata. To retrieve additional schema metadata, add the
view
query parameter to the request URL.
Compatibility details: While Dataplex metadata
is centrally registered in the metadata API, only entity table metadata that is
compatible with BigQuery and Apache Hive Metastore is published
to BigQuery and Dataproc Metastore.
The Get Entity
API returns a
CompatibilityStatus
message, which indicates if table metadata is compatible with BigQuery and Hive Metastore,
and if not, the reason for the incompatibility.
Update entity
Use this API to edit entity metadata, including whether you or Dataplex will manage entity metadata.
- This API performs a full replacement of all mutable Entity fields. The following Entity fields are immutable, and will be ignored if specified in an update request:
- Specify a value for all mutable Entity fields, including all schema fields, even if the values are not being changed.
- You must supply the
etag
field. You can obtain the etag by first submitting a
entities.get request,
which returns the entity's
etag
in the response. - Updating schema fields: You can update the table schema discovered by
Dataplex to improve its accuracy:
- If the schema is a fileset, leave all schema fields empty.
- To define a repeated field, set the
mode
to
REPEATED
. To define a struct field, set the type toRECORD
. - You can set the schema's
userManaged
field to specify whether you or Dataplex manages table metadata. The default setting is Dataplex managed. IfuserManaged
is set to true, this setting is included in the information returned from anentities.get
request if EntityView is set toSCHEMA
orFULL
.
- Updating partition fields:
- For non-Hive style partitioned data, Dataplex discovery auto-generates partition keys. For example, for the data path `gs://root/2020/12/31`, partition keys `p0`, `p1`, and `p2` are generated. To make querying more intuitive, you can update `p0`, `p1`, and `p2`to `year`, `month`, and `day`.
- If you update the partition style to HIVE style, the partition field is immutable.
- Updating other metadata fields: You can update auto-generated mimeType, CompressionFormat, CsvOptions and JsonOptions fields to aid Dataplex discovery. Dataplex discovery will use new values on its next run.
Create entity
Use the entities.create
API to create table or fileset metadata entities.
Populate the required and relevant optional fields, or let the Dataplex
discovery service fill in optional fields.
Delete entity
- You must supply the
etag
field. You can obtain the etag by first submitting a
entities.get request,
which returns the entity's
etag
in the response.
If underlying data for a table or fileset in a raw zone is deleted, the table or fileset metadata get deleted automatically upon the next Discovery scan. If underlying data for a table in a curated zone is deleted, the table metadata isn't deleted correspondingly, but rather, a missing data action is reported. To resolve this, explicitly delete the table metadata entity via the metadata API.
Partitions
List partitions
Add
filter
query parameters to the list partitions
request URL to limit the list
of partitions returned by the service.
Examples:
?filter="Country=US AND State=CA AND City=Sunnyvale"
?filter="year < 2000 AND month > 12 AND Date > 10"
Get partition
To get a partition, you must complete the request URL by appending the partition key values to the end of the URL, formatted to read as "partitions/value1/value2/…./value10".
Example: if a partition has values, {Country=US, State=CA, City=Sunnyvale}
,
the get request URL should end with "/partitions/US/CA/Sunnyvale".
Important: The appended URL values must be
double encoded. For example, url_encode(url_encode(value))
can
be used to encode "US:CA/CA#Sunnyvale so that the request URL ends
with "/partitions/US%253ACA/CA%2523Sunnyvale". The name field in the response
retains the encoded format.
Create partition
Use the partitions.create
API to create a customized partition
for your data source. You must specify the required
location
field with a Cloud Storage path.
Delete partition
Complete the request URL by appending partition key values to the end of the request URL, formatted to read as "partitions/value1/value2/…./value10".
Example: if a partition has values, {Country=US, State=CA, City=Sunnyvale}
,
the request URL should end with "/partitions/US/CA/Sunnyvale".
Important: The appended URL values must conform to RFC-1034 or they must be double encoded, for example, "US:/CA#/Sunnyvale" as "US%3A/CA%3A/Sunnyvale".
What's next?
- Learn more about accessing metadata in Apache Spark.