This guide describes Dataplex metadata and how you can use Dataplex APIs to manage it.
Overview
Dataplex scans the following:
- Structured and semi-structured data assets within data lakes, to extract table metadata into table entities
- Unstructured data, such as images and texts, to extract fileset metadata into fileset entities
You can use the Dataplex Metadata API to do the following:
- View, edit, and delete table and fileset entity metadata
- Create your own table or fileset entity metadata
You can analyze Dataplex metadata using the following:
- Data Catalog for searching and tagging
- Dataproc Metastore and BigQuery for table metadata querying and analytics processing
Dataplex APIs
This section summarizes the Dataplex APIs and the key resources with them.
Control plane API
The Dataplex control plane API allows for the creation and management of the lake, zone, and asset resources.
Lake: A Dataplex service instance that allows managing storage resources across projects within an organization.
Zone: A logical grouping of assets within a lake. Use multiple zones within a lake to organize data based on readiness, workload, or organization structure.
Assets: Storage resources, with data stored in Cloud Storage buckets or BigQuery datasets, that are attached to a zone within a lake.
Metadata API
Use the Dataplex Metadata API to create and manage metadata within table and fileset entities and partitions. Dataplex scans data assets, either in a lake or provided by you, to create entities and partitions. Entities and partitions maintain references to associated assets and physical storage locations.
Key concepts
- Table entity:
Metadata for structured data with well-defined schemas. Table entities are uniquely identified by entity ID and data location. Table entity metadata is queryable in BigQuery and Dataproc Metastore:
- Cloud Storage objects: Metadata for Cloud Storage objects, which are accessed through the Cloud Storage APIs.
- BigQuery tables: Metadata for BigQuery tables, which are accessed through the BigQuery APIs.
- Fileset entity:
Metadata about unstructured, typically schema-less, data. Filesets are uniquely identified by entity ID and data location. Each fileset has a data format.
- Partitions:
Metadata for a subset of data within a table or fileset entity, identified by a set of key-value pairs and a data location.
Try the API
Use the Dataplex lakes.zones.entities and lakes.zones.partitions API reference documentation pages to view the parameters and fields associated with each API. Use the Try this API panel that accompanies the reference documentation for each API method to make API requests using different parameters and fields. You can construct, view, and submit your requests without the need to generate credentials, and then view responses returned by the service.
The following sections provide information to help you understand and use the Dataplex Metadata APIs.
Entities
To limit the list of entities returned by the service, add
filter
query parameters to the list entities
request URL.
By default, the Get Entity
response contains basic entity
metadata. To retrieve additional schema metadata, add the
view
query parameter to the request URL.
Compatibility details: While Dataplex metadata
is centrally registered in the metadata API, only entity table metadata that is
compatible with BigQuery and Apache Hive Metastore is published
to BigQuery and Dataproc Metastore.
The Get Entity
API returns a
CompatibilityStatus
message, which indicates if table metadata is compatible with BigQuery and Hive Metastore,
and if not, the reason for the incompatibility.
Use this API to edit entity metadata, including whether you or Dataplex will manage entity metadata.
- This API performs a full replacement of all mutable Entity fields. The following Entity fields are immutable, and if you specify them in an update request, they will be ignored:
- Specify a value for all mutable Entity fields, including all schema fields, even if the values are not being changed.
- Supply the
etag
field. You can obtain the etag by first submitting a
entities.get request,
which returns the
etag
of the entity in the response. - Updating schema fields: You can update the table schema discovered by
Dataplex to improve its accuracy:
- If the schema is a fileset, leave all schema fields empty.
- To define a repeated field, set the
mode
to
REPEATED
. To define a struct field, set the type toRECORD
. - You can set the
userManaged
field of the schema to specify whether you or Dataplex manages table metadata. The default setting is Dataplex managed. IfuserManaged
is set to true, this setting is included in the information returned from anentities.get
request if EntityView is set toSCHEMA
orFULL
.
- Updating partition fields:
- For non-Hive style partitioned data, Dataplex discovery
auto-generates partition keys. For example, for the data path
gs://root/2020/12/31
, partition keysp0
,p1
, andp2
are generated. To make querying more intuitive, you can updatep0
,p1
, andp2
toyear
,month
, andday
respectively. - If you update the partition style to HIVE style, the partition field is immutable.
- For non-Hive style partitioned data, Dataplex discovery
auto-generates partition keys. For example, for the data path
- Updating other metadata fields: You can update auto-generated mimeType, CompressionFormat, CsvOptions, and JsonOptions fields to aid Dataplex discovery. Dataplex discovery will use new values on its next run.
Use the entities.create
API to create table or fileset metadata entities.
Populate the required and relevant optional fields, or let the Dataplex
discovery service fill in optional fields.
- Supply the
etag
field. You can obtain the etag by first submitting a
entities.get request,
which returns the
etag
of the entity in the response.
If underlying data for a table or fileset in a raw zone is deleted, the table or fileset metadata get deleted automatically upon the next Discovery scan. If underlying data for a table in a curated zone is deleted, the table metadata isn't deleted correspondingly, but rather, a missing data action is reported. To resolve this issue, explicitly delete the table metadata entity through the metadata API.
Partitions
To limit the list of partitions returned by the service, add
filter
query parameters to the list partitions
request URL.
Examples:
?filter="Country=US AND State=CA AND City=Sunnyvale"
?filter="year < 2000 AND month > 12 AND Date > 10"
To get a partition, you must complete the request URL by appending the
partition key values to the end of the URL, formatted to read as
partitions/value1/value2/…./value10
.
Example: if a partition has values, {Country=US, State=CA, City=Sunnyvale}
,
the get request URL should end with /partitions/US/CA/Sunnyvale
.
Important: The appended URL values must be
double encoded. For example, url_encode(url_encode(value))
can
be used to encode "US:CA/CA#Sunnyvale" so that the request URL ends
with /partitions/US%253ACA/CA%2523Sunnyvale
. The name field in the
response retains the encoded format.
To create a customized partition for your data source, use the
partitions.create
API. Specify the required
location
field with a Cloud Storage path.
Complete the request URL by appending partition key values to the end of
the request URL, formatted to read as partitions/value1/value2/…./value10
.
Example: if a partition has values, {Country=US, State=CA, City=Sunnyvale}
,
the request URL should end with /partitions/US/CA/Sunnyvale
.
Important: The appended URL values must conform to
RFC-1034
or they must be double encoded, for example, US:/CA#/Sunnyvale
as US%3A/CA%3A/Sunnyvale
.
What's next
- Learn more about accessing metadata in Apache Spark.