Dataplex glossary

Dataplex unifies the end-to-end journey for analytics with a centralized management of data and services. This glossary hopes to define terms used within the management system.

Glossary list

Action

User actionable issues. For example:

  • Security policy propagation failed due to a non-existent security group provided by the user.
  • A managed resource cannot be accessed by Dataplex.
  • Discovery job failed for various reasons (that can be remedied by the user). This can be due to user data issues, such as invalid data formats, incompatible schema across partitions, or inconsistent partition naming, etc.

Actions are generated automatically by Dataplex. Some actions are automatically cleared by Dataplex when the underlying issue is detected to have been resolved by the user. Other actions need to be explicitly marked as resolved by the user.

For example, after discovery actions are taken care of by the user they should call Dataplex API to mark the actions as resolved so that the discovery system can unpause and schedule an immediate discovery run.

Asset

Asset represents a single managed resource (bucket/dataset) in Dataplex. It is also a placeholder for various configurations for the managed resource and subsystems (discovery, policy administration, etc.) that act on it.

BigQuery

BigQuery is Google Cloud's fully managed, petabyte-scale, and cost-effective analytics data warehouse that lets you run analytics over vast amounts of data in near real time.

With BigQuery, there's no infrastructure to set up or manage, letting you focus on finding meaningful insights using standard SQL and taking advantage of flexible pricing models across on-demand and flat-rate options. Learn more

Data

User data inside a managed resource. For example, Cloud Storage objects in a bucket or BigQuery table rows in a dataset. In the case of Cloud Storage, objects are immutable units of user data. In the case of a BigQuery dataset, the rows inside the child tables are considered user data.

Data Catalog

Data Catalog is a fully managed and scalable metadata management service that allows organizations to quickly discover, manage, and understand all their data in Google Cloud. Learn more

Dataplex Service Account

Represents an internally managed Google Cloud service account that performs various actions on behalf of Dataplex. For example, service account credentials are used by the discovery system, policy administration system, etc.

Various IAM permissions on user managed resources and projects are needed by the service account to perform its job. Some are automatically granted as part of activating Dataplex on a project. Others (for example, attaching a bucket from a different project) need to be granted manually by the user.

Dataproc Metastore

Dataproc Metastore is a fully managed, highly available, autoscaled, autohealing, OSS-native metastore service that greatly simplifies technical metadata management. Dataproc Metastore service is based on Apache Hive metastore and serves as a critical component towards enterprise data lakes. Learn more

Discovery

Subsystem responsible for crawling user data and extracting metadata.

Entry group

An entry group contains entries. An entry group is a set of logically related entries together with Identity and Access Management policies that specify the users who can create, edit, and view entries within an entry group.

Fileset

A fileset is an entry within a user-created entry group. A fileset is defined by one or more file patterns that specify a set of one or more Cloud Storage files. Fileset entries can be used to organize and discover Cloud Storage files, and to add metadata to them.

Lake

A lake is a centralized repository for managing enterprise data across the organization distributed across many cloud projects, and stored in a variety of storage services such as Cloud Storage and BigQuery. The resources attached to a lake are referred to as managed resources. Data within these managed resources can be structured or unstructured.

A lake provides data admins with tools to organize, secure, and manage their data at scale, and provides data scientists and data engineers an integrated experience to easily search, discover, analyze, and transform data and associated metadata.

Logs

Stackdriver logs provided by Dataplex that users can use to gain insights into the workings of their lake, perform debugging, set alerts, etc. For example, logs that:

  • Surface actions that need attention
  • Surface metadata changes
  • Surface a summary of job runs
  • Surface discovery job actions (files read, written, etc.)

Metadata

Information extracted from the user data by the discovery system. For example, Cloud Storage bucket name, BigQuery dataset properties, schema of child BigQuery tables, etc.

There are two types of metadata:

  • Technical metadata such as schema
  • Operational metadata such as data stats (total object count and size in Cloud Storage)

Metrics

Metrics represent Stackdriver metrics that are exposed as public API by Dataplex, that can then be used by users to set up Stackdriver alerts or visualize via graphs. See Dataplex Cloud Monitoring for more information on specific Dataplex metrics.

Propagation

Changing certain resource configurations initiates a background, asynchronous process to reconcile the state of managed resources with what the user specified. For example, security configuration specified on a lake needs to be propagated to IAM policy of potentially thousands of managed resources (buckets/datasets) under that lake. It doesn't happen immediately when the API is invoked. This process is referred to as propagation.

The status of the propagation will be reflected by the relevant status fields and errors will be surfaced via actions.

Resource

Dataplex Resource

Google Cloud resources defined by Dataplex service, such as lake, data zone, and asset.

Child Resource

Child of a managed resource. For example, Cloud Storage objects or BigQuery table/routine/models. Child resource policy administration is not done directly via Dataplex, however, its effective policy does get influenced by what is inherited from the parent.

Managed Resource

Google Cloud resources that can be administered and discovered via Dataplex. Currently, Cloud Storage buckets and BigQuery datasets. A managed resource can belong to a different project than the lake, however, it must belong to the same organization.

Spec

User provided specification. For example:

  • Security spec specifies security configuration for lake/zone/asset.
  • Resource spec for an asset specifies a pointer to the managed resource (bucket/dataset).
  • Discovery spec specifies discovery configuration for an asset.

Status

Represents the status of the user provided spec. For example:

  • Security status represents the status of the propagation of security policy (such as a security spec) to the underlying buckets/datasets.
  • Resource status represents the status of the managed resource (ok / not found / permission denied, etc.) which is specified in the resource spec.
  • Discovery status represents the status of the discovery job, which is driven by discovery specs.

Table

Logical table (rows & columns) with a well defined schema (column names & types) that is backed by data (or subset thereof) in a managed resource. For example, a table may be backed by a subset of Cloud Storage objects in a Cloud Storage bucket or a BigQuery table in the BigQuery dataset.

  • Tables as a first class concept are surfaced in Dataproc Metastore, Data Catalog, and BigQuery (metadata registration). Tables won't be surfaced downstream if discovery or publishing to the downstream system is not enabled. For example, tables discovered from user data in Cloud Storage won't be surfaced to BigQuery if publishing to BigQuery is not enabled.
  • Discovered by the discovery system. Cannot be created by the user.
  • Table names are generated to be short and meaningful so that they're easy to query. The names contain three parts, [Prefix_]table root path[_Sequence number].

Zone

A logical container of one or more data resources created within a Lake. A data zone can be used to model the business units within an organization (for example, sales vs. operations). Data zones also model the data journey or readiness for consumption.

Raw Zone

A data zone that contains data that needs further processing before it is considered generally ready for consumption and analytics workloads.

Curated Zone

A data zone that contains data that is considered to be ready for broader consumption and analytics workloads. Curated structured data stored in Cloud Storage must conform to certain file formats (Parquet, Avro, and ORC) and organized in a hive-compatible directory layout.

What's next?