Dataplex is a data fabric that unifies distributed data and automates data management and governance for that data.
Dataplex lets you do the following:
- Build a domain-specific data mesh across data that's stored in multiple Google Cloud projects, without any data movement.
- Consistently govern and monitor data with a single set of permissions.
- Discover and curate metadata across various silos using catalog capabilities. For more information, see Dataplex Catalog overview.
- Securely query metadata by using BigQuery and open source tools, such as SparkSQL, Presto, and HiveQL.
- Run data quality and data lifecycle management tasks, including serverless Spark tasks.
- (Deprecated) Explore data using fully managed, serverless Spark environments with simple access to notebooks and SparkSQL queries.
Why use Dataplex?
Enterprises have data that's distributed across data lakes, data warehouses, and data marts. Using Dataplex, you can do the following:
- Discover data
- Curate data
- Unify data without any data movement
- Organize data based on your business needs
- Centrally manage, monitor, and govern data
Dataplex lets you standardize and unify metadata, security policies, governance, classification, and data lifecycle management across this distributed data.
How Dataplex works
Dataplex manages data in a way that doesn’t require data movement or duplication. As you identify new data sources, Dataplex harvests the metadata for both structured and unstructured data, using built-in data quality checks to enhance integrity.
Dataplex automatically registers all metadata in a unified metastore. You can access data and metadata using various services and tools including the following:
- Google Cloud services, such as BigQuery, Dataproc Metastore, Data Catalog.
- Open source tools, such as Apache Spark and Presto.
Terminology
Dataplex abstracts away the underlying data storage systems, by using the following constructs:
Lake: A logical construct representing a data domain or business unit. For example, to organize data based on group usage, you can set up a lake for each department (for example, Retail, Sales, Finance).
Zone: A subdomain within a lake, which is useful to categorize data by the following:
- Stage: For example, landing, raw, curated data analytics, and curated data science.
- Usage: For example, data contract.
- Restrictions: For example, security controls and user access levels.
Zones are of two types: raw and curated.
Raw zone: Contains data that is in its raw format and not subject to strict type-checking.
Curated zone: Contains data that is cleaned, formatted, and ready for analytics. The data is columnar, Hive-partitioned, and stored in Parquet, Avro, Orc files, or BigQuery tables. Data undergoes type-checking- for example, to prohibit the use of CSV files because they don't perform as well for SQL access.
Asset: Maps to data stored in either Cloud Storage or BigQuery. You can map data stored in separate Google Cloud projects as assets into a single zone.
Entity: Represents metadata for structured and semi-structured data (table) and unstructured data (fileset).
Common use cases
This section outlines common use cases for using Dataplex.
A domain-centric data mesh
With this type of data mesh, data is organized into multiple domains within an enterprise- for example, Sales, Customers, and Products. Ownership of the data can be decentralized. You can subscribe to data from different domains. For example, data scientists and data analysts can pull from different domains to accomplish business objectives like machine learning and business intelligence.
In the following diagram, domains are represented by Dataplex lakes and owned by separate data producers. Data producers own creation, curation, and access control in their domains. Data consumers can then request access to the lakes (domains) or zones (subdomains) for their analysis.
In this case, data stewards need to retain a holistic view of the entire data landscape.
This diagram includes the following elements:
- Dataplex: A mesh of multiple data domains.
- Domain: Lakes for sales, customers, and product data.
- Zone within a domain: For individual teams or to provide managed data contracts.
- Assets: Data stored in either a Cloud Storage bucket or a BigQuery dataset, which can exist in a separate Google Cloud project from your Dataplex mesh.
You can extend this scenario by breaking down data that's within zones into raw and curated layers. You can accomplish this approach by creating zones for each permutation of a domain and raw or curated data:
- Sales raw
- Sales curated
- Customers raw
- Customers curated
- Products raw
- Products curated
Data tiering based on readiness
Another common use case is when your data is accessible only to data engineers, and is later refined and made available to data scientists and analysts. In this case, you can set up a lake to have the following:
- A raw zone for the data that the engineers can access.
- A curated zone for the data that's available to the data scientists and analysts.