Enterprises have data that is often distributed across data lakes, data warehouses, and data marts. Dataplex Universal Catalog is a data fabric that unifies distributed data and eases data governance by applying logical constructs to different data assets.
Dataplex Universal Catalog abstracts away the underlying data storage systems, by using the following constructs: lakes, zones, assets, and entries.
Lakes
A lake is a logical construct representing a data domain or business unit. For example, to organize data based on group usage, you can set up a lake for each department (for example, retail, sales, finance).
Zones
A zone is a subdomain within a lake, which is useful to categorize data by the following:
- Stage: for example, landing, raw, curated data analytics, and curated data science
- Usage: for example, data contract
- Restrictions: for example, security controls and user access levels
Zones are of two types:
Raw zone: contains data that is in its raw format and not subject to strict type-checking.
Curated zone: contains data that is cleaned, formatted, and ready for analytics. The data is columnar, Hive-partitioned, and stored in Parquet, Avro, Orc files, or BigQuery tables. Data undergoes type-checking- for example, to prohibit the use of CSV files because they don't perform as well for SQL access.
Assets
An asset maps to data stored in either Cloud Storage or BigQuery. You can map data stored in separate Google Cloud projects as assets into a single zone.
Entries
An entity represents metadata for structured and semi-structured data (for example, table), and unstructured data (for example, fileset).
What's next
- Organize your data into lakes and zones.
- Secure your lake.
- View discovered metadata by using the Google Cloud console.
- View discovered metadata by using the API.