About lakes and zones

Enterprises have data that is often distributed across data lakes, data warehouses, and data marts. Dataplex Universal Catalog is a data fabric that unifies distributed data and eases data governance by applying logical constructs to different data assets.

Dataplex Universal Catalog abstracts away the underlying data storage systems, by using the following constructs: lakes, zones, assets, and entries.

Lakes

A lake is a logical construct representing a data domain or business unit. For example, to organize data based on group usage, you can set up a lake for each department (for example, retail, sales, finance).

Zones

A zone is a subdomain within a lake, which is useful to categorize data by the following:

  • Stage: for example, landing, raw, curated data analytics, and curated data science
  • Usage: for example, data contract
  • Restrictions: for example, security controls and user access levels

Zones are of two types:

  • Raw zone: contains data that is in its raw format and not subject to strict type-checking.

  • Curated zone: contains data that is cleaned, formatted, and ready for analytics. The data is columnar, Hive-partitioned, and stored in Parquet, Avro, Orc files, or BigQuery tables. Data undergoes type-checking- for example, to prohibit the use of CSV files because they don't perform as well for SQL access.

Assets

An asset maps to data stored in either Cloud Storage or BigQuery. You can map data stored in separate Google Cloud projects as assets into a single zone.

Entries

An entity represents metadata for structured and semi-structured data (for example, table), and unstructured data (for example, fileset).

What's next