This document describes what Dataplex zones are and how to add them to your Dataplex lake.
Overview
Dataplex zones are named entities within a Dataplex lake. They are logical groupings of unstructured, semi-structured, and structured data, consisting of multiple assets, such as Cloud Storage buckets, BigQuery datasets, and BigQuery tables.
A lake can include one or more zones. While a zone can only be part of one lake, it might contain assets that point to resources that are part of projects outside of its parent project.
You can select configurations for a zone in Dataplex. There are two types of zones that you can choose from: raw and curated.
Raw zones
Raw zones store structured data, semi-structured data such as CSV files and JSON files, and unstructured data in any format from external sources. Raw zones are useful for staging raw data before performing any transformations. Data can be stored in Cloud Storage buckets or BigQuery datasets.
Raw zones support bucket-level or dataset-level granularity for read and write permissions. There are no restrictions on the type of data that can be stored in raw zones.
Curated zones
Curated zones store structured data. Data can be stored in Cloud Storage buckets or BigQuery datasets.
Supported formats for Cloud Storage buckets include Parquet, Avro, and ORC. Curated zones are useful for staging data that requires processing before being used for analysis, or for serving data that is ready for analysis.
For BigQuery tables, you must have a well-defined schema and Hive-style partitions. When you provide a schema for a given table in a curated zone, the data should conform to the schema defined for the table without schema drift. This means that the data should be compatible with the schema defined for the table, and new partitions shouldn't have a schema that conflicts with the table schema.
Curated zones support Cloud Storage bucket-level or BigQuery dataset-level granularity for read and write permissions.
Before you begin
Before you can add zones to a lake, you must have a lake. If you haven't already, create a lake.
Most gcloud lake
commands require a location. You can specify the location by
setting the --location
parameter.
Required roles
To get the permission that you need to add a zone,
ask your administrator to grant you the
Dataplex Administrator (roles/dataplex.admin
) IAM role on project.
For more information about granting roles, see Manage access to projects, folders, and organizations.
This predefined role contains the
dataplex.lakes.create
permission,
which is required to
add a zone.
You might also be able to get this permission with custom roles or other predefined roles.
Add a zone
You can add multiple zones to your lake. You can add one zone at a time but still use your lake while the zone is being created.
To add a zone to an existing lake, follow these steps:
Console
In the Google Cloud console, go to Dataplex.
Navigate to the Manage view.
In the Manage view, click the name of the lake you'd like to add a zone to.
In the Zones tab, click
Add zone.Enter a Display name for your zone.
Click the Type menu. Choose Raw Zone or Curated Zone. Learn more about supported zone types.
Optional: Enter a description.
Under Data locations, select either Regional or Multi-regional. What you choose cannot be changed later. Single region and multi-region data cannot be mixed in the same zone.
Optional: Enable metadata discovery, which lets Dataplex to automatically scan and extract metadata from the data in your zone:
Click Discovery settings.
Make sure Enable metadata discovery is selected.
Optional: Under Include patterns, list the files to include in the discovery scans.
Optional: Under Exclude patterns, list the files to exclude in the discovery scans. If you enter both include and exclude patterns, exclude patterns are applied first.
Click the Repeats menu and select a frequency. If you select Custom, in the Schedule field, enter a job schedule. Otherwise, the Schedule value is automatically filled for you.
Click the Timezone menu and select a timezone.
Click Create.
REST
To add a zone, use the lakes.zones.create method.
It might take a few minutes for the zone to be created.
When the zone creation succeeds, the zone automatically enters active state. If it fails, then the lake is rolled back to its previous state.
After you create your zone, you can map data stored in Cloud Storage buckets and BigQuery datasets as assets to your zone. For more information, see Add an asset.
What's next
- Learn how to manage buckets.
- Learn how to create a lake.
- Learn more about Cloud Audit Logs.