This document provides guidance and best practices for using Dataplex.
Choose a project for your lake
When you select the project in which to host your lake, consider the following factors:
The project must belong to the same VPC Service Controls perimeter as the data destined to be within the lake.
The lake service account requires admin permissions on the desired Cloud Storage buckets or BigQuery datasets. Dataplex creates external tables in BigQuery for tables discovered in Cloud Storage. Dataplex also makes available BigQuery table metadata, and tables discovered in the Cloud Storage bucket, in a Dataproc Metastore. The Dataproc Metastore is located within the data lake project.
Cloud Storage settings and limitations
Region: Dataplex currently supports single region and multi-region buckets in some Google Cloud regions.
Storage class: Cloud Storage buckets of all storage classes are supported (standard, nearline, coldline, archive). Additional data retrieval costs may incur for accessing or scanning nearline, coldline, or archive data.
Bucket ACL: Dataplex supports Cloud Storage buckets with uniform access controls only. Fine-grained access controls are currently not supported.
Requester Pays: Cloud Storage buckets with the Requester Pays feature enabled are currently not supported.
Security and permissions guidance
Dataplex requires adding the Dataplex service accounts as an administrative service account on managed buckets and datasets.
Dataplex enables analysts to access Cloud Storage buckets and BigQuery datasets across many projects. To enable this access, Dataplex requires adding the Dataplex service accounts with administrative controls to these projects.
For Discoverys, Dataplex adds the Dataproc Metastore service account to the Cloud Storage buckets. If you have your own Dataproc Metastore cluster, you will likely want to make the Dataplex lake use your Dataproc Metastore service (an available option when you create your lake).
If you choose to add a Cloud Storage bucket with fine-grained access to a lake, Dataplex will provide full access to that bucket through the lake because Dataplex permissions are propagated to all objects in the bucket. If you require fine-grained access, it is recommended that you split the data in your bucket into multiple buckets.