This document describes how to create a Dataplex lake. You can create a lake in any of the regions that support Dataplex.
Before you begin
- Sign in to your Google Cloud account. If you're new to Google Cloud, create an account to evaluate how our products perform in real-world scenarios. New customers also get $300 in free credits to run, test, and deploy workloads.
-
In the Google Cloud console, on the project selector page, select or create a Google Cloud project.
-
Make sure that billing is enabled for your Google Cloud project.
-
Enable the Dataplex, Dataproc, Dataproc Metastore, Data Catalog, BigQuery, and Cloud Storage. APIs.
-
In the Google Cloud console, on the project selector page, select or create a Google Cloud project.
-
Make sure that billing is enabled for your Google Cloud project.
-
Enable the Dataplex, Dataproc, Dataproc Metastore, Data Catalog, BigQuery, and Cloud Storage. APIs.
Access control
To create and manage your lake, make sure you have the predefined roles
roles/dataplex.admin
orroles/dataplex.editor
granted. For more information, see grant a single role.To attach a Cloud Storage bucket from another project to your lake, grant the following Dataplex service account an administrator role on the bucket by running the following command:
gcloud alpha dataplex lakes authorize \ --project PROJECT_ID_OF_LAKE \ --storage-bucket-resource BUCKET_NAME
Create a metastore
You can access Dataplex metadata using Hive Metastore in Spark queries by associating a Dataproc Metastore service instance with your Dataplex lake. You need to have a gRPC-enabled Dataproc Metastore (version 3.1.2 or higher) associated with the Dataplex lake.
Create a Dataproc Metastore service.
Configure the Dataproc Metastore service instance to expose a gRPC endpoint (instead of the default Thrift Metastore endpoint):
curl -X PATCH \ -H "Authorization: Bearer $(gcloud auth print-access-token)" \ -H "Content-Type: application/json" \ "https://metastore.googleapis.com/v1beta/projects/PROJECT_ID/locations/LOCATION/services/SERVICE_ID?updateMask=hiveMetastoreConfig.endpointProtocol" \ -d '{"hiveMetastoreConfig": {"endpointProtocol": "GRPC"}}'
View the gRPC endpoint:
gcloud metastore services describe SERVICE_ID \ --project PROJECT_ID \ --location LOCATION \ --format "value(endpointUri)"
Create a lake
Console
In the Google Cloud console, go to Dataplex.
Navigate to the Manage view.
Click
Create.Enter a Display name.
The lake ID is automatically generated for you. If you prefer, you can provide your own ID. See Resource naming convention.
Optional: Enter a Description.
Specify the Region in which to create the lake.
For lakes created in a given region (for example,
us-central1
), you can attach both single-region (us-central1
) data and multi-region (us multi-region
) data depending on the zone settings.Optional: Add labels to your lake.
Optional: In the Metastore section, click the Metastore service menu, and select the service you created in the Before you begin section.
Click Create.
gcloud
To create a lake, use the gcloud alpha dataplex lakes create
command:
gcloud alpha dataplex lakes create LAKE \ --location=LOCATION \ --labels=k1=v1,k2=v2,k3=v3 \ --metastore-service=METASTORE_SERVICE
Replace the following:
LAKE
: name of the new lakeLOCATION
: refers to a Google Cloud regionk1=v1,k2=v2,k3=v3
: labels used (if any)METASTORE_SERVICE
: the Dataproc Metastore service, if created
REST
To create a lake, use the lakes.create method.
What's next?
- Learn how to Add zones to a lake.
- Learn how to Attach assets to a zone.
- Learn how to secure your lake.
- Learn how to manage your lake.