This guide shows you how to create a Dataplex lake, using the
Google Cloud console, gcloud CLI, or the lakes.create
API method.
You can create your lake in any of the regions that support Dataplex.
Before you begin
- Sign in to your Google Cloud account. If you're new to Google Cloud, create an account to evaluate how our products perform in real-world scenarios. New customers also get $300 in free credits to run, test, and deploy workloads.
-
In the Google Cloud console, on the project selector page, select or create a Google Cloud project.
-
Make sure that billing is enabled for your Google Cloud project.
-
Enable the Dataplex, Dataproc, Dataproc Metastore, Data Catalog, BigQuery, and Cloud Storage. APIs.
-
In the Google Cloud console, on the project selector page, select or create a Google Cloud project.
-
Make sure that billing is enabled for your Google Cloud project.
-
Enable the Dataplex, Dataproc, Dataproc Metastore, Data Catalog, BigQuery, and Cloud Storage. APIs.
Access control
Make sure you have the predefined roles
roles/dataplex.admin
orroles/dataplex.editor
granted to you so that you can create and manage your lake. Follow the steps in the IAM documentation for granting roles.To attach a Cloud Storage bucket from another project to your lake, grant the following Dataplex service account an administrator role on the bucket by running the following command:
gcloud alpha dataplex lakes authorize \ --project PROJECT_ID_OF_LAKE \ --storage-bucket-resource BUCKET_NAME
Create a metastore
You can access Dataplex metadata using Hive Metastore in Spark queries by associating a Dataproc Metastore service instance with your Dataplex lake. You need to have a gRPC-enabled Dataproc Metastore (version 3.1.2 or higher) associated with the Dataplex lake.
Create a Dataproc Metastore service.
Configure the Dataproc Metastore service instance to expose a gRPC endpoint (instead of the default Thrift Metastore endpoint). Run the following update API request:
curl -X PATCH \ -H "Authorization: Bearer $(gcloud auth print-access-token)" \ -H "Content-Type: application/json" \ "https://metastore.googleapis.com/v1beta/projects/PROJECT_ID/locations/LOCATION/services/SERVICE_ID?updateMask=hiveMetastoreConfig.endpointProtocol" \ -d '{"hiveMetastoreConfig": {"endpointProtocol": "GRPC"}}'
View the gRPC endpoint. Run the following command:
gcloud metastore services describe SERVICE_ID \ --project PROJECT_ID \ --location LOCATION \ --format "value(endpointUri)"
Create a Dataplex lake
The following steps show you how to create a Dataplex lake.
Console
Go to Dataplex in the Google Cloud console.
Go to Dataplex
Navigate to the Manage view.
Click
Create.Enter a Display name.
The lake ID is automatically generated for you. If you prefer, you can provide your own ID. See Resource naming convention.
Optional: Enter a Description.
Specify the Region in which to create the lake.
For lakes created in a given region (for example,
us-central1
), both single-region (us-central1
) data and multi-region (us multi-region
) data can be attached depending on the zone settings.Optional: Add labels to your lake.
Optional: In the Metastore section, click the Metastore service drop-down, and select the service you created in the Before you begin section.
Click Create.
gcloud
Use the following gcloud preview dataplex lake create
command to create a
lake:
gcloud alpha dataplex lakes create LAKE \ --location=LOCATION \ --labels=k1=v1,k2=v2,k3=v3 \ --metastore-service=METASTORE_SERVICE
Replace the following:
LAKE
: The name of the new lake.LOCATION
: Refers to a Google Cloud region.k1=v1,k2=v2,k3=v3
: The labels used (if any).METASTORE_SERVICE
: The Dataproc Metastore service, if one was created.
REST
Follow the API instructions to create a lake by using the APIs Explorer.
What's next?
- Learn how to organize your data
into lakes and zones.
- Add zones to your lake.
- Attach assets to your zones.
- Learn how to secure your lake.
- Learn how to manage your lake.