Create a Dataplex lake

This document describes how to create a Dataplex lake. You can create a lake in any of the regions that support Dataplex.

Before you begin

  1. Sign in to your Google Cloud account. If you're new to Google Cloud, create an account to evaluate how our products perform in real-world scenarios. New customers also get $300 in free credits to run, test, and deploy workloads.
  2. In the Google Cloud console, on the project selector page, select or create a Google Cloud project.

    Go to project selector

  3. Make sure that billing is enabled for your Google Cloud project.

  4. Enable the Dataplex, Dataproc, Dataproc Metastore, Data Catalog, BigQuery, and Cloud Storage. APIs.

    Enable the APIs

  5. In the Google Cloud console, on the project selector page, select or create a Google Cloud project.

    Go to project selector

  6. Make sure that billing is enabled for your Google Cloud project.

  7. Enable the Dataplex, Dataproc, Dataproc Metastore, Data Catalog, BigQuery, and Cloud Storage. APIs.

    Enable the APIs

Access control

  1. To create and manage your lake, make sure you have the predefined roles roles/dataplex.admin or roles/dataplex.editor granted. For more information, see grant a single role.

  2. To attach a Cloud Storage bucket from another project to your lake, grant the following Dataplex service account an administrator role on the bucket by running the following command:

    gcloud alpha dataplex lakes authorize \
    --project PROJECT_ID_OF_LAKE \
    --storage-bucket-resource BUCKET_NAME
    

Create a metastore

You can access Dataplex metadata using Hive Metastore in Spark queries by associating a Dataproc Metastore service instance with your Dataplex lake. You need to have a gRPC-enabled Dataproc Metastore (version 3.1.2 or higher) associated with the Dataplex lake.

  1. Create a Dataproc Metastore service.

  2. Configure the Dataproc Metastore service instance to expose a gRPC endpoint (instead of the default Thrift Metastore endpoint):

    curl -X PATCH \
    -H "Authorization: Bearer $(gcloud auth print-access-token)" \
    -H "Content-Type: application/json" \
    "https://metastore.googleapis.com/v1beta/projects/PROJECT_ID/locations/LOCATION/services/SERVICE_ID?updateMask=hiveMetastoreConfig.endpointProtocol" \
    -d '{"hiveMetastoreConfig": {"endpointProtocol": "GRPC"}}'
    
  3. View the gRPC endpoint:

    gcloud metastore services describe SERVICE_ID \
      --project PROJECT_ID \
      --location LOCATION \
      --format "value(endpointUri)"
    

Create a lake

Console

  1. In the Google Cloud console, go to Dataplex.

    Go to Dataplex

  2. Navigate to the Manage view.

  3. Click Create.

  4. Enter a Display name.

  5. The lake ID is automatically generated for you. If you prefer, you can provide your own ID. See Resource naming convention.

  6. Optional: Enter a Description.

  7. Specify the Region in which to create the lake.

    For lakes created in a given region (for example, us-central1), you can attach both single-region (us-central1) data and multi-region (us multi-region) data depending on the zone settings.

  8. Optional: Add labels to your lake.

  9. Optional: In the Metastore section, click the Metastore service menu, and select the service you created in the Before you begin section.

  10. Click Create.

gcloud

To create a lake, use the gcloud alpha dataplex lakes create command:

gcloud alpha dataplex lakes create LAKE \
 --location=LOCATION \
 --labels=k1=v1,k2=v2,k3=v3 \
 --metastore-service=METASTORE_SERVICE

Replace the following:

  • LAKE: name of the new lake
  • LOCATION: refers to a Google Cloud region
  • k1=v1,k2=v2,k3=v3: labels used (if any)
  • METASTORE_SERVICE: the Dataproc Metastore service, if created

REST

To create a lake, use the lakes.create method.

What's next?