Use Data exploration workbench

Data exploration workbench in Dataplex (Explore) lets you interactively query fully governed data with one-click access to Spark SQL scripts and Jupyter notebooks. Explore lets you collaborate across teams with built-in publishing, sharing, and searching of coding assets.

Explore provisions, scales, and manages the serverless infrastructure required to run your Spark SQL scripts and notebooks using user credentials. You can operationalize your work with serverless scheduling from the workbench.

This document describes how to use the Explore features in Dataplex.

Costs

Dataplex offers Explore at the premium processing tier.

Terminology

This document uses the following terms:

Environment

An environment provides serverless compute resources for your Spark SQL queries and notebooks to run within a lake. A Dataplex administrator creates and manages environments.

Administrators can authorize one or more users to run queries and notebooks on the configured environment by granting them the Dataplex Developer role or associated IAM permissions.

Session

When an authorized user chooses an environment to run their queries and notebooks, Dataplex uses the specified environment configuration to create a user-specific active session. Depending on the environment configuration, if a session is not in use, it automatically terminates.

It takes a couple of minutes to start a new session per user. Once a session starts, it runs subsequent queries and notebooks for the same user. A session is active for a maximum of 10 hours.

For an environment, Dataplex creates only one session per user, which is shared by both Spark SQL scripts and Jupyter notebooks.

Dataplex uses user credentials within a session to run operations, such as querying the data from Cloud Storage and BigQuery.

Node

A node specifies the compute capacity in an environment configuration. One node maps to 4 Data Compute Units (DCU), which is comparable to 4 vCPUs and 16 GB of RAM.

Default environment

You can create one default environment per lake with the ID default. A default environment must use a default configuration. A default configuration consists of the following:

  • Compute capacity of one node
  • Primary disk size of 100 GB
  • Auto session shutdown (auto shutdown time) set to 10 minutes of idle time
  • The sessionSpec.enableFastStartup parameter, which is by default set to true. When this parameter is set to true, Dataplex pre-provisions the sessions for this environment so that they are readily available, which reduces the initial session startup time.
  • A fast startup session is a single node session, which Dataplex charges at Premium Processing SKU rates similar to a regular session. A maximum of one always-on session is available for fast startup, which incurs costs even when not in use. Dataplex keeps this pre-created session alive for 10 hours, shuts it off, then creates a new session.

If you don't select an environment explicitly and if you have set up a default environment beforehand, then Dataplex uses the default environment to create sessions.

SQL script

A SQL script is a Spark SQL script that's saved as content within a Dataplex lake. You can save the script within a lake and share it with other principals. Also, you can schedule it to run as a batch serverless Spark job in Dataplex. Dataplex enables out-of-the-box Spark SQL access to tables that map to data in Cloud Storage and BigQuery.

Notebook

A Python 3 notebook is a Jupyter notebook that you save as content in a Dataplex lake. You can save a notebook as content within a lake and share it with other principals, or schedule it to run as a Dataproc Serverless Spark batch job in Dataplex.

For data in BigQuery, you can access BigQuery tables directly through Spark without using the %%bigquery magic command.

Before you begin

Before you begin, link your lake to Dataproc Metastore and grant the required roles.

Link your lake to Dataproc Metastore (DPMS)

To use Explore, do the following:

  • Associate a gRPC-enabled Dataproc Metastore (DPMS) instance in version 3.1.2 or later with the Dataplex lake.
  • Make sure that you have a Dataproc Metastore and an environment mapped to your lake.

Learn how to set up Dataproc Metastore with Dataplex to access metadata in Spark.

Required roles

Depending on the actions you plan to perform, you need all of the following IAM roles. All environments in a lake inherit permissions granted at the lake level.

Dataplex IAM roles:

  • Dataplex Viewer
  • Dataplex Developer
  • Dataplex Metadata Reader
  • Dataplex Data Reader

Additional roles:

Logging

To understand the usage of Explore, see the following documents:

Known limitations

This section describes the known limitations of Explore.

  • Explore is available for lakes in the following regions:

    • asia-northeast1
    • asia-southeast1
    • europe-west1
    • europe-west2
    • us-central1
    • us-east1
    • us-west1
  • You can use up to 10 environments per region in a project. For information about increasing the quota limit, see Working with quotas.

  • You can create environments with a maximum of 150 nodes. The session length for individual user sessions is restricted to 10 hours.

  • Spark SQL scripts can only query data within a given lake. If you want to query data in a different lake, you must switch to that lake and select an environment within that lake.

  • After you undelete a project, Dataplex doesn't restore content resources, such as SQL scripts or notebooks. Proceed with caution when deleting a project with Explore content resources.

  • When scheduling a notebook, if the environment has custom packages, then you can schedule the notebook only using the gcloud CLI. For more information, see Schedule notebooks with custom packages.

  • If you delete an environment before deleting the scripts and notebooks, you cannot access the Explore page. Therefore, make sure that you delete the scripts and notebooks before deleting an environment in Explore.

  • Explore sessions don't support Hadoop Distributed File Systems (HDFS). Don't store any user data in an Explore session because it gets deleted when the session ends.

  • The maximum size limit for a notebook or a SQL script is 1 MB.

Create an environment

  1. In the Google Cloud console, go to the Dataplex Manage Lakes page.

    Go to Dataplex

  2. Select a Dataplex lake for which you would like to create an environment.

  3. Click the Environments tab.

  4. Click Create environment.

  5. In the Display name field, enter a name for your environment.

  6. In the Environment ID, enter a unique ID.

  7. Optional: Enter a description for the new environment.

  8. In the Configure compute pane, specify the following:

    1. Number of nodes: The number of nodes to be provisioned for user sessions created for this environment.
    2. Maximum number of nodes: The maximum number of nodes that Dataplex can autoscale in the user sessions associated with this environment.
    3. Primary disk size: The amount of disk size associated with each provisioned node.
    4. Auto shutdown time: The idle time after which Dataplex automatically shuts down user sessions associated with this environment. You can set a minimum of 10 minutes and a maximum of 60 minutes.
  9. In the Software packages (optional) pane, you can specify additional Python packages, JAR files, and Spark properties to install on user sessions provisioned for this environment.

    When you create an environment and provide the Cloud Storage path for Java JARs or Python packages, for Dataplex to install the JARs or packages, make sure that the Cloud Dataplex Service Agent has the required permissions to access the Cloud Storage files.

  10. Click Create.

Notes

  • A node maps to 4 Data Compute Units (DCU), which is comparable to 4 vCPUs and 16 GB of RAM.

  • You can create an environment with one node, or with three or greater nodes.

  • If you're a lake administrator, you can set up environments ahead of time, enabling users to run their workloads using the pre-specified configurations.

  • Although environments can be shared with multiple users, Dataplex creates a separate session per user using the environment configuration.

Create a default environment

See the configuration requirements for a default environment.

Console

  1. Open Dataplex in the Google Cloud console.

    Go to Dataplex

  2. Navigate to the Manage view.

  3. Select a Dataplex lake.

  4. Click the Environments tab.

  5. Click Create default environment.

gcloud

To create a default environment with fast startup enabled, run the following command:

gcloud dataplex environments create default --project=PROJECT_ID --lake=LAKE_ID --location=REGION--os-image-version=latest --session-enable-fast-startup

Explore data using Spark SQL workbench

To explore BigQuery and Cloud Storage data, use Spark SQL scripts.

Create and save a script

  1. In the Google Cloud console, go to the Dataplex Explore page.

  2. In the Explore view, select the lake containing the data assets you want to explore.

  3. In the resource browser, expand the lake. This displays the following folders:

    • Data: Contains all the databases and tables in the DPMS instance connected to your lake, including the Hudi, Iceberg, and Delta lake tables.
    • Notebooks: Contains all the notebooks created in the selected lake.
    • Spark SQL Scripts: Contains all the Spark SQL scripts created in the selected lake.
  4. Expand Data, and select the required database and the table.

  5. To use a sample query, click QUERY. The Spark SQL workbench auto-populates a new tab with a sample query.

  6. To create a new script, in the Spark SQL Editor, click New script, and enter your queries.

  7. To save the script, select Save > Save script.

Run a script

  1. In the Spark SQL Editor, click the tab with the query you want to run.

  2. Click Select environment. Select the environment in which you want to run the query. If you don't select an environment, Dataplex uses the default environment to create a session per user.

    You can run multiple Spark SQL queries in the same script by separating the queries with semicolons.

  3. Click Run.

  4. View the Query history results for each of the queries in the script using the drop-down list.

Schedule a script

You can schedule a script to run as a Dataplex Task. For more information, see Create and manage schedules for SQL scripts.

Share a script

You can share a script with others in the organization using IAM permissions:

  1. In the Explore view, click the Spark SQL script you want to share.

  2. In the More menu, click Share.

  3. Review the permissions. Add or remove viewer, editor, and administrator permissions for the shared script.

After you share a script, users with view or edit permissions at the lake level can navigate to the lake and work on the shared script.

Explore BigQuery and Cloud Storage data using Spark SQL

For any BigQuery dataset that is added as an asset to a zone, Dataplex enables direct Spark SQL access to all the tables in that dataset. You can query data in Dataplex using Spark SQL scripts or notebooks. For example:

 select * from ZONE_ID.TABLE_ID

If your assets map to Cloud Storage buckets in the same zone, Dataplex provides a unified list of tables that you can query using Spark.

Explore data using notebooks

This section describes how to create, schedule, share, import, and export notebooks.

Create and save a notebook

  1. In the Google Cloud console, go to the Dataplex Explore page.

  2. In the Explore view, select a lake.

  3. Expand the lake and click the Notebooks folder.

  4. Click New notebook.

  5. In the Notebook path field, provide the name of the notebook.

  6. Optional: In the Description field, provide a description for the new notebook.

  7. Optional: Add labels.

  8. Click Create notebook. A notebook is now created.

  9. To open the created notebook, click Open notebook.

  10. Select an environment in which you want Dataplex to create a user session, when creating or opening your notebook. Make sure that you select an environment with packages you trust.

    If you don't select an environment, Dataplex uses the default environment. If you don't have an environment, create one. For more information, see Create an environment.

    You can now explore your data by writing python code and saving the notebook post exploration. Later, you can preview the created notebook and examine its output without creating a session and running the code.

Schedule a notebook

You can schedule a notebook to run as a Dataplex Task. For more information, see Create and manage schedules for notebooks.

Share a notebook

You can share a notebook with others in the organization using IAM permissions:

  1. In the Explore view, click the Notebooks folder.

  2. Select the Jupyter notebook that you want to share.

  3. Click Share.

  4. Review the permissions. Add or remove viewer, editor, and administrator permissions for this notebook.

    After you share a notebook, users with view or edit permissions at the lake level can navigate to the lake and work on the shared notebook.

Import a notebook

You can import a notebook from a Cloud Storage bucket:

  1. In the Explore view, click the Notebooks folder.

  2. Click Import.

  3. Navigate to the Cloud Storage bucket that contains the notebook that you want to import.

  4. Select the notebook, provide a name, and click Import.

    The imported notebook is created in the Notebooks folder. You can open, edit, share, and schedule the imported notebook.

Export a notebook

You can export a notebook to a Cloud Storage bucket so that it can be used by others in the organization with IAM permissions.

  1. In the Explore view, click the Notebooks folder.

  2. Mark the notebook you want to export.

  3. Click the menu and click Export.

  4. Enter the Cloud Storage path where you want to export the notebook.

  5. Click Export notebook.

What's next