Use Data exploration workbench

Data exploration workbench in Dataplex (Explore) helps you interactively query fully-governed, high-quality data with one-click access to Spark SQL scripts and Jupyter notebooks. It lets you collaborate across teams with built-in publishing, sharing, and searching of coding assets.

Explore provisions, scales, and manages the serverless infrastructure required to run your Spark SQL scripts and notebooks using user credentials. You can operationalize your work with one-click serverless scheduling from the workbench.

This document describes how to use the Explore features in Dataplex.

Costs

Explore is charged according to the Dataplex premium processing tier.

Terminology

The following terms are used in this document:

Environment

An environment provides serverless compute resources for your Spark SQL queries and notebooks to run within a lake. Environments are created and managed by a Dataplex administrator.

Administrators can authorize one or more users to run queries and notebooks on the configured environment by granting them the Developer role or associated IAM permissions.

Session

When an authorized user chooses an environment to run their queries and notebooks, Dataplex uses the specified environment configuration to create a user-specific active session. Depending on the environment configuration, a session automatically terminates if it's not used.

It takes a couple of minutes to start a new session per user. Once a session is active, it is used to run subsequent queries and notebooks for the same user. A session is active for a maximum of 10 hours.

For an environment, only one session is created per user, which is shared by both Spark SQL scripts and Jupyter notebooks.

Dataplex uses user credentials within a session to run operations, such as querying the data from Cloud Storage and BigQuery.

Node

A node specifies the compute capacity in an environment configuration. One node maps to 4 Data Compute Units (DCU), which is comparable to 4 vCPUs and 16 GB of RAM.

Default environment

You can create one default environment per lake with the ID default. A default environment must use a default configuration. A default configuration consists of the following:

  • Compute capacity of one node.
  • Primary disk size of 100 GB.
  • Auto session shutdown (auto shutdown time) set to 10 minutes of idle time.
  • The sessionSpec.enableFastStartup parameter, which is by default set to true. When this parameter is set to true, Dataplex pre-provisions the sessions for this environment so that they are readily available, which reduces the initial session startup time.
  • A fast startup session is a single node session, which is charged at Dataplex Premium Processing SKU rates similar to a regular session. A maximum of one always-on session is available for fast startup, which is charged even when not in use. This pre-created session is kept alive for 10 hours and shut off after that, and a new session is created.

If you don't choose an environment explicitly and if you have set up a default environment beforehand, then Dataplex uses the default environment to create sessions.

SQL script

A SQL script is a Spark SQL script that's saved as content in Dataplex within a lake. You can save the script within a lake and share it with other principals. Also, you can schedule it to run as a batch serverless Spark job in Dataplex. Dataplex enables out-of-the-box Spark SQL access to tables that map to data in Cloud Storage and BigQuery.

Notebook

A Python 3 notebook is a Jupyter notebook that is saved as content in Dataplex within a lake. You can save a notebook as content within a lake and share it with other principals, or schedule it to run as a Dataproc Serverless Spark batch job in Dataplex.

For data in BigQuery, you can access BigQuery tables directly through Spark without using the %%bigquery magic command.

Before you begin

Link your lake to Dataproc Metastore

To use Explore, you must associate a gRPC-enabled Dataproc Metastore instance in version 3.1.2 or later with the Dataplex lake. Without a Dataproc Metastore (DPMS) and an environment mapped to your lake, you cannot leverage the features of Explore.

Learn how to set up Dataproc Metastore with Dataplex to access metadata in Spark.

Required roles

Depending on the actions you plan to perform, you need all of the following IAM roles. Permissions granted at the lake level are inherited by all environments within that lake.

Dataplex IAM roles:

  • Dataplex Viewer
  • Dataplex Developer
  • Dataplex Metadata Reader
  • Dataplex Data Reader

Additional roles:

Logging

To understand the usage of Explore, see the following documents:

Known limitations

This section describes the known limitations of Explore.

  • Explore is available for lakes in the following regions:

    • asia-northeast1
    • asia-southeast1
    • europe-west1
    • europe-west2
    • us-central1
    • us-east1
    • us-west1
  • Standard quota restriction of 10 environments per region in a project is applied. For information about increasing the quota limit, see Working with quotas.

  • You can create environments with a maximum of 150 nodes. The session length for individual user sessions is restricted to 10 hours.

  • Spark SQL scripts can only query data within a given lake. If you want to query data in a different lake, you must switch to that lake and choose an environment within that lake.

  • Content resources are not restored after undeleting a project. Any data in the form of saved SQL scripts and notebooks are lost. Proceed with caution when deleting a project with Explore content resources.

  • When scheduling a notebook, if the environment has custom packages, then you can schedule the notebook only using the gcloud CLI. For more information, see Schedule notebooks with custom packages.

  • If you delete an environment before deleting the scripts and notebooks, you cannot access the Explore page. Therefore, make sure that you delete the scripts and notebooks before deleting an environment in Explore.

  • Using Hadoop Distributed File System (HDFS) in Explore sessions is not supported. Don't store any user data in an Explore session because it gets deleted when the session ends.

  • The maximum size limit for a notebook or a SQL script is 1 MB.

Create an environment

  1. In the Google Cloud console, go to the Dataplex Manage Lakes page.

    Go to Dataplex

  2. Choose a Dataplex lake for which you would like to create an environment.

  3. Click the Environments tab.

  4. Click Create environment.

  5. In the Display name field, enter a name for your environment.

  6. In the Environment ID, enter a unique ID.

  7. Optional: Enter a description for the new environment.

  8. In the Configure compute pane, specify the following:

    1. Number of nodes: The number of nodes that will be provisioned for user sessions created for this environment.
    2. Maximum number of nodes: The maximum number of nodes that Dataplex will autoscale in the user sessions associated with this environment.
    3. Primary disk size: The amount of disk size associated with each provisioned node.
    4. Auto shutdown time: The idle time after which Dataplex will automatically shut down user sessions associated with this environment. You can set a minimum of 10 minutes and a maximum of 60 minutes.
  9. In the Software packages (optional) pane, you can specify additional Python packages, JAR files, and Spark properties to install on user sessions provisioned for this environment.

    When you create an environment and provide the Cloud Storage file path for Java JARs or Python packages, for Dataplex to install the JARs or packages, make sure that the Cloud Dataplex Service Agent has the required permissions to access the Cloud Storage files.

  10. Click Create.

Notes

  • A node maps to 4 Data Compute Units (DCU), which is comparable to 4 vCPUs and 16 GB of RAM.

  • You can create an environment with one node, or with three or greater nodes.

  • If you're a lake administrator, you can set up environments ahead of time, enabling users to run their workloads using the pre-specified configurations.

  • Although environments can be shared with multiple users, Dataplex creates a separate session per user using the environment configuration.

Create a default environment

See the configuration requirements for a default environment.

Console

  1. Open Dataplex in the Google Cloud console.

    Go to Dataplex

  2. Navigate to the Manage view.

  3. Choose a Dataplex lake.

  4. Click the Environments tab.

  5. Click Create default environment.

gcloud

To create a default environment with fast startup enabled, run the following command:

gcloud dataplex environments create default --project=PROJECT_ID --lake=LAKE_ID --location=REGION --os-image-version=latest --session-enable-fast-startup

Explore data using Spark SQL workbench

Create and save a script

  1. In the Google Cloud console, go to the Dataplex Explore page.

  2. In the Explore view, choose the lake containing the data assets you wish to explore.

  3. In the Spark SQL Editor, click New script, and type in your queries.

  4. To save the script, click Save > Save script.

    In the Explore view, expand the lake to access the tables, saved queries, and saved notebooks available in that lake.

Run a script

  1. In the Spark SQL Editor, click the tab with the query you want to run.

  2. Click Select environment. Choose the environment in which you want to run the query. If you don't select an environment, Dataplex uses the default environment to create a session per user.

    You can run multiple Spark SQL queries in the same script by separating the queries with semicolons.

  3. Click Run.

  4. View the Query history results for each of the queries in the script using the drop-down list.

Schedule a script

You can schedule a script to run as a Dataplex Task. For more information, see Create and manage schedules for SQL scripts.

Share a script

You can share a script with others in the organization using IAM permissions:

  1. In the Explore view, click the Spark SQL script you want to share.

  2. In the More menu, click Share.

  3. Review the permissions. Add or remove viewer, editor, and admin permissions for the shared script.

After you share a script, users with view or edit permissions at the lake level can navigate to the lake and work on the shared script.

Explore BigQuery and Cloud Storage data using Spark SQL

For any BigQuery dataset that is added as an asset to a zone, Dataplex enables direct Spark SQL access to all the tables in that dataset. You can query data in Dataplex using Spark SQL scripts or notebooks. For example:

 select * from ZONE_ID.TABLE_ID

If your assets map to Cloud Storage buckets in the same zone, Dataplex provides a unified list of tables that you can query using Spark.

Explore data using notebooks

Create and save a notebook

  1. In the Google Cloud console, go to the Dataplex Explore page.

  2. In the Explore view, choose a lake.

  3. Expand the lake and click the Notebooks folder.

  4. Click New notebook.

  5. In the Notebook path field, provide the name of the notebook.

  6. Optional: In the Description field, provide a description for the new notebook.

  7. Optional: Add labels.

  8. Click Create notebook. A notebook is now created.

  9. To open the created notebook, click Open notebook.

  10. Select an environment in which you want Dataplex to create a user session, when creating or opening your notebook. Make sure that you select an environment with packages you trust.

    If you don't select an environment, Dataplex uses the default environment. If you don't have an environment, create one. For more information, see Create an environment.

    You can now explore your data by writing python code and saving the notebook post exploration. Subsequently, you can preview the created notebook and examine its output without creating a session and running the code.

Schedule a notebook

You can schedule a notebook to run as a Dataplex Task. For more information, see Create and manage schedules for notebooks.

Share a notebook

You can share a notebook with others in the organization using IAM permissions:

  1. In the Explore view, click the Notebooks folder.

  2. Select the Jupyter notebook you want to share.

  3. Click Share.

  4. Review the permissions. Add or remove viewer, editor, and admin permissions for this notebook.

    After you share a notebook, users with view or edit permissions at the lake level can navigate to the lake and work on the shared notebook.

Import a notebook

You can import a notebook from a Cloud Storage bucket:

  1. In the Explore view, click the Notebooks folder.

  2. Click Import.

  3. Navigate to the Cloud Storage bucket that contains the notebook you want to import.

  4. Select the notebook, provide a name, and click Import.

    The imported notebook is created in the Notebooks folder. You can open, edit, share, and schedule the imported notebook.

Export a notebook

You can export a notebook to a Cloud Storage bucket so that it can be used by others in the organization with IAM permissions.

  1. In the Explore view, click the Notebooks folder.

  2. Mark the notebook you want to export.

  3. Click the menu and click Export.

  4. Enter the Cloud Storage path where you want to export the notebook.

  5. Click Export notebook.

What's next