Dataplex Explore provides a fully-managed, serverless data exploration experience, enabling you to explore your data using Apache SparkSQL queries and Jupyter notebooks. Dataplex provisions, scales, and manages the serverless infrastructure required to run your SparkSQL queries and notebooks using user credentials. Also, you can save, share, and schedule your queries and notebooks from the same interface.
This document describes how to use the data exploration features in Dataplex.
These terms are used throughout this document.
An environment provides the serverless compute resources that your SparkSQL queries and notebooks need to run within a lake. Environments are created and managed by a Dataplex administrator.
When an authorized user chooses an environment in which to run their queries and notebooks, Dataplex uses the specified environment configuration to create a user-specific, active session.
If a user runs both an Apache SparkSQL query and a notebook, both will run in one session if it is available for both operations.
Dataplex uses user credentials within a session to run operations such as querying the data from Cloud Storage and BigQuery.
A node specifies the compute capacity in an environment configuration. One node maps to 4 Data Compute Units (DCU), which is comparable to 4 vCPUs and 16 GB of RAM.
You can create one default environment per lake with the id
A default environment must use a default configuration. A default configuration
consists of the following:
- Compute capacity of 1 node
- Primary disk size of 100GB
- Auto session shutdown (auto shutdown time) set to 10 minutes of idle time
sessionSpec.enableFastStartupparameter, which is by default set to
true. When this parameter is set to true, Dataplex pre-provisions the sessions for this environment so that they are readily available, which reduces the initial session startup time.
Dataplex uses the default environment to create sessions if you do not explicitly choose an environment.
A SQL query is a SparkSQL query that has been saved as content in Dataplex within a lake.You can save the query within a lake and share it with other IAM users. Also, you can schedule it to run as a batch serverless Spark job in Dataplex. Dataplex enables out-of-the-box SparkSQL access to tables that map to data in Cloud Storage and BigQuery.
A Python 3 notebook is a Jupyter notebook that is saved as content in Dataplex within a lake. You can save a notebook as content within a lake and share with other IAM users, or schedule to run as a batch serverless Spark job in Dataplex.
For data in BigQuery, you can access BigQuery
tables directly through Spark without using the
%%bigquery magic command.
Before you begin
You need to have a gRPC-enabled Dataproc Metastore (version 3.1.2 or higher) associated with the Dataplex lake in which you want to use Explore. Learn how to set up the Dataproc Metastore with Dataplex to access metadata in Spark.
You need the following IAM roles, depending on the actions you plan to perform:
- Dataplex Administrator: Provides permissions to manage environments (create, edit, delete) and grant other IAM users (Editor, Developer, and Administrator) privileges.
- Dataplex Editor: Provides permissions to manage (create, edit, delete) environments.
- Dataplex Developer: Provides permissions to run SparkSQL queries and notebooks in authorized environments.
Permissions granted at the lake level are inherited by all environments within that lake.
This section describes the known limitations of Explore while it is in Preview.
Explore is available for lakes in the
The data exploration capabilities in Explore are not supported in a project with the VPC service perimeter enabled.
It can take 2.5 to 3.5 minutes to start a new session per user. Once a session is active, it is used to run subsequent queries and notebooks for the same user.
Tip: Create a default environment with fast startup enabled.
You can create environments with a maximum of 6 nodes. The session length for individual user sessions is restricted to 120 minutes. After Explore becomes generally available, these restrictions will be more permissive.
SparkSQL queries can only query data within a given lake. If you want to query data in a different lake, you need to switch to that lake and choose an environment within that lake.
Sharing environments is not supported in the Cloud console. However, you can share environments from the gcloud CLI command:
gcloud dataplex environments add-iam-policy-binding ENV_NAME --project <P> --lake <L> --location <L> --member="user:USER" --role="roles/dataplex.developer"
Sharing SparkSQL queries and notebooks is not supported.
Scheduling notebooks is not yet supported.
Create an environment
In the Cloud console, go to the Dataplex page:
Navigate to the Manage view.
Choose a Dataplex lake for which you would like to create an environment.
Click the Environments tab.
Click Create environment.
Under Display name, enter a name for your environment.
Under Environment ID, enter a unique ID.
Under Configure compute, you can specify the following:
- Initial number of nodes: The number of nodes that will be provisioned for user sessions created for this environment.
- Maximum number of nodes: The maximum number of nodes that Dataplex will autoscale in the user sessions associated with this environment.
- Primary disk size: The amount of disk size associated with each provisioned node.
- Auto shutdown time: The idle time after which Dataplex will automatically shut down user sessions associated with this environment. You can set a minimum of 10 minutes and a maximum of 60 minutes.
Under Software packages, you can specify additional Python packages, jar files, and Spark properties to install on user sessions provisioned for this environment.
A node maps to 4 Data Compute Units (DCU), which is comparable to 4 vCPUs and 16 GB of RAM.
You can create an environment with 1 node or with 3 or greater nodes.
If you are a lake administrator, you can set up environments ahead of time, enabling users to run their workloads using these pre-specified configurations.
Although environments can be shared with multiple users, Dataplex creates a separate session per user using the environment configuration.
Create a default environment
Open Dataplex in the Cloud console.
Navigate to the Manage view.
Choose a Dataplex lake.
Click the Environments tab.
Click Create default environment.
To create a default environment with fast startup enabled, run the following command:
gcloud dataplex environments create default --project=PROJECT_ID --lake=LAKE_ID --location=REGION --os-image-version=latest --session-enable-fast-startup
Explore data using SparkSQL scripts
Create and save a script
In the Explore view, under New query, type in your query.
Click Save query.
Enter a query path.
Choose a lake from the dropdown.
To explore the data and saved queries available in your lake, expand the data explorer under Explore. You can see the tables available under Zones as well as saved queries and notebooks available in your lake.
Run the script
In the Explore view, click the tab with the query you'd like to run.
Click Select environment. Choose the environment in which you would like to run the query. If you don't select an environment, Dataplex uses the default environment to create a session per user.
To run queries, you must be granted the Developer role on the environment.
Schedule a script
You can schedule a script to run as a Dataplex Task.
Explore data using Notebooks
In the Explore view, expand the data explorer under Explore.
Choose a lake.
Under the lake, click the Notebooks folder.
(Optional) Click Select environment. Select an environment in which Dataplex creates a user session to create or open your Notebook. If you don't select an environment, Dataplex uses the default environment.
Click Create notebook, or select an existing notebook and click Open Notebook.
If you're creating a new Notebook, enter a Notebook path and an optional Description. Click Create notebook.
Explore BigQuery data using SparkSQL
For any BigQuery dataset that is added as an asset to a zone, Dataplex enables direct SparkSQL access to all the tables in that dataset. You can query data in Dataplex using SparkSQL queries or notebooks. For example:
select * from zone-id.table-id
If your assets map to Cloud Storage buckets in the same zone, Dataplex provides a unified list of tables that you can query through Spark.
You can use the REST APIs to access the features outlined above. See the Dataplex reference docs.