This page shows how to resolve issues with Data exploration workbench in Dataplex.
Database not found
When you run a Spark query from SQL workbench or Jupyter notebook, the following error occurs:
Script failed in execution.
org.apache.spark.sql.catalyst.parser.ParseException:
No viable alternative at input `zone-name`(line 1, pos24)
Dataplex zone names are mapped to Hive-compatible database names,
which can be queried using Spark. Dataplex zone names can contain
a hyphen (-
), whereas the Hive database names cannot. Therefore, hyphens
in the Dataplex zone names are mapped to underscores (_
) in the
Hive database names.
To resolve this issue, follow these steps:
Get a list of available databases:
show databases
Review the list of returned database names and make sure that you're querying the correct database name.
Table not found
When you run a Spark query from SQL workbench or Jupyter notebook, the following error occurs:
Script failed in execution.
org.apache.spark.sql.AnalysisException: Table or view not found
Dataplex discovers the metadata for BigQuery and Cloud Storage assets, and makes it accessible using Dataproc Metastore (DPMS). Spark queries on SQL workbench or Jupyter notebooks connect to DPMS while executing SQL queries to get the table metadata.
To resolve this issue, follow these steps:
Get the list of available tables:
show tables in DATABASE_NAME
Make sure that you are querying the correct table name.
If the table name contains upper case letters, then set
spark.sql.caseSensitive
totrue
in the environment configuration.
Permission errors
Spark queries fail with permission errors. For example:
HiveException
TTransportException
To use the Explore features in Dataplex, you must be granted the required roles and permissions on the Dataplex resources and underlying assets.
To resolve the permission issue, follow these steps:
- Ensure that you are granted the required roles and permissions for using the Data exploration workbench.
- Ensure that you have
read
permissions on the underlying Cloud Storage and BigQuery assets. - For custom packages, ensure that the Cloud Dataplex Service Agent
has
read
permissions on the Cloud Storage bucket configured in the environment.
Unable to delete lake containing scripts or notebooks
When you delete a lake that is used for Dataplex Explore, and if the lake contains scripts or notebooks, the following error occurs:
Failed to delete `projects/locations/region/lakes/lakename` since it has child
resources.
Dataplex Explore requires at least one environment to be present in the resource browser.
To resolve this issue, use one of the following workarounds:
- Use the gcloud CLI commands to delete scripts and notebooks from the lake, and then delete the lake.
- Create a temporary environment which enables the resource browser. Delete all the scripts and notebooks, followed by the temporary environment and the lake.
Job aborted
When you run a Spark query, the job aborts if there is a critical error.
To resolve this issue, refer to the error message to identify the root cause of the issue, and fix it.
TTransportException when querying Iceberg tables
When you query a wide Iceberg table, the TTransportException
occurs.
Iceberg has a known issue on Spark 3.1 that is available on Dataproc 2.0 images used by Dataplex Explore.
To resolve this issue, add an extra projection in the SELECT
query.
For example:
SELECT a,b,c, 1 AS dummy FROM ICEBERG_TABLE
In this example, 1 AS dummy
is the extra projection. For more information,
see the issue details page.
Lakes don't appear in the Explore resource browser
Explore is available for lakes only in the us-central1
, europe-west2
,
europe-west1
, us-east1
, us-west1
, asia-southeast1
, asia-northeast1
regions. Lakes that belong to any other region don't appear in the Explore
resource browser.
Unable to get started with Dataplex Explore
In the Google Cloud console, on the Dataplex page, when you click Explore, the following message is displayed:
In order to use Dataplex Explore to query data in CLoud Storage and BigQuery
using open soure applications (ex: SparkSQL), connect a metastore. Get started
by setting up one. If DPMS is already attached to the lake and you are seeing
this page, please check the DPMS logs for any possible issues.
Explore works only if a lake has a Dataproc Metastore (DPMS) configured and at least one environment setup.
To resolve this issue, link your lake to Dataproc Metastore.
Quota restrictions
When you create an environment, you might see quota related errors.
To resolve this issue, review the following quotas before creating an environment:
- You can create 10 environments per lake.
- You can create environments with a maximum of 150 nodes.
- The session length for individual user sessions is restricted to 10 hours.
Session startup time is long
It takes 2.5-3.5 minutes to start a new session per user. Once a session is active, it's used to run subsequent queries and notebooks for the same user.
To reduce the session startup time, create a default environment with fast startup enabled.
Unable to schedule notebooks containing custom Python packages
In the Google Cloud console, when you schedule a notebook that contains custom Python packages, the following error occurs:
Selected environment ENVIRONMENT_NAME has additional Python
packages configured. These packages will not be available in the default runtime
for the scheduled notebook when scheduling in the Console. To make the required
additional Python packages available in the runtime, please create Notebook
Schedule using gcloud command instead, referencing a container image with
required additional packages.
You cannot schedule a notebook in the Google Cloud console if the environment has custom Python packages.
To resolve this issue, use the gcloud CLI to schedule notebooks containing custom packages.