Migrate Dataplex Explore to BigQuery Studio

Dataplex is discontinuing support for Explore. This document outlines the steps for migrating Dataplex Explore resources to BigQuery Studio. You can migrate your Spark SQL and JupyterLab Notebook content to BigQuery Studio, a unified data exploration platform.

Deprecated features

For questions or clarifications, reach out to the Explore team at dataplex-explore-support@google.com.

Before you begin

  • Enable the BigQuery and BigQuery Studio APIs.

    Enable the APIs

Notebook content

If you have notebooks in Explore that are run in a Jupyterlab instance in a serverless manner, after migrating you'll have the same experience in BigQuery Studio.

BigQuery Studio offers a notebook interface powered by Colab Enterprise, which provides several advantages over JupyterLab notebooks. You can still write, save, and run your notebooks in a serverless manner in BigQuery Studio. Additionally, you can benefit from the Colab Enterprise integrated cloud environment with powerful GPUs and TPUs, real-time collaboration, sharing and access control through Google Drive, automatic saving, pre-installed libraries, free usage with quotas, built-in widgets and extensions, and integration with other Google services like BigQuery and Cloud Storage.

Spark SQL content

Dataplex Discovery registers discovered tables in BigQuery and Dataproc Metastore. Depending on where the tables are registered, use one of the following migration options.

  • Tables are registered in both Dataproc Metastore and BigQuery: if the Spark SQL script interacts with Dataplex-discovered tables through Dataproc Metastore, then you can directly query those tables from BigQuery.
  • Tables are registered only in Dataproc Metastore: if the Spark SQL script interacts with tables not available in BigQuery, then you need to set up BigQuery Studio integration with Dataproc Metastore. The Dataproc Metastore provide two types of endpoints: Thrift and gRPC. For more information about how to find the endpoint protocol, see Find your endpoint URI value. Then, set up BigQuery Studio integration by using the steps in the following sections.

Connect to a Thrift-based Dataproc Metastore

A Thrift-based endpoint starts with thrift://. To connect to a Thrift-based Dataproc Metastore, pass the Thrift endpoint URI in the SparkSession configuration, as in the following sample:

from pyspark.sql import SparkSession

spark = (
    SparkSession.builder.appName("Dataproc Metastore Connection")
    .config(
        "spark.hadoop.hive.metastore.uris",
        "thrift://IP_ADDRESS:9083",
    )
    .enableHiveSupport()
    .getOrCreate()
)

Connect to a gRPC-based endpoint

A gRPC-based endpoint starts with https://. Spark can't connect to non-Thrift based endpoints directly. Instead, you must run a proxy service that converts requests from Thrift to gRPC. To connect to a gRPC-based Dataproc Metastore service, follow these steps in your BigQuery Studio notebook:

  1. Download the latest version of the Hive Metastore (HMS) proxy JAR file in the notebook runtime by running the following command in the notebook:

    # Download the latest HMS Proxy jar file.
    !gsutil cp gs://metastore-init-actions/metastore-grpc-proxy/hms-proxy-3.1.2-v0.0.46.jar
    
  2. Start the HMS proxy.

    %%bash
    # Metastore store URI including the port number but without "https://" prefix.
    METASTORE_URI=METASTORE_URI
    # HMS Proxy JAR path.
    JAR_PATH=JAR_PATH
    # DPMS Supported Hive Version.
    HIVE_VERSION=3.1.2
    
    # Start the HMS Proxy.
    java -jar ${JAR_PATH} --conf proxy.mode=thrift proxy.uri=${METASTORE_URI} thrift.listening.port=9083 hive.version=${HIVE_VERSION} google.credentials.applicationdefault.enabled=true proxy.grpc.ssl.upstream.enabled=true > /tmp/hms.logs 2>&1 &
    
  3. Connect the Spark session to a local HMS proxy.

    from pyspark.sql import SparkSession
    
    spark = (
      SparkSession.builder.appName("Dataproc Metastore Connection")
      .config(
          "spark.hadoop.hive.metastore.uris",
          "thrift://localhost:9083",
      )
      .enableHiveSupport()
      .getOrCreate()
    )
    

Session resources

A session resource refers to a user-specific active session. Migration of session resources is not supported.

Environment resources

An environment provides serverless compute resources for your Spark SQL queries and notebooks to run within a lake. Because BigQuery Studio provides a serverless environment for running SQL queries and notebooks, migration of environment resources is not supported.

Schedule a task with content resources

You can schedule queries in BigQuery Studio.