Dataproc Hub Overview

The Dataproc Hub lets you to take advantage of Vertex AI Workbench and Dataproc to run interactive ML and data processing tasks at scale using Jupyter notebooks and the Hadoop and Spark ecosystem.

Dataproc Hub notebooks are administrator-curated, single-user notebooks running on a Dataproc JupyterLab cluster created and running in the user's project.

  • Dataproc Hub leverages JupyterHub to:

    • Bring consistency across the organization by enabling administrators to create a curated list of notebook templates for different groups of data and ML users.
    • Accelerate notebook creation by providing data and ML users with pre-configured environments that match their software and hardware requirements.
  • Dataproc Hub provides separate interfaces for administrators and users:

    • Administrators use the Dataproc→Workbench→User-Managed Notebooks page in the Google Cloud console to create Dataproc Hub instances. Each hub instance contains a predefined set of notebook environments defined by YAML cluster configuration files.
    • Data and ML users use the Notebooks→Instances UI in the Google Cloud console to select a predefined notebook environment to spawn a notebook server on their Dataproc cluster.
      • Users without console access can access the Dataproc Hub instance to spawn a Dataproc cluster from their web browser by using a Dataproc Hub instance URL provided by the administrator.
  • Dataproc Hub use cases:

    • Data and ML users are organized in groups with common software and hardware requirements (users can be placed in multiple groups)
    • Restricted Dataproc console access: Users do not have access to Dataproc in the Google Cloud console
  • Dataproc Hub features:

    • Predefined user environments
    • Cluster and notebook isolation: members of a group are not provided easy access to clusters and notebooks of members in other groups

For more information