Extending Vertex AI Workbench user-managed notebooks to Dataproc and Google Kubernetes Engine

This document helps system administrators and data engineers choose the best approach to running Jupyter notebooks on Google Cloud. The document assumes that you are familiar with Jupyter notebooks and Dataproc Hub.

The document is part of a series that includes the following:

This series is primarily aimed at administrators who build Jupyter notebook environments for data science research.

Introduction

Companies of all sizes gather an ever-growing quantity of data. Making sense of it requires both human and technical resources. End users such as data scientists or data analysts often interact directly with that data through tools such as text editors, integrated development environments (IDEs), or notebook environments. System administrators create an infrastructure to help these users make their investigations.

End users generally want to do the following:

  • Run interactive tasks at scale to facilitate development and debugging.
  • Run data tasks and machine learning tasks from the same notebook.
  • Have access to environments that match their needs in order to enhance productivity.
  • Keep notebooks separated from the processing infrastructure so that they don't lose work.
  • Quickly start an environment that matches their hardware and software requirements.

System administrators generally want to do the following:

  • Centrally control notebook environments to facilitate management.
  • Automate the deployment of the platform that manages notebook lifecycles in order to minimize operational overhead.
  • Provide users with limited but sufficient access so that they can do their jobs.
  • Enable users to run their jobs in isolation to provide flexibility to the end user.
  • Maximize resource utilization to minimize costs and overhead.

This document gives a general overview of notebook options on Google Cloud. The other documents in this series focus on how to customize and deploy JupyterHub on Google Cloud. JupyterHub helps organizations manage Jupyter notebook environments in a multi-user context.

Terminology

This series uses the following terms and products:

  • Product: an official and proprietary offering provided and supported by Google Cloud.
  • Solution: architecture and open-source software that uses several Google Cloud products.
  • Administrator: a person who manages resources that manage notebooks.
  • End user: a person who runs interactive computational tasks by using notebooks. For the purposes of this series, end users are primarily data scientists.
  • Hub: a UI for end users to customize and launch a new notebook server or to access the Jupyter interface of an existing notebook server.
  • Endpoint: a URL that an end user uses to access a notebook UI.
  • JupyterHub: a multi-user server that serves Jupyter notebooks to multiple users and manages the creation, proxying, and lifecycle of notebooks servers.
  • Notebook server profile: a configuration created by an administrator that defines the hardware and software setup for the notebook environment.
  • Spawner: a process that's responsible for starting a single-user notebook server.

Deciding which product to use for deploying notebook-based platforms

The following decision tree helps you determine which product or solution to use when you deploy your notebook-based development platform.

Decision tree to help determine which product or solution to use to deploy a notebook-based development platform.

The tree illustrated in the diagram leads you through a set of decisions that help you determine what product or solution to use to deploy a notebook platform on Google Cloud.

The first question is whether you need to run the Apache Spark SDK in a distributed environment. The rest of the decisions depend on your answer to this question, as follows.

No, you don't need to run Apache Spark in a distributed environment

In that case, do you need to use the Apache Beam SDK on a single instance and then run code in Dataflow?

  • If you do need to use the Beam SDK, use user-managed notebooks that use the Apache Beam Direct Runner to provide an interactive research environment on a single machine. For details, see the Apache Beam on user-managed notebooks documentation.
  • If you don't need to use the Beam SDK, do you need to manage processing profiles centrally?
    • If yes, use GKE Hub Extended.
    • If no, do you need to consolidate cost? If yes, use GKE Hub Extended. If no, use user-managed notebooks.

Yes, you do need to use Apache Spark in a distributed environment

In that case, do you need to manage Dataproc cluster profiles centrally?

  • If you need to manage cluster profiles centrally, do you need advanced Dataproc Hub customization? If yes, use Dataproc Hub Extended. If no, use Dataproc Hub
  • If you don't need to manage cluster profiles centrally, use Dataproc Notebooks.

Choosing an infrastructure

By default, you can run notebook servers for Jupyter notebooks on the following Google Cloud products:

  • Dataproc: users run notebook servers through the Jupyter component on a Spark-enabled Dataproc cluster. This option is useful if you need to use Spark in a distributed environment.
  • User-managed notebooks: users run notebook servers on a single Compute Engine instance. This option is useful if the work can run on a single instance.

If your processing framework does not use Spark in a distributed environment, you should first explore user-managed notebooks. If user-managed notebooks does not meet your requirements, consider Google Kubernetes Engine (GKE) options.

When you know whether you need Dataproc, the next step is to decide how much flexibility end users should have.

Choosing a hub

A hub typically lets administrators define how notebook environments look and lets end users manage the lifecycle of their notebook environment. Google Cloud uses the Cloud Console as a default hub. Although the Cloud Console provides flexibility to end users, some companies need additional administrative options to centrally manage notebook profiles. Reasons include the following:

  • Minimize costs and maximize resource usage by gathering notebook environments on a limited number of machines.
  • Enforce security by providing sandboxed environments to end users.
  • Minimize repetitive tasks by predefining notebook environments that end users can start by using just a few clicks.
  • Establish consistency across the organization by using templates for notebook environments.

If you don't want end users to work in the Cloud Console for any of these reasons, you can choose one of the following options:

  • If you use Dataproc, you can use Dataproc Hub, a product that hosts JupyterHub on a user-managed notebooks instance and that spawns notebook environments on Dataproc. Because Dataproc Hub is based on open source technology, you can customize it.
  • If you don't use Dataproc, Google Cloud doesn't provide a managed way to run JupyterHub for workloads that aren't based on Dataproc. However, for this scenario, you can use GKE Hub Extended, a lightweight solution that runs on GKE.

You can read more about each of these options in the following sections of this document:

Dataproc Hub Extended

If your company is on Google Cloud and uses Spark, you can use Dataproc to process data at scale and to run training and inference. With Dataproc Hub, you can centrally manage and standardize cluster configurations while providing end users with the environment that they need. To do this, you use cluster configurations, which are declarative YAML files that define the following:

  • Hardware profiles for the Dataproc clusters
  • Software profiles for the notebook environment

For example, you might use Dataproc Hub for the following scenario:

  1. An end user creates their own notebook environment based on a list of configurations that are curated by the administrator.
  2. When an environment is running from the notebook, the end user interactively explores data at scale by using PySpark. Users can also run ML tasks with TensorFlow or PyTorch.

By default, the Dataproc Hub architecture looks similar to the following:

Dataproc Hub architecture.

Dataproc Hub uses the following Google Cloud products and open source software:

  • Dataproc hosts the notebook server.
  • User-managed notebooks provides a Compute Engine instance to run JupyterHub and to help provide secure and identified access to the UI using the Inverting Proxy.
  • Inverting Proxy server is an open source tool that receives and forwards incoming requests to the appropriate backend. Dataproc Hub uses a Google-managed version of the Inverting Proxy server.
  • Inverting Proxy agent is an open source tool that runs beside JupyterHub on a Compute Engine instance and matches requests and responses between the backend and the client.
  • JupyterHub authenticator for Google Cloud proxies (gcp-proxies-authenticator) provides transparent user identification using headers that are provided by the Inverting Proxy. The code is available on the JupyterHub authenticator for Google Cloud proxies repository on GitHub.
  • JupyterHub is installed on a user-managed notebooks Compute Engine instance to leverage the Inverting Proxy agent service.
  • Dataproc Spawner provides a form to the end user so they can create notebook servers on Dataproc in a selected zone. The code is available on the JupyterHub authenticator for Google Cloud proxies repository on GitHub.

In some cases, you might need to extend the base version of Dataproc Hub to modify either the overall architecture or to adapt some code to your needs. Because Dataproc Hub is based on open source software, you can extend the product's capabilities. For example, you can do the following:

GKE Hub Extended

GKE Hub Extended helps consolidate costs by centralizing computing infrastructure. The following architecture shows how end users can create their own sandboxed environments on a GKE infrastructure.

Architecture for how end users can create their own sandboxed environments on a GKE infrastructure.

In this architecture, administrators use notebook server profiles to provide a list of allowed environments to a team of end users. Profiles are declarative JSON objects that define the following:

  • Hardware profiles for the Pods
  • Software profiles for the notebook environment

In the diagram, profile templates are next to the IT Admin as custom images. When an end user chooses an image, a notebook server starts on a GKE node. In the diagram, these servers are represented by the user-* boxes.

As the diagram shows, GKE Hub Extended is a lightweight solution that uses the following Google Cloud products and open source software:

  • Inverting Proxy server is an open source tool that receives and forwards incoming requests to the appropriate backend. GKE Hub uses a Google-managed version of the Inverting Proxy server.
  • Inverting Proxy agent is an open source tool that runs on GKE as its own deployment and that matches requests and responses between the backend and the client. There is one agent that runs beside the JupyterHub deployment to help provide secure access to the JupyterHub UI. The task of routing users to notebook servers is handled by JupyterHub. For more information, see the README file of the Inverting Proxy repository on GitHub.
  • JupyterHub authenticator for Google Cloud proxies (gcp-proxies-authenticator) provides transparent user identification through headers that are provided by the Inverting Proxy. The code is available on the JupyterHub authenticator for Google Cloud proxies repository on GitHub.
  • KubeSpawner provides a form to end users so that they can create notebook servers on the same Kubernetes cluster that hosts JupyterHub. KubeSpawner also provides a way for administrators to manage notebook server profiles through a profile_list parameter. The code is available on the JupyterHub Kubernetes Spawner repository on GitHub.

The GKE Hub folder in the repository serves as an example to show how to set up the environment that's described in this section. You can fork the parent repository and extend the code for GKE Hub.

Common considerations for choosing a notebook

The following considerations are important when you decide how to set up a notebook infrastructure for experimentation:

  • Access to interfaces: how end users access the product UI, including authentication and network access. This decision impacts security.
  • Relationship between users and notebook servers: whether one user can use several notebook servers at a time or whether the user is limited to one notebook server. This decision often impacts productivity and costs.
  • Processing framework: whether you have legacy requirements or specific technology requirements for computing. Spark is an example of a technology requirement. This decision often impacts productivity.
  • Underlying infrastructure: which Google Cloud infrastructure product to use and whether users share the infrastructure. This decision impacts scalability and costs.

The rest of this document discusses options to address each of those considerations.

Access to the web interface

End users access notebooks and related web interfaces through an endpoint URL. Endpoints have different characteristics:

  • Connectivity: whether the endpoint is exposed either privately (such as through notebooks.corp.example.com) or publicly (such as through notebooks.example.com).
  • Network security: whether a resource in a specific network can access the endpoint.
  • Authentication: how the system verifies the identity of users when they access the endpoint.
  • Domain: whether the endpoint is accessible through a custom domain or through a Google-provided domain. You manage custom domains through Cloud DNS or a similar solution.
  • Hub: which technology can route users to their notebook server.

Most of those considerations depend on a combination of which notebook platform you use and which proxy is attached to it. In the architectures that are described in this series, you can choose from the following proxies:

  • IAP: a product that manages access to web applications such as JupyterHub running in Compute Engine or on GKE.
  • Inverting Proxy: an open source solution that includes a reverse proxy server and an agent. This series uses a server that's managed by Google.
  • Component Gateway: a product managed by Dataproc that combines both the Inverting Proxy and Apache Knox, an API gateway, to help provide secure access to Dataproc web endpoints.

To simplify domain management, Google Cloud primarily uses the Inverting Proxy for products such as Jupyter notebooks that need to externalize a web UI. The Inverting Proxy uses a Google-managed domain.

The following table lists the characteristics of the preceding list for each Google Cloud notebook product or solution.

Platform Proxy Authentication Domain Hub
Dataproc Jupyter Component Gateway Google-managed Proxy-provided Cloud Console
Dataproc Hub Inverting Proxy gcp-proxies-authenticator Proxy-provided JupyterHub-based
Dataproc Hub Extended Inverting Proxy gcp-proxies-authenticator Proxy-provided JupyterHub-based
User-managed notebooks Inverting Proxy Google-managed Proxy-provided Cloud Console
GKE Hub Extended Inverting Proxy gcp-proxies-authenticator Proxy-provided JupyterHub-based

Dataproc Hub Extended and GKE Hub Extended can also use IAP and a custom domain. The ai-notebook-extended GitHub repository gives an example of how to achieve this for Dataproc Hub Extended on managed instance groups.

Relation between users and notebook servers

Depending on the product that you use, the relationship between notebooks and users can vary, as shown in the following table.

Tool User:Notebook relationship
Dataproc Jupyter N:N
Dataproc Hub 1:1
Dataproc Hub Extended 1:1 (you can customize the spawner code for 1:N)
User-managed notebooks N:N
GKE Hub 1:N (you can set up the spawner for 1:1)

The implications of the relationships in the preceding table are the following:

  • If you use Dataproc Jupyter, any end user who has access to a Component Gateway URL can access the notebook that's attached to that URL.
  • If you use Dataproc Hub, an end user who has access to an Inverting Proxy URL can access the hub that's attached to that URL. But the user cannot access other users' notebook servers. By default, a user can start only one single-user notebook server at a time.
  • If you use Dataproc Hub Extended, an end user who has access to an Inverting Proxy URL can access the hub that's attached to that URL. But the user cannot access other users' notebook servers. By default, a user can start only one single-user notebook server at a time, but you can customize the spawner code to change this behavior.
  • If you use user-managed notebooks, any end user who has authenticated access to an Inverting Proxy URL can access the notebook that's attached to that URL.
  • If you use the GKE Hub solution, end users who have access to an Inverting Proxy URL can access the hub that's attached to that URL. But the users can see only their own notebook servers. By default, if you enable the relevant option, users can run several single-user notebook servers at a time.

You should be aware of the following:

  • Accessing the UI behind an Inverting Proxy URL requires the user to have the serviceAccountUser role for the service account of the instance that hosts the agent.
  • Accessing an application that's protected by IAP requires the user to have the proper permissions at the IAP level.

Summary

Google Cloud offers multiple options to run notebooks. This document series helps you make the right decision. Use these guidelines:

  1. If you do not need to centrally manage notebook server profiles, and if end users can work on a single instance, use user-managed notebooks and the Inverting Proxy.
  2. If you do not need to centrally manage notebook server profiles, and if end users need Apache Spark in a distributed environment, use Dataproc Notebooks through the Dataproc Jupyter component and Component Gateway.
  3. If you need to centrally manage notebook server profiles, and if end users need Spark in a distributed environment, use Dataproc Hub and the Inverting Proxy.
  4. If you need to customize Dataproc Hub, use the Dataproc Hub solution from GitHub as a base.
  5. If you need to centrally manage user environments and you want to use Kubernetes, build on top of the GKE Hub example solution. For information about setting up GKE Hub with the Inverting Proxy, see Tutorial: Spawning notebook servers on Google Kubernetes Engine (GKE) in this series.

One of the main advantages of using Dataproc Hub Extended and GKE Hub Extended is that you can fully customize them. For example, you can run one hub on GKE and spawn notebooks where needed, whether on Dataproc or GKE. If you need help with customizations that are not in the GitHub repository, contact your local Google Cloud team to discuss how Google Cloud can help.

What's next