Manage upgrades and dependencies for user-managed notebooks: Overview

This document describes the design of a repeatable and scalable end-to-end process for managing upgrades and dependencies for Vertex AI Workbench user-managed notebooks. The intended audience is IT administrators who want to implement and use a well-defined process to let data scientists consistently update their Deep Learning VM Images instances and notebook dependencies. The document assumes that you have experience with the following technologies:

The procedures discussed in this document provide ways at every stage of the process to help your organization recover quickly from issues and to give data scientists uninterrupted access to working versions of the images. The process produces images that are thoroughly tested using automated and optional manual tests.

This document describes the use case and workflow for updating user-managed notebooks instances. An accompanying document, Manage upgrades and dependencies for user-managed notebooks: Process, provides details about how to set up and use the process that's described in this document.

The code and artifacts for a pipeline that's part of the process described in this document are in the managed-environments GitHub repository.

Use case

Data scientists use Jupyter notebooks (created as Vertex AI Workbench user-managed notebooks instances) for experiments and exploration tasks. User-managed notebooks let users take advantage of the interactive computing approach in these notebooks to write and execute code, visualize results, and share insights.

Vertex AI Workbench user-managed notebooks instances are hosted in a virtual machine (VM) that has several preinstalled frameworks and tools. For example, the VM typically includes Python3, a version of TensorFlow or PyTorch, and Conda. Data scientists might install additional software on a VM, such as the cloudml-hypertune library that's used to report training metrics.

If users modify the VM with many frameworks and tools, it can become difficult for an organization to keep track of all the software that a notebook depends on. These types of changes can make it hard for a team to share the notebook with other teams.

In addition, when data scientists need to take advantage of new hardware or upgrade dependencies, frameworks, or tools, they need to consider the following:

  • Should they create a new VM that has the latest tools and then reinstall the dependencies for the notebook on that VM?
  • How can they be sure that all of their notebooks will continue working after the upgrade?
  • Are other team members or teams using the same dependencies and hardware versions to ensure a consistent experimentation and development platform?
  • How do they troubleshoot issues?

Because of these complexities, organizations might decide to delay upgrades and not to work with the latest version of tools, frameworks, and hardware. However, delaying upgrades can result in VMs that have outdated or incompatible tooling and dependencies within teams and across the organization. If an organization does decide to upgrade but doesn't rigorously test the new tools and dependencies, notebooks might be broken.

To solve these issues, you can use the process described in this document: a repeatable, end-to-end process for upgrades and dependency management for Vertex AI Workbench user-managed notebooks. The process uses continuous integration and continuous delivery (CI/CD) principles. It's scalable to any number of user-managed notebooks instances and to any number of teams in an organization, with minimal overhead.

Process concepts

This section describes key components of the upgrades and dependencies management process.

Deep Learning VM instances and images

Organizations are increasingly turning to artificial intelligence (AI) to improve, scale, and accelerate their decision-making processes. Vertex AI Workbench is a single development environment for the entire data-science workflow. It provides managed and user-managed notebooks. These notebooks help data scientists and ML developers experiment, develop, and deploy AI models into production.

Vertex AI Workbench lets you create user-managed notebooks instances that run on VM instances that are prepackaged with JupyterLab. These VM instances, known as Deep Learning VM instances, are preconfigured to suit your choice of frameworks and processor.

Vertex AI Workbench creates Deep Learning VM instances from Deep Learning VM images or Deep Learning Containers images. We recommend that you use Deep Learning Containers images because containers provide a straightforward way to upgrade, configure, and redeploy a notebook; you just need to edit the container's Dockerfile. We don't recommend using Deep Learning VM images directly, because you need to deploy them from a new VM image, reinstall dependencies, and then create a custom VM image. Therefore, this document discusses how to use Deep Learning Containers images.

The process outlined in this document creates container images using the docker build command. Each time you upgrade the Deep Learning Containers image or change dependencies, you create a new image version.

Roles required for the process

The following roles are involved in the upgrade and dependency management process:

  • Owners: IT administrators or users who are responsible for upgrading Deep Learning Containers image versions, adding dependencies, starting the build pipelines, and resolving issues.

    In this document, we assume that you are an owner.

  • Testers: A subset of users who test new image versions before those versions are available to the larger user population.

  • Users: Deep Learning VM end users, like data scientists and machine learning developers, who are experimenting, developing, and deploying models into production.

Depending on your organization, owners might be part of the same team as the users or might be part of a more centralized IT team that's in charge of images. In both cases, owners need a solid understanding of the process and the tools involved. Close communication between owners and other users is critical.

Environments used in the process

An environment is a set of attributes that describe a single image version. Deploying an image version into an environment means creating or updating the environment's attributes to point to that version.

The upgrade and dependency management process defines the following environments:

  • Staging: This is the first environment in which you as owner deploy an image version. As part of the deployment, automated tests run. Testers also manually test the image in the staging environment. Manual tests help prevent unexpected issues that aren't accounted for in the automated tests.
  • Production: After automated and manual tests have passed, owners deploy the image version into the production environment, where it's available to the larger user population.
  • Fallback: The fallback environment points to the last known working version, so that notebooks can have close to zero downtime if there are issues with the production environment.

Depending on the strategy and size of the organization and projects, you might consider the staging or fallback environments as optional.

Naming conventions

We suggest that you adopt an image naming convention that's easy for all users in a team to remember and that includes the name of the experiment or task. For example, teams might use a name pattern of TEAM.TASK. A natural language processing (NLP) team that works on Keras models for batch predictions might then use the name nlp.keras-batch for the images for those experiments. The image name must conform to the Docker container image naming conventions, which include using only lowercase letters, digits, periods, underscores, and dashes.

A team can have multiple images for different experiments or tasks, and images can be shared with other teams that have the appropriate permissions.

An image repository stores several versions of the same image. To identify a specific image version, we recommend that you use a unique version ID in the image's Docker tag.

Continuing the nlp.keras-batch image example, you might create the following names:

  • Image name: nlp.keras-batch
  • Image repository: gcr.io/nlp-gcp-project1/nlp.keras-batch
  • Image version: nlp.keras-batch:b2cc9f405d29e64e1229af2b95e7c9ccca86e218
  • Version ID: b2cc9f405d29e64e1229af2b95e7c9ccca86e218 (The next section explains where you get this value.)

Each of the environments (staging, production, and fallback) must have an environment ID. Environment IDs must be immutable so that users can remember them easily and create Deep Learning VM instances without needing to know which image version the environment points to. Therefore, we recommend an environment ID naming convention that consists of the image name and an environment suffix. For example, the three environments for the nlp.keras-batch image might be nlp.keras-batch.staging, nlp.keras-batch.production, and nlp.keras-batch.fallback.

Source control

As an owner, you ensure that a working image version is available to users so that they can create Deep Learning VM instances. It's therefore essential that you have effective disaster recovery measures so that you can minimize the recovery time objective (RTO) and recovery point objective (RPO). That way, if an issue does occur, you can point users to the fallback environment as described earlier in the Environments section.

If the image version in the fallback environment is itself faulty, owners need a quick way to deploy an earlier working version. This is possible if you have access to the full history of changes, along with the corresponding version ID.

We recommend that you use the Git version control system, which provides a history of changes and which can automatically generate unique identifiers.

The Git repository that's associated with this document includes a Dockerfile, image build scripts, and test notebooks. The Dockerfile defines the Deep Learning Containers image version and notebook dependencies. When you commit changes to the repository, Git automatically generates a commit ID, which is a 40-character SHA-1 hash (also known as a commit SHA). You get this commit SHA by running the git log command. The following listing shows a sample commit SHA (b2cc9f4...).

commit b2cc9f405d29e64e1229af2b95e7c9ccca86e218 (HEAD -> production)
Author: Dana <dana@example.com>
Date: Tue Feb 2 15:44:32 2021 +0100

Added dependency: cloudml-hypertune

We recommend that you use the commit SHA as the version ID (as explained in the previous section) and that you store it in the image version Docker tag. The following code example shows a gcloud command that gets the image tags for an image whose version ID is the commit SHA:

gcloud container images list-tags gcr.io/nlp-gcp-project1/nlp.keras-batch

The output is similar to the following:

DIGEST        TAGS                                      TIMESTAMP
e37b5ebeb88f  b2cc9f405d29e64e1229af2b95e7c9ccca86e218  2021-02-02T15:56:29

In addition to providing change history and commit SHAs, Git also provides branches. You can combine branches with Cloud Build triggers to provide the mechanism to start a new image version build.

When you push changes to a staging branch, the Git repository triggers the build of an image version and then deploys the version to the staging environment.

After the environment has been tested, you merge the changes into the production branch. When the production branch is pushed to the remote repository, the repository triggers the build of an image version that's deployed in the production environment. The next section describes the triggering and build mechanism in detail.

Process workflow

This section discusses how the concepts introduced in the previous section interact in the end-to-end process for upgrading and managing notebook dependencies. The following diagram shows a high-level view of the expected path for this end-to-end process.

Expected flow for handling managing upgrades and dependencies for notebooks.

The diagram describes four stages in the flow.

  1. Push an image to the staging environment.

    As an owner, you push the image to the staging environment, and then automatic tests run. This stage incorporates the following events and tasks:

    1. Google Cloud publishes a new Deep Learning Containers image, or the team wants to add new dependencies or new automated tests.
    2. You create a new Dockerfile or use an existing one, like the example file in the managed-environments GitHub sample. The Dockerfile specifies an existing Deep Learning Containers parent image.
    3. You add tests and dependencies and then push the changes to a short-lived topic Git branch called staging.
    4. The Git push triggers the staging pipeline in Cloud Build.
    5. The pipeline builds the new image version and then runs the automated tests. If the tests succeed, the pipeline creates or overwrites the staging environment and points the environment to the newly created image version.

      For more information and implementation details, see Update the Deep Learning Containers and dependencies in the accompanying document.

  2. Test the staging image.

    Testers create Deep Learning VM instances on Vertex AI Workbench using the staging environment. When testers have made sure that all of their jobs and exploratory tasks are running smoothly, they report back to you, the owner. For information about the command that creates the Deep Learning VM and about the technical steps behind the scenes, see Run manual tests in the accompanying document.

  3. Push the image to the production environment.

    You push the image to the production environment. At this point in the process, both automated and manual tests have passed. This stage incorporates the following events and tasks:

    1. You merge the staging branch into a production branch and push the changes to the remote repository.
    2. The push triggers the production pipeline in Cloud Build.
    3. Cloud Build saves the previous production environment, if it exists, as the fallback environment. It also creates or overwrites the production environment and points the environment to the tested image version.

    For more information about implementation details, see Push an image to production in the accompanying document.

  4. Use the production image.

    The new image is now available in production. In this final phase of the process, users can start using the new image by creating a Deep Learning VM instance on Vertex AI Workbench in the production environment. For more information, see Use a production image in the accompanying document.

Alternate workflows

The process workflow shows the expected path in which all tests are successful, but in real life unexpected issues can occur. Your organization requires a robust process to account for these eventualities. The following diagram shows scenarios that diverge from the expected path when the process doesn't go as planned.

Alternative flow when issues are encountered.

The diagram describes issues that might occur at various stages of the flow.

  1. Push an image to the staging environment.

    Issue: Failed automated tests

    1. When you push the feature branch to the Git remote repository, Cloud Build runs automated tests.
    2. When the automated tests don't succeed, Cloud Build stops the pipeline and presents a failure message, along with logs pointing to the specific line and notebook that caused the issue. The error might be caused by an incompatibility with the new Deep Learning Containers image, a missing or incompatible dependency, or issues in a test notebook.
    3. If there's a problem because of the Deep Learning Containers image, then you might choose a different Deep Learning Containers image or wait until a new Deep Learning Containers image version is available.
    4. If the issue is a missing dependency or a failed test, then you change the code that's causing the issue.
    5. Finally, you commit the fix and push it to the remote Git repository, and the process starts again from the first step in the expected path.
  2. Test the staging image.

    Issue: Failed manual tests

    1. If a tester detects an issue that wasn't identified during automated testing, they report the issue to you, the owner.
    2. You fix the issue by adding a dependency or by changing the Deep Learning Containers image.
    3. If possible, you add a new test to the test notebooks to ensure that the automated testing in subsequent builds catches this error.
    4. You commit the fix and push it, and the process starts again from the first step in the expected path.
    5. The testers discard the faulty Deep Learning VM instance that uses the staging environment. If necessary, they can create a Deep Learning VM instance with the existing production environment.
  3. Push the image to the production environment.

    No tests run during this step.

  4. Use the production image.

    Issue: One or more problems with Deep Learning VM instances from the production environment

    1. At this stage, users use the production environment to create their Deep Learning VM instances. If a problem is detected at this point, it's critical that users can revert to a stable image version as soon as possible. The fallback environment provides a swift fallback to meet a minimal RTO.
    2. Users who experience an issue can discard the faulty Deep Learning VM and create a new one using the fallback environment without intervention from you.
    3. At any time, you can overwrite the current production environment with the fallback environment, fix the issue, and start the process from the first step in the expected path. For more information, see Revert to the fallback image in the accompanying document.
    4. If the fallback environment isn't available or is itself faulty, then you must find the last good image from the repository and overwrite the production environment using the environment's commit SHA. For more information, see Revert to a previous image in the accompanying document.
    5. You fix the issue and restart the process from first step in the expected path.

What's next