Manage upgrades and dependencies for user-managed notebooks: Process

This document describes the tasks for creating a repeatable and scalable end-to-end process for managing upgrades and dependencies for Vertex AI Workbench user-managed notebooks. The intended audience is IT administrators who want to implement a process that lets data scientists consistently update their Deep Learning VM instances and notebook dependencies.

An accompanying document, Manage upgrades and dependencies for user-managed notebooks: Overview, describes the use case for this process and the overall workflow. We recommend that you read that document first.

The code and artifacts for a pipeline that's part of the process described in this document are in the managed-environments GitHub repository.

Process tasks and who performs them

As noted in the overview, the process involves three roles: owners (IT administrators), testers, and users (data scientists). The document assumes the following:

  • You are an owner.
  • You are setting up the process.
  • You will communicate with testers and users about their tasks.

The following table shows which role is responsible for each task.

Section Role responsible
Set up the pipelines Owners
Automate the pipelines Owners
Update the Deep Learning Containers and dependencies Owners
Run manual tests Testers
Push an image to production Owners
Use a production image Users
Use a fallback image Users
Revert to the fallback image Owners
Revert to a previous image Owners

Set up the pipelines

Owners are responsible for setting up the pipelines that implement the update and dependency management process.

The following screenshot shows the files that are provided in the GitHub sample repository to implement the process. As an owner, you can adapt the files according to your business and technical needs.

A file listing for the files in the GitHub repository.

The code in the repository defines a pair of Cloud Build pipelines: one for staging and one for production. The YAML files that are shown in the screenshot are the build config files in which you define the steps for each pipeline.

Each pipeline has a corresponding shell script that's called from the pipeline steps. The scripts manage the notebook environments, the commit SHAs, and logging. They provide a clean separation between the pipeline definition and the programmatic instructions that are needed in some pipeline steps. The internals of the config files and of the shell scripts are explained in later sections of this document.

The repository has a subdirectory that contains Jupyter test notebooks. These are the notebooks that Cloud Build runs automatically when you push a change into the staging branch. In the diagram, the subdirectory is called test_notebooks, but you can use any name; just be sure that you specify it in the dir parameter for the automated testing step in the staging config file.

The code provides a sample Dockerfile, which includes the Deep Learning Containers image that's used as the basis to build your image and any dependencies. The sample Dockerfile uses a GPU-enabled container with TensorFlow, and it installs the setuptools dependency by using the following commands:

FROM gcr.io/deeplearning-platform-release/tf2-gpu.2-4

RUN pip install --user --upgrade setuptools

The repository also contains a .gitignore file that lets you exclude files from the Git repository.

Automate the pipelines

Owners are responsible for automating the process. After you clone and configure your pipeline files, you as an owner set up the mechanism to run the pipelines automatically when you make changes to the Dockerfile or to the notebook tests. To automate the pipelines, you use Git remote branches and Cloud Build triggers.

  1. In the Google Cloud console, activate Cloud Shell.

    Activate Cloud Shell

  2. Clone the managed-environments GitHub sample repository:

    git clone https://github.com/gclouduniverse/managed-environments.git
    
  3. Remove any Git configuration from the cloned repository:

    cd managed-environments
    rm -fr .git
    
  4. Initialize your local Git repository by following the instructions in Adding an existing project to GitHub using the command line. However, for the name of your initial branch, use production instead of main, as in the following example:

    git init -b production
    
  5. Add the files that were mentioned earlier in Set up the pipelines and push your first pipeline commit to the remote repository that you created in GitHub:

    git add .
    git commit -m "First commit"
    git remote add origin REMOTE_URL
    git push -u origin production
    

    Replace REMOTE_URL with the URL of the Git repository.

  6. Create a branch called staging:

    git checkout -b staging
    
  7. Create the production and staging pipelines by following the instructions in Creating and managing build triggers.

    The names of the branches correspond to the pipeline environments.

Cloud Build triggers are activated when you push changes to a particular remote branch. Therefore, you automate the pipelines by creating two triggers that perform the following tasks:

  • One trigger starts the production pipeline when you push a commit to the production branch
  • Another trigger starts the staging pipeline when you push a commit to the staging branch.

The build configuration files that you need for creating the triggers are cloudbuild-prod.yaml and cloudbuild-staging.yaml.

Update the Deep Learning Containers image and dependencies

Google regularly publishes new versions of Deep Learning Containers images. In addition, as notebooks evolve, users need to add new dependencies. You as an owner might decide to use a newly published Deep Learning Containers image, or you might add dependencies that are requested by your users. You can make these changes in the Dockerfile.

If you add dependencies, we recommend that you add notebook tests to cover the new dependencies. Your users are best positioned to provide a streamlined version of a notebook that can be used for testing.

When you're satisfied with the changes to the Dockerfile, push the changes to the remote staging branch. To push the local staging branch the first time and make it track a remote branch of the same name, you run the following command:

git push --set-upstream origin staging

The following diagram shows the sequence of steps that occur after you push a branch in order to update Deep Learning Containers image and dependencies:

Steps to update a Deep Learning Containers image and dependencies.

The diagram describes the following flow:

  1. After the push, the trigger for the staging pipeline starts a Docker build to get a new image version, using the commit SHA as the version ID.
  2. The trigger uses the Docker push command to push the created image version into Container Registry.
  3. The trigger runs the automated tests and verifies that they completed successfully.
  4. If the automated tests run successfully, the trigger creates or overwrites the staging environment in Vertex AI Workbench. The trigger points the environment to the image version.
  5. You verify the status of your build and examine any errors in the Cloud Build log.

The cloudbuild-staging.yaml file in GitHub implements the steps that are described in the preceding sequence. The file includes substitutions, which are variables in Cloud Build config files, as shown in the following snippet:

substitutions:
  _IMAGE_NAME: team1.task1

In this case, _IMAGE_NAME is the name of the image that you defined as described in the image names and version IDs section of the overview document, such as nlp.keras-batch.

The ${COMMIT_SHA} and ${PROJECT_ID} variables in the cloudbuild-staging.yaml file are default substitutions that are provided by Cloud Build. The pipeline configuration file uses them to specify the image repository and to pass them as parameters for the shell script, as shown in the following snippet:

pip install gcloud-notebook-training && \
time find -name '*.ipynb' -print0 |\
  xargs --null --max-procs=0 --max-args=1 -I {} \
    gcloud-notebook-training \
      --input-notebook "{}" \
      --container-uri gcr.io/${PROJECT_ID}/${_IMAGE_NAME}:${COMMIT_SHA}

The code relies on the open source gcloud-notebook-training framework to create or overwrite the staging environment. To parallelize test execution, the --max-procs=0 parameter lets the xargs command run each test notebook in a separate process. Because testing uses parallel processing, the total test execution time is equal to the runtime for the longest running test notebook.

After the automated tests have succeeded, the staging pipeline runs the publish-staging-env.sh shell script to create or overwrite the staging environment. The staging environment has the same ID as the image name, but the ID has a .staging suffix. Following the earlier example, the staging environment ID is nlp.keras-batch.staging. For more information, see Naming conventions in the overview document.

Run manual tests

Running manual tests is the responsibility of the testers. After the staging pipeline finishes running, owners request that testers create Deep Learning VM instances to perform manual testing.

Testers need the environment ID and they must have permissions to create a Deep Learning VM instance. They don't need to be concerned with image families, Deep Learning Containers image versions, details about commit SHAs, or image repositories.

To create instances using the staging environment, testers run the following command in Cloud Shell:

gcloud notebooks instances create notebook-vm-test-1 \
    --location=us-central1-a \
    --environment=nlp.keras-batch.staging

The following diagram shows the sequence of steps for running manual tests:

Steps to run manual tests.

The diagram describes the following flow:

  1. When a tester runs the gcloud command listed earlier, Vertex AI Workbench gets the commit SHA and the image repository from the staging environment and pulls the image from Container Registry.
  2. Using the image, Vertex AI Workbench creates a Deep Learning VM instance and provides a JupyterLab link to the tester.
  3. The tester runs selected user-managed notebooks instances and communicates the results to you.

Push an image to production

As an owner, you can push the image to the production environment. You do this after Cloud Build runs the automated tests, after the testers have finished testing the new image from the staging environment, and after the testers have indicated that the image works. The process of pushing the image to production makes the image available to all users.

The following diagram shows the sequence of steps for pushing an image to production:

Steps to push image to production.

The diagram describes the following flow:

  1. You merge the changes for the last staging image in the staging Git branch into your local production branch.

    If the changes on the staging branch include a single commit, we recommend a fast-forward merge (using the --ff option) to avoid an unnecessary merge commit. Using the approach keeps the history of the production branch uncluttered.

    If the changes on the staging branch include more than one commit, we recommend a no-fast-forward merge (using the --no-ff option) to create a merge commit. The no-fast-forward merge creates a group of the commits that are related to a single logical change, but retains the history of individual commits.

    Whether you use a fast-forward merge or a no-fast-forward merge, the code in the sample ensures that the version ID remains unchanged and that it corresponds to the SHA of the last commit that was pushed into the remote staging branch. The code ignores the SHA of merge commits.

  2. After you merge the changes, you push them to the remote production branch, which triggers a production build in Cloud Build. The build pipeline is defined in the cloudbuild-prod.yaml config file. You must make sure that the _IMAGE_NAME substitution variable has the same value as the one that's defined in the staging pipeline. The production pipeline has a single step to call the publish-prod-env.sh script, which does the following:

    1. Gets the commit SHA from the staging environment.
    2. Gets the commit SHA from the production environment.
    3. Creates or overwrites the fallback environment using the production commit SHA.
    4. Creates or overwrites the production environment using the staging commit SHA.

The script makes the fallback environment point to the image version in the previous production environment and causes the production environment to point to the image version in the previous staging environment.

The production environment has the same ID as the image name but with a .production suffix. In this example, the production environment ID is nlp.keras-batch.production.

Use a production image

After the production pipeline finishes, users can create Deep Learning VM instances for their daily tasks with the new image version. Users who know the environment ID and have sufficient permissions can create a Deep Learning VM instance in the production environment. They don't need to be concerned with image families, Deep Learning Containers image versions, details about commit SHAs, or image repositories.

To perform this task, users run the following command in Cloud Shell:

gcloud notebooks instances create notebook-vm-1 \
    --location=us-central1-a \
    --environment=nlp.keras-batch.production

The following diagram shows the steps that are involved in creating a Deep Learning VM in production:

Steps to create a Deep Learning VM VM in production.

The steps shown in the diagram are similar to those in the staging environment as explained earlier in Run manual tests. The main difference is that Vertex AI Workbench retrieves the SHA from the production environment instead of from the staging environment.

Use a fallback image

The fallback environment points to the last known working version. If the production version fails, users can immediately create a Deep Learning VM instance using the fallback environment. To perform this task, users can run the following task in Cloud Shell:

gcloud notebooks instances create notebook-vm-1 \
    --location=us-central1-a \
    --environment=nlp.keras-batch.fallback

The fallback environment has a .fallback suffix. In order to create a Deep Learning VM instance using an environment, users must know the environment ID and they must have the notebooks.instances.create permission.

Users do not need to request a working image from the owners in order to create a Deep Learning VM instance using a fallback environment. Therefore, this approach minimizes RTO and prevents owners from being a bottleneck in case of incidents with the production environment.

Revert to the fallback image

If there's an issue with the image in the production environment, you as an owner can overwrite the production environment with the attributes of the fallback environment in order to prevent users from creating additional faulty Deep Learning VM instances.

  1. In Cloud Shell, get the fallback commit SHA for the image that's in the current fallback environment:

    gcloud notebooks environments describe IMAGE_NAME.fallback \
          --location=LOCATION
    

    Replace the following:

    • IMAGE_NAME: the name of the fallback image to get the commit SHA for.
    • LOCATION: the Google Cloud location of the environment.
  2. Delete the current production environment:

    gcloud notebooks environments delete IMAGE_NAME.production \
        --location=LOCATION
    
  3. Create the new production environment using the fallback commit SHA:

    gcloud notebooks environments create IMAGE_NAME.production
        --location=LOCATION \
        --container-repository=gcr.io/PROJECT_ID/IMAGE_NAME \
        --container-tag=FALLBACK_SHA \
        --display-name=IMAGE_NAME.production
    

    Replace the following:

    • PROJECT_ID: the ID of the Google Cloud project that contains the image.
    • FALLBACK_SHA: the fallback SHA that you got earlier.

Revert to a previous image

If there's an issue with the production image and the fallback image isn't usable or doesn't exist, you need to retrieve a known working image. The owner's Git repository keeps a history of the changes that correspond to each of the image versions.

  1. Search for a working commit by listing the previous commits and filtering the commit messages with specific keywords. Use one of the following methods:

  2. After you find a working commit, make a note of the commit SHA.

    The following listing shows the SHAs of two commits into the production branch.

    commit b2cc9f405d29e64e1229af2b95e7c9ccca86e218 (HEAD -> production)
    
    Author: Dana <dana@example.com>
    Date: Tue Feb 2 15:44:32 2021 +0100
    
    Added dependency: cloudml-hypertune
    
    commit dea830d86bd14e96ff2c8c0d6b7411309b976799
    Author: Taylor <taylor@example.com>
    Date: Thu Jan 7 16:20:07 2021 +0100
    
    Upgraded from tf-gpu.1-14 to tf-gpu.1-15
    
  3. In Cloud Shell, make sure that the image version with that particular SHA as a tag is available in Container Registry:

    gcloud container images list-tags gcr.io/PROJECT_ID/IMAGE_NAME | \
        grep WORKING_VERSION_COMMIT_SHA
    

    Replace the following:

    • PROJECT_ID: the ID of the Google Cloud project that contains the image.
    • IMAGE_NAME: the name of the image.
    • WORKING_VERSION_COMMIT_SHA: the SHA of a working commit that you got earlier in the procedure.

    If the tag is available in Container Registry, the command returns the image digest, tag, and timestamp.

    The commit SHA is the common identifier that links your Git commits with your image versions. Container Registry keeps all of the image versions that you publish unless you delete them.

  4. Overwrite the production environment:

    gcloud notebooks environments delete IMAGE_NAME.production \
        --location=LOCATION
    
    gcloud notebooks environments create IMAGE_NAME.production \
        --location=LOCATION \
        --container-repository=gcr.io/PROJECT_ID/IMAGE_NAME \
        --container-tag=WORKING_VERSION_COMMIT_SHA \
        --display-name=IMAGE_NAME.production
    

    Replace LOCATION with the Google Cloud location of the environment.

  5. If the image version isn't available in Container Registry, rebuild it by using the code in your Git repository.

After you've overwritten the production environment, the next step is to revert to a previous commit in Git. There are multiple ways to do this. The following sections of this document describe two approaches. You choose the approach based on which of the following options describes your objective:

Approach 1: Temporary solution while you fix the code

You might not want to revert the code in your staging and production branches to a known working version, but you need to provide a previous image version to your users as a temporary solution. In that case, we recommend that you create a temporary branch and its corresponding trigger.

  1. In Cloud Shell, create a temporary branch named temp that points to the commit SHA of the working version:

    git checkout -b temp WORKING_VERSION_COMMIT_SHA
    

    Replace WORKING_VERSION_COMMIT_SHA with the SHA of a working image version that you got earlier.

  2. Perform one of the following tasks:

    • If you want to overwrite your staging environment, create a Cloud Build trigger that starts the staging pipeline when you commit to the temp branch.
    • If you don't want to overwrite your staging environment, copy the pipeline config YAML file and shell script, modify the environment suffix from staging to temp, and use the new config for the trigger definition.
  3. Push your changes to the remote repository:

    git push --set-upstream origin temp
    

    This command triggers a build that produces a new image version in the staging environment that corresponds to the Git commit that you chose.

  4. Replace the faulty production environment and point it to the newly built image version using the gcloud notebooks environments delete and create commands as described in Revert to a previous image.

Approach 2: Revert your current code

You can revert the code in your staging and production branches to a known working version.

  1. In Cloud Shell, revert the code:

    git checkout staging
    git revert --no-commit WORKING_VERSION_COMMIT_SHA..HEAD
    

    Replace WORKING_VERSION_COMMIT_SHA with the SHA of a working version that you got earlier.

    These commands create a new commit that reverts changes from the HEAD to the selected known working version commit. The --no-commit flag creates a single commit instead of one commit for each reverted commit. The changes remain in your Git history.

  2. Push your commit to the staging branch in the remote repository and start updating the Deep Learning Containers and dependencies.

What's next