Manage upgrades and dependencies for user-managed notebooks: Process
This document describes the tasks for creating a repeatable and scalable end-to-end process for managing upgrades and dependencies for Vertex AI Workbench user-managed notebooks. The intended audience is IT administrators who want to implement a process that lets data scientists consistently update their Deep Learning VM instances and notebook dependencies.
An accompanying document, Manage upgrades and dependencies for user-managed notebooks: Overview, describes the use case for this process and the overall workflow. We recommend that you read that document first.
The code and artifacts for a pipeline that's part of the process described in
this document are in the
managed-environments
GitHub repository.
Process tasks and who performs them
As noted in the overview, the process involves three roles: owners (IT administrators), testers, and users (data scientists). The document assumes the following:
- You are an owner.
- You are setting up the process.
- You will communicate with testers and users about their tasks.
The following table shows which role is responsible for each task.
Section | Role responsible |
---|---|
Set up the pipelines | Owners |
Automate the pipelines | Owners |
Update the Deep Learning Containers and dependencies | Owners |
Run manual tests | Testers |
Push an image to production | Owners |
Use a production image | Users |
Use a fallback image | Users |
Revert to the fallback image | Owners |
Revert to a previous image | Owners |
Set up the pipelines
Owners are responsible for setting up the pipelines that implement the update and dependency management process.
The following screenshot shows the files that are provided in the GitHub sample repository to implement the process. As an owner, you can adapt the files according to your business and technical needs.
The code in the repository defines a pair of Cloud Build pipelines: one for staging and one for production. The YAML files that are shown in the screenshot are the build config files in which you define the steps for each pipeline.
Each pipeline has a corresponding shell script that's called from the pipeline steps. The scripts manage the notebook environments, the commit SHAs, and logging. They provide a clean separation between the pipeline definition and the programmatic instructions that are needed in some pipeline steps. The internals of the config files and of the shell scripts are explained in later sections of this document.
The repository has a subdirectory that contains Jupyter test notebooks. These
are the notebooks that Cloud Build runs automatically when you push a
change into the staging branch. In the diagram, the subdirectory is called
test_notebooks
, but you can use any name; just be sure that you specify it in
the dir
parameter for the automated testing step in the staging config file.
The code provides a sample Dockerfile, which includes the Deep Learning Containers image that's used as the basis to build your image and any dependencies. The sample Dockerfile uses a GPU-enabled container with TensorFlow, and it installs the setuptools dependency by using the following commands:
FROM gcr.io/deeplearning-platform-release/tf2-gpu.2-4 RUN pip install --user --upgrade setuptools
The repository also contains a
.gitignore
file that lets you exclude files from the Git repository.
Automate the pipelines
Owners are responsible for automating the process. After you clone and configure your pipeline files, you as an owner set up the mechanism to run the pipelines automatically when you make changes to the Dockerfile or to the notebook tests. To automate the pipelines, you use Git remote branches and Cloud Build triggers.
In the Google Cloud console, activate Cloud Shell.
Clone the
managed-environments
GitHub sample repository:git clone https://github.com/gclouduniverse/managed-environments.git
Remove any Git configuration from the cloned repository:
cd managed-environments rm -fr .git
Initialize your local Git repository by following the instructions in Adding an existing project to GitHub using the command line. However, for the name of your initial branch, use
production
instead ofmain
, as in the following example:git init -b production
Add the files that were mentioned earlier in Set up the pipelines and push your first pipeline commit to the remote repository that you created in GitHub:
git add . git commit -m "First commit" git remote add origin REMOTE_URL git push -u origin production
Replace REMOTE_URL with the URL of the Git repository.
Create a branch called
staging
:git checkout -b staging
Create the production and staging pipelines by following the instructions in Creating and managing build triggers.
The names of the branches correspond to the pipeline environments.
Cloud Build triggers are activated when you push changes to a particular remote branch. Therefore, you automate the pipelines by creating two triggers that perform the following tasks:
- One trigger starts the production pipeline when you push a commit to the
production
branch - Another trigger starts the staging pipeline when you push a commit to
the
staging
branch.
The build configuration
files that you need for creating the triggers are cloudbuild-prod.yaml
and cloudbuild-staging.yaml
.
Update the Deep Learning Containers image and dependencies
Google regularly publishes new versions of Deep Learning Containers images. In addition, as notebooks evolve, users need to add new dependencies. You as an owner might decide to use a newly published Deep Learning Containers image, or you might add dependencies that are requested by your users. You can make these changes in the Dockerfile.
If you add dependencies, we recommend that you add notebook tests to cover the new dependencies. Your users are best positioned to provide a streamlined version of a notebook that can be used for testing.
When you're satisfied with the changes to the Dockerfile, push the changes
to the remote staging
branch. To push the local staging
branch the first
time and make it track a remote branch of the same name, you run the following
command:
git push --set-upstream origin staging
The following diagram shows the sequence of steps that occur after you push a branch in order to update Deep Learning Containers image and dependencies:
The diagram describes the following flow:
- After the push, the trigger for the staging pipeline starts a Docker build to get a new image version, using the commit SHA as the version ID.
- The trigger uses the Docker
push
command to push the created image version into Container Registry. - The trigger runs the automated tests and verifies that they completed successfully.
- If the automated tests run successfully, the trigger creates or overwrites the staging environment in Vertex AI Workbench. The trigger points the environment to the image version.
- You verify the status of your build and examine any errors in the Cloud Build log.
The
cloudbuild-staging.yaml
file in GitHub implements the steps that are described in the preceding sequence.
The file includes
substitutions,
which are variables in Cloud Build config files, as shown in the
following snippet:
substitutions: _IMAGE_NAME: team1.task1
In this case, _IMAGE_NAME
is the name of the image that you defined
as described in the
image names and version IDs
section of the overview document, such as nlp.keras-batch
.
The ${COMMIT_SHA}
and ${PROJECT_ID}
variables in the
cloudbuild-staging.yaml
file are
default substitutions
that are provided by Cloud Build. The pipeline configuration file
uses them to specify the image repository and to pass them as parameters for the
shell script, as shown in the following snippet:
pip install gcloud-notebook-training && \ time find -name '*.ipynb' -print0 |\ xargs --null --max-procs=0 --max-args=1 -I {} \ gcloud-notebook-training \ --input-notebook "{}" \ --container-uri gcr.io/${PROJECT_ID}/${_IMAGE_NAME}:${COMMIT_SHA}
The code relies on the open source
gcloud-notebook-training framework
to create or overwrite the staging environment. To parallelize test execution,
the --max-procs=0
parameter lets the xargs
command run each test notebook in
a separate process. Because testing uses parallel processing, the
total test execution time is equal to the runtime for the longest running test
notebook.
After the automated tests have succeeded, the staging pipeline runs the
publish-staging-env.sh
shell script to create or overwrite the staging environment. The staging
environment has the same ID as the image name, but the ID has a .staging
suffix.
Following the earlier example, the staging environment ID is
nlp.keras-batch.staging
. For more information, see
Naming conventions
in the overview document.
Run manual tests
Running manual tests is the responsibility of the testers. After the staging pipeline finishes running, owners request that testers create Deep Learning VM instances to perform manual testing.
Testers need the environment ID and they must have permissions to create a Deep Learning VM instance. They don't need to be concerned with image families, Deep Learning Containers image versions, details about commit SHAs, or image repositories.
To create instances using the staging environment, testers run the following command in Cloud Shell:
gcloud notebooks instances create notebook-vm-test-1 \
--location=us-central1-a \
--environment=nlp.keras-batch.staging
The following diagram shows the sequence of steps for running manual tests:
The diagram describes the following flow:
- When a tester runs the
gcloud
command listed earlier, Vertex AI Workbench gets the commit SHA and the image repository from the staging environment and pulls the image from Container Registry. - Using the image, Vertex AI Workbench creates a Deep Learning VM instance and provides a JupyterLab link to the tester.
- The tester runs selected user-managed notebooks instances and communicates the results to you.
Push an image to production
As an owner, you can push the image to the production environment. You do this after Cloud Build runs the automated tests, after the testers have finished testing the new image from the staging environment, and after the testers have indicated that the image works. The process of pushing the image to production makes the image available to all users.
The following diagram shows the sequence of steps for pushing an image to production:
The diagram describes the following flow:
You merge the changes for the last staging image in the
staging
Git branch into your localproduction
branch.If the changes on the
staging
branch include a single commit, we recommend a fast-forward merge (using the--ff
option) to avoid an unnecessary merge commit. Using the approach keeps the history of theproduction
branch uncluttered.If the changes on the
staging
branch include more than one commit, we recommend a no-fast-forward merge (using the--no-ff
option) to create a merge commit. The no-fast-forward merge creates a group of the commits that are related to a single logical change, but retains the history of individual commits.Whether you use a fast-forward merge or a no-fast-forward merge, the code in the sample ensures that the version ID remains unchanged and that it corresponds to the SHA of the last commit that was pushed into the remote
staging
branch. The code ignores the SHA of merge commits.After you merge the changes, you push them to the remote
production
branch, which triggers a production build in Cloud Build. The build pipeline is defined in thecloudbuild-prod.yaml
config file. You must make sure that the_IMAGE_NAME
substitution variable has the same value as the one that's defined in the staging pipeline. The production pipeline has a single step to call thepublish-prod-env.sh
script, which does the following:- Gets the commit SHA from the staging environment.
- Gets the commit SHA from the production environment.
- Creates or overwrites the fallback environment using the production commit SHA.
- Creates or overwrites the production environment using the staging commit SHA.
The script makes the fallback environment point to the image version in the previous production environment and causes the production environment to point to the image version in the previous staging environment.
The production environment has the same ID as the image name but with a
.production
suffix. In this example, the production environment ID is
nlp.keras-batch.production
.
Use a production image
After the production pipeline finishes, users can create Deep Learning VM instances for their daily tasks with the new image version. Users who know the environment ID and have sufficient permissions can create a Deep Learning VM instance in the production environment. They don't need to be concerned with image families, Deep Learning Containers image versions, details about commit SHAs, or image repositories.
To perform this task, users run the following command in Cloud Shell:
gcloud notebooks instances create notebook-vm-1 \
--location=us-central1-a \
--environment=nlp.keras-batch.production
The following diagram shows the steps that are involved in creating a Deep Learning VM in production:
The steps shown in the diagram are similar to those in the staging environment as explained earlier in Run manual tests. The main difference is that Vertex AI Workbench retrieves the SHA from the production environment instead of from the staging environment.
Use a fallback image
The fallback environment points to the last known working version. If the production version fails, users can immediately create a Deep Learning VM instance using the fallback environment. To perform this task, users can run the following task in Cloud Shell:
gcloud notebooks instances create notebook-vm-1 \
--location=us-central1-a \
--environment=nlp.keras-batch.fallback
The fallback environment has a .fallback
suffix. In order to
create a Deep Learning VM instance using an environment,
users must know the environment ID and they must have the
notebooks.instances.create
permission.
Users do not need to request a working image from the owners in order to create a Deep Learning VM instance using a fallback environment. Therefore, this approach minimizes RTO and prevents owners from being a bottleneck in case of incidents with the production environment.
Revert to the fallback image
If there's an issue with the image in the production environment, you as an owner can overwrite the production environment with the attributes of the fallback environment in order to prevent users from creating additional faulty Deep Learning VM instances.
In Cloud Shell, get the fallback commit SHA for the image that's in the current fallback environment:
gcloud notebooks environments describe IMAGE_NAME.fallback \ --location=LOCATION
Replace the following:
- IMAGE_NAME: the name of the fallback image to get the commit SHA for.
- LOCATION: the Google Cloud location of the environment.
Delete the current production environment:
gcloud notebooks environments delete IMAGE_NAME.production \ --location=LOCATION
Create the new production environment using the fallback commit SHA:
gcloud notebooks environments create IMAGE_NAME.production --location=LOCATION \ --container-repository=gcr.io/PROJECT_ID/IMAGE_NAME \ --container-tag=FALLBACK_SHA \ --display-name=IMAGE_NAME.production
Replace the following:
- PROJECT_ID: the ID of the Google Cloud project that contains the image.
- FALLBACK_SHA: the fallback SHA that you got earlier.
Revert to a previous image
If there's an issue with the production image and the fallback image isn't usable or doesn't exist, you need to retrieve a known working image. The owner's Git repository keeps a history of the changes that correspond to each of the image versions.
Search for a working commit by listing the previous commits and filtering the commit messages with specific keywords. Use one of the following methods:
- Run the
git log
command and filter the results. - Use your GitHub repository page.
- Use a graphical Git tool like Atlassian SourceTree.
- Run the
After you find a working commit, make a note of the commit SHA.
The following listing shows the SHAs of two commits into the
production
branch.commit b2cc9f405d29e64e1229af2b95e7c9ccca86e218 (HEAD -> production) Author: Dana <dana@example.com> Date: Tue Feb 2 15:44:32 2021 +0100 Added dependency: cloudml-hypertune commit dea830d86bd14e96ff2c8c0d6b7411309b976799 Author: Taylor <taylor@example.com> Date: Thu Jan 7 16:20:07 2021 +0100 Upgraded from tf-gpu.1-14 to tf-gpu.1-15
In Cloud Shell, make sure that the image version with that particular SHA as a tag is available in Container Registry:
gcloud container images list-tags gcr.io/PROJECT_ID/IMAGE_NAME | \ grep WORKING_VERSION_COMMIT_SHA
Replace the following:
- PROJECT_ID: the ID of the Google Cloud project that contains the image.
- IMAGE_NAME: the name of the image.
- WORKING_VERSION_COMMIT_SHA: the SHA of a working commit that you got earlier in the procedure.
If the tag is available in Container Registry, the command returns the image digest, tag, and timestamp.
The commit SHA is the common identifier that links your Git commits with your image versions. Container Registry keeps all of the image versions that you publish unless you delete them.
Overwrite the production environment:
gcloud notebooks environments delete IMAGE_NAME.production \ --location=LOCATION gcloud notebooks environments create IMAGE_NAME.production \ --location=LOCATION \ --container-repository=gcr.io/PROJECT_ID/IMAGE_NAME \ --container-tag=WORKING_VERSION_COMMIT_SHA \ --display-name=IMAGE_NAME.production
Replace LOCATION with the Google Cloud location of the environment.
If the image version isn't available in Container Registry, rebuild it by using the code in your Git repository.
After you've overwritten the production environment, the next step is to revert to a previous commit in Git. There are multiple ways to do this. The following sections of this document describe two approaches. You choose the approach based on which of the following options describes your objective:
- You need a temporary solution while you fix the issues in the current code.
- You want to revert the current code to a known working version.
Approach 1: Temporary solution while you fix the code
You might not want to revert the code in your staging
and production
branches to a known working version, but you need to provide a previous image
version to your users as a temporary solution. In that case, we recommend
that you create a temporary branch and its corresponding trigger.
In Cloud Shell, create a temporary branch named
temp
that points to the commit SHA of the working version:git checkout -b temp WORKING_VERSION_COMMIT_SHA
Replace WORKING_VERSION_COMMIT_SHA with the SHA of a working image version that you got earlier.
Perform one of the following tasks:
- If you want to overwrite your staging environment, create a
Cloud Build trigger
that starts the staging pipeline when you commit to the
temp
branch. - If you don't want to overwrite your staging environment, copy
the pipeline config YAML file and shell script, modify the environment
suffix from
staging
totemp
, and use the new config for the trigger definition.
- If you want to overwrite your staging environment, create a
Cloud Build trigger
that starts the staging pipeline when you commit to the
Push your changes to the remote repository:
git push --set-upstream origin temp
This command triggers a build that produces a new image version in the staging environment that corresponds to the Git commit that you chose.
Replace the faulty production environment and point it to the newly built image version using the
gcloud notebooks environments delete
andcreate
commands as described in Revert to a previous image.
Approach 2: Revert your current code
You can revert the code in your staging and production branches to a known working version.
In Cloud Shell, revert the code:
git checkout staging git revert --no-commit WORKING_VERSION_COMMIT_SHA..HEAD
Replace WORKING_VERSION_COMMIT_SHA with the SHA of a working version that you got earlier.
These commands create a new commit that reverts changes from the HEAD to the selected known working version commit. The
--no-commit
flag creates a single commit instead of one commit for each reverted commit. The changes remain in your Git history.Push your commit to the
staging
branch in the remote repository and start updating the Deep Learning Containers and dependencies.
What's next
- Create a derivative container.
- Learn how to train in a container using Google Kubernetes Engine.
- Explore reference architectures, diagrams, and best practices about Google Cloud. Take a look at our Cloud Architecture Center.