AI & Machine Learning

Let Deep Learning VMs and Jupyter notebooks burn the midnight oil for you: robust and automated training with Papermill

February 27, 2019

Gonzalo Gasca Meza

Developer Programs Engineer, Google

Viacheslav Kovalevskyi

Software Engineer, Google

In the past several years, Jupyter notebooks have become a convenient way of experimenting with machine learning datasets and models, as well as sharing training processes with colleagues and collaborators. Often times your notebook will take a long time to complete its execution. An extended training session may cause you to incur charges even though you are no longer using Compute Engine resources.

This post will explain how to execute a Jupyter Notebook in a simple and cost-efficient way.

We’ll explain how to deploy a Deep Learning VM image using TensorFlow to launch a Jupyter notebook which will be executed using the Nteract Papermill open source project. Once the notebook has finished executing, the Compute Engine instance that hosts your Deep Learning VM image will automatically terminate.

The components of our system:

First, Jupyter Notebooks

The Jupyter Notebook is an open-source web-based, interactive environment for creating and sharing IPython notebook (.ipynb) documents that contain live code, equations, visualizations and narrative text. This platform supports data cleaning and transformation, numerical simulation, statistical modeling, data visualization, machine learning, and much more.

Next, Deep Learning Virtual Machine (VM) images

The Deep Learning Virtual Machine images are a set of Debian 9-based Compute Engine virtual machine disk images that are optimized for data science and machine learning tasks. All images include common ML frameworks and tools installed from first boot, and can be used out of the box on instances with GPUs to accelerate your data processing tasks. You can launch Compute Engine instances pre-installed with popular ML frameworks like TensorFlow, PyTorch, or scikit-learn, and even add Cloud TPU and GPU support with a single click.

And now, Papermill

Papermill is a library for parametrizing, executing, and analyzing Jupyter Notebooks. It lets you spawn multiple notebooks with different parameter sets and execute them concurrently. Papermill can also help collect and summarize metrics from a collection of notebooks.

Papermill also permits you to read or write data from many different locations. Thus, you can store your output notebook on a different storage system that provides higher durability and easy access in order to establish a reliable pipeline. Papermill recently added support for Google Cloud Storage buckets, and in this post we will show you how to put this new functionality to use.

Installation

Submit a Jupyter notebook for execution

The following command starts execution of a Jupyter notebook stored in a Cloud Storage bucket:

# Compute Engine Instance parameters
export IMAGE_FAMILY="tf-latest-cu100" 
export ZONE="us-central1-b"
export INSTANCE_NAME="notebook-executor"
export INSTANCE_TYPE="n1-standard-8"
# Notebook parameters
export INPUT_NOTEBOOK_PATH="gs://my-bucket/input.ipynb"
export OUTPUT_NOTEBOOK_PATH="gs://my-bucket/output.ipynb"
export PARAMETERS_FILE="params.yaml" # Optional
export PARAMETERS="-p batch_size 128 -p epochs 40"  # Optional
export STARTUP_SCRIPT="papermill ${INPUT_NOTEBOOK_PATH} ${OUTPUT_NOTEBOOK_PATH} -y ${PARAMETERS_FILE} ${PARAMETERS}"

gcloud compute instances create $INSTANCE_NAME \
        --zone=$ZONE \
        --image-family=$IMAGE_FAMILY \
        --image-project=deeplearning-platform-release \
        --maintenance-policy=TERMINATE \
        --accelerator='type=nvidia-tesla-t4,count=2' \
        --machine-type=$INSTANCE_TYPE \
        --boot-disk-size=100GB \
        --scopes=https://www.googleapis.com/auth/cloud-platform \
        --metadata="install-nvidia-driver=True,startup-script=${STARTUP_SCRIPT}"

gcloud --quiet compute instances delete $INSTANCE_NAME --zone $ZONE

The above commands do the following:

Create a Compute Engine instance using TensorFlow Deep Learning VM and 2 NVIDIA Tesla T4 GPUs
Install the latest NVIDIA GPU drivers
Execute the notebook using Papermill
Upload notebook result (with all the cells pre-computed) to Cloud Storage bucket in this case: “gs://my-bucket/”
Terminate the Compute Engine instance

And there you have it! You’ll no longer pay for resources you don’t use since after execution completes, your notebook, with populated cells, is uploaded to the specified Cloud Storage bucket. You can read more about it in the Cloud Storage documentation.

Note: In case you are not using a Deep Learning VM, and you want to install Papermill library with Cloud Storage support, you only need to run:

Note: Papermill version 0.18.2 supports Cloud Storage.

And here is an even simpler set of bash commands:

Execute a notebook using GPU resources

Execute a notebook using CPU resources

The Deep Learning VM instance requires several permissions: read and write ability to Cloud Storage, and the ability to delete instances on Compute Engine. That is why our original command has the scope “https://www.googleapis.com/auth/cloud-platform” defined.

Your submission process will look like this:

Note: Verify that you have enough CPU or GPU resources available by checking your quota in the zone where your instance will be deployed.

Executing a Jupyter notebook

Let’s look into the following code:

# Compute Engine Instance parameters
export IMAGE_FAMILY="tf-latest-cu100" 
export ZONE="us-central1-b"
export INSTANCE_NAME="notebook-executor"
export INSTANCE_TYPE="n1-standard-8"
# Notebook parameters
export INPUT_NOTEBOOK_PATH="gs://my-bucket/input.ipynb"
export OUTPUT_NOTEBOOK_PATH="gs://my-bucket/output.ipynb"
export PARAMETERS_FILE="params.yaml" # Optional
export PARAMETERS="-p batch_size 128 -p epochs 40"  # Optional
export STARTUP_SCRIPT="https://raw.githubusercontent.com/GoogleCloudPlatform/ml-on-gcp/master/dlvm/tools/scripts/notebook_executor.sh"

gcloud compute instances create $INSTANCE_NAME \
        --zone=$ZONE \
        --image-family=$IMAGE_FAMILY \
        --image-project=deeplearning-platform-release \
        --maintenance-policy=TERMINATE \
        --accelerator='type=nvidia-tesla-t4,count=2' \
        --machine-type=$INSTANCE_TYPE \
        --boot-disk-size=100GB \
        --scopes=https://www.googleapis.com/auth/cloud-platform \   --metadata="input_notebook_path=${INPUT_NOTEBOOK_PATH},output_notebook_path=${OUTPUT_NOTEBOOK_PATH},parameters_file=${PARAMETERS_FILE},startup-script-url=${STARTUP_SCRIPT}"

This command is the standard way to create a Deep Learning VM. But keep in mind, you’ll need to pick the VM that includes the core dependencies you need to execute your notebook. Do not try to use a TensorFlow image if your notebook needs PyTorch or vice versa.

Note: if you do not see a dependency that is required for your notebook and you think should be in the image, please let us know on the forum (or with a comment to this article).

The secret sauce here contains two following things:

Papermill library

Startup shell script

https://storage.googleapis.com/gweb-cloudblog-publish/images/papermill.max-500x500.png

Papermill is a tool for parameterizing, executing, and analyzing Jupyter Notebooks.

Papermill lets you:

Parameterize notebooks via command line arguments or a parameter file in YAML format
Execute and collect metrics across the notebooks
Summarize collections of notebooks

In our case, we are just using its ability to execute notebooks and pass parameters if needed.

Behind the scenes

Let’s start with the startup shell script parameters:

INPUT_NOTEBOOK_PATH: The input notebook located Cloud Storage bucket.
Example: gs://my-bucket/input.ipynb
OUTPUT_NOTEBOOK_PATH: The output notebook located Cloud Storage bucket.
Example: gs://my-bucket/input.ipynb.
PARAMETERS_FILE: Users can provide a YAML file where notebook parameter values should be read.
Example: gs://my-bucket/params.yaml
PARAMETERS: Pass parameters via -p key value for notebook execution.
Example: -p batch_size 128 -p epochs 40.

The two ways to execute the notebook with parameters are: (1) through the Python API and (2) through the command line interface. This sample script supports two different ways to pass parameters to Jupyter notebook, although Papermill supports other formats, so please consult Papermill’s documentation.

The above script performs the following steps:

Creates a Compute Engine instance using the TensorFlow Deep Learning VM and 2 NVIDIA Tesla T4 GPUs
Installs NVIDIA GPU drivers
Executes the notebook using Papermill tool
Uploads notebook result (with all the cells pre-computed) to Cloud Storage bucket in this case: gs://my-bucket/
Papermill emits a save after each cell executes, this could generate “429 Too Many Requests” errors, which are handled by the library itself.
Terminates the Compute Engine instance

Conclusion

By using the Deep Learning VM images, you can automate your notebook training, such that you no longer need to pay extra or manually manage your Cloud infrastructure. Take advantage of all the pre-installed ML software and Nteract’s Papermill project to help you solve your ML problems more quickly! Papermill will help you automate the execution of yourJupyter notebooks and in combination of Cloud Storage and Deep Learning VM images you can now set up this process in a very simple and cost efficient way.

Posted in