AI & Machine Learning

Introducing Deep Learning Containers: Consistent and portable environments

Deep learning containers

It’s easy to underestimate how much time it takes to get a machine learning project up and running. All too often, these projects require you to manage the compatibility and complexities of an ever-evolving software stack, which can be frustrating, time-consuming, and keep you from what you really want to do: spending time iterating and refining your model. To help you bypass this set-up and quickly get started with your project, we’re introducing Deep Learning Containers in beta today. 

Deep Learning Containers are pre-packaged, performance-optimized, and compatibility-tested, so you can get started immediately. Productionizing your workflow requires not only developing the code or artifacts you want to deploy, but also maintaining a consistent execution environment to guarantee reproducibility and correctness. If your development strategy involves a combination of local prototyping and multiple cloud tools, it can often be frustrating to ensure that all the necessary dependencies are packaged correctly and available to every runtime. Deep Learning Containers address this challenge by providing a consistent environment for testing and deploying your application across GCP products and services, like Cloud AI Platform Notebooks and Google Kubernetes Engine (GKE), making it easy to scale in the cloud or shift across on-prem. In addition, we provide hardware optimized versions of TensorFlow, whether you’re training on NVIDIA GPUs or deploying on Intel CPUs.

In this blog post, we’ll cover some common scenarios when working with Deep Learning Containers, including how to select a container, develop locally, and create derivative containers for use in Cloud AI Platform Notebooks.

Choose a container and develop locally
All Deep Learning Containers have a preconfigured Jupyter environment, so each can be pulled and used directly as a prototyping space. First, make sure you have the gcloud tool installed and configured. Then, determine the container that you would like to use. All containers are hosted under gcr.io/deeplearning-platform-release, and can be listed with the command:

  gcloud container images list --repository="gcr.io/deeplearning-platform-release"

Each container provides a Python3 environment consistent with the corresponding Deep Learning VM, including the selected data science framework, conda, the NVIDIA stack for GPU images (CUDA, cuDNN, NCCL), and a host of other supporting packages and tools. Our initial release consists of containers for TensorFlow 1.13, TensorFlow 2.0, PyTorch, and R, and we are working to reach parity with all Deep Learning VM types.

With the exception of the base containers, the container names will be in the format <framework>-<cpu/gpu>.<framework version>. Let’s say you’d like to prototype on CPU-only TensorFlow. The following command will start the TensorFlow Deep Learning Container in detached mode, bind the running Jupyter server to port 8080 on the local machine, and mount /path/to/local/dir to /home in the container.

  docker run -d -p 8080:8080 -v /path/to/local/dir:/home \
  gcr.io/deeplearning-platform-release/tf-cpu.1-13

Then, the running JupyterLab instance can be accessed at localhost:8080. Make sure to develop in /home, as any other files will be removed when the container is stopped.

If you would like to use the GPU-enabled containers, you will need a CUDA 10 compatible GPU, the associated driver, and nvidia-docker installed. Then, you can run a similar command.

  docker run --runtime=nvidia -d -p 8080:8080 -v /path/to/local/dir:/home \
  gcr.io/deeplearning-platform-release/tf-gpu.1-13

Create derivative containers and deploy to Cloud AI Platform Notebooks and GKE
At some point, you’ll likely need a beefier machine than what your local machine has to offer, but you may have local data and packages that need to be installed in the environment. Deep Learning Containers can be extended to include your local files, and then these custom containers can then be deployed in a Cloud AI Platform Notebooks instance and GKE.

For example, imagine that you have a local python package called mypackage that you are using as part of your Pytorch workflow. Create a Dockerfile in the directory above mypackage as such.

Dockerfile

  FROM gcr.io/deeplearning-platform-release/pytorch-gpu
COPY mypackage /mypackage
RUN pip install /mypackage

This simple Dockerfile will copy in the package files and install it into the default environment. You can add additional RUN pip/conda commands, but you should not modify CMD or ENTRYPOINT, as these are already configured for AI Platform Notebooks. Build and upload this container to Google Container Registry.

  export PROJECT_ID=$(gcloud config list project --format "value(core.project)")
export IMAGE_REPO_NAME=pytorch_custom_container
export IMAGE_TAG=$(date +%Y%m%d_%H%M%S)
export IMAGE_URI=gcr.io/$PROJECT_ID/$IMAGE_REPO_NAME:$IMAGE_TAG

docker build -f Dockerfile -t $IMAGE_URI ./

gcloud auth configure-docker
docker push $IMAGE_URI

Then, create an AI Platform Notebooks instance using the gcloud CLI (custom container UI support coming soon). Feel free to modify the instance type and accelerator fields to suit your workload needs.

  # Compute Engine Instance parameters
export IMAGE_FAMILY="common-container" 
export ZONE="us-central1-b"
export INSTANCE_NAME="custom-container-notebook"
export INSTANCE_TYPE="n1-standard-8"
export ACCELERATOR="type=nvidia-tesla-t4,count=2"

gcloud compute instances create $INSTANCE_NAME \
        --zone=$ZONE \
        --image-family=$IMAGE_FAMILY \
        --image-project="deeplearning-platform-release" \
        --maintenance-policy=TERMINATE \
        --accelerator=$ACCELERATOR \
        --machine-type=$INSTANCE_TYPE \
        --boot-disk-size=100GB \
        --scopes=https://www.googleapis.com/auth/cloud-platform \
        --metadata="install-nvidia-driver=True,proxy-mode=project_editors,container=$IMAGE_URI"

The image will take a few minutes to set up. If the container was loaded correctly, there will be a link to access JupyterLab written to the proxy-url metadata field, and the instance will appear as ready in the AI Platform > Notebooks UI on Cloud Console. You can also query the link directly by describing the instance metadata.

  gcloud compute instances describe "${INSTANCE_NAME}" \
  --format='value[](metadata.items.proxy-url)'

Accessing this link will take you to your JupyterLab instance. 

Please note: only data saved to /home will be persisted across reboots. By default, the container VM mounts /home on the VM to /home on the container, so make sure you create new notebooks in /home, otherwise that work will be lost if the instance shuts down.

Deploying Deep Learning Containers on GKE with NVIDIA GPUs
You can also take advantage of GKE to develop on your Deep Learning Containers. After setting up your GKE cluster with GPUs following the user guide, you just need to specify the container image in your Kubernetes pod spec. The following spec creates a pod with one GPU from tf-gpu and an attached GCE persistent disk:

pod.yaml

  apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: dlc-persistent-volume-claim
spec:
  accessModes:
    - ReadWriteOnce
  resources:
    requests:
      storage: 100Gi
---
apiVersion: v1
kind: Pod
metadata:
  name: dlc-tf
spec:
  containers:
  - name: dlc-tf
    image: gcr.io/deeplearning-platform-release/tf-gpu
    resources:
      limits:
        nvidia.com/gpu: 1
    volumeMounts:
      - name: dlc-volume
        mountPath: /home
  volumes:
  - name: dlc-volume
    persistentVolumeClaim:
      claimName: dlc-persistent-volume-claim

Deploy and connect to your instance with the following commands:

  kubectl apply -f ./pod.yaml
kubectl port-forward pods/dlc-tf 8080:8080

After the pod is fully deployed, your running JupyterLab instance can be accessed at localhost:8080.

Getting Started 
If you’re not already a Google Cloud customer, you can sign up today for $300 of credit in our free tier. Then, try out our quick start guides and documentation for more details on getting started with your project.