Running instances with GPU accelerators

This page describes how to use NVIDIA graphics processing unit (GPU) hardware accelerators on Container-Optimized OS virtual machine (VM) instances.

Overview

On Compute Engine, you can create Container-Optimized OS VM instances equipped with NVIDIA Tesla K80, P100, P4, V100, and T4 GPUs. GPUs provide compute power to drive deep-learning tasks such as image recognition and natural language processing, as well as other compute-intensive tasks such as video transcoding and image processing.

Google provides a seamless experience for users to run their GPU workloads within Docker containers on Container-Optimized OS VM instances so that users can benefit from other Container-Optimized OS features such as security and reliability as well.

Requirements

Running GPUs on Container-Optimized OS VM instances has the following requirements:

  • Container-Optimized OS version: To run GPUs on Container-Optimized OS VM instances, the Container-Optimized OS release milestone must be a LTS milestone and the milestone number must be 85 or higher.
  • GPU quota: You must have Compute Engine GPU quota in your desired zone before you can create Container-Optimized OS VM instances with GPUs. To ensure that you have enough GPU quota in your project, see Quotas in the Google Cloud Console.

    If you require additional GPU quota, you must request GPU quota in the Cloud Console. If you have an established billing account, your project automatically receives GPU quota after you submit the quota request.

  • NVIDIA GPU drivers: You must install NVIDIA GPU drivers by yourself on your Container-Optimized OS VM instances. This section explains how to install the drivers on Container-Optimized OS VM instances.

Getting started: Running GPUs on Container-Optimized OS

The following sections explain how to run GPUs on Container-Optimized OS VM instances.

First, you need a Container-Optimized OS VM instance with GPUs. You can either create a Container-Optimized OS VM instance with a GPU or add GPUs on an existing Container-Optimized OS VM instance. When you create VM instances, remember to choose images or image families from the cos-cloud image project.

To check all GPUs attched to your current Container-Optimized OS VM instances, run the following command:

gcloud compute instances describe instance-name \
  --project=project-id \
  --zone zone \
  --format="value(guestAccelerators)"

Replace the following:

  • instance-name: The name of the instance.
  • project-id: Your project ID.
  • zone: The zone for the instance.

Installing NVIDIA GPU device drivers

After you create an instance with one or more GPUs, your system requires device drivers so that your applications can access the device. This guide shows the ways to install NVIDIA proprietary drivers on Container-Optimized OS VM instances.

Container-Optimized OS provides a built-in utility cos-extensions to simplify the NVIDIA driver installation process. By running the utility, users agree to accept the NVIDIA license agreement.

Identifying GPU driver version

Each version of Container-Optimized OS image has a default supported NVIDIA GPU driver version. See the release notes of the major Container-Optimized OS LTS milestones for the default supported version.

You may also check all the supported GPU driver versions by running the following command on your Container-Optimized OS VM instance:

cos-extensions list

Installing drivers through shell commands

After you connect to your Container-Optimized OS VM instances, you can run the following command manually to install drivers:

cos-extensions install gpu

Installing drivers through startup scripts

You can also install GPU drivers through startup scripts. You can provide the startup script when you create VM instances or apply the script to running VM instances and then reboot the VMs. This allows you to install drivers without connecting to the VMs. It also makes sure the GPU drivers are configured on every VM reboot.

The following is an example startup script to install drivers:

#! /bin/bash

cos-extensions install gpu

Installing drivers through cloud-init

Cloud-init is similar to startup scripts but more powerful. The following example shows how to install GPU driver through cloud-init:

#cloud-config

runcmd:
  - cos-extensions install gpu

Using cloud-init allows you to specify the dependencies so that your GPU applications will only run after the driver has been installed. See the End-to-end: Running a GPU application on Container-Optimized OS section for more details.

For more information about how to use cloud-init on Container-Optimized OS VM instances, see the creating and configuring instances page.

Verifying installation

You can run the following command on your Container-Optimized OS VM instances to manually verify GPU driver installation. It shows the GPU devices information, such as devices state and the driverversion.

# Make the driver installation path executable by remounting it.
sudo mount --bind /var/lib/nvidia /var/lib/nvidia
sudo mount -o remount,exec /var/lib/nvidia
/var/lib/nvidia/bin/nvidia-smi

Configuring Docker containers to consume GPUs

After the GPU drivers are installed, you can configure Docker containers to consume GPUs. The following example shows you how to run a simple CUDA application in a Docker container that consumes /dev/nvidia0:

docker run \
  --volume /var/lib/nvidia/lib64:/usr/local/nvidia/lib64 \
  --volume /var/lib/nvidia/bin:/usr/local/nvidia/bin \
  --device /dev/nvidia0:/dev/nvidia0 \
  --device /dev/nvidia-uvm:/dev/nvidia-uvm \
  --device /dev/nvidiactl:/dev/nvidiactl \
  gcr.io/google_containers/cuda-vector-add:v0.1

You can run your Docker containers through cloud-init to specify the dependency between driver installation and your Docker containers. see the End-to-end: Running a GPU application on Container-Optimized OS section for more details.

End-to-end: Running a GPU application on Container-Optimized OS

The following end-to-end example shows you how to use cloud-init to configure Container-Optimized OS VM instances that provision a GPU application container myapp:latest after the GPU driver has been installed:

#cloud-config

users:
- name: myuser
  uid: 2000

write_files:
  - path: /etc/systemd/system/install-gpu.service
    permissions: 0644
    owner: root
    content: |
      [Unit]
      Description=Install GPU drivers
      Wants=gcr-online.target docker.socket
      After=gcr-online.target docker.socket

      [Service]
      User=root
      Type=oneshot
      ExecStart=cos-extensions install gpu
      StandardOutput=journal+console
      StandardError=journal+console
  - path: /etc/systemd/system/myapp.service
    permissions: 0644
    owner: root
    content: |
      [Unit]
      Description=Run a myapp GPU application container
      Requires=install-gpu.service
      After=install-gpu.service

      [Service]
      User=root
      Type=oneshot
      RemainAfterExit=true
      ExecStart=/usr/bin/docker run --rm -u 2000 --name=myapp --device /dev/nvidia0:/dev/nvidia0 myapp:latest
      StandardOutput=journal+console
      StandardError=journal+console

runcmd:
  - systemctl daemon-reload
  - systemctl start install-gpu.service
  - systemctl start myapp.service

About the CUDA libraries

The NVIDIA device drivers you install on your Container-Optimized OS VM instances include the CUDA libraries.

The preceding example also shows you how to mount CUDA libraries and debug utilities into Docker containers at /usr/local/nvidia/lib64 and /usr/local/nvidia/bin, respectively.

CUDA applications running in Docker containers that are consuming NVIDIA GPUs need to dynamically discover CUDA libraries. This requires including /usr/local/nvidia/lib64 in the LD_LIBRARY_PATH environment variable.

Use Ubuntu-based CUDA Docker base images for CUDA applications on Container-Optimized OS, where LD_LIBRARY_PATH is already set appropriately.

Security

Just like other kernel modules on Container-Optimized OS, GPU drivers are cryptographically signed and verified by keys that are built into the Container-Optimized OS kernel. Unlike some other distros, Container-Optimized OS does not allow users to enroll their Machine Owner Key (MOK) and use the keys to sign custom kernel modules. This is to ensure the integrity of the Container-Optimized OS kernel and reduce the attack surface.

Restrictions

COS version restrictions

Only COS LTS release milestone 85 and later support the cos-extensions utility mentioned in the Installing NVIDIA GPU device drivers section. For earlier COS release milestones, use the cos-gpu-installer open source tool to manually install GPU drivers.

VM instances restrictions

VM instances with GPUs have specific restrictions that make them behave differently than other instance types. For more information, see the Compute Engine GPU restrictions page.

Quota and availability

GPUs are available in specific regions and zones. When you request GPU quota, consider the regions in which you intend to run your Container-Optimized OS VM instances.

For a complete list of applicable regions and zones, see GPUs on Compute Engine. You can also see GPUs available in your zone using the gcloud command-line tool.

gcloud compute accelerator-types list

Pricing

For GPU pricing information, see the pricing table on the Google Cloud GPU page.

Supportability

Each Container-Optimized OS release version has at least one supported NVIDIA GPU driver version. The Container-Optimized OS team qualifies the supported GPU drivers against the Container-Optimized OS version before release to make sure they are compatible. New versions of the NVIDIA GPU drivers may be made available from time-to-time. Some GPU driver versions will not be qualified for COS, and the qualification timeline is not guaranteed.

When the Container-Optimized OS team releases a new version on a release milestone we try to support the latest GPU driver version on the corresponding driver branch. This is to fix CVEs discovered in GPU drivers as soon as possible.

If a Container-Optimized OS customer identifies an issue that's related to the NVIDIA GPU drivers, the customer must work directly with NVIDIA for support. If the issue is not driver specific, then users can open a request with Cloud Support.

What's next