This page describes how to use NVIDIA graphics processing unit (GPU) hardware accelerators on Container-Optimized OS virtual machine (VM) instances.
Overview
On Compute Engine, you can create Container-Optimized OS VM instances equipped with NVIDIA Tesla K80, P100, P4, V100, and T4 GPUs. GPUs provide compute power to drive deep-learning tasks such as image recognition and natural language processing, as well as other compute-intensive tasks such as video transcoding and image processing.
Google provides a seamless experience for users to run their GPU workloads within Docker containers on Container-Optimized OS VM instances so that users can benefit from other Container-Optimized OS features such as security and reliability as well.
Requirements
Running GPUs on Container-Optimized OS VM instances has the following requirements:
- Container-Optimized OS version: To run GPUs on Container-Optimized OS VM instances, the Container-Optimized OS release milestone must be a LTS milestone and the milestone number must be 85 or higher.
GPU quota: You must have Compute Engine GPU quota in your desired zone before you can create Container-Optimized OS VM instances with GPUs. To ensure that you have enough GPU quota in your project, see Quotas in the Google Cloud Console.
If you require additional GPU quota, you must request GPU quota in the Cloud Console. If you have an established billing account, your project automatically receives GPU quota after you submit the quota request.
NVIDIA GPU drivers: You must install NVIDIA GPU drivers by yourself on your Container-Optimized OS VM instances. This section explains how to install the drivers on Container-Optimized OS VM instances.
Getting started: Running GPUs on Container-Optimized OS
The following sections explain how to run GPUs on Container-Optimized OS VM instances.
First, you need a Container-Optimized OS VM instance with GPUs. You can
either create a Container-Optimized OS VM instance with a GPU
or
add GPUs on an existing Container-Optimized OS VM instance.
When you create VM instances, remember to choose images or image families from
the cos-cloud
image project.
To check all GPUs attched to your current Container-Optimized OS VM instances, run the following command:
gcloud compute instances describe instance-name \ --project=project-id \ --zone zone \ --format="value(guestAccelerators)"
Replace the following:
- instance-name: The name of the instance.
- project-id: Your project ID.
- zone: The zone for the instance.
Installing NVIDIA GPU device drivers
After you create an instance with one or more GPUs, your system requires device drivers so that your applications can access the device. This guide shows the ways to install NVIDIA proprietary drivers on Container-Optimized OS VM instances.
Container-Optimized OS provides a built-in utility cos-extensions
to
simplify the NVIDIA driver installation process. By running the utility, users
agree to accept the NVIDIA license agreement.
Identifying GPU driver version
Each version of Container-Optimized OS image has a default supported NVIDIA GPU driver version. The following table shows the default supported version of major Container-Optimized OS LTS milestones.
OS Version | Default Driver Version |
---|---|
COS 85 LTS | 450.51.06 |
You may also check all the supported GPU driver versions by running the following command on your Container-Optimized OS VM instance:
cos-extensions list
Installing drivers through shell commands
After you connect to your Container-Optimized OS VM instances, you can run the following command manually to install drivers:
cos-extensions install gpu
Installing drivers through startup scripts
You can also install GPU drivers through startup scripts. You can provide the startup script when you create VM instances or apply the script to running VM instances and then reboot the VMs. This allows you to install drivers without connecting to the VMs. It also makes sure the GPU drivers are configured on every VM reboot.
The following is an example startup script to install drivers:
#! /bin/bash
cos-extensions install gpu
Installing drivers through cloud-init
Cloud-init is similar to startup scripts but more powerful. The following example shows how to install GPU driver through cloud-init:
#cloud-config
runcmd:
- cos-extensions install gpu
Using cloud-init allows you to specify the dependencies so that your GPU applications will only run after the driver has been installed. See the End-to-end: Running a GPU application on Container-Optimized OS section for more details.
For more information about how to use cloud-init on Container-Optimized OS VM instances, see the creating and configuring instances page.
Verifying installation
You can run the following command on your Container-Optimized OS VM instances to manually verify GPU driver installation. It shows the GPU devices information, such as devices state and the driverversion.
# Make the driver installation path executable by remounting it.
sudo mount --bind /var/lib/nvidia /var/lib/nvidia
sudo mount -o remount,exec /var/lib/nvidia
/var/lib/nvidia/bin/nvidia-smi
Configuring Docker containers to consume GPUs
After the GPU drivers are installed, you can configure Docker containers to
consume GPUs. The following example shows you how to run a simple CUDA
application in a Docker container that consumes /dev/nvidia0
:
docker run \
--volume /var/lib/nvidia/lib64:/usr/local/nvidia/lib64 \
--volume /var/lib/nvidia/bin:/usr/local/nvidia/bin \
--device /dev/nvidia0:/dev/nvidia0 \
--device /dev/nvidia-uvm:/dev/nvidia-uvm \
--device /dev/nvidiactl:/dev/nvidiactl \
gcr.io/google_containers/cuda-vector-add:v0.1
You can run your Docker containers through cloud-init to specify the dependency between driver installation and your Docker containers. see the End-to-end: Running a GPU application on Container-Optimized OS section for more details.
End-to-end: Running a GPU application on Container-Optimized OS
The following end-to-end example shows you how to use cloud-init to configure
Container-Optimized OS VM instances that provision a GPU application
container myapp:latest
after the GPU driver has been installed:
#cloud-config
users:
- name: myuser
uid: 2000
write_files:
- path: /etc/systemd/system/install-gpu.service
permissions: 0644
owner: root
content: |
[Unit]
Description=Install GPU drivers
Wants=gcr-online.target docker.socket
After=gcr-online.target docker.socket
[Service]
User=root
Type=oneshot
ExecStart=cos-extensions install gpu
StandardOutput=journal+console
StandardError=journal+console
- path: /etc/systemd/system/myapp.service
permissions: 0644
owner: root
content: |
[Unit]
Description=Run a myapp GPU application container
Requires=install-gpu.service
After=install-gpu.service
[Service]
User=root
Type=oneshot
RemainAfterExit=true
ExecStart=/usr/bin/docker run --rm -u 2000 --name=myapp --device /dev/nvidia0:/dev/nvidia0 myapp:latest
StandardOutput=journal+console
StandardError=journal+console
runcmd:
- systemctl daemon-reload
- systemctl start install-gpu.service
- systemctl start myapp.service
About the CUDA libraries
The NVIDIA device drivers you install on your Container-Optimized OS VM instances include the CUDA libraries.
The preceding example also shows you how to mount CUDA libraries and debug
utilities into Docker containers at /usr/local/nvidia/lib64
and
/usr/local/nvidia/bin
, respectively.
CUDA applications running in Docker containers that are consuming NVIDIA GPUs
need to dynamically discover CUDA libraries. This requires including
/usr/local/nvidia/lib64
in the LD_LIBRARY_PATH
environment variable.
Use Ubuntu-based CUDA Docker base images
for CUDA applications on Container-Optimized OS, where LD_LIBRARY_PATH
is
already set appropriately.
Security
Just like other kernel modules on Container-Optimized OS, GPU drivers are cryptographically signed and verified by keys that are built into the Container-Optimized OS kernel. Unlike some other distros, Container-Optimized OS does not allow users to enroll their Machine Owner Key (MOK) and use the keys to sign custom kernel modules. This is to ensure the integrity of the Container-Optimized OS kernel and reduce the attack surface.
Restrictions
COS version restrictions
Only COS LTS release milestone 85 and later support the cos-extensions
utility
mentioned in the Installing NVIDIA GPU device drivers section. For
earlier COS release milestones, use the cos-gpu-installer
open source tool to manually install GPU drivers.
VM instances restrictions
VM instances with GPUs have specific restrictions that make them behave differently than other instance types. For more information, see the Compute Engine GPU restrictions page.
Quota and availability
GPUs are available in specific regions and zones. When you request GPU quota, consider the regions in which you intend to run your Container-Optimized OS VM instances.
For a complete list of applicable regions and zones, see
GPUs on Compute Engine.
You can also see GPUs available in your zone using the gcloud
command-line
tool.
gcloud compute accelerator-types list
Pricing
For GPU pricing information, see the pricing table on the Google Cloud GPU page.
Supportability
Each Container-Optimized OS release version has at least one supported NVIDIA GPU driver version. The Container-Optimized OS team qualifies the supported GPU drivers against the Container-Optimized OS version before release to make sure they are compatible. New versions of the NVIDIA GPU drivers may be made available from time-to-time. Some GPU driver versions will not be qualified for COS, and the qualification timeline is not guaranteed.
When the Container-Optimized OS team releases a new version on a release milestone we try to support the latest GPU driver version on the corresponding driver branch. This is to fix CVEs discovered in GPU drivers as soon as possible.
If a Container-Optimized OS customer identifies an issue that's related to the NVIDIA GPU drivers, the customer must work directly with NVIDIA for support. If the issue is not driver specific, then users can open a request with Cloud Support.
What's next
- Learn more about running containers on a Container-Optimized OS VM instance.
- Learn more about GPUs on Compute Engine.
- Learn more about requesting GPU quota.