Adding or removing GPUs

Compute Engine provides graphics processing units (GPUs) that you can add to your virtual machine instances. You can use these GPUs to accelerate specific workloads on your instances such as machine learning and data processing.

For more information about what you can do with GPUs and what types of GPU hardware are available, read GPUs on Compute Engine.

Before you begin

Creating an instance with a GPU

Before you create an instance with a GPU, select which boot disk image you want to use for the instance, and ensure that the appropriate GPU driver is installed.

If you are using GPUs for machine learning, you can use a Deep Learning VM image for your instance. The Deep Learning VM images have GPU drivers pre-installed, and include packages such as TensorFlow and PyTorch. You can also use the Deep Learning VM images for general GPU workloads. For information about the images available, and the packages installed on the images, see the Deep Learning VM documentation.

You can also use any public image or custom image, but some images might require a unique driver or install process that is not covered in this guide. You must identify what drivers are appropriate for your images.

For steps to install drivers, see installing GPU drivers.

When you create an instance with one or more GPUs, you must set the instance to terminate on host maintenance. Instances with GPUs cannot live migrate because they are assigned to specific hardware devices. See GPU restrictions for details.

Create an instance with one or more GPUs using the Google Cloud Platform Console, the gcloud command-line tool, or the API.

Console

  1. Go to the VM instances page.

    Go to the VM instances page

  2. Click Create instance.
  3. Select a zone where GPUs are available. See the list of available zones with GPUs.
  4. In the Machine configuration section, select the machine type that you want to use for this instance. Alternatively, you can specify custom machine type settings if desired.
  5. In the Machine configuration section, click CPU platform and GPU to see advanced machine type options and available GPUs.
  6. Click GPUs to see the list of available GPUs.
  7. Specify the GPU type and the number of GPUs that you need.
  8. If necessary, adjust the machine type to accommodate your desired GPU settings. If you leave these settings as they are, the instance uses the predefined machine type that you specified before opening the machine type customization screen.
  9. To configure your boot disk, in the Boot disk section, click Change.
  10. In the OS images tab, choose an image.
  11. Click Select to confirm your boot disk options.
  12. Configure any other instance settings that you require. For example, you can change the Preemptibility settings to configure your instance as a preemptible instance. This reduces the cost of your instance and the attached GPUs. Read GPUs on preemptible instances to learn more.
  13. At the bottom of the page, click Create to create the instance.

gcloud

Use the regions describe command to ensure that you have sufficient GPU quota in the region where you want to create instances with GPUs.

gcloud compute regions describe [REGION]

where [REGION] is the region where you want to check for GPU quota.

Start an instance with the latest image from an image family:

gcloud compute instances create [INSTANCE_NAME] \
    --machine-type [MACHINE_TYPE] --zone [ZONE] \
    --accelerator type=[ACCELERATOR_TYPE],count=[ACCELERATOR_COUNT] \
    --image-family [IMAGE_FAMILY] --image-project [IMAGE_PROJECT] \
    --maintenance-policy TERMINATE --restart-on-failure \
    [--preemptible]

where:

  • [INSTANCE_NAME] is the name for the new instance.
  • [MACHINE_TYPE] is the machine type that you selected for the instance. See GPUs on Compute Engine to see what machine types are available based on your desired GPU count.
  • [ZONE] is the zone for this instance.
  • [IMAGE_FAMILY] is one of the available image families.
  • [ACCELERATOR_COUNT] is the number of GPUs that you want to add to your instance. See GPUs on Compute Engine for a list of GPU limits based on the machine type of your instance.
  • [ACCELERATOR_TYPE] is the GPU model that you want to use. Use one of the following values:

    • NVIDIA® Tesla® T4: nvidia-tesla-t4
    • NVIDIA® Tesla® T4 Virtual Workstation with NVIDIA® GRID®: nvidia-tesla-t4-vws
    • NVIDIA® Tesla® P4: nvidia-tesla-p4
    • NVIDIA® Tesla® P4 Virtual Workstation with NVIDIA® GRID®: nvidia-tesla-p4-vws
    • NVIDIA® Tesla® P100: nvidia-tesla-p100
    • NVIDIA® Tesla® P100 Virtual Workstation with NVIDIA® GRID®: nvidia-tesla-p100-vws
    • NVIDIA® Tesla® V100: nvidia-tesla-v100
    • NVIDIA® Tesla® K80: nvidia-tesla-k80

    See GPUs on Compute Engine for a list of available GPU models.

  • [IMAGE_PROJECT] is the image project that the image family belongs to.

  • --preemptible is an optional flag that configures your instance as a preemptible instance. This reduces the cost of your instance and the attached GPUs. Read GPUs on preemptible instances to learn more.

For example, you can use the following gcloud command to start an Ubuntu 16.04 instance with 1 NVIDIA Tesla K80 GPU and 2 vCPUs in the us-east1-d zone.

gcloud compute instances create gpu-instance-1 \
    --machine-type n1-standard-2 --zone us-east1-d \
    --accelerator type=nvidia-tesla-k80,count=1 \
    --image-family ubuntu-1604-lts --image-project ubuntu-os-cloud \
    --maintenance-policy TERMINATE --restart-on-failure

This example command starts the instance, but CUDA and the driver must be installed on the instance ahead of time.

API

Identify the GPU type that you want to add to your instance. Submit a GET request to list the GPU types that are available to your project in a specific zone.

GET https://www.googleapis.com/compute/v1/projects/[PROJECT_ID]/zones/[ZONE]/acceleratorTypes

where:

  • [PROJECT_ID] is your project ID.
  • [ZONE] is the zone where you want to list the available GPU types.

In the API, create a POST request to create a new instance. Include the acceleratorType parameter to specify which GPU type you want to use, and include the acceleratorCount parameter to specify how many GPUs you want to add. Also set the onHostMaintenance parameter to TERMINATE.

POST https://www.googleapis.com/compute/v1/projects/[PROJECT_ID]/zones/[ZONE]/instances?key={YOUR_API_KEY}
{
  "machineType": "https://www.googleapis.com/compute/v1/projects/[PROJECT_ID]/zones/[ZONE]/machineTypes/n1-highmem-2",
  "disks":
  [
    {
      "type": "PERSISTENT",
      "initializeParams":
      {
        "diskSizeGb": "[DISK_SIZE]",
        "sourceImage": "https://www.googleapis.com/compute/v1/projects/[IMAGE_PROJECT]/global/images/family/[IMAGE_FAMILY]"
      },
      "boot": true
    }
  ],
  "name": "[INSTANCE_NAME]",
  "networkInterfaces":
  [
    {
      "network": "https://www.googleapis.com/compute/v1/projects/[PROJECT_ID]/global/networks/[NETWORK]"
    }
  ],
  "guestAccelerators":
  [
    {
      "acceleratorCount": [ACCELERATOR_COUNT],
      "acceleratorType": "https://www.googleapis.com/compute/v1/projects/[PROJECT_ID]/zones/[ZONE]/acceleratorTypes/[ACCELERATOR_TYPE]"
    }
  ],
  "scheduling":
  {
    "onHostMaintenance": "terminate",
    "automaticRestart": true,
    ["preemptible": true]
  },
}

where:

  • [INSTANCE_NAME] is the name of the instance.
  • [PROJECT_ID] is your project ID.
  • [ZONE] is the zone for this instance.
  • [MACHINE_TYPE] is the machine type that you selected for the instance. See GPUs on Compute Engine to see what machine types are available based on your desired GPU count.
  • [IMAGE_PROJECT] is the image project that the image belongs to.
  • [IMAGE_FAMILY] is a boot disk image for your instance. Specify an image family from the list of available public images.
  • [DISK_SIZE] is the size of your boot disk in GB.
  • [NETWORK] is the VPC network that you want to use for this instance. Specify default to use your default network.
  • [ACCELERATOR_COUNT] is the number of GPUs that you want to add to your instance. See GPUs on Compute Engine for a list of GPU limits based on the machine type of your instance.
  • [ACCELERATOR_TYPE] is the GPU model that you want to use. See GPUs on Compute Engine for a list of available GPU models.
  • "preemptible": true is an optional parameter that configures your instance as a preemptible instance. This reduces the cost of your instance and the attached GPUs. Read GPUs on preemptible instances to learn more.

Install the GPU driver on your instance so that your system can use the device.

Adding or removing GPUs on existing instances

You can add or detach GPUs on your existing instances, but you must first stop the instance and change its host maintenance setting so that it terminates rather than live-migrating. Instances with GPUs cannot live migrate because they are assigned to specific hardware devices. See GPU restrictions for details.

Also be aware that you must install GPU drivers on this instance after you add a GPU. The boot disk image that you used to create this instance determines what drivers you need. You must identify what drivers are appropriate for the operating system on your instance's persistent boot disk images. Read installing GPU drivers for details.

You can add or remove GPUs from an instance using the Google Cloud Platform Console or the API.

Console

You can add or remove GPUs from your instance by stopping the instance and editing your instance's configuration.

  1. Verify that all of your critical applications are stopped on the instance. You must stop the instance before you can add a GPU.

  2. Go to the VM instances page to see your list of instances.

    Go to the VM instances page

  3. On the list of instances, click the name of the instance where you want to add GPUs. The instance details page opens.

  4. At the top of the instance details page, click Stop to stop the instance.

  5. After the instance stops running, click Edit to change the instance properties.

  6. If the instance has a shared-core machine type, you must change the machine type to have one or more vCPUs. You cannot add accelerators to instances with shared-core machine types.

  7. In the Machine configuration section, click CPU platform and GPU to see advanced machine type options and available GPUs.

  8. Click GPUs to see the list of available GPUs.

  9. Select the number of GPUs and the GPU model that you want to add to your instance. Alternatively, you can set the number of GPUs to None to remove existing GPUs from the instance.

  10. If you added GPUs to an instance, set the host maintenance setting to Terminate. If you removed GPUs from the instance, you can optionally set the host maintenance setting back to Migrate VM instance.

  11. At the bottom of the instance details page, click Save to apply your changes.

  12. After the instance settings are saved, click Start at the top of the instance details page to start the instance again.

API

You can add or remove GPUs from your instance by stopping the instance and changing your instance's configuration through the API.

  1. Verify that all of your critical applications are stopped on the instance and then create a POST command to stop the instance so it can move to a host system where GPUs are available.

    POST https://www.googleapis.com/compute/v1/projects/compute/zones/[ZONE]/instances/[INSTANCE_NAME]/stop
    

    where:

    • [INSTANCE_NAME] is the name of the instance where you want to add GPUs.
    • [ZONE] is the zone for where the instance is located.
  2. Identify the GPU type that you want to add to your instance. Submit a GET request to list the GPU types that are available to your project in a specific zone.

    GET https://www.googleapis.com/compute/v1/projects/[PROJECT_ID]/zones/[ZONE]/acceleratorTypes
    

    where:

    • [PROJECT_ID] is your project ID.
    • [ZONE] is the zone where you want to list the available GPU types.
  3. If the instance has a shared-core machine type, you must change the machine type to have one or more vCPUs. You cannot add accelerators to instances with shared-core machine types.

  4. After the instance stops, create a POST request to add or remove one or more GPUs to your instance.

    POST https://www.googleapis.com/compute/v1/projects/[PROJECT_ID]/zones/[ZONE]/instances/[INSTANCE_NAME]/setMachineResources
    
    {
     "guestAccelerators": [
      {
        "acceleratorCount": [ACCELERATOR_COUNT],
        "acceleratorType": "https://www.googleapis.com/compute/v1/projects/[PROJECT_ID]/zones/[ZONE]/acceleratorTypes/[ACCELERATOR_TYPE]"
      }
     ]
    }
    

    where:

    • [INSTANCE_NAME] is the name of the instance.
    • [PROJECT_ID] is your project ID.
    • [ZONE] is the zone for this instance.
    • [ACCELERATOR_COUNT] is the number of GPUs that you want on your instance. See GPUs on Compute Engine for a list of GPU limits based on the machine type of your instance.
    • [ACCELERATOR_TYPE] is the GPU model that you want to use. See GPUs on Compute Engine for a list of available GPU models.
  5. Create a POST command to set the scheduling options for the instance. If you are adding GPUs to an instance, you must specify "onHostMaintenance": "TERMINATE". Optionally, if you are removing GPUs from an instance you can specify "onHostMaintenance": "MIGRATE".

    POST https://www.googleapis.com/compute/v1/projects/compute/zones/[ZONE]/instances/[INSTANCE_NAME]/setScheduling
    
    {
     "onHostMaintenance": "[MAINTENANCE_TYPE]",
     "automaticRestart": true
    }
    

    where:

    • [INSTANCE_NAME] is the name of the instance where you want to add GPUs.
    • [ZONE] is the zone for where the instance is located.
    • [MAINTENANCE_TYPE] is the action you want your instance to take when host maintenance is necessary. Specify TERMINATE if you are adding GPUs to your instance. Alternatively, you can specify "onHostMaintenance": "MIGRATE" if you have removed all of the GPUs from your instance and want the instance to resume migration on host maintenance events.
  6. Start the instance.

    POST https://www.googleapis.com/compute/v1/projects/compute/zones/[ZONE]/instances/[INSTANCE_NAME]/start
    

    where:

    • [INSTANCE_NAME] is the name of the instance where you want to add GPUs.
    • [ZONE] is the zone for where the instance is located.

Next install the GPU driver on your instance so that your system can use the device.

Creating groups of GPU instances using instance templates

You can use instance templates to create managed instance groups with GPUs added to each instance. Managed instance groups use the template to create multiple identical instances. You can scale the number of instances in the group to match your workload.

Because the instances created must have the CUDA toolkit and NVIDIA driver installed to use GPUs, you must create an image that already has the driver installed by following the instructions in GPU driver installation steps.

For steps to create an instance template, see Creating instance templates.

If you create the instance template using the Console, customize the machine type, and select the type and number of GPUs that you want to add to the instance template.

If you are using the gcloud command-line tool, include the --accelerators and --maintenance-policy TERMINATE flags.

The following example creates an instance template with 2 vCPUs, a 250GB boot disk based on your image (with drivers installed) and an NVIDIA Tesla K80 GPU.

gcloud beta compute instance-templates create gpu-template \
    --machine-type n1-standard-2 \
    --boot-disk-size 250GB \
    --accelerator type=nvidia-tesla-k80,count=1 \
    --image-family <MY_IMAGE_WITH_DRIVERS> \
    --maintenance-policy TERMINATE --restart-on-failure

After you create the template, use the template to create an instance group. Every time you add an instance to the group, it starts that instance using the settings in the instance template.

If you are creating a regional managed instance group, be sure to select zones that specifically support the GPU model that you want. For a list of GPU models and available zones, see GPUs on Compute Engine. The following example creates a regional managed instance group across two zones that support the nvidia-tesla-k80 model.

gcloud beta compute instance-groups managed create example-rmig \
    --template gpu-template --base-instance-name example-instances \
    --size 30 --zones us-east1-c,us-east1-d

Note: If you are choosing specific zones, use the gcloud beta component because the zone selection feature is currently in Beta.

To learn more about managing and scaling groups of instances, read Creating Groups of Managed Instances.

Installing GPU drivers

After you create an instance with one or more GPUs, your system requires device drivers so that your applications can access the device. This guide shows the ways to install NVIDIA proprietary drivers on instances with public images.

Each version of CUDA requires a minimum GPU driver version or a later version. To check the minimum driver required for your version of CUDA, see CUDA Toolkit and Compatible Driver Versions.

NVIDIA GPUs running on Compute Engine must use the following driver versions:

  • Linux instances:

    • NVIDIA 410.79 driver or greater
  • Windows Server instances:

    • NVIDIA 426.00 driver or greater

For most driver installs, you can obtain these drivers by installing the NVIDIA CUDA Toolkit.

Use the following steps to install CUDA and the associated drivers for NVIDIA® GPUs. Review your application needs to determine the driver version that works best. If the software you are using requires a specific version of CUDA, modify the commands to download the version of CUDA that you need.

For information about support for CUDA, and for steps to modify your CUDA installation, see the CUDA Toolkit Documentation.

You can use this process to manually install drivers on instances with most public images. For custom images, you might need to modify the process to function in your unique environment.

To ensure a successful installation, your operating system must have the latest package updates.

CentOS/RHEL

  1. Install latest kernel package. If needed, this command also reboots the system.

    sudo yum clean all
    sudo yum install -y kernel | grep -q 'already installed' || sudo reboot
    
  2. If the system rebooted in the previous step, reconnect to the instance.

  3. Install kernel headers and development packages.

    sudo yum install -y kernel-devel-$(uname -r) kernel-headers-$(uname -r)
    
  4. Select a driver repository for the CUDA Toolkit and add it to your instance.

    • CentOS/RHEL 8

      sudo yum install http://developer.download.nvidia.com/compute/cuda/repos/rhel8/x86_64/cuda-repo-rhel8-10.1.243-1.x86_64.rpm
      
    • CentOS/RHEL 7

      sudo yum install http://developer.download.nvidia.com/compute/cuda/repos/rhel7/x86_64/cuda-repo-rhel7-10.0.130-1.x86_64.rpm
      
    • CentOS/RHEL 6

      sudo yum install http://developer.download.nvidia.com/compute/cuda/repos/rhel6/x86_64/cuda-repo-rhel6-10.0.130-1.x86_64.rpm
      
  5. Install the epel-release repository. This repository includes the DKMS packages, which are required to install NVIDIA drivers on CentOS.

    • CentOS 6/7/8 and RHEL 6/7

      sudo yum install epel-release
      
    • RHEL 8 only

      sudo yum install https://dl.fedoraproject.org/pub/epel/epel-release-latest-8.noarch.rpm
      
  6. Clean the Yum cache:

    sudo yum clean all
    
  7. Install CUDA, this package includes the NVIDIA driver.

    sudo yum install cuda
    

SLES

  1. Connect to the instance where you want to install the driver.

  2. Install latest kernel package. If needed, this command also reboots the system.

    sudo zypper refresh
    sudo zypper up -y kernel-default | grep -q 'already installed' || sudo reboot
    
  3. If the system rebooted in the previous step, reconnect to the instance.

  4. Select a driver repository for the CUDA Toolkit and add it to your instance.

    • SLES 15

      sudo rpm --import https://developer.download.nvidia.com/compute/cuda/repos/sles15/x86_64/7fa2af80.pub
      sudo yum install https://developer.download.nvidia.com/compute/cuda/repos/sles15/x86_64/cuda-repo-sles15-10.0.130-1.x86_64.rpm
      
    • SLES 12 with Service Pack 4

      sudo rpm --import https://developer.download.nvidia.com/compute/cuda/repos/sles124/x86_64/7fa2af80.pub
      sudo yum install https://developer.download.nvidia.com/compute/cuda/repos/sles124/x86_64/cuda-repo-sles124-10.1.243-1.x86_64.rpm
      
  5. Refresh Zypper.

    sudo zypper refresh
    
  6. Install CUDA, which includes the NVIDIA driver.

    sudo zypper install cuda
    

Ubuntu

  1. Connect to the instance where you want to install the driver.

  2. Select a driver repository for the CUDA Toolkit and add it to your instance.

    • Ubuntu 18.04 LTS

      curl -O http://developer.download.nvidia.com/compute/cuda/repos/ubuntu1804/x86_64/cuda-repo-ubuntu1804_10.0.130-1_amd64.deb
      sudo dpkg -i cuda-repo-ubuntu1804_10.0.130-1_amd64.deb
      sudo apt-key adv --fetch-keys http://developer.download.nvidia.com/compute/cuda/repos/ubuntu1804/x86_64/7fa2af80.pub
      
    • Ubuntu 16.04 LTS

      curl -O http://developer.download.nvidia.com/compute/cuda/repos/ubuntu1604/x86_64/cuda-repo-ubuntu1604_10.0.130-1_amd64.deb
      sudo dpkg -i cuda-repo-ubuntu1604_10.0.130-1_amd64.deb
      sudo apt-key adv --fetch-keys http://developer.download.nvidia.com/compute/cuda/repos/ubuntu1604/x86_64/7fa2af80.pub
      
  3. Update the package lists.

    sudo apt-get update
    
  4. Install CUDA, which includes the NVIDIA driver.

    sudo apt-get install cuda
    

Windows Server

  1. Connect to the instance where you want to install the driver.

  2. Download an .exe installer file to your instance that includes the R426 branch: NVIDIA 426.00 driver or greater. For most Windows Server instances, you can use one of the following options:

    For example in Windows Server 2019, you can open a PowerShell terminal as an administrator and use the wget command to download the driver installer that you need.

    PS C:\> wget https://developer.download.nvidia.com/compute/cuda/10.1/Prod/network_installers/cuda_10.1.243_win10_network.exe -O cuda_10.1.243_win10_network.exe
  3. Run the .exe installer. For example, you can open a PowerShell terminal as an administrator and run the following command.

    PS C:\> .\\cuda_10.1.243_win10_network.exe
    

Verifying the GPU driver install

After completing the driver installation steps, verify that the driver installed and initialized properly.

Linux

Connect to the Linux instance and use the nvidia-smi command to verify that the driver is running properly.

nvidia-smi

Wed Jan  2 19:51:51 2019
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 410.79       Driver Version: 410.79       CUDA Version: 10.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla P4            Off  | 00000000:00:04.0 Off |                    0 |
| N/A   42C    P8     7W /  75W |     62MiB /  7611MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+

Windows Server

Connect to the Windows Server instance and use the nvidia-smi.exe tool to verify that the driver is running properly.

PS C:\> & 'C:\Program Files\NVIDIA Corporation\NVSMI\nvidia-smi.exe'

Mon Aug 26 18:09:03 2019
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 426.00      Driver Version: 426.00       CUDA Version: 10.1      |
|-------------------------------+----------------------+----------------------+
| GPU  Name            TCC/WDDM | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla P4            TCC  | 00000000:00:04.0 Off |                    0 |
| N/A   27C    P8     7W /  75W |      0MiB /  7611MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+

Installing GRID® drivers for virtual workstations

For a full list of NVIDIA drivers that you can use on Compute Engine, see the contents of the NVIDIA drivers Cloud Storage bucket.

Linux

  1. Download the GRID driver, using the following command:

    curl -O https://storage.googleapis.com/nvidia-drivers-us-public/GRID/GRID7.1/NVIDIA-Linux-x86_64-410.92-grid.run
    
  2. Use the following command to start the installer:

    sudo bash NVIDIA-Linux-x86_64-410.92-grid.run
    
  3. During the installation, choose the following options:

    • If you are prompted to install 32-bit binaries, select Yes.
    • If you are prompted to modify the x.org file, select No.

Windows Server

  1. Depending on your version of Windows Server, download one of the following NVIDIA GRID drivers:

  2. Run the installer, and choose the Express installation.

  3. After the installation is complete, restart the VM. When you restart, you are disconnected from your session.

  4. Reconnect to your instance using RDP or a PCoIP client.

Verifying that the GRID driver is installed

Linux

Run the following command:

nvidia-smi

The output of the command looks similar to the following:

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 410.92                Driver Version: 410.92                     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla P100-PCIE...  On   | 00000000:00:04.0 Off |                    0 |
| N/A   34C    P0    26W / 250W |      0MiB / 16276MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+

Windows Server

  1. Connect to your Windows instance using RDP or a PCoIP client.

  2. Right-click the desktop, and select NVIDIA Control Panel.

  3. In the NVIDIA Control Panel, from the Help menu, select System Information. The information shows the GPU that the VM is using, and the driver version.

Monitoring and optimizing GPU performance

Monitoring GPU performance

To help with better utilization of resources, you can track the GPU usage rates of your instances. When you know the GPU usage rates, you can then perform tasks such as setting up managed instance groups that can be used to autoscale resources based on needs.

To review GPU metrics using Stackdriver, complete the following steps:

  1. On each VM instance, set up the GPU metrics reporting script. This script performs the following tasks:

    • Installs the GPU metrics reporting agent. This agent runs at intervals on the instance to collect GPU data, and sends this data to Stackdriver.
    • Creates a custom/gpu_utilization metrics field in Stackdriver. This field stores GPU specific data that you can analyze in Stackdriver.
  2. View logs in Stackdriver.

Setting up the GPU metrics reporting script

  1. On each of your VM instances, check that you meet the following requirements:

  2. On each of your VM instances, install the GPU metrics agent. To install the metrics agent, complete the following steps:

    1. Download the GPU metrics reporting scripts.

      git clone https://github.com/GoogleCloudPlatform/tensorflow-inference-tensorrt5-t4-gpu.git
      
    2. Switch to the metrics_reporting folder.

      cd tensorflow-inference-tensorrt5-t4-gpu/metrics_reporting
      
    3. Set up the installation environment for the metrics agent.

      pip install -r ./requirements.txt
      
    4. Move the metric reporting script to your root directory.

      sudo cp report_gpu_metrics.py /root/
      
    5. Enable the GPU metrics agent.

      cat <<-EOH > /lib/systemd/system/gpu_utilization_agent.service
      [Unit]
      Description=GPU Utilization Metric Agent
      [Service]
      Type=simple
      PIDFile=/run/gpu_agent.pid
      ExecStart=/bin/bash --login -c '/usr/bin/python /root/report_gpu_metrics.py'
      User=root
      Group=root
      WorkingDirectory=/
      Restart=always
      [Install]
      WantedBy=multi-user.target
      EOH
      
    6. Reload the system daemon.

      systemctl daemon-reload
      
    7. Enable the gpu monitoring service.

      systemctl --no-reload --now enable /lib/systemd/system/gpu_utilization_agent.service
      

Reviewing metrics in Stackdriver

  1. Go to the Stackdriver Metrics Explorer page

  2. Search for gpu_utilization.

    Screenshot of Stackdriver initiation

  3. Your GPU utilization should resemble the following output:

    Screenshot of Stackdriver running

  4. (Optional) Set up autoscaling using managed instance groups. To get started, you can review the Setting up a multi-zone cluster section of the TensorFlow inference workload tutorial.

Optimizing GPU performance

You can optimize the performance on instances with NVIDIA® Tesla® K80 GPUs by disabling autoboost. To disable autoboost, run the following command:

sudo nvidia-smi --auto-boost-default=DISABLED

All done.

Handling host maintenance events

GPU instances cannot be live migrated. GPU instances must terminate for host maintenance events, but can automatically restart. These maintenance events typically occur once per month, but can occur more frequently when necessary.

To mimimize disruptions to your workloads during a maintenance event, you can monitor the maintenance schedule for your instance, and prepare your workloads to transition through the system restart.

To receive advanced notice of host maintenance events, monitor the /computeMetadata/v1/instance/maintenance-event metadata value. If the request to the metadata server returns NONE, the instance is not scheduled to terminate. For example, run the following command from within an instance:

curl http://metadata.google.internal/computeMetadata/v1/instance/maintenance-event -H "Metadata-Flavor: Google"

NONE

If the metadata server returns TERMINATE_ON_HOST_MAINTENANCE, then your instance is scheduled for termination. Compute Engine gives GPU instances a one hour termination notice, while normal instances receive only a 60 second notice. Configure your application to transition through the maintenance event. For example, you might use one of the following techniques:

  • Configure your app to temporarily move work in progress to a Cloud Storage bucket, then retrieve that data after the instance restarts.

  • Write data to a secondary persistent disk. When the instance automatically restarts, the persistent disk can be reattached and your app can resume work.

You can also receive notification of changes in this metadata value without polling. For examples of how to receive advanced notice of host maintenance events without polling, read getting live migration notices.

What's next?

Bu sayfayı yararlı buldunuz mu? Lütfen görüşünüzü bildirin:

Şunun hakkında geri bildirim gönderin...

Compute Engine Documentation