Using GPUs for Training Models in the Cloud

Graphics Processing Units (GPUs) can significantly accelerate the training process for many deep learning models. For example, GPUs can accelerate the training process for deep learning models designed for image classification, video analysis, and natural language processing because the training process for those models involves the compute-intensive task of matrix multiplication and other operations that can take advantage of a GPU's massively parallel architecture. This architecture is well-suited for algorithms designed to address embarrassingly parallel workloads.

Training a deep learning model that involves intensive compute tasks on extremely large datasets can take days to run on a single processor. However, if you design your program to offload those tasks to one or more GPUs, you can reduce training time to hours instead of days.

For general information about accelerated computing using GPUs, go to NVIDIA's page about Accelerated Computing. For detailed information about using GPUs with TensorFlow, go to using GPUs in the TensorFlow documentation.

Requesting GPU-enabled machines

To use GPUs in the cloud, configure your job to access GPU-enabled machines:

  • Set the scale tier to CUSTOM.
  • Configure each task (master, worker, or parameter server) to use one of the GPU-enabled machine types below, based on the number of GPUs and the type of accelerator required for your task:

    • standard_gpu: A single NVIDIA Tesla K80 GPU
    • complex_model_m_gpu: Four NVIDIA Tesla K80 GPUs
    • complex_model_l_gpu: Eight NVIDIA Tesla K80 GPUs
    • standard_p100: A single NVIDIA Tesla P100 GPU (Alpha)
    • complex_model_m_p100: Four NVIDIA Tesla P100 GPUs (Alpha)

    See more information about specifying machine types for the custom scale tier.

Alternatively, if you are learning how to use Cloud ML Engine or experimenting with GPU-enabled machines, you can set the scale tier to BASIC_GPU to get a single worker instance with a single NVIDIA Tesla K80 GPU.

In addition, you need to run your job in a region that supports GPUs. The following regions currently provide access to GPUs:

  • us-east1
  • us-central1
  • asia-east1
  • europe-west1

Assigning ops to GPUs

To make use of the GPUs on a machine, make the appropriate changes to your TensorFlow trainer application:

  • High-level Estimator API: No code changes are necessary as long as your ClusterSpec is configured properly. If a cluster is a mixture of CPUs and GPUs, map the ps job name to the CPUs and the worker job name to the GPUs.

  • Core Tensorflow API: You must assign ops to run on GPU-enabled machines. This process is the same as using GPUs with TensorFlow locally. You can use tf.train.replica_device_setter to assign ops to devices.

When you assign a GPU-enabled machine to a Cloud ML Engine process, that process has exclusive access to that machine's GPUs; you can't share the GPUs of a single machine in your cluster among multiple processes. The process corresponds to the distributed TensorFlow task in your cluster specification. The distributed TensorFlow documentation describes cluster specifications and tasks.

GPU device strings

A standard_gpu machine's single GPU is identified as "/gpu:0". Machines with multiple GPUs use identifiers ranging from "/gpu:0" to "/gpu:<var>n</var>". For example, complex_model_m_gpu machines have four GPUs identified as "/gpu:0" through "/gpu:3".

What's next

Monitor your resources on the go

Get the Google Cloud Console app to help you manage your projects.

Send feedback about...

Cloud Machine Learning Engine (Cloud ML Engine)