Using GPUs for training models in the cloud

Graphics Processing Units (GPUs) can significantly accelerate the training process for many deep learning models. Training models for tasks like image classification, video analysis, and natural language processing involves compute-intensive matrix multiplication and other operations that can take advantage of a GPU's massively parallel architecture.

Training a deep learning model that involves intensive compute tasks on extremely large datasets can take days to run on a single processor. However, if you design your program to offload those tasks to one or more GPUs, you can reduce training time to hours instead of days.

Before you begin

AI Platform Training lets you run your TensorFlow training application on a GPU- enabled machine. Read the TensorFlow guide to using GPUs and the section of this document about adjusting training code to utilize GPUs to ensure your application makes use of available GPUs.

You can also use GPUs with machine learning frameworks other than TensorFlow, if you use a custom container for training.

Some models don't benefit from running on GPUs. We recommend GPUs for large, complex models that have many mathematical operations. Even then, you should test the benefit of GPU support by running a small sample of your data through training.

Requesting GPU-enabled machines

To use GPUs in the cloud, configure your training job to access GPU-enabled machines in one of the following ways:

  • Use the BASIC_GPU scale tier.
  • Use Compute Engine machine types and attach GPUs.
  • Use GPU-enabled legacy machine types.

Basic GPU-enabled machine

If you are learning how to use AI Platform Training or experimenting with GPU-enabled machines, you can set the scale tier to BASIC_GPU to get a single worker instance with one GPU.

Compute Engine machine types with GPU attachments

If you configure your training job with Compute Engine machine types, you can attach a custom number of GPUs to accelerate your job:

  • Set the scale tier to CUSTOM.
  • Configure your master worker and any other task types (worker, parameter server, or evaluator) that are part of your job to use valid Compute Engine machine types.
  • Add an acceleratorConfig field with the type and number of GPUs you want to masterConfig, workerConfig, parameterServerConfig, or evaluatorConfig, depending on which virtual machine (VM) instances you would like to accelerate. You can use the following GPU types:
    • NVIDIA_TESLA_A100
    • NVIDIA_TESLA_P4
    • NVIDIA_TESLA_P100
    • NVIDIA_TESLA_T4
    • NVIDIA_TESLA_V100

To create a valid acceleratorConfig, you must account for several restrictions:

  1. You can only use certain numbers of GPUs in your configuration. For example, you can attach 2 or 4 NVIDIA Tesla T4s, but not 3. To see what counts are valid for each type of GPU, see the compatibility table below.

  2. You must make sure each of your GPU configurations provides sufficient virtual CPUs and memory to the machine type you attach it to. For example, if you use n1-standard-32 for your workers, then each worker has 32 virtual CPUs and 120 GB of memory. Since each NVIDIA Tesla V100 can provide up to 12 virtual CPUs and 76 GB of memory, you must attach at least 4 to each n1-standard-32 worker to support its requirements. (2 GPUs provide insufficient resources, and you cannot specify 3 GPUs.)

    Review the list of machine types for AI Platform Training and the comparison of GPUs for compute workloads to determine these compatibilities, or reference the compatibility table below.

    Note the following additional limitation on GPU resources for AI Platform Training in particular cases:

    • A configuration with 4 NVIDIA Tesla P100 GPUs only supports up to 64 virtual CPUS and up to 208 GB of memory in all regions and zones.
  3. You must submit your training job to a region that supports your GPU configuration. Read about region support below.

The following table provides a quick reference of how many of each type of accelerator you can attach to each Compute Engine machine type:

Valid numbers of GPUs for each machine type
Machine type NVIDIA A100 NVIDIA Tesla K80 NVIDIA Tesla P4 NVIDIA Tesla P100 NVIDIA Tesla T4 NVIDIA Tesla V100
n1-standard-4 1, 2, 4, 8 1, 2, 4 1, 2, 4 1, 2, 4 1, 2, 4, 8
n1-standard-8 1, 2, 4, 8 1, 2, 4 1, 2, 4 1, 2, 4 1, 2, 4, 8
n1-standard-16 2, 4, 8 1, 2, 4 1, 2, 4 1, 2, 4 2, 4, 8
n1-standard-32 4, 8 2, 4 2, 4 2, 4 4, 8
n1-standard-64 4 4 8
n1-standard-96 4 4 8
n1-highmem-2 1, 2, 4, 8 1, 2, 4 1, 2, 4 1, 2, 4 1, 2, 4, 8
n1-highmem-4 1, 2, 4, 8 1, 2, 4 1, 2, 4 1, 2, 4 1, 2, 4, 8
n1-highmem-8 1, 2, 4, 8 1, 2, 4 1, 2, 4 1, 2, 4 1, 2, 4, 8
n1-highmem-16 2, 4, 8 1, 2, 4 1, 2, 4 1, 2, 4 2, 4, 8
n1-highmem-32 4, 8 2, 4 2, 4 2, 4 4, 8
n1-highmem-64 4 4 8
n1-highmem-96 4 4 8
n1-highcpu-16 2, 4, 8 1, 2, 4 1, 2, 4 1, 2, 4 2, 4, 8
n1-highcpu-32 4, 8 2, 4 2, 4 2, 4 4, 8
n1-highcpu-64 8 4 4 4 8
n1-highcpu-96 4 4 8
a2-highgpu-1g 1
a2-highgpu-2g 2
a2-highgpu-4g 4
a2-highgpu-8g 8
a2-megagpu-16g 16

Below is an example of submitting a job using Compute Engine machine types with GPUs attached.

Machine types with GPUs included

Alternatively, instead of using an acceleratorConfig, you can select a legacy machine type that has GPUs included:

  • Set the scale tier to CUSTOM.
  • Configure your master worker and any other task types (worker, parameter server, or evaluator) that you would like to accelerate to use one of the following GPU-enabled machine types, based on the number of GPUs and the type of accelerator required for your task:
    • standard_gpu: A single GPU
    • complex_model_m_gpu: Four GPUs
    • complex_model_l_gpu: Eight GPUs
    • standard_p100: A single NVIDIA Tesla P100 GPU
    • complex_model_m_p100: Four NVIDIA Tesla P100 GPUs
    • standard_v100: A single NVIDIA Tesla V100 GPU
    • large_model_v100: A single NVIDIA Tesla V100 GPU
    • complex_model_m_v100: Four NVIDIA Tesla V100 GPUs
    • complex_model_l_v100: Eight NVIDIA Tesla V100 GPUs

Below is an example of submitting a job with GPU-enabled machine types using the gcloud command.

See more information about machine types for AI Platform Training.

Regions that support GPUs

You must run your job in a region that supports GPUs. The following regions currently provide access to GPUs:

  • us-west1
  • us-west2
  • us-central1
  • us-east1
  • us-east4
  • northamerica-northeast1
  • southamerica-east1
  • europe-west1
  • europe-west2
  • europe-west4
  • asia-south1
  • asia-southeast1
  • asia-east1
  • asia-northeast1
  • asia-northeast3
  • australia-southeast1

In addition, some of these regions only provide access to certain types of GPUs. To fully understand the available regions for AI Platform Training services, including model training and online/batch prediction, read the guide to regions.

If your training job uses multiple types of GPUs, they must all be available in a single zone in your region. For example, you cannot run a job in us-central1 with a master worker using NVIDIA Tesla T4 GPUs, parameter servers using NVIDIA Tesla K80 GPUs, and workers using NVIDIA Tesla P100 GPUs. While all of these GPUs are available for training jobs in us-central1, no single zone in that region provides all three types of GPU. To learn more about the zone availability of GPUs, see the comparison of GPUs for compute workloads.

Submitting the training job

You can submit your training job using the gcloud ai-platform jobs submit training command.

  1. Define a config.yaml file that describes the GPU options you want. The structure of the YAML file represents the Job resource. Below are two examples of config.yaml files.

    The first example shows a configuration file for a training job that uses Compute Engine machine types, some of which have GPUs attached:

    trainingInput:
      scaleTier: CUSTOM
      # Configure a master worker with 4 T4 GPUs
      masterType: n1-highcpu-16
      masterConfig:
        acceleratorConfig:
          count: 4
          type: NVIDIA_TESLA_T4
      # Configure 9 workers, each with 4 T4 GPUs
      workerCount: 9
      workerType: n1-highcpu-16
      workerConfig:
        acceleratorConfig:
          count: 4
          type: NVIDIA_TESLA_T4
      # Configure 3 parameter servers with no GPUs
      parameterServerCount: 3
      parameterServerType: n1-highmem-8
    

    The next example shows a configuration file for a job with a similar configuration as the one above. However, this configuration uses legacy machine types that include GPUs instead of attaching GPUs with an acceleratorConfig:

    trainingInput:
      scaleTier: CUSTOM
      # Configure a master worker with 4 GPUs
      masterType: complex_model_m_gpu
      # Configure 9 workers, each with 4 GPUs
      workerCount: 9
      workerType: complex_model_m_gpu
      # Configure 3 parameter servers with no GPUs
      parameterServerCount: 3
      parameterServerType: large_model
    
  2. Use the gcloud command to submit the job, including a --config argument pointing to your config.yaml file. The following example assumes you've set up environment variables, indicated by a $ sign followed by capital letters, for the values of some arguments:

    gcloud ai-platform jobs submit training $JOB_NAME \
            --package-path $APP_PACKAGE_PATH \
            --module-name $MAIN_APP_MODULE \
            --job-dir $JOB_DIR \
            --region us-central1 \
            --config config.yaml \
            -- \
            --user_arg_1 value_1 \
             ...
            --user_arg_n value_n
    

Alternatively, you may specify cluster configuration details with command-line flags, rather than in a configuration file. Learn more about how to use these flags.

The following example shows how to submit a job with the same configuration as the first example (using Compute Engine machine types with GPUs attached), but it does so without using a config.yaml file:

gcloud ai-platform jobs submit training $JOB_NAME \
        --package-path $APP_PACKAGE_PATH \
        --module-name $MAIN_APP_MODULE \
        --job-dir $JOB_DIR \
        --region us-central1 \
        --scale-tier custom \
        --master-machine-type n1-highcpu-16 \
        --master-accelerator count=4,type=nvidia-tesla-t4 \
        --worker-count 9 \
        --worker-machine-type n1-highcpu-16 \
        --worker-accelerator count=4,type=nvidia-tesla-t4 \
        --parameter-server-count 3 \
        --parameter-server-machine-type n1-highmem-8 \
        -- \
        --user_arg_1 value_1 \
         ...
        --user_arg_n value_n

Notes:

  • If you specify an option both in your configuration file (config.yaml) and as a command-line flag, the value on the command line overrides the value in the configuration file.
  • The empty -- flag marks the end of the gcloud specific flags and the start of the USER_ARGS that you want to pass to your application.
  • Flags specific to AI Platform Training, such as --module-name, --runtime-version, and --job-dir, must come before the empty -- flag. The AI Platform Training service interprets these flags.
  • The --job-dir flag, if specified, must come before the empty -- flag, because AI Platform Training uses the --job-dir to validate the path.
  • Your application must handle the --job-dir flag too, if specified. Even though the flag comes before the empty --, the --job-dir is also passed to your application as a command-line flag.
  • You can define as many USER_ARGS as you need. AI Platform Training passes --user_first_arg, --user_second_arg, and so on, through to your application.

For more details of the job submission options, see the guide to starting a training job.

Adjusting training code to utilize GPUs

If you use Keras or Estimators for your TensorFlow training job and want to train using a single VM with one GPU, then you do not need to customize your code for the GPU.

If your training cluster contains multiple GPUs, use the tf.distribute.Strategy API in your training code:

To customize how TensorFlow assigns specific operations to GPUs, read the TensorFlow guide to using GPUs. In this case, it might also be helpful to learn how AI Platform Training sets the TF_CONFIG environment variable on each VM.

GPU device strings

A standard_gpu machine's single GPU is identified as "/gpu:0". Machines with multiple GPUs use identifiers starting with "/gpu:0", then "/gpu:1", and so on. For example, complex_model_m_gpu machines have four GPUs identified as "/gpu:0" through "/gpu:3".

Python packages on GPU-enabled machines

GPU-enabled machines come pre-installed with tensorflow-gpu, the TensorFlow Python package with GPU support. See the runtime version list for a list of all pre-installed packages.

Maintenance events

GPU-enabled VMs that run AI Platform Training jobs are occasionally subject to Compute Engine host maintenance. The VMs are configured to automatically restart after such maintenance events, but you may have to do some extra work to ensure that your job is resilient to these shutdowns. Configure your training application to regularly save model checkpoints (usually along the Cloud Storage path you specify through the --job-dir argument to gcloud ai-platform jobs submit training) and to restore the most recent checkpoint in the case that a checkpoint already exists.

TensorFlow Estimators implement this functionality for you, as long as you specify a model_dir. Estimators regularly save checkpoints to the model_dir and attempt to load from the latest checkpoint, so you do not have to worry about maintenance events on your GPU workers.

If you are training with Keras, use the ModelCheckpoint callback to regularly save training progress. If you are using tf.distribute.Strategy with Keras, your VMs uses checkpoints to automatically recover from restarts. Otherwise, add logic to your training code to check for the existence a recent checkpoint and restore from the checkpoint if it exists.

For more advanced cases, read the TensorFlow guide to checkpoints.

What's next