Graphics Processing Units (GPUs) can significantly accelerate the training process for many deep learning models. Training models for tasks like image classification, video analysis, and natural language processing involves compute-intensive matrix multiplication and other operations that can take advantage of a GPU's massively parallel architecture.
Training a deep learning model that involves intensive compute tasks on extremely large datasets can take days to run on a single processor. However, if you design your program to offload those tasks to one or more GPUs, you can reduce training time to hours instead of days.
Before you begin
Cloud Machine Learning Engine lets you run any TensorFlow training application on a GPU- enabled machine. Read the TensorFlow guide to using GPUs and the section below on assigning ops to GPUs to ensure your application makes use of available GPUs.Some models don't benefit from running on GPUs. We recommend GPUs for large, complex models that have many mathematical operations. Even then, you should test the benefit of GPU support by running a small sample of your data through training.
Requesting GPU-enabled machines
To use GPUs in the cloud, configure your training job to access GPU-enabled machines:
- Set the scale tier to
- Configure each task (master, worker, or parameter server) to use one of the
GPU-enabled machine types below, based on the number of GPUs and the type of
accelerator required for your task:
standard_gpu: A single NVIDIA Tesla K80 GPU
complex_model_m_gpu: Four NVIDIA Tesla K80 GPUs
complex_model_l_gpu: Eight NVIDIA Tesla K80 GPUs
standard_p100: A single NVIDIA Tesla P100 GPU
complex_model_m_p100: Four NVIDIA Tesla P100 GPUs
standard_v100: A single NVIDIA Tesla V100 GPU (Beta)
large_model_v100: A single NVIDIA Tesla V100 GPU (Beta)
complex_model_m_v100: Four NVIDIA Tesla V100 GPUs (Beta)
complex_model_l_v100: Eight NVIDIA Tesla V100 GPUs (Beta)
Below is an example of submitting the job using the
Alternatively, if you are learning how to use Cloud ML Engine or
experimenting with GPU-enabled machines, you can set the scale tier to
BASIC_GPU to get a single worker instance with a single NVIDIA Tesla K80 GPU.
See more information about comparing machine types.
In addition, you need to run your job in a region that supports GPUs. The following regions currently provide access to GPUs:
To fully understand the available regions for Cloud ML Engine services, including model training and online/batch prediction, read the guide to regions.
Submitting the training job
You can submit your training job using the
gcloud ml-engine jobs submit
config.yamlfile that describes the GPU options you want. The structure of the YAML file represents the Job resource. For example:
trainingInput: scaleTier: CUSTOM masterType: complex_model_m_gpu workerType: complex_model_m_gpu parameterServerType: large_model workerCount: 9 parameterServerCount: 3
gcloudcommand to submit the job, including a
--configargument pointing to your
config.yamlfile. The following example assumes you've set up environment variables, indicated by a
$sign followed by capital letters, for the values of some arguments:
gcloud ml-engine jobs submit training $JOB_NAME \ --package-path $APP_PACKAGE_PATH \ --module-name $MAIN_APP_MODULE \ --job-dir $JOB_DIR \ --region us-central1 \ --config config.yaml \ -- \ --user_arg_1 value_1 \ ... --user_arg_n value_n
- If you specify an option both in your configuration file
config.yaml) and as a command-line flag, the value on the command line overrides the value in the configuration file.
- The empty
--flag marks the end of the
gcloudspecific flags and the start of the
USER_ARGSthat you want to pass to your application.
- Flags specific to Cloud ML Engine, such as
--job-dir, must come before the empty
--flag. The Cloud ML Engine service interprets these flags.
--job-dirflag, if specified, must come before the empty
--flag, because Cloud ML Engine uses the
--job-dirto validate the path.
- Your application must handle the
--job-dirflag too, if specified. Even though the flag comes before the empty
--job-diris also passed to your application as a command-line flag.
- You can define as many
USER_ARGSas you need. Cloud ML Engine passes
--user_second_arg, and so on, through to your application.
For more details of the job submission options, see the guide to starting a training job.
Assigning ops to GPUs
To make use of the GPUs on a machine, make the appropriate changes to your TensorFlow training application:
High-level Estimator API: No code changes are necessary as long as your ClusterSpec is configured properly. If a cluster is a mixture of CPUs and GPUs, map the
psjob name to the CPUs and the
workerjob name to the GPUs.
Core Tensorflow API: You must assign ops to run on GPU-enabled machines. This process is the same as using GPUs with TensorFlow locally. You can use tf.train.replica_device_setter to assign ops to devices.
When you assign a GPU-enabled machine to a Cloud ML Engine process, that process has exclusive access to that machine's GPUs; you can't share the GPUs of a single machine in your cluster among multiple processes. The process corresponds to the distributed TensorFlow task in your cluster specification. The distributed TensorFlow documentation describes cluster specifications and tasks.
GPU device strings
standard_gpu machine's single GPU is identified as
Machines with multiple GPUs use identifiers starting with
"/gpu:1", and so on. For example,
complex_model_m_gpu machines have four
GPUs identified as
Python packages on GPU-enabled machines
If you use GPUs in your training jobs, be aware that the underlying virtual
machines will occasionally be subject to Compute Engine host
The GPU-enabled virtual machines used in your training jobs are configured to
automatically restart after such maintenance events, but you may have to do some
extra work to ensure that your job is resilient to these shutdowns. Configure
your training application to regularly save model checkpoints (usually along the
Cloud Storage path you specify through the
--job-dir argument to
gcloud ml-engine jobs submit training) and to restore the most recent
checkpoint in the case that a checkpoint already exists.
The TensorFlow Estimator API implements this functionality for you, so if your model is already wrapped in an Estimator, you do not have to worry about maintenance events on your GPU workers.
If it is not feasible for you to wrap your model in a TensorFlow Estimator and you want your GPU-enabled training jobs to be resilient to maintenance events, you must write the checkpoint saving and restoration functionality into your model manually. TensorFlow does provide some useful resources for such an implementation in the tf.train module - specifically, tf.train.checkpoint_exists and tf.train.latest_checkpoint.