This document explains how to create a distributed PyTorch training job. When you create a distributed training job, AI Platform Training runs your code on a cluster of virtual machine (VM) instances, also known as nodes, with environment variables that support distributed PyTorch training. This can help your training job scale to handle a large amount of data.
This guide assumes that you are using pre-built PyTorch containers for training, as described in Getting started with PyTorch. Adapting your PyTorch code for distributed training requires minimal changes.
Specifying training cluster structure
For distributed PyTorch training, configure your job to use one master worker node and one or more worker nodes. These roles have the following behaviors:
- Master worker: The VM with rank 0. This node sets up connections between the nodes in the cluster.
- Worker: The remaining nodes in the cluster. Each node does a portion of training, as specified by your training application code.
To learn how to specify the master worker node and worker nodes for your training cluster, read Specifying machine types or scale tiers.
Specifying container images
When you create a training job, specify the image of a Docker container for the
master worker to use in the trainingInput.masterConfig.imageUri
field, and specify the image of a
Docker container for each worker to use in the
trainingInput.workerConfig.imageUri
field. See the list of pre-built
PyTorch containers.
If you use the gcloud ai-platform jobs submit training
command to create your
training job, you can specify these fields with the --master-image-uri
and
--worker-image-uri
flags.
However, if you don't specify the trainingInput.workerConfig.imageUri
field,
its value defaults to the value of trainingInput.masterConfig.imageUri
. It
often makes sense to use the same pre-built PyTorch container on every node.
Updating your training code
In your training application, add the following code to initialize the training cluster:
import torch
torch.distributed.init_process_group(
backend='BACKEND',
init_method='env://'
)
Replace BACKEND with one of the supported distributed training
backends described in the following section. The init_method='env://'
keyword
argument tells PyTorch to use environment variables to initialize communication
in the cluster. Learn more in the Environment variables
section of this guide.
Additionally, update your training code to use the
torch.nn.parallel.DistributedDataParallel
class. For example, if you have
created a PyTorch module called model
in your code, add the following line:
model = torch.nn.parallel.DistributedDataParallel(model)
To learn more about configuring distributed training, read the PyTorch documentation's guide to distributed training.
Distributed training backends
AI Platform Training supports the following backends for distributed PyTorch training:
gloo
: recommended for CPU training jobsnccl
: recommended for GPU training jobs
Read about the differences between backends.
Environment variables
When you create a distributed PyTorch training job, AI Platform Training sets the following environment variables on each node:
WORLD_SIZE
: The total number of nodes in the cluster. This variable has the same value on every node.RANK
: A unique identifier for each node. On the master worker, this is set to0
. On each worker, it is set to a different value from1
toWORLD_SIZE - 1
.MASTER_ADDR
: The hostname of the master worker node. This variable has the same value on every node.MASTER_PORT
: The port that the master worker node communicates on. This variable has the same value on every node.
PyTorch uses these environment variables to initialize the cluster.
What's next
- To learn more about training with PyTorch on AI Platform Training, follow the Getting started with PyTorch tutorial.
- To learn more about distributed PyTorch training in general, read the PyTorch documentation's guide to distributed training.