This document explains how to create a distributed PyTorch training job. When you create a distributed training job, AI Platform Training runs your code on a cluster of virtual machine (VM) instances, also known as nodes, with environment variables that support distributed PyTorch training. This can help your training job scale to handle a large amount of data.
This guide assumes that you are using pre-built PyTorch containers for training, as described in Getting started with PyTorch. Adapting your PyTorch code for distributed training requires minimal changes.
Specifying training cluster structure
For distributed PyTorch training, configure your job to use one master worker node and one or more worker nodes. These roles have the following behaviors:
- Master worker: The VM with rank 0. This node sets up connections between the nodes in the cluster.
- Worker: The remaining nodes in the cluster. Each node does a portion of training, as specified by your training application code.
To learn how to specify the master worker node and worker nodes for your training cluster, read Specifying machine types or scale tiers.
Specifying container images
When you create a training job, specify the image of a Docker container for the
master worker to use in the
field, and specify the image of a
Docker container for each worker to use in the
field. See the list of pre-built
If you use the
gcloud ai-platform jobs submit training
command to create your
training job, you can specify these fields with the
However, if you don't specify the
its value defaults to the value of
often makes sense to use the same pre-built PyTorch container on every node.
Updating your training code
In your training application, add the following code to initialize the training cluster:
import torch torch.distributed.init_process_group( backend='BACKEND', init_method='env://' )
Replace BACKEND with one of the supported distributed training
backends described in the following section. The
argument tells PyTorch to use environment variables to initialize communication
in the cluster. Learn more in the Environment variables
section of this guide.
Additionally, update your training code to use the
torch.nn.parallel.DistributedDataParallel class. For example, if you have
created a PyTorch module called
model in your code, add the following line:
model = torch.nn.parallel.DistributedDataParallel(model)
To learn more about configuring distributed training, read the PyTorch documentation's guide to distributed training.
Distributed training backends
AI Platform Training supports the following backends for distributed PyTorch training:
gloo: recommended for CPU training jobs
nccl: recommended for GPU training jobs
Read about the differences between backends.
When you create a distributed PyTorch training job, AI Platform Training sets the following environment variables on each node:
WORLD_SIZE: The total number of nodes in the cluster. This variable has the same value on every node.
RANK: A unique identifier for each node. On the master worker, this is set to
0. On each worker, it is set to a different value from
WORLD_SIZE - 1.
MASTER_ADDR: The hostname of the master worker node. This variable has the same value on every node.
MASTER_PORT: The port that the master worker node communicates on. This variable has the same value on every node.
- To learn more about training with PyTorch on AI Platform Training, follow the Getting started with PyTorch tutorial.
- To learn more about distributed PyTorch training in general, read the PyTorch documentation's guide to distributed training.