Configuring distributed training for PyTorch

This document explains how to create a distributed PyTorch training job. When you create a distributed training job, AI Platform Training runs your code on a cluster of virtual machine (VM) instances, also known as nodes, with environment variables that support distributed PyTorch training. This can help your training job scale to handle a large amount of data.

This guide assumes that you are using pre-built PyTorch containers for training, as described in Getting started with PyTorch. Adapting your PyTorch code for distributed training requires minimal changes.

Specifying training cluster structure

For distributed PyTorch training, configure your job to use one master worker node and one or more worker nodes. These roles have the following behaviors:

  • Master worker: The VM with rank 0. This node sets up connections between the nodes in the cluster.
  • Worker: The remaining nodes in the cluster. Each node does a portion of training, as specified by your training application code.

To learn how to specify the master worker node and worker nodes for your training cluster, read Specifying machine types or scale tiers.

Specifying container images

When you create a training job, specify the image of a Docker container for the master worker to use in the trainingInput.masterConfig.imageUri field, and specify the image of a Docker container for each worker to use in the trainingInput.workerConfig.imageUri field. See the list of pre-built PyTorch containers.

If you use the gcloud ai-platform jobs submit training command to create your training job, you can specify these fields with the --master-image-uri and --worker-image-uri flags.

However, if you don't specify the trainingInput.workerConfig.imageUri field, its value defaults to the value of trainingInput.masterConfig.imageUri. It often makes sense to use the same pre-built PyTorch container on every node.

Updating your training code

In your training application, add the following code to initialize the training cluster:

import torch

torch.distributed.init_process_group(
    backend='BACKEND',
    init_method='env://'
)

Replace BACKEND with one of the supported distributed training backends described in the following section. The init_method='env://' keyword argument tells PyTorch to use environment variables to initialize communication in the cluster. Learn more in the Environment variables section of this guide.

Additionally, update your training code to use the torch.nn.parallel.DistributedDataParallel class. For example, if you have created a PyTorch module called model in your code, add the following line:

model = torch.nn.parallel.DistributedDataParallel(model)

To learn more about configuring distributed training, read the PyTorch documentation's guide to distributed training.

Distributed training backends

AI Platform Training supports the following backends for distributed PyTorch training:

  • gloo: recommended for CPU training jobs
  • nccl: recommended for GPU training jobs

Read about the differences between backends.

Environment variables

When you create a distributed PyTorch training job, AI Platform Training sets the following environment variables on each node:

  • WORLD_SIZE: The total number of nodes in the cluster. This variable has the same value on every node.
  • RANK: A unique identifier for each node. On the master worker, this is set to 0. On each worker, it is set to a different value from 1 to WORLD_SIZE - 1.
  • MASTER_ADDR: The hostname of the master worker node. This variable has the same value on every node.
  • MASTER_PORT: The port that the master worker node communicates on. This variable has the same value on every node.

PyTorch uses these environment variables to initialize the cluster.

What's next