Distributed training with containers

This page explains how custom containers support the structure of distributed training on AI Platform Training.

With custom containers, you can do distributed training with any ML framework that supports distribution. Although the terminology used here is based on TensorFlow's distributed model, you can use any other ML framework that has a similar distribution structure. For example, distributed training in MXNet uses a scheduler, workers, and servers. This corresponds with the distributed training structure of AI Platform Training custom containers, which uses a master, workers and parameter servers.

Structure of the training cluster

If you run a distributed training job with AI Platform Training, you specify multiple machines (nodes) in a training cluster. The training service allocates the resources for the machine types you specify. Your running job on a given node is called a replica. In accordance with the distributed TensorFlow model, each replica in the training cluster is given a single role or task in distributed training:

  • Master worker: Exactly one replica is designated the master worker (also known as the chief worker). This task manages the others and reports status for the job as a whole.

  • Worker(s): One or more replicas may be designated as workers. These replicas do their portion of the work as you designate in your job configuration.

  • Parameter server(s): One or more replicas may be designated as parameter servers. These replicas store model parameters and coordinate shared model state between the workers.

  • Evaluator(s): One or more replicas may be designated as evaluators. These replicas can be used to evaluate your model. If you are using TensorFlow, note that TensorFlow generally expects that you use no more than one evaluator.

API mapping

The four different roles you can assign to machines in your training cluster correspond to four fields you can specify in TrainingInput, which represents the input parameters for a training job:

  • masterConfig.imageUri represents the container image URI to be run on the master worker.
  • workerConfig.imageUri, parameterServerConfig.imageUri, and evaluatorConfig.imageUri represent the container image URIs to be run on worker(s), parameter server(s), and evaluator(s) respectively. If no value is set for these fields, AI Platform Training uses the value of masterConfig.imageUri.

You can also set the values for each of these fields with their corresponding flags in gcloud ai-platform jobs submit training:

  • For the master worker config, use --master-image-uri.
  • For the worker config, use --worker-image-uri.
  • For the parameter server config, use --parameter-server-image-uri.
  • There is not currently a flag for specifying the container image URI for evaluators. You can specify evaluatorConfig.imageUri in a config.yaml configuration file.

See an example of how to submit a distributed training job with custom containers.

Understanding CLUSTER_SPEC

AI Platform Training populates an environment variable, CLUSTER_SPEC, on every replica to describe how the overall cluster is set up. Like TensorFlow's TF_CONFIG, CLUSTER_SPEC describes every replica in the cluster, including its index and role (master worker, worker, parameter server, or evaluator).

When you run distributed training with TensorFlow, TF_CONFIG is parsed to build tf.train.ClusterSpec. Similarly, when you run distributed training with other ML frameworks, you must parse CLUSTER_SPEC to populate any environment variables or settings required by the framework.

The format of CLUSTER_SPEC

The CLUSTER_SPEC environment variable is a JSON string with the following format:

Key Description
"cluster" The cluster description for your custom container. As with `TF_CONFIG`, this object is formatted as a TensorFlow cluster specification, and can be passed to the constructor of tf.train.ClusterSpec.
"task" Describes the task of the particular node on which your code is running. You can use this information to write code for specific workers in a distributed job. This entry is a dictionary with the following keys:
"type" The type of task performed by this node. Possible values are master, worker, ps, and evaluator.
"index" The zero-based index of the task. Most distributed training jobs have a single master task, one or more parameter servers, and one or more workers.
"trial" The identifier of the hyperparameter tuning trial currently running. When you configure hyperparameter tuning for your job, you set a number of trials to train. This value gives you a way to differentiate in your code between trials that are running. The identifier is a string value containing the trial number, starting at 1.
"job" The job parameters you used when you initiated the job. In most cases, you can ignore this entry, as it replicates the data passed to your application through its command-line arguments.

Comparison to TF_CONFIG

Note that AI Platform Training also sets the TF_CONFIG environment variable on each replica of all training jobs. AI Platform Training only sets the CLUSTER_SPEC environment variable on replicas of custom container training jobs. The two environment variables share some values, but they have different formats.

When you train with custom containers, the master replica is labeled in the TF_CONFIG environment variable with the master task name by default. You can configure it to be labeled with the chief task name instead by setting the trainingInput.useChiefInTfConfig field to true when you create your training job, or by using one or more evaluator replicas in your job. This is especially helpful if your custom container uses TensorFlow 2.

Besides this configuration option, distributed training with TensorFlow works the same way when you use custom containers as when you use an AI Platform Training runtime version. See more details and examples on how to use TF_CONFIG for distributed training on AI Platform Training.

What's next