This page explains how custom containers support the structure of distributed training on AI Platform.
With custom containers, you can do distributed training with any ML framework that supports distribution. Although the terminology used here is based on TensorFlow's distributed model, you can use any other ML framework that has a similar distribution structure. For example, distributed training in MXNet uses a scheduler, workers and servers. This corresponds with the distributed training structure of AI Platform custom containers, which uses a master, workers and parameter servers.
Structure of the training cluster
If you run a distributed training job with AI Platform, you specify multiple machines (nodes) in a training cluster. The training service allocates the resources for the machine types you specify. Your running job on a given node is called a replica. In accordance with the distributed TensorFlow model, each replica in the training cluster is given a single role or task in distributed training:
Master: Exactly one replica is designated the master. This task manages the others and reports status for the job as a whole.
Worker(s): One or more replicas may be designated as workers. These replicas do their portion of the work as you designate in your job configuration.
Parameter server(s): One or more replicas may be designated as parameter servers. These replicas store model parameters and coordinate shared model state between the workers.
The three different roles you can assign to machines in your training cluster
correspond to three fields you can specify in
which represents the input parameters for a training job:
masterConfig.imageUrirepresents the container image URI to be run on master.
parameterServerConfig.imageUrirepresent the container image URI to be run on worker(s) and parameter server(s), respectively. If no value is set for these fields, AI Platform uses the value of
You can also set the values for each of these fields with their corresponding
gcloud beta ml-engine jobs submit training:
- For the master config, use
- For the worker config, use
- For the parameter server config, use
See an example of how to submit a distributed training job with custom containers.
AI Platform populates an environment variable,
on every replica to describe how the overall cluster is set up. Like
CLUSTER_SPEC describes every replica in
the cluster, including its index and role (master, worker, or parameter server).
When you run distributed training with TensorFlow,
TF_CONFIG is parsed to
Similarly, when you run distributed training with other ML frameworks, you must
CLUSTER_SPEC to populate any environment variables or settings required
by the framework.
The format of
CLUSTER_SPEC environment variable is a JSON string with the following
||The cluster description for your custom container. As with `TF_CONFIG`, this object is formatted as a TensorFlow cluster specification, and can be passed to the constructor of tf.train.ClusterSpec.|
||Describes the task of the particular node on which your code is running. You can use this information to write code for specific workers in a distributed job. This entry is a dictionary with the following keys:|
The type of task performed by this node. Possible values are
||The zero-based index of the task. Most distributed training jobs have a single master task, one or more parameter servers, and one or more workers.|
||The identifier of the hyperparameter tuning trial currently running. When you configure hyperparameter tuning for your job, you set a number of trials to train. This value gives you a way to differentiate in your code between trials that are running. The identifier is a string value containing the trial number, starting at 1.|
||The job parameters you used when you initiated the job. In most cases, you can ignore this entry, as it replicates the data passed to your application through its command-line arguments.|
Compatibility with TensorFlow
Distributed training with TensorFlow works the same way with custom containers
on AI Platform as with your training code running on
AI Platform runtime versions. In both cases, AI Platform
TF_CONFIG for you. If you use custom containers to run distributed
training with TensorFlow, AI Platform populates both
CLUSTER_SPEC for you.
If you're using the TensorFlow Estimator API, then TensorFlow parses
to build the cluster spec automatically. If you're using the core API, you must
build the cluster spec from
TF_CONFIG in your training application.
See more details and examples on how to
TF_CONFIG for distributed training on AI Platform.