Distributed training with containers

This page explains how custom containers support the structure of distributed training on AI Platform.

With custom containers, you can do distributed training with any ML framework that supports distribution. Although the terminology used here is based on TensorFlow's distributed model, you can use any other ML framework that has a similar distribution structure. For example, distributed training in MXNet uses a scheduler, workers and servers. This corresponds with the distributed training structure of AI Platform custom containers, which uses a master, workers and parameter servers.

Structure of the training cluster

If you run a distributed training job with AI Platform, you specify multiple machines (nodes) in a training cluster. The training service allocates the resources for the machine types you specify. Your running job on a given node is called a replica. In accordance with the distributed TensorFlow model, each replica in the training cluster is given a single role or task in distributed training:

  • Master: Exactly one replica is designated the master. This task manages the others and reports status for the job as a whole.

  • Worker(s): One or more replicas may be designated as workers. These replicas do their portion of the work as you designate in your job configuration.

  • Parameter server(s): One or more replicas may be designated as parameter servers. These replicas store model parameters and coordinate shared model state between the workers.

API mapping

The three different roles you can assign to machines in your training cluster correspond to three fields you can specify in TrainingInput, which represents the input parameters for a training job:

  • masterConfig.imageUri represents the container image URI to be run on master.
  • workerConfig.imageUri and parameterServerConfig.imageUri represent the container image URI to be run on worker(s) and parameter server(s), respectively. If no value is set for these fields, AI Platform uses the value of masterConfig.imageUri.

You can also set the values for each of these fields with their corresponding flags in gcloud beta ai-platform jobs submit training:

  • For the master config, use --master-image-uri.
  • For the worker config, use --worker-image-uri.
  • For the parameter server config, use --parameter-server-image-uri.

See an example of how to submit a distributed training job with custom containers.

Understanding CLUSTER_SPEC

AI Platform populates an environment variable, CLUSTER_SPEC, on every replica to describe how the overall cluster is set up. Like TensorFlow's TF_CONFIG, CLUSTER_SPEC describes every replica in the cluster, including its index and role (master, worker, or parameter server).

When you run distributed training with TensorFlow, TF_CONFIG is parsed to build tf.train.ClusterSpec. Similarly, when you run distributed training with other ML frameworks, you must parse CLUSTER_SPEC to populate any environment variables or settings required by the framework.

The format of CLUSTER_SPEC

The CLUSTER_SPEC environment variable is a JSON string with the following format:

Key Description
"cluster" The cluster description for your custom container. As with `TF_CONFIG`, this object is formatted as a TensorFlow cluster specification, and can be passed to the constructor of tf.train.ClusterSpec.
"task" Describes the task of the particular node on which your code is running. You can use this information to write code for specific workers in a distributed job. This entry is a dictionary with the following keys:
"type" The type of task performed by this node. Possible values are master, worker, and ps.
"index" The zero-based index of the task. Most distributed training jobs have a single master task, one or more parameter servers, and one or more workers.
"trial" The identifier of the hyperparameter tuning trial currently running. When you configure hyperparameter tuning for your job, you set a number of trials to train. This value gives you a way to differentiate in your code between trials that are running. The identifier is a string value containing the trial number, starting at 1.
"job" The job parameters you used when you initiated the job. In most cases, you can ignore this entry, as it replicates the data passed to your application through its command-line arguments.

Compatibility with TensorFlow

Distributed training with TensorFlow works the same way with custom containers on AI Platform as with your training code running on AI Platform runtime versions. In both cases, AI Platform populates TF_CONFIG for you. If you use custom containers to run distributed training with TensorFlow, AI Platform populates both TF_CONFIG and CLUSTER_SPEC for you.

If you're using the TensorFlow Estimator API, then TensorFlow parses TF_CONFIG to build the cluster spec automatically. If you're using the core API, you must build the cluster spec from TF_CONFIG in your training application.

See more details and examples on how to use TF_CONFIG for distributed training on AI Platform.

What's next

Var denne siden nyttig? Si fra hva du synes:

Send tilbakemelding om ...