TF_CONFIG and distributed training

When you run a training job, AI Platform Training sets an environment variable called TF_CONFIG on each virtual machine (VM) instance that is part of your job. Your training code, which runs on each VM, can use the TF_CONFIG environment variable to access details about the training job and the role of the VM that it is running on.

TensorFlow uses the TF_CONFIG environment variable to facilitate distributed training, but you likely don't have to access it directly in your training code. This document describes the TF_CONFIG environment variable and its usage in distributed TensorFlow jobs and hyperparameter tuning jobs.

The format of TF_CONFIG

AI Platform Training sets the TF_CONFIG environment variable on every VM of every training job to meet the specifications that TensorFlow requires for distributed training. However, AI Platform Training also sets additional fields in the TF_CONFIG environment variable beyond what TensorFlow requires.

The TF_CONFIG environment variable is a JSON string with the following format:

TF_CONFIG fields
cluster

The TensorFlow cluster description. A dictionary mapping one or more task names (chief, worker, ps, or master) to lists of network addresses where these tasks are running. For a given training job, this dictionary is the same on every VM.

This is a valid first argument for the tf.train.ClusterSpec constructor. Note that this dictionary never contains evaluator as a key, since evaluators are not considered part of the training cluster even if you use them for your job.

Learn about the difference between chief and master in another section of this document.

task

The task description of the VM where this environment variable is set. For a given training job, this dictionary is different on every VM. You can use this information to customize what code runs on each VM in a distributed training job. You can also use it to change the behavior of your training code for different trials of a hyperparameter tuning job.

This dictionary includes the following key-value pairs:

task fields
type

The type of task that this VM is performing. This value is set to worker on workers, ps on parameter servers, and evaluator on evaluators. On your job's master worker, the value is set to either chief or master; learn more about the difference between the two in the chief versus master section of this document.

index

The zero-based index of the task. For example, if your training job includes two workers, this value is set to 0 on one of them and 1 on the other.

trial

The ID of the hyperparameter tuning trial currently running on this VM. This field is only set if the current training job is a hyperparameter tuning job.

For hyperparameter tuning jobs, AI Platform Training runs your training code repeatedly in many trials with different hyperparameters each time. This field contains the current trial number, starting at 1 for the first trial.

cloud

An ID used internally by AI Platform Training. You can ignore this field.

job

The TrainingInput that you provided to create the current training job, represented as a dictionary.

environment

The string cloud.

For custom container training jobs, AI Platform Training sets an additional environment variable called CLUSTER_SPEC, which has a similar format to TF_CONFIG but with several important differences. Learn about the CLUSTER_SPEC environment variable.

Example

The following example code prints theTF_CONFIG environment variable to your training logs:

import json
import os

tf_config_str = os.environ.get('TF_CONFIG')
tf_config_dict  = json.loads(tf_config_str)

# Convert back to string just for pretty printing
print(json.dumps(tf_config_dict, indent=2))

In a hyperparameter tuning job that runs in runtime version 2.1 or later and uses a master worker, two workers, and a parameter server, this code produces the following log for one of the workers during the first hyperparameter tuning trial. The example output hides the job field for conciseness and replaces some IDs with generic values.

{
  "cluster": {
    "chief": [
      "cmle-training-chief-[ID_STRING_1]-0:2222"
    ],
    "ps": [
      "cmle-training-ps-[ID_STRING_1]-0:2222"
    ],
    "worker": [
      "cmle-training-worker-[ID_STRING_1]-0:2222",
      "cmle-training-worker-[ID_STRING_1]-1:2222"
    ]
  },
  "environment": "cloud",
  "job": {
    ...
  },
  "task": {
    "cloud": "[ID_STRING_2]",
    "index": 0,
    "trial": "1",
    "type": "worker"
  }
}

chief versus master

The master worker VM in AI Platform Training corresponds to the chief task type in TensorFlow. While TensorFlow can appoint a worker task to act as chief, AI Platform Training always explicitly designates a chief.

master is a deprecated task type in TensorFlow. master represented a task that performed a similar role as chief but also acted as an evaluator in some configurations. TensorFlow 2 does not support TF_CONFIG environment variables that contain a master task.

AI Platform Training uses chief in the cluster and task fields of the TF_CONFIG environment variable if any of the following are true:

Otherwise, for compatibility reasons, AI Platform Training uses the deprecated master task type instead of chief.

When to use TF_CONFIG

As mentioned in a previous section, you likely don't need to interact with the TF_CONFIG environment variable directly in your training code. Only access the the TF_CONFIG environment variable if TensorFlow's distribution strategies and AI Platform Training's standard hyperparameter tuning workflow, both described in the next sections, do not work for your job.

Distributed training

AI Platform Training sets the TF_CONFIG environment variable to extend the specifications that TensorFlow requires for distributed training.

To perform distributed training with TensorFlow, use the tf.distribute.Strategy API. In particular, we recommend that you use the Keras API together with the MultiWorkerMirroredStrategy or, if you specify parameter servers for your job, the ParameterServerStrategy. However, note that TensorFlow currently only provides experimental support for these strategies.

These distribution strategies use the TF_CONFIG environment variable to assign roles to each VM in your training job and to facilitate communication between the VMs. You do not need to access the TF_CONFIG environment variable directly in your training code, because TensorFlow handles it for you.

Only parse the TF_CONFIG environment variable directly if you want to customize how the different VMs running your training job behave.

Hyperparameter tuning

When you run a hyperparameter tuning job, AI Platform Training provides different arguments to your training code for each trial. Your training code does not necessarily need to be aware of what trial is currently running. In addition, AI Platform Training provides tools for monitoring the progress of hyperparameter tuning jobs.

If needed, your code can read the current trial number from the trial field of the task field of the TF_CONFIG environment variable.

What's next