When you run a training job, AI Platform Training sets an environment variable
called TF_CONFIG
on each virtual machine (VM) instance that is part of your
job. Your training code, which runs on each VM, can use the TF_CONFIG
environment variable to access details about the training job and the role of
the VM that it is running on.
TensorFlow uses the TF_CONFIG
environment variable to facilitate distributed
training, but you likely don't have to access it directly in your training code.
This document describes the TF_CONFIG
environment variable and its usage in
distributed TensorFlow jobs and hyperparameter tuning jobs.
The format of TF_CONFIG
AI Platform Training sets the TF_CONFIG
environment variable on every VM of every
training job to meet the specifications that TensorFlow requires for
distributed
training.
However, AI Platform Training also sets additional fields in the TF_CONFIG
environment
variable beyond what TensorFlow requires.
The TF_CONFIG
environment variable is a JSON string with the following format:
TF_CONFIG fields |
|||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|
cluster |
The TensorFlow cluster description. A dictionary mapping one or more
task names ( This is a valid first argument for the
Learn about the difference between
|
||||||||||
task |
The task description of the VM where this environment variable is set. For a given training job, this dictionary is different on every VM. You can use this information to customize what code runs on each VM in a distributed training job. You can also use it to change the behavior of your training code for different trials of a hyperparameter tuning job. This dictionary includes the following key-value pairs:
|
||||||||||
job |
The |
||||||||||
environment |
The string |
For custom container training jobs,
AI Platform Training sets an additional environment variable called CLUSTER_SPEC
,
which has a similar format to TF_CONFIG
but with several important
differences. Learn about the CLUSTER_SPEC
environment
variable.
Example
The following example code prints theTF_CONFIG
environment variable to your
training logs:
import json
import os
tf_config_str = os.environ.get('TF_CONFIG')
tf_config_dict = json.loads(tf_config_str)
# Convert back to string just for pretty printing
print(json.dumps(tf_config_dict, indent=2))
In a hyperparameter tuning job that runs in runtime version 2.1 or later and
uses a master worker, two workers, and a parameter server, this code produces
the following log for one of the workers during the first hyperparameter tuning
trial. The example output hides the job
field for conciseness and replaces
some IDs with generic values.
{
"cluster": {
"chief": [
"cmle-training-chief-[ID_STRING_1]-0:2222"
],
"ps": [
"cmle-training-ps-[ID_STRING_1]-0:2222"
],
"worker": [
"cmle-training-worker-[ID_STRING_1]-0:2222",
"cmle-training-worker-[ID_STRING_1]-1:2222"
]
},
"environment": "cloud",
"job": {
...
},
"task": {
"cloud": "[ID_STRING_2]",
"index": 0,
"trial": "1",
"type": "worker"
}
}
chief
versus master
The master worker VM in AI Platform Training corresponds
to the chief
task
type
in TensorFlow. While TensorFlow can appoint a worker
task to act as chief
,
AI Platform Training always explicitly designates a chief
.
master
is a deprecated task type in TensorFlow. master
represented a task
that performed a similar role as chief
but also acted as an evaluator
in
some configurations. TensorFlow 2 does not support TF_CONFIG
environment
variables that contain a master
task.
AI Platform Training uses chief
in the cluster
and task
fields of the TF_CONFIG
environment variable if any of the following are true:
- You are running a training job that uses runtime version 2.1 or later.
- You have configured your training job to use one or more
evaluators. In other words, you have set your
job's
trainingInput.evaluatorCount
to1
or greater. - Your job uses a custom container and you
have set your job's
trainingInput.useChiefInTfConfig
totrue
.
Otherwise, for compatibility reasons, AI Platform Training uses the deprecated master
task type instead of chief
.
When to use TF_CONFIG
As mentioned in a previous section, you likely don't need to interact with the
TF_CONFIG
environment variable directly in your training code. Only access the
the TF_CONFIG
environment variable if TensorFlow's distribution strategies and
AI Platform Training's standard hyperparameter tuning workflow, both described in the
next sections, do not work for your job.
Distributed training
AI Platform Training sets the TF_CONFIG
environment variable to extend the
specifications that TensorFlow requires for distributed
training.
To perform distributed training with TensorFlow, use the
tf.distribute.Strategy
API.
In particular, we recommend that you use the Keras API together with the
MultiWorkerMirroredStrategy
or, if you specify parameter servers for your job,
the
ParameterServerStrategy
.
However, note that TensorFlow currently only provides experimental support for
these strategies.
These distribution strategies use the TF_CONFIG
environment variable to assign
roles to each VM in your training job and to facilitate communication between
the VMs. You do not need to access the TF_CONFIG
environment variable directly
in your training code, because TensorFlow handles it for you.
Only parse the TF_CONFIG
environment variable directly if you want to
customize how the different VMs running your training job behave.
Hyperparameter tuning
When you run a hyperparameter tuning job, AI Platform Training provides different arguments to your training code for each trial. Your training code does not necessarily need to be aware of what trial is currently running. In addition, AI Platform Training provides tools for monitoring the progress of hyperparameter tuning jobs.
If needed, your code can read the current trial number from the trial
field
of the task
field of the TF_CONFIG
environment variable.
What's next
- Work through a tutorial in the TensorFlow documentation about Multi-worker training with Keras
- Learn about distributed training with custom containers in AI Platform Training.
- Learn how to implement hyperparameter tuning for your training jobs.