Training with TensorFlow 2

Train a machine learning model with TensorFlow 2 on AI Platform Training by using runtime version 2.1 or later. TensorFlow 2 simplifies many APIs from TensorFlow 1. The TensorFlow documentation provides a guide to migrating TensorFlow 1 code to TensorFlow 2.

Running a training job with TensorFlow 2 on AI Platform Training follows the same process as running other custom code training jobs. However, some AI Platform Training features work differently with TensorFlow 2 compared to how they work with TensorFlow 1. This document provides a summary of these differences.

Python version support

Runtime versions 2.1 and later only support training with Python 3.7. Therefore you must use Python 3.7 to train with TensorFlow 2.

The Python Software Foundation ended support for Python 2.7 on January 1, 2020. No AI Platform runtime versions released after January 1, 2020 support Python 2.7.

Distributed training

TensorFlow 2 provides an updated API for distributed training. Additionally, AI Platform Training sets the TF_CONFIG environment variable differently in runtime versions 2.1 and later. This section describes both changes.

Distribution strategies

To perform distributed training with multiple virtual machine (VM) instances in TensorFlow 2, use the tf.distribute.Strategy API. In particular, we recommend that you use the Keras API together with the MultiWorkerMirroredStrategy or, if you specify parameter servers for your job, the ParameterServerStrategy. However, note that TensorFlow currently only provides experimental support for these strategies.

TF_CONFIG

TensorFlow expects a TF_CONFIG environment variable to be set on each VM used for training. AI Platform Training automatically sets this environment variable on each VM used in your training job. This lets each VM behave differently depending on its type and it helps the VMs communicate with each other.

In runtime version 2.1 and later, AI Platform Training no longer uses the master task type in any TF_CONFIG environment variables. Instead, your training job's master worker is labeled with the chief type in the TF_CONFIG environment variable. Learn more about how AI Platform Training sets the TF_CONFIG environment variable.

Accelerators for training

AI Platform Training lets you accelerate your training jobs with GPUs and TPUs.

GPUs

To learn how to use GPUs for training, read the AI Platform Training guide to configuring GPUs and TensorFlow's guide to using GPUs.

If you want to train on a single VM with multiple GPUs, the best practice is to use TensorFlow's MirroredStrategy. If you want to train using multiple VMs with GPUs, the best practice is to use TensorFlow's MultiWorkerMirroredStrategy.

TPUs

To learn how to use TPUs for training, read the guide to training with TPUs.

Hyperparameter tuning

If you are running a hyperparameter tuning job with TensorFlow 2, you might need to adjust how your training code reports your hyperparameter tuning metric to the AI Platform Training service.

If you are training with an Estimator, you can write your metric to a summary in the same way that you do in TensorFlow 1. If you are training with Keras, we recommend that you use tf.summary.scalar to write a summary.

Learn more about reporting your hyperparameter metric and see examples of how to do so in TensorFlow 2.

What's next