Training at scale

The following tips apply to large datasets and/or large models.

Single vs. distributed training

If you create a TensorFlow training application or a custom container, you can perform distributed training on AI Platform Training.

AI Platform Training does not support distributed training for scikit-learn and XGBoost jobs. If your training applications uses one of these frameworks, please only use the scale-tier or custom machine type configurations that correspond to a single worker instance.

Large datasets

When dealing with large datasets, it's possible that downloading the entire dataset into the training worker VM and loading it into pandas does not scale. In these cases, consider using TensorFlow's stream-read/file_io API (this API is preinstalled on the VM).

import pandas as pd

from pandas.compat import StringIO
from tensorflow.python.lib.io import file_io

# Access iris data from Cloud Storage
iris_data_filesteam = file_io.FileIO(os.path.join(data_dir, iris_data_filename),
                                     mode='r')
iris_data = pd.read_csv(StringIO(iris_data_filesteam.read())).values
iris_target_filesteam = file_io.FileIO(os.path.join(data_dir,
                                                    iris_target_filename),
                                       mode='r')
iris_target = pd.read_csv(StringIO(iris_target_filesteam.read())).values
iris_target = iris_target.reshape((iris_target.size,))


# Your training program goes here
...
..
.

# Close all filestreams
iris_data_filesteam.close()
iris_target_filesteam.close()

Large models

Training worker VMs with higher memory needs can be requested by setting scale-tier to CUSTOM and setting the masterType via an accompanying config file. For more details, refer to the scale tier documentation.

To do this:

  1. Create config.yaml locally with the following contents:

    trainingInput:
      masterType: large_model
    
  2. Submit your job:

    CONFIG=path/to/config.yaml
    
    gcloud ai-platform jobs submit training $JOB_NAME \
      --job-dir $JOB_DIR \
      --package-path $TRAINER_PACKAGE_PATH \
      --module-name $MAIN_TRAINER_MODULE \
      --region us-central1 \
      --runtime-version=$RUNTIME_VERSION \
      --python-version=$PYTHON_VERSION \
      --scale-tier CUSTOM \
      --config $CONFIG
    
Was this page helpful? Let us know how we did:

Send feedback about...

AI Platform
Need help? Visit our support page.