The following tips apply to large datasets and/or large models.
Single vs. distributed training
If you create a TensorFlow training application or a custom container, you can perform distributed training on AI Platform Training.
If you train with a pre-built PyTorch container, you can perform distributed PyTorch training.
You can only perform distributed training for XGBoost by using the built-in distributed XGBoost algorithm.
AI Platform Training does not support distributed training for scikit-learn. If your training applications use this framework, please only use the scale-tier or custom machine type configurations that correspond to a single worker instance.
Large datasets
When dealing with large datasets, it's possible that downloading the entire
dataset into the training worker VM and loading it into pandas does not scale.
In these cases, consider using TensorFlow's stream-read/file_io
API
(this API is preinstalled on the VM).
import pandas as pd
from pandas.compat import StringIO
from tensorflow.python.lib.io import file_io
# Access iris data from Cloud Storage
iris_data_filesteam = file_io.FileIO(os.path.join(data_dir, iris_data_filename),
mode='r')
iris_data = pd.read_csv(StringIO(iris_data_filesteam.read())).values
iris_target_filesteam = file_io.FileIO(os.path.join(data_dir,
iris_target_filename),
mode='r')
iris_target = pd.read_csv(StringIO(iris_target_filesteam.read())).values
iris_target = iris_target.reshape((iris_target.size,))
# Your training program goes here
...
..
.
# Close all filestreams
iris_data_filesteam.close()
iris_target_filesteam.close()
Large models
Training worker VMs with higher memory needs can be requested by setting
scale-tier
to CUSTOM
and setting the masterType
via an accompanying config
file. For more details, refer to the scale
tier documentation.
To do this:
Create
config.yaml
locally with the following contents:trainingInput: masterType: large_model
Submit your job:
CONFIG=path/to/config.yaml gcloud ai-platform jobs submit training $JOB_NAME \ --job-dir $JOB_DIR \ --package-path $TRAINER_PACKAGE_PATH \ --module-name $MAIN_TRAINER_MODULE \ --region us-central1 \ --runtime-version=$RUNTIME_VERSION \ --python-version=$PYTHON_VERSION \ --scale-tier CUSTOM \ --config $CONFIG