Training using the built-in distributed XGBoost algorithm

Training with built-in algorithms on AI Platform Training allows you to submit your dataset and train a model without writing any training code. This page explains how the built-in distributed XGBoost algorithm works, and how to use it.

Overview

The built-in distributed XGBoost algorithm is a wrapper for the XGBoost algorithm that is compatible to be run on AI Platform Training.

Unlike the single replica built-in XGBoost algorithm, this algorithm lets you use multiple virtual machines in parallel to train on large datasets. This algorithm also lets you use GPUs for training, which can speed up the training process.

AI Platform Training runs training using the distributed XGBoost algorithm based on your dataset and the model parameters you supplied. The current implementation is based on XGBoost's 0.81 version.

Limitations

The following features are not supported for training with the built-in distributed XGBoost algorithm:

  • Preprocessing. This algorithm does not support automatic preprocessing. You must manually prepare training and validation data into separate groups of files that meet the requirements described in following section about formatting input data.
  • Single replica training. This algorithm is designed to use multiple virtual machines for training. Use the single replica built-in XGBoost algorithm if you want to train using a single virtual machine.

Supported machine types

You can use any AI Platform Training scale tier or valid combination of machine types with this algorithm, as long as your configuration meets the following requirements:

  • Specify a master worker and at least one worker.
  • For best performance, specify the same machine type for the master worker and for workers.
  • Do not specify any parameter servers.
  • Make sure that the total memory of the virtual machines that you specify is at least 20% greater than the total file size of your training data. This allows virtual machines to load all the training data into memory and also use additional memory for training.
  • If you use GPUs, ensure that each virtual machine uses only a single GPU, and use the same type of GPU for the master worker and for workers. Make sure that the machine type that you have specified supports the GPU configuration.
  • Do not use any TPUs.

Format input data

The built-in distributed XGBoost algorithm works on numerical tabular data. Each row of a dataset represents one instance, and each column of a dataset represents a feature value. The target column represents the value you want to predict.

Prepare CSV files

Your input data must be one or more CSV files with UTF-8 encoding. Each file must meet the following requirements:

  • The CSV files must not have a header row. If your CSV files have header rows labeling each column, remove this first row from each file.
  • The target column must be the first column.
  • For classification training jobs, the target column can contain non-numerical values. All other columns must contain only numerical data.
  • For regression training jobs, normalize your target values so that each value is between 0 and 1. All other columns must contain only numerical data.

Split data for distributed training

To provide data from multiple CSV files when you submit a training job, use wildcards in the Cloud Storage paths that you specify for the training_data_path and validation_data_path arguments. All the CSV files must use the same column schema, meeting the requirements described in the previous section.

The built-in distributed XGBoost algorithm distributes your training data across virtual machines in one of the following ways:

  • If the number of CSV files is greater than or equal to the number of virtual machines, then the algorithm distributes data by file in round robin order. In other words, the master worker loads the first CSV file, the first worker loads the second CSV file, and so on. This method of assigning files loops so that each virtual machine loads roughly the same number of files.

  • If the number of CSV files is less than the number of virtual machines, then the algorithm distributes data by instance in round robin order. In other words, the master worker loads the first row of each CSV file, the first worker loads the second row of each CSV file, and so on. This method of assigning instances loops so that each virtual machine loads roughly the same number of instances.

If you specify the validation_data_path argument, then the algorithm also loads validation data in one of these ways. However, note that the algorithm loads training data and validation data independently. For example, if you provide many training data files but only one validation data file, the algorithm might load the training data by file and load the validation data by instance.

For the best performance, split training data into multiple CSV files that meet the following guidelines:

  • Each file is less than 1 GB in size.
  • Each file contains roughly the same number of instances.
  • The number of files is divisible by the total number of virtual machines. For example, if you train with a master and two workers, the number of files is a multiple of 3.

Check Cloud Storage bucket permissions

To store your data, use a Cloud Storage bucket in the same Google Cloud project you're using to run AI Platform Training jobs. Otherwise, grant AI Platform Training access to the Cloud Storage bucket where your data is stored.

Using GPUs

As described in the previous section about machine types, the built-in distributed XGBoost algorithm supports using a single GPU per virtual machine for training.

To take advantage of GPUs, set the tree_method hyperparameter to gpu_exact or gpu_hist when you submit your training job.

Learn more about XGBoost's support for GPUs.

Submit a distributed XGBoost training job

This section explains how to submit a built-in distributed XGBoost training job. Use the Google Cloud console or the Google Cloud CLI to submit your job.

You can find comprehensive descriptions of hyperparameters and other arguments that you can adjust for this algorithm in the reference for the built-in distributed XGBoost algorithm.

The following example assumes you are training a classifier on Census data that you have split into three training data files and three validation data files:

  • train-0.csv
  • train-1.csv
  • train-2.csv
  • eval-0.csv
  • eval-1.csv
  • eval-2.csv

Assume that none of these files have header rows and that you have uploaded them to Cloud Storage. The example creates a training job that uses three virtual machines, each of which uses an NVIDIA Tesla P100 GPU. The job runs in the us-central1 region.

Google Cloud console

  1. Go to the AI Platform Training Jobs page in the Google Cloud console:

    AI Platform Training Jobs page

  2. Click the New training job button. From the options that display below, click Built-in algorithm training.

  3. On the Create a new training job page, open the Select an algorithm drop-down list and select Distributed XGBoost. Click Next.

  4. In the Training data section, select Use multiple files stored in one Cloud Storage directory from the drop-down list. Use the Directory path field to select the Cloud Storage directory that contains your training files. In the Wildcard name field, enter train-*.csv.

  5. In the Validation data (Optional) section, select Use multiple files stored in one Cloud Storage directory from the drop-down list. Use the Directory path field to select the Cloud Storage directory that contains your validation files. In the Wildcard name field, enter eval-*.csv.

  6. In the Training output section, use the Output directory field to select a separate directory in your Cloud Storage bucket to store training output. Click Next.

  7. Customize the Algorithm arguments for your training job or keep the default values. To learn more about the arguments, follow the links in the Google Cloud console and refer to the built-in distributed XGBoost reference. Click Next.

  8. Enter a name of your choice in the Job ID field. From the Region drop-down list, select us-central1.

    In the Scale tier drop-down list, select CUSTOM. In the Custom cluster configuration section, select standard_p100 in the Master type and the Worker type drop-down lists. In the Worker count field, enter 2. Click Done.

  9. On the Jobs page, click the ID of your new job to see its Job Details page. Then click View Logs to see training logs.

gcloud tool

  1. Set environment variables for your job, replacing BUCKET with the name of your Cloud Storage bucket and DATA_DIRECTORY with the path to the directory in your bucket that contains your data:

    # Specify the Docker container for your built-in algorithm selection.
    IMAGE_URI='gcr.io/cloud-ml-algos/xgboost_dist:latest'
    
    # Specify the Cloud Storage wildcard paths to your training and validation data.
    TRAINING_DATA='gs://BUCKET/DATA_DIRECTORY/train-*.csv'
    VALIDATION_DATA='gs://BUCKET/DATA_DIRECTORY/eval-*.csv'
    
    # Variables for constructing descriptive names for JOB_ID and JOB_DIR
    DATASET_NAME='census'
    ALGORITHM='xgboost_dist'
    MODEL_TYPE='classification'
    DATE='date '+%Y%m%d_%H%M%S''
    MODEL_NAME="${DATASET_NAME}_${ALGORITHM}_${MODEL_TYPE}"
    
    # Specify an ID for this job
    JOB_ID="${MODEL_NAME}_${DATE}"
    
    # Specify the directory where you want your training outputs to be stored
    JOB_DIR="gs://BUCKET/algorithm_training/${MODEL_NAME}/${DATE}"
    
  2. Submit the training job using the gcloud ai-platform jobs training submit command:

    gcloud ai-platform jobs submit training $JOB_ID \
      --region=us-central1 \
      --master-image-uri=$IMAGE_URI \
      --job-dir=$JOB_DIR \
      --scale-tier=CUSTOM \
      --master-machine-type=n1-standard-4 \
      --master-accelerator count=1,type=nvidia-tesla-p100 \
      --worker-machine-type=n1-standard-4 \
      --worker-count=2 \
      --worker-accelerator count=1,type=nvidia-tesla-p100 \
      -- \
      --training_data_path=$TRAINING_DATA \
      --validation_data_path=$VALIDATION_DATA \
      --objective=binary:logistic \
      --tree_method=gpu_hist
    
  3. Monitor the status of your training job by viewing logs with gcloud. Refer to gcloud ai-platform jobs describe and gcloud ai-platform jobs stream-logs.

    gcloud ai-platform jobs describe ${JOB_ID}
    gcloud ai-platform jobs stream-logs ${JOB_ID}
    

What's next