Reference for built-in XGBoost algorithm

This page provides detailed reference information about arguments you submit to AI Platform Training when running a training job using the built-in XGBoost algorithm.

Versioning

The standard (single-replica) version of the built-in XGBoost algorithm uses XGBoost 0.81.

The distributed version of the algorithm uses XGBoost 0.81.

Data format arguments

The standard (single-replica) version of the algorithm and the distributed version of the algorithm accept different arguments for describing the format of your data and defining the high-level functionality of the algorithm.

The following tables describe the arguments accepted by each version of the algorithm:

Single-replica

This is the version of the algorithm available at gcr.io/cloud-ml-algos/boosted_trees:latest. Learn more about using the single-replica version of the algorithm.

The following arguments are used for data formatting and automatic preprocessing:

Arguments Details
preprocess Specify this to enable automatic preprocessing.
Types of automatic preprocessing:
  • Splits the data into validation_split and test_split percentages.
  • Fills in missing values. For numerical columns, the mean is substituted for all missing values. For categorical columns, one-hot encoding or hashing is used to fill in the missing values.
  • Removes rows that have more than 10% column values missing.

Default: Unset
Type: Boolean flag. If set to true, enables automatic preprocessing.
training_data_path Cloud Storage path to a CSV file. The CSV file must have the following specifications:
  • CSV file must not contain a header
  • Only contain categorical or numerical columns
  • First column must be the target column
  • Blank values will be treated as missing

Required
Type: String
validation_data_path Cloud Storage path to a CSV file. The CSV file must have the same format as training_data_path.

Optional
Type: String
test_data_path Cloud Storage path to a CSV file. The CSV file must have the same format as training_data_path and validation_data_path.

Optional
Type: String
job-dir Cloud Storage path where model, checkpoints and other training artifacts reside. The following directories are created here:
  • model: This contains the trained model
  • processed_data: This contains three data files for training, testing, and validation if automatic preprocessing was enabled.
  • artifacts: This contains training preprocessing-related artifacts that help you to do client side preprocessing.

Required
Type: String
PREPROCESSING PARAMETERS
(when preprocess is set.)
validation_split Fraction of training data that should be used as validation data.

Default: 0.20
Type: Float
Note: validation_split + test_split <=0.40
Only specify this if you are not specifying validation_data_path.
test_split Fraction of training data that should be used as test data.

Default: 0.20
Type: Float
Note: validation_split + test_split <=0.40
Only specify this if you are not specifying test_data_path.

Distributed

This is the version of the algorithm available at gcr.io/cloud-ml-algos/xgboost_dist:latest. Learn more about using the distributed version of the algorithm.

The following arguments are used for data formatting and determining several other aspects of how the algorithm runs:

Arguments Details
training_data_path Cloud Storage path to one or more CSV files for training. To specify multiple files, use wildcards in this string. The CSV files must have the following specifications:
  • First column must be the target column
  • Only contain numerical data in other columns

Required
Type: String
validation_data_path Cloud Storage path to one or more CSV files for validation. To specify multiple files, use wildcards in this string. The CSV file must have the same format as training_data_path.

Optional
Type: String
job-dir Cloud Storage path where the trained model and other training artifacts get created. The following directories are created here:
  • model: This contains the trained model
  • intermediate: This contains intermediate models. See model_saving_period in a following row of this table.
  • evaluation: This contains intermediate metrics calculated against validation data. See eval_log_period in a following row of this table.

Required
Type: String
model_saving_period How often the algorithm saves an intermediate model, measured in iterations of training. For example, if this argument is set to 3, then the algorithm saves an intermediate model to a file in job-dir after every 3 iterations of training. If this field is set to an integer less than or equal to 0, then the algorithm does not save intermediate models.

Default: 1
Optional
Type: Integer
eval_log_period How often the algorithm logs metrics calculated against the validation data, measured in iterations of training. For example, if this argument is set to 3, then the algorithm logs evaluation metrics to a file in job-dir after every 3 iterations of training. If this field is set to an integer less than or equal to 0, then the algorithm does not save evaluation metrics.

Default: 0
Optional
Type: Integer
silent Whether the algorithm suppresses logs for debugging during the training process. Prints logs if this field is set to 0 and suppresses logs if it is set to 1.

Default: 0
Optional
Options: {0, 1}

Hyperparameters

The single-version and distributed versions of the built-in XGBoost algorithm both have the following hyperparameters:

Hyperparameter Details
BASIC PARAMETERS
objective Specify the learning task and the corresponding learning objective. For detailed information, please refer to 'objective' section in learning task parameters.

Required
Type: String
Options: one of {reg:linear, reg:logistic, binary:logistic, binary:logitraw, count:poisson, survival:cox, multi:softmax, multi:softprob, reg:gamma, reg:tweedie}
eval_metric Evaluation metrics to be used for validation data. It should be a comma separated string of multiple values. A default will be assigned based on the objective.
(rmse for regression, and error for classification, mean average precision for ranking)

Default: Default according to objective
Type: String
Options: Full list of all possible values are in 'eval_metric' of learning task parameters.
booster The type of booster to use, can be gbtree, gblinear or dart. gbtree and dart use tree based models while gblinear uses linear functions.

Default: gbtree
Type: String
Options: one of {gbtree,gblinear,dart}
num_boost_round Number of boosting iterations

Default: 10
Type: Integer
Options: [1, ∞)
max_depth Maximum depth of a tree. Increasing this value will make the model more complex and more likely to overfit. 0 indicates no limit. Note that limit is required when grow_policy is set of depthwise.

Default: 6
Type: Integer
Options: [0, ∞)
eta Step size shrinkage used in update to prevent overfitting. After each boosting step, we can directly get the weights of new features, and eta shrinks the feature weights to make the boosting process more conservative.

Default: 0.3
Type: Float
Options: [0, 1]
csv_weight When this flag is enabled, XGBoost differentiates the importance of instances for csv input by taking the second column (the column after labels) in training data as the instance weights.
This should only be set when input_type='csv'

Default: 0
Type: Integer
Options: {0, 1}
base_score The initial prediction score of all instance, global bias. For a sufficient number of iterations, changing this value will not have too much effect.

Default: 0.5
Type: Float
TREE BOOSTER PARAMETERS
gamma Minimum loss reduction required to make a further partition on a leaf node of the tree. The larger gamma is, the more conservative the algorithm will be.

Default: 0
Type: Float
Options: [0, ∞)
min_child_weight Minimum sum of instance weight (hessian) needed in a child. If the tree partition step results in a leaf node with the sum of instance weight less than min_child_weight, then the building process will give up further partitioning. In a linear regression task, this simply corresponds to the minimum number of instances needed to be in each node. The larger min_child_weight is, the more conservative the algorithm will be.

Default: 0
Type: Float
Options: [0, ∞)
max_delta_step Maximum delta step you allow each leaf output to be. If the value is set to 0, it means there is no constraint. If it is set to a positive value, it can help making the update step more conservative. Usually this parameter is not needed, but it might help in logistic regression when class is extremely imbalanced. Setting the value within the range 1-10 might help control the update.

Default: 0
Type: Integer
Options: [0, ∞)
subsample Subsample ratio of the training instances. Setting it to 0.5 means that XGBoost would randomly sample half of the training data prior to growing trees. This can prevent overfitting. Subsampling will occur once in every boosting iteration.

Default: 1
Type: Float
Options: (0,1]
colsample_bytree Subsample ratio of columns when constructing each tree. Subsampling will occur once in every boosting iteration.

Default: 1
Type: Float
Options: (0,1]
colsample_bylevel Subsample ratio of columns for each split, in each level. Subsampling will occur each time a new split is made. This parameter has no effect when tree_method is set to hist.

Default: 1
Type: Float
Options: (0,1]
lambda L2 regularization term on weights. Increasing this value will make the model more conservative.

Default: 1
Type: Float
alpha L1 regularization term on weights. Increasing this value will make the model more conservative.

Default: 0
Type: Float
tree_method The tree construction algorithm used in XGBoost. For detailed information, please refer to 'tree_method' in parameters for tree booster.

Default: auto
Type: String
Options: one of {auto, exact, approx, hist}
Additional options only for the distributed version of the XGBoost algorithm: one of {gpu_exact, gpu_hist}
sketch_eps This roughly translates into O(1 / sketch_eps) number of bins. Compared to directly selecting the number of bins, this provides a theoretical guarantee of sketch accuracy. Usually you do not have to tune this, but you may consider setting this to a lower number for a more accurate enumeration of split candidates.

Default: 0.03
Type: Float
Options: (0,1)
Note: Only specify this if tree_method='approx'
scale_pos_weight Control the balance of positive and negative weights, useful for unbalanced classes. A typical value to consider: sum(negative instances) / sum(positive instances)

Default: 1
Type: Float
updater A comma separated string defining the sequence of tree updaters to run, providing a modular way to construct and to modify the trees. This is an advanced parameter that is usually set automatically, depending on some other parameters. However, you can also set this explicitly. For detailed information, please refer to 'updater' in parameters for tree booster.

Default: grow_colmaker,prune
Type: String
Options: comma separated: {grow_colmaker, distcol, grow_histmaker, grow_local_histmaker, grow_skmaker, sync, refresh, prune}
refresh_leaf When this flag is 1, tree leaves as well as tree nodes' stats are updated. When it is 0, only node stats are updated.

Default: 0
Type: Integer
Options: {0, 1}
process_type A type of boosting process to run. For detailed information, please refer to 'process_type' in parameters for tree booster.

Default: default
Type: String
Options: {default, update}
grow_policy Controls a way new nodes are added to the tree.
  • depthwise: split at nodes closest to the root.
  • lossguide: split at nodes with highest loss change.

Default: depthwise
Type: String
Options: one of {depthwise, lossguide}
max_leaves Maximum number of nodes to be added.

Default: 0
Type: Integer
Note: Only specify this if grow_policy='lossguide'
max_bin Maximum number of discrete bins to bucket continuous features. Increasing this number improves the optimality of splits at the cost of higher computation time.

Default: 256
Type: Integer
Note: Only specify this if tree_method='hist'
DART BOOSTER PARAMETERS
(arguments for booster='dart')
sample_type Type of sampling algorithm
  • uniform: dropped trees are selected uniformly.
  • weighted: dropped trees are selected in proportion to weight.

Default: uniform
Type: String
Options: one of {uniform, weighted}
normalize_type Type of normalization algorithm.
  • tree: new trees have the same weight of each dropped trees.
    • Weight of new trees are 1 / (k + learning_rate).
    • Dropped trees are scaled by a factor of k / (k + learning_rate).
  • forest: new trees have the same weight of sum of dropped trees (forest).
    • Weight of new trees are 1 / (1 + learning_rate).
    • Dropped trees are scaled by a factor of 1 / (1 + learning_rate).

Default: tree
Type: String
Options: one of {tree, forest}
rate_drop Dropout rate (a fraction of previous trees to drop during the dropout).

Default: 0
Type: Float
Options: [0, 1]
one_drop When this flag is enabled, at least one tree is always dropped during the dropout (allows Binomial-plus-one or epsilon-dropout from the original DART paper).

Default: 0
Type: Integer
Options: {0, 1}
skip_drop Probability of skipping the dropout procedure during a boosting iteration. If a dropout is skipped, new trees are added in the same manner as gbtree. Note that non-zero skip_drop has higher priority than rate_drop or one_drop.

Default: 0
Type: Float
Options: [0, 1]
TWEEDIE REGRESSION
(arguments for objective='reg:tweedie')
tweedie_variance_power Parameter that controls the variance of the Tweedie distribution
var(y) ~ E(y)^tweedie_variance_power
  • Set closer to 2 to shift towards a gamma distribution
  • Set closer to 1 to shift towards a Poisson distribution.

Default: 1.5
Type: Float
Options: (1, 2)

Hyperparameter tuning

Hyperparameter tuning tests different hyperparameter configurations when training your model. It finds hyperparameter values that are optimal for the goal metric you choose. For each tunable argument, you can specify a range of values to restrict and focus the possibilities AI Platform Training can try.

Learn more about hyperparameter tuning on AI Platform Training.

Goal metrics

The following metrics can be optimized:

Objective Metric Direction Details
rmse MINIMIZE Root mean square error
logloss MINIMIZE Negative log-likelihood
error MINIMIZE Binary classification error rate. It is calculated as #(wrong cases)/#(all cases). For the predictions, the evaluation will regard the instances with prediction value larger than 0.5 as positive instances, and the others as negative instances.
merror MINIMIZE Multiclass classification error rate. It is calculated as #(wrong cases)/#(all cases).
mlogloss MINIMIZE Multiclass logloss
auc MAXIMIZE Area under the curve for ranking evaluation.
map MAXIMIZE Mean average precision

Tunable hyperparameters

When training with the built-in XGBoost algorithm (single-replica or distributed), you can tune the following hyperparameters. Start by tuning parameters with "high tunable value". These have the greatest impact on your goal metric.

Hyperparameters Type Valid values
PARAMETERS WITH HIGH TUNABLE VALUE
(greatest impact on goal metric)
eta DOUBLE [0, 1]
max_depth INTEGER [0, ∞)
num_boost_round INTEGER [1, ∞)
min_child_weight DOUBLE [1, ∞)
lambda DOUBLE (-∞, ∞)
alpha DOUBLE (-∞, ∞)
OTHER PARAMETERS
gamma DOUBLE [0, ∞)
max_delta_step DOUBLE [0, ∞)
subsample DOUBLE [0, 1]
col_sample_bytree DOUBLE [0, 1]
col_sample_bylevel DOUBLE [0, 1]