This page provides detailed reference information about arguments you submit to AI Platform Training when running a training job using the built-in XGBoost algorithm.
Versioning
The standard (single-replica) version of the built-in XGBoost algorithm uses XGBoost 0.81.
The distributed version of the algorithm uses XGBoost 0.81.
Data format arguments
The standard (single-replica) version of the algorithm and the distributed version of the algorithm accept different arguments for describing the format of your data and defining the high-level functionality of the algorithm.
The following tables describe the arguments accepted by each version of the algorithm:
Single-replica
This is the version of the algorithm available at
gcr.io/cloud-ml-algos/boosted_trees:latest
. Learn more about using the
single-replica version of the algorithm.
The following arguments are used for data formatting and automatic preprocessing:
Arguments | Details |
---|---|
preprocess |
Specify this to enable automatic preprocessing. Types of automatic preprocessing:
Default: Unset Type: Boolean flag. If set to true, enables automatic preprocessing. |
training_data_path |
Cloud Storage path to a CSV file. The CSV file
must have the following specifications:
Required Type: String |
validation_data_path |
Cloud Storage path to a CSV file. The CSV file
must have the same format as training_data_path .Optional Type: String |
test_data_path |
Cloud Storage path to a CSV file. The CSV file
must have the same format as training_data_path and
validation_data_path .Optional Type: String |
job-dir |
Cloud Storage path where model, checkpoints and other training
artifacts reside. The following directories are created here:
Required Type: String |
PREPROCESSING PARAMETERS (when preprocess is set.) |
|
validation_split |
Fraction of training data that should be used as validation data. Default: 0.20 Type: Float Note: validation_split +
test_split <=0.40Only specify this if you are not specifying validation_data_path. |
test_split |
Fraction of training data that should be used as test data. Default: 0.20 Type: Float Note: validation_split +
test_split <=0.40Only specify this if you are not specifying test_data_path. |
Distributed
This is the version of the algorithm available at
gcr.io/cloud-ml-algos/xgboost_dist:latest
. Learn more about using the
distributed version of the algorithm.
The following arguments are used for data formatting and determining several other aspects of how the algorithm runs:
Arguments | Details |
---|---|
training_data_path |
Cloud Storage path to one or more CSV files for
training. To specify multiple files, use
wildcards in this string. The CSV files
must have the following specifications:
Required Type: String |
validation_data_path |
Cloud Storage path to one or more CSV files for
validation. To specify multiple files, use
wildcards in this string. The CSV file
must have the same format as training_data_path .Optional Type: String |
job-dir |
Cloud Storage path where the trained model and other training
artifacts get created. The following directories are created here:
Required Type: String |
model_saving_period |
How often the algorithm saves an intermediate model, measured in
iterations of training. For example, if this argument is set to 3, then the
algorithm saves an intermediate model to a file in job-dir
after every 3 iterations of training. If this field is set to an integer
less than or equal to 0, then the algorithm does not save intermediate
models.Default: 1 Optional Type: Integer |
eval_log_period |
How often the algorithm logs metrics calculated against the validation
data, measured in iterations of training. For example, if this argument is
set to 3, then the algorithm logs evaluation metrics to a file in
job-dir after every 3 iterations of training. If this field is
set to an integer less than or equal to 0, then the algorithm does not save
evaluation metrics.Default: 0 Optional Type: Integer |
silent |
Whether the algorithm suppresses logs for debugging during the training
process. Prints logs if this field is set to 0 and suppresses logs if it is
set to 1. Default: 0 Optional Options: {0, 1} |
Hyperparameters
The single-version and distributed versions of the built-in XGBoost algorithm both have the following hyperparameters:
Hyperparameter | Details |
---|---|
BASIC PARAMETERS | |
objective |
Specify the learning task and the corresponding learning objective. For
detailed information, please refer to 'objective' section in learning
task parameters. Required Type: String Options: one of {reg:linear, reg:logistic,
binary:logistic, binary:logitraw, count:poisson, survival:cox,
multi:softmax, multi:softprob, reg:gamma, reg:tweedie}
|
eval_metric |
Evaluation metrics to be used for validation data. It should be a comma
separated string of multiple values. A default will be assigned based on
the objective . (rmse for regression, and error for classification, mean average precision for ranking) Default: Default according to objective Type: String Options: Full list of all possible values are in 'eval_metric' of learning task parameters. |
booster |
The type of booster to use, can be gbtree , gblinear or
dart . gbtree and dart use tree based models while
gblinear uses linear functions.Default: gbtree Type: String Options: one of { gbtree,gblinear,dart } |
num_boost_round |
Number of boosting iterations Default: 10 Type: Integer Options: [1, ∞) |
max_depth |
Maximum depth of a tree. Increasing this value will make the model more
complex and more likely to overfit. 0 indicates no limit. Note that limit
is required when grow_policy is set of depthwise. Default: 6 Type: Integer Options: [0, ∞) |
eta |
Step size shrinkage used in update to prevent overfitting. After each
boosting step, we can directly get the weights of new features, and eta
shrinks the feature weights to make the boosting process more
conservative. Default: 0.3 Type: Float Options: [0, 1] |
csv_weight |
When this flag is enabled, XGBoost differentiates the importance of
instances for csv input by taking the second column (the column after
labels) in training data as the instance weights. This should only be set when input_type='csv' Default: 0 Type: Integer Options: {0, 1} |
base_score |
The initial prediction score of all instance, global bias. For a sufficient
number of iterations, changing this value will not have too much
effect. Default: 0.5 Type: Float |
TREE BOOSTER PARAMETERS | |
gamma |
Minimum loss reduction required to make a further partition on a leaf node
of the tree. The larger gamma is, the more conservative the algorithm will
be. Default: 0 Type: Float Options: [0, ∞) |
min_child_weight |
Minimum sum of instance weight (hessian) needed in a child. If the tree
partition step results in a leaf node with the sum of instance weight less
than min_child_weight, then the building process will give up further
partitioning. In a linear regression task, this simply corresponds to the
minimum number of instances needed to be in each node. The larger
min_child_weight is, the more conservative the algorithm will
be.Default: 0 Type: Float Options: [0, ∞) |
max_delta_step |
Maximum delta step you allow each leaf output to be. If the value is set to
0, it means there is no constraint. If it is set to a positive value, it
can help making the update step more conservative. Usually this parameter
is not needed, but it might help in logistic regression when class is
extremely imbalanced. Setting the value within the range 1-10 might help control
the update. Default: 0 Type: Integer Options: [0, ∞) |
subsample |
Subsample ratio of the training instances. Setting it to 0.5 means that
XGBoost would randomly sample half of the training data prior to growing
trees. This can prevent overfitting. Subsampling will occur once in
every boosting iteration. Default: 1 Type: Float Options: (0,1] |
colsample_bytree |
Subsample ratio of columns when constructing each tree. Subsampling will
occur once in every boosting iteration. Default: 1 Type: Float Options: (0,1] |
colsample_bylevel |
Subsample ratio of columns for each split, in each level. Subsampling will
occur each time a new split is made. This parameter has no effect when
tree_method is set to hist. Default: 1 Type: Float Options: (0,1] |
lambda |
L2 regularization term on weights. Increasing this value will make the
model more conservative. Default: 1 Type: Float |
alpha |
L1 regularization term on weights. Increasing this value will make the
model more conservative. Default: 0 Type: Float |
tree_method |
The tree construction algorithm used in XGBoost. For detailed information,
please refer to 'tree_method' in
parameters for tree booster. Default: auto Type: String Options: one of {auto, exact, approx, hist} Additional options only for the distributed version of the XGBoost algorithm: one of {gpu_exact, gpu_hist} |
sketch_eps |
This roughly translates into O(1 / sketch_eps) number of bins. Compared to
directly selecting the number of bins, this provides a theoretical guarantee
of sketch accuracy. Usually you do not have to tune this, but you may consider
setting this to a lower number for a more accurate enumeration of split
candidates. Default: 0.03 Type: Float Options: (0,1) Note: Only specify this if tree_method='approx' |
scale_pos_weight |
Control the balance of positive and negative weights, useful for unbalanced
classes. A typical value to consider: sum(negative instances) /
sum(positive instances) Default: 1 Type: Float |
updater |
A comma separated string defining the sequence of tree updaters to run,
providing a modular way to construct and to modify the trees. This is an
advanced parameter that is usually set automatically, depending on some
other parameters. However, you can also set this explicitly. For
detailed information, please refer to 'updater' in parameters
for tree booster. Default: grow_colmaker,prune Type: String Options: comma separated: {grow_colmaker,
distcol, grow_histmaker, grow_local_histmaker, grow_skmaker, sync, refresh,
prune} |
refresh_leaf |
When this flag is 1, tree leaves as well as tree nodes' stats are updated.
When it is 0, only node stats are updated. Default: 0 Type: Integer Options: {0, 1} |
process_type |
A type of boosting process to run. For detailed information, please refer
to 'process_type' in parameters
for tree booster. Default: default Type: String Options: {default, update} |
grow_policy |
Controls a way new nodes are added to the tree.
Default: depthwise Type: String Options: one of {depthwise,
lossguide} |
max_leaves |
Maximum number of nodes to be added. Default: 0 Type: Integer Note: Only specify this if grow_policy='lossguide' |
max_bin |
Maximum number of discrete bins to bucket continuous features. Increasing
this number improves the optimality of splits at the cost of higher
computation time. Default: 256 Type: Integer Note: Only specify this if tree_method='hist' |
DART BOOSTER PARAMETERS (arguments for booster='dart' ) |
|
sample_type |
Type of sampling algorithm
Default: uniform Type: String Options: one of {uniform, weighted} |
normalize_type |
Type of normalization algorithm.
Default: tree Type: String Options: one of {tree, forest} |
rate_drop |
Dropout rate (a fraction of previous trees to drop during the
dropout). Default: 0 Type: Float Options: [0, 1] |
one_drop |
When this flag is enabled, at least one tree is always dropped during the
dropout (allows Binomial-plus-one or epsilon-dropout from the original DART
paper). Default: 0 Type: Integer Options: {0, 1} |
skip_drop |
Probability of skipping the dropout procedure during a boosting iteration.
If a dropout is skipped, new trees are added in the same manner as gbtree.
Note that non-zero skip_drop has higher priority than
rate_drop or one_drop .Default: 0 Type: Float Options: [0, 1] |
TWEEDIE REGRESSION (arguments for objective='reg:tweedie' ) |
|
tweedie_variance_power |
Parameter that controls the variance of the Tweedie distribution var(y) ~ E(y)^ tweedie_variance_power
Default: 1.5 Type: Float Options: (1, 2) |
Hyperparameter tuning
Hyperparameter tuning tests different hyperparameter configurations when training your model. It finds hyperparameter values that are optimal for the goal metric you choose. For each tunable argument, you can specify a range of values to restrict and focus the possibilities AI Platform Training can try.
Learn more about hyperparameter tuning on AI Platform Training.
Goal metrics
The following metrics can be optimized:
Objective Metric | Direction | Details |
---|---|---|
rmse |
MINIMIZE | Root mean square error |
logloss |
MINIMIZE | Negative log-likelihood |
error |
MINIMIZE | Binary classification error rate. It is calculated as #(wrong cases)/#(all cases). For the predictions, the evaluation will regard the instances with prediction value larger than 0.5 as positive instances, and the others as negative instances. |
merror |
MINIMIZE | Multiclass classification error rate. It is calculated as #(wrong cases)/#(all cases). |
mlogloss |
MINIMIZE | Multiclass logloss |
auc |
MAXIMIZE | Area under the curve for ranking evaluation. |
map |
MAXIMIZE | Mean average precision |
Tunable hyperparameters
When training with the built-in XGBoost algorithm (single-replica or distributed), you can tune the following hyperparameters. Start by tuning parameters with "high tunable value". These have the greatest impact on your goal metric.
Hyperparameters | Type | Valid values |
---|---|---|
PARAMETERS WITH HIGH TUNABLE VALUE (greatest impact on goal metric) |
||
eta |
DOUBLE | [0, 1] |
max_depth |
INTEGER | [0, ∞) |
num_boost_round |
INTEGER | [1, ∞) |
min_child_weight |
DOUBLE | [1, ∞) |
lambda |
DOUBLE | (-∞, ∞) |
alpha |
DOUBLE | (-∞, ∞) |
OTHER PARAMETERS | ||
gamma |
DOUBLE | [0, ∞) |
max_delta_step |
DOUBLE | [0, ∞) |
subsample |
DOUBLE | [0, 1] |
col_sample_bytree |
DOUBLE | [0, 1] |
col_sample_bylevel |
DOUBLE | [0, 1] |