This page provides detailed reference information about arguments you submit to AI Platform Training when running a training job using the builtin XGBoost algorithm.
Versioning
The standard (singlereplica) version of the builtin XGBoost algorithm uses XGBoost 0.80.
The distributed version of the algorithm uses XGBoost 0.81.
Data format arguments
The standard (singlereplica) version of the algorithm and the distributed version of the algorithm accept different arguments for describing the format of your data and defining the highlevel functionality of the algorithm.
The following tables describe the arguments accepted by each version of the algorithm:
Singlereplica
This is the version of the algorithm available at
gcr.io/cloudmlalgos/boosted_trees:latest
. Learn more about using the
singlereplica version of the algorithm.
The following arguments are used for data formatting and automatic preprocessing:
Arguments  Details 

preprocess 
Specify this to enable automatic preprocessing. Types of automatic preprocessing:
Default: Unset Type: Boolean flag. If set to true, enables automatic preprocessing. 
training_data_path 
Cloud Storage path to a CSV file. The CSV file
must have the following specifications:
Required Type: String 
validation_data_path 
Cloud Storage path to a CSV file. The CSV file
must have the same format as training_data_path .Optional Type: String 
test_data_path 
Cloud Storage path to a CSV file. The CSV file
must have the same format as training_data_path and
validation_data_path .Optional Type: String 
jobdir 
Cloud Storage path where model, checkpoints and other training
artifacts reside. The following directories are created here:
Required Type: String 
PREPROCESSING PARAMETERS (when preprocess is set.) 

validation_split 
Fraction of training data that should be used as validation data. Default: 0.20 Type: Float Note: validation_split +
test_split <=0.40Only specify this if you are not specifying validation_data_path. 
test_split 
Fraction of training data that should be used as test data. Default: 0.20 Type: Float Note: validation_split +
test_split <=0.40Only specify this if you are not specifying test_data_path. 
Distributed
This is the version of the algorithm available at
gcr.io/cloudmlalgos/xgboost_dist:latest
. Learn more about using the
distributed version of the algorithm.
The following arguments are used for data formatting and determining several other aspects of how the algorithm runs:
Arguments  Details 

training_data_path 
Cloud Storage path to one or more CSV files for
training. To specify multiple files, use
wildcards in this
string. The CSV files
must have the following specifications:
Required Type: String 
validation_data_path 
Cloud Storage path to one or more CSV files for
validation. To specify multiple files, use
wildcards in this
string. The CSV file
must have the same format as training_data_path .Optional Type: String 
jobdir 
Cloud Storage path where the trained model and other training
artifacts get created. The following directories are created here:
Required Type: String 
model_saving_period 
How often the algorithm saves an intermediate model, measured in
iterations of training. For example, if this argument is set to 3, then the
algorithm saves an intermediate model to a file in jobdir
after every 3 iterations of training. If this field is set to an integer
less than or equal to 0, then the algorithm does not save intermediate
models.Default: 1 Optional Type: Integer 
eval_log_period 
How often the algorithm logs metrics calculated against the validation
data, measured in iterations of training. For example, if this argument is
set to 3, then the algorithm logs evaluation metrics to a file in
jobdir after every 3 iterations of training. If this field is
set to an integer less than or equal to 0, then the algorithm does not save
evaluation metrics.Default: 0 Optional Type: Integer 
silent 
Whether the algorithm suppresses logs for debugging during the training
process. Prints logs if this field is set to 0 and suppresses logs if it is
set to 1. Default: 0 Optional Options: {0, 1} 
Hyperparameters
The singleversion and distributed versions of the builtin XGBoost algorithm both have the following hyperparameters:
Hyperparameter  Details 

BASIC PARAMETERS  
objective 
Specify the learning task and the corresponding learning objective. For
detailed information, please refer to 'objective' section in learning
task parameters. Required Type: String Options: one of {reg:linear, reg:logistic,
binary:logistic, binary:logitraw, count:poisson, survival:cox,
multi:softmax, multi:softprob, reg:gamma, reg:tweedie}

eval_metric 
Evaluation metrics to be used for validation data. It should be a comma
separated string of multiple values. A default will be assigned based on
the objective . (rmse for regression, and error for classification, mean average precision for ranking) Default: Default according to objective Type: String Options: Full list of all possible values are in 'eval_metric' of learning task parameters. 
booster 
The type of booster to use, can be gbtree , gblinear or
dart . gbtree and dart use tree based models while
gblinear uses linear functions.Default: gbtree Type: String Options: one of { gbtree,gblinear,dart } 
num_boost_round 
Number of boosting iterations Default: 10 Type: Integer Options: [1, ∞) 
max_depth 
Maximum depth of a tree. Increasing this value will make the model more
complex and more likely to overfit. 0 indicates no limit. Note that limit
is required when grow_policy is set of depthwise. Default: 6 Type: Integer Options: [0, ∞) 
eta 
Step size shrinkage used in update to prevent overfitting. After each
boosting step, we can directly get the weights of new features, and eta
shrinks the feature weights to make the boosting process more
conservative. Default: 0.3 Type: Float Options: [0, 1] 
csv_weight 
When this flag is enabled, XGBoost differentiates the importance of
instances for csv input by taking the second column (the column after
labels) in training data as the instance weights. This should only be set when input_type='csv' Default: 0 Type: Integer Options: {0, 1} 
base_score 
The initial prediction score of all instance, global bias. For a sufficient
number of iterations, changing this value will not have too much
effect. Default: 0.5 Type: Float 
TREE BOOSTER PARAMETERS  
gamma 
Minimum loss reduction required to make a further partition on a leaf node
of the tree. The larger gamma is, the more conservative the algorithm will
be. Default: 0 Type: Float Options: [0, ∞) 
min_child_weight 
Minimum sum of instance weight (hessian) needed in a child. If the tree
partition step results in a leaf node with the sum of instance weight less
than min_child_weight, then the building process will give up further
partitioning. In a linear regression task, this simply corresponds to the
minimum number of instances needed to be in each node. The larger
min_child_weight is, the more conservative the algorithm will
be.Default: 0 Type: Float Options: [0, ∞) 
max_delta_step 
Maximum delta step you allow each leaf output to be. If the value is set to
0, it means there is no constraint. If it is set to a positive value, it
can help making the update step more conservative. Usually this parameter
is not needed, but it might help in logistic regression when class is
extremely imbalanced. Setting the value within the range 110 might help control
the update. Default: 0 Type: Integer Options: [0, ∞) 
subsample 
Subsample ratio of the training instances. Setting it to 0.5 means that
XGBoost would randomly sample half of the training data prior to growing
trees. This can prevent overfitting. Subsampling will occur once in
every boosting iteration. Default: 1 Type: Float Options: (0,1] 
colsample_bytree 
Subsample ratio of columns when constructing each tree. Subsampling will
occur once in every boosting iteration. Default: 1 Type: Float Options: (0,1] 
colsample_bylevel 
Subsample ratio of columns for each split, in each level. Subsampling will
occur each time a new split is made. This parameter has no effect when
tree_method is set to hist. Default: 1 Type: Float Options: (0,1] 
lambda 
L2 regularization term on weights. Increasing this value will make the
model more conservative. Default: 1 Type: Float 
alpha 
L1 regularization term on weights. Increasing this value will make the
model more conservative. Default: 0 Type: Float 
tree_method 
The tree construction algorithm used in XGBoost. For detailed information,
please refer to 'tree_method' in
parameters for tree booster. Default: auto Type: String Options: one of {auto, exact, approx, hist} Additional options only for the distributed version of the XGBoost algorithm: one of {gpu_exact, gpu_hist} 
sketch_eps 
This roughly translates into O(1 / sketch_eps) number of bins. Compared to
directly selecting the number of bins, this provides a theoretical guarantee
of sketch accuracy. Usually you do not have to tune this, but you may consider
setting this to a lower number for a more accurate enumeration of split
candidates. Default: 0.03 Type: Float Options: (0,1) Note: Only specify this if tree_method='approx' 
scale_pos_weight 
Control the balance of positive and negative weights, useful for unbalanced
classes. A typical value to consider: sum(negative instances) /
sum(positive instances) Default: 1 Type: Float 
updater 
A comma separated string defining the sequence of tree updaters to run,
providing a modular way to construct and to modify the trees. This is an
advanced parameter that is usually set automatically, depending on some
other parameters. However, you can also set this explicitly. For
detailed information, please refer to 'updater' in parameters
for tree booster. Default: grow_colmaker,prune Type: String Options: comma separated: {grow_colmaker,
distcol, grow_histmaker, grow_local_histmaker, grow_skmaker, sync, refresh,
prune} 
refresh_leaf 
When this flag is 1, tree leaves as well as tree nodes' stats are updated.
When it is 0, only node stats are updated. Default: 0 Type: Integer Options: {0, 1} 
process_type 
A type of boosting process to run. For detailed information, please refer
to 'process_type' in parameters
for tree booster. Default: default Type: String Options: {default, update} 
grow_policy 
Controls a way new nodes are added to the tree.
Default: depthwise Type: String Options: one of {depthwise,
lossguide} 
max_leaves 
Maximum number of nodes to be added. Default: 0 Type: Integer Note: Only specify this if grow_policy='lossguide' 
max_bin 
Maximum number of discrete bins to bucket continuous features. Increasing
this number improves the optimality of splits at the cost of higher
computation time. Default: 256 Type: Integer Note: Only specify this if tree_method='hist' 
DART BOOSTER PARAMETERS (arguments for booster='dart' ) 

sample_type 
Type of sampling algorithm
Default: uniform Type: String Options: one of {uniform, weighted} 
normalize_type 
Type of normalization algorithm.
Default: tree Type: String Options: one of {tree, forest} 
rate_drop 
Dropout rate (a fraction of previous trees to drop during the
dropout). Default: 0 Type: Float Options: [0, 1] 
one_drop 
When this flag is enabled, at least one tree is always dropped during the
dropout (allows Binomialplusone or epsilondropout from the original DART
paper). Default: 0 Type: Integer Options: {0, 1} 
skip_drop 
Probability of skipping the dropout procedure during a boosting iteration.
If a dropout is skipped, new trees are added in the same manner as gbtree.
Note that nonzero skip_drop has higher priority than
rate_drop or one_drop .Default: 0 Type: Float Options: [0, 1] 
TWEEDIE REGRESSION (arguments for objective='reg:tweedie' ) 

tweedie_variance_power 
Parameter that controls the variance of the Tweedie distribution var(y) ~ E(y)^ tweedie_variance_power
Default: 1.5 Type: Float Options: (1, 2) 
Hyperparameter tuning
Hyperparameter tuning tests different hyperparameter configurations when training your model. It finds hyperparameter values that are optimal for the goal metric you choose. For each tunable argument, you can specify a range of values to restrict and focus the possibilities AI Platform Training can try.
Learn more about hyperparameter tuning on AI Platform Training.
Goal metrics
The following metrics can be optimized:
Objective Metric  Direction  Details 

rmse 
MINIMIZE  Root mean square error 
logloss 
MINIMIZE  Negative loglikelihood 
error 
MINIMIZE  Binary classification error rate. It is calculated as #(wrong cases)/#(all cases). For the predictions, the evaluation will regard the instances with prediction value larger than 0.5 as positive instances, and the others as negative instances. 
merror 
MINIMIZE  Multiclass classification error rate. It is calculated as #(wrong cases)/#(all cases). 
mlogloss 
MINIMIZE  Multiclass logloss 
auc 
MAXIMIZE  Area under the curve for ranking evaluation. 
map 
MAXIMIZE  Mean average precision 
Tunable hyperparameters
When training with the builtin XGBoost algorithm (singlereplica or distributed), you can tune the following hyperparameters. Start by tuning parameters with "high tunable value". These have the greatest impact on your goal metric.
Hyperparameters  Type  Valid values 

PARAMETERS WITH HIGH TUNABLE VALUE (greatest impact on goal metric) 

eta 
DOUBLE  [0, 1] 
max_depth 
INTEGER  [0, ∞) 
num_boost_round 
INTEGER  [1, ∞) 
min_child_weight 
DOUBLE  [1, ∞) 
lambda 
DOUBLE  (∞, ∞) 
alpha 
DOUBLE  (∞, ∞) 
OTHER PARAMETERS  
gamma 
DOUBLE  [0, ∞) 
max_delta_step 
DOUBLE  [0, ∞) 
subsample 
DOUBLE  [0, 1] 
col_sample_bytree 
DOUBLE  [0, 1] 
col_sample_bylevel 
DOUBLE  [0, 1] 