This legacy version of AI Platform Training is deprecated and will no longer be available on Google Cloud after January 31, 2025. Migrate your resources to Vertex AI custom training to get new machine learning features that are unavailable in AI Platform.

Reference for built-in XGBoost algorithm

This page provides detailed reference information about arguments you submit to AI Platform Training when running a training job using the built-in XGBoost algorithm.

Versioning

The standard (single-replica) version of the built-in XGBoost algorithm uses XGBoost 0.81.

The distributed version of the algorithm uses XGBoost 0.81.

Data format arguments

The standard (single-replica) version of the algorithm and the distributed version of the algorithm accept different arguments for describing the format of your data and defining the high-level functionality of the algorithm.

The following tables describe the arguments accepted by each version of the algorithm:

Single-replica

This is the version of the algorithm available at gcr.io/cloud-ml-algos/boosted_trees:latest. Learn more about using the single-replica version of the algorithm.

The following arguments are used for data formatting and automatic preprocessing:

Arguments	Details
`preprocess`	Specify this to enable automatic preprocessing. Types of automatic preprocessing: Splits the data into `validation_split` and `test_split` percentages. Fills in missing values. For numerical columns, the mean is substituted for all missing values. For categorical columns, one-hot encoding or hashing is used to fill in the missing values. Removes rows that have more than 10% column values missing. Default: Unset Type: Boolean flag. If set to true, enables automatic preprocessing.
`training_data_path`	Cloud Storage path to a CSV file. The CSV file must have the following specifications: CSV file must not contain a header Only contain categorical or numerical columns First column must be the target column Blank values will be treated as missing Required Type: String
`validation_data_path`	Cloud Storage path to a CSV file. The CSV file must have the same format as `training_data_path`. Optional Type: String
`test_data_path`	Cloud Storage path to a CSV file. The CSV file must have the same format as `training_data_path` and `validation_data_path`. Optional Type: String
`job-dir`	Cloud Storage path where model, checkpoints and other training artifacts reside. The following directories are created here: model: This contains the trained model processed_data: This contains three data files for training, testing, and validation if automatic preprocessing was enabled. artifacts: This contains training preprocessing-related artifacts that help you to do client side preprocessing. Required Type: String
PREPROCESSING PARAMETERS (when `preprocess` is set.)
`validation_split`	Fraction of training data that should be used as validation data. Default: 0.20 Type: Float Note: `validation_split` + `test_split` <=0.40 Only specify this if you are not specifying `validation_data_path.`
`test_split`	Fraction of training data that should be used as test data. Default: 0.20 Type: Float Note: `validation_split` + `test_split` <=0.40 Only specify this if you are not specifying `test_data_path.`

Distributed

This is the version of the algorithm available at gcr.io/cloud-ml-algos/xgboost_dist:latest. Learn more about using the distributed version of the algorithm.

The following arguments are used for data formatting and determining several other aspects of how the algorithm runs:

Arguments	Details
`training_data_path`	Cloud Storage path to one or more CSV files for training. To specify multiple files, use wildcards in this string. The CSV files must have the following specifications: First column must be the target column Only contain numerical data in other columns Required Type: String
`validation_data_path`	Cloud Storage path to one or more CSV files for validation. To specify multiple files, use wildcards in this string. The CSV file must have the same format as `training_data_path`. Optional Type: String
`job-dir`	Cloud Storage path where the trained model and other training artifacts get created. The following directories are created here: model: This contains the trained model intermediate: This contains intermediate models. See `model_saving_period` in a following row of this table. evaluation: This contains intermediate metrics calculated against validation data. See `eval_log_period` in a following row of this table. Required Type: String
`model_saving_period`	How often the algorithm saves an intermediate model, measured in iterations of training. For example, if this argument is set to 3, then the algorithm saves an intermediate model to a file in `job-dir` after every 3 iterations of training. If this field is set to an integer less than or equal to 0, then the algorithm does not save intermediate models. Default: 1 Optional Type: Integer
`eval_log_period`	How often the algorithm logs metrics calculated against the validation data, measured in iterations of training. For example, if this argument is set to 3, then the algorithm logs evaluation metrics to a file in `job-dir` after every 3 iterations of training. If this field is set to an integer less than or equal to 0, then the algorithm does not save evaluation metrics. Default: 0 Optional Type: Integer
`silent`	Whether the algorithm suppresses logs for debugging during the training process. Prints logs if this field is set to 0 and suppresses logs if it is set to 1. Default: 0 Optional Options: {0, 1}

Hyperparameters

The single-version and distributed versions of the built-in XGBoost algorithm both have the following hyperparameters:

Hyperparameter	Details
BASIC PARAMETERS
`objective`	Specify the learning task and the corresponding learning objective. For detailed information, please refer to 'objective' section in learning task parameters. Required Type: String Options: one of `{reg:linear, reg:logistic, binary:logistic, binary:logitraw, count:poisson, survival:cox, multi:softmax, multi:softprob, reg:gamma, reg:tweedie}`
`eval_metric`	Evaluation metrics to be used for validation data. It should be a comma separated string of multiple values. A default will be assigned based on the `objective`. (rmse for regression, and error for classification, mean average precision for ranking) Default: Default according to objective Type: String Options: Full list of all possible values are in 'eval_metric' of learning task parameters.
`booster`	The type of booster to use, can be `gbtree`, `gblinear` or `dart`. `gbtree` and `dart` use tree based models while `gblinear` uses linear functions. Default: `gbtree` Type: String Options: one of {`gbtree,gblinear,dart`}
`num_boost_round`	Number of boosting iterations Default: 10 Type: Integer Options: [1, ∞)
`max_depth`	Maximum depth of a tree. Increasing this value will make the model more complex and more likely to overfit. 0 indicates no limit. Note that limit is required when grow_policy is set of depthwise. Default: 6 Type: Integer Options: [0, ∞)
`eta`	Step size shrinkage used in update to prevent overfitting. After each boosting step, we can directly get the weights of new features, and eta shrinks the feature weights to make the boosting process more conservative. Default: 0.3 Type: Float Options: [0, 1]
`csv_weight`	When this flag is enabled, XGBoost differentiates the importance of instances for csv input by taking the second column (the column after labels) in training data as the instance weights. This should only be set when `input_type='csv'` Default: 0 Type: Integer Options: {0, 1}
`base_score`	The initial prediction score of all instance, global bias. For a sufficient number of iterations, changing this value will not have too much effect. Default: 0.5 Type: Float
TREE BOOSTER PARAMETERS
`gamma`	Minimum loss reduction required to make a further partition on a leaf node of the tree. The larger gamma is, the more conservative the algorithm will be. Default: 0 Type: Float Options: [0, ∞)
`min_child_weight`	Minimum sum of instance weight (hessian) needed in a child. If the tree partition step results in a leaf node with the sum of instance weight less than min_child_weight, then the building process will give up further partitioning. In a linear regression task, this simply corresponds to the minimum number of instances needed to be in each node. The larger `min_child_weight` is, the more conservative the algorithm will be. Default: 0 Type: Float Options: [0, ∞)
`max_delta_step`	Maximum delta step you allow each leaf output to be. If the value is set to 0, it means there is no constraint. If it is set to a positive value, it can help making the update step more conservative. Usually this parameter is not needed, but it might help in logistic regression when class is extremely imbalanced. Setting the value within the range 1-10 might help control the update. Default: 0 Type: Integer Options: [0, ∞)
`subsample`	Subsample ratio of the training instances. Setting it to 0.5 means that XGBoost would randomly sample half of the training data prior to growing trees. This can prevent overfitting. Subsampling will occur once in every boosting iteration. Default: 1 Type: Float Options: (0,1]
`colsample_bytree`	Subsample ratio of columns when constructing each tree. Subsampling will occur once in every boosting iteration. Default: 1 Type: Float Options: (0,1]
`colsample_bylevel`	Subsample ratio of columns for each split, in each level. Subsampling will occur each time a new split is made. This parameter has no effect when tree_method is set to hist. Default: 1 Type: Float Options: (0,1]
`lambda`	L2 regularization term on weights. Increasing this value will make the model more conservative. Default: 1 Type: Float
`alpha`	L1 regularization term on weights. Increasing this value will make the model more conservative. Default: 0 Type: Float
`tree_method`	The tree construction algorithm used in XGBoost. For detailed information, please refer to 'tree_method' in parameters for tree booster. Default: auto Type: String Options: one of `{auto, exact, approx, hist}` *Additional options only for the distributed version* of the XGBoost algorithm:** one of `{gpu_exact, gpu_hist}`
`sketch_eps`	This roughly translates into O(1 / sketch_eps) number of bins. Compared to directly selecting the number of bins, this provides a theoretical guarantee of sketch accuracy. Usually you do not have to tune this, but you may consider setting this to a lower number for a more accurate enumeration of split candidates. Default: 0.03 Type: Float Options: (0,1) Note: Only specify this if `tree_method='approx'`
`scale_pos_weight`	Control the balance of positive and negative weights, useful for unbalanced classes. A typical value to consider: sum(negative instances) / sum(positive instances) Default: 1 Type: Float
`updater`	A comma separated string defining the sequence of tree updaters to run, providing a modular way to construct and to modify the trees. This is an advanced parameter that is usually set automatically, depending on some other parameters. However, you can also set this explicitly. For detailed information, please refer to 'updater' in parameters for tree booster. Default: grow_colmaker,prune Type: String Options: comma separated: `{grow_colmaker, distcol, grow_histmaker, grow_local_histmaker, grow_skmaker, sync, refresh, prune}`
`refresh_leaf`	When this flag is 1, tree leaves as well as tree nodes' stats are updated. When it is 0, only node stats are updated. Default: 0 Type: Integer Options: {0, 1}
`process_type`	A type of boosting process to run. For detailed information, please refer to 'process_type' in parameters for tree booster. Default: default Type: String Options: `{default, update}`
`grow_policy`	Controls a way new nodes are added to the tree. depthwise: split at nodes closest to the root. lossguide: split at nodes with highest loss change. Default: depthwise Type: String Options: one of `{depthwise, lossguide}`
`max_leaves`	Maximum number of nodes to be added. Default: 0 Type: Integer Note: Only specify this if `grow_policy='lossguide'`
`max_bin`	Maximum number of discrete bins to bucket continuous features. Increasing this number improves the optimality of splits at the cost of higher computation time. Default: 256 Type: Integer Note: Only specify this if `tree_method='hist'`
DART BOOSTER PARAMETERS (arguments for `booster='dart'`)
`sample_type`	Type of sampling algorithm uniform: dropped trees are selected uniformly. weighted: dropped trees are selected in proportion to weight. Default: uniform Type: String Options: one of `{uniform, weighted}`
`normalize_type`	Type of normalization algorithm. tree: new trees have the same weight of each dropped trees. Weight of new trees are 1 / (k + learning_rate). Dropped trees are scaled by a factor of k / (k + learning_rate). forest: new trees have the same weight of sum of dropped trees (forest). Weight of new trees are 1 / (1 + learning_rate). Dropped trees are scaled by a factor of 1 / (1 + learning_rate). Default: tree Type: String Options: one of `{tree, forest}`
`rate_drop`	Dropout rate (a fraction of previous trees to drop during the dropout). Default: 0 Type: Float Options: [0, 1]
`one_drop`	When this flag is enabled, at least one tree is always dropped during the dropout (allows Binomial-plus-one or epsilon-dropout from the original DART paper). Default: 0 Type: Integer Options: {0, 1}
`skip_drop`	Probability of skipping the dropout procedure during a boosting iteration. If a dropout is skipped, new trees are added in the same manner as gbtree. Note that non-zero `skip_drop` has higher priority than `rate_drop` or `one_drop`. Default: 0 Type: Float Options: [0, 1]
TWEEDIE REGRESSION (arguments for `objective='reg:tweedie'`)
`tweedie_variance_power`	Parameter that controls the variance of the Tweedie distribution var(y) ~ E(y)^`tweedie_variance_power` Set closer to 2 to shift towards a gamma distribution Set closer to 1 to shift towards a Poisson distribution. Default: 1.5 Type: Float Options: (1, 2)

Hyperparameter tuning

Hyperparameter tuning tests different hyperparameter configurations when training your model. It finds hyperparameter values that are optimal for the goal metric you choose. For each tunable argument, you can specify a range of values to restrict and focus the possibilities AI Platform Training can try.

Learn more about hyperparameter tuning on AI Platform Training.

Goal metrics

The following metrics can be optimized:

Objective Metric	Direction	Details
`rmse`	MINIMIZE	Root mean square error
`logloss`	MINIMIZE	Negative log-likelihood
`error`	MINIMIZE	Binary classification error rate. It is calculated as #(wrong cases)/#(all cases). For the predictions, the evaluation will regard the instances with prediction value larger than 0.5 as positive instances, and the others as negative instances.
`merror`	MINIMIZE	Multiclass classification error rate. It is calculated as #(wrong cases)/#(all cases).
`mlogloss`	MINIMIZE	Multiclass logloss
`auc`	MAXIMIZE	Area under the curve for ranking evaluation.
`map`	MAXIMIZE	Mean average precision

Tunable hyperparameters

When training with the built-in XGBoost algorithm (single-replica or distributed), you can tune the following hyperparameters. Start by tuning parameters with "high tunable value". These have the greatest impact on your goal metric.

Hyperparameters	Type	Valid values
PARAMETERS WITH HIGH TUNABLE VALUE (greatest impact on goal metric)
`eta`	DOUBLE	[0, 1]
`max_depth`	INTEGER	[0, ∞)
`num_boost_round`	INTEGER	[1, ∞)
`min_child_weight`	DOUBLE	[1, ∞)
`lambda`	DOUBLE	(-∞, ∞)
`alpha`	DOUBLE	(-∞, ∞)
OTHER PARAMETERS
`gamma`	DOUBLE	[0, ∞)
`max_delta_step`	DOUBLE	[0, ∞)
`subsample`	DOUBLE	[0, 1]
`col_sample_bytree`	DOUBLE	[0, 1]
`col_sample_bylevel`	DOUBLE	[0, 1]

Training using the built-in distributed XGBoost algorithm