Training using the built-in XGBoost algorithm

Training with built-in algorithms on AI Platform Training allows you to submit your dataset and train a model without writing any training code. This page explains how the built-in XGBoost algorithm works, and how to use it.


The built-in XGBoost algorithm is a wrapper for the XGBoost algorithm that is compatible to be run on AI Platform Training.

This document describes a version of the algorithm that runs on a single virtual machine replica. There is also a distributed version of this algorithm that uses multiple virtual machines for training and requires slightly different usage. This algorithm has two phases:

  1. Preprocessing: AI Platform Training processes your mix of categorical and numerical data into an all numerical dataset in order to prepare it for training with XGBoost.
  2. Training: AI Platform Training runs training using the XGBoost algorithm based on your dataset and the model parameters you supplied. The current implementation is based on XGBoost's 0.81 version.


The following features are not supported for training with the single-replica version of the built-in XGBoost algorithm:

Supported machine types

The following AI Platform Training scale tiers and machine types are supported:

Format input data

XGBoost works on numerical tabular data. Each row of a dataset represents one instance, and each column of a dataset represents a feature value. The target column represents the value you want to predict.

Prepare CSV file

Your input data must be a CSV file with UTF-8 encoding. If your training data only consists of categorical and numerical values, then you can use our preprocessing module to convert categorical data to numerical data. Otherwise, you can run training without automatic preprocessing enabled.

You must prepare your input CSV file to meet the following requirements:

  • Remove the header row. The header row contains the labels for each column. Remove the header row in order to avoid submitting it with the rest of the data instances as part of the training data.
  • Ensure that the target column is the first column. The target column contains the value that you are trying to predict. For a classification algorithm, all values in the target column are a class or category. For a regression algorithm, all values in the target column are a numerical value.

Handle integer values

Columns of integer values are interpreted as categorical columns by default, if there are few enough unique values. For example, if a column in your dataset includes integer values such as {101, 102, 103}, AI Platform Training interprets these values as categories, such as {'high', 'medium', 'low'}.

To avoid this incorrect analysis, make sure to convert integers to floats when you intend the data to be numerical: {101.0, 102.0, 103.0}. To ensure that integers are interpreted as categorical, append a string before or after each value: {code_101, code_102, code_103}.

Normalize target values for regression

For regression training jobs, make sure to normalize your target values so that each value is between 0 and 1.

Submit a XGBoost training job

This section explains how to submit a built-in XGBoost training job.

You can find brief explanations of each hyperparameter within the Google Cloud Console, and a more comprehensive explanation in the reference for the built-in XGBoost algorithm.


  1. Go to the AI Platform Training Jobs page in the Google Cloud Console:

    AI Platform Training Jobs page

  2. Click the New training job button. From the options that display below, click Built-in algorithm training.

  3. On the Create a new training job page, select Built-in XGBoost and click Next.

  4. To learn more about all the available parameters, follow the links in the Google Cloud Console and refer to the built-in XGBoost reference for more details.


  1. Set environment variables for your job, filling in [VALUES-IN-BRACKETS] with your own values:

    # Specify the name of the Cloud Storage bucket where you want your
    # training outputs to be stored, and the Docker container for
    # your built-in algorithm selection.
    # Specify the Cloud Storage path to your training input data.
    DATE='date '+%Y%m%d_%H%M%S''
  2. Submit the training job using gcloud ai-platform jobs training submit:

    gcloud ai-platform jobs submit training $JOB_ID \
       --master-image-uri=$IMAGE_URI --scale-tier=BASIC --job-dir=$JOB_DIR \
       -- \
       --preprocess --objective=binary:logistic \
  3. Monitor the status of your training job by viewing logs with gcloud. Refer to gcloud ai-platform jobs describe and gcloud ai-platform jobs stream-logs.

    gcloud ai-platform jobs describe ${JOB_ID}
    gcloud ai-platform jobs stream-logs ${JOB_ID}

How preprocessing works

Automatic preprocessing works for categorical and numerical data. The preprocessing routine first analyzes and then transforms your data.


First, AI Platform Training automatically detects the data type of each column, identifies how each column should be treated, and computes some statistics of the data in the column. This information is captured in the metadata.json file.

AI Platform Training analyzes the type of the target column to identify whether the given dataset is for regression or classification. If this analysis conflicts with your selection for the objective, it results in an error. Be explicit about how the target column should be treated by formatting your data clearly in ambiguous cases.

  • Type: The column can be either numerical or categorical.

  • Treatment: AI Platform Training identifies how to treat each column as follows:

    • If the column includes a single value in all the rows, it is treated as a constant.
    • If the column is categorical, and includes unique values in all the rows, it is treated as a row_identifier.
    • If the column is numerical with float values, or if it's numerical with integer values and it contains many unique values, then the column is treated as numerical.
    • If the column is numerical with integer values, and it contains few enough unique values, then the column is treated as a categorical column where the integer values are the identity or the vocabulary.
      • A column is considered to have few unique values if the number of unique values in the column is less than 20% of the number of rows in the input dataset.
    • If the column is categorical with high cardinality, then the column is treated with hashing, where the number of hash buckets equals to the square root of the number of unique values in the column.
      • A categorical column is considered to have high cardinality if the number of unique values is greater than the square root of the number of rows in the dataset.
    • If the column is categorical, and the number of unique values is less than or equal to the square root of the number of rows in the dataset, then the column is treated as a normal categorical column with vocabulary.
  • Statistics: AI Platform Training computes the following statistics, based on the identified column type and treatment, to be used for transforming the column in a later stage.

    • If the column is numeric, the mean and variance values are computed.
    • If the column is categorical, and the treatment is identity or vocabulary, the distinct values are extracted from the column.
    • If the column is categorical, and the treatment is hashing, the number of hash buckets is computed with respect to the cardinality of the column.


After the initial analysis of the dataset is complete, AI Platform Training transforms your data based on the types, treatments and statistics applied to your dataset. AI Platform Training does transformations in the following order:

  1. Splits the training dataset into validation and test datasets if you specify the amount of training data to use in each (as a percentage).
  2. Removes any rows that have more than 10% of features missing.
  3. Fills up missing values. The mean is used for numerical columns, and zeroes are used for categorical columns. See an example below.

  4. For each categorical column with vocabulary and identity treatment, AI Platform Training does one-hot encoding on the column values. See an example below.

  5. For each categorical column with hashing treatment, AI Platform Training uses scikit-learn's FeatureHasher to do feature hashing. The number of features counted earlier determines the number of hash buckets.

  6. Each column designated with a row_key or constant treatment is removed.

Example transformations

Rows with 10% of missing values are removed. In the following examples, assume the row has 10 values. Each example row is truncated for simplicity.

Row issue Original values Transformed values Explanation
Example row with no missing values [3, 0.45, ...,
'fruits', 0, 1]
[3, 0.45, ...,
1, 0, 0, 0, 1]
The string 'fruits' is transformed to the values "1, 0, 0" in one-hot encoding.
Too many missing values [3, 0.45, ...,
'fruits', __, __]
Row is removed More than 10% of values in the row are missing.
Missing numerical value [3, 0.45, ...,
'fruits', 0, __]
[3, 0.45, ...,
1, 0, 0, 0, 0.54]
  • The mean value for the column replaces the missing numerical value. In this example, the mean is 0.54.
  • The string 'fruits' is transformed to the values "1, 0, 0" in one-hot encoding.
Missing categorical value [3, 0.45, ...,
__, 0, 1]
[3, 0.45, ...,
0, 0, 0, 0, 1]
  • The missing categorical value is transformed to the values "0, 0, 0" in one-hot encoding.

After automatic preprocessing is complete, AI Platform Training uploads your processed dataset back to your Cloud Storage bucket at the directory you specified in the job request.

What's next