Training with built-in algorithms on AI Platform Training allows you to submit your dataset and train a model without writing any training code. This page explains how the built-in XGBoost algorithm works, and how to use it.
Overview
The built-in XGBoost algorithm is a wrapper for the XGBoost algorithm that is compatible to be run on AI Platform Training.
This document describes a version of the algorithm that runs on a single virtual machine replica. There is also a distributed version of this algorithm that uses multiple virtual machines for training and requires slightly different usage. This algorithm has two phases:
- Preprocessing: AI Platform Training processes your mix of categorical and numerical data into an all numerical dataset in order to prepare it for training with XGBoost.
- Training: AI Platform Training runs training using the XGBoost algorithm based on your dataset and the model parameters you supplied. The current implementation is based on XGBoost's 0.81 version.
Limitations
The following features are not supported for training with the single-replica version of the built-in XGBoost algorithm:
- Training with GPUs. To train with GPUs, use the built-in distributed XGBoost algorithm.
- Distributed training. To run a distributed training job, use the built-in distributed XGBoost algorithm.
Supported machine types
The following AI Platform Training scale tiers and machine types are supported:
BASIC
scale tierCUSTOM
scale tier with any of the Compute Engine machine types supported by AI Platform Training.CUSTOM
scale tier with any of the following legacy machine types:standard
large_model
complex_model_s
complex_model_m
complex_model_l
Format input data
XGBoost works on numerical tabular data. Each row of a dataset represents one instance, and each column of a dataset represents a feature value. The target column represents the value you want to predict.
Prepare CSV file
Your input data must be a CSV file with UTF-8 encoding. If your training data only consists of categorical and numerical values, then you can use our preprocessing module to convert categorical data to numerical data. Otherwise, you can run training without automatic preprocessing enabled.
You must prepare your input CSV file to meet the following requirements:
- Remove the header row. The header row contains the labels for each column. Remove the header row in order to avoid submitting it with the rest of the data instances as part of the training data.
- Ensure that the target column is the first column. The target column contains the value that you are trying to predict. For a classification algorithm, all values in the target column are a class or category. For a regression algorithm, all values in the target column are a numerical value.
Handle integer values
The meaning of integer values can be ambiguous, which makes columns of integer values problematic in automatic preprocessing. AI Platform Training automatically determines how to handle integer values. By default:
- If every integer value is unique, the column is treated as instance keys.
- If there are only a few unique integer values, the column is treated as categorical.
- Otherwise, the values in the column are converted to float and treated as numerical.
To override these default determinations:
- If the data should be treated as numerical, convert all integer values in the column to floating point, ex. {101.0, 102.0, 103.0}
- If the data should be treated as categorical, prepend a non-numeric prefix to all integer values in the column, ex. {code_101, code_102, code_103}
Normalize target values for regression
For regression training jobs, make sure to normalize your target values so that each value is between 0 and 1.
Check Cloud Storage bucket permissions
To store your data, use a Cloud Storage bucket in the same Google Cloud project you're using to run AI Platform Training jobs. Otherwise, grant AI Platform Training access to the Cloud Storage bucket where your data is stored.
Submit a XGBoost training job
This section explains how to submit a built-in XGBoost training job.
You can find brief explanations of each hyperparameter within the Google Cloud console, and a more comprehensive explanation in the reference for the built-in XGBoost algorithm.
Console
Go to the AI Platform Training Jobs page in the Google Cloud console:
Click the New training job button. From the options that display below, click Built-in algorithm training.
On the Create a new training job page, select Built-in XGBoost and click Next.
To learn more about all the available parameters, follow the links in the Google Cloud console and refer to the built-in XGBoost reference for more details.
gcloud
Set environment variables for your job, filling in
[VALUES-IN-BRACKETS]
with your own values:# Specify the name of the Cloud Storage bucket where you want your # training outputs to be stored, and the Docker container for # your built-in algorithm selection. BUCKET_NAME='[YOUR-BUCKET-NAME]' IMAGE_URI='gcr.io/cloud-ml-algos/boosted_trees:latest' # Specify the Cloud Storage path to your training input data. TRAINING_DATA='gs://[YOUR_BUCKET_NAME]/[YOUR_FILE_NAME].csv' DATASET_NAME='census' ALGORITHM='xgboost' MODEL_TYPE='classification' DATE='date '+%Y%m%d_%H%M%S'' MODEL_NAME="${DATASET_NAME}_${ALGORITHM}_${MODEL_TYPE}" JOB_ID="${MODEL_NAME}_${DATE}" JOB_DIR="gs://${BUCKET_NAME}/algorithm_training/${MODEL_NAME}/${DATE}"
Submit the training job using
gcloud ai-platform jobs training submit
:gcloud ai-platform jobs submit training $JOB_ID \ --master-image-uri=$IMAGE_URI --scale-tier=BASIC --job-dir=$JOB_DIR \ -- \ --preprocess --objective=binary:logistic \ --training_data_path=$TRAINING_DATA
Monitor the status of your training job by viewing logs with
gcloud
. Refer togcloud ai-platform jobs describe
andgcloud ai-platform jobs stream-logs
.gcloud ai-platform jobs describe ${JOB_ID} gcloud ai-platform jobs stream-logs ${JOB_ID}
How preprocessing works
Automatic preprocessing works for categorical and numerical data. The preprocessing routine first analyzes and then transforms your data.
Analysis
First, AI Platform Training automatically detects the data type of each
column, identifies how each column should be treated, and computes some
statistics of the data in the column. This information is captured in the
metadata.json
file.
AI Platform Training analyzes the type of the target column to identify
whether the given dataset is for regression or classification. If this analysis
conflicts with your selection for the objective
, it results
in an error. Be explicit about how the target column should be treated by
formatting your data clearly in ambiguous cases.
Type: The column can be either numerical or categorical.
Treatment: AI Platform Training identifies how to treat each column as follows:
- If the column includes a single value in all the rows, it is treated as a constant.
- If the column is categorical, and includes unique values in all the rows, it is treated as a row_identifier.
- If the column is numerical with float values, or if it's numerical with integer values and it contains many unique values, then the column is treated as numerical.
- If the column is numerical with integer values, and it contains
few enough unique values, then the column is treated as
a categorical column where the integer values are the identity or
the vocabulary.
- A column is considered to have few unique values if the number of unique values in the column is less than 20% of the number of rows in the input dataset.
- If the column is categorical with high cardinality, then the column is
treated with hashing, where the number of hash buckets equals to the
square root of the number of unique values in the column.
- A categorical column is considered to have high cardinality if the number of unique values is greater than the square root of the number of rows in the dataset.
- If the column is categorical, and the number of unique values is less than or equal to the square root of the number of rows in the dataset, then the column is treated as a normal categorical column with vocabulary.
Statistics: AI Platform Training computes the following statistics, based on the identified column type and treatment, to be used for transforming the column in a later stage.
- If the column is numeric, the mean and variance values are computed.
- If the column is categorical, and the treatment is identity or vocabulary, the distinct values are extracted from the column.
- If the column is categorical, and the treatment is hashing, the number of hash buckets is computed with respect to the cardinality of the column.
Transformation
After the initial analysis of the dataset is complete, AI Platform Training transforms your data based on the types, treatments and statistics applied to your dataset. AI Platform Training does transformations in the following order:
- Splits the training dataset into validation and test datasets if you specify the amount of training data to use in each (as a percentage).
- Removes any rows that have more than 10% of features missing.
Fills up missing values. The mean is used for numerical columns, and zeroes are used for categorical columns. See an example below.
For each categorical column with vocabulary and identity treatment, AI Platform Training does one-hot encoding on the column values. See an example below.
For each categorical column with hashing treatment, AI Platform Training uses scikit-learn's FeatureHasher to do feature hashing. The number of features counted earlier determines the number of hash buckets.
Each column designated with a row_key or constant treatment is removed.
Example transformations
Rows with 10% of missing values are removed. In the following examples, assume the row has 10 values. Each example row is truncated for simplicity.
Row issue | Original values | Transformed values | Explanation |
---|---|---|---|
Example row with no missing values | [3, 0.45, ..., 'fruits', 0, 1] |
[3, 0.45, ..., 1, 0, 0, 0, 1] |
The string 'fruits' is transformed to the values "1, 0, 0" in one-hot encoding. |
Too many missing values | [3, 0.45, ..., 'fruits', __, __] |
Row is removed | More than 10% of values in the row are missing. |
Missing numerical value | [3, 0.45, ..., 'fruits', 0, __] |
[3, 0.45, ..., 1, 0, 0, 0, 0.54] |
|
Missing categorical value | [3, 0.45, ..., __, 0, 1] |
[3, 0.45, ..., 0, 0, 0, 0, 1] |
|
After automatic preprocessing is complete, AI Platform Training uploads your processed dataset back to your Cloud Storage bucket at the directory you specified in the job request.
What's next
- Learn more about XGBoost.
- Refer to the built-in XGBoost reference to learn about all the different parameters.