With built-in algorithms on AI Platform, you can submit your training data, select an algorithm, and let AI Platform Training handle the preprocessing and training for you, without writing any code for a training application.
Overview
In this tutorial, you train an XGBoost model without writing any code. You submit the Census Income Data Set to AI Platform for preprocessing and training, and then you deploy the model on AI Platform to get predictions. The resulting model predicts the probability that an individual's yearly income is greater than $50,000.
Before you begin
To complete this tutorial on the command line, use either Cloud Shell or any environment where the Cloud SDK is installed.
Complete the following steps to set up a GCP account, enable the required APIs, and install and activate the Cloud SDK:
-
Sign in to your Google Account.
If you don't already have one, sign up for a new account.
-
In the Cloud Console, on the project selector page, select or create a Google Cloud project.
-
Make sure that billing is enabled for your Google Cloud project. Learn how to confirm billing is enabled for your project.
- Enable the AI Platform ("Cloud Machine Learning Engine") and Compute Engine APIs.
- Install and initialize the Cloud SDK.
Setup
To use built-in algorithms, you must remove the header row from your CSV file
and move the target values to the first column. We have modified the original
Census dataset for use with this tutorial, and hosted it in a public
Cloud Storage bucket,
gs://cloud-samples-data/ai-platform/census/algorithms/data/
.
Console
Before you start your training job, you need to copy the data from our public Cloud Storage bucket to your Cloud Storage bucket.
Copy the sample data to your Cloud Storage bucket
First, download the training and testing data from our public Cloud Storage bucket.
Navigate to our public Cloud Storage bucket:
Download both
test.csv
andtrain.csv
:Click the file name.
From the Object details page, click Download. These files download to your local environment as
ai-platform_census_algorithms_data_test.csv
andai-platform_census_algorithms_data_train.csv
respectively.
Next, upload the training and testing data to your Cloud Storage bucket.
Navigate to the Browser page for your Cloud Storage bucket. Select your project from the Select a project drop-down list, or open it in a new tab:
Click the name of the bucket you want to use, or create a new bucket if you do not have one. (If you create a new bucket, make sure it is a regional bucket, and select the same region where you're running the AI Platform training job.)
(Optional) Click Create folder to create a folder for the files you upload. Enter a name for the folder (for example, "data") and click Create. Then, navigate to the new folder by clicking the folder name.
Click Upload files to upload both the training and testing files,
ai-platform_census_algorithms_data_train.csv
andai-platform_census_algorithms_data_test.csv
to your bucket.
Now that the data is copied to your bucket, you can start a training job by selecting the type of algorithm you want to use.
Select your algorithm
Go to the AI Platform Jobs page in the Google Cloud Console:
Click the New training job button. From the options that display below, click Built-in model training. The Create a new training job page displays.
The training job creation is divided into four steps. The first step is Training algorithm. Select Built-in XGBoost and click Next.
gcloud
Set up environment variables for your project ID, your Cloud Storage bucket, the Cloud Storage path to the training data, and your algorithm selection.
AI Platform built-in algorithms are in Docker containers hosted in Container Registry.
PROJECT_ID="[YOUR-PROJECT-ID]"
BUCKET_NAME="[YOUR-BUCKET-NAME]"
REGION="us-central1"
gcloud config set project $PROJECT_ID
gcloud config set compute/region $REGION
# Copy the training data into your Cloud Storage bucket, and set the path
# to your copy of the training data.
TRAINING_DATA_SOURCE="gs://cloud-samples-data/ai-platform/census/algorithms/data/train.csv"
TRAINING_DATA_PATH="gs://$BUCKET_NAME/algorithms-demo/data/train.csv"
gsutil cp $TRAINING_DATA_SOURCE $TRAINING_DATA_PATH
# Specify the Docker container URI specific to the algorithm.
IMAGE_URI="gcr.io/cloud-ml-algos/boosted_trees:latest"
Submit a training job
To submit a job, you must specify some basic training arguments and some basic arguments related to the XGBoost algorithm.
General arguments for the training job:
Training job arguments | |
---|---|
Argument | Description |
job-id |
Unique ID for your training job. You can use this to find logs for the status of your training job after you submit it. |
job-dir |
Cloud Storage path where AI Platform saves training files after completing a successful training job. |
scale-tier |
Specifies machine types for training. Use BASIC to select
a configuration of just one machine.
|
master-image-uri |
Container Registry URI used to specify which Docker container to
use for the training job. Use the container for the built-in
XGBoost algorithm defined earlier as IMAGE_URI .
|
region |
Specify the available region in which to run your training job. For
this tutorial, you can use the region us-central1 .
|
Arguments specific to the built-in XGBoost algorithm:
Algorithm arguments | |
---|---|
Argument | Description |
preprocess |
Boolean argument stating whether or not AI Platform should preprocess the data. |
objective |
Indicates the learning task and its corresponding learning objective. In this example, "binary:logistic". |
training_data_path |
Cloud Storage location to the training data, which must be a CSV file. |
For a detailed list of all other XGBoost algorithm flags, refer to the built-in XGBoost reference.
Console
Leave Enable automatic data preprocessing checked.
For Training data path, click Browse. In the right panel, click the name of the bucket where you uploaded the training data, and navigate to your
ai-platform_census_algorithms_data_train.csv
file.Leave the fields for Validation data and Test data at their default settings.
In Output directory, enter the path to your Cloud Storage bucket where you want AI Platform to store the outputs from your training job. You can fill in your Cloud Storage bucket path directly, or click the Browse button to select it.
To keep things organized, create a new directory within your Cloud Storage bucket for this training job. You can do this within the Browse pane.
Click Next.
For Objective, select "binary:logistic", which indicates a binary learning task and an objective of logistic regression.
For Model type, select Classification.
Leave all other fields on their default settings, and click Next.
On the Job settings page:
- Enter a unique Job ID (such as "xgboost_example").
- Enter an available region (such as "us-central1").
- Select "BASIC" for the scale tier.
Click Done to submit the training job.
gcloud
Set up all the arguments for the training job and the algorithm, before using
gcloud
to submit the job:DATASET_NAME="census" ALGORITHM="xgboost" MODEL_TYPE="classification" MODEL_NAME="${DATASET_NAME}_${ALGORITHM}_${MODEL_TYPE}" # Give a unique name to your training job. DATE="$(date '+%Y%m%d_%H%M%S')" JOB_ID="${MODEL_NAME}_${DATE}" # Make sure you have access to this Cloud Storage bucket. JOB_DIR="gs://${BUCKET_NAME}/algorithms_training/${MODEL_NAME}/${DATE}"
Submit the job:
gcloud ai-platform jobs submit training $JOB_ID \ --master-image-uri=$IMAGE_URI --scale-tier=BASIC --job-dir=$JOB_DIR \ -- \ --preprocess --objective=binary:logistic \ --training_data_path=$TRAINING_DATA_PATH
After the job is submitted successfully, you can view the logs using the following
gcloud
commands:gcloud ai-platform jobs describe $JOB_ID gcloud ai-platform jobs stream-logs $JOB_ID
Understand your job directory
After the successful completion of a training job, AI Platform
creates a trained model in your Cloud Storage bucket, along with some other
artifacts. You can find the following directory structure within your JOB_DIR
:
- model/
- model.pkl
- deployment_config.yaml
- artifacts/
- instance_generator.py
- metadata.json
- processed_data/
- training.csv
- validation.csv
- test.csv
Confirm that the directory structure in your JOB_DIR
matches:
gsutil ls -a $JOB_DIR/*
Deploy the trained model
AI Platform organizes your trained models using model and version resources. An AI Platform model is a container for the versions of your machine learning model.
To deploy a model, you create a model resource in AI Platform, create a version of that model, then use the model and version to request online predictions.
For more information on how to deploy models to AI Platform, see how to deploy a scikit-learn or XGBoost model.
Console
On the Jobs page, you can find a list of all your training jobs. Click the name of the training job you just submitted ("xgboost_example" or the job name you used).
On the Job details page, you can view the general progress of your job, or click View logs for a more detailed view of its progress.
When the job is successful, the Deploy model button appears at the top. Click Deploy model.
Select "Deploy as new model", and enter a model name, such as "xgboost_model". Next, click Confirm.
On the Create version page, enter a version name, such as "v1", and leave all other fields at their default settings. Click Save.
gcloud
The training process with the built-in XGBoost algorithm produces a
file, deployment_config.yaml
, that makes it easier to deploy your model
on AI Platform for predictions.
Copy the file to your local directory and view its contents:
gsutil cp $JOB_DIR/model/deployment_config.yaml . cat deployment_config.yaml
Your
deployment_config.yaml
file should appear similar to the following:deploymentUri: gs://BUCKET_NAME/algorithms_training/census_xgboost_classification/20190227060114/model framework: XGBOOST labels: job_id: census_xgboost_classification_20190227060114 error_percentage: '14' runtimeVersion: '1.12'
Create the model and version in AI Platform:
MODEL_NAME="${DATASET_NAME}_${ALGORITHM}_${MODEL_TYPE}" gcloud ai-platform models create $MODEL_NAME --regions $REGION # Create a model and a version using the file above. VERSION_NAME="v_${DATE}" gcloud ai-platform versions create $VERSION_NAME \ --model $MODEL_NAME \ --config deployment_config.yaml
The version takes a few minutes to create.
Get online predictions
When requesting predictions, you need to make sure that your input data is
formatted the same way as the training data. Before training,
AI Platform preprocesses your data by transforming it into the corpus
shown in metadata.json
.
You can use instance_generator.py
to apply the same preprocessing
transformations to your input instances that AI Platform applies to
your training data. This file reads the mapping information stored in the
metadata.json
file. You can also use the function
transform_string_instance
in the module to transform your raw string to a
format that the model would accept.
Download the training artifact files, and review
metadata.json
:gsutil cp $JOB_DIR/artifacts/* . # Let's look at the metadata.json file head metadata.json
Use
instance_generator.py
to prepare the prediction input for one data instance:# ground truth is >50K RAW_DATA_POINT="44, Private, 160323, Some-college, 10, Married-civ-spouse, Machine-op-inspct, Husband, Black, Male, 7688, 0, 40, United-States" # Now let's create a JSON prediction request python instance_generator.py --raw_data_string="${RAW_DATA_POINT}" > sample_input.json # Let's look at the prediction request file. cat sample_input.json
Send the prediction request:
gcloud ai-platform predict \ --model $MODEL_NAME \ --version $VERSION_NAME \ --json-instances sample_input.json
The resulting prediction should be a number over 0.5, which indicates that the individual most likely earns a salary greater than $50,000.
About the data
The Census Income Data Set that this sample uses for training is hosted by the UC Irvine Machine Learning Repository.
Census data courtesy of: Lichman, M. (2013). UCI Machine Learning Repository http://archive.ics.uci.edu/ml. Irvine, CA: University of California, School of Information and Computer Science.
What's next
- Learn more about using the built-in XGBoost algorithm.