BigQuery ML supports supervised learning with the logistic regression model type. You can use the binary logistic regression model type to predict whether a value falls into one of two categories; or, you can use the multi-class regression model type to predict whether a value falls into one of multiple categories. These are known as classification problems, because they attempt to classify data into two or more categories.
In this tutorial, you use a binary logistic regression model in BigQuery ML to predict the income range of respondents in the US Census Dataset. This dataset contains the demographic and income information of US residents from 2000 and 2010. The data includes employment status, job category, level of education, and income data.
Objectives
In this tutorial you will perform the following tasks:- Create a logistic regression model.
- Evaluate the logistic regression model.
- Make predictions using the logistic regression model.
Costs
This tutorial uses billable components of Google Cloud, including the following:
- BigQuery
- BigQuery ML
For more information on BigQuery costs, see the BigQuery pricing page.
For more information on BigQuery ML costs, see BigQuery ML pricing.
Before you begin
- Sign in to your Google Cloud account. If you're new to Google Cloud, create an account to evaluate how our products perform in real-world scenarios. New customers also get $300 in free credits to run, test, and deploy workloads.
-
In the Google Cloud console, on the project selector page, select or create a Google Cloud project.
-
Make sure that billing is enabled for your Google Cloud project.
-
In the Google Cloud console, on the project selector page, select or create a Google Cloud project.
-
Make sure that billing is enabled for your Google Cloud project.
- BigQuery is automatically enabled in new projects.
To activate BigQuery in a pre-existing project, go to
Enable the BigQuery API.
Introduction
A common problem in machine learning is to classify data into one of two types, known as labels. For example, a retailer may want to predict whether a given customer will purchase a new product, based on other information about that customer. In that case, the two labels might be "will buy" and "won't buy." You can construct your dataset such that one column represents the label. The data you can use to train such a binary logistic regression model include the customer's location, their previous purchases, the customer's reported preferences, and so on.
In this tutorial, you use BigQuery ML to create a binary logistic regression model that predicts whether a US Census respondent's income falls into one of two ranges based on the respondent's demographic attributes.
Creating your logistic regression model consists of the following steps.
- Step one: Create a dataset to store your model.
- The first step is to create a BigQuery dataset to store your model.
- Step two: Examine your data.
- In this step, examine the dataset and identify which columns to use as training data for your logistic regression model.
- Step three: Select your training data.
- The next step is to prepare the data you use to train your binary logistic
regression model by running a query against the
census_adult_income
table. This step identifies the relevant features and stores them in a view for later queries to use as input data.
- Step four: Create a logistic regression model.
- In this step, use the
CREATE MODEL
statement to create your logistic regression model.
- Step five: Use the
ML.EVALUATE
function to evaluate your model. - Then, use the
ML.EVALUATE
function to provide statistics about model performance.
- Step five: Use the
- Step six: Use the
ML.PREDICT
function to predict a participant's income. - Finally, you use the
ML.PREDICT
function to predict the income bracket for a given set of census participants.
- Step six: Use the
Step one: Create your dataset
Create a BigQuery dataset to store your ML model:
In the Google Cloud console, go to the BigQuery page.
In the Explorer pane, click your project name.
Click
View actions > Create dataset.On the Create dataset page, do the following:
For Dataset ID, enter
census
.For Location type, select Multi-region, and then select US (multiple regions in United States).
The public datasets are stored in the
US
multi-region. For simplicity, store your dataset in the same location.Leave the remaining default settings as they are, and click Create dataset.
Step two: Examine your data
The next step is to examine the dataset and identify which columns to use as training data for your logistic regression model. You can use a GoogleSQL query to return rows from the dataset.
The following query returns 100 rows from the US Census Dataset:
SELECT age, workclass, marital_status, education_num, occupation, hours_per_week, income_bracket FROM `bigquery-public-data.ml_datasets.census_adult_income` LIMIT 100;
Run the query
To run the query that returns rows from your dataset:
In the Google Cloud console, click the Compose new query button.
Enter the following GoogleSQL query in the Query editor text area:
SELECT age, workclass, marital_status, education_num, occupation, hours_per_week, income_bracket FROM `bigquery-public-data.ml_datasets.census_adult_income` LIMIT 100;
Click Run.
When the query is complete, click the Results tab below the query text area. The results should look like the following:
The query results show that the
income_bracket
column in thecensus_adult_income
table has only one of two values:<=50K
or>50K
. It also shows that the columnseducation
andeducation_num
in thecensus_adult_income
table express the same data in different formats. Thefunctional_weight
column is the number of individuals that the Census Organizations believes a particular row represents; the values of this column appear unrelated to the value ofincome_bracket
for a particular row.
Step three: Select your training data
Next, you select the data used to train your logistic regression model. In this tutorial, you predict census respondent income based on the following attributes:
- Age
- Type of work performed
- Marital status
- Level of education
- Occupation
- Hours worked per week
The following query creates a view that compiles your training data. This view
is included in your CREATE MODEL
statement later in this tutorial.
CREATE OR REPLACE VIEW `census.input_view` AS SELECT age, workclass, marital_status, education_num, occupation, hours_per_week, income_bracket, CASE WHEN MOD(functional_weight, 10) < 8 THEN 'training' WHEN MOD(functional_weight, 10) = 8 THEN 'evaluation' WHEN MOD(functional_weight, 10) = 9 THEN 'prediction' END AS dataframe FROM `bigquery-public-data.ml_datasets.census_adult_income`
Query details
This query extracts data on census respondents, including education_num
,
which represents the respondent's level of education, and workclass
, which
represents the type of work the respondent performs. This query excludes several categories that duplicate data: for example,
the columns education
and education_num
in the census_adult_income
table
express the same data in different formats, so this query excludes the
education
column. The dataframe
column uses the excluded functional_weight
column to label 80% of the data source for training, and reserves the remaining
data for evaluation and prediction. The query creates a
view containing these
columns, so that you can use them to perform training and prediction later.
Run the query
To run the query that compiles the training data for your model:
In the Google Cloud console, click the Compose new query button.
Enter the following GoogleSQL query in the Query editor text area:
CREATE OR REPLACE VIEW `census.input_view` AS SELECT age, workclass, marital_status, education_num, occupation, hours_per_week, income_bracket, CASE WHEN MOD(functional_weight, 10) < 8 THEN 'training' WHEN MOD(functional_weight, 10) = 8 THEN 'evaluation' WHEN MOD(functional_weight, 10) = 9 THEN 'prediction' END AS dataframe FROM `bigquery-public-data.ml_datasets.census_adult_income`
Click Run.
In the navigation panel, in the Resources section, click your project name. Your view should appear beneath it.
Click your view. The schema for the view appears in the Schema tab under the query editor.
Step four: Create a logistic regression model
Now that you have examined your training data, the next step is to create a logistic regression model using the data.
You can create and train a logistic regression model using the
CREATE MODEL
statement with the option 'LOGISTIC_REG'
. The following query uses a
CREATE MODEL
statement to train a new binary logistic regression model on the
view from the previous query.
CREATE OR REPLACE MODEL `census.census_model` OPTIONS ( model_type='LOGISTIC_REG', auto_class_weights=TRUE, input_label_cols=['income_bracket'] ) AS SELECT * EXCEPT(dataframe) FROM `census.input_view` WHERE dataframe = 'training'
Query details
The CREATE MODEL
statement trains a model using the training data in the SELECT
statement.
The OPTIONS
clause
specifies the model type and training options. Here, the
LOGISTIC_REG
option specifies a logistic regression model type. It is not necessary to
specify a binary logistic regression model versus a multiclass logistic
regression model: BigQuery can determine which to train based on the
number of unique values in the label column.
The input_label_cols
option specifies which column in the SELECT
statement
to use as the label column. Here, the label column is income_bracket
, so the
model will learn which of the two values of income_bracket
is most likely
based on the other values present in each row.
The 'auto_class_weights=TRUE' option balances the class labels in the training data. By default, the training data is unweighted. If the labels in the training data are imbalanced, the model may learn to predict the most popular class of labels more heavily. In this case, most of the respondents in the dataset are in the lower income bracket. This may lead to a model that predicts the lower income bracket too heavily. Class weights balance the class labels by calculating the weights for each class in inverse proportion to the frequency of that class.
The SELECT
statement
queries the view from Step 2. This view contains
only the columns containing feature data for training the model. The WHERE
clause filters the rows in input_view
so that only those rows belonging to the
training dataframe are included in the training data.
Run the CREATE MODEL
query
To run the query that creates your logistic regression model, complete the following steps:
In the Google Cloud console, click the Compose new query button.
Enter the following GoogleSQL query in the Query editor text area:
CREATE OR REPLACE MODEL `census.census_model` OPTIONS ( model_type='LOGISTIC_REG', auto_class_weights=TRUE, data_split_method='NO_SPLIT', input_label_cols=['income_bracket'], max_iterations=15) AS SELECT * EXCEPT(dataframe) FROM `census.input_view` WHERE dataframe = 'training'
Click Run.
In the navigation panel, in the Resources section, expand [PROJECT_ID] > census and then click census_model.
Click the Schema tab. The model schema lists the attributes that BigQuery ML used to perform logistic regression. The schema should look like the following:
Step five: Use the ML.EVALUATE
function to evaluate your model
After creating your model, evaluate the performance of the model using the
ML.EVALUATE
function. The ML.EVALUATE
function evaluates the predicted values against the
actual data.
The query to evaluate the model is as follows:
SELECT * FROM ML.EVALUATE (MODEL `census.census_model`, ( SELECT * FROM `census.input_view` WHERE dataframe = 'evaluation' ) )
Query details
The ML.EVALUATE
function takes the model trained in
Step 4 and evaluation data
returned by a SELECT
subquery. The function returns a single row of
statistics about the model. This query uses data from input_view
as evaluation
data. The WHERE
clause filters the input data so that the subquery contains
only rows in the evaluation
dataframe.
Run the ML.EVALUATE
query
To run the ML.EVALUATE
query that evaluates the model, follow these steps:
In the Google Cloud console, click the Compose new query button.
Enter the following GoogleSQL query in the Query editor text area:
SELECT * FROM ML.EVALUATE (MODEL `census.census_model`, ( SELECT * FROM `census.input_view` WHERE dataframe = 'evaluation' ) )
(Optional) To set the processing location, click More > Query settings. For Processing location, choose
US
. This step is optional because the processing location is automatically detected based on the dataset's location.Click Run.
When the query is complete, click the Results tab below the query text area. The results should look like the following:
Because you performed a logistic regression, the results include the following columns:
precision
recall
accuracy
f1_score
log_loss
roc_auc
You can also call ML.EVALUATE
without providing input data. ML.EVALUATE
retrieves the evaluation metrics calculated during training, which uses the
automatically reserved evaluation dataset. In this CREATE MODEL
query with
NO_SPLIT
specified for the data_split_method
training option, the whole
input dataset is used for both training and evaluation.
Calling ML.EVALUATE
without input data retrieves the evaluation
metrics on a training dataset. This evaluation is less effective than an
evaluation run on a data set that was kept separate from the model training data.
You can also use Google Cloud console to view the evaluation metrics calculated during the training. The results should look like the following:
Step six: Use the ML.PREDICT
function to predict income bracket
To identify the income bracket to which a particular respondent belongs, use the
ML.PREDICT
function. The following query predicts the income bracket of every respondent
in the prediction
dataframe.
SELECT * FROM ML.PREDICT (MODEL `census.census_model`, ( SELECT * FROM `census.input_view` WHERE dataframe = 'prediction' ) )
Query details
The ML.PREDICT
function predicts results using your model and the data from input_view
,
filtered to include only rows in the 'prediction' dataframe. The top-most
SELECT
statement retrieves the output of the ML.PREDICT
function.
Run the ML.PREDICT
query
To run the ML.PREDICT
query, follow these steps:
In the Google Cloud console, click the Compose new query button.
Enter the following GoogleSQL query in the Query editor text area.
SELECT * FROM ML.PREDICT (MODEL `census.census_model`, ( SELECT * FROM `census.input_view` WHERE dataframe = 'prediction' ) )
Click Run.
When the query is complete, click the Results tab below the query text area. The results should look like the following:
predicted_income_bracket
is the predicted value ofincome_bracket
.
Step seven: Explain prediction results with explainable AI methods
To understand why your model is generating these prediction results, you can use
the ML.EXPLAIN_PREDICT
function.
ML.EXPLAIN_PREDICT
is an extended version of ML.PREDICT
. ML.EXPLAIN_PREDICT
not only outputs prediction results, but also outputs additional columns to explain the prediction results. So in practice, you only need to run ML.EXPLAIN_PREDICT
while skipping running ML.PREDICT
. For a more in-depth explanation about Shapley values and explainable AI
offerings in BigQuery ML,
see BigQuery ML explainable AI overview.
The query used to generate explanations is as follows:
#standardSQL SELECT * FROM ML.EXPLAIN_PREDICT(MODEL `census.census_model`, ( SELECT * FROM `census.input_view` WHERE dataframe = 'evaluation'), STRUCT(3 as top_k_features))
Run the ML.EXPLAIN_PREDICT
query
To run the ML.EXPLAIN_PREDICT
query that explains the model:
In the Google Cloud console, click the Compose new query button.
Enter the following GoogleSQL query in the Query editor.
#standardSQL SELECT * FROM ML.EXPLAIN_PREDICT(MODEL `census.census_model`, ( SELECT * FROM `census.input_view` WHERE dataframe = 'evaluation'), STRUCT(3 as top_k_features))
Click Run.
When the query is complete, click the Results tab below the query editor. The results should look like the following:
For logistic regression models, Shapley values are used to generate feature attribution values for each feature in the model. ML.EXPLAIN_PREDICT
outputs the top 3 feature attributions per row of the table provided because top_k_features
was set to 3 in the query. These attributions are sorted by the absolute value of the attribution in descending order. In row 1 of this example, the feature hours_per_week
contributed the most to the overall prediction, but for row 2 of this example, occupation
contributed the most to the overall prediction. For detailed explanations of the output columns of the ML.EXPLAIN_PREDICT
query, see
ML.EXPLAIN_PREDICT
syntax documentation
Step eight (optional): Globally explain your model
To know which features are the most important to determine the income bracket in general, you can use the ML.GLOBAL_EXPLAIN
function. In order to use ML.GLOBAL_EXPLAIN
, the model must be retrained with
the option ENABLE_GLOBAL_EXPLAIN=TRUE
. Rerun the training query with this option using the following query:
CREATE OR REPLACE MODEL `census.census_model` OPTIONS ( model_type='LOGISTIC_REG', auto_class_weights=TRUE, enable_global_explain=TRUE, input_label_cols=['income_bracket'] ) AS SELECT * EXCEPT(dataframe) FROM `census.input_view` WHERE dataframe = 'training'
Access global explanations through ML.GLOBAL_EXPLAIN
The query used to generate global explanations is as follows:
#standardSQL SELECT * FROM ML.GLOBAL_EXPLAIN(MODEL `census.census_model`)
Run the ML.GLOBAL_EXPLAIN
query
To run the ML.GLOBAL_EXPLAIN
query:
In the Google Cloud console, click the Compose new query button.
Enter the following GoogleSQL query in the Query editor text area.
#standardSQL SELECT * FROM ML.GLOBAL_EXPLAIN(MODEL `census.census_model`)
(Optional) To set the processing location, click More > Query settings. For Processing location, choose
US
. This step is optional because the processing location is automatically detected based on the dataset's location.Click Run.
When the query is complete, click the Results tab below the query text area. The results should look like the following:
Clean up
To avoid incurring charges to your Google Cloud account for the resources used in this tutorial, either delete the project that contains the resources, or keep the project and delete the individual resources.
Delete your dataset
Deleting your project removes all datasets and all tables in the project. If you prefer to reuse the project, you can delete the dataset you created in this tutorial:
If necessary, open the BigQuery page in the Google Cloud console.
In the navigation, click the census dataset you created.
Click Delete dataset on the right side of the window. This action deletes the dataset and the model.
In the Delete dataset dialog box, confirm the delete command by typing the name of your dataset (
census
) and then click Delete.
Delete your project
To delete the project:
- In the Google Cloud console, go to the Manage resources page.
- In the project list, select the project that you want to delete, and then click Delete.
- In the dialog, type the project ID, and then click Shut down to delete the project.
What's next
- For an overview of BigQuery ML, see Introduction to BigQuery ML.
- For information on creating models, see the
CREATE MODEL
syntax page.