Build and use a classification model on census data


BigQuery ML supports supervised learning  with the logistic regression model type. You can use the binary logistic regression model type to predict whether a value falls into one of two categories; or, you can use the multi-class regression model type to predict whether a value falls into one of multiple categories. These are known as classification problems, because they attempt to classify data into two or more categories.

In this tutorial, you use a binary logistic regression model in BigQuery ML to predict the income range of respondents in the US Census Dataset. This dataset contains the demographic and income information of US residents from 2000 and 2010. The data includes employment status, job category, level of education, and income data.

Objectives

In this tutorial you will perform the following tasks:

  • Create a logistic regression model.
  • Evaluate the logistic regression model.
  • Make predictions using the logistic regression model.

Costs

This tutorial uses billable components of Google Cloud, including the following:

  • BigQuery
  • BigQuery ML

For more information on BigQuery costs, see the BigQuery pricing page.

For more information on BigQuery ML costs, see BigQuery ML pricing.

Before you begin

  1. Sign in to your Google Cloud account. If you're new to Google Cloud, create an account to evaluate how our products perform in real-world scenarios. New customers also get $300 in free credits to run, test, and deploy workloads.
  2. In the Google Cloud console, on the project selector page, select or create a Google Cloud project.

    Go to project selector

  3. Make sure that billing is enabled for your Google Cloud project.

  4. In the Google Cloud console, on the project selector page, select or create a Google Cloud project.

    Go to project selector

  5. Make sure that billing is enabled for your Google Cloud project.

  6. BigQuery is automatically enabled in new projects. To activate BigQuery in a pre-existing project, go to

    Enable the BigQuery API.

    Enable the API

Introduction

A common problem in machine learning is to classify data into one of two types, known as labels. For example, a retailer may want to predict whether a given customer will purchase a new product, based on other information about that customer. In that case, the two labels might be "will buy" and "won't buy." You can construct your dataset such that one column represents the label. The data you can use to train such a binary logistic regression model include the customer's location, their previous purchases, the customer's reported preferences, and so on.

In this tutorial, you use BigQuery ML to create a binary logistic regression model that predicts whether a US Census respondent's income falls into one of two ranges based on the respondent's demographic attributes.

Creating your logistic regression model consists of the following steps.

  • Step one: Create a dataset to store your model.
    The first step is to create a BigQuery dataset to store your model.
  • Step two: Examine your data.
    In this step, examine the dataset and identify which columns to use as training data for your logistic regression model.
  • Step three: Select your training data.
    The next step is to prepare the data you use to train your binary logistic regression model by running a query against the census_adult_income table. This step identifies the relevant features and stores them in a view for later queries to use as input data.
  • Step four: Create a logistic regression model.
    In this step, use the CREATE MODEL statement to create your logistic regression model.
  • Step five: Use the ML.EVALUATE function to evaluate your model.
    Then, use the ML.EVALUATE function to provide statistics about model performance.
  • Step six: Use the ML.PREDICT function to predict a participant's income.
    Finally, you use the ML.PREDICT function to predict the income bracket for a given set of census participants.

Step one: Create your dataset

Create a BigQuery dataset to store your ML model:

  1. In the Google Cloud console, go to the BigQuery page.

    Go to the BigQuery page

  2. In the Explorer pane, click your project name.

  3. Click View actions > Create dataset.

    Create dataset.

  4. On the Create dataset page, do the following:

    • For Dataset ID, enter census.

    • For Location type, select Multi-region, and then select US (multiple regions in United States).

      The public datasets are stored in the US multi-region. For simplicity, store your dataset in the same location.

    • Leave the remaining default settings as they are, and click Create dataset.

      Create dataset page.

Step two: Examine your data

The next step is to examine the dataset and identify which columns to use as training data for your logistic regression model. You can use a GoogleSQL query to return rows from the dataset.

The following query returns 100 rows from the US Census Dataset:

SELECT
  age,
  workclass,
  marital_status,
  education_num,
  occupation,
  hours_per_week,
  income_bracket
FROM
  `bigquery-public-data.ml_datasets.census_adult_income`
LIMIT
  100;

Run the query

To run the query that returns rows from your dataset:

  1. In the Google Cloud console, click the Compose new query button.

  2. Enter the following GoogleSQL query in the Query editor text area:

    SELECT
      age,
      workclass,
      marital_status,
      education_num,
      occupation,
      hours_per_week,
      income_bracket
    FROM
      `bigquery-public-data.ml_datasets.census_adult_income`
    LIMIT
      100;
    
  3. Click Run.

  4. When the query is complete, click the Results tab below the query text area. The results should look like the following:

    Census Data

    The query results show that the income_bracket column in the census_adult_income table has only one of two values: <=50K or >50K. It also shows that the columns education and education_num in the census_adult_income table express the same data in different formats. The functional_weight column is the number of individuals that the Census Organizations believes a particular row represents; the values of this column appear unrelated to the value of income_bracket for a particular row.

Step three: Select your training data

Next, you select the data used to train your logistic regression model. In this tutorial, you predict census respondent income based on the following attributes:

  • Age
  • Type of work performed
  • Marital status
  • Level of education
  • Occupation
  • Hours worked per week

The following query creates a view that compiles your training data. This view is included in your CREATE MODEL statement later in this tutorial.

CREATE OR REPLACE VIEW
  `census.input_view` AS
SELECT
  age,
  workclass,
  marital_status,
  education_num,
  occupation,
  hours_per_week,
  income_bracket,
  CASE
    WHEN MOD(functional_weight, 10) < 8 THEN 'training'
    WHEN MOD(functional_weight, 10) = 8 THEN 'evaluation'
    WHEN MOD(functional_weight, 10) = 9 THEN 'prediction'
  END AS dataframe
FROM
  `bigquery-public-data.ml_datasets.census_adult_income`

Query details

This query extracts data on census respondents, including education_num, which represents the respondent's level of education, and workclass, which represents the type of work the respondent performs. This query excludes several categories that duplicate data: for example, the columns education and education_num in the census_adult_income table express the same data in different formats, so this query excludes the education column. The dataframe column uses the excluded functional_weight column to label 80% of the data source for training, and reserves the remaining data for evaluation and prediction. The query creates a view containing these columns, so that you can use them to perform training and prediction later.

Run the query

To run the query that compiles the training data for your model:

  1. In the Google Cloud console, click the Compose new query button.

  2. Enter the following GoogleSQL query in the Query editor text area:

    CREATE OR REPLACE VIEW
      `census.input_view` AS
    SELECT
      age,
      workclass,
      marital_status,
      education_num,
      occupation,
      hours_per_week,
      income_bracket,
      CASE
        WHEN MOD(functional_weight, 10) < 8 THEN 'training'
        WHEN MOD(functional_weight, 10) = 8 THEN 'evaluation'
        WHEN MOD(functional_weight, 10) = 9 THEN 'prediction'
      END AS dataframe
    FROM
      `bigquery-public-data.ml_datasets.census_adult_income`
    
  3. Click Run.

  4. In the navigation panel, in the Resources section, click your project name. Your view should appear beneath it.

  5. Click your view. The schema for the view appears in the Schema tab under the query editor.

    Query results

Step four: Create a logistic regression model

Now that you have examined your training data, the next step is to create a logistic regression model using the data.

You can create and train a logistic regression model using the CREATE MODEL statement with the option 'LOGISTIC_REG'. The following query uses a CREATE MODEL statement to train a new binary logistic regression model on the view from the previous query.

CREATE OR REPLACE MODEL
  `census.census_model`
OPTIONS
  ( model_type='LOGISTIC_REG',
    auto_class_weights=TRUE,
    input_label_cols=['income_bracket']
  ) AS
SELECT
  * EXCEPT(dataframe)
FROM
  `census.input_view`
WHERE
  dataframe = 'training'

Query details

The CREATE MODEL statement trains a model using the training data in the SELECT statement.

The OPTIONS clause specifies the model type and training options. Here, the LOGISTIC_REG option specifies a logistic regression model type. It is not necessary to specify a binary logistic regression model versus a multiclass logistic regression model: BigQuery can determine which to train based on the number of unique values in the label column.

The input_label_cols option specifies which column in the SELECT statement to use as the label column. Here, the label column is income_bracket, so the model will learn which of the two values of income_bracket is most likely based on the other values present in each row.

The 'auto_class_weights=TRUE' option balances the class labels in the training data. By default, the training data is unweighted. If the labels in the training data are imbalanced, the model may learn to predict the most popular class of labels more heavily. In this case, most of the respondents in the dataset are in the lower income bracket. This may lead to a model that predicts the lower income bracket too heavily. Class weights balance the class labels by calculating the weights for each class in inverse proportion to the frequency of that class.

The SELECT statement queries the view from Step 2. This view contains only the columns containing feature data for training the model. The WHERE clause filters the rows in input_view so that only those rows belonging to the training dataframe are included in the training data.

Run the CREATE MODEL query

To run the query that creates your logistic regression model, complete the following steps:

  1. In the Google Cloud console, click the Compose new query button.

  2. Enter the following GoogleSQL query in the Query editor text area:

CREATE OR REPLACE MODEL
  `census.census_model`
OPTIONS
  ( model_type='LOGISTIC_REG',
    auto_class_weights=TRUE,
    data_split_method='NO_SPLIT',
    input_label_cols=['income_bracket'],
    max_iterations=15) AS
SELECT
  * EXCEPT(dataframe)
FROM
  `census.input_view`
WHERE
  dataframe = 'training'
  1. Click Run.

  2. In the navigation panel, in the Resources section, expand [PROJECT_ID] > census and then click census_model.

  3. Click the Schema tab. The model schema lists the attributes that BigQuery ML used to perform logistic regression. The schema should look like the following:

    Cluster schema info

Step five: Use the ML.EVALUATE function to evaluate your model

After creating your model, evaluate the performance of the model using the ML.EVALUATE function. The ML.EVALUATE function evaluates the predicted values against the actual data.

The query to evaluate the model is as follows:

SELECT
  *
FROM
  ML.EVALUATE (MODEL `census.census_model`,
    (
    SELECT
      *
    FROM
      `census.input_view`
    WHERE
      dataframe = 'evaluation'
    )
  )

Query details

The ML.EVALUATE function takes the model trained in Step 4 and evaluation data returned by a SELECT subquery. The function returns a single row of statistics about the model. This query uses data from input_view as evaluation data. The WHERE clause filters the input data so that the subquery contains only rows in the evaluation dataframe.

Run the ML.EVALUATE query

To run the ML.EVALUATE query that evaluates the model, follow these steps:

  1. In the Google Cloud console, click the Compose new query button.

  2. Enter the following GoogleSQL query in the Query editor text area:

    SELECT
      *
    FROM
      ML.EVALUATE (MODEL `census.census_model`,
        (
        SELECT
          *
        FROM
          `census.input_view`
        WHERE
          dataframe = 'evaluation'
        )
      )
    
  3. (Optional) To set the processing location, click More > Query settings. For Processing location, choose US. This step is optional because the processing location is automatically detected based on the dataset's location.

    Query settings

  4. Click Run.

  5. When the query is complete, click the Results tab below the query text area. The results should look like the following:

    ML.EVALUATE output

    Because you performed a logistic regression, the results include the following columns:

    • precision
    • recall
    • accuracy
    • f1_score
    • log_loss
    • roc_auc

You can also call ML.EVALUATE without providing input data. ML.EVALUATE retrieves the evaluation metrics calculated during training, which uses the automatically reserved evaluation dataset. In this CREATE MODEL query with NO_SPLIT specified for the data_split_method training option, the whole input dataset is used for both training and evaluation. Calling ML.EVALUATE without input data retrieves the evaluation metrics on a training dataset. This evaluation is less effective than an evaluation run on a data set that was kept separate from the model training data.

You can also use Google Cloud console to view the evaluation metrics calculated during the training. The results should look like the following:

ML.EVALUATE output

Step six: Use the ML.PREDICT function to predict income bracket

To identify the income bracket to which a particular respondent belongs, use the ML.PREDICT function. The following query predicts the income bracket of every respondent in the prediction dataframe.

SELECT
  *
FROM
  ML.PREDICT (MODEL `census.census_model`,
    (
    SELECT
      *
    FROM
      `census.input_view`
    WHERE
      dataframe = 'prediction'
     )
  )

Query details

The ML.PREDICT function predicts results using your model and the data from input_view, filtered to include only rows in the 'prediction' dataframe. The top-most SELECT statement retrieves the output of the ML.PREDICT function.

Run the ML.PREDICT query

To run the ML.PREDICT query, follow these steps:

  1. In the Google Cloud console, click the Compose new query button.

  2. Enter the following GoogleSQL query in the Query editor text area.

SELECT
  *
FROM
  ML.PREDICT (MODEL `census.census_model`,
    (
    SELECT
      *
    FROM
      `census.input_view`
    WHERE
      dataframe = 'prediction'
     )
  )
  1. Click Run.

  2. When the query is complete, click the Results tab below the query text area. The results should look like the following:

    ML.PREDICT results

    predicted_income_bracket is the predicted value of income_bracket.

Step seven: Explain prediction results with explainable AI methods

To understand why your model is generating these prediction results, you can use the ML.EXPLAIN_PREDICT function.

ML.EXPLAIN_PREDICT is an extended version of ML.PREDICT. ML.EXPLAIN_PREDICT not only outputs prediction results, but also outputs additional columns to explain the prediction results. So in practice, you only need to run ML.EXPLAIN_PREDICT while skipping running ML.PREDICT. For a more in-depth explanation about Shapley values and explainable AI offerings in BigQuery ML, see BigQuery ML explainable AI overview.

The query used to generate explanations is as follows:

#standardSQL
SELECT
  *
FROM
  ML.EXPLAIN_PREDICT(MODEL `census.census_model`,
    (
    SELECT
      *
    FROM
      `census.input_view`
    WHERE
      dataframe = 'evaluation'),
    STRUCT(3 as top_k_features))

Run the ML.EXPLAIN_PREDICT query

To run the ML.EXPLAIN_PREDICT query that explains the model:

  1. In the Google Cloud console, click the Compose new query button.

  2. Enter the following GoogleSQL query in the Query editor.

    #standardSQL
    SELECT
    *
    FROM
    ML.EXPLAIN_PREDICT(MODEL `census.census_model`,
      (
      SELECT
        *
      FROM
        `census.input_view`
      WHERE
        dataframe = 'evaluation'),
      STRUCT(3 as top_k_features))
    
  3. Click Run.

  4. When the query is complete, click the Results tab below the query editor. The results should look like the following:

    ML.EXPLAIN_PREDICT output

For logistic regression models, Shapley values are used to generate feature attribution values for each feature in the model. ML.EXPLAIN_PREDICT outputs the top 3 feature attributions per row of the table provided because top_k_features was set to 3 in the query. These attributions are sorted by the absolute value of the attribution in descending order. In row 1 of this example, the feature hours_per_week contributed the most to the overall prediction, but for row 2 of this example, occupation contributed the most to the overall prediction. For detailed explanations of the output columns of the ML.EXPLAIN_PREDICT query, see ML.EXPLAIN_PREDICT syntax documentation

Step eight (optional): Globally explain your model

To know which features are the most important to determine the income bracket in general, you can use the ML.GLOBAL_EXPLAIN function. In order to use ML.GLOBAL_EXPLAIN, the model must be retrained with the option ENABLE_GLOBAL_EXPLAIN=TRUE. Rerun the training query with this option using the following query:

CREATE OR REPLACE MODEL `census.census_model`
OPTIONS
  ( model_type='LOGISTIC_REG',
    auto_class_weights=TRUE,
    enable_global_explain=TRUE,
    input_label_cols=['income_bracket']
  ) AS
SELECT
  * EXCEPT(dataframe)
FROM
  `census.input_view`
WHERE
  dataframe = 'training'

Access global explanations through ML.GLOBAL_EXPLAIN

The query used to generate global explanations is as follows:

#standardSQL
SELECT
  *
FROM
  ML.GLOBAL_EXPLAIN(MODEL `census.census_model`)

Run the ML.GLOBAL_EXPLAIN query

To run the ML.GLOBAL_EXPLAIN query:

  1. In the Google Cloud console, click the Compose new query button.

  2. Enter the following GoogleSQL query in the Query editor text area.

#standardSQL
SELECT
  *
FROM
  ML.GLOBAL_EXPLAIN(MODEL `census.census_model`)
  1. (Optional) To set the processing location, click More > Query settings. For Processing location, choose US. This step is optional because the processing location is automatically detected based on the dataset's location.

    Query settings

  2. Click Run.

  3. When the query is complete, click the Results tab below the query text area. The results should look like the following:

    ML.GLOBAL_EXPLAIN output

Clean up

To avoid incurring charges to your Google Cloud account for the resources used in this tutorial, either delete the project that contains the resources, or keep the project and delete the individual resources.

Delete your dataset

Deleting your project removes all datasets and all tables in the project. If you prefer to reuse the project, you can delete the dataset you created in this tutorial:

  1. If necessary, open the BigQuery page in the Google Cloud console.

    Go to the BigQuery page

  2. In the navigation, click the census dataset you created.

  3. Click Delete dataset on the right side of the window. This action deletes the dataset and the model.

  4. In the Delete dataset dialog box, confirm the delete command by typing the name of your dataset (census) and then click Delete.

Delete your project

To delete the project:

  1. In the Google Cloud console, go to the Manage resources page.

    Go to Manage resources

  2. In the project list, select the project that you want to delete, and then click Delete.
  3. In the dialog, type the project ID, and then click Shut down to delete the project.

What's next