Build and use a classification model on census data


In this tutorial, you use a binary logistic regression model in BigQuery ML to predict the income range of individuals based on their demographic data. A binary logistic regression model predicts whether a value falls into one of two categories, in this case whether an individual's annual income falls above or below $50,000.

This tutorial uses the bigquery-public-data.ml_datasets.census_adult_income dataset. This dataset contains the demographic and income information of US residents from 2000 and 2010.

Objectives

In this tutorial you will perform the following tasks:

  • Create a logistic regression model.
  • Evaluate the model.
  • Make predictions by using the model.
  • Explain the results produced by the model.

Costs

This tutorial uses billable components of Google Cloud, including the following:

  • BigQuery
  • BigQuery ML

For more information on BigQuery costs, see the BigQuery pricing page.

For more information on BigQuery ML costs, see BigQuery ML pricing.

Before you begin

  1. In the Google Cloud console, on the project selector page, select or create a Google Cloud project.

    Go to project selector

  2. Make sure that billing is enabled for your Google Cloud project.

  3. Enable the BigQuery API.

    Enable the API

Required permissions

To create the model using BigQuery ML, you need the following IAM permissions:

  • bigquery.jobs.create
  • bigquery.models.create
  • bigquery.models.getData
  • bigquery.models.updateData
  • bigquery.models.updateMetadata

To run inference, you need the following permissions:

  • bigquery.models.getData on the model
  • bigquery.jobs.create

Introduction

A common task in machine learning is to classify data into one of two types, known as labels. For example, a retailer might want to predict whether a given customer will purchase a new product, based on other information about that customer. In that case, the two labels might be will buy and won't buy. The retailer can construct a dataset such that one column represents both labels, and also contains customer information such as the customer's location, their previous purchases, and their reported preferences. The retailer can then use a binary logistic regression model that uses this customer information to predict which label best represents each customer.

In this tutorial, you create a binary logistic regression model that predicts whether a US Census respondent's income falls into one of two ranges based on the respondent's demographic attributes.

Create a dataset

Create a BigQuery dataset to store your model:

  1. In the Google Cloud console, go to the BigQuery page.

    Go to BigQuery

  2. In the Explorer pane, click your project name.

  3. Click View actions > Create dataset.

    Create dataset.

  4. On the Create dataset page, do the following:

    • For Dataset ID, enter census.

    • For Location type, select Multi-region, and then select US (multiple regions in United States).

      The public datasets are stored in the US multi-region. For simplicity, store your dataset in the same location.

    • Leave the remaining default settings as they are, and click Create dataset.

Examine the data

Examine the dataset and identify which columns to use as training data for the logistic regression model. Run a GoogleSQL query to return 100 rows from the census_adult_income table:

  1. In the Google Cloud console, go to the BigQuery page.

    Go to BigQuery

  2. In the query editor, run the following query:

    SELECT
      age,
      workclass,
      marital_status,
      education_num,
      occupation,
      hours_per_week,
      income_bracket,
      functional_weight
    FROM
      `bigquery-public-data.ml_datasets.census_adult_income`
    LIMIT
      100;
    
  3. The results should look similar to the following:

    Census Data

The query results show that the income_bracket column in the census_adult_income table has only one of two values: <=50K or >50K. The functional_weight column is the number of individuals that the census organization believes a particular row represents. The values of this column appear unrelated to the value of income_bracket for a particular row.

Prepare the sample data

In this tutorial, you predict census respondent income based on the following attributes:

  • Age
  • Type of work performed
  • Marital status
  • Level of education
  • Occupation
  • Hours worked per week

To do this, you create a view to contain the data to use to train and evaluate the model, and also to make predictions. This view is used by the CREATE MODEL statement later in this tutorial.

The query that creates the view extracts data on census respondents, including education_num, which represents the respondent's level of education, and workclass, which represents the type of work the respondent performs. This query excludes several columns that duplicate data: for example, the columns education and education_num in the census_adult_income table express the same data in different formats, so this query excludes the education column. A new dataframe column is created that uses the census_adult_income table's functional_weight column to label 80% of the data source for training the model, and reserves the remaining data for evaluation and prediction.

Run the query that prepares the sample data:

  1. In the Google Cloud console, go to the BigQuery page.

    Go to BigQuery

  2. In the query editor, run the following query:

    CREATE OR REPLACE VIEW
      `census.input_data` AS
    SELECT
      age,
      workclass,
      marital_status,
      education_num,
      occupation,
      hours_per_week,
      income_bracket,
      CASE
        WHEN MOD(functional_weight, 10) < 8 THEN 'training'
        WHEN MOD(functional_weight, 10) = 8 THEN 'evaluation'
        WHEN MOD(functional_weight, 10) = 9 THEN 'prediction'
      END AS dataframe
    FROM
      `bigquery-public-data.ml_datasets.census_adult_income`
    
  3. In the Explorer pane, expand the census dataset and locate the input_data view.

  4. Click the view name to open the information pane. The view schema appears in the Schema tab.

    Query results

Create a logistic regression model

You create a logistic regression model by using the CREATE MODEL statement and specifying LOGISTIC_REG for the model type. Part of creating the model includes training the model on the training data you labeled in the previous section.

The following are useful things to know about the CREATE MODEL statement:

  • The input_label_cols option specifies which column in the SELECT statement to use as the label column. Here, the label column is income_bracket, so the model learns which of the two values of income_bracket is most likely for a given row based on the other values present in that row.

  • It is not necessary to specify whether a logistic regression model is binary or multiclass. BigQuery can determine which type of model to train based on the number of unique values in the label column.

  • The auto_class_weights option is set to TRUE in order to balance the class labels in the training data. By default, the training data is unweighted. If the labels in the training data are imbalanced, the model may learn to predict the most popular class of labels more heavily. In this case, most of the respondents in the dataset are in the lower income bracket. This may lead to a model that predicts the lower income bracket too heavily. Class weights balance the class labels by calculating the weights for each class in inverse proportion to the frequency of that class.

  • The SELECT statement queries the input_data view that contains the training data. The WHERE clause filters the rows in input_data so that only those rows labeled as training data are used to train the model.

Run the query that creates your logistic regression model:

  1. In the Google Cloud console, go to the BigQuery page.

    Go to BigQuery

  2. In the query editor, run the following query:

    CREATE OR REPLACE MODEL
      `census.census_model`
    OPTIONS
      ( model_type='LOGISTIC_REG',
        auto_class_weights=TRUE,
        data_split_method='NO_SPLIT',
        input_label_cols=['income_bracket'],
        max_iterations=15) AS
    SELECT * EXCEPT(dataframe)
    FROM
      `census.input_data`
    WHERE
      dataframe = 'training'
    
  3. In the Explorer pane, expand the census dataset and then the Models folder.

  4. Click the census_model model to open the information pane.

  5. Click the Schema tab. The model schema lists the attributes that BigQuery ML used to perform logistic regression. The schema should look similar to the following:

    Cluster schema info

Use the ML.EVALUATE function to evaluate the model

After creating the model, evaluate the model's performance by using the ML.EVALUATE function. The ML.EVALUATE function evaluates the predicted values generated by the model against the actual data.

For input, the ML.EVALUATE function takes the trained model and the rows from theinput_data view that have evaluation as the dataframe column value. The function returns a single row of statistics about the model.

Run the ML.EVALUATE query:

  1. In the Google Cloud console, go to the BigQuery page.

    Go to BigQuery

  2. In the query editor, run the following query:

    SELECT
      *
    FROM
      ML.EVALUATE (MODEL `census.census_model`,
        (
        SELECT
          *
        FROM
          `census.input_data`
        WHERE
          dataframe = 'evaluation'
        )
      )
    
  3. The results should look similar to the following:

    ML.EVALUATE output

You can also look at the model's information pane in the Google Cloud console to view the evaluation metrics calculated during the training:

ML.EVALUATE output

Use the ML.PREDICT function to predict income bracket

To identify the income bracket to which a particular respondent belongs, use the ML.PREDICT function.

For input, the ML.PREDICT function takes the trained model and the rows from theinput_data view that have prediction as the dataframe column value.

Run the ML.PREDICT query:

  1. In the Google Cloud console, go to the BigQuery page.

    Go to BigQuery

  2. In the query editor, run the following query:

    SELECT
      *
    FROM
      ML.PREDICT (MODEL `census.census_model`,
        (
        SELECT
          *
        FROM
          `census.input_data`
        WHERE
          dataframe = 'prediction'
        )
      )
    
  3. The results should look similar to the following:

    ML.PREDICT results

    predicted_income_bracket is the predicted value of income_bracket.

Explain the prediction results

To understand why the model is generating these prediction results, you can use the ML.EXPLAIN_PREDICT function.

ML.EXPLAIN_PREDICT is an extended version of the ML.PREDICT function. ML.EXPLAIN_PREDICT not only outputs prediction results, but also outputs additional columns to explain the prediction results. In practice, you can run ML.EXPLAIN_PREDICT instead of ML.PREDICT. For more information, see BigQuery ML explainable AI overview.

Run the ML.EXPLAIN_PREDICT query:

  1. In the Google Cloud console, go to the BigQuery page.

    Go to BigQuery

  2. In the query editor, run the following query:

    SELECT
    *
    FROM
    ML.EXPLAIN_PREDICT(MODEL `census.census_model`,
      (
      SELECT
        *
      FROM
        `census.input_data`
      WHERE
        dataframe = 'evaluation'),
      STRUCT(3 as top_k_features))
    
  3. The results should look similar to the following:

    ML.EXPLAIN_PREDICT output

For logistic regression models, Shapley values are used to generate feature attribution values for each feature in the model. ML.EXPLAIN_PREDICT outputs the top three feature attributions per row of the input_data view because top_k_features was set to 3 in the query. These attributions are sorted by the absolute value of the attribution in descending order. In row 1 of this example, the feature hours_per_week contributed the most to the overall prediction, but in row 2, occupation contributed the most to the overall prediction.

Globally explain the model

To know which features are generally the most important to determine the income bracket, you can use the ML.GLOBAL_EXPLAIN function. In order to use ML.GLOBAL_EXPLAIN, you must retrain the model with the ENABLE_GLOBAL_EXPLAIN option set to TRUE.

Retrain and get global explanations for the model:

  1. In the Google Cloud console, go to the BigQuery page.

    Go to BigQuery

  2. In the query editor, run the following query to retrain the model:

    CREATE OR REPLACE MODEL `census.census_model`
    OPTIONS
      ( model_type='LOGISTIC_REG',
        auto_class_weights=TRUE,
        enable_global_explain=TRUE,
        input_label_cols=['income_bracket']
      ) AS
    SELECT * EXCEPT(dataframe)
    FROM
      `census.input_data`
    WHERE
      dataframe = 'training'
    
  3. In the query editor, run the following query to get global explanations:

    SELECT
      *
    FROM
      ML.GLOBAL_EXPLAIN(MODEL `census.census_model`)
    
  4. The results should look similar to the following:

    ML.GLOBAL_EXPLAIN output

Clean up

To avoid incurring charges to your Google Cloud account for the resources used in this tutorial, either delete the project that contains the resources, or keep the project and delete the individual resources.

Delete your dataset

Deleting your project removes all datasets and all tables in the project. If you prefer to reuse the project, you can delete the dataset you created in this tutorial:

  1. If necessary, open the BigQuery page in the Google Cloud console.

    Go to the BigQuery page

  2. In the navigation, click the census dataset you created.

  3. Click Delete dataset on the right side of the window. This action deletes the dataset and the model.

  4. In the Delete dataset dialog box, confirm the delete command by typing the name of your dataset (census) and then click Delete.

Delete your project

To delete the project:

  1. In the Google Cloud console, go to the Manage resources page.

    Go to Manage resources

  2. In the project list, select the project that you want to delete, and then click Delete.
  3. In the dialog, type the project ID, and then click Shut down to delete the project.

What's next