The CREATE MODEL statement for contribution analysis models

This document describes the CREATE MODEL statement for creating contribution analysis models in BigQuery. You can use contribution analysis models to generate insights about changes to key metrics in your multi-dimensional data.

After you have created a contribution analysis model, you can use the ML.GET_INSIGHTS function to retrieve the metric information calculated by the model.

For more information about supported SQL statements and functions for this model, see Contribution analysis user journey.

`CREATE MODEL` syntax

{CREATE MODEL | CREATE MODEL IF NOT EXISTS | CREATE OR REPLACE MODEL} 
`project_id.dataset.model_name`
OPTIONS(
  MODEL_TYPE = 'CONTRIBUTION_ANALYSIS',
  CONTRIBUTION_METRIC = 'contribution_metric',
  IS_TEST_COL = 'is_test_col'
  [, DIMENSION_ID_COLS = dimension_column_array]
  [, MIN_APRIORI_SUPPORT = min_apriori_support]
  [, TOP_K_INSIGHTS_BY_APRIORI_SUPPORT = top_k_insights_by_apriori_support]
  [, PRUNING_METHOD = {'NO_PRUNING', 'PRUNE_REDUNDANT_INSIGHTS'}]
)
AS query_statement;

`CREATE MODEL`

Creates and trains a new model in the specified dataset. If the model name exists, CREATE MODEL returns an error.

`CREATE MODEL IF NOT EXISTS`

Creates and trains a new model only if the model doesn't exist in the specified dataset.

`CREATE OR REPLACE MODEL`

Creates and trains a model and replaces an existing model with the same name in the specified dataset.

`model_name`

The name of the model you're creating or replacing. The model name must be unique in the dataset: no other model or table can have the same name. The model name must follow the same naming rules as a BigQuery table. A model name can:

Contain up to 1,024 characters
Contain letters (upper or lower case), numbers, and underscores

model_name is case-sensitive.

If you don't have a default project configured, then you must prepend the project ID to the model name in the following format, including backticks:

`[PROJECT_ID].[DATASET].[MODEL]`

For example, `myproject.mydataset.mymodel`.

`MODEL_TYPE`

Syntax

MODEL_TYPE = 'CONTRIBUTION_ANALYSIS'

Description

Specify the model type. This option is required.

`CONTRIBUTION_METRIC`

Syntax

CONTRIBUTION_METRIC = 'contribution_metric'

Description

Provides the expression to use to calculate the metric you are analyzing.

To calculate a summable metric, the expression must be in the form SUM(metric_column_name), where metric_column_name is a numeric data type.

To calculate a summable ratio metric, the expression must be in the form SUM(numerator_metric_column_name)/SUM(denominator_metric_column_name), where numerator_metric_column_name and denominator_metric_column_name are numeric data types.

To calculate a summable by category metric, the expression must be in the form SUM(metric_sum_column_name)/COUNT(DISTINCT categorical_column_name). The summed column must be a numeric data type. The categorical column must have type BOOL, DATE, DATETIME, TIME, TIMESTAMP, STRING, or INT64.

The expression is case insensitive.

You can't use any additional numerical computations in the contribution metric expression. For example, neither SUM(AVG(metric_col)) nor AVG(SUM(round(metric_col_numerator))/(SUM(metric_col_denominator)) is valid. You can perform additional computations in the query_statement if necessary.

The values in the metrics columns that you use in the CONTRIBUTION_METRIC option must be non-negative, unless you specify 0 for the MIN_APRIORI_SUPPORT value.

Arguments

A STRING value. There is no default value.

`IS_TEST_COL`

Syntax

IS_TEST_COL = 'is_test_col'

Description

Provides the name of the column to use to determine whether a given row is test data or control data. The column that you specify must have a BOOL data type.

Arguments

A STRING value. There is no default value.

`DIMENSION_ID_COLS`

Syntax

DIMENSION_ID_COLS = dimension_column_array

Description

Provides the names of the columns to use as dimensions when summarizing the metric specified in the CONTRIBUTION_METRIC option. The dimension columns that you specify must have an INT64, BOOL, or STRING data type.

Any rows in the dimension columns that contain NULL values are removed. If you have NULL values in your data, you can preprocess the data to fill in the NULL values by using the ML.IMPUTER function, or by using the IFNULL expression to replace NULL values with a custom value.

If you want to use a numerical column that has a data type other than INT64 as a dimension, you can use any function that outputs an INT64, BOOL, or STRING value in the query_statement to transform continuous values. For example, you could use one of the numerical preprocessing functions, such as ML.BUCKETIZE.

Using fewer dimensions helps reduce the model creation runtime.

Arguments

An ARRAY<STRING> value. For example, ['dimension_column_1','dimension_column_2','dimension_column_3']. The array can have 13 or fewer array elements when the MIN_APRIORI_SUPPORT option value is 0, and 50 or fewer array elements when the MIN_APRIORI_SUPPORT option value is greater than 0.

If you omit the dimension_id_cols option, then all columns in the query statement that are not specified in the contribution_metric or is_test_col options are used as dimension_id_cols.

`MIN_APRIORI_SUPPORT`

Syntax

MIN_APRIORI_SUPPORT = min_apriori_support

Description

Provide the minimum apriori support threshold for including segments in the model output. All segments whose apriori_support value is less than the MIN_APRIORI_SUPPORT value that you specify are excluded. The MIN_APRIORI_SUPPORT option can't be used together with TOP_K_INSIGHTS_BY_APRIORI_SUPPORT.

Arguments

A FLOAT64 value in the range [0, 1]. The default value is 0.1.

`TOP_K_INSIGHTS_BY_APRIORI_SUPPORT`

Syntax

TOP_K_INSIGHTS_BY_APRIORI_SUPPORT = top_k_insights_by_apriori_support

Description

Provide the maximum number of insights to output with the highest apriori support value. The model automatically retrieves the insights with the highest apriori support values and prunes insights with low apriori support, decreasing query runtime compared to using no apriori support thresholding. The TOP_K_INSIGHTS_BY_APRIORI_SUPPORT option can't be used together with MIN_APRIORI_SUPPORT.

Arguments

An INT64 value between 1 and 1,000,000. There is no default value. If no TOP_K_INSIGHTS_BY_APRIORI_SUPPORT is specified, a MIN_APRIORI_SUPPORT threshold of 0.1 is applied.

`PRUNING_METHOD`

Syntax

PRUNING_METHOD = {'NO_PRUNING' | 'PRUNE_REDUNDANT_INSIGHTS'}

Description

The pruning method used to filter insights generated by the model.

Arguments

NO_PRUNING: All insights are returned, respecting any apriori support thresholds specified by MIN_APRIORI_SUPPORT or TOP_K_INSIGHTS_BY_APRIORI_SUPPORT.
PRUNE_REDUNDANT_INSIGHTS: Redundant insights are pruned from the model. Two rows are considered to be redundant insights if the following conditions are met:
- Their metric values are equal.
- The dimension IDs and corresponding values of one row are a subset of the dimension IDs and corresponding values of the other. In this case, the row with more dimension IDs (the more descriptive row) is kept.
The all insight row is never pruned.

The default value is NO_PRUNING.

`query_statement`

The GoogleSQL query that contains the test and control data to analyze. The table of input data that you specify in the query must contain exactly the columns that you reference in the CONTRIBUTION_METRIC, DIMENSION_ID_COLS, and IS_TEST_COL options.

Choose test and control data

Contribution analysis models require test and control data in a single table as input. The control set is used as a baseline to compare against the test set, which contains the data points of interest. The following are examples of test and control sets:

Periods of time: Compare two different months of revenue data.
Geographic regions: Compare retention rate across different countries.
Product types: Compare sales per user across different product types.
Campaign or promotion: Compare website engagement across different ad campaigns.

To set up the input data, you can create tables of test and control data separately and take the union of the two tables. The final input table must contain your dimension columns, numeric metric columns, and a categorical column if you're using a summable by category metric. For best results, we recommend having roughly equal numbers of test and control rows when using the summable and summable ratio metrics to avoid biased results. If the number of test and control rows are unbalanced, then the summable by category metric might reduce bias by normalizing the numeric metric against categorical values, allowing for easier comparisons between contributors.

Use a summable metric

A summable metric contribution analysis model summarizes the values of a metric column that you specify, and then determines a total for each segment of the data. For example, you might summarize by revenue, or by number of items sold.

An example input dataset for a summable metric contribution analysis model might look similar to the following table:

product	store	city	revenue	is_test
shoe_1	store_2	Mountain View	100	true
shoe_1	store_3	Sunnyvale	200	true
shoe_2	store_1	San Francisco	85	true
shoe_1	store_2	Mountain View	100	false
shoe_1	store_3	Sunnyvale	150	false
shoe_2	store_1	San Francisco	175	false

In this example, the numerical revenue column provides the metric to analyze. The product, store, and city columns are the dimensions to analyze as contributors to total revenue. The is_test column value indicates whether the row belongs to the test set or the control set. Your CREATE MODEL statement would include the following:

CONTRIBUTION_METRIC = 'SUM(revenue)',
DIMENSION_ID_COLS = ['product', 'store', 'city'],
IS_TEST_COL = 'is_test',

Use a summable ratio metric

A summable ratio metric contribution analysis model summarizes the values of two numeric columns that you specify, and determines the ratio between them for each segment of the data. For example, you might analyze the ratio between cost and clicks values to determine the cost per click in an ad campaign.

An example input dataset for a summable ratio metric contribution analysis model might look similar to the following table:

browser	device	cost	clicks	is_test
Chrome	iPad	100.00	30	true
Firefox	Pixel	100.00	10	true
Edge	Pixel	250.00	40	true
Chrome	iPad	100.00	40	false
Firefox	Pixel	50.00	5	false

In this case, the numerical cost and clicks columns provide the metrics to analyze. The browser and device columns are the dimensions to analyze as contributors to the cost and click values. The is_test column value indicates whether the row belongs to the test set or the control set. Your CREATE MODEL statement would include the following:

CONTRIBUTION_METRIC = 'SUM(cost)/SUM(clicks)',
DIMENSION_ID_COLS = ['browser', 'device'],
IS_TEST_COL = 'is_test',

Use a summable by category metric

A summable by category metric contribution analysis model sums the value of a numeric column and divides it by the number of distinct values from a categorical column, defined as a column of one of the following types:

BOOL
DATE
DATETIME
TIME
TIMESTAMP
STRING
INT64.

For example, if sales is your numeric column and month is your categorical column, then the metric would analyze the contributions of each segment to monthly sales.

An example input dataset for the summable by category metric contribution analysis model might look similar to the following table:

item_id	category	sales	month	is_test
1	shirts	150	December	true
2	jackets	100	January	true
3	hats	50	April	true
1	shirts	125	February	false
4	hats	10	February	false
2	jackets	75	November	false

In this case, the numerical sales and categorical month columns provide the metrics to analyze. The item_id and category columns are the dimensions to analyze as contributors to the sales per month values. The is_test column value indicates whether the row belongs to the test set or the control set. Your CREATE MODEL statement would include the following:

CONTRIBUTION_METRIC = 'SUM(sales)/COUNT(DISTINCT month)',
DIMENSION_ID_COLS = ['item_id', 'category'],
IS_TEST_COL = 'is_test',

Use an apriori support threshold

You can optimize a contribution analysis model by specifying an apriori support threshold. The model uses the threshold value that you specify to prune the search space and include only larger segments. Apriori support is measured by comparing the size of each segment relative to the rest of the population, and dropping the segments where this ratio is less than the model's apriori support threshold value. This pruning lets you analyze only the segments of data that are large enough to be of interest, and also reduces model creation time. There are two ways that you can use an apriori support threshold in a contribution analysis model:

Tune your own apriori support value with MIN_APRIORI_SUPPORT.
Provide the maximum number of insights to retrieve using TOP_K_INSIGHTS_BY_APRIORI_SUPPORT, which automatically fetches the k insights with the highest apriori support values.

Examples

The following examples show how to create contribution analysis models.

Example 1

The following example creates a contribution analysis model that uses a summable metric:

CREATE (OR REPLACE) MODEL `myproject.mydataset.ca_model`
  OPTIONS(
    MODEL_TYPE = 'CONTRIBUTION_ANALYSIS',
    CONTRIBUTION_METRIC = 'SUM(house_price)',
    DIMENSION_ID_COLS = ['city', 'floor_space'],
    IS_TEST_COL = 'dataset_type'
) AS
SELECT * FROM mydataset.house_sales WHERE state = 'WA';

Example 2

The following example creates a contribution analysis model that uses a summable ratio metric:

CREATE (OR REPLACE) MODEL `myproject.mydataset.ca_model`
  OPTIONS(
    MODEL_TYPE = 'CONTRIBUTION_ANALYSIS',
    CONTRIBUTION_METRIC = 'SUM(concessions_sold)/SUM(attendee_count)',
    DIMENSION_ID_COLS = ['venue_ID','event_date','concession_type'],
    IS_TEST_COL = 'test_col',
    MIN_APRIORI_SUPPORT = .08
) AS
SELECT concessions_sold, attendee_count, venue_ID,
  event_date, concession_type, test_col
FROM mydataset.regional_sales;

Example 3

The following example creates a contribution analysis model that uses a summable by category metric:

CREATE (OR REPLACE) MODEL `myproject.mydataset.ca_model`
  OPTIONS(
    MODEL_TYPE = 'CONTRIBUTION_ANALYSIS',
    CONTRIBUTION_METRIC = 'SUM(concessions_sold)/COUNT(DISTINCT sales_date)',
    DIMENSION_ID_COLS = ['venue_ID','event_date', 'attendee_count', 'concession_type'],
    IS_TEST_COL = 'test_col',
    TOP_K_INSIGHTS_BY_APRIORI_SUPPORT = 25,
    PRUNING_METHOD = 'PRUNE_REDUNDANT_INSIGHTS'
) AS
SELECT concessions_sold, attendee_count, venue_ID,
  event_date, concession_type, test_col, sales_date
FROM mydataset.regional_sales;

What's next

Learn how to call the ML.GET_INSIGHTS function to get insights from your contribution analysis model.
Try a tutorial on how to Get data insights from a contribution analysis model using a summable metric.

The CREATE MODEL statement for contribution analysis models

CREATE MODEL syntax

CREATE MODEL

CREATE MODEL IF NOT EXISTS

CREATE OR REPLACE MODEL

model_name

MODEL_TYPE

CONTRIBUTION_METRIC

IS_TEST_COL

DIMENSION_ID_COLS

MIN_APRIORI_SUPPORT

TOP_K_INSIGHTS_BY_APRIORI_SUPPORT

PRUNING_METHOD

query_statement

Choose test and control data

Use a summable metric

Use a summable ratio metric

Use a summable by category metric

Use an apriori support threshold

Examples

What's next

`CREATE MODEL` syntax

`CREATE MODEL`

`CREATE MODEL IF NOT EXISTS`

`CREATE OR REPLACE MODEL`

`model_name`

`MODEL_TYPE`

`CONTRIBUTION_METRIC`

`IS_TEST_COL`

`DIMENSION_ID_COLS`

`MIN_APRIORI_SUPPORT`

`TOP_K_INSIGHTS_BY_APRIORI_SUPPORT`

`PRUNING_METHOD`

`query_statement`