Build a smart API to predict customer propensity to purchase by using Apigee, BigQuery ML, and Spanner

Last reviewed 2023-06-20 UTC

This document describes how you build the architecture described in Smart API to predict customer propensity to purchase by using Apigee, BigQuery ML, and Spanner. In that architecture, you use these products to deploy an API that can predict how likely a customer is to make a purchase.

These instructions are intended for API developers and data specialists who want to generate more revenue through omni-channel and ecommerce platforms by providing a more personalized experience to users. It assumes that you're familiar with Apigee, BigQuery ML, Spanner, the Google Cloud CLI, and Apache Maven.


The following diagram shows the architecture and process used in this solution:

Architecture of an API to predict customer propensity to purchase.

For details, see Smart API to predict customer propensity to purchase by using Apigee, BigQuery ML, and Spanner.


  • Create a customer purchase propensity dataset in BigQuery.
  • Import product catalog data to the Spanner database.
  • Import and deploy an Apigee API proxy.
  • Integrate customer purchase propensity data from BigQuery with the product catalog and pricing information from the Spanner database.
  • Create an aggregated view of product recommendations.


In this document, you use the following billable components of Google Cloud:

  • BigQuery and BigQuery ML Flex Slots
  • Spanner
  • Apigee

To generate a cost estimate based on your projected usage, use the pricing calculator. New Google Cloud users might be eligible for a free trial.

Before you begin

  1. In the Google Cloud console, on the project selector page, select or create a Google Cloud project.

    Go to project selector

  2. Make sure that billing is enabled for your Google Cloud project.

  3. Enable the AI Platform Training & Prediction, BigQuery, BigQuery Reservation, BigQuery Storage, Cloud Spanner, Cloud Storage, Cloud Storage API, Dataflow, Google Cloud, Cloud Storage JSON, Service Management, Service Usage APIs.

    Enable the APIs

  4. In the Google Cloud console, activate Cloud Shell.

    Activate Cloud Shell

    At the bottom of the Google Cloud console, a Cloud Shell session starts and displays a command-line prompt. Cloud Shell is a shell environment with the Google Cloud CLI already installed and with values already set for your current project. It can take a few seconds for the session to initialize.

  5. Grant roles to your Google Account. Run the following command once for each of the following IAM roles:

    • roles/apigee.admin
    • roles/bigquery.user
    • roles/bigquery.dataViewer
    • roles/spanner.admin
    • roles/spanner.databaseAdmin
    • roles/resourcemanager.projectIamAdmin
    • roles/serviceusage.serviceUsageConsumer

    gcloud projects add-iam-policy-binding PROJECT_ID --member="user:EMAIL_ADDRESS" --role=ROLE
    • Replace PROJECT_ID with your project ID.
    • Replace EMAIL_ADDRESS with your email address.
    • Replace ROLE with each individual role.
  6. If you don't already have an Apigee account, set up Apigee to provision an Apigee instance. Provisioning can take up to an hour.
  7. Configure your Apigee organization to allow external access.

If you complete this procedure and decide not to keep the deployment, you can avoid continued billing by deleting the resources you created. For more information, see Clean up.

Create an ecommerce recommendation dataset using BigQuery ML

In this section, you use standard SQL queries in BigQuery ML to create an ML model, train it on customer data in BigQuery, and then deploy it.

The customer data that you use is from the Google Analytics Sample dataset, which is hosted publicly on BigQuery. This dataset provides 12 months (August 2016 to August 2017) of obfuscated Analytics 360 data from the Google Merchandise Store, a real ecommerce store that sells Google-branded merchandise.

You don't have to export your data or build a model training and deployment pipeline. BigQuery autoscales to handle any compute resources that you need.

The ML model that you create for this solution uses matrix factorization, a common and effective method of creating a recommendation system based on user preference data.

Process the sample data

When you use matrix factorization, you evaluate explicit or implicit user feedback to determine customer preferences. To use explicit feedback, the dataset must contain data about user product preferences, like star ratings between 1 and 5. If there isn't explicit feedback available, you must use other behavioral metrics to infer customer preferences, like looking at the total time a user spends on a product detail page. This solution uses session duration data to infer customer preferences.

To train the matrix factorization model, you need a table with columns that identify the customer, the item being rated, and the implicit rating. In this section, you create such a table with the userid, itemId, and session_duration columns. The session_duration column contains the duration of the user's session on the product page of the given item.

To create the table using data from the Analytics Sample dataset, do the following:

  1. In the Google Cloud Marketplace, go to the Analytics Sample page.

    Go to Analytics Sample

  2. Click View Dataset. The BigQuery SQL workspace page opens with the Analytics Sample dataset selected.

  3. In the Explorer section, next to your project, click View actions, and then click Create dataset.

  4. In the Create dataset dialog that appears, do the following:

    1. In the Dataset ID field, enter bqml.
    2. In the Data location list, select us (multiple regions in United States).
    3. Click Create dataset.
  5. Click Go to dataset, and then click Compose New Query.

  6. In the Query Editor, create a table that contains the training data by running the following SQL statement:

    CREATE OR REPLACE TABLE bqml.aggregate_web_stats AS (
        durations AS (
          --calculate pageview durations
                 CAST(visitNumber AS STRING),'-',
                 CAST(hitNumber AS STRING) ) AS visitorId_session_hit,
            LEAD(time, 1) OVER (
              PARTITION BY CONCAT(fullVisitorId,'-',CAST(visitNumber AS STRING))
              ORDER BY
              time ASC ) - time AS pageview_duration
            UNNEST(hits) AS hit
        prodview_durations AS (
          --filter for product detail pages only
            CONCAT(fullVisitorId,'-',CAST(visitNumber AS STRING)) AS userId,
            productSKU AS itemId,
             1) AS pageview_duration,
            `bigquery-public-data.google_analytics_sample.ga_sessions_2017*` t,
            UNNEST(hits) AS hits,
            UNNEST(hits.product) AS hits_product
            durations dur
                   CAST(visitNumber AS STRING),'-',
                   CAST(hitNumber AS STRING)) = dur.visitorId_session_hit
          eCommerceAction.action_type = "2"
        aggregate_web_stats AS(
          --sum pageview durations by userId, itemId
            SUM(pageview_duration) AS session_duration
          GROUP BY
            itemId )

    The bqml.aggregate_web_stats table is created and populated with training data.

  7. To see a sample of the data, in the Query Editor run the following SQL statement:


The output shows a table that has a row for each user ID that includes the item ID that the user viewed, and the duration of their session. The output is similar to the following:

Row userId itemId session_duration
1 6703373209489429228-1 GGOEAXXX0808 19523
2 868731560082458910-1 GGOEAXXX0808 8312
3 4805474958539278422-1 GGOEAXXX0808 62892
4 8353360074725418910-3 GGOEAXXX0808 4883
5 8253742246691621007-2 GGOEAXXX0808 10
6 7211254729136975568-1 GGOEAXXX0808 96090
7 66777488032155805-1 GGOEAXXX0808 3893
8 0804312527321649470-1 GGOEAXXX0808 7539
9 2965429397557124425-1 GGOEAXXX0808 21776
10 8459600677575627508-1 GGOEAXXX0808 6265

Purchase flex slots

If you use on-demand pricing for BigQuery, in order to train a matrix factorization model, you must purchase flex slots and then create reservations and assignments for them. If you use flat-rate pricing with BigQuery, you can skip this section and go to Create, train, and deploy the model.

To purchase flex slots, you must have an IAM role that includes the bigquery.reservations.create permission. This permission is granted to the project owner, and it's included in the BigQuery Admin (roles/bigquery.admin) and BigQuery Resource Admin (roles/bigquery.resourceAdmin) IAM roles.

  1. In the Google Cloud console, go to the BigQuery page:

    Go to BigQuery

  2. Click Capacity management, and then click Reservations.

  3. If you're redirected to the BigQuery Reservation API page to enable the API, click Enable. Otherwise, proceed to the next step.

  4. In the Reservations tab, click Buy Slots.

  5. In the Buy Slots page, do the following:

    1. In the Commitment duration list, select Flex.
    2. In the Location list, select us (multiple regions in United States)
    3. In the Number of slots list, select 500.
    4. Click Next.
    5. In the Purchase confirmation field, type CONFIRM and then click Purchase.

  6. Click View Slot Commitments.

    Wait up to 20 minutes for the capacity to be provisioned. After the capacity is provisioned, the slot commitment Status column shows .

  7. Click Create Reservation, and then set the following options:

    1. In Reservation name, enter model.
    2. In the Location list, select us (multiple regions in United States)
    3. In Number of slots, enter 500.
    4. Click Save. This returns you to the Reservations page.
  8. Next to the model reservation, in the Actions column, select Create Assignment.

  9. In Select an organization, folder, or project, click Browse.

  10. Enter the name of the project that you used for this solution, or select it from the list.

  11. Click Select, and then click Create.

Create, train, and deploy the model

To create, train, and deploy the matrix factorization model, do the following:

  1. In the Google Cloud console BigQuery page, click Compose new query.
  2. Run the CREATE MODEL SQL statement:

    CREATE OR REPLACE MODEL bqml.retail_recommender`
      SELECT * FROM bqml.aggregate_web_stats;

After training completes, the trained model is deployed as the bqml.retail_recommender model.

Use the trained model to make predictions

In this section, to get predictions from the deployed bqml.retail_recommender model, you use the ML.RECOMMEND SQL function.

  1. In the Google Cloud console BigQuery page, compose a query and get predictions that represent the top 5 recommendations for a specified userId:

    DECLARE MY_USERID STRING DEFAULT "0824461277962362623-1";
      ML.RECOMMEND(MODEL `bqml.retail_recommender`,
      (SELECT MY_USERID as userID)
    ORDER BY predicted_session_duration_confidence DESC
    LIMIT 5;

    The output displays a row for the predicted session duration confidence (higher is better), the associated user ID, and the item ID that the user viewed. The output is similar to the following:

    Row predicted_session_duration_confidence userId itemId
    1 29011.10454702254 0824461277962362623-1 GGOEGAAX0574
    2 28212.99840462358 0824461277962362623-1 GGOEGDHQ015399
    3 28126.79442866013 0824461277962362623-1 GGOEGETR014599
    4 27303.60852083874 0824461277962362623-1 GGOEGAAX0338
    5 25692.370609851147 0824461277962362623-1 GGOEGEFR024199
  2. To get the top 5 predictions for all users, run the following SQL statement. The statement generates many rows, so the output is written to a table and then the first ten records are retrieved so that you can see an example of the data.

    -- Create output table of top 5 predictions
    CREATE OR REPLACE TABLE bqml.prod_recommendations AS (
    WITH predictions AS (
                    ORDER BY
                      predicted_session_duration_confidence DESC
                    LIMIT 5) as recommended
        FROM ML.RECOMMEND(MODEL bqml.retail_recommender)
        GROUP BY userId
      predictions p,
    -- Show table

    The output shows multiple user IDs, the item ID that the user viewed, and the predicted session duration confidence. The output is similar to the following:

    Row userId itemId predicted_session_duration_confidence
    1 000170187170673177-6 GGOEGDHQ015399 15931.156936770309
    2 000170187170673177-6 GGOEGAAX0574 20178.608474922632
    3 000170187170673177-6 GGOEGAAX0338 20247.337545389437
    4 000170187170673177-6 GGOEGETR014599 15524.355852692066
    5 000170187170673177-6 GGOEGEFR024199 16443.307099088597
    6 000338059556124978-1 GGOEGAAX0338 18143.067737280064
    7 000338059556124978-1 GGOEGAAX0279 16531.718889063464
    8 000338059556124978-1 GGOEGAAX0574 20916.672241880347
    9 000338059556124978-1 GGOEGETR014599 16155.674211782945
    10 000338059556124978-1 GGOEGEFR024199 18417.17554202264

Set up Spanner data

In the following sections, you use the gcloud CLI and Maven. You run commands for both tools from Cloud Shell. No installation is required to use the tools.

  1. In Cloud Shell, clone the product recommendations GitHut repository that contains the product-recommendations API proxy bundle and scripts for setting up the data in a Spanner database:

    git clone
    cd product-recommendations-v1
  2. Create the datareader service account and assign IAM roles to it. The service account is used to access the data in BigQuery and the Spanner database from the API proxy.

    gcloud iam service-accounts create datareader --display-name="Data reader for BigQuery and Spanner Demo"
    gcloud iam service-accounts list | grep datareader
    gcloud iam service-accounts create datareader --display-name="Data reader for Apigee, BigQuery, and Spanner Demo"
    gcloud projects add-iam-policy-binding $PROJECT_ID --member="serviceAccount:$SA" --role="roles/spanner.databaseUser" --quiet
    gcloud projects add-iam-policy-binding $PROJECT_ID --member="serviceAccount:$SA" --role="roles/spanner.databaseReader" --quiet
    gcloud projects add-iam-policy-binding $PROJECT_ID --member="serviceAccount:$SA" --role="roles/bigquery.dataViewer" --quiet
    gcloud projects add-iam-policy-binding $PROJECT_ID --member="serviceAccount:$SA" --role="roles/bigquery.user" --quiet
  3. Set environment variables:

    # For Apigee
    export ORG=$PROJECT_ID
    export ENV=eval
    export SA=datareader@$
    # For Cloud Spanner
    export SPANNER_INSTANCE=product-catalog
    export SPANNER_DATABASE=product-catalog-v1
    export REGION=regional-us-east1

    Replace the following:

    • APIGEE_PROJECT: the name of your Apigee project.
    • API_DOMAIN_NAME: the hostname configured in the Admin > Environments > Groups page in the Apigee UI.
  4. In the Google Cloud console BigQuery page, select the prod_recommendations table, and then click the Preview tab to view the results.

    1. Copy any userId value.
    2. In Cloud Shell, set an environment variable:


      Replace USER_ID with the userId value that you copied in the preceding step.

  5. In Cloud Shell, display the ordered product recommendation results for the specified CUSTOMER_USERID value:

    bq query --nouse_legacy_sql \
        "SELECT * FROM \`$PROJECT_ID.bqml.prod_recommendations\` AS A where A.userid = \"$CUSTOMER_USERID\"" \
        ORDER BY A.predicted_session_duration_confidence DESC

    The output shows an individual user ID, the item ID that the user viewed, and the predicted session duration confidence. The output is similar to the following:

    |        userId         |     itemId     |      predicted_session_duration_confidence |
    | 6929470170340317899-1 | GGOEGAAX0037   |                          40161.10446942589 |
    | 6929470170340317899-1 | GGOEYDHJ056099 |                          27642.28480729123 |
    | 6929470170340317899-1 | GGOEGAAX0351   |                         27204.111219270915 |
    | 6929470170340317899-1 | GGOEGDWC020199 |                         25863.861349754334 |
    | 6929470170340317899-1 | GGOEGAAX0318   |                         24585.509088154067 |

Create a Spanner database and import product catalog data

  1. In Cloud Shell, create a Spanner instance in the specified region, create the product catalog database, and import the data:


    The script uses the CUSTOMER_USERID environment variable and it displays the entries that were created.

    The Spanner product catalog only contains the items that were used in the BigQuery training step for a specific user. Therefore, if you change the CUSTOMER_USERID environment variable after you create the product catalog data in the Spanner database, you must rerun the shell script to repopulate the data.

  2. Verify the data in the Spanner database:

    gcloud spanner databases execute-sql $SPANNER_DATABASE --sql='SELECT * FROM products'

    The output shows the product IDs and the descriptive information from the Spanner product catalog, including price and image path. The output is similar to the following:

    productid       name                description                price  discount  image
    GGOEGAAX0037    Aviator Sunglasses  The ultimate sunglasses    42.42  0         products_Images/sunglasses.jpg
    GGOEGAAX0318    Bamboo glass jar    Bamboo glass jar           19.99  0         products_Images/bamboo-glass-jar.jpg
    GGOEGAAX0351    Loafers             Most comfortable loafers   38.99  0         products_Images/loafers.jpg
    GGOEGDWC020199  Hair dryer          Hottest hair dryer         84.99  0         products_Images/hairdryer.jpg
    GGOEYDHJ056099  Coffee Mug          Best Coffee Mug            4.2    0         products_Images/mug.jpg

Deploy Apigee proxy

In this section, you run a Maven command to create the following resources:

  • A proxy named product-recommendations-v1
  • An API product named product-recommendations-v1-$ENV
  • An app developer user named
  • An app named product-recommendations-v1-app-$ENV

To create the resources, Maven uses the pom.xml file from the GitHub repository. The file contains the installation instructions and steps.

The profile section of the pom.xml file contains values for, apigee.env, api.northbound.domain, gcp.projectid,, and api.userid. These values vary by project and they are set by using the command line. The following example shows the section of the pom.xml file that contains the values:


You set these values earlier when you set up the Spanner data.

To deploy the proxy, do the following:

  • In Cloud Shell, install the proxy and its associated artifacts and then test the API:

    mvn -P eval clean install -Dbearer=$(gcloud auth print-access-token) \
        -DapigeeOrg=$ORG \
        -DapigeeEnv=$ENV \
        -DenvGroupHostname=$ENVGROUP_HOSTNAME \
        -DgcpProjectId=$PROJECT_ID \
        -DgoogleTokenEmail=$SA \

    The output shows the execution of the installation steps and the results of the integration test API calls. There's one call to the /openapi endpoint and another to the /productsendpoint. The test results verify that the API proxy was installed, deployed, and is operational. The output also displays the app credentials, which you can use for subsequent API test calls.

Test the recommendations API

  1. In Cloud Shell, set an environment variable for the API Key of the app by making a curl call to the Apigee API:

    APIKEY=$(curl -s -H "Authorization: Bearer $(gcloud auth print-access-token)" \$ORG/developers/$ENV \
        | jq -r .credentials[0].consumerKey)

    Make a note of the APIKEY value, you need this information if you optionally create an AppSheet app later.

  2. To get the results for the CUSTOMER_USERID value that you specified when you installed the API proxy, make a test call:

    curl -s https://$ENVGROUP_HOSTNAME/v1/recommendations/products \
    -H "x-apikey:$APIKEY" | jq

The API that's defined by the OpenAPI Specification (OAS) lets the request specify the following headers:

  • x-apikey: The app consumer key from the security scheme.
  • x-userid: The user identifier making the request. If not provided, this value defaults to the CUSTOMER_USERID value that's configured in the proxy.
  • cache-control: The amount of time to cache the response. This header lets you cache the response for 300 seconds or override by specifying no-cache.

To change the CUSTOMER_USERID value or to control caching, you can set the header values as shown in the following example call:

curl -s "https://$ENVGROUP_HOSTNAME/v1/recommendations/products" \
-H "x-apikey:$APIKEY" \
-H "x-userid:$CUSTOMER_USERID" \
-H "Cache-Control:no-cache" | jq

The response to the example call is similar to the following:

  "products": [
      "productid": "GGOEGAAX0037",
      "name": "Aviator Sunglasses",
      "description": "The ultimate sunglasses",
      "price": "42.42",
      "image": "products_Images/sunglasses.jpg"
      "productid": "GGOEYDHJ056099",
      "name": "Coffee Mug",
      "description": "Best Coffee Mug",
      "price": "4.2",
      "image": "products_Images/mug.jpg"
      "productid": "GGOEGAAX0351",
      "name": "Loafers",
      "description": "Most comfortable loafers",
      "price": "38.99",
      "image": "products_Images/loafers.jpg"
      "productid": "GGOEGDWC020199",
      "name": "Hairdryer",
      "description": "Hotest hairdryer",
      "price": "84.99",
      "image": "products_Images/hairdryer.jpg"
      "productid": "GGOEGAAX0318",
      "name": "Bamboo glass jar",
      "description": "Bamboo glass jar",
      "price": "19.99",
      "image": "products_Images/bamboo-glass-jar.jpg"

Apigee policies

The Apigee policies that are listed in the following sections are used in the API proxy bundle.


Protect the backend BigQuery service from traffic surges.
Throttle the API requests based on the consumer application and developer. Throttle limits are configured in the API product.
ResponseCache - uri
Reduce the requests to the BigQuery data warehouse by caching the response from the service.


AssignMessage - create query request
Set the HTTP request with an SQL query for fetching the product recommendation list from the BigQuery dataset.
JavaScript and ExtractVariable policies
Format the data from the BigQuery service and create a more user-friendly response message.
LookupCache - Spanner session
Look up the Spanner database session ID from the Apigee cache.
ServiceCallout - Spanner session
Make a request to Spanner service and create a database session if the Apigee cache doesn't have a session ID or the ID is expired.
PopulateCache - Spanner session
Populate the Spanner session ID to the Apigee cache.
ServiceCallout - product catalog lookup
Fetch the product details from the Spanner product catalog database.
JavaScript - format product data
Create an API response message that conforms to the OAS by formatting the data from the product catalog database.


AssignMessage - OAS JSON
Set the OAS JSON response for the API.

Optional: Create an AppSheet app using Apigee as a data source

To show product recommendations to ecommerce website end users and to business users, you can create an AppSheet app as shown in this section.

Create an AppSheet account

Create an AppSheet account using your email address.

Create a data source

AppSheet uses the API proxy as the data source for your new app. To create a data source, do the following:

  1. Sign in to AppSheet.
  2. In the My Account page, click Sources > New Data Source.
  3. In the Add a new data source dialog, enter the name of the proxy product-recommendations-v1, and then click Apigee.
  4. In the Add Apigee API connection information dialog, set the following options:

    1. Select Manual.
    2. In the Apigee API Key field, enter the API key for your app, which you used to test the proxy. If you don't have the API key, get it in Cloud Shell by running echo $APIKEY.
    3. In the Apigee API Base Path field, enter the following:


      Replace ENVGROUP_HOSTNAME with the hostname configured in the Apigee UI for Admin > Environments > Groups.

    4. In the API Resource Paths field, enter the path suffix products.

    5. Click Test.

    6. After the tests complete successfully, click Authorize Access.

After you authorize access, the AppSheet console shows a new tile for product-recommendations-v1.

Create the app

  1. Go to the AppSheet products-template app page.
  2. Click Copy and Customize. The Clone your App dialog is displayed.
  3. In the App name field, enter the name of the proxy, product-recommendations-v1, and then click Copy app. Wait a few moments for your app to be created.
  4. In the Welcome to your app page, click Customize app. By default, the app uses a sample data source in a Google Sheet.
  5. Change the data source to match the Apigee data source that you created earlier:

    1. Click + New Table, and then click product recommendations v1.
    2. In the Choose a Sheet/Table dialog, select products.
    3. In the Create a new table dialog, click Read-Only, and then click Add This Table.

    The app is displayed with a products 2 tab and a different data view. The new data view has different values for the description and price of each item. Notice that the order of the items is not the same as the order of the predictions.

  6. Change the order in which items are returned from the data source by removing the AppSheet default sorting:

    1. In the side navigation menu, select UX.
    2. In the Primary Views section, select products 2.
    3. In the View Options section, next to Sort by, delete the entry for name, Ascending. Observe that the order displayed in AppSheet is now the same as the result of the API call, with the last item from the response at the bottom.
  7. Save the app.

You can optionally delete the original Product table and UX, and rename the "products 2" table and view to "Product Recommendations".

Clean up

To avoid incurring charges to your Google Cloud account for the resources used in this procedure, either delete the project that contains the resources, or keep the project and delete the individual resources.

Delete the project

  1. In the Google Cloud console, go to the Manage resources page.

    Go to Manage resources

  2. In the project list, select the project that you want to delete, and then click Delete.
  3. In the dialog, type the project ID, and then click Shut down to delete the project.

Delete the individual resources

To avoid recurring costs, delete the BigQuery Flex Slot reservations, the BigQuery dataset, and AI Platform model.

Delete the BigQuery Flex Slot reservations

  1. In the Google Cloud console BigQuery page, select Capacity management, and then click the Reservations tab.
  2. Open the model entry.
  3. Next to the reservation, click View actions, and then click Delete.
  4. Next to the model entry, click View actions, and then click Delete.
  5. In Cloud Shell, delete the deployment, the proxy, and its associated artifacts: