Use BigQuery ML to make recommendations from movie ratings


This tutorial introduces data analysts to the matrix factorization model in BigQuery ML. BigQuery ML enables users to create and execute machine learning models in BigQuery using SQL queries. The goal is to democratize machine learning by enabling SQL practitioners to build models using their existing tools and to increase development speed by eliminating the need for data movement.

In this tutorial, you learn how to create a model from explicit feedback using the movielens1m dataset to make recommendations given a movie ID and user id

The movielens dataset contains the ratings from a scale of 1 to 5 that users gave to movies, along with metadata of the movie such as genre.

Objectives

In this tutorial, you use:

  • BigQuery ML to create an explicit recommendations model using the CREATE MODEL statement
  • The ML.EVALUATE function to evaluate the ML models
  • The ML.WEIGHTS function to inspect the latent factor weights generated during training.
  • The ML.RECOMMEND function to produce recommendations for a user.

Costs

This tutorial uses billable components of Google Cloud, including the following:

  • BigQuery
  • BigQuery ML

For more information on BigQuery costs, see the BigQuery pricing page.

For more information on BigQuery ML costs, see BigQuery ML pricing.

Before you begin

  1. Sign in to your Google Cloud account. If you're new to Google Cloud, create an account to evaluate how our products perform in real-world scenarios. New customers also get $300 in free credits to run, test, and deploy workloads.
  2. In the Google Cloud console, on the project selector page, select or create a Google Cloud project.

    Go to project selector

  3. Make sure that billing is enabled for your Google Cloud project.

  4. In the Google Cloud console, on the project selector page, select or create a Google Cloud project.

    Go to project selector

  5. Make sure that billing is enabled for your Google Cloud project.

  6. BigQuery is automatically enabled in new projects. To activate BigQuery in a pre-existing project, go to

    Enable the BigQuery API.

    Enable the API

Step one: Create your dataset

Create a BigQuery dataset to store your ML model:

  1. In the Google Cloud console, go to the BigQuery page.

    Go to the BigQuery page

  2. In the Explorer pane, click your project name.

  3. Click View actions > Create dataset.

    Create dataset.

  4. On the Create dataset page, do the following:

    • For Dataset ID, enter bqml_tutorial.

    • For Location type, select Multi-region, and then select US (multiple regions in United States).

      The public datasets are stored in the US multi-region. For simplicity, store your dataset in the same location.

    • Leave the remaining default settings as they are, and click Create dataset.

      Create dataset page.

Step two: Load the Movielens dataset into BigQuery

The following are the steps to load the 1m movielens dataset into BigQuery using the BigQuery command-line tools. A dataset called movielens will be created and the relevant movielens tables will be stored in it.

curl -O 'http://files.grouplens.org/datasets/movielens/ml-1m.zip'
unzip ml-1m.zip
bq mk --dataset movielens
sed 's/::/,/g' ml-1m/ratings.dat > ratings.csv
bq load --source_format=CSV movielens.movielens_1m ratings.csv \
  user_id:INT64,item_id:INT64,rating:FLOAT64,timestamp:TIMESTAMP

Because the movie titles contain colons, commas and pipes, we need to use a different delimiter. To load the movie titles, a slightly different variant of the last two commands should be used.

sed 's/::/@/g' ml-1m/movies.dat > movie_titles.csv
bq load --source_format=CSV --field_delimiter=@ \
 movielens.movie_titles movie_titles.csv \
 movie_id:INT64,movie_title:STRING,genre:STRING

Step three: Create your explicit recommendations model

Next, you create an explicit recommendations model using the movielens sample table that was loaded in the previous step. The following GoogleSQL query is used to create the model that will be used to predict a rating for every user-item pair.

#standardSQL
CREATE OR REPLACE MODEL `bqml_tutorial.my_explicit_mf_model`
OPTIONS
  (model_type='matrix_factorization',
   user_col='user_id',
   item_col='item_id',
   l2_reg=9.83,
   num_factors=34) AS
SELECT
  user_id,
  item_id,
  rating
FROM `movielens.movielens_1m`

In addition to creating the model, running the CREATE MODEL command trains the model you create.

Query details

The CREATE MODEL clause is used to create and train the model named bqml_tutorial.my_explicit_mf_model.

The OPTIONS(model_type='matrix_factorization', user_col='user_id', ...) clause indicates that you are creating a matrix factorization model. By default, this will create an explicit matrix factorization model unless feedback_type='IMPLICIT' is specified. An example of how to create an implicit matrix factorization model will be explained in Using BigQuery ML to make recommendations for implicit feedback.

This query's SELECT statement uses the following columns to generate recommendations.

  • user_id—The user ID (INT64).
  • item_id—The movie ID (INT64).
  • rating—The explicit rating from 1 to 5 that the user_id gave the item_id (FLOAT64).

The FROM clause—movielens.movielens_1m — indicates that you are querying the movielens_1m table in the movielens dataset. This dataset is in your bigquery project if the instructions in step two were followed.

Run the CREATE MODEL query

To run the CREATE MODEL query to create and train your model:

  1. In the Google Cloud console, click the Compose new query button.

  2. Enter the following GoogleSQL query in the Query editor text area.

#standardSQL
CREATE OR REPLACE MODEL `bqml_tutorial.my_explicit_mf_model`
OPTIONS
  (model_type='matrix_factorization',
   user_col='user_id',
   item_col='item_id',
   l2_reg=9.83,
   num_factors=34) AS
SELECT
  user_id,
  item_id,
  rating
FROM `movielens.movielens_1m`
  1. Click Run.

    The query takes about 10 minutes to complete, after which your model (my_explicit_mf_model) appears in the navigation panel of the Google Cloud console. Because the query uses a CREATE MODEL statement to create a model, you do not see query results.

Step four (optional): Get training statistics

To see the results of the model training, you can use the ML.TRAINING_INFO function, or you can view the statistics in the Google Cloud console. In this tutorial, you use the Google Cloud console.

A machine learning algorithm builds a model by examining many examples and attempting to find a model that minimizes loss. This process is called empirical risk minimization.

To see the model training statistics that were generated when you ran the CREATE MODEL query:

  1. In the Google Cloud console navigation panel, in the Resources section, expand [PROJECT_ID] > bqml_tutorial and then click my_explicit_mf_model.

  2. Click the Training tab, and then click Table. The results should look like the following:

    ML.TRAINING_INFO output.

    The Training Data Loss column represents the loss metric calculated after the model is trained on the training dataset. Since you performed matrix factorization, this column is the mean squared error. By default, matrix factorization models will not split the data, so the Evaluation Data Loss column will not be present unless a holdout dataset is specified because splitting the data has a chance of losing all the ratings for a user or an item. As a result, the model will not have latent factor information about missing users or items.

    For more details on the ML.TRAINING_INFO function, see the BigQuery ML syntax reference.

Step five: Evaluate your model

After creating your model, you evaluate the performance of the recommender using the ML.EVALUATE function. The ML.EVALUATE function evaluates the predicted ratings against the actual ratings.

The query used to evaluate the model is as follows:

#standardSQL
SELECT
  *
FROM
  ML.EVALUATE(MODEL `bqml_tutorial.my_explicit_mf_model`,
    (
    SELECT
      user_id,
      item_id,
      rating
     FROM
      `movielens.movielens_1m`))

Query details

The top-most SELECT statement retrieves the columns from your model.

The FROM clause uses the ML.EVALUATE function against your model: bqml_tutorial.my_explicit_mf_model.

This query's nested SELECT statement and FROM clause are the same as those in the CREATE MODEL query.

You can also call ML.EVALUATE without providing the input data. It will use the evaluation metrics calculated during training:

#standardSQL
SELECT
  *
FROM
  ML.EVALUATE(MODEL `bqml_tutorial.my_explicit_mf_model`)

Run the ML.EVALUATE query

To run the ML.EVALUATE query that evaluates the model:

  1. In the Google Cloud console, click the Compose new query button.

  2. Enter the following GoogleSQL query in the Query editor text area.

    #standardSQL
    SELECT
    *
    FROM
    ML.EVALUATE(MODEL `bqml_tutorial.my_explicit_mf_model`,
      (
      SELECT
    user_id,
    item_id,
    rating
       FROM
    `movielens.movielens_1m`))
    
  3. (Optional) To set the processing location, click More > Query settings. For Processing location, choose US. This step is optional because the processing location is automatically detected based on the dataset's location.

    Query settings.

  4. Click Run.

  5. When the query is complete, click the Results tab below the query text area. The results should look like the following:

    ML.EVALUATE output.

    Because you performed an explicit matrix factorization, the results include the following columns:

    • mean_absolute_error
    • mean_squared_error
    • mean_squared_log_error
    • median_absolute_error
    • r2_score
    • explained_variance

An important metric in the evaluation results is the R2 score. The R2 score is a statistical measure that determines if the linear regression predictions approximate the actual data. 0 indicates that the model explains none of the variability of the response data around the mean. 1 indicates that the model explains all the variability of the response data around the mean.

Step six: Use your model to predict ratings and make recommendations

Find all the item ratings for a set of users

ML.RECOMMEND does not need to take any additional arguments besides the model, but can take in an optional table. If the input table only has one column that matches the name of the input user or input item column, then all of the predicted item ratings for each user will be outputed and vice versa. Note that if all of the users or all of the items are in the input table, it will output the same results as passing no optional argument to ML.RECOMMEND.

The following is an example of a query to fetch all of the predicted movie ratings for 5 users:

#standardSQL
SELECT
  *
FROM
  ML.RECOMMEND(MODEL `bqml_tutorial.my_explicit_mf_model`,
    (
    SELECT
      user_id
    FROM
      `movielens.movielens_1m`
    LIMIT 5))

Query details

The top-most SELECT statement retrieves the user, item, and predicted_rating column. This last column is generated by the ML.RECOMMEND function. When you use the ML.RECOMMEND function, the output column name for the model is predicted_<rating_column_name>. For explicit matrix factorization models, predicted_rating is the estimated value of rating.

The ML.RECOMMEND function is used to predict ratings using your model: bqml_tutorial.my_explicit_mf_model.

This query's nested SELECT statement selects just the user_id column from the original table used for training.

The LIMIT clause—LIMIT 5—will randomly filter out 5 user_ids to send to ML.RECOMMEND.

Find the ratings for all user-item pairs

Now that you have evaluated your model, the next step is to use it to predict a rating. You use your model to predict the ratings of every user-item combination in the following query:

#standardSQL
SELECT
  *
FROM
  ML.RECOMMEND(MODEL `bqml_tutorial.my_explicit_mf_model`)

Query details

The top-most SELECT statement retrieves the user, item, and predicted_rating column. This last column is generated by the ML.RECOMMEND function. When you use the ML.RECOMMEND function, the output column name for the model is predicted_<rating_column_name>. For explicit matrix factorization models, predicted_rating is the estimated value of rating.

The ML.RECOMMEND function is used to predict ratings using your model: bqml_tutorial.my_explicit_mf_model.

One way of saving the result to table is:

#standardSQL
CREATE OR REPLACE TABLE `bqml_tutorial.recommend_1m`
OPTIONS() AS
SELECT
  *
FROM
  ML.RECOMMEND(MODEL `bqml_tutorial.my_explicit_mf_model`)

If a Query Exceeded Resource Limits error occurs for ML.RECOMMEND, retry with a higher billing tier. In the BigQuery command-line tool, this can be set using the flag --maximum_billing_tier.

Generate recommendations

Using the previous recommendations query, we can order by the predicted rating and output the top predicted items for each user. The following query joins the item_ids with the movie_ids found in the movielens.movie_titles table uploaded earlier and outputs the top 5 recommended movies per user.

#standardSQL
SELECT
  user_id,
  ARRAY_AGG(STRUCT(movie_title, genre, predicted_rating)
ORDER BY predicted_rating DESC LIMIT 5)
FROM (
SELECT
  user_id,
  item_id,
  predicted_rating,
  movie_title,
  genre
FROM
  `bqml_tutorial.recommend_1m`
JOIN
  `movielens.movie_titles`
ON
  item_id = movie_id)
GROUP BY
  user_id

Query details

The inner SELECT statement performs an inner join on item_id from the recommendation results table and movie_id from the movielens.movie_titles table. movielens.movie_titles not only maps movie_id to a movie name, but it also includes the genres of the movie as listed by IMDB.

The top level SELECT statement aggregates the results from the nested SELECT statement by using GROUPS BY user_id to aggregate the movie_title, genre, and predicted_rating in descending order and only keeps the top 5 movies.

Run the ML.RECOMMEND query

To run the ML.RECOMMEND query that outputs the top 5 recommended movies per user:

  1. In the Google Cloud console, click the Compose new query button.

  2. Enter the following GoogleSQL query in the Query editor text area.

    #standardSQL
    CREATE OR REPLACE TABLE `bqml_tutorial.recommend_1m`
    OPTIONS() AS
    SELECT
    *
    FROM
    ML.RECOMMEND(MODEL `bqml_tutorial.my_explicit_mf_model`)
    
  3. Click Run.

    When the query finishes running, (bqml_tutorial.recommend_1m) will appear in the navigation panel. Because the query uses a CREATE TABLE statement to create a table, you do not see the query results.

  4. Compose another new query. Enter the following GoogleSQL query in the Query editor text area once the previous query has finished running.

    #standardSQL
    SELECT
     user_id,
     ARRAY_AGG(STRUCT(movie_title, genre, predicted_rating)
    ORDER BY predicted_rating DESC LIMIT 5)
    FROM (
    SELECT
     user_id,
     item_id,
     predicted_rating,
     movie_title,
     genre
    FROM
     `bqml_tutorial.recommend_1m`
    JOIN
     `movielens.movie_titles`
    ON
     item_id = movie_id)
    GROUP BY
     user_id
    
  5. (Optional) To set the processing location, click More > Query settings. For Processing location, choose US. This step is optional because the processing location is automatically detected based on the dataset's location.

    Query settings.

  6. Click Run.

  7. When the query is complete, click the Results tab below the query text area. The results should look like the following:

    ML.RECOMMEND output.

Since we had additional metadata information about each movie_id beyond an INT64, we can see info like genre about the top 5 recommended movies for each user. If you don't have a movietitles equivalent table for your training data, the results might not be as human interpretable with just number ids or hashes.

Top genres per factor

If you are curious as to what genre each latent factor might correlate to, you can run the following query:

#standardSQL
SELECT
  factor,
  ARRAY_AGG(STRUCT(feature, genre,
      weight)
  ORDER BY
    weight DESC
  LIMIT
    10) AS weights
FROM (
  SELECT
    * EXCEPT(factor_weights)
  FROM (
    SELECT
      *
    FROM (
      SELECT
        factor_weights,
        CAST(feature AS INT64) as feature
      FROM
        ML.WEIGHTS(model `bqml_tutorial.my_explicit_mf_model`)
      WHERE
        processed_input= 'item_id')
    JOIN
      `movielens.movie_titles`
    ON
      feature = movie_id) weights
  CROSS JOIN
    UNNEST(weights.factor_weights)
  ORDER BY
    feature,
    weight DESC)
GROUP BY
  factor

Query details

The inner most SELECT statement gets the item_id or movie factor weights array and then joins it against the movielens.movie_titles table to get the genre of each item id.

The result of which is then CROSS JOINed with each factor_weights array whose result is then ORDER BY feature, weight DESC.

Finally, the top level SELECT statement aggregates the results from its inner statement by factor and creates an array for each factor ordered by the weight of each genre.

Run the query

To run the above query that outputs the top 10 movie genres per factor:

  1. In the Google Cloud console, click the Compose new query button.

  2. Enter the following GoogleSQL query in the Query editor text area.

#standardSQL
SELECT
  factor,
  ARRAY_AGG(STRUCT(feature, genre,
      weight)
  ORDER BY
    weight DESC
  LIMIT
    10) AS weights
FROM (
  SELECT
    * EXCEPT(factor_weights)
  FROM (
    SELECT
      *
    FROM (
      SELECT
        factor_weights,
        CAST(feature AS INT64) as feature
      FROM
        ML.WEIGHTS(model `bqml_tutorial.my_explicit_mf_model`)
      WHERE
        processed_input= 'item_id')
    JOIN
      `movielens.movie_titles`
    ON
      feature = movie_id) weights
  CROSS JOIN
    UNNEST(weights.factor_weights)
  ORDER BY
    feature,
    weight DESC)
GROUP BY
  factor
  1. (Optional) To set the processing location, click More > Query settings. For Processing location, choose US. This step is optional because the processing location is automatically detected based on the dataset's location.

    Query settings.

  2. Click Run.

  3. When the query is complete, click the Results tab below the query text area. The results should look like the following:

    Weights analysis.

Clean up

To avoid incurring charges to your Google Cloud account for the resources used in this tutorial, either delete the project that contains the resources, or keep the project and delete the individual resources.

  • You can delete the project you created.
  • Or you can keep the project and delete the dataset.

Delete your dataset

Deleting your project removes all datasets and all tables in the project. If you prefer to reuse the project, you can delete the dataset you created in this tutorial:

  1. If necessary, open the BigQuery page in the Google Cloud console.

    Go to the BigQuery page

  2. In the navigation, click the bqml_tutorial dataset you created.

  3. Click Delete dataset on the right side of the window. This action deletes the dataset, the table, and all the data.

  4. In the Delete dataset dialog box, confirm the delete command by typing the name of your dataset (bqml_tutorial) and then click Delete.

Delete your project

To delete the project:

  1. In the Google Cloud console, go to the Manage resources page.

    Go to Manage resources

  2. In the project list, select the project that you want to delete, and then click Delete.
  3. In the dialog, type the project ID, and then click Shut down to delete the project.

What's next