This tutorial teaches you how to create a
matrix factorization model
and train it on the customer movie ratings in the
movielens1m
dataset. You then
use the matrix factorization model to generate movie recommendations for users.
Using customer-provided ratings to train the model is called training with explicit feedback. Matrix factorization models are trained using the Alternating Least Squares algorithm when you use explicit feedback as training data.
Objectives
This tutorial guides you through completing the following tasks:
- Creating a matrix factorization model by using the
CREATE MODEL
statement. - Evaluating the model by using the
ML.EVALUATE
function. - Generating movie recommendations for users by using the model with the
ML.RECOMMEND
function.
Costs
This tutorial uses billable components of Google Cloud, including the following:
- BigQuery
- BigQuery ML
For more information on BigQuery costs, see the BigQuery pricing page.
For more information on BigQuery ML costs, see BigQuery ML pricing.
Before you begin
- Sign in to your Google Cloud account. If you're new to Google Cloud, create an account to evaluate how our products perform in real-world scenarios. New customers also get $300 in free credits to run, test, and deploy workloads.
-
In the Google Cloud console, on the project selector page, select or create a Google Cloud project.
-
Make sure that billing is enabled for your Google Cloud project.
-
In the Google Cloud console, on the project selector page, select or create a Google Cloud project.
-
Make sure that billing is enabled for your Google Cloud project.
- BigQuery is automatically enabled in new projects.
To activate BigQuery in a pre-existing project, go to
Enable the BigQuery API.
Required Permissions
- To create the dataset, you need the
bigquery.datasets.create
IAM permission. To create the connection resource, you need the following permissions:
bigquery.connections.create
bigquery.connections.get
To create the model, you need the following permissions:
bigquery.jobs.create
bigquery.models.create
bigquery.models.getData
bigquery.models.updateData
bigquery.connections.delegate
To run inference, you need the following permissions:
bigquery.models.getData
bigquery.jobs.create
For more information about IAM roles and permissions in BigQuery, see Introduction to IAM.
Create a dataset
Create a BigQuery dataset to store your ML model:
In the Google Cloud console, go to the BigQuery page.
In the Explorer pane, click your project name.
Click
View actions > Create dataset.On the Create dataset page, do the following:
For Dataset ID, enter
bqml_tutorial
.For Location type, select Multi-region, and then select US (multiple regions in United States).
The public datasets are stored in the
US
multi-region. For simplicity, store your dataset in the same location.Leave the remaining default settings as they are, and click Create dataset.
Upload the Movielens data
Upload the movielens1m
data into BigQuery using the
bq command-line tool.
Follow these steps to upload the movielens1m
data:
Open Cloud Shell:
Upload the ratings data into the
ratings
table. On the command line, paste in the following query and hitEnter
:curl -O 'http://files.grouplens.org/datasets/movielens/ml-1m.zip' unzip ml-1m.zip sed 's/::/,/g' ml-1m/ratings.dat > ratings.csv bq load --source_format=CSV bqml_tutorial.ratings ratings.csv \ user_id:INT64,item_id:INT64,rating:FLOAT64,timestamp:TIMESTAMP
Upload the movie data into the
movies
table. On the command line, paste in the following query and hitEnter
:sed 's/::/@/g' ml-1m/movies.dat > movie_titles.csv bq load --source_format=CSV --field_delimiter=@ \ bqml_tutorial.movies movie_titles.csv \ movie_id:INT64,movie_title:STRING,genre:STRING
Create the model
Create a matrix factorization model and train it on the data in the ratings
table. The model is trained to predict a rating for every user-item pair,
based on the customer-provided movie ratings.
The following CREATE MODEL
statement uses these columns to generate
recommendations:
user_id
—The user ID.item_id
—The movie ID.rating
—The explicit rating from 1 to 5 that the user gave the item.
Follow these steps to create the model:
In the Google Cloud console, go to the BigQuery page.
In the query editor, paste in the following query and click Run:
CREATE OR REPLACE MODEL `bqml_tutorial.mf_explicit` OPTIONS ( MODEL_TYPE = 'matrix_factorization', FEEDBACK_TYPE = 'explicit', USER_COL = 'user_id', ITEM_COL = 'item_id', L2_REG = 9.83, NUM_FACTORS = 34) AS SELECT user_id, item_id, rating FROM `bqml_tutorial.ratings`;
The query takes about 10 minutes to complete, after which the
mf_explicit
model appears in the Explorer pane. Because the query uses aCREATE MODEL
statement to create a model, you don't see query results.
Get training statistics
Optionally, you can view the model's training statistics in the Google Cloud console.
A machine learning algorithm builds a model by creating many iterations of the model using different parameters, and then selecting the version of the model that minimizes loss. This process is called empirical risk minimization. The model's training statistics let you see the loss associated with each iteration of the model.
Follow these steps to view the model's training statistics:
In the Google Cloud console, go to the BigQuery page.
In the Explorer pane, expand your project, expand the
bqml_tutorial
dataset, and then expand the Models folder.Click the
mf_explicit
model and then click the Training tabIn the View as section, click Table. The results should look similar to the following:
+-----------+--------------------+--------------------+ | Iteration | Training Data Loss | Duration (seconds) | +-----------+--------------------+--------------------+ | 11 | 0.3943 | 42.59 | +-----------+--------------------+--------------------+ | 10 | 0.3979 | 27.37 | +-----------+--------------------+--------------------+ | 9 | 0.4038 | 40.79 | +-----------+--------------------+--------------------+ | ... | ... | ... | +-----------+--------------------+--------------------+
The Training Data Loss column represents the loss metric calculated after the model is trained. Because this is a matrix factorization model, this column shows the mean squared error.
You can also use the
ML.TRAINING_INFO
function
to see model training statistics.
Evaluate the model
Evaluate the performance of the model by using the ML.EVALUATE
function.
The ML.EVALUATE
function evaluates the predicted movie ratings returned by the
model against the actual user movie ratings from the training data.
Follow these steps to evaluate the model:
In the Google Cloud console, go to the BigQuery page.
In the query editor, paste in the following query and click Run:
SELECT * FROM ML.EVALUATE( MODEL `bqml_tutorial.mf_explicit`, ( SELECT user_id, item_id, rating FROM `bqml_tutorial.ratings` ));
The results should look similar to the following:
+---------------------+---------------------+------------------------+-----------------------+--------------------+--------------------+ | mean_absolute_error | mean_squared_error | mean_squared_log_error | median_absolute_error | r2_score | explained_variance | +---------------------+---------------------+------------------------+-----------------------+--------------------+--------------------+ | 0.48494444327829156 | 0.39433706592870565 | 0.025437895793637522 | 0.39017059802629905 | 0.6840033369412044 | 0.6840033369412264 | +---------------------+---------------------+------------------------+-----------------------+--------------------+--------------------+
An important metric in the evaluation results is the R2 score. The R2 score is a statistical measure that determines if the linear regression predictions approximate the actual data. A value of
0
indicates that the model explains none of the variability of the response data around the mean. A value of1
indicates that the model explains all the variability of the response data around the mean.For more information about the
ML.EVALUATE
function output, see Matrix factorization models.
You can also call ML.EVALUATE
without providing the input data. It will
use the evaluation metrics calculated during training.
Get the predicted ratings for a subset of user-item pairs
Use the ML.RECOMMEND
to get the predicted rating for each movie for five
users.
Follow these steps to get predicted ratings:
In the Google Cloud console, go to the BigQuery page.
In the query editor, paste in the following query and click Run:
SELECT * FROM ML.RECOMMEND( MODEL `bqml_tutorial.mf_explicit`, ( SELECT user_id FROM `bqml_tutorial.ratings` LIMIT 5 ));
The results should look similar to the following:
+--------------------+---------+---------+ | predicted_rating | user_id | item_id | +--------------------+---------+---------+ | 4.2125303962491873 | 4 | 3169 | +--------------------+---------+---------+ | 4.8068920531981263 | 4 | 3739 | +--------------------+---------+---------+ | 3.8742203494732403 | 4 | 3574 | +--------------------+---------+---------+ | ... | ... | ... | +--------------------+---------+---------+
Generate recommendations
Use the predicted ratings to generate the top five recommended movies for each user.
Follow these steps to generate recommendations:
In the Google Cloud console, go to the BigQuery page.
Write the predicted ratings to a table. In the query editor, paste in the following query and click Run:
CREATE OR REPLACE TABLE `bqml_tutorial.recommend` AS SELECT * FROM ML.RECOMMEND(MODEL `bqml_tutorial.mf_explicit`);
Join the predicted ratings with the movie information, and select the top five results per user. In the query editor, paste in the following query and click Run:
SELECT user_id, ARRAY_AGG(STRUCT(movie_title, genre, predicted_rating) ORDER BY predicted_rating DESC LIMIT 5) FROM ( SELECT user_id, item_id, predicted_rating, movie_title, genre FROM `bqml_tutorial.recommend` JOIN `bqml_tutorial.movies` ON item_id = movie_id ) GROUP BY user_id;
The results should look similar to the following:
+---------+-------------------------------------+------------------------+--------------------+ | user_id | f0_movie_title | f0_genre | predicted_rating | +---------+-------------------------------------+------------------------+--------------------+ | 4597 | Song of Freedom (1936) | Drama | 6.8495752907364009 | | | I Went Down (1997) | Action/Comedy/Crime | 6.7203235758772877 | | | Men With Guns (1997) | Action/Drama | 6.399407352232001 | | | Kid, The (1921) | Action | 6.1952890198126731 | | | Hype! (1996) | Documentary | 6.1895766097451475 | +---------+-------------------------------------+------------------------+--------------------+ | 5349 | Fandango (1985) | Comedy | 9.944574012151549 | | | Breakfast of Champions (1999) | Comedy | 9.55661860430112 | | | Funny Bones (1995) | Comedy | 9.52778917835076 | | | Paradise Road (1997) | Drama/War | 9.1643621767929133 | | | Surviving Picasso (1996) | Drama | 8.807353289233772 | +---------+-------------------------------------+------------------------+--------------------+ | ... | ... | ... | ... | +---------+-------------------------------------+------------------------+--------------------+
Clean up
To avoid incurring charges to your Google Cloud account for the resources used in this tutorial, either delete the project that contains the resources, or keep the project and delete the individual resources.
- You can delete the project you created.
- Or you can keep the project and delete the dataset.
Delete your dataset
Deleting your project removes all datasets and all tables in the project. If you prefer to reuse the project, you can delete the dataset you created in this tutorial:
If necessary, open the BigQuery page in the Google Cloud console.
In the navigation, click the bqml_tutorial dataset you created.
Click Delete dataset on the right side of the window. This action deletes the dataset, the table, and all the data.
In the Delete dataset dialog, confirm the delete command by typing the name of your dataset (
bqml_tutorial
) and then click Delete.
Delete your project
To delete the project:
- In the Google Cloud console, go to the Manage resources page.
- In the project list, select the project that you want to delete, and then click Delete.
- In the dialog, type the project ID, and then click Shut down to delete the project.
What's next
- Try creating a matrix factorization model based on implicit feedback.
- For an overview of BigQuery ML, see Introduction to BigQuery ML.
- To learn more about machine learning, see the Machine learning crash course.