This tutorial introduces data analysts to the matrix factorization model in BigQuery ML. BigQuery ML enables users to create and execute machine learning models in BigQuery using SQL queries. The goal is to democratize machine learning by enabling SQL practitioners to build models using their existing tools and to increase development speed by eliminating the need for data movement.
In this tutorial, you learn how to create a model from explicit feedback
using the movielens1m
dataset
to make recommendations given a movie ID and user id
The movielens dataset contains the ratings from a scale of 1 to 5 that users gave to movies, along with metadata of the movie such as genre.
Objectives
In this tutorial, you use:
- BigQuery ML to create an explicit recommendations model using the
CREATE MODEL
statement - The
ML.EVALUATE
function to evaluate the ML models - The
ML.WEIGHTS
function to inspect the latent factor weights generated during training. - The
ML.RECOMMEND
function to produce recommendations for a user.
Costs
This tutorial uses billable components of Google Cloud, including the following:
- BigQuery
- BigQuery ML
For more information on BigQuery costs, see the BigQuery pricing page.
For more information on BigQuery ML costs, see BigQuery ML pricing.
Before you begin
- Sign in to your Google Cloud account. If you're new to Google Cloud, create an account to evaluate how our products perform in real-world scenarios. New customers also get $300 in free credits to run, test, and deploy workloads.
-
In the Google Cloud console, on the project selector page, select or create a Google Cloud project.
-
Make sure that billing is enabled for your Google Cloud project.
-
In the Google Cloud console, on the project selector page, select or create a Google Cloud project.
-
Make sure that billing is enabled for your Google Cloud project.
- BigQuery is automatically enabled in new projects.
To activate BigQuery in a pre-existing project, go to
Enable the BigQuery API.
Step one: Create your dataset
Create a BigQuery dataset to store your ML model:
In the Google Cloud console, go to the BigQuery page.
In the Explorer pane, click your project name.
Click
View actions > Create dataset.On the Create dataset page, do the following:
For Dataset ID, enter
bqml_tutorial
.For Location type, select Multi-region, and then select US (multiple regions in United States).
The public datasets are stored in the
US
multi-region. For simplicity, store your dataset in the same location.Leave the remaining default settings as they are, and click Create dataset.
Step two: Load the Movielens dataset into BigQuery
The following are the steps to load the 1m movielens dataset into
BigQuery using the
BigQuery command-line tools.
A dataset called movielens
will be created and the relevant movielens
tables will be stored in it.
curl -O 'http://files.grouplens.org/datasets/movielens/ml-1m.zip'
unzip ml-1m.zip
bq mk --dataset movielens
sed 's/::/,/g' ml-1m/ratings.dat > ratings.csv
bq load --source_format=CSV movielens.movielens_1m ratings.csv \
user_id:INT64,item_id:INT64,rating:FLOAT64,timestamp:TIMESTAMP
Because the movie titles contain colons, commas and pipes, we need to use a different delimiter. To load the movie titles, a slightly different variant of the last two commands should be used.
sed 's/::/@/g' ml-1m/movies.dat > movie_titles.csv
bq load --source_format=CSV --field_delimiter=@ \
movielens.movie_titles movie_titles.csv \
movie_id:INT64,movie_title:STRING,genre:STRING
Step three: Create your explicit recommendations model
Next, you create an explicit recommendations model using the movielens sample table that was loaded in the previous step. The following GoogleSQL query is used to create the model that will be used to predict a rating for every user-item pair.
#standardSQL CREATE OR REPLACE MODEL `bqml_tutorial.my_explicit_mf_model` OPTIONS (model_type='matrix_factorization', user_col='user_id', item_col='item_id', l2_reg=9.83, num_factors=34) AS SELECT user_id, item_id, rating FROM `movielens.movielens_1m`
In addition to creating the model, running the CREATE MODEL
command trains the
model you create.
Query details
The CREATE MODEL
clause is used to create and train the model named bqml_tutorial.my_explicit_mf_model
.
The OPTIONS(model_type='matrix_factorization', user_col='user_id', ...)
clause
indicates that you are creating a matrix factorization
model. By default, this will create an explicit matrix factorization model unless
feedback_type='IMPLICIT'
is specified. An example of how to create an implicit matrix factorization model
will be explained in Using BigQuery ML to make recommendations for implicit
feedback.
This query's SELECT
statement uses the following columns to generate
recommendations.
user_id
—The user ID (INT64).item_id
—The movie ID (INT64).rating
—The explicit rating from 1 to 5 that theuser_id
gave theitem_id
(FLOAT64).
The FROM
clause—movielens.movielens_1m
—
indicates that you are querying the movielens_1m
table in the movielens
dataset.
This dataset is in your bigquery project if the instructions in step two were followed.
Run the CREATE MODEL query
To run the CREATE MODEL
query to create and train your model:
In the Google Cloud console, click the Compose new query button.
Enter the following GoogleSQL query in the Query editor text area.
#standardSQL CREATE OR REPLACE MODEL `bqml_tutorial.my_explicit_mf_model` OPTIONS (model_type='matrix_factorization', user_col='user_id', item_col='item_id', l2_reg=9.83, num_factors=34) AS SELECT user_id, item_id, rating FROM `movielens.movielens_1m`
Click Run.
The query takes about 10 minutes to complete, after which your model (
my_explicit_mf_model
) appears in the navigation panel of the Google Cloud console. Because the query uses aCREATE MODEL
statement to create a model, you do not see query results.
Step four (optional): Get training statistics
To see the results of the model training, you can use the
ML.TRAINING_INFO
function, or you can view the statistics in the Google Cloud console. In this
tutorial, you use the Google Cloud console.
A machine learning algorithm builds a model by examining many examples and attempting to find a model that minimizes loss. This process is called empirical risk minimization.
To see the model training statistics that were generated when you ran the
CREATE MODEL
query:
In the Google Cloud console navigation panel, in the Resources section, expand [PROJECT_ID] > bqml_tutorial and then click my_explicit_mf_model.
Click the Training tab, and then click Table. The results should look like the following:
The Training Data Loss column represents the loss metric calculated after the model is trained on the training dataset. Since you performed matrix factorization, this column is the mean squared error. By default, matrix factorization models will not split the data, so the Evaluation Data Loss column will not be present unless a holdout dataset is specified because splitting the data has a chance of losing all the ratings for a user or an item. As a result, the model will not have latent factor information about missing users or items.
For more details on the
ML.TRAINING_INFO
function, see the BigQuery ML syntax reference.
Step five: Evaluate your model
After creating your model, you evaluate the performance of the recommender using
the ML.EVALUATE
function. The ML.EVALUATE
function evaluates the predicted
ratings against the actual ratings.
The query used to evaluate the model is as follows:
#standardSQL SELECT * FROM ML.EVALUATE(MODEL `bqml_tutorial.my_explicit_mf_model`, ( SELECT user_id, item_id, rating FROM `movielens.movielens_1m`))
Query details
The top-most SELECT
statement retrieves the columns from your model.
The FROM
clause uses the ML.EVALUATE
function against your model: bqml_tutorial.my_explicit_mf_model
.
This query's nested SELECT
statement and FROM
clause are the same as those
in the CREATE MODEL
query.
You can also call ML.EVALUATE
without providing the input data. It will
use the evaluation metrics calculated during training:
#standardSQL SELECT * FROM ML.EVALUATE(MODEL `bqml_tutorial.my_explicit_mf_model`)
Run the ML.EVALUATE query
To run the ML.EVALUATE
query that evaluates the model:
In the Google Cloud console, click the Compose new query button.
Enter the following GoogleSQL query in the Query editor text area.
#standardSQL SELECT * FROM ML.EVALUATE(MODEL `bqml_tutorial.my_explicit_mf_model`, ( SELECT user_id, item_id, rating FROM `movielens.movielens_1m`))
(Optional) To set the processing location, click More > Query settings. For Processing location, choose
US
. This step is optional because the processing location is automatically detected based on the dataset's location.Click Run.
When the query is complete, click the Results tab below the query text area. The results should look like the following:
Because you performed an explicit matrix factorization, the results include the following columns:
mean_absolute_error
mean_squared_error
mean_squared_log_error
median_absolute_error
r2_score
explained_variance
An important metric in the evaluation results is the R2 score. The R2 score is a statistical measure that determines if the linear regression predictions approximate the actual data. 0 indicates that the model explains none of the variability of the response data around the mean. 1 indicates that the model explains all the variability of the response data around the mean.
Step six: Use your model to predict ratings and make recommendations
Find all the item ratings for a set of users
ML.RECOMMEND
does not need to take any additional arguments besides the model,
but can take in an optional table. If the input table only has one column that
matches the name of the input user
or input item
column, then all of the
predicted item ratings for each user
will be outputed and vice versa. Note
that if all of the users
or all of the items
are in the input table, it
will output the same results as passing no optional argument to ML.RECOMMEND
.
The following is an example of a query to fetch all of the predicted movie ratings for 5 users:
#standardSQL SELECT * FROM ML.RECOMMEND(MODEL `bqml_tutorial.my_explicit_mf_model`, ( SELECT user_id FROM `movielens.movielens_1m` LIMIT 5))
Query details
The top-most SELECT
statement retrieves the user
, item
, and predicted_rating
column.
This last column is generated by the ML.RECOMMEND
function. When you use the
ML.RECOMMEND
function, the output column name for the model is
predicted_<rating_column_name>
. For explicit matrix factorization models, predicted_rating
is the estimated value of rating
.
The ML.RECOMMEND
function is used to predict ratings using your model: bqml_tutorial.my_explicit_mf_model
.
This query's nested SELECT
statement selects just the user_id
column from
the original table used for training.
The LIMIT
clause—LIMIT 5
—will randomly filter out 5
user_id
s to send to ML.RECOMMEND
.
Find the ratings for all user-item pairs
Now that you have evaluated your model, the next step is to use it to predict a rating. You use your model to predict the ratings of every user-item combination in the following query:
#standardSQL SELECT * FROM ML.RECOMMEND(MODEL `bqml_tutorial.my_explicit_mf_model`)
Query details
The top-most SELECT
statement retrieves the user
, item
, and predicted_rating
column.
This last column is generated by the ML.RECOMMEND
function. When you use the
ML.RECOMMEND
function, the output column name for the model is
predicted_<rating_column_name>
. For explicit matrix factorization models, predicted_rating
is the estimated value of rating
.
The ML.RECOMMEND
function is used to predict ratings using your model: bqml_tutorial.my_explicit_mf_model
.
One way of saving the result to table is:
#standardSQL CREATE OR REPLACE TABLE `bqml_tutorial.recommend_1m` OPTIONS() AS SELECT * FROM ML.RECOMMEND(MODEL `bqml_tutorial.my_explicit_mf_model`)
If a Query Exceeded Resource Limits
error occurs for ML.RECOMMEND
, retry
with a higher billing tier. In the BigQuery command-line tool, this can be set using
the flag --maximum_billing_tier
.
Generate recommendations
Using the previous recommendations query, we can order by the predicted rating
and output the top predicted items for each user. The following query joins the
item_ids
with the movie_ids
found in the movielens.movie_titles
table
uploaded earlier and outputs the top 5 recommended movies per user.
#standardSQL SELECT user_id, ARRAY_AGG(STRUCT(movie_title, genre, predicted_rating) ORDER BY predicted_rating DESC LIMIT 5) FROM ( SELECT user_id, item_id, predicted_rating, movie_title, genre FROM `bqml_tutorial.recommend_1m` JOIN `movielens.movie_titles` ON item_id = movie_id) GROUP BY user_id
Query details
The inner SELECT
statement performs an inner join on item_id
from the
recommendation results table and movie_id
from the movielens.movie_titles
table. movielens.movie_titles
not only maps movie_id
to a movie name, but
it also includes the genres of the movie as listed by IMDB.
The top level SELECT
statement aggregates the results from the nested
SELECT
statement by using GROUPS BY user_id
to aggregate the movie_title,
genre,
and predicted_rating
in descending order and only keeps the top 5
movies.
Run the ML.RECOMMEND query
To run the ML.RECOMMEND
query that outputs the top 5 recommended movies per user:
In the Google Cloud console, click the Compose new query button.
Enter the following GoogleSQL query in the Query editor text area.
#standardSQL CREATE OR REPLACE TABLE `bqml_tutorial.recommend_1m` OPTIONS() AS SELECT * FROM ML.RECOMMEND(MODEL `bqml_tutorial.my_explicit_mf_model`)
Click Run.
When the query finishes running, (
bqml_tutorial.recommend_1m
) will appear in the navigation panel. Because the query uses aCREATE TABLE
statement to create a table, you do not see the query results.Compose another new query. Enter the following GoogleSQL query in the Query editor text area once the previous query has finished running.
#standardSQL SELECT user_id, ARRAY_AGG(STRUCT(movie_title, genre, predicted_rating) ORDER BY predicted_rating DESC LIMIT 5) FROM ( SELECT user_id, item_id, predicted_rating, movie_title, genre FROM `bqml_tutorial.recommend_1m` JOIN `movielens.movie_titles` ON item_id = movie_id) GROUP BY user_id
(Optional) To set the processing location, click More > Query settings. For Processing location, choose
US
. This step is optional because the processing location is automatically detected based on the dataset's location.Click Run.
When the query is complete, click the Results tab below the query text area. The results should look like the following:
Since we had additional metadata information about each movie_id
beyond an
INT64
, we can see info like genre about the top 5 recommended movies for each
user. If you don't have a movietitles
equivalent table for your training
data, the results might not be as human interpretable with just number ids or
hashes.
Top genres per factor
If you are curious as to what genre each latent factor might correlate to, you can run the following query:
#standardSQL SELECT factor, ARRAY_AGG(STRUCT(feature, genre, weight) ORDER BY weight DESC LIMIT 10) AS weights FROM ( SELECT * EXCEPT(factor_weights) FROM ( SELECT * FROM ( SELECT factor_weights, CAST(feature AS INT64) as feature FROM ML.WEIGHTS(model `bqml_tutorial.my_explicit_mf_model`) WHERE processed_input= 'item_id') JOIN `movielens.movie_titles` ON feature = movie_id) weights CROSS JOIN UNNEST(weights.factor_weights) ORDER BY feature, weight DESC) GROUP BY factor
Query details
The inner most SELECT
statement gets the item_id
or movie factor weights
array and then joins it against the movielens.movie_titles
table to get the
genre of each item id.
The result of which is then CROSS JOIN
ed with each factor_weights
array
whose result is then ORDER BY feature, weight DESC
.
Finally, the top level SELECT
statement aggregates the results from its inner
statement by factor
and creates an array for each factor ordered by the
weight of each genre.
Run the query
To run the above query that outputs the top 10 movie genres per factor:
In the Google Cloud console, click the Compose new query button.
Enter the following GoogleSQL query in the Query editor text area.
#standardSQL SELECT factor, ARRAY_AGG(STRUCT(feature, genre, weight) ORDER BY weight DESC LIMIT 10) AS weights FROM ( SELECT * EXCEPT(factor_weights) FROM ( SELECT * FROM ( SELECT factor_weights, CAST(feature AS INT64) as feature FROM ML.WEIGHTS(model `bqml_tutorial.my_explicit_mf_model`) WHERE processed_input= 'item_id') JOIN `movielens.movie_titles` ON feature = movie_id) weights CROSS JOIN UNNEST(weights.factor_weights) ORDER BY feature, weight DESC) GROUP BY factor
(Optional) To set the processing location, click More > Query settings. For Processing location, choose
US
. This step is optional because the processing location is automatically detected based on the dataset's location.Click Run.
When the query is complete, click the Results tab below the query text area. The results should look like the following:
Clean up
To avoid incurring charges to your Google Cloud account for the resources used in this tutorial, either delete the project that contains the resources, or keep the project and delete the individual resources.
- You can delete the project you created.
- Or you can keep the project and delete the dataset.
Delete your dataset
Deleting your project removes all datasets and all tables in the project. If you prefer to reuse the project, you can delete the dataset you created in this tutorial:
If necessary, open the BigQuery page in the Google Cloud console.
In the navigation, click the bqml_tutorial dataset you created.
Click Delete dataset on the right side of the window. This action deletes the dataset, the table, and all the data.
In the Delete dataset dialog box, confirm the delete command by typing the name of your dataset (
bqml_tutorial
) and then click Delete.
Delete your project
To delete the project:
- In the Google Cloud console, go to the Manage resources page.
- In the project list, select the project that you want to delete, and then click Delete.
- In the dialog, type the project ID, and then click Shut down to delete the project.
What's next
- To learn more about machine learning, see the Machine learning crash course.
- For an overview of BigQuery ML, see Introduction to BigQuery ML.
- To learn more about the Google Cloud console, see Using the Google Cloud console.