This article is the first part of a multi-part tutorial series that shows you
how to implement a machine-learning (ML) recommendation system with
TensorFlow 1.x
and
AI Platform
in Google Cloud Platform (GCP). This part shows you how to install the
TensorFlow model code on a development system and run the model on the
MovieLens
dataset.
The recommendation system in the tutorial uses the weighted alternating least squares (WALS) algorithm. WALS is included in the contrib.factorization package of the TensorFlow 1.x code base, and is used to factorize a large matrix of user and item ratings. For more information about WALS, see the overview.
This article describes the code of the model in detail, including data preprocessing and executing the WALS algorithm in TensorFlow.
The series consists of these parts:
- Overview
- Create the model (Part 1) (this tutorial)
- Train and Tune on AI Platform (Part 2)
- Apply to Data from Google Analytics (Part 3)
- Deploy the Recommendation System (Part 4)
Objectives
- Understand the structure of TensorFlow code used for applying WALS to matrix factorization.
- Run the sample TensorFlow code locally to perform recommendations on the
MovieLens
dataset.
Costs
This tutorial uses Cloud Storage and AI Platform, and both are billable services. You can use the pricing calculator to estimate the costs for your projected usage. The projected cost for this tutorial is $0.15. If you are a new GCP user, you might be eligible for a free trial.
Before you begin
-
In the Google Cloud Console, on the project selector page, select or create a Google Cloud project.
-
Make sure that billing is enabled for your Cloud project. Learn how to confirm that billing is enabled for your project.
- Enable the Compute Engine and AI Platform APIs.
Running the TensorFlow Model
You can run the steps in this section on a Compute Engine instance with at least 7G of memory, as explained in the procedure that follows. Alternatively, you can run the steps in this section on a local macOS or Linux system; in that case, you don't have to create a Compute Engine instance.
Creating the Compute Engine instance
In the Google Cloud Console, go to the VM Instances page.
Click Create instance.
Name your instance whatever you like, and pick a zone. If you do not already have a preferred zone, pick one that is geographically close to you.
In the Machine type drop-down list, select 2 vCPUs with n1-standard-2.
In the Access Scopes section, select Allow Full Access To All APIs.
Click Create.
Installing the code
Go to the VM instances page.
In the row for your instance, click SSH to open a browser-based terminal that is securely connected to the instance.
In the new terminal window, update your instance's software respositories.
sudo apt-get update
Install
git
,bzip2
, andunzip
.sudo apt-get install -y git bzip2 unzip
Clone the sample code repository:
git clone https://github.com/GoogleCloudPlatform/tensorflow-recommendation-wals
Install miniconda. The sample code requires Python 2.7.
wget https://repo.continuum.io/miniconda/Miniconda2-latest-Linux-x86_64.sh bash Miniconda2-latest-Linux-x86_64.sh export PATH="/home/$USER/miniconda2/bin:$PATH"
Install the Python packages and TensorFlow. This tutorial will run on any 1.x version of TensorFlow.
cd tensorflow-recommendation-wals conda create -y -n tfrec conda install -y -n tfrec --file conda.txt source activate tfrec pip install -r requirements.txt pip install tensorflow==1.15
Download the
MovieLens
dataset. There are several versions of the dataset.For development purposes, we recommend using the
100k
version, which contains 100,000 ratings from 943 users on 1682 items. Download it using the following commands:curl -O 'http://files.grouplens.org/datasets/movielens/ml-100k.zip' unzip ml-100k.zip mkdir -p data cp ml-100k/u.data data/
For training purposes, we recommend the
1m
dataset, which contains one million ratings. The format of the1m
set is a little different than the100k
set. For example, the ratings file is delimited with::
characters. The sample code lets you use the--delimiter
argument to specify that the dataset uses this delimiter.curl -O 'http://files.grouplens.org/datasets/movielens/ml-1m.zip' unzip ml-1m.zip mkdir -p data cp ml-1m/ratings.dat data/
You can also use the
20m
dataset with this project. The dataset is supplied in a CSV file. If you use this dataset, you must pass the--headers
flag, because the file contains a header line.curl -O 'http://files.grouplens.org/datasets/movielens/ml-20m.zip' unzip ml-20m.zip mkdir -p data cp ml-20m/ratings.csv data/
Understand the model code
The model code is contained in the wals_ml_engine
directory. The code's
high-level functionality is implemented by the following files:
mltrain.sh
- + Launches various types of AI Platform jobs. This shell script accepts arguments for the location of the dataset file, the delimiter used to separate values in the file, and whether the data file has a header line. It's a best practice to create a script that automatically configures and executes AI Platform jobs.
task.py
- + Parses the arguments for the AI Platform job and executes training.
model.py
- + Loads the dataset.
- Creates two sparse matrices from the data, one for training and one for testing. Executes WALS on the training sparse matrix of ratings.
wals.py
- + Creates the WALS model.
- Executes the WALS algorithm.
- Calculates the root-mean-square error (RMSE) for a set of row/column factors and a ratings matrix.
How the model preprocesses data
The model code performs data preprocessing to create a sparse ratings matrix and prepare it for matrix factorization. This involves the following steps:
The model code loads the data from a delimited text file. Each row contains a single rating.
ratings_df = pd.read_csv(input_file, sep=args['delimiter'], names=headers, header=header_row, dtype={ 'user_id': np.int32, 'item_id': np.int32, 'rating': np.float32, 'timestamp': np.int32, })
The code establishes a 0-indexed set of unique IDs for users and items. This guarantees that a unique ID corresponds to specific row and column indexes of the sparse ratings matrix.
The
MovieLens 100k
data uses 1-based IDs where the lowest index of the unique set is 1. To normalize, the code subtracts one from each index. Frommodel.py
:ratings = ratings_df.as_matrix(['user_id', 'item_id', 'rating']) # deal with 1-based user indices ratings[:,0] -= 1 ratings[:,1] -= 1
The
1m
and20m
MovieLens
datasets skip some user and item IDs. This creates a problem: you have to map the set of unique user IDs to an index set equal to [0 ... num_users-1]
and do the same for item IDs. The item mapping is accomplished using the following[numpy](http://www.numpy.org/)
code. The code creates an array of size [0..max_item_id]
to perform the mapping, so if the maximum item ID is very large, this method might use too much memory.np_items = ratings_df.item_id.as_matrix() unique_items = np.unique(np_items) n_items = unique_items.shape[0] max_item = unique_items[-1] # map unique items down to an array 0..n_items-1 z = np.zeros(max_item+1, dtype=int) z[unique_items] = np.arange(n_items) i_r = z[np_items]
The code for mapping users is essentially the same as the code for items.
The model code randomly selects a test set of ratings. By default, 10% of the ratings are chosen for the test set. These ratings are removed from the training set and will be used to evaluate the predictive accuracy of the user and item factors.
test_set_size = len(ratings) / TEST_SET_RATIO test_set_idx = np.random.choice(xrange(len(ratings)), size=test_set_size, replace=False) test_set_idx = sorted(test_set_idx) ts_ratings = ratings[test_set_idx] tr_ratings = np.delete(ratings, test_set_idx, axis=0)
Finally, the code creates a scipy sparse matrix in coordinate form (
coo_matrix
) that includes the user and item indexes and ratings. Thecoo_matrix
object acts as a wrapper for a sparse matrix. It also performs validation of the user and ratings indexes, checking for errors in preprocessing.u_tr, i_tr, r_tr = zip(*tr_ratings) tr_sparse = coo_matrix((r_tr, (u_tr, i_tr)), shape=(n_users, n_items))
How the WALS Algorithm is implemented in TensorFlow
After the data is preprocessed, the code passes the sparse training matrix into the TensorFlow WALS model to be factorized into row factor X and column factor Y.
The TensorFlow code that executes the model is actually simple, because it
relies on the WALSModel
class included in the contrib.factorization_ops
module of TensorFlow.
A
SparseTensor
object is initialized with user IDs and items IDs as indices, and with the ratings as values. Fromwals.py
:input_tensor = tf.SparseTensor(indices=zip(data.row, data.col), values=(data.data).astype(np.float32), dense_shape=data.shape)
The data variable is the
coo_matrix
object of training ratings created in the preprocessing step.The model is instantiated:
model = factorization_ops.WALSModel(num_rows, num_cols, dim, unobserved_weight=unobs, regularization=reg, row_weights=row_wts, col_weights=col_wts)
The row factors and column factor tensors are created automatically by the
WALSModel
class, and are retrieved so they can be evaluated after factoring the matrix:# retrieve the row and column factors row_factor = model.row_factors[0] col_factor = model.col_factors[0]
The training process executes the following loop within a TensorFlow session using the
simple_train
method inwals.py
:row_update_op = model.update_row_factors(sp_input=input_tensor)[1] col_update_op = model.update_col_factors(sp_input=input_tensor)[1] sess.run(model.initialize_op) sess.run(model.worker_init) for _ in xrange(num_iterations): sess.run(model.row_update_prep_gramian_op) sess.run(model.initialize_row_update_op) sess.run(row_update_op) sess.run(model.col_update_prep_gramian_op) sess.run(model.initialize_col_update_op) sess.run(col_update_op)
After
num_iterations
iterations have been executed, the row and column factor tensors are evaluated in the session to producenumpy
arrays for each factor:# evaluate output factor matrices output_row = row_factor.eval(session=session) output_col = col_factor.eval(session=session)
These factor arrays are used to calculate the RMSE on the test set of ratings.
The two arrays are also saved in the output directory in numpy
format.
Training the model
In this context, training the model involves factoring a sparse matrix of ratings into a user factor matrix X and an item factor matrix Y. The saved user and item factors can be used as the base model for a recommendation system.
This system takes a user as input, retrieves the vector of user factors for that user from X, multiplies that vector by all the item factors Y, and returns the top N items according to predicted rating.
Part 4 of this tutorial set provides more detail on a system that uses the trained model to perform predictions, and shows how to deploy such a system on GCP.
Train the model locally
Training the model locally is useful for development purposes. It allows you to
rapidly test code changes and to include breakpoints for easy debugging. To run
the model on
Cloud Shell
or from your local system, run the mltrain.sh
script from the wals_ml_engine
directory
using the local
option.
cd wals_ml_engine
For the
MovieLens
100k
dataset, specify the path to the100k
data file:./mltrain.sh local ../data u.data
For the
MovieLens
1m
dataset, include the--delimiter
option and specify the path to the1m
data file:./mltrain.sh local ../data ratings.dat --delimiter ::
For the
MovieLens
20m
dataset, use the--delimiter
and--headers
options:./mltrain.sh local ../data ratings.csv --headers --delimiter ,
The training job output shows the RMSE calculated on the test set. For the 1m
dataset, and using the default hyperparameters specified in the source code, the
output should look like the following:
INFO:tensorflow:Train Start: <timestamp> ... INFO:tensorflow:Train Finish: <timestamp> INFO:tensorflow:train RMSE = 1.29 INFO:tensorflow:test RMSE = 1.34
For more details, refer to Part 2.
The RMSE corresponds to the average error in the predicted ratings compared to
the test set. On average, each rating produced by the algorithm is within ± 1.29
of the actual user rating in the test set on the 1m
dataset. The WALS
algorithm performs much better with tuned hyperparameters, as shown in
Part 2
of this series.
Cleaning up
If you created a Compute Engine instance for running TensorFlow, you must stop it to avoid incurring charges to your GCP account. This instance is not used in Part 2 of the series. However, the Compute Engine instance is required for Part 3. and Part 4.
Stopping the Compute Engine instance
- In the Cloud Console, open the Compute Engine VM Instance list page.
- Select the instance name.
- Click Stop and confirm the operation.
Delete the project
The easiest way to eliminate billing is to delete the project that you created for the tutorial.
To delete the project:
- In the Cloud Console, go to the Manage resources page.
- In the project list, select the project that you want to delete, and then click Delete.
- In the dialog, type the project ID, and then click Shut down to delete the project.
What's next
- The next tutorial, Recommendations in TensorFlow: Train and Tune on AI Platform (Part 2), explains how to train the recommendation model on AI Platform and tunes the hyperparameters.