Recommendations in TensorFlow: Create the Model

This article is the first part of a multi-part tutorial series that shows you how to implement a machine-learning (ML) recommendation system with TensorFlow and Cloud Machine Learning Engine in Google Cloud Platform (GCP). This part shows you how to install the TensorFlow model code on a development system and run the model on the MovieLens dataset.

The recommendation system in the tutorial uses the weighted alternating least squares (WALS) algorithm. WALS is included in the contrib.factorization package of the TensorFlow code base, and is used to factorize a large matrix of user and item ratings. For more information about WALS, see the overview.

This article describes the code of the model in detail, including data preprocessing and executing the WALS algorithm in TensorFlow.

The series consists of these parts:

Objectives

  • Understand the structure of TensorFlow code used for applying WALS to matrix factorization.
  • Run the sample TensorFlow code locally to perform recommendations on the MovieLens dataset.

Costs

This tutorial uses Cloud Storage and Cloud Machine Learning Engine, and both are billable services. You can use the pricing calculator to estimate the costs for your projected usage. The projected cost for this tutorial is $0.15. If you are a new GCP user, you might be eligible for a free trial.

Before you begin

  1. Select or create a GCP project.

    Go to the Manage resources page

  2. Make sure that billing is enabled for your project.

    Learn how to enable billing

  3. Enable the Compute Engine and Cloud Machine Learning Engine APIs.

    Enable the APIs

  4. (Optional) If you want to run the tutorial on your own computer (not using Cloud Shell), make sure that Python 2.7 is installed on your computer.

Running the TensorFlow Model

You can run the steps in this section on a Compute Engine instance with at least 7G of memory, as explained in the procedure that follows. Alternatively, you can run the steps in this section on a local macOS or Linux system; in that case, you don't have to create a Compute Engine instance.

Creating the Compute Engine instance

  1. In the Google Cloud Platform Console, go to the VM Instances page.

    GO TO THE VM INSTANCES PAGE

  2. Click Create instance.

  3. Name your instance whatever you like, and pick a zone. If you do not already have a preferred zone, pick one that is geographically close to you.
  4. In the Machine type drop-down list, select 2 vCPUs with n1-standard-2.
  5. In the Access Scopes section, select Allow Full Access To All APIs.
  6. Click Create.

Installing the code

  1. Connect to the new instance. For details, see Connecting to Instances.

  2. In the new instance, clone the sample code repository:

    git clone https://github.com/GoogleCloudPlatform/tensorflow-recommendation-wals

  3. Install miniconda. The sample code requires Python 2.7.

    wget https://repo.continuum.io/miniconda/Miniconda2-latest-Linux-x86_64.sh
    bash Miniconda2-latest-Linux-x86_64.sh

  4. Install the Python packages and TensorFlow. This tutorial assumes TensorFlow version 1.4.1.

    cd tensorflow-recommendation-wals
    conda create -n tfrec
    conda install -n tfrec --file conda.txt
    source activate tfrec
    pip install -r requirements.txt
    pip install tensorflow==1.4.1

  5. In Cloud Shell, download the MovieLens dataset. There are several versions of the dataset.

    1. For development purposes, we recommend using the 100k version, which contains 100,000 ratings from 943 users on 1682 items. Download it using the following commands:

      curl -O 'http://files.grouplens.org/datasets/movielens/ml-100k.zip'
      unzip ml-100k.zip
      mkdir -p data
      cp ml-100k/u.data data/

    2. For training purposes, we recommend the 1m dataset, which contains one million ratings. The format of the 1m set is a little different than the 100k set. For example, the ratings file is delimited with :: characters. The sample code lets you use the --delimiter argument to specify that the dataset uses this delimiter.

      curl -O 'http://files.grouplens.org/datasets/movielens/ml-1m.zip'
      unzip ml-1m.zip
      mkdir -p data
      cp ml-1m/ratings.dat data/

    3. You can also use the 20m dataset with this project. The dataset is supplied in a CSV file. If you use this dataset, you must pass the --headers flag, because the file contains a header line.

      curl -O 'http://files.grouplens.org/datasets/movielens/ml-20m.zip'
      unzip ml-20m.zip
      mkdir -p data
      cp ml-20m/ratings.csv data/

Understand the model code

The model code is contained in the wals_ml_engine directory. The code's high-level functionality is implemented by the following files:

mltrain.sh
  • Launches various types of Cloud Machine Learning Engine jobs. This shell script accepts arguments for the location of the dataset file, the delimiter used to separate values in the file, and whether the data file has a header line. It's a best practice to create a script that automatically configures and executes Cloud Machine Learning Engine jobs.
task.py
  • Parses the arguments for the Cloud Machine Learning Engine job and executes training.
model.py
  • Loads the dataset.
  • Creates two sparse matrices from the data, one for training and one for testing. Executes WALS on the training sparse matrix of ratings.
wals.py
  • Creates the WALS model.
  • Executes the WALS algorithm.
  • Calculates the root-mean-square error (RMSE) for a set of row/column factors and a ratings matrix.

How the model preprocesses data

The model code performs data preprocessing to create a sparse ratings matrix and prepare it for matrix factorization. This involves the following steps:

  1. The model code loads the data from a delimited text file. Each row contains a single rating.

    ratings_df = pd.read_csv(input_file,
                             sep=args['delimiter'],
                             names=headers,
                             header=header_row,
                             dtype={
                               'user_id': np.int32,
                               'item_id': np.int32,
                               'rating': np.float32,
                               'timestamp': np.int32,
                             })

  2. The code establishes a 0-indexed set of unique IDs for users and items. This guarantees that a unique ID corresponds to specific row and column indexes of the sparse ratings matrix.

    • The MovieLens 100k data uses 1-based IDs where the lowest index of the unique set is 1. To normalize, the code subtracts one from each index. From model.py:

      ratings = ratings_df.as_matrix(['user_id', 'item_id', 'rating'])
      # deal with 1-based user indices
      ratings[:,0] -= 1
      ratings[:,1] -= 1

    • The 1m and 20m MovieLens datasets skip some user and item IDs. This creates a problem: you have to map the set of unique user IDs to an index set equal to [0 ... num_users-1] and do the same for item IDs. The item mapping is accomplished using the following [numpy](http://www.numpy.org/) code. The code creates an array of size [0..max_item_id] to perform the mapping, so if the maximum item ID is very large, this method might use too much memory.

      np_items = ratings_df.item_id.as_matrix()
      unique_items = np.unique(np_items)
      n_items = unique_items.shape[0]
      max_item = unique_items[-1]

      # map unique items down to an array 0..n_items-1 z = np.zeros(max_item+1, dtype=int) z[unique_items] = np.arange(n_items) i_r = z[np_items]

    • The code for mapping users is essentially the same as the code for items.

  3. The model code randomly selects a test set of ratings. By default, 10% of the ratings are chosen for the test set. These ratings are removed from the training set and will be used to evaluate the predictive accuracy of the user and item factors.

    test_set_size = len(ratings) / TEST_SET_RATIO
    test_set_idx = np.random.choice(xrange(len(ratings)),
                                    size=test_set_size, replace=False)
    test_set_idx = sorted(test_set_idx)

    ts_ratings = ratings[test_set_idx] tr_ratings = np.delete(ratings, test_set_idx, axis=0)

  4. Finally, the code creates a scipy sparse matrix in coordinate form (coo_matrix) that includes the user and item indexes and ratings. The coo_matrix object acts as a wrapper for a sparse matrix. It also performs validation of the user and ratings indexes, checking for errors in preprocessing.

    u_tr, i_tr, r_tr = zip(*tr_ratings)
    tr_sparse = coo_matrix((r_tr, (u_tr, i_tr)), shape=(n_users, n_items))

How the WALS Algorithm is implemented in TensorFlow

After the data is preprocessed, the code passes the sparse training matrix into the TensorFlow WALS model to be factorized into row factor X and column factor Y.

The TensorFlow code that executes the model is actually simple, because it relies on the WALSModel class included in the contrib.factorization_ops module of TensorFlow.

  1. A SparseTensor object is initialized with user IDs and items IDs as indices, and with the ratings as values. From wals.py:

    input_tensor = tf.SparseTensor(indices=zip(data.row, data.col),
                                    values=(data.data).astype(np.float32),
                                    dense_shape=data.shape)

    The data variable is the coo_matrix object of training ratings created in the preprocessing step.

  2. The model is instantiated:

    model = factorization_ops.WALSModel(num_rows, num_cols, dim,
                                        unobserved_weight=unobs,
                                        regularization=reg,
                                        row_weights=row_wts,
                                        col_weights=col_wts)

  3. The row factors and column factor tensors are created automatically by the WALSModel class, and are retrieved so they can be evaluated after factoring the matrix:

    # retrieve the row and column factors
    row_factor = model.row_factors[0]
    col_factor = model.col_factors[0]

  4. The training process executes the following loop within a TensorFlow session using the simple_train method in wals.py:

    row_update_op = model.update_row_factors(sp_input=input_tensor)[1]
    col_update_op = model.update_col_factors(sp_input=input_tensor)[1]

    sess.run(model.initialize_op) sess.run(model.worker_init) for _ in xrange(num_iterations): sess.run(model.row_update_prep_gramian_op) sess.run(model.initialize_row_update_op) sess.run(row_update_op) sess.run(model.col_update_prep_gramian_op) sess.run(model.initialize_col_update_op) sess.run(col_update_op)

  5. After num_iterations iterations have been executed, the row and column factor tensors are evaluated in the session to produce numpy arrays for each factor:

    # evaluate output factor matrices
    output_row = row_factor.eval(session=session)
    output_col = col_factor.eval(session=session)

These factor arrays are used to calculate the RMSE on the test set of ratings. The two arrays are also saved in the output directory in numpy format.

Training the model

In this context, training the model involves factoring a sparse matrix of ratings into a user factor matrix X and an item factor matrix Y. The saved user and item factors can be used as the base model for a recommendation system.

This system takes a user as input, retrieves the vector of user factors for that user from X, multiplies that vector by all the item factors Y, and returns the top N items according to predicted rating.

Part 4 of this tutorial set provides more detail on a system that uses the trained model to perform predictions, and shows how to deploy such a system on GCP.

Train the model locally

Training the model locally is useful for development purposes. It allows you to rapidly test code changes and to include breakpoints for easy debugging. To run the model on Cloud Shell or from your local system, run the mltrain.sh script from the code directory using the local option.

  • For the MovieLens 100k dataset, specify the path to the 100k data file:

    ./mltrain.sh local ./../data/u.data

  • For the MovieLens 1m dataset, include the --delimiter option and specify the path to the 1m data file:

    ./mltrain.sh local data/ratings.dat --delimiter ::

  • For the MovieLens 20m dataset, use the --delimiter and --header options:

    ./mltrain.sh local data/ratings.csv --header --delimiter ,

The training job output shows the RMSE calculated on the test set. For the 1m dataset, and using the default hyperparameters specified in the source code, the output should look like the following:

INFO:tensorflow:Train Start: <timestamp>
...
INFO:tensorflow:Train Finish: <timestamp>
INFO:tensorflow:train RMSE = 1.06
INFO:tensorflow:test RMSE = 1.11

For more details, refer to Part 2.

The RMSE corresponds to the average error in the predicted ratings compared to the test set. On average, each rating produced by the algorithm is within ± 1.11 of the actual user rating in the test set on the 1m dataset. The WALS algorithm performs much better with tuned hyperparameters, as shown in Part 2 of this series.

Cleaning up

If you created a Compute Engine instance for running TensorFlow, you must stop it to avoid incurring charges to your GCP account. This instance is not used in Part 2 of the series. However, the Compute Engine instance is required for Part 3. and Part 4.

Stopping the Compute Engine instance

  1. In the GCP Console, open the Compute Engine VM Instance list page.
  2. Select the instance name.
  3. Click Stop and confirm the operation.

Delete the project

The easiest way to eliminate billing is to delete the project you created for the tutorial.

To delete the project:

  1. In the GCP Console, go to the Projects page.

    Go to the Projects page

  2. In the project list, select the project you want to delete and click Delete project. After selecting the checkbox next to the project name, click
      Delete project
  3. In the dialog, type the project ID, and then click Shut down to delete the project.

What's next

Was this page helpful? Let us know how we did:

Send feedback about...