Building a Recommendation System in TensorFlow: Overview

This article is an overview for a multi-part tutorial series that shows you how to implement a recommendation system with TensorFlow and AI Platform in Google Cloud Platform (GCP).

This overview does the following:

  • Outlines the theory for recommendation systems based on matrix factorization.
  • Describes the weighted alternating least squares (WALS) algorithm used to perform the matrix factorization.
  • Provides an overview for a set of tutorials that provide step-by-step guidance for implementing a recommendation system on GCP.

The recommendation system in the tutorial uses the weighted alternating least squares (WALS) algorithm. WALS is included in the contrib.factorization package of the TensorFlow code base, and is used to factorize a large matrix of user and item ratings.

Tutorials in this series

The tutorials that go with this overview include the following:

Background theory for recommendations

This article outlines the background theory for matrix factorization-based collaborative filtering as applied to recommendation systems. The following topics are covered in depth, with some links provided for further reading:

  • Collaborative filtering. What is collaborative filtering, and how can you use insights gained from it?
  • Matrix factorization. What is a ratings matrix, what are latent factors, and how can you mathematically transform the ratings matrix to account for them?
  • WALS method for matrix factorization. How do you perform matrix factorization using the WALS method?

Collaborative filtering for recommendation systems

The collaborative filtering technique is a powerful method for generating user recommendations. Collaborative filtering relies only on observed user behavior to make recommendations—no profile data or content access is necessary.

The technique is based on the following observations:

  • Users who interact with items in a similar manner (for example, buying the same products or viewing the same articles) share one or more hidden preferences.
  • Users with shared preferences are likely to respond in the same way to the same items.

Combining these basic observations allows a recommendation engine to function without needing to determine the precise nature of the shared user preferences. All that's required is that the preferences exist and are meaningful. The basic assumption is that similar user behavior reflects similar fundamental preferences, allowing a recommendation engine to make suggestions accordingly.

For example, suppose User 1 has viewed items A, B, C, D, E, and F. User 2 has viewed items A, B, D, E and F, but not C. Because both users viewed five of the same six items, it's likely that they share some basic preferences. User 1 liked item C, and it's probable that User 2 would also like item C if the user were aware of its existence. This is where the recommendation engine steps in: it informs User 2 about item C, piquing that user's interest.

Matrix factorization for collaborative filtering

The collaborative filtering problem can be solved using matrix factorization. Suppose you have a matrix consisting of user IDs and their interactions with your products. Each row corresponds to a unique user, and each column corresponds to an item. The item could be an product in a catalog, an article, or a video. Each entry in the matrix captures a user's rating or preference for a single item. The rating could be explicit, directly generated by user feedback, or it could be implicit, based on user purchases or time spent interacting with an article or video.

If a user has never rated an item or shown any implied interest in it, the matrix entry is zero. Figure 1 shows a representation of a MovieLens rating matrix.

The MovieLens ratings matrix
Figure 1. The MovieLens ratings matrix

Ratings in the MovieLens dataset range from 1 to 5. Empty rating entries have value 0, meaning that a given user hasn't rated the item.

Defining the matrix factorization method

A ratings matrix consists of a matrix R, where entries \(r_{ij}\) are ratings of user \(i\) for item \(j\). For many internet applications, these matrices are large, with millions of users and millions of different items. They are also sparse, meaning that each user has typically rated, viewed, or purchased only a small number of items relative to the entire set. The vast majority of matrix entries, often greater than 99%, are zero.

The matrix factorization method assumes that there is a set of attributes common to all items, with items differing in the degree to which they express these attributes. Furthermore, the matrix factorization method assumes that each user has their own expression for each of these attributes, independent of the items. In this way, a user's item rating can be approximated by summing the user's strength for each attribute weighted by the degree to which the item expresses this attribute. These attributes are sometimes called hidden or latent factors.

Intuitively, it's easy to see that these hypothetical latent factors actually exist. In the case of movies, it's clear that many users prefer certain genres, actors, or directors. These categories represent latent factors that, while obvious, are still quite useful. For example, genre preferences manifest themselves in the movies that users tend to like, and people with similar genre preferences presumably like similar movies. Much of the power of the matrix factorization approach to collaborative filtering derives from the fact that it's not necessary to know the number of genres or actors or other categories that might comprise the entirety of a given user's preferences. It's simply assumed there are an arbitrary number of them.

Transforming the matrix to represent latent factors

To translate the existence of latent factors into the matrix of ratings, you do this: for a set of users U of size u and items I of size i, you pick an arbitrary number k of latent factors and factorize the large matrix R into two much smaller matrices X (the "row factor") and Y (the "column factor"). Matrix X has dimension u × k, and Y has dimension k × i. This is shown in figure 2.

Approximating the ratings matrix with row and column factors
Figure 2. Approximating the ratings matrix with row and column factors

In linear algebra this is called a low-rank approximation. You can view this process as compressing the sparse information in R into the much lower dimensional spaces u × k and k × i. The matrix R', obtained when the low-rank matrices X and Y are multiplied, represents an approximation of R.

Every user is represented by a vector in this k-dimensional space, and each row \( \mathbf{x}_{u}\) in X corresponds to the strength of the user's preferences for these k factors. Similarly, every item represented by a vector in this k-dimensional space has a column \(\mathbf{y}_{i}\) in Y corresponding to the item's expression of the same k factors.

To calculate the rating of user u for item i, you take the dot product of the two vectors:

$$ r_{ui} = \mathbf{x}^{T}_{u} \cdot \mathbf{y}_{i} $$

This product is a real number \(r_{ui}\) and represents an approximation, or in ML terms a prediction, of user u's rating for item i. This is illustrated in figure 3.

Calculating a predicted rating from the row and
  column factors column factors
Figure 3. Calculating a predicted rating from the row and column factors column factors

You can define a loss function measuring the accuracy of the approximation in the following way:

$$ L = \sum_{u,i}(r_{ui} - \mathbf{x}^{T}_{u} \cdot \mathbf{y}_{i})^{2} $$

This equation does the following: over all users and items, sum the squared difference between the approximate rating \(\mathbf{x}^T_{u} \cdot \mathbf{y}_{i}\) and the actual rating from the training set \(r_{ui}\).

It's common practice to add regularization terms to this loss function to help prevent overfitting. Adding L2 regularization terms for both row and column factors gives the following:

$$ L = \sum_{u,i}(r_{ui} - \mathbf{x}^{T}_{u} \cdot \mathbf{y}_{i})^{2} + \lambda\sum_{u}\left\lVert\mathbf{x}_{u}\right\rVert^{2} + \lambda\sum_{i}\left\lVert\mathbf{y}_{i}\right\rVert^{2} $$

Here, \(\lambda \) is a regularization constant, one of the model’s hyperparameters, which we discuss in Part 2 of the tutorial series.

The WALS method of matrix factorization

This section discusses two methods of matrix factorization.

Alternating least squares method

The alternating least squares method of matrix factorization is an iterative method for determining the optimal factors X and Y that best approximate R. In each iteration, one of the row or column factors is held fixed and the other is computed by minimizing the loss function with respect to the other factor.

First, you begin with the row factors:

  1. Set the column factors to constant values.
  2. Take the derivative of the loss function with respect to the row factors.
  3. Set the equation equal to zero.
$$ \frac{\partial L}{\partial \mathbf{x}_{u}} = -2\sum_{i}(r_{ui} - \mathbf{x}^{T}_{u} \cdot \mathbf{y}_{i})\mathbf{y}^{T}_{i} + 2\lambda\mathbf{x}^{T}_{u} $$
$$ 0 = -(\mathbf{r}_{u} - \mathbf{x}^{T}_{u}Y^{T})Y + \lambda\mathbf{x}^{T}_{u} $$
$$ \mathbf{x}^{T}_{u}(Y^{T}Y + \lambda{I}) = \mathbf{r}_{u}Y $$
$$ \mathbf{x}^{T}_{u} = \mathbf{r}_{u}Y(Y^{T}Y + \lambda{I})^{-1} $$

The iteration proceeds by holding the solved-for row factors fixed and solving the analogous equation for the column factors. Alternating row and column factors, the iterative process is repeated until convergence, which typically occurs within a small (< 20) number of iterations even for very large matrices consisting of tens of millions of rows or columns.

Weighted alternating least squares (WALS) method

The weighted version of the algorithm introduces different weights for the zero, or unobserved, entries and the non-zero entries in the matrix:

$$ L^{w} = \mathbf{W} \cdot \sum_{u,i}(r_{ui} - \mathbf{x}^{T}_{u} \cdot \mathbf{y}_{i})^{2} $$

In this equation:

\(w_{ui} = \omega_{0}\) for zero (unobserved) entries in the ratings matrix, and
\(w_{ui} = \omega_{0} + f(c_{i})\) for observed entries, where
\(c_{i} = \sum_{u,i}1 \text{ if } r_{ui} > 0\) the sum of the number of nonzero entries for column i

The weight is scaled by the sum of the nonzero entries in a row to normalize the weight for users who have rated a different number of items.

This type of weighting allows for a more flexible model of the user's preferences and produces better empirical results than the unweighted version. (For more details, see the paper Collaborative Filtering for Implicit Feedback Datasets.) The function \(f\) varies according to the dataset and whether ratings are explicit or implicit.

The MovieLens dataset contains explicit ratings. In this case, a better choice for \(f\) is one that weighs the observed entries linearly:

\(f = \omega_{k}c_{i}\) where \(\omega_{k}\) is the observed weight.

For implicit ratings related to dynamic events, where each rating corresponds to the number of times a video has been watched, an article read, or a web page viewed, the rating itself may have an exponential distribution due to user behavior. There will be many low-value ratings as users click on a page or video and navigate away quickly. There will be fewer large implicit ratings as users read an entire article, watch an entire video, or watch a given scheduled show multiple times. In this case, an appropriate \(f\) is one that weights the observed ratings to account for this distribution:

\(f = \left(\frac{1}{c_{i}}\right)^{e}\) where \(e\) is the feature weight exponent.

WALS compared to other techniques

Many matrix factorization techniques are used for collaborative filtering, including SVD and Stochastic Gradient Descent. In some cases these techniques give better reduced-rank approximations than WALS. It's beyond the scope of this article to compare the strengths and weaknesses of different collaborative filtering methods, but it's worth noting the following advantages of WALS:

  • The weights used in WALS make it suitable for implicit ratings, but WALS can also be applied to explicit rating datasets such as MovieLens.
  • WALS includes algorithmic optimizations that make it easy to incorporate weights and efficiently calculate row and column factor updates. For details, see the paper Collaborative Filtering for Implicit Feedback Datasets. These optimizations are built into the WALS TensorFlow code base.
  • It's relatively straightforward to create a distributed version of the algorithm. Distributed versions can process very large matrices with potentially hundreds of millions of rows. Discussion of the distributed version of WALS is beyond the scope of this article.

What's next

Start working through the tutorials in this series: