Recommendations in TensorFlow: Create the Model

This article is the first part of a multi-part tutorial series that shows you how to implement a machine-learning (ML) recommendation system with TensorFlow and AI Platform in Google Cloud Platform (GCP). This part shows you how to install the TensorFlow model code on a development system and run the model on the MovieLens dataset.

The recommendation system in the tutorial uses the weighted alternating least squares (WALS) algorithm. WALS is included in the contrib.factorization package of the TensorFlow code base, and is used to factorize a large matrix of user and item ratings. For more information about WALS, see the overview.

This article describes the code of the model in detail, including data preprocessing and executing the WALS algorithm in TensorFlow.

The series consists of these parts:

Objectives

  • Understand the structure of TensorFlow code used for applying WALS to matrix factorization.
  • Run the sample TensorFlow code locally to perform recommendations on the MovieLens dataset.

Costs

This tutorial uses Cloud Storage and AI Platform, and both are billable services. You can use the pricing calculator to estimate the costs for your projected usage. The projected cost for this tutorial is $0.15. If you are a new GCP user, you might be eligible for a free trial.

Before you begin

  1. Select or create a GCP project.

    Go to the project selector page

  2. Make sure that billing is enabled for your Google Cloud Platform project. Learn how to enable billing.

  3. Enable the Compute Engine and AI Platform APIs.

    Enable the APIs

Running the TensorFlow Model

You can run the steps in this section on a Compute Engine instance with at least 7G of memory, as explained in the procedure that follows. Alternatively, you can run the steps in this section on a local macOS or Linux system; in that case, you don't have to create a Compute Engine instance.

Creating the Compute Engine instance

  1. In the Google Cloud Platform Console, go to the VM Instances page.

    GO TO THE VM INSTANCES PAGE

  2. Click Create instance.

  3. Name your instance whatever you like, and pick a zone. If you do not already have a preferred zone, pick one that is geographically close to you.

  4. In the Machine type drop-down list, select 2 vCPUs with n1-standard-2.

  5. In the Access Scopes section, select Allow Full Access To All APIs.

  6. Click Create.

Installing the code

  1. Go to the VM instances page.

    Go to the VM instances page

  2. In the row for your instance, click SSH to open a browser-based terminal that is securely connected to the instance.

  3. In the new terminal window, update your instance's software respositories.

    sudo apt-get update
    
  4. Install git, bzip2, and unzip.

    sudo apt-get install -y git bzip2 unzip
    
  5. Clone the sample code repository:

    git clone https://github.com/GoogleCloudPlatform/tensorflow-recommendation-wals
  6. Install miniconda. The sample code requires Python 2.7.

    wget https://repo.continuum.io/miniconda/Miniconda2-latest-Linux-x86_64.sh
    bash Miniconda2-latest-Linux-x86_64.sh
  7. Install the Python packages and TensorFlow. This tutorial will run on any 1.x version of TensorFlow.

    cd tensorflow-recommendation-wals
    conda create -n tfrec
    conda install -n tfrec --file conda.txt
    source activate tfrec
    pip install -r requirements.txt
    pip install tensorflow
  8. Download the MovieLens dataset. There are several versions of the dataset.

    • For development purposes, we recommend using the 100k version, which contains 100,000 ratings from 943 users on 1682 items. Download it using the following commands:

      curl -O 'http://files.grouplens.org/datasets/movielens/ml-100k.zip'
      unzip ml-100k.zip
      mkdir -p data
      cp ml-100k/u.data data/
    • For training purposes, we recommend the 1m dataset, which contains one million ratings. The format of the 1m set is a little different than the 100k set. For example, the ratings file is delimited with :: characters. The sample code lets you use the --delimiter argument to specify that the dataset uses this delimiter.

      curl -O 'http://files.grouplens.org/datasets/movielens/ml-1m.zip'
      unzip ml-1m.zip
      mkdir -p data
      cp ml-1m/ratings.dat data/
    • You can also use the 20m dataset with this project. The dataset is supplied in a CSV file. If you use this dataset, you must pass the --headers flag, because the file contains a header line.

      curl -O 'http://files.grouplens.org/datasets/movielens/ml-20m.zip'
      unzip ml-20m.zip
      mkdir -p data
      cp ml-20m/ratings.csv data/

Understand the model code

The model code is contained in the wals_ml_engine directory. The code's high-level functionality is implemented by the following files:

mltrain.sh
+ Launches various types of AI Platform jobs. This shell script accepts arguments for the location of the dataset file, the delimiter used to separate values in the file, and whether the data file has a header line. It's a best practice to create a script that automatically configures and executes AI Platform jobs.
task.py
+ Parses the arguments for the AI Platform job and executes training.
model.py
+ Loads the dataset.
  • Creates two sparse matrices from the data, one for training and one for testing. Executes WALS on the training sparse matrix of ratings.
wals.py
+ Creates the WALS model.
  • Executes the WALS algorithm.
  • Calculates the root-mean-square error (RMSE) for a set of row/column factors and a ratings matrix.

How the model preprocesses data

The model code performs data preprocessing to create a sparse ratings matrix and prepare it for matrix factorization. This involves the following steps:

  1. The model code loads the data from a delimited text file. Each row contains a single rating.

    ratings_df = pd.read_csv(input_file,
                             sep=args['delimiter'],
                             names=headers,
                             header=header_row,
                             dtype={
                               'user_id': np.int32,
                               'item_id': np.int32,
                               'rating': np.float32,
                               'timestamp': np.int32,
                             })
  2. The code establishes a 0-indexed set of unique IDs for users and items. This guarantees that a unique ID corresponds to specific row and column indexes of the sparse ratings matrix.

    • The MovieLens 100k data uses 1-based IDs where the lowest index of the unique set is 1. To normalize, the code subtracts one from each index. From model.py:

      ratings = ratings_df.as_matrix(['user_id', 'item_id', 'rating'])
      # deal with 1-based user indices
      ratings[:,0] -= 1
      ratings[:,1] -= 1
    • The 1m and 20m MovieLens datasets skip some user and item IDs. This creates a problem: you have to map the set of unique user IDs to an index set equal to [0 ... num_users-1] and do the same for item IDs. The item mapping is accomplished using the following [numpy](http://www.numpy.org/) code. The code creates an array of size [0..max_item_id] to perform the mapping, so if the maximum item ID is very large, this method might use too much memory.

      np_items = ratings_df.item_id.as_matrix()
      unique_items = np.unique(np_items)
      n_items = unique_items.shape[0]
      max_item = unique_items[-1]
      
      # map unique items down to an array 0..n_items-1
      z = np.zeros(max_item+1, dtype=int)
      z[unique_items] = np.arange(n_items)
      i_r = z[np_items]
    • The code for mapping users is essentially the same as the code for items.

  3. The model code randomly selects a test set of ratings. By default, 10% of the ratings are chosen for the test set. These ratings are removed from the training set and will be used to evaluate the predictive accuracy of the user and item factors.

    test_set_size = len(ratings) / TEST_SET_RATIO
    test_set_idx = np.random.choice(xrange(len(ratings)),
                                    size=test_set_size, replace=False)
    test_set_idx = sorted(test_set_idx)
    
    ts_ratings = ratings[test_set_idx]
    tr_ratings = np.delete(ratings, test_set_idx, axis=0)
  4. Finally, the code creates a scipy sparse matrix in coordinate form (coo_matrix) that includes the user and item indexes and ratings. The coo_matrix object acts as a wrapper for a sparse matrix. It also performs validation of the user and ratings indexes, checking for errors in preprocessing.

    u_tr, i_tr, r_tr = zip(*tr_ratings)
    tr_sparse = coo_matrix((r_tr, (u_tr, i_tr)), shape=(n_users, n_items))

How the WALS Algorithm is implemented in TensorFlow

After the data is preprocessed, the code passes the sparse training matrix into the TensorFlow WALS model to be factorized into row factor X and column factor Y.

The TensorFlow code that executes the model is actually simple, because it relies on the WALSModel class included in the contrib.factorization_ops module of TensorFlow.

  1. A SparseTensor object is initialized with user IDs and items IDs as indices, and with the ratings as values. From wals.py:

    input_tensor = tf.SparseTensor(indices=zip(data.row, data.col),
                                    values=(data.data).astype(np.float32),
                                    dense_shape=data.shape)

    The data variable is the coo_matrix object of training ratings created in the preprocessing step.

  2. The model is instantiated:

    model = factorization_ops.WALSModel(num_rows, num_cols, dim,
                                        unobserved_weight=unobs,
                                        regularization=reg,
                                        row_weights=row_wts,
                                        col_weights=col_wts)
  3. The row factors and column factor tensors are created automatically by the WALSModel class, and are retrieved so they can be evaluated after factoring the matrix:

    # retrieve the row and column factors
    row_factor = model.row_factors[0]
    col_factor = model.col_factors[0]
  4. The training process executes the following loop within a TensorFlow session using the simple_train method in wals.py:

    row_update_op = model.update_row_factors(sp_input=input_tensor)[1]
    col_update_op = model.update_col_factors(sp_input=input_tensor)[1]
    
    sess.run(model.initialize_op)
    sess.run(model.worker_init)
    for _ in xrange(num_iterations):
        sess.run(model.row_update_prep_gramian_op)
        sess.run(model.initialize_row_update_op)
        sess.run(row_update_op)
        sess.run(model.col_update_prep_gramian_op)
        sess.run(model.initialize_col_update_op)
        sess.run(col_update_op)
  5. After num_iterations iterations have been executed, the row and column factor tensors are evaluated in the session to produce numpy arrays for each factor:

    # evaluate output factor matrices
    output_row = row_factor.eval(session=session)
    output_col = col_factor.eval(session=session)

These factor arrays are used to calculate the RMSE on the test set of ratings. The two arrays are also saved in the output directory in numpy format.

Training the model

In this context, training the model involves factoring a sparse matrix of ratings into a user factor matrix X and an item factor matrix Y. The saved user and item factors can be used as the base model for a recommendation system.

This system takes a user as input, retrieves the vector of user factors for that user from X, multiplies that vector by all the item factors Y, and returns the top N items according to predicted rating.

Part 4 of this tutorial set provides more detail on a system that uses the trained model to perform predictions, and shows how to deploy such a system on GCP.

Train the model locally

Training the model locally is useful for development purposes. It allows you to rapidly test code changes and to include breakpoints for easy debugging. To run the model on Cloud Shell or from your local system, run the mltrain.sh script from the wals_ml_engine directory using the local option.

cd wals_ml_engine
  • For the MovieLens 100k dataset, specify the path to the 100k data file:

    ./mltrain.sh local ../data u.data
  • For the MovieLens 1m dataset, include the --delimiter option and specify the path to the 1m data file:

    ./mltrain.sh local ../data ratings.dat --delimiter ::
  • For the MovieLens 20m dataset, use the --delimiter and --headers options:

    ./mltrain.sh local ../data ratings.csv --headers --delimiter ,

The training job output shows the RMSE calculated on the test set. For the 1m dataset, and using the default hyperparameters specified in the source code, the output should look like the following:

INFO:tensorflow:Train Start: <timestamp>
...
INFO:tensorflow:Train Finish: <timestamp>
INFO:tensorflow:train RMSE = 1.29
INFO:tensorflow:test RMSE = 1.34

For more details, refer to Part 2.

The RMSE corresponds to the average error in the predicted ratings compared to the test set. On average, each rating produced by the algorithm is within ± 1.29 of the actual user rating in the test set on the 1m dataset. The WALS algorithm performs much better with tuned hyperparameters, as shown in Part 2 of this series.

Cleaning up

If you created a Compute Engine instance for running TensorFlow, you must stop it to avoid incurring charges to your GCP account. This instance is not used in Part 2 of the series. However, the Compute Engine instance is required for Part 3. and Part 4.

Stopping the Compute Engine instance

  1. In the GCP Console, open the Compute Engine VM Instance list page.
  2. Select the instance name.
  3. Click Stop and confirm the operation.

Delete the project

The easiest way to eliminate billing is to delete the project that you created for the tutorial.

To delete the project:

  1. In the GCP Console, go to the Manage resources page.

    Go to the Manage resources page

  2. In the project list, select the project you want to delete and click Delete .
  3. In the dialog, type the project ID, and then click Shut down to delete the project.

What's next

Trang này có hữu ích không? Hãy cho chúng tôi biết đánh giá của bạn:

Gửi phản hồi về...