Training NCF on Cloud TPU (TF 2.x)


Overview

This is an implementation of the Neural Collaborative Filtering (NCF) framework using a Neural Matrix Factorization (NeuMF) model as described in the Neural Collaborative Filtering paper. The current implementation is based on the code from the authors' NCF code and the Stanford implementation in the MLPerf Repo.

NCF is a general framework for collaborative filtering of recommendations in which a neural network architecture is used to model user-item interactions. Unlike traditional models, NCF does not resort to Matrix Factorization (MF) with an inner product on latent features of users and items. It replaces the inner product with a multi-layer perceptron that can learn an arbitrary function from data.

Two implementations of NCF are Generalized Matrix Factorization (GMF) and Multi-Layer Perceptron (MLP). GMF applies a linear kernel to model the latent feature interactions, and MLP uses a nonlinear kernel to learn the interaction function from data. NeuMF is a fused model of GMF and MLP to better model complex user-item interactions, and unifies the strengths of linearity of MF and non-linearality of MLP for modeling the user-item latent structures. NeuMF allows GMF and MLP to learn separate embeddings, and combines the two models by concatenating their last hidden layer. neumf_model.py defines the architecture details.

The instructions below assume you are already familiar with training a model on Cloud TPU. If you are new to Cloud TPU, refer to the Quickstart for a basic introduction.

Dataset

The MovieLens datasets are used for model training and evaluation. Specifically, we use two datasets: ml-1m (short for MovieLens 1 million) and ml-20m (short for MovieLens 20 million).

ml-1m

ml-1m dataset contains 1,000,209 anonymous ratings of approximately 3,706 movies made by 6,040 users who joined MovieLens in 2000. All ratings are contained in the file "ratings.dat" without a header row, and are in the following format:

UserID::MovieID::Rating::Timestamp

  • UserIDs range between 1 and 6040.
  • MovieIDs range between 1 and 3952.
  • Ratings are made on a 5-star scale (whole-star ratings only).

ml-20m

ml-20m dataset contains 20,000,263 ratings of 26,744 movies by 138493 users. All ratings are contained in the file "ratings.csv". Each line of this file after the header row represents a single user's rating of a movie, and has the following format:

userId,movieId,rating,timestamp

The lines within this file are ordered first by userId, then, within user, by movieId. Ratings are made on a 5-star scale, with half-star increments (0.5 stars - 5.0 stars). In both datasets, the timestamp is represented in seconds since midnight Coordinated Universal Time (UTC) of January 1, 1970. Each user has at least 20 ratings.

Objectives

  • Create a Cloud Storage bucket to hold your dataset and model output
  • Prepare the MovieLens dataset
  • Set up a Compute Engine VM and Cloud TPU node for training and evaluation
  • Run training and evaluation

Costs

In this document, you use the following billable components of Google Cloud:

  • Compute Engine
  • Cloud TPU
  • Cloud Storage

To generate a cost estimate based on your projected usage, use the pricing calculator. New Google Cloud users might be eligible for a free trial.

Before you begin

Before starting this tutorial, check that your Google Cloud project is correctly set up.

  1. Sign in to your Google Cloud account. If you're new to Google Cloud, create an account to evaluate how our products perform in real-world scenarios. New customers also get $300 in free credits to run, test, and deploy workloads.
  2. In the Google Cloud console, on the project selector page, select or create a Google Cloud project.

    Go to project selector

  3. Make sure that billing is enabled for your Google Cloud project.

  4. In the Google Cloud console, on the project selector page, select or create a Google Cloud project.

    Go to project selector

  5. Make sure that billing is enabled for your Google Cloud project.

  6. This walkthrough uses billable components of Google Cloud. Check the Cloud TPU pricing page to estimate your costs. Be sure to clean up resources you create when you've finished with them to avoid unnecessary charges.

Set up your resources

This section provides information on setting up Cloud Storage, VM, and Cloud TPU resources for this tutorial.

  1. Open a Cloud Shell window.

    Open Cloud Shell

  2. Create an environment variable for your project's ID.

    export PROJECT_ID=project-id
  3. Configure Google Cloud CLI to use the project where you want to create the Cloud TPU.

    gcloud config set project ${PROJECT_ID}
    

    The first time you run this command in a new Cloud Shell VM, an Authorize Cloud Shell page is displayed. Click Authorize at the bottom of the page to allow gcloud to make API calls with your credentials.

  4. Create a Service Account for the Cloud TPU project.

    gcloud beta services identity create --service tpu.googleapis.com --project $PROJECT_ID
    

    The command returns a Cloud TPU Service Account with following format:

    service-PROJECT_NUMBER@cloud-tpu.iam.gserviceaccount.com
    
  5. Create a Cloud Storage bucket using the following command:

    gsutil mb -p ${PROJECT_ID} -c standard -l europe-west4 gs://bucket-name
    

    This Cloud Storage bucket stores the data you use to train your model and the training results. The gcloud command used in this tutorial to set up the TPU also sets up default permissions for the Cloud TPU Service Account you set up in the previous step. If you want finer-grain permissions, review the access level permissions.

    The bucket location must be in the same region as your virtual machine (VM) and your TPU node. VMs and TPU nodes are located in specific zones, which are subdivisions within a region.

  6. Launch a Compute Engine VM and Cloud TPU using the gcloud command. The command you use depends on whether you are using TPU VMs or TPU nodes. For more information on the two VM architecture, see System Architecture.

    TPU VM

    $ gcloud compute tpus tpu-vm create ncf-tutorial \
    --zone=europe-west4-a \
    --accelerator-type=v3-8 \
    --version=tpu-vm-tf-2.16.1-pjrt
    

    Command flag descriptions

    zone
    The zone where you plan to create your Cloud TPU.
    accelerator-type
    The accelerator type specifies the version and size of the Cloud TPU you want to create. For more information about supported accelerator types for each TPU version, see TPU versions.
    version
    The Cloud TPU software version.

    TPU Node

    $ gcloud compute tpus execution-groups create  \
     --zone=europe-west4-a \
     --name=ncf-tutorial \
     --accelerator-type=v3-8 \
     --machine-type=n1-standard-8 \
     --disk-size=300 \
     --tf-version=2.12.0
    

    Command flag descriptions

    zone
    The zone where you plan to create your Cloud TPU.
    name
    The TPU name. If not specified, defaults to your username.
    accelerator-type
    The type of the Cloud TPU to create.
    machine-type
    The machine type of the Compute Engine VM to create.
    disk-size
    The root volume size of your Compute Engine VM (in GB).
    tf-version
    The version of Tensorflow gcloud installs on the VM.

    For more information on the gcloud command, see the gcloud Reference.

  7. If you are not automatically logged in to the Compute Engine instance, log in by running the following ssh command. When you are logged into the VM, your shell prompt changes from username@projectname to username@vm-name:

    TPU VM

    gcloud compute tpus tpu-vm ssh ncf-tutorial --zone=europe-west4-a
    

    TPU Node

    gcloud compute ssh ncf-tutorial --zone=europe-west4-a
    

Prepare the data

  1. Add an environment variable for your storage bucket. Replace bucket-name with your bucket name.

    (vm)$ export STORAGE_BUCKET=gs://bucket-name
    
  2. Add an environment variable for the data directory.

    (vm)$ export DATA_DIR=${STORAGE_BUCKET}/ncf_data
    
  3. Set up the model location and set the PYTHONPATH environment variable.

    TPU VM

    (vm)$ git clone https://github.com/tensorflow/models.git
    (vm)$ pip3 install -r models/official/requirements.txt
    
    (vm)$ export PYTHONPATH="${PWD}/models:${PYTHONPATH}"
    

    TPU Node

    (vm)$ export PYTHONPATH="${PYTHONPATH}:/usr/share/models"
    (vm)$ pip3 install -r /usr/share/models/official/requirements.txt
    
  4. Change to directory that stores the model processing files:

    TPU VM

    (vm)$ cd ~/models/official/recommendation
    

    TPU Node

    (vm)$ cd /usr/share/models/official/recommendation
    
  5. Generate training and evaluation data for the ml-20m dataset in DATA_DIR:

    (vm)$ python3 create_ncf_data.py \
        --dataset ml-20m \
        --num_train_epochs 4 \
        --meta_data_file_path ${DATA_DIR}/metadata \
        --eval_prebatch_size 160000 \
        --data_dir ${DATA_DIR}
    

This script generates and preprocesses the dataset on your VM. Preprocessing converts the data into TFRecord format required by the model. The download and pre-processing takes approximately 25 minutes and generates output similar to the following:

I0804 23:03:02.370002 139664166737728 movielens.py:124] Successfully downloaded /tmp/tmpicajrlfc/ml-20m.zip 198702078 bytes
I0804 23:04:42.665195 139664166737728 data_preprocessing.py:223] Beginning data preprocessing.
I0804 23:04:59.084554 139664166737728 data_preprocessing.py:84] Generating user_map and item_map...
I0804 23:05:20.934210 139664166737728 data_preprocessing.py:103] Sorting by user, timestamp...
I0804 23:06:39.859857 139664166737728 data_preprocessing.py:194] Writing raw data cache.
I0804 23:06:42.375952 139664166737728 data_preprocessing.py:262] Data preprocessing complete. Time: 119.7 sec.
%lt;BisectionDataConstructor(Thread-1, initial daemon)>
General:
  Num users: 138493
  Num items: 26744

Training:
  Positive count:          19861770
  Batch size:              99000
  Batch count per epoch:   1004

Eval:
  Positive count:          138493
  Batch size:              160000
  Batch count per epoch:   866

I0804 23:07:14.137242 139664166737728 data_pipeline.py:887] Negative total vector built. Time: 31.8 seconds
I0804 23:11:25.013135 139664166737728 data_pipeline.py:588] Epoch construction complete. Time: 250.9 seconds
I0804 23:15:46.391308 139664166737728 data_pipeline.py:674] Eval construction complete. Time: 261.4 seconds
I0804 23:19:54.345858 139664166737728 data_pipeline.py:588] Epoch construction complete. Time: 248.0 seconds
I0804 23:24:09.182484 139664166737728 data_pipeline.py:588] Epoch construction complete. Time: 254.8 seconds
I0804 23:28:26.224653 139664166737728 data_pipeline.py:588] Epoch construction complete. Time: 257.0 seconds

Set up and start training the Cloud TPU

  1. Set the Cloud TPU name variable.

    TPU VM

    (vm)$ export TPU_NAME=local
    

    TPU Node

    (vm)$ export TPU_NAME=ncf-tutorial
    

Run the training and evaluation

The following script runs a sample training for 3 epochs,

  1. Add an environment variable for the Model directory to save checkpoints and TensorBoard summaries:

    (vm)$ export MODEL_DIR=${STORAGE_BUCKET}/ncf
    
  2. When creating your TPU, if you set the --version parameter to a version ending with -pjrt, set the following environment variables to enable the PJRT runtime:

      (vm)$ export NEXT_PLUGGABLE_DEVICE_USE_C_API=true
      (vm)$ export TF_PLUGGABLE_DEVICE_LIBRARY_PATH=/lib/libtpu.so
    
  3. Run the following command to train the NCF model:

    (vm)$ python3 ncf_keras_main.py \
         --model_dir=${MODEL_DIR} \
         --data_dir=${DATA_DIR} \
         --train_dataset_path=${DATA_DIR}/training_cycle_*/* \
         --eval_dataset_path=${DATA_DIR}/eval_data/* \
         --input_meta_data_path=${DATA_DIR}/metadata \
         --learning_rate=3e-5 \
         --train_epochs=3 \
         --dataset=ml-20m \
         --eval_batch_size=160000 \
         --learning_rate=0.00382059 \
         --beta1=0.783529 \
         --beta2=0.909003 \
         --epsilon=1.45439e-07 \
         --dataset=ml-20m \
         --num_factors=64 \
         --hr_threshold=0.635 \
         --keras_use_ctl=true \
         --layers=256,256,128,64 \
         --use_synthetic_data=false \
         --distribution_strategy=tpu \
         --download_if_missing=false
     

The training and evaluation takes about 2 minutes and generates final output similar to:

Result is {'loss': <tf.Tensor: shape=(), dtype=float32, numpy=0.10950611>,
'train_finish_time': 1618016422.1377568, 'avg_exp_per_second': 3062557.5070816963}

Clean up

To avoid incurring charges to your Google Cloud account for the resources used in this tutorial, either delete the project that contains the resources, or keep the project and delete the individual resources.

  1. Disconnect from the Compute Engine instance, if you have not already done so:

    (vm)$ exit
    

    Your prompt should now be username@projectname, showing you are in the Cloud Shell.

  2. Delete your Cloud TPU and Compute Engine resources. The command you use to delete your resources depends upon whether you are using TPU VMs or TPU Nodes. For more information, see System Architecture.

    TPU VM

    $ gcloud compute tpus tpu-vm delete ncf-tutorial \
    --zone=europe-west4-a
    

    TPU Node

    $ gcloud compute tpus execution-groups delete ncf-tutorial \
    --zone=europe-west4-a
    
  3. Verify the resources have been deleted by running gcloud compute tpus execution-groups list. The deletion might take several minutes. A response like the one below indicates your instances have been successfully deleted.

    TPU VM

    $ gcloud compute tpus tpu-vm list \
    --zone=europe-west4-a
    

    TPU Node

    $ gcloud compute tpus execution-groups list --zone=europe-west4-a
    
    Listed 0 items.
    
  4. Run gsutil as shown, replacing bucket-name with the name of the Cloud Storage bucket you created for this tutorial:

    $ gsutil rm -r gs://bucket-name
    

What's next

The TensorFlow Cloud TPU tutorials generally train the model using a sample dataset. The results of this training are not usable for inference. To use a model for inference, you can train the data on a publicly available dataset or your own dataset. TensorFlow models trained on Cloud TPUs generally require datasets to be in TFRecord format.

You can use the dataset conversion tool sample to convert an image classification dataset into TFRecord format. If you are not using an image classification model, you will have to convert your dataset to TFRecord format yourself. For more information, see TFRecord and tf.Example.

Hyperparameter tuning

To improve the model's performance with your dataset, you can tune the model's hyperparameters. You can find information about hyperparameters common to all TPU supported models on GitHub. Information about model-specific hyperparameters can be found in the source code for each model. For more information on hyperparameter tuning, see Overview of hyperparameter tuning and Tune hyperparameters.

Inference

Once you have trained your model, you can use it for inference (also called prediction). You can use the Cloud TPU inference converter tool to prepare and optimize a TensorFlow model for inference on Cloud TPU v5e. For more information about inference on Cloud TPU v5e, see Cloud TPU v5e inference introduction.