Training NCF on Cloud TPU (TF 2.x)


This is an implementation of the Neural Collaborative Filtering (NCF) framework with Neural Matrix Factorization (NeuMF) model as described in the Neural Collaborative Filtering paper. The current implementation is based on the code from the authors' NCF code and the Stanford implementation in the MLPerf Repo.

NCF is a general framework for collaborative filtering of recommendations in which a neural network architecture is used to model user-item interactions. Unlike traditional models, NCF does not resort to Matrix Factorization (MF) with an inner product on latent features of users and items. It replaces the inner product with a multi-layer perceptron that can learn an arbitrary function from data.

Two implementations of NCF are Generalized Matrix Factorization (GMF) and Multi-Layer Perceptron (MLP). GMF applies a linear kernel to model the latent feature interactions, and MLP uses a nonlinear kernel to learn the interaction function from data. NeuMF is a fused model of GMF and MLP to better model complex user-item interactions, and unifies the strengths of linearity of MF and non-linearality of MLP for modeling the user-item latent structures. NeuMF allows GMF and MLP to learn separate embeddings, and combines the two models by concatenating their last hidden layer. defines the architecture details.

The instructions below assume you are already familiar with training a model on Cloud TPU. If you are new to Cloud TPU, refer to the Quickstart for a basic introduction.


The MovieLens datasets are used for model training and evaluation. Specifically, we use two datasets: ml-1m (short for MovieLens 1 million) and ml-20m (short for MovieLens 20 million).


ml-1m dataset contains 1,000,209 anonymous ratings of approximately 3,706 movies made by 6,040 users who joined MovieLens in 2000. All ratings are contained in the file "ratings.dat" without a header row, and are in the following format:


  • UserIDs range between 1 and 6040.
  • MovieIDs range between 1 and 3952.
  • Ratings are made on a 5-star scale (whole-star ratings only).


ml-20m dataset contains 20,000,263 ratings of 26,744 movies by 138493 users. All ratings are contained in the file "ratings.csv". Each line of this file after the header row represents one rating of one movie by one user, and has the following format:


The lines within this file are ordered first by userId, then, within user, by movieId. Ratings are made on a 5-star scale, with half-star increments (0.5 stars - 5.0 stars). In both datasets, the timestamp is represented in seconds since midnight Coordinated Universal Time (UTC) of January 1, 1970. Each user has at least 20 ratings.


  • Create a Cloud Storage bucket to hold your dataset and model output
  • Prepare the MovieLens dataset
  • Set up a Compute Engine VM and Cloud TPU node for training and evaluation
  • Run training and evaluation


This tutorial uses billable components of Google Cloud, including:

  • Compute Engine
  • Cloud TPU
  • Cloud Storage

Use the pricing calculator to generate a cost estimate based on your projected usage. New Google Cloud users might be eligible for a free trial.

Before you begin

Before starting this tutorial, check that your Google Cloud project is correctly set up.

  1. Sign in to your Google Cloud account. If you're new to Google Cloud, create an account to evaluate how our products perform in real-world scenarios. New customers also get $300 in free credits to run, test, and deploy workloads.
  2. In the Google Cloud Console, on the project selector page, select or create a Google Cloud project.

    Go to project selector

  3. Make sure that billing is enabled for your Cloud project. Learn how to confirm that billing is enabled for your project.

  4. This walkthrough uses billable components of Google Cloud. Check the Cloud TPU pricing page to estimate your costs. Be sure to clean up resources you create when you've finished with them to avoid unnecessary charges.

Set up your resources

This section provides information on setting up Cloud Storage, VM, and Cloud TPU resources for this tutorial.

  1. Open a Cloud Shell window.

    Open Cloud Shell

  2. Create an environment variable for your project's ID.

    export PROJECT_ID=project-id
  3. Configure gcloud command-line tool to use the project where you want to create the Cloud TPU.

    gcloud config set project ${PROJECT_ID}

    The first time you run this command in a new Cloud Shell VM, an Authorize Cloud Shell page is displayed. Click Authorize at the bottom of the page to allow gcloud to make GCP API calls with your credentials.

  4. Create a Service Account for the Cloud TPU project.

    gcloud beta services identity create --service --project $PROJECT_ID

    The command returns a Cloud TPU Service Account with following format:
  5. Create a Cloud Storage bucket using the following command:

    gsutil mb -p ${PROJECT_ID} -c standard -l europe-west4 -b on gs://bucket-name

    This Cloud Storage bucket stores the data you use to train your model and the training results. The ctpu up tool used in this tutorial sets up default permissions for the Cloud TPU Service Account you set up in the previous step. If you want finer-grain permissions, review the access level permissions.

    The bucket location must be in the same region as your virtual machine (VM) and your TPU node. VMs and TPU nodes are located in specific zones, which are subdivisions within a region.

  6. Launch a Compute Engine VM using the ctpu up command.

    $ ctpu up --project=${PROJECT_ID} \
     --zone=europe-west4-a \
     --vm-only \
     --disk-size-gb=300 \
     --machine-type=n1-standard-8 \
     --name=ncf-tutorial \

    Command flag descriptions

    Your GCP project ID
    The zone where you plan to create your Cloud TPU.
    Create a VM only. By default the ctpu up command creates a VM and a Cloud TPU.
    The size of the disk for the VM in GB.
    The machine type of the VM the ctpu up command creates.
    The name of the Compute Engine VM to create.
    The version of Tensorflow ctpu installs on the VM.
  7. The configuration you specified appears. Enter y to approve or n to cancel.

  8. When the ctpu up command has finished executing, verify that your shell prompt has changed from username@projectname to username@vm-name. This change shows that you are now logged into your Compute Engine VM.

    gcloud compute ssh ncf-tutorial --zone=europe-west4-a

    As you continue these instructions, run each command that begins with (vm)$ in your VM session window.

Prepare the data

  1. Add an environment variable for your storage bucket. Replace bucket-name with your bucket name.

    (vm)$ export STORAGE_BUCKET=gs://bucket-name
  2. Add an environment variable for the data directory.

    (vm)$ export DATA_DIR=${STORAGE_BUCKET}/ncf_data
  3. Add an environment variable for the Python path.

    (vm)$ export PYTHONPATH="${PYTHONPATH}:/usr/share/models"
  4. Generate training and evaluation data for the ml-20m dataset in DATA_DIR:

    (vm)$ python3 /usr/share/models/official/recommendation/ \
        --dataset ml-20m \
        --num_train_epochs 4 \
        --meta_data_file_path ${DATA_DIR}/metadata \
        --eval_prebatch_size 160000 \
        --data_dir ${DATA_DIR}

This script generates and preprocesses the dataset on your VM. Preprocessing converts the data into TFRecord format required by the model. The download and pre-processing takes approximately 25 minutes and generates output similar to the following:

I0804 23:03:02.370002 139664166737728] Successfully downloaded /tmp/tmpicajrlfc/ 198702078 bytes
I0804 23:04:42.665195 139664166737728] Beginning data preprocessing.
I0804 23:04:59.084554 139664166737728] Generating user_map and item_map...
I0804 23:05:20.934210 139664166737728] Sorting by user, timestamp...
I0804 23:06:39.859857 139664166737728] Writing raw data cache.
I0804 23:06:42.375952 139664166737728] Data preprocessing complete. Time: 119.7 sec.
%lt;BisectionDataConstructor(Thread-1, initial daemon)>
  Num users: 138493
  Num items: 26744

  Positive count:          19861770
  Batch size:              99000
  Batch count per epoch:   1004

  Positive count:          138493
  Batch size:              160000
  Batch count per epoch:   866

I0804 23:07:14.137242 139664166737728] Negative total vector built. Time: 31.8 seconds
I0804 23:11:25.013135 139664166737728] Epoch construction complete. Time: 250.9 seconds
I0804 23:15:46.391308 139664166737728] Eval construction complete. Time: 261.4 seconds
I0804 23:19:54.345858 139664166737728] Epoch construction complete. Time: 248.0 seconds
I0804 23:24:09.182484 139664166737728] Epoch construction complete. Time: 254.8 seconds
I0804 23:28:26.224653 139664166737728] Epoch construction complete. Time: 257.0 seconds

Set up and start training the Cloud TPU

  1. Run the following command to create your Cloud TPU.

    (vm)$ ctpu up --project=${PROJECT_ID} \
      --tpu-only \
      --tpu-size=v3-8 \
      --zone=europe-west4-a \
      --name=ncf-tutorial \

    Command flag descriptions

    Your GCP project ID
    Creates the Cloud TPU without creating a VM. By default the ctpu up command creates a VM and a Cloud TPU.
    The type of the Cloud TPU to create.
    The zone where you plan to create your Cloud TPU.
    The name of the Cloud TPU to create.
    The version of Tensorflow ctpu installs on the VM.
  2. The configuration you specified appears. Enter y to approve or n to cancel.

    You will see a message: Operation success; not ssh-ing to Compute Engine VM due to --tpu-only flag. Since you previously completed SSH key propagation, you can ignore this message.

  3. Add an environment variable for your Cloud TPU's name.

    (vm)$ export TPU_NAME=ncf-tutorial

Run the training and evaluation

The following script runs a sample training for 3 epochs,

  1. Add an environment variable for the Model directory to save checkpoints and TensorBoard summaries:

    (vm)$ export MODEL_DIR=${STORAGE_BUCKET}/ncf
  2. Run the following command to train the NCF model:

    (vm)$ python3 /usr/share/models/official/recommendation/ \
         --model_dir=${MODEL_DIR} \
         --data_dir=${DATA_DIR} \
         --train_dataset_path=${DATA_DIR}/training_cycle_*/* \
         --eval_dataset_path=${DATA_DIR}/eval_data/* \
         --input_meta_data_path=${DATA_DIR}/metadata \
         --learning_rate=3e-5 \
         --train_epochs=3 \
         --dataset=ml-20m \
         --eval_batch_size=160000 \
         --learning_rate=0.00382059 \
         --beta1=0.783529 \
         --beta2=0.909003 \
         --epsilon=1.45439e-07 \
         --dataset=ml-20m \
         --num_factors=64 \
         --hr_threshold=0.635 \
         --keras_use_ctl=true \
         --layers=256,256,128,64 \
         --use_synthetic_data=false \
         --distribution_strategy=tpu \

The training and evaluation takes about 2 minutes and generates final output similar to:

I0805 21:23:05.134161 139825684965184] Done training epoch 3, epoch loss=0.097
I0805 21:23:06.722786 139825684965184] Done eval epoch 3, hit_rate=0.585
I0805 21:23:16.005549 139825684965184] Saving model as TF checkpoint: gs://gm-bucket-eu/ncf/ctl_checkpoint
I0805 21:23:16.058367 139825684965184] Result is {'loss': <tf.Tensor: shape=(), dtype=float32, numpy=0.09678721>, 'eval_loss': None, 'eval_hit_rate': , 'step_timestamp_log': ['BatchTimestamp', 'BatchTimestamp', 'BatchTimestamp', 'BatchTimestamp', 'BatchTimestamp', 'BatchTimestamp', 'BatchTimestamp', 'BatchTimestamp', 'BatchTimestamp', 'BatchTim
estamp', 'BatchTimestamp', 'BatchTimestamp', 'BatchTimestamp', 'BatchTimestamp', 'BatchTimestamp', 'BatchTimestamp', 'BatchTimestamp', 'BatchTimestamp', 'BatchTimestamp', 'BatchTimestamp', 'BatchTimestamp', 'BatchTimestamp', 'BatchTimestamp', 'BatchTimestamp', 'BatchTimestamp',
 'BatchTimestamp', 'BatchTimestamp', 'BatchTimestamp', 'BatchTime
stamp', 'BatchTimestamp', 'BatchTimestamp'], 'train_finish_time': 1596662568.102817, 'avg_exp_per_second': 4474047.406912873}

Cleaning up

To avoid incurring charges to your Google Cloud account for the resources used in this tutorial, either delete the project that contains the resources, or keep the project and delete the individual resources.

Clean up the Compute Engine VM instance and Cloud TPU resources.

  1. Disconnect from the Compute Engine instance, if you have not already done so:

    (vm)$ exit

    Your prompt should now be username@projectname, showing you are in the Cloud Shell.

  2. In your VM or Cloud Shell, run ctpu delete with the --name and --zone flags you used when you set up the Cloud TPU to delete your Cloud TPU:

    $ ctpu delete --project=${PROJECT_ID} \
      --name=ncf-tutorial \
  3. Run the following command to verify the Compute Engine VM and Cloud TPU have been shut down:

    $ ctpu status --project=${PROJECT_ID} \
      --name=ncf-tutorial \

    The deletion might take several minutes. A response like the one below indicates there are no more allocated instances:

    2018/04/28 16:16:23 WARNING: Setting zone to "europe-west4-a"
    No instances currently exist.
            Compute Engine VM:     --
            Cloud TPU:             --
  4. Run gsutil as shown, replacing bucket-name with the name of the Cloud Storage bucket you created for this tutorial:

    $ gsutil rm -r gs://bucket-name

What's next

In this tutorial you have trained the NCF model using a sample dataset. The results of this training are (in most cases) not usable for inference. To use a model for inference you can train the data on a publicly available dataset or your own data set. Models trained on Cloud TPUs require datasets to be in TFRecord format.

You can use the dataset conversion tool sample to convert an image classification dataset into TFRecord format. If you are not using an image classification model, you will have to convert your dataset to TFRecord format yourself. For more information, see TFRecord and tf.Example

Hyperparameter tuning

To improve the model's performance with your dataset, you can tune the model's hyperparameters. You can find information about hyperparameters common to all TPU supported models on GitHub. Information about model-specific hyperparameters can be found in the source code for each model. For more information on hyperparameter tuning, see Overview of hyperparameter tuning, Using the Hyperparameter tuning service and Tune hyperparameters.


Once you have trained your model you can use it for inference (also called prediction). AI Platform is a cloud-based solution for developing, training, and deploying machine learning models. Once a model is deployed, you can use the AI Platform Prediction service.