Training using the built-in NCF algorithm

Training with built-in algorithms on AI Platform Training allows you to submit your dataset and train a model without writing any training code. This page explains how the built-in NCF algorithm works, and how to use it.

Overview

This built-in algorithm only does training:

  1. Training: Using the dataset and the model parameters you supplied, AI Platform Training runs training using TensorFlow's NCF implementation.

Limitations

The following features are not supported for training with the built-in NCF algorithm:

  • Automated Data Preprocessing This version of NCF requires input data to be in the form of TFRecords for both training and output. A training application must be made to handle unformatted input automatically.

Supported machine types

The following AI Platform Training scale tiers and machine types are supported:

  • BASIC scale tier
  • BASIC_GPU scale tier
  • BASIC_TPU scale tier
  • CUSTOM scale tier with any of the Compute Engine machine types supported by AI Platform Training.
  • CUSTOM scale tier with any of the following legacy machine types:
    • standard
    • large_model
    • complex_model_s
    • complex_model_m
    • complex_model_l
    • standard_gpu
    • standard_p100
    • standard_v100
    • large_model_v100
    • complex_model_m_gpu
    • complex_model_l_gpu
    • complex_model_m_p100
    • complex_model_m_v100
    • complex_model_l_v100
    • TPU_V2 (8 cores)

It is recommended that a machine type with access to TPUs or GPUs is used.

Format input data

Ensure that input and evaluation data are in the form of TFRecords before training the model.

Check Cloud Storage bucket permissions

To store your data, use a Cloud Storage bucket in the same Google Cloud project you're using to run AI Platform Training jobs. Otherwise, grant AI Platform Training access to the Cloud Storage bucket where your data is stored.

Submit a NCF training job

This section explains how to submit a training job using the built-in NCF algorithm.

You can find brief explanations of each hyperparameter within the Google Cloud console, and a more comprehensive explanation in the reference for the built-in NCF algorithm.

Console

  1. Go to the AI Platform Training Jobs page in the Google Cloud console:

    AI Platform Training Jobs page

  2. Click the New training job button. From the options that display below, click Built-in algorithm training.

  3. On the Create a new training job page, select NCF and click Next.

  4. To learn more about all the available parameters, follow the links in the Google Cloud console and refer to the built-in NCF reference for more details.

gcloud

  1. Set environment variables for your job:

       # Specify the name of the Cloud Storage bucket where you want your
       # training outputs to be stored, and the Docker container for
       # your built-in algorithm selection.
       BUCKET_NAME='BUCKET_NAME'
       IMAGE_URI='gcr.io/cloud-ml-algos/ncf:latest'
    
       # Specify the Cloud Storage path to your training input data.
       DATA_DIR='gs://$BUCKET_NAME/ncf_data'
    
       DATE="$(date '+%Y%m%d_%H%M%S')"
       MODEL_NAME='MODEL_NAME'
       JOB_ID="${MODEL_NAME}_${DATE}"
    
       JOB_DIR="gs://${BUCKET_NAME}/algorithm_training/${MODEL_NAME}/${DATE}"
    

    Replace the following:

    • BUCKET_NAME: The name of the Cloud Storage bucket where you want training outputs to be stored.
    • MODEL_NAME: A name for your model, to identify where model artifacts get stored in your Cloud Storage bucket.
  2. Submit the training job using gcloud ai-platform jobs training submit. Adjust this generic example to work with your dataset:

       gcloud ai-platform jobs submit training $JOB_ID \
          --master-image-uri=$IMAGE_URI --scale-tier=BASIC_TPU --job-dir=$JOB_DIR \
          -- \
          --train_dataset_path=${DATA_DIR}/training_cycle_*/* \
          --eval_dataset_path=${DATA_DIR}/eval_data/* \
          --input_meta_data_path=${DATA_DIR}/metadata \
          --learning_rate=3e-5 \
          --train_epochs=3 \
          --eval_batch_size=160000 \
          --learning_rate=0.00382059 \
          --beta1=0.783529 \
          --beta2=0.909003 \
          --epsilon=1.45439e-07 \
          --num_factors=64 \
          --hr_threshold=0.635 \
          --keras_use_ctl=true \
          --layers=256,256,128,64 \
    
  3. Monitor the status of your training job by viewing logs with gcloud. Refer to gcloud ai-platform jobs describe and gcloud ai-platform jobs stream-logs.

       gcloud ai-platform jobs describe ${JOB_ID}
       gcloud ai-platform jobs stream-logs ${JOB_ID}
    

Further learning resources