Training DLRM on Cloud TPU using PyTorch

This tutorial shows you how to train Facebook Research DLRM on a Cloud TPU.

Objectives

  • Create and configure the PyTorch environment
  • Run the training job with fake data
  • (Optional) Train on Criteo Kaggle dataset

Costs

This tutorial uses billable components of Google Cloud, including:

  • Compute Engine
  • Cloud TPU

Use the pricing calculator to generate a cost estimate based on your projected usage. New Google Cloud users might be eligible for a free trial.

Before you begin

Before starting this tutorial, check that your Google Cloud project is correctly set up.

  1. Sign in to your Google Account.

    If you don't already have one, sign up for a new account.

  2. In the Google Cloud Console, on the project selector page, select or create a Google Cloud project.

    Go to the project selector page

  3. Make sure that billing is enabled for your Cloud project. Learn how to confirm that billing is enabled for your project.

  4. This walkthrough uses billable components of Google Cloud. Check the Cloud TPU pricing page to estimate your costs. Be sure to clean up resources you create when you have finished with them to avoid unnecessary charges.

Set up a Compute Engine instance

  1. Open a Cloud Shell window.

    Open Cloud Shell

  2. Create a variable for your project's ID.

    export PROJECT_ID=project-id
    
  3. Configure gcloud command-line tool to use the project where you want to create Cloud TPU.

    gcloud config set project ${PROJECT_ID}
    

    The first time you run this command in a new Cloud Shell VM, an Authorize Cloud Shell page is displayed. Click Authorize at the bottom of the page to allow gcloud to make GCP API calls with your credentials.

  4. From the Cloud Shell, launch the Compute Engine resource required for this tutorial. Note: You will want to use a n1-highmem-96 machine-type if training on Criteo Kaggle dataset.

    gcloud compute instances create dlrm-tutorial \
    --zone=us-central1-a \
    --machine-type=n1-standard-64 \
    --image-family=torch-xla \
    --image-project=ml-images  \
    --boot-disk-size=200GB \
    --scopes=https://www.googleapis.com/auth/cloud-platform
    
  5. Connect to the new Compute Engine instance.

    gcloud compute ssh dlrm-tutorial --zone=us-central1-a
    

Launch a Cloud TPU resource

  1. From the Compute Engine virtual machine, launch a Cloud TPU resource using the following command:

    (vm) $ gcloud compute tpus create dlrm-tutorial \
    --zone=us-central1-a \
    --network=default \
    --version=pytorch-1.7  \
    --accelerator-type=v3-8
    
  2. Identify the IP address for the Cloud TPU resource.

    (vm) $ gcloud compute tpus list --zone=us-central1-a
    

Create and configure the PyTorch environment

  1. Start a conda environment.

    (vm) $ conda activate torch-xla-1.7
    
  2. Configure environmental variables for the Cloud TPU resource.

    (vm) $ export TPU_IP_ADDRESS=ip-address
    
    (vm) $ export XRT_TPU_CONFIG="tpu_worker;0;$TPU_IP_ADDRESS:8470"
    

Run the training job with fake data

  1. Get TPU compatible DLRM by running:

    (vm) $ git clone --recursive https://github.com/pytorch-tpu/examples.git
    
  2. Install dependencies.

    (vm) $ pip install onnx
    
  3. Run the model on random data.

    (vm) $ python examples/deps/dlrm/dlrm_tpu_runner.py \
        --arch-embedding-size=1000000-1000000-1000000-1000000-1000000-1000000-1000000-1000000 \
        --arch-sparse-feature-size=64 \
        --arch-mlp-bot=512-512-64 \
        --arch-mlp-top=1024-1024-1024-1 \
        --arch-interaction-op=dot \
        --lr-num-warmup-steps=10 \
        --lr-decay-start-step=10 \
        --mini-batch-size=2048 \
        --num-batches=1000 \
        --data-generation='random' \
        --numpy-rand-seed=727 \
        --print-time \
        --print-freq=100 \
        --num-indices-per-lookup=100 \
        --use-tpu \
        --num-indices-per-lookup-fixed \
        --tpu-model-parallel-group-len=8 \
        --tpu-metrics-debug \
        --tpu-cores=8
    

(Optional) Train on Criteo Kaggle dataset

These steps are optional. You should run them only if you want to train on Criteo Kaggle dataset.

  1. Download the dataset.

    Download the dataset from Criteo Kaggle dataset following instructions here. When the download is complete, copy the dac.tar.gz file in a directory named ./criteo-kaggle/. Use the tar -xzvf command to extract the contents of the tar.gz file in the ./critero-kaggle directory.

     (vm) $ mkdir criteo-kaggle
     (vm) $ cd criteo-kaggle
     (vm) $ # Download dataset from above link here.
     (vm) $ tar -xzvf dac.tar.gz
     (vm) $ cd ..
    
  2. Preprocess the dataset.

    Start this script to preprocess the Criteo dataset. This script produces a file named kaggleAdDisplayChallenge_processed.npz and takes more than 3 hours to preprocess the dataset.

    (vm) $ python examples/deps/dlrm/dlrm_data_pytorch.py \
        --data-generation=dataset \
        --data-set=kaggle \
        --raw-data-file=criteo-kaggle/train.txt \
        --mini-batch-size=128 \
        --test-mini-batch-size=16384 \
        --test-num-workers=4 
    
  3. Verify the preprocessing was successful.

    You should see the kaggleAdDisplayChallenge_processed.npz file in the criteo-kaggle directory.

  4. Run the training script on pre-processed Criteo Kaggle dataset.

    (vm) $ python examples/deps/dlrm/dlrm_tpu_runner.py \
        --arch-sparse-feature-size=16 \
        --arch-mlp-bot="13-512-256-64-16" \
        --arch-mlp-top="512-256-1" \
        --data-generation=dataset \
        --data-set=kaggle \
        --raw-data-file=criteo-kaggle/train.txt \
        --processed-data-file=criteo-kaggle/kaggleAdDisplayChallenge_processed.npz \
        --loss-function=bce \
        --round-targets=True \
        --learning-rate=0.1 \
        --mini-batch-size=128 \
        --print-freq=1024 \
        --print-time \
        --test-mini-batch-size=16384 \
        --test-num-workers=4 \
        --test-freq=101376 \
        --use-tpu \
        --num-indices-per-lookup=1 \
        --num-indices-per-lookup-fixed \
        --tpu-model-parallel-group-len 8 \
        --tpu-metrics-debug \
        --tpu-cores=8
    

    Training should complete in 2+ hours with accurary of 78.75%+.

Cleaning up

Perform a cleanup to avoid incurring unnecessary charges to your account after using the resources you created:

  1. Disconnect from the Compute Engine instance, if you have not already done so:

    (vm) $ exit
    

    Your prompt should now be user@projectname, showing you are in the Cloud Shell.

  2. In your Cloud Shell, use the gcloud command-line tool to delete the Compute Engine instance:

    $ gcloud compute instances delete dlrm-tutorial --zone=us-central1-a
    
  3. Use gcloud command-line tool to delete the Cloud TPU resource.

    $ gcloud compute tpus delete dlrm-tutorial --zone=us-central1-a
    

What's next

Try the PyTorch colabs: