This tutorial shows you how to train Facebook Research DLRM on a Cloud TPU.
Objectives
- Create and configure the PyTorch environment
- Run the training job with fake data
- (Optional) Train on Criteo Kaggle dataset
Costs
In this document, you use the following billable components of Google Cloud:
- Compute Engine
- Cloud TPU
To generate a cost estimate based on your projected usage,
use the pricing calculator.
Before you begin
Before starting this tutorial, check that your Google Cloud project is correctly set up.
- Sign in to your Google Cloud account. If you're new to Google Cloud, create an account to evaluate how our products perform in real-world scenarios. New customers also get $300 in free credits to run, test, and deploy workloads.
-
In the Google Cloud console, on the project selector page, select or create a Google Cloud project.
-
Make sure that billing is enabled for your Google Cloud project. Learn how to check if billing is enabled on a project.
-
In the Google Cloud console, on the project selector page, select or create a Google Cloud project.
-
Make sure that billing is enabled for your Google Cloud project. Learn how to check if billing is enabled on a project.
This walkthrough uses billable components of Google Cloud. Check the Cloud TPU pricing page to estimate your costs. Be sure to clean up resources you create when you have finished with them to avoid unnecessary charges.
Set up a Compute Engine instance
Open a Cloud Shell window.
Create a variable for your project's ID.
export PROJECT_ID=project-id
Configure Google Cloud CLI to use the project where you want to create Cloud TPU.
gcloud config set project ${PROJECT_ID}
The first time you run this command in a new Cloud Shell VM, an
Authorize Cloud Shell
page is displayed. ClickAuthorize
at the bottom of the page to allowgcloud
to make Google Cloud API calls with your credentials.From the Cloud Shell, launch the Compute Engine resource required for this tutorial. Note: You will want to use a n1-highmem-96
machine-type
if training on Criteo Kaggle dataset.gcloud compute instances create dlrm-tutorial \ --zone=us-central1-a \ --machine-type=n1-standard-64 \ --image-family=torch-xla \ --image-project=ml-images \ --boot-disk-size=200GB \ --scopes=https://www.googleapis.com/auth/cloud-platform
Connect to the new Compute Engine instance.
gcloud compute ssh dlrm-tutorial --zone=us-central1-a
Launch a Cloud TPU resource
From the Compute Engine virtual machine, launch a Cloud TPU resource using the following command:
(vm) $ gcloud compute tpus create dlrm-tutorial \ --zone=us-central1-a \ --network=default \ --version=pytorch-2.0 \ --accelerator-type=v3-8
Identify the IP address for the Cloud TPU resource.
(vm) $ gcloud compute tpus describe dlrm-tutorial --zone=us-central1-a
Create and configure the PyTorch environment
Start a
conda
environment.(vm) $ conda activate torch-xla-2.0
Configure environmental variables for the Cloud TPU resource.
(vm) $ export TPU_IP_ADDRESS=ip-address
(vm) $ export XRT_TPU_CONFIG="tpu_worker;0;$TPU_IP_ADDRESS:8470"
Run the training job with fake data
Install dependencies.
(vm) $ pip install onnx
Run the model on random data. This should take 5-10 minutes.
(vm) $ python /usr/share/torch-xla-2.0/tpu-examples/deps/dlrm/dlrm_tpu_runner.py \ --arch-embedding-size=1000000-1000000-1000000-1000000-1000000-1000000-1000000-1000000 \ --arch-sparse-feature-size=64 \ --arch-mlp-bot=512-512-64 \ --arch-mlp-top=1024-1024-1024-1 \ --arch-interaction-op=dot \ --lr-num-warmup-steps=10 \ --lr-decay-start-step=10 \ --mini-batch-size=2048 \ --num-batches=1000 \ --data-generation='random' \ --numpy-rand-seed=727 \ --print-time \ --print-freq=100 \ --num-indices-per-lookup=100 \ --use-tpu \ --num-indices-per-lookup-fixed \ --tpu-model-parallel-group-len=8 \ --tpu-metrics-debug \ --tpu-cores=8
(Optional) Train on Criteo Kaggle dataset
These steps are optional. You should run them only if you want to train on Criteo Kaggle dataset.
Download the dataset.
Download the dataset from Criteo Kaggle dataset following instructions here. When the download is complete, copy the
dac.tar.gz
file in a directory named./criteo-kaggle/
. Use thetar -xzvf
command to extract the contents of the tar.gz file in the./critero-kaggle
directory.(vm) $ mkdir criteo-kaggle (vm) $ cd criteo-kaggle (vm) $ # Download dataset from above link here. (vm) $ tar -xzvf dac.tar.gz (vm) $ cd ..
Preprocess the dataset.
Start this script to preprocess the Criteo dataset. This script produces a file named
kaggleAdDisplayChallenge_processed.npz
and takes more than 3 hours to preprocess the dataset.(vm) $ python /usr/share/torch-xla-2.0/tpu-examples/deps/dlrm/dlrm_data_pytorch.py \ --data-generation=dataset \ --data-set=kaggle \ --raw-data-file=criteo-kaggle/train.txt \ --mini-batch-size=128 \ --memory-map \ --test-mini-batch-size=16384 \ --test-num-workers=4
Verify the preprocessing was successful.
You should see the
kaggleAdDisplayChallenge_processed.npz
file in thecriteo-kaggle
directory.Run the training script on pre-processed Criteo Kaggle dataset.
(vm) $ python /usr/share/torch-xla-2.0/tpu-examples/deps/dlrm/dlrm_tpu_runner.py \ --arch-sparse-feature-size=16 \ --arch-mlp-bot="13-512-256-64-16" \ --arch-mlp-top="512-256-1" \ --data-generation=dataset \ --data-set=kaggle \ --raw-data-file=criteo-kaggle/train.txt \ --processed-data-file=criteo-kaggle/kaggleAdDisplayChallenge_processed.npz \ --loss-function=bce \ --round-targets=True \ --learning-rate=0.1 \ --mini-batch-size=128 \ --print-freq=1024 \ --print-time \ --test-mini-batch-size=16384 \ --test-num-workers=4 \ --memory-map \ --test-freq=101376 \ --use-tpu \ --num-indices-per-lookup=1 \ --num-indices-per-lookup-fixed \ --tpu-model-parallel-group-len 8 \ --tpu-metrics-debug \ --tpu-cores=8
Training should complete in 2+ hours with accuracy of 78.75%+.
Clean up
Perform a cleanup to avoid incurring unnecessary charges to your account after using the resources you created:
Disconnect from the Compute Engine instance, if you have not already done so:
(vm) $ exit
Your prompt should now be
user@projectname
, showing you are in the Cloud Shell.In your Cloud Shell, use the Google Cloud CLI to delete the Compute Engine instance:
$ gcloud compute instances delete dlrm-tutorial --zone=us-central1-a
Use Google Cloud CLI to delete the Cloud TPU resource.
$ gcloud compute tpus delete dlrm-tutorial --zone=us-central1-a
What's next
Try the PyTorch colabs:
- Getting Started with PyTorch on Cloud TPUs
- Training MNIST on TPUs
- Training ResNet18 on TPUs with Cifar10 dataset
- Inference with Pretrained ResNet50 Model
- Fast Neural Style Transfer
- MultiCore Training AlexNet on Fashion MNIST
- Single Core Training AlexNet on Fashion MNIST