Running the Transformer with Tensor2Tensor on Cloud TPU

This tutorial shows you how to train the Transformer model (from Attention Is All You Need) with Tensor2Tensor on a Cloud TPU.

Model description

The Transformer model uses stacks of self-attention layers and feed-forward layers to process sequential input like text. It supports the following variants:

  • transformer (encoder-decoder) for sequence to sequence modeling. Example use case: translation.
  • transformer (decoder-only) for single sequence modeling. Example use case: language modeling.
  • transformer_encoder (encoder-only) runs only the encoder for sequence to class modeling. Example use case: sentiment classification.

The Transformer is just one of the models in the Tensor2Tensor library. Tensor2Tensor (T2T) is a library of deep learning models and datasets as well as a set of scripts that allow you to train the models and to download and prepare the data.

Before you begin

Before starting this tutorial, check that your Cloud project is correctly set up, and create a Compute Engine VM and a TPU resource.

In this tutorial, you manually setup VM and TPU instances, but T2T also supports automatically creating those instances for you with its --cloud_tpu flag. See T2T's Cloud TPU docs for more information.

Set up your Cloud project

  1. Sign in to your Google Account.

    If you don't already have one, sign up for a new account.

  2. Select or create a GCP project.

    Go to the Manage resources page

  3. Make sure that billing is enabled for your project.

    Learn how to enable billing

  4. This walkthrough uses billable components of Google Cloud Platform. Check the Cloud TPU pricing page to estimate your costs, and follow the instructions to clean up resources when you've finished with them.

  5. Check your TPU quota to ensure that you have the quota to create a TPU resource.

    Go to TPU quota page

    If you don't have TPU quota, request quota using this link.

  6. Install the gcloud command-line tool via the Cloud SDK.

    Install the Cloud SDK

  7. Install the gcloud beta components, which include the commands necessary to create Cloud TPU resources.

    $ gcloud components install beta

Create a Compute Engine VM and a TPU resource

Set up a Compute Engine virtual machine (VM) instance and attach a TPU resource:

This section shows you how to create a Compute Engine VM and create a TPU resource from your local machine.

  1. Use the gcloud command-line tool to specify your Cloud Platform project:

    $ gcloud config set project your-cloud-project

    where your-cloud-projectis the name of the your Cloud Platform project with access to TPU quota.

  2. Specify the zone where you plan to create your VM and TPU resource. For this tutorial, use the us-central1-c zone:

    $ gcloud config set compute/zone us-central1-c

    For reference, Cloud TPU is available in the following zones:

    • us-central1-c
  3. Create the Compute Engine VM instance:

    $ gcloud compute instances create tpu-demo-vm \
      --machine-type=n1-standard-4 \
      --image-project=ml-images \
      --image-family=tf-1-7 \
      --scopes=cloud-platform
    

    where:

    • tpu-demo-vm is a name for identifying the VM instance that you're creating.

    • --machine-type=n1-standard-4 is a standard machine type with 4 virtual CPUs and 15 GB of memory. See Machine Types for more machine types.

    • --image-project=ml-images is a shared collection of images that makes the tf-1-7 image available for your use.

    • --image-family=tf-1-7 is an image with the required pip package for TensorFlow.

    • --scopes=cloud-platform allows the VM to access Cloud Platform APIs.

  4. Create a new Cloud TPU resource. For this example, name the resource demo-tpu. Keep in mind that billing begins as soon as the TPU is created, until the time it is deleted. (Check the Cloud TPU pricing page to estimate your costs.) If you are using a dataset that requires a substantial download and processing phase, hold off on running this command until you are ready to use the TPU:

    $ gcloud beta compute tpus create demo-tpu \
      --range=10.240.1.0/29 --version=1.7

    --range specifies the address of the created TPU resource and can be any value in 10.240.*.*/29. For this example, use 10.240.1.0/29.

Connect to your VM

Connect to your Compute Engine VM using the gcloud compute ssh command, with port forwarding for TensorBoard:

$ gcloud compute ssh tpu-demo-vm -- -L 6006:localhost:6006

Alternatively, you can SSH into your Compute Engine VM from the Google Cloud Platform Console: Go to Compute Engine -> VM instances. Find the tpu-demo-vm instance in the list of instances, and click SSH to connect to it.

Add disk space to your VM

T2T conveniently packages data generation for many common open-source datasets in its t2t-datagen script. The script downloads the data, preprocess it, and makes it ready for training. To do so, it needs local disk space.

You can skip this step if you run t2t-datagen on your local machine (pip install tensor2tensor and then see the t2t-datagen command below).

  • Follow the Compute Engine guide to add a disk to your Compute Engine VM.
  • Set the disk size to 200GB (the recommended minimum size).
  • Set When deleting instance to Delete disk to ensure that the disk is removed when you remove the VM.

Make a note of the path to your new disk. For example: /mnt/disks/mnt-dir.

Generate the training dataset

On your Compute Engine VM:

  1. Create the following environment variables:

    (vm)$ TPU_ZONE=your-zone
    (vm)$ STORAGE_BUCKET=gs://your-bucket-name
    (vm)$ DATA_DIR=$STORAGE_BUCKET/data/
    (vm)$ TMP_DIR=/mnt/disks/mnt-dir/t2t_tmp

    where:

    • TPU_ZONE is the zone where you created the TPU resource. For example: us-central1-c. Cloud TPU is available in the following zones:
      • us-central1-c
    • your-bucket-name is the name of the Cloud Storage you want to create.
    • DATA_DIR is a location on Cloud Storage
    • TMP_DIR is a location on the disk that you added to your Compute Engine VM at the start of the tutorial.
  2. Use the gsutil mb command to create a Cloud Storage bucket:

    (vm)$ gsutil mb -l ${TPU_ZONE} ${STORAGE_BUCKET} 

  3. Create a temporary directory on the disk that you added to your Compute Engine VM at the start of the tutorial:

    (vm)$ mkdir /mnt/disks/mnt-dir/t2t_tmp

  4. Use the t2t-datagen script to generate the training and evaluation data on the Cloud Storage bucket, so that the Cloud TPU can access the data:

    (vm)$ t2t-datagen --problem=translate_ende_wmt32k_packed --data_dir=$DATA_DIR --tmp_dir=$TMP_DIR

You can view the data on Cloud Storage by going to the Google Cloud Platform Console and choosing Storage from the left-hand menu. Click the name of the bucket that you created for this tutorial. You should see sharded files named translate_ende_wmt32k_packed-train and translate_ende_wmt32k_packed-dev.

Give your TPU access to the data

You need to give your TPU read/write access to Cloud Storage objects. To do that, you must grant the required access to the service account used by the TPU. Follow these steps to find the TPU service account and grant the necessary access:

  1. List your TPUs to find their names:

    $ gcloud beta compute tpus list
  2. Use the describe command to find the service account of your TPU, where demo-tpu is the name of your TPU resource:

    $ gcloud beta compute tpus describe demo-tpu
  3. Copy the name of the TPU service account from the output of the describe command. The name has the format of an email address, like 12345-compute@my.serviceaccount.com.

  4. Log in to to the Google Cloud Platform Console and choose the project in which you’re using the TPU.

  5. Choose IAM & Admin > IAM.

  6. Click the +Add button to add a member to the project.

  7. Enter the name of the TPU service account in the Members text box.

  8. Click the Roles dropdown list.

  9. Enable the following roles:

    • Project > Viewer
    • Logging > Logs Writer
    • Storage > Storage Object Admin

Train an English-German translation model

Run the following commands on your Compute Engine VM:

  1. Use the gcloud command-line tool to specify your Cloud Platform project:

    (vm)$ gcloud config set project your-cloud-project

    where your-cloud-project is the name of the your Cloud Platform project with access to TPU quota.

  2. Specify the zone where you plan to create your VM and TPU resource. For this tutorial, use the us-central1-c zone:

    (vm)$ gcloud config set compute/zone us-central1-c

    For reference, Cloud TPU is available in the following zones:

    • us-central1-c

  3. Set up environment variables for the TPU machine's IP and port. To find the IP address, run the following command:

    (vm)$ gcloud beta compute tpus list

    The above command will print the IP under NETWORK_ENDPOINT:

    NAME       ZONE           ACCELERATOR_TYPE  NETWORK_ENDPOINT   NETWORK  RANGE          STATUS
    demo-tpu   us-central1-c  v2-8              10.240.1.2:8470    default  10.240.1.0/29  READY
    

    Set these environmental variables:

    (vm)$ TPU_IP=10.240.1.2
    (vm)$ TPU_MASTER=grpc://$TPU_IP:8470

  4. Set up an environment variable for the training directory, which must be a Cloud Storage location:

    (vm)$ OUT_DIR=$STORAGE_BUCKET/training/transformer_ende_1

  5. Run t2t-trainer to train and evaluate the model:

    (vm)$ t2t-trainer \
      --model=transformer \
      --hparams_set=transformer_tpu \
      --problems=translate_ende_wmt32k_packed \
      --train_steps=10 \
      --eval_steps=3 \
      --data_dir=$DATA_DIR \
      --output_dir=$OUT_DIR \
      --use_tpu=True \
      --master=$TPU_MASTER

    The above command runs 10 training steps, then 3 evaluation steps. You can (and should) increase the number of training steps by adjusting the --train_steps flag. Translations usually begin to be reasonable after ~40k steps. The model typically converges to its maximum quality after ~250k steps.

  6. View the output in your Cloud Storage bucket by going to the Google Cloud Platform Console and choosing Storage from the left-hand menu. Click the name of the bucket that you created for this tutorial. Within the bucket, navigate to the training directory, for example, /training/transformer_ende_1, to see the model output. You can launch tensorboard pointing at that directory to see training and evaluation metrics.

Train a language model

You can use the transformer model for language modeling as well. Run the following command to generate the training data:

(vm)$ t2t-datagen --problem=languagemodel_lm1b8k_packed --data_dir=$DATA_DIR --tmp_dir=$TMP_DIR

Run the following command to train and evaluate the model:

(vm)$ t2t-trainer \
  --model=transformer \
  --hparams_set=transformer_tpu \
  --problems=languagemodel_lm1b8k_packed \
  --train_steps=10 \
  --eval_steps=8 \
  --data_dir=$DATA_DIR \
  --output_dir=$OUT_DIR \
  --use_tpu=True \
  --master=$TPU_MASTER

This model converges after approximately 250,000 steps.

Train a sentiment classifier

You can use the transformer_encoder model for sentiment classification. Run the following command to generate the training data:

(vm)$ t2t-datagen --problem=sentiment_imdb --data_dir=$DATA_DIR --tmp_dir=$TMP_DIR

Run the following command to train and evaluate the model:

(vm)$ t2t-trainer \
  --model=transformer_encoder \
  --hparams_set=transformer_tiny_tpu \
  --problems=sentiment_imdb \
  --train_steps=10 \
  --eval_steps=2 \
  --data_dir=$DATA_DIR \
  --output_dir=$OUT_DIR \
  --use_tpu=True \
  --master=$TPU_MASTER

This model achieves approximately 85% accuracy after approximately 2,000 steps.

Clean up

When you've finished with the tutorial, clean up the VM and TPU resource to avoid incurring extra charges to your Google Cloud Platform account.

If you haven't set the project and zone for this session, do so before running the cleanup procedure.

  1. Use the gcloud command-line tool to delete your Cloud TPU resource:

    (vm)$ gcloud beta compute tpus delete demo-tpu
  2. Disconnect from the Compute Engine VM instance:

    (vm)$ exit
  3. Use the gcloud command-line tool to delete your Compute Engine instance:

    $ gcloud compute instances delete tpu-demo-vm
  4. Go to the VPC Networking page in the Google Cloud Platform Console.

    Go to the VPC Networking page
  5. Select the VPC network that Google automatically created as part of the Cloud TPU setup. The peering entry starts with cp-to-tp-peering in the ID.

  6. At the top of the VPC Networking page, click Delete to delete the selected VPC network.

  7. Go to the Network Routes page in the Google Cloud Platform Console.

    Go to the Network Routes page
  8. Select the route that Google automatically created as part of the Cloud TPU setup. The peering entry starts with peering-route in the ID.

  9. At the top of the Network Routes page, click Delete to delete the selected route.

When you've finished finished examining the data, use the gsutil command to delete any Cloud Storage buckets you created during this tutorial. (See the Cloud Storage pricing guide for free storage limits and other pricing information.) Replace your-bucket-name with the name of your Cloud Storage bucket:

$ gsutil rm -r gs://your-bucket-name

What's next

Send feedback about...