Training transformer on Cloud TPU (TF 2.x)

Transformer is a neural network architecture that solves sequence to sequence problems using attention mechanisms. Unlike traditional neural seq2seq models, Transformer does not involve recurrent connections. The attention mechanism learns dependencies between tokens in two sequences. Since attention weights apply to all tokens in the sequences, the Transformer model is able to easily capture long-distance dependencies.

Transformer's overall structure follows the standard encoder-decoder pattern. The encoder uses self-attention to compute a representation of the input sequence. The decoder generates the output sequence one token at a time, taking the encoder output and previous decoder-outputted tokens as inputs.

The model also applies embeddings on the input and output tokens, and adds a constant positional encoding. The positional encoding adds information about the position of each token.

Costs

This tutorial uses billable components of Google Cloud, including:

  • Compute Engine
  • Cloud TPU

Use the pricing calculator to generate a cost estimate based on your projected usage. New Google Cloud users might be eligible for a free trial.

Before you begin

If you plan to train on a TPU Pod slice, please make sure you go over this document that explains the special considerations when training on a Pod slice.

Before starting this tutorial, follow the steps below to check that your Google Cloud project is correctly set up.

  1. Open a Cloud Shell window.

    Open Cloud Shell

  2. Create a variable for your project's ID.

    export PROJECT_ID=project-id
    
  3. Configure gcloud command-line tool to use the project where you want to create Cloud TPU.

    gcloud config set project ${PROJECT_ID}
    
  4. Create a Cloud Storage bucket using the following command:

    gsutil mb -p ${PROJECT_ID} -c standard -l europe-west4 -b on gs://bucket-name
    
  5. Launch a Compute Engine VM using the ctpu up command. This example sets the zone to europe-west4-a, but you can set it to whatever zone you are planning to use for the Compute Engine VM and Cloud TPU.

    ctpu up --vm-only \
     --disk-size-gb=300 \
     --machine-type=n1-standard-8 \
     --zone=europe-west4-a \
     --tf-version=2.2 \
     --name=transformer-tutorial
    

    Command flag descriptions

    vm-only
    Create a VM only. By default the ctpu up command creates a VM and a Cloud TPU.
    disk-size-gb
    The size of the disk for the VM in GB.
    machine_type
    The machine type of the VM the ctpu up command creates.
    zone
    The zone where you plan to create your Cloud TPU.
    tpu-size
    The type of the Cloud TPU to create.
    tf-version
    The version of Tensorflow ctpu installs on the VM.
    name
    The name of the Cloud TPU to create.

    For more information on the CTPU utility, see the CTPU Reference.

  6. The configuration you specified appears. Enter y to approve or n to cancel.

  7. When the ctpu up command has finished executing, verify that your shell prompt has changed from username@projectname to username@vm-name. This change shows that you are now logged into your Compute Engine VM.

    gcloud compute ssh transformer-tutorial --zone=europe-west4-a
    

As you continue these instructions, run each command that begins with (vm)$ in your VM session window.

Generate the training dataset

On your Compute Engine VM:

  1. Create the following environment variables. Replace bucket-name with your bucket name:

    (vm)$ export STORAGE_BUCKET=gs://bucket-name
    
    (vm)$ export GCS_DATA_DIR=$STORAGE_BUCKET/data/transformer
    (vm)$ export PARAM_SET=big
    (vm)$ export MODEL_DIR=$STORAGE_BUCKET/transformer/model_$PARAM_SET
    (vm)$ export PYTHONPATH="$PYTHONPATH:/usr/share/models"
    (vm)$ export DATA_DIR=${HOME}/transformer/data
    (vm)$ export VOCAB_FILE=${DATA_DIR}/vocab.ende.32768
    
  2. Change directory to the training directory:

    (vm)$ cd /usr/share/models/official/nlp/transformer
  3. Set up the following dataset environment variables:

    (vm)$ export GCS_DATA_DIR=${STORAGE_BUCKET}/data/transformer
    (vm)$ export MODEL_DIR=${STORAGE_BUCKET}/transformer/model_${PARAM_SET}
    
  4. Download and prepare the datasets

    (vm)$ python3 data_download.py --data_dir=${DATA_DIR}
    (vm)$ gsutil cp -r ${DATA_DIR} ${GCS_DATA_DIR}
    

    data_download.py downloads and preprocesses the training and evaluation WMT datasets. After the data is downloaded and extracted, the training data is used to generate a vocabulary of subtokens. The evaluation and training strings are tokenized, and the resulting data is sharded, shuffled, and saved as TFRecords.

    1.75GB of compressed data is downloaded. In total, the raw files (compressed, extracted, and combined files) take up 8.4GB of disk space. The resulting TFRecord and vocabulary files are 722MB. The script takes around 40 minutes to run, with the bulk of the time spent downloading and ~15 minutes spent on preprocessing.

Train an English-German translation model on a single Cloud TPU

Run the following commands on your Compute Engine VM:

  1. Run the following command to create your Cloud TPU.

    (vm)$ ctpu up --tpu-only \
      --tpu-size=v3-8  \
      --zone=europe-west4-a \
      --tf-version=2.2 \
      --name=transformer-tutorial
    

    Command flag descriptions

    tpu-only
    Create a Cloud TPU only. By default the ctpu up command creates a VM and a Cloud TPU.
    tpu-size
    Specifies the type of Cloud TPU, for example v3-8.
    zone
    The zone where you plan to create your Cloud TPU. This should be the same zone you used for the Compute Engine VM. For example, europe-west4-a.
    tf-version
    The version of Tensorflow ctpu installs on the VM.
    name
    The name of the Cloud TPU to create.

    For more information on the CTPU utility, see the CTPU Reference.

    The configuration you specified appears. Enter y to approve or n to cancel.

    You will see a message: Operation success; not ssh-ing to Compute Engine VM due to --tpu-only flag. Since you previously completed SSH key propagation, you can ignore this message.

  2. Set the Cloud TPU name variable. This will either be a name you specified with the --name parameter to ctpu up or the default, your username:

    (vm)$ export TPU_NAME=transformer-tutorial
    
  3. Run the training script:

    (vm)$ python3 transformer_main.py \
        --tpu=${TPU_NAME} \
        --model_dir=${MODEL_DIR} \
        --data_dir=${GCS_DATA_DIR} \
        --vocab_file=${GCS_DATA_DIR}/vocab.ende.32768 \
        --bleu_source=${GCS_DATA_DIR}/newstest2014.en \
        --bleu_ref=${GCS_DATA_DIR}/newstest2014.de \
        --batch_size=6144 \
        --train_steps=2000 \
        --static_batch=true \
        --use_ctl=true \
        --param_set=big \
        --max_length=64 \
        --decode_batch_size=32 \
        --decode_max_length=97 \
        --padded_decode=true \
        --distribution_strategy=tpu

    Command flag descriptions

    tpu
    The name of the Cloud TPU. This is set by specifying the environment variable (TPU_NAME).
    model_dir
    The directory where checkpoints and summaries are stored during model training. If the folder is missing, the program creates one. When using a Cloud TPU, the model_dir must be a Cloud Storage path (`gs://...`). You can reuse an existing folder to load current checkpoint data and to store additional checkpoints as long as the previous checkpoints were created using Cloud TPU of the same size and TensorFlow version.
    data_dir
    The Cloud Storage path of training input. It is set to the fake_imagenet dataset in this example.
    vocab_file
    A file that contains the vocabulary for translation.
    bleu_source
    A file that contains source sentences for translation.
    bleu_ref
    A file that contains the reference for the translation sentences.
    train_steps
    The number of steps to train the model. One step processes one batch of data. This includes both a forward pass and back propagation.
    batch_size
    The training batch size.
    static_batch
    Specifies whether the batches in the dataset has static shapes.
    use_ctl
    Specifies whether the script runs with a custom training loop.
    param_set
    The parameter set to use when creating and training the model. The parameters define the input shape, model configuration, and other settings.
    max_length
    The maximum length of an example in the dataset.
    decode_batch_size
    The global batch size used for Transformer auto-regressive decoding on a Cloud TPU.
    decode_max_length
    The maximum sequence length of the decode/eval data. This is used by the Transformer auto-regressive decoding on a Cloud TPU to minimize the amount of required data padding.
    padded_decode
    Specifies whether the auto-regressive decoding runs with input data padded to the decode_max_length. Tor TPU/XLA-GPU runs, this flag must be set due to the static shape requirement.
    distribution_strategy
    To train the ResNet model on a Cloud TPU, set distribution_strategy to tpu.

By default, the model will evaluate after every 2000 steps. In order to train to convergence, change train_steps to 200000. You can increase the number of training steps or specify how often to run evaluations by setting these parameters:

  • --train_steps: Sets the total number of training steps to run.
  • --steps_between_evals: Number of training steps to run between evaluations.

Training and evaluation takes approximately 7 minutes on a v3-8 Cloud TPU. When the training and evaluation complete, a message similar to the following appears:

INFO:tensorflow:Writing to file /tmp/tmpf2gn8vpa
I1125 21:22:30.734232 140328028010240 translate.py:182] Writing to file /tmp/tmpf2gn8vpa
I1125 21:22:42.785628 140328028010240 transformer_main.py:121] Bleu score (uncased): 0.01125154594774358
I1125 21:22:42.786558 140328028010240 transformer_main.py:122] Bleu score (cased): 0.01123994225054048

Compute BLEU score during model evaluation

Use these flags to compute the BLEU when the model evaluates:

  • --bleu_source: Path to file containing text to translate.
  • --bleu_ref: Path to file containing the reference translation.

From here, you can either conclude this tutorial and clean up your GCP resources, or you can further explore running the model on a Cloud TPU Pod.

Scaling your model with Cloud TPU Pods

You can get results faster by scaling your model with Cloud TPU Pods. The fully supported Transformer model can work with the following Pod slices:

  • v2-32
  • v3-32
  1. In your Cloud Shell, run ctpu delete with the tpu-only flag and the --zone flag you used when you set up the Cloud TPU. This deletes only your Cloud TPU.

     (vm)$ ctpu delete --tpu-only --zone=europe-west4-a
     

  2. Run the ctpu up command, using the tpu-size parameter to specify the Pod slice you want to use. For example, the following command uses a v2-32 Pod slice.

    (vm)$ ctpu up --tpu-only \
    --tpu-size=v2-32  \
    --zone=europe-west4-a \
    --tf-version=2.2 \
    --name=transformer-tutorial

    Command flag descriptions

    tpu-only
    Create a Cloud TPU only. By default the ctpu up command creates a VM and a Cloud TPU.
    tpu-size
    Specifies the type of Cloud TPU, for example v3-8.
    zone
    The zone where you plan to create your Cloud TPU. This should be the same zone you used for the Compute Engine VM. For example, europe-west4-a.
    tf-version
    The version of Tensorflow ctpu installs on the VM.
    name
    The name of the Cloud TPU to create.

    For more information on the CTPU utility, see the CTPU Reference.

    gcloud compute ssh transformer-tutorial --zone=europe-west4-a
    
  3. Export the TPU name:

    (vm)$ export TPU_NAME=transformer-tutorial
    
  4. Export the model directory variable:

    (vm)$ export MODEL_DIR=${STORAGE_BUCKET}/transformer/model_${PARAM_SET}_pod
    
  5. Change directory to the training directory:

    (vm)$ cd /usr/share/models/official/nlp/transformer
    
  6. Run the Pod training script:

    (vm)$ python3 transformer_main.py \
         --tpu=${TPU_NAME} \
         --model_dir=${MODEL_DIR} \
         --data_dir=${GCS_DATA_DIR} \
         --vocab_file=${GCS_DATA_DIR}/vocab.ende.32768 \
         --bleu_source=${GCS_DATA_DIR}/newstest2014.en \
         --bleu_ref=${GCS_DATA_DIR}/newstest2014.de \
         --batch_size=24576 \
         --train_steps=2000 \
         --static_batch=true \
         --use_ctl=true \
         --param_set=big \
         --max_length=64 \
         --decode_batch_size=32 \
         --decode_max_length=97 \
         --padded_decode=true \
         --steps_between_evals=2000 \
         --distribution_strategy=tpu
    

    Command flag descriptions

    tpu
    The name of the Cloud TPU. This is set by specifying the environment variable (TPU_NAME).
    model_dir
    The directory where checkpoints and summaries are stored during model training. If the folder is missing, the program creates one. When using a Cloud TPU, the model_dir must be a Cloud Storage path (`gs://...`). You can reuse an existing folder to load current checkpoint data and to store additional checkpoints as long as the previous checkpoints were created using TPU of the same size and TensorFlow version.
    data_dir
    The Cloud Storage path of training input. It is set to the fake_imagenet dataset in this example.
    vocab_file
    A file that contains the vocabulary for translation.
    bleu_source
    A file that contains source sentences for translation.
    bleu_ref
    A file that contains the reference for the translation sentences.
    batch_size
    The training batch size.
    train_steps
    The number of steps to train the model. One step processes one batch of data. This includes both a forward pass and back propagation.
    static_batch
    Specifies whether the batches in the dataset has static shapes.
    use_ctl
    Specifies whether the script runs with a custom training loop.
    param_set
    The parameter set to use when creating and training the model. The parameters define the input shape, model configuration, and other settings.
    max_length
    The maximum length of an example in the dataset.
    decode_batch_size
    The global batch size used for Transformer auto-regressive decoding on a Cloud TPU.
    decode_max_length
    The maximum sequence length of the decode/eval data. This is used by the Transformer auto-regressive decoding on a Cloud TPU to minimize the amount of required data padding.
    padded_decode
    Specifies whether the auto-regressive decoding runs with input data padded to the decode_max_length. Tor TPU/XLA-GPU runs, this flag must be set due to the static shape requirement.
    steps_between_evals
    The number of training steps to run between evaluations.
    distribution_strategy
    To train the ResNet model on a TPU, set distribution_strategy to tpu.

This training script trains for 2000 steps and runs evaluation every 2000 steps. This particular training and evaluation takes approximately 8 minutes on a v2-32 Cloud TPU Pod.

In order to train to convergence, change train_steps to 200000. You can increase the number of training steps or specify how often to run evaluations by setting these parameters:

  • --train_steps: Sets the total number of training steps to run.
  • --steps_between_evals: Number of training steps to run between evaluations.

When the training and evaluation complete, a message similar to the following appears:

0509 00:27:59.984464 140553148962624 translate.py:184] Writing to file /tmp/tmp_rk3m8jp
I0509 00:28:11.189308 140553148962624 transformer_main.py:119] Bleu score (uncased): 1.3239131309092045
I0509 00:28:11.189623 140553148962624 transformer_main.py:120] Bleu score (cased): 1.2855342589318752

Cleaning up

To avoid incurring charges to your Google Cloud Platform account for the resources used in this tutorial:

1. In your Cloud Shell, run `ctpu delete` with the --zone flag you used when you set up the Cloud TPU to delete your Compute Engine VM and your Cloud TPU:
    $ ctpu delete --zone=europe-west4-a \
      --name=transformer-tutorial
    
  1. Run the following command to verify the Compute Engine VM and Cloud TPU have been shut down:

    $ ctpu status --zone=europe-west4-a
    

    The deletion might take several minutes. A response like the one below indicates there are no more allocated instances:

    2018/04/28 16:16:23 WARNING: Setting zone to "europe-west4-a"
    No instances currently exist.
            Compute Engine VM:     --
            Cloud TPU:             --
    
  2. Run gsutil as shown, replacing bucket-name with the name of the Cloud Storage bucket you created for this tutorial:

    $ gsutil rm -r gs://bucket-name
    

What's next