Training transformer on Cloud TPU (TF 2.x)

Transformer is a neural network architecture that solves sequence to sequence problems using attention mechanisms. Unlike traditional neural seq2seq models, Transformer does not involve recurrent connections. The attention mechanism learns dependencies between tokens in two sequences. Since attention weights apply to all tokens in the sequences, the Transformer model is able to easily capture long-distance dependencies.

Transformer's overall structure follows the standard encoder-decoder pattern. The encoder uses self-attention to compute a representation of the input sequence. The decoder generates the output sequence one token at a time, taking the encoder output and previous decoder-outputted tokens as inputs.

The model also applies embeddings on the input and output tokens, and adds a constant positional encoding. The positional encoding adds information about the position of each token.

Costs

This tutorial uses billable components of Google Cloud, including:

  • Compute Engine
  • Cloud TPU

Use the pricing calculator to generate a cost estimate based on your projected usage. New Google Cloud users might be eligible for a free trial.

Before you begin

If you plan to train on a TPU Pod slice, please make sure you go over this document that explains the special considerations when training on a Pod slice.

Before starting this tutorial, follow the steps below to check that your Google Cloud project is correctly set up.

  1. Open a Cloud Shell window.

    Open Cloud Shell

  2. Create a variable for your project's name.

    export PROJECT_NAME=project-name
    
  3. Configure gcloud command-line tool to use the project where you want to create Cloud TPU.

    gcloud config set project $PROJECT_NAME
    
  4. Create a Cloud Storage bucket using the following command:

    gsutil mb -p $PROJECT_NAME -c standard -l europe-west4 -b on gs://bucket-name
    
  5. Launch a Compute Engine VM using the ctpu up command. This example sets the zone to europe-west4-a, but you can set it to whatever zone you are planning to use for the Compute Engine VM and Cloud TPU.

    ctpu up --vm-only \
    --disk-size-gb=300 \
    --machine-type=n1-standard-8 \
    --zone=europe-west4-a \
    --tf-version=2.1
  6. The configuration you specified appears. Enter y to approve or n to cancel.

  7. When the ctpu up command has finished executing, verify that your shell prompt has changed from username@project to username@tpuname. This change shows that you are now logged into your Compute Engine VM.

    gcloud compute ssh vm-name --zone=europe-west4-a
    

As you continue these instructions, run each command that begins with (vm)$ in your VM session window.

Generate the training dataset

On your Compute Engine VM:

  1. Create the following environment variables:

    (vm)$ export PYTHONPATH="$PYTHONPATH:/usr/share/models"
    (vm)$ export PARAM_SET=big
    (vm)$ export DATA_DIR=$HOME/transformer/data
    (vm)$ export VOCAB_FILE=$DATA_DIR/vocab.ende.32768
    
    Parameter Description
    PARAM_SET Parameter set to use when creating and training the model. Options are `base` and `big` (default).
    DATA_DIR Specifies the directory where training data are stored.
    VOCAB_FILE Path to subtoken vocabulary file. If data_download is used, the file will be in `data_dir`.
  2. Change directory to the training directory:

    (vm)$ cd /usr/share/models/official/transformer
  3. Generate the training dataset:

    (vm)$ python3 data_download.py --data_dir=$DATA_DIR
    (vm)$ export STORAGE_BUCKET=gs://bucket-name
    (vm)$ export GCS_DATA_DIR=$STORAGE_BUCKET/data/transformer
    (vm)$ export MODEL_DIR=$STORAGE_BUCKET/transformer/model_$PARAM_SET
    (vm)$ gsutil cp -r $DATA_DIR $GCS_DATA_DIR
    
    (vm)$ cd v2

data_download.py downloads and preprocesses the training and evaluation WMT datasets. After the data is downloaded and extracted, the training data is used to generate a vocabulary of subtokens. The evaluation and training strings are tokenized, and the resulting data is sharded, shuffled, and saved as TFRecords.

1.75GB of compressed data is downloaded. In total, the raw files (compressed, extracted, and combined files) take up 8.4GB of disk space. The resulting TFRecord and vocabulary files are 722MB. The script takes around 40 minutes to run, with the bulk of the time spent downloading and ~15 minutes spent on preprocessing.

Train an English-German translation model on a single Cloud TPU

To run the Transformer model on a TPU, you must set --distribution_strategy=tpu, --tpu=$TPU_NAME, and --use_ctl=True where $TPU_NAME is the name of your Compute Engine VM that you specified with ctpu up. The Cloud TPU should be the same as the VM name.

Run the following commands on your Compute Engine VM:

  1. Run the following command to create your Cloud TPU.

    (vm)$ ctpu up --tpu-only \
    --tpu-size=v3-8 \
    --zone=europe-west4-a \
    --tf-version=2.1
    Parameter Description
    tpu-size Specifies the type of Cloud TPU, for example v3-8.
    zone The zone where you plan to create your Cloud TPU. This should be the same zone you used for the Compute Engine VM. For example, europe-west4-a.

  2. The configuration you specified appears. Enter y to approve or n to cancel.

    You will see a message: Operation success; not ssh-ing to Compute Engine VM due to --tpu-only flag. Since you previously completed SSH key propagation, you can ignore this message.

  3. Set the Cloud TPU name variable. This will either be a name you specified with the --name parameter to ctpu up or the default, your username:

    (vm)$ export TPU_NAME=tpu-name
    
  4. Run the training script:

    (vm)$ python3 transformer_main.py \
         --tpu=$TPU_NAME \
         --model_dir=$MODEL_DIR \
         --data_dir=$GCS_DATA_DIR \
         --vocab_file=$GCS_DATA_DIR/vocab.ende.32768 \
         --bleu_source=$GCS_DATA_DIR/newstest2014.en \
         --bleu_ref=$GCS_DATA_DIR/newstest2014.de \
         --batch_size=6144 \
         --train_steps=2000 \
         --static_batch=true \
         --use_ctl=true \
         --param_set=big \
         --max_length=64 \
         --decode_batch_size=32 \
         --decode_max_length=97 \
         --padded_decode=true \
         --distribution_strategy=tpu
    

By default, the model will evaluate after every 2000 steps. In order to train to convergence, change train_steps to 200000.

This training takes approximately 7 minutes on a v3-8 Cloud TPU. When the training and evaluation complete, a message similar to the following appears:

INFO:tensorflow:Writing to file /tmp/tmpf2gn8vpa
I1125 21:22:30.734232 140328028010240 translate.py:182] Writing to file /tmp/tmpf2gn8vpa
I1125 21:22:42.785628 140328028010240 transformer_main.py:121] Bleu score (uncased): 0.01125154594774358
I1125 21:22:42.786558 140328028010240 transformer_main.py:122] Bleu score (cased): 0.01123994225054048

Customizing training schedule

  • Training with steps:
    • --train_steps: Sets the total number of training steps to run.
    • --steps_between_evals: Number of training steps to run between evaluations.

Compute BLEU score during model evaluation

Use these flags to compute the BLEU when the model evaluates:

  • --bleu_source: Path to file containing text to translate.
  • --bleu_ref: Path to file containing the reference translation.

When running English to German translation use the flags: --bleu_source=$DATA_DIR/newstest2014.en --bleu_ref=$DATA_DIR/newstest2014.de

From here, you can either conclude this tutorial and clean up your GCP resources, or you can further explore running the model on a Cloud TPU Pod.

Scaling your model with Cloud TPU Pods

You can get results faster by scaling your model with Cloud TPU Pods. The fully supported Transformer model can work with the following Pod slices:

  • v2-32
  • v3-32
  1. Disconnect from the Compute Engine instance, if you have not already done so:

    (vm)$ exit
    

    Your prompt should now be user@projectname, showing you are in the Cloud Shell.

  2. In your Cloud Shell, run ctpu delete with the tpu-only flag and the --zone flag you used when you set up the Cloud TPU. This deletes only your Cloud TPU.

     $ ctpu delete --tpu-only --zone=europe-west4-a
     

  3. Reconnect to the Compute Engine instance by running the following command, replacing vm-name with the name of your VM:

    gcloud compute ssh vm-name --zone=europe-west4-a
    
  4. Run the ctpu up command, using the tpu-size parameter to specify the Pod slice you want to use. For example, the following command uses a v2-32 Pod slice.

    (vm)$ ctpu up --tpu-only \
    --tpu-size=v2-32 \
    --zone=europe-west4-a \
    --tf-version=2.1 (vm)$ export TPU_NAME=tpu-name
  5. Export storage bucket variables:

    (vm)$ export STORAGE_BUCKET=gs://bucket-name
    (vm)$ export GCS_DATA_DIR=$STORAGE_BUCKET/data/transformer
    (vm)$ export MODEL_DIR=$STORAGE_BUCKET/transformer/model_$PARAM_SET
    
  6. Change directory to the training directory:

    (vm)$ cd /usr/share/models/official/transformer/v2
    
  7. Run the Pod training script:

    (vm)$ python3 transformer_main.py \
         --tpu=$TPU_NAME \
         --model_dir=$MODEL_DIR \
         --data_dir=$GCS_DATA_DIR \
         --vocab_file=$GCS_DATA_DIR/vocab.ende.32768 \
         --bleu_source=$GCS_DATA_DIR/newstest2014.en \
         --bleu_ref=$GCS_DATA_DIR/newstest2014.de \
         --batch_size=24576 \
         --train_steps=200000 \
         --static_batch=true \
         --use_ctl=true \
         --param_set=big \
         --max_length=64 \
         --decode_batch_size=32 \
         --decode_max_length=97 \
         --padded_decode=true \
         --distribution_strategy=tpu
    

Cleaning up

To avoid incurring charges to your Google Cloud Platform account for the resources used in this tutorial:

Clean up the Compute Engine VM instance and Cloud TPU resources.

  1. Disconnect from the Compute Engine instance, if you have not already done so:

    (vm)$ exit
    

    Your prompt should now be user@projectname, showing you are in the Cloud Shell.

  2. In your Cloud Shell, run ctpu delete with the --zone flag you used when you set up the Cloud TPU to delete your Compute Engine VM and your Cloud TPU:

    $ ctpu delete --zone=europe-west4-a
    
  3. Run the following command to verify the Compute Engine VM and Cloud TPU have been shut down:

    $ ctpu status --zone=europe-west4-a
    

    The deletion might take several minutes. A response like the one below indicates there are no more allocated instances:

    2018/04/28 16:16:23 WARNING: Setting zone to "europe-west4-a"
    No instances currently exist.
            Compute Engine VM:     --
            Cloud TPU:             --
    
  4. Run gsutil as shown, replacing bucket-name with the name of the Cloud Storage bucket you created for this tutorial:

    $ gsutil rm -r gs://bucket-name
    

What's next

  • Learn more about ctpu, including how to install it on a local machine.
  • Explore more [Tensor2Tensor models for TPU][t2tcloudtpu].
  • Experiment with more TPU samples.
  • Explore the TPU tools in TensorBoard.