BERT FineTuning with Cloud TPU: Sentence and Sentence-Pair Classification Tasks (TF 2.x)

This tutorial shows you how to train the Bidirectional Encoder Representations from Transformers (BERT) model on Cloud TPU.

BERT is a method of pre-training language representations. Pre-training refers to how BERT is first trained on a large source of text, such as Wikipedia. You can then apply the training results to other Natural Language Processing (NLP) tasks, such as question answering and sentiment analysis. With BERT and Cloud TPU, you can train a variety of NLP models in about 30 minutes.

For more information about BERT, see the following resources:

Objectives

  • Create a Cloud Storage bucket to hold your dataset and model output.
  • Run the training job.
  • Verify the output results.

Costs

This tutorial uses billable components of Google Cloud, including:

  • Compute Engine
  • Cloud TPU
  • Cloud Storage

Use the pricing calculator to generate a cost estimate based on your projected usage. New Google Cloud users might be eligible for a free trial.

Before you begin

This section provides information on setting up Cloud Storage bucket and a Compute Engine VM.

  1. Open a Cloud Shell window.

    Open Cloud Shell

  2. Create a variable for your project's name.

    export PROJECT_NAME=project-name
    
  3. Configure gcloud command-line tool to use the project where you want to create Cloud TPU.

    gcloud config set project ${PROJECT_NAME}
    
  4. Create a Cloud Storage bucket using the following command:

    gsutil mb -p ${PROJECT_NAME} -c standard -l europe-west4 -b on gs://bucket-name
    

    This Cloud Storage bucket stores the data you use to train your model and the training results.

    In order for the Cloud TPU to read and write to the storage bucket, the service account for your project needs read/write or Admin permissions on it. See the section on storage buckets for how to view and set those permissions.

  5. Launch a Compute Engine VM and Cloud TPU using the ctpu up command.

    $ ctpu up --tpu-size=v3-8 \
      --name=bert-tutorial \
      --machine-type=n1-standard-8 \
      --zone=europe-west4-a \
      --tf-version=2.2
    
  6. The configuration you specified appears. Enter y to approve or n to cancel.

  7. When the ctpu up command has finished executing, verify that your shell prompt has changed from username@projectname to username@vm-name. This change shows that you are now logged into your Compute Engine VM.

    gcloud compute ssh bert-tutorial --zone=europe-west4-a
    

    As you continue these instructions, run each command that begins with (vm)$ in your VM session window.

    (vm)$ export TPU_NAME=bert-tutorial
    

Prepare the dataset

  1. From your Compute Engine virtual machine (VM), install requirements.txt.

    (vm)$ sudo pip3 install -r /usr/share/models/official/requirements.txt
    
  2. Optional: download download_glue_data.py

    This tutorial uses the General Language Understanding Evaluation (GLUE) benchmark to evaluate and analyze the performance of the model. The GLUE data is provided for this tutorial at gs://cloud-tpu-checkpoints/bert/classification.

    If you want to work with raw GLUE data and create TFRecords, follow the dataset processing instructions on GitHub.

Define parameter values

Next, define several parameter values that are required when you train and evaluate your model:

(vm)$ export STORAGE_BUCKET=gs://bucket-name
(vm)$ export PYTHONPATH="${PYTHONPATH}:/usr/share/models"
(vm)$ export BERT_BASE_DIR=gs://cloud-tpu-checkpoints/bert/keras_bert/uncased_L-24_H-1024_A-16
(vm)$ export MODEL_DIR=${STORAGE_BUCKET}/bert-output
(vm)$ export GLUE_DIR=gs://cloud-tpu-checkpoints/bert/classification
(vm)$ export TASK=mnli
Parameter Description
STORAGE_BUCKET This is the name of the Cloud Storage bucket that you created in the **Before you begin** section.
BERT_BASE_DIR This is the directory that contains files used for training.
MODEL_DIR This is the directory that contains the model files. This tutorial uses a folder within the Cloud Storage bucket. You do not have to create this folder beforehand. The script creates the folder if it does not already exist.
GLUE_DIR This is the directory that contains the GLUE training data.

Train the model

From your Compute Engine VM, run the following command.

(vm)$ python3 /usr/share/models/official/nlp/bert/run_classifier.py \
  --mode='train_and_eval' \
  --input_meta_data_path=${GLUE_DIR}/${TASK}_meta_data \
  --train_data_path=${GLUE_DIR}/${TASK}_train.tf_record \
  --eval_data_path=${GLUE_DIR}/${TASK}_eval.tf_record \
  --bert_config_file=${BERT_BASE_DIR}/bert_config.json \
  --init_checkpoint=${BERT_BASE_DIR}/bert_model.ckpt \
  --train_batch_size=32 \
  --eval_batch_size=32 \
  --learning_rate=2e-5 \
  --num_train_epochs=1 \
  --model_dir=${MODEL_DIR} \
  --distribution_strategy=tpu \
  --tpu=${TPU_NAME}
Parameter Description
mode When set to `train_and_eval` this script trains and evaluates the model. When set to `export_only` this script exports a saved model.
input_meta_data_path The path to a file that contains metadata about the dataset to be used for training and evaluation.
tpu Uses the name specified in the TPU_NAME variable.
data_dir Specifies the Cloud Storage path for training input. It is set to the fake_imagenet dataset in this example.
model_dir Specifies the directory where checkpoints and summaries are stored during model training. If the folder is missing, the program creates one. When using a Cloud TPU, the model_dir must be a Cloud Storage path (`gs://...`). You can reuse an existing folder to load current checkpoint data and to store additional checkpoints as long as the previous checkpoints were created using TPU of the same size and TensorFlow version.
distribution_strategy To run ResNet model on a TPU, you must set the `distribution_strategy` to 'tpu'.

Verify your results

The training takes approximately 2 minutes on a v3-8 TPU. When script completes, you should see results similar to the following:

Training Summary:
{'total_training_steps': 12271, 'train_loss': 0.0, 'last_train_metrics': 0.0, 'eval_metrics': 0.8608226180076599}

To increase accuracy, set --num_tain_epochs=3. The script will take approximately one hour to train.

Clean up

  1. Disconnect from the Compute Engine instance, if you have not already done so:

    (vm)$ exit
    

    Your prompt should now be username@projectname, showing you are in the Cloud Shell.

  2. In your Cloud Shell, run ctpu delete with the --name and --zone flags you used when you set up the Compute Engine VM and Cloud TPU. This deletes both your VM and your Cloud TPU.

    $ ctpu delete --name=bert-tutorial --zone=europe-west4-a
    
  3. Run ctpu status to make sure you have no instances allocated to avoid unnecessary charges for TPU usage. The deletion might take several minutes. A response like the one below indicates there are no more allocated instances:

    $ ctpu status --name=bert-tutorial --zone=europe-west4-a
    
    2018/04/28 16:16:23 WARNING: Setting zone to "europe-west4-a"
    No instances currently exist.
        Compute Engine VM:     --
        Cloud TPU:             --
    
  4. Run gsutil as shown, replacing bucket-name with the name of the Cloud Storage bucket you created for this tutorial:

    $ gsutil rm -r gs://bucket-name