Getting started with the built-in BERT algorithm

This tutorial shows you how to train the Bidirectional Encoder Representations from Transformers (BERT) model on AI Platform Training.

BERT is a method of pre-training language representations. Pre-training refers to how BERT is first trained on a large source of text, such as Wikipedia. You can then apply the training results to other Natural Language Processing (NLP) tasks, such as question answering and sentiment analysis. With BERT and AI Platform Training, you can train a variety of NLP models in about 30 minutes.

For more information about BERT, see the following resources:

Objectives

  • Create a Cloud Storage bucket to hold your model output.
  • Run the training job.
  • Verify the output results.

Before starting this tutorial, check that your Google Cloud project is correctly set up.

Complete the following steps to set up a GCP account, enable the required APIs, and install and activate the Cloud SDK:

  1. Sign in to your Google Cloud account. If you're new to Google Cloud, create an account to evaluate how our products perform in real-world scenarios. New customers also get $300 in free credits to run, test, and deploy workloads.
  2. In the Google Cloud Console, on the project selector page, select or create a Google Cloud project.

    Go to project selector

  3. Make sure that billing is enabled for your Cloud project. Learn how to confirm that billing is enabled for your project.

  4. Enable the AI Platform Training & Prediction and Compute Engine APIs.

    Enable the APIs

  5. Install and initialize the Cloud SDK.
  6. In the Google Cloud Console, on the project selector page, select or create a Google Cloud project.

    Go to project selector

  7. Make sure that billing is enabled for your Cloud project. Learn how to confirm that billing is enabled for your project.

  8. Enable the AI Platform Training & Prediction and Compute Engine APIs.

    Enable the APIs

  9. Install and initialize the Cloud SDK.

Prepare the data

This tutorial will not require any preprocessing or downloading data. All of the data and model checkpoints needed are available in public storage buckets. If you are interested in that process please check out the Cloud TPU tutorial, which covers creating this dataset from the command-line.

Submit a training job

To submit a job, you must specify some basic training arguments and some basic arguments related to the BERT algorithm.

General arguments for the training job:

Training job arguments
Argument Description
job-id Unique ID for your training job. You can use this to find logs for the status of your training job after you submit it.
job-dir Cloud Storage path where AI Platform Training saves training files after completing a successful training job.
scale-tier Specifies machine types for training. Use BASIC to select a configuration of just one machine.
master-image-uri Container Registry URI used to specify which Docker container to use for the training job. Use the container for the built-in BERT algorithm defined earlier as IMAGE_URI.
region Specify the available region in which to run your training job. For this tutorial, you can use the region us-central1.

Arguments specific to the built-in BERT algorithm training with the provided dataset:

Algorithm arguments
Argument Value Description
mode train_and_eval Indicate whether or not to do fine-tuning training or export the model.
train_dataset_path gs://cloud-tpu-checkpoints/bert/classification/mnli_train.tf_record Cloud Storage path where the training data is stored.
eval_dataset_path gs://cloud-tpu-checkpoints/bert/classification/mnli_eval.tf_record Cloud Storage path where the evaluation data is stored.
input_meta_data_path gs://cloud-tpu-checkpoints/bert/classification/mnli_meta_data Cloud Storage path where the input schema is stored.
bert_config_file gs://cloud-tpu-checkpoints/bert/keras_bert/uncased_L-24_H-1024_A-16/bert_config.json Cloud Storage path where the BERT config file is stored.
init_checkpoint gs://cloud-tpu-checkpoints/bert/keras_bert/uncased_L-24_H-1024_A-16/bert_model.ckpt Starting checkpoint for fine-tuning (usually a pre-trained BERT model.)
train_batch_size 32 Batch size for training.
eval_batch_size 32 Batch size for evaluation.
learning_rate 2e-5 Learning rate used by the Adam optimizer.
num_train_epochs 1 Number of training epochs to run (only available in train_and_eval mode.)
steps_per_loop 1000 The number of steps per graph-mode loop.

For a detailed list of all other BERT algorithm flags, refer to the built-in BERT reference.

Run the training job

  1. Navigate to the AI Platform > Jobs page:

AI Platform > Jobs page

  1. At the top of the page, click the "New training job" button and select "Built-in algorithm training"

  2. Select BERT as your training algorithm

  3. Use the browse button to mark the training and evaluation datasets in your Cloud Storage bucket and choose the output directory.

  4. On the next page, use the argument values above to configure the training job.

  5. Give your training job a name and use the BASIC_TPU machine type.

  6. Click "Submit" to start your job.

Understand your job directory

After the successful completion of a training job, AI Platform Training creates a trained model in your Cloud Storage bucket, along with some other artifacts. You can find the following directory structure within your JOB_DIR:

  • model/ (a TensorFlow SavedModel directory)
    • saved_model.pb
    • assets/
    • variables/
  • summaries/ (logging from training and evaluation)
    • eval/
    • train/
  • various checkpoint files (created and used during training)
    • checkpoint
    • ctl_checkpoint-1.data-00000-of-00002
    • ...
    • ctl_checkpoint-1.index

Confirm that the directory structure in your JOB_DIR matches the structure described in the preceding list:

gsutil ls -a $JOB_DIR/*

Deploy the trained model

AI Platform Prediction organizes your trained models using model and version resources. An AI Platform Prediction model is a container for the versions of your machine learning model.

To deploy a model, you create a model resource in AI Platform Prediction, create a version of that model, then use the model and version to request online predictions.

Learn more about how to deploy models to AI Platform Prediction.

Console

  1. On the Jobs page, you can find a list of all your training jobs. Click the name of the training job you just submitted.

  2. On the Job details page, you can view the general progress of your job, or click View logs for a more detailed view of its progress.

  3. When the job is successful, the Deploy model button appears at the top. Click Deploy model.

  4. Select "Deploy as new model", and enter a model name. Next, click Confirm.

  5. On the Create version page, enter a version name, such as v1, and leave all other fields at their default settings. Click Save.

  6. On the Model details page, your version name displays. The version takes a few minutes to create. When the version is ready, a checkmark icon appears by the version name.

  7. Click the version name (v1) to navigate to the Version details page. In the next step of this tutorial, you send a prediction request

Get online predictions

When you request predictions, you must format input data as JSON in a manner that the model expects. Current BERT models do not automatically preprocess inputs.

Console

  1. On the Version details page for v1, the version you just created, you can send a sample prediction request.

    Select the Test & Use tab.

  2. Copy the following sample to the input field:

      {
        "instances": [
          {
            "input_mask": [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
                           0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
                           0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
                           0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
                           0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
                           0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
                           0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
                           0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
            "input_type_ids":[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
                              0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
                              0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
                              0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
                              0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
                              0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
                              0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
                              0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
            "input_word_ids": [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
                               0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
                               0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
                               0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
                               0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
                               0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
                               0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
                               0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
            }
          ]
        }
    
  3. Click Test.

    Wait a moment, and a prediction vector should be returned.

What's next

In this tutorial you have trained the BERT model using a sample dataset. In most cases, the results of this training are not usable for inference. To use a model for inference you can train the data on a publicly available dataset or your own data set. Models trained on Cloud TPU require datasets to be in TFRecord format.

You can use the dataset conversion tool sample to convert an image classification dataset into TFRecord format. If you are not using an image classification model, you will have to convert your dataset to TFRecord format yourself. For more information, see TFRecord and tf.Example