Training using the built-in BERT algorithm

Training with built-in algorithms on AI Platform Training allows you to submit your dataset and train a model without writing any training code. This page explains how the built-in BERT algorithm works, and how to use it.

Overview

This built-in algorithm can do both training and model exporting:

  1. Training: Using the dataset and the model parameters you supplied, AI Platform Training runs training using TensorFlow's BERT implementation.
  2. Exporting: Using the initial checkpoint supplied, produce a serialized model in the desired job directory. This model can then be deployed to AI Platform.

Limitations

The following features are not supported for training with the built-in BERT algorithm:

  • Automated Data Preprocessing This version of BERT requires input data to be in the form of TFRecords for both training and output. A training application must be made to handle unformatted input automatically.

Supported machine types

The following AI Platform Training scale tiers and machine types are supported:

  • BASIC scale tier
  • BASIC_TPU scale tier
  • CUSTOM scale tier with any of the Compute Engine machine types supported by AI Platform Training.
  • CUSTOM scale tier with any of the following legacy machine types:
    • standard
    • large_model
    • complex_model_s
    • complex_model_m
    • complex_model_l
    • standard_gpu
    • standard_p100
    • standard_v100
    • large_model_v100
    • complex_model_m_gpu
    • complex_model_l_gpu
    • complex_model_m_p100
    • complex_model_m_v100
    • complex_model_l_v100
    • TPU_V2 (8 cores)

We recommend using a machine type with access to TPUs.

Format input data

Ensure that input and evaluation data are in the form of TFRecords before training the model.

Check Cloud Storage bucket permissions

To store your data, use a Cloud Storage bucket in the same Google Cloud project you're using to run AI Platform Training jobs. Otherwise, grant AI Platform Training access to the Cloud Storage bucket where your data is stored.

Submit a BERT training job

This section explains how to submit a training job using the built-in BERT algorithm.

You can find brief explanations of each hyperparameter within the Google Cloud console, and a more comprehensive explanation in the reference for the built-in BERT algorithm.

Console

  1. Go to the AI Platform Training Jobs page in the Google Cloud console:

    AI Platform Training Jobs page

  2. Click the New training job button. From the options that display below, click Built-in algorithm training.

  3. On the Create a new training job page, select BERT and click Next.

  4. To learn more about all the available parameters, follow the links in the Google Cloud console and refer to the built-in BERT reference for more details.

gcloud

  1. Set environment variables for your job, filling in [VALUES-IN-BRACKETS] with your own values:

       # Specify the name of the Cloud Storage bucket where you want your
       # training outputs to be stored, and the Docker container for
       # your built-in algorithm selection.
       BUCKET_NAME='BUCKET_NAME'
       IMAGE_URI='gcr.io/cloud-ml-algos/bert:latest'
    
       DATE="$(date '+%Y%m%d_%H%M%S')"
       MODEL_NAME='MODEL_NAME'
       JOB_ID="${MODEL_NAME}_${DATE}"
    
       JOB_DIR="gs://${BUCKET_NAME}/algorithm_training/${MODEL_NAME}/${DATE}"
       BERT_BASE_DIR='gs://cloud-tpu-checkpoints/bert/keras_bert/uncased_L-24_H-1024_A-16'
       MODEL_DIR='${STORAGE_BUCKET}/bert-output'
       GLUE_DIR='gs://cloud-tpu-checkpoints/bert/classification'
       TASK='mnli'
    
  2. Submit the training job using gcloud ai-platform jobs training submit. Adjust this generic example to work with your dataset:

       gcloud ai-platform jobs submit training $JOB_ID \
          --master-image-uri=$IMAGE_URI --scale-tier=BASIC_TPU --job-dir=$JOB_DIR \
          -- \
          --mode='train_and_eval' \
          --input_meta_data_path=${GLUE_DIR}/${TASK}_meta_data \
          --train_data_path=${GLUE_DIR}/${TASK}_train.tf_record \
          --eval_data_path=${GLUE_DIR}/${TASK}_eval.tf_record \
          --bert_config_file=${BERT_BASE_DIR}/bert_config.json \
          --init_checkpoint=${BERT_BASE_DIR}/bert_model.ckpt \
          --train_batch_size=32 \
          --eval_batch_size=32 \
          --learning_rate=2e-5 \
          --num_train_epochs=1 \
          --steps_per_loop=1000
    
  3. Monitor the status of your training job by viewing logs with gcloud. Refer to gcloud ai-platform jobs describe and gcloud ai-platform jobs stream-logs.

       gcloud ai-platform jobs describe ${JOB_ID}
       gcloud ai-platform jobs stream-logs ${JOB_ID}
    

Further learning resources