Tune PaLM text models by using RLHF tuning

Reinforcement learning from human feedback (RLHF) uses feedback gathered from humans to tune a model. RLHF is recommended when the output of your model is complex and difficult to describe. The human feedback is in the form of choices between different output options. These choices provide better data than labeled prompts, used by supervised fine-tuning, to tune a model that produces output that's difficult to describe. If the output from your model isn't difficult to define, consider tuning your text model by using supervised fine-tuning.

This page provides detailed information about tuning a text model using RLHF tuning. You learn about which text models support RLHF tuning, how to create a dataset, and how to tune a text model using RLHF tuning. You also learn how to view and load tuned models tuned using RLHF tuning.

For more details about RLHF tuning in Vertex AI, see the following blog post, RLHF Tuning with Vertex AI.

Workflow for RLHF model tuning

The RLHF model tuning workflow on Vertex AI includes the following steps:

Prepare your human preference dataset.
Prepare your prompt dataset.
Upload your datasets to Cloud Storage bucket. They don't need to be in the same Cloud Storage bucket.
Create an RLHF model tuning job.

After model tuning completes, the tuned model is deployed to a Vertex AI endpoint. The name of the endpoint is the same as the name of the tuned model. Tuned models are available to select in Vertex AI Studio when you want to create a new prompt.

Supported models

The following text models support tuning using Reinforcement Learning from Human Feedback on Vertex AI:

The text generation foundation model, text-bison@002. For more information, see Text generation model.
The t5-small, t5-large, t5-xl, and t5-xxl Flan text-to-text transfer transformer (Flan-T5) models. Flan-T5 models can be fine-tuned to perform tasks such as text classification, language translation, and question answering. For more information, see Flan-T5 checkpoints.

The following text models support RLHF tuning as a self-administered Vertex AI Pipelines job.

Llama 2, available through the Vertex AI Model Garden. For more information, see the sample notebook. Because this method is more complex, we recommend that you first try RLHF on Vertex AI.

Code models don't support RLHF tuning.

Prepare RLHF tuning datasets

RLHF tuning requires that you prepare two datasets and one optional dataset. All datasets are in JSON Lines (JSONL) format and need to be uploaded to a Cloud Storage bucket. The dataset format used to tune a text generation model is different from the dataset format for tuning a text chat model.

Prompt dataset

A dataset that contains unlabeled prompts. Prompts can be the same prompts from the preference dataset, or they can be different. Each line in the prompt dataset contains the following fields:

text-bison dataset

The text generation dataset includes one field:

input_text - a required field that contains the prompt.

Example

{
  "input_text": "Create a description for Plantation Palms."
}

chat-bison dataset

The chat generation dataset includes two fields:

messages - an array of author-content pairs. The author field refers to the author of the message and alternates between user and assistant. The content field is the content of the message. The content can't be empty, and the first and last author must be set to user.
context - (optional) additional context for the model to use when it responds to a prompt.

Example

{
  "context": "You are a pirate dog named Captain Barktholomew.",
  "messages": [
    {
      "author": "user",
      "content": "Hi"
    },
    {
      "author": "assistant",
      "content": "Argh! What brings ye to my ship?"
    },
    {
      "author": "user",
      "content": "What's your name?"
    },
  ]
}

To learn more, you can download and view this sample prompt dataset.

Human preference dataset

The human preference dataset contains preferences from humans. Each line in the human preference dataset records the preference between two options that were presented to a human. We recommend that the human preference dataset includes 5,000 to 10,000 examples. Each line in the human preference dataset contains one example preference that includes the prompt dataset fields for the model being tuned plus the following fields:

candidate_0 and candidate_1 - each of these fields contains two responses. The human helps tune the model by choosing which of the two responses they prefer.
choice - contains an integer, 0 or 1, that indicates which candidate the human preferred. A 0 indicates the human chose candidate_0, and a 1 indicates the human chose candidate_1.

An example of a row in the human preference dataset is the following:

{"input_text": "Create a description for Plantation Palms.", "candidate_0": "Enjoy some fun in the sun at Gulf Shores.", "candidate_1": "A Tranquil Oasis of Natural Beauty.", "choice": 0}

To learn more, you can download and view this sample human preference dataset.

Evaluation dataset (optional)

A dataset that includes unlabeled prompts for prediction after the model is tuned. If the evaluation dataset is provided, then inference is performed on it after the tuning job completes. The format of the evaluation dataset is the same as the format of the prompt dataset. However, the prompts in an evaluation dataset need to be different from the prompts in the prompt dataset.

To learn more, you can download and view this sample evaluation dataset.

Reward model

The human preference dataset is used to train a reward model. Vertex AI creates and then uses the reward model during RLHF tuning. Reward models are created in a private Cloud Storage bucket in a customer tenant project. A customer tenant project is an internal project that's unique to a customer. You can't access a reward model, and it's deleted after the tuning job completes. For more information, see Tenant project.

Maintain consistency with production data

The examples in your datasets should match your expected production traffic. If your dataset contains specific formatting, keywords, instructions, or information, the production data should be formatted in the same way and contain the same instructions.

For example, if the examples in your dataset include a "question:" and a "context:", production traffic should also be formatted to include a "question:" and a "context:" in the same order as it appears in the dataset examples. If you exclude the context, the model will not recognize the pattern, even if the exact question was in an example in the dataset.

Upload tuning datasets to Cloud Storage

To run a tuning job, you need to upload one or more datasets to a Cloud Storage bucket. You can either create a new Cloud Storage bucket or use an existing one to store dataset files. The region of the bucket doesn't matter, but we recommend that you use a bucket that's in the same Google Cloud project where you plan to tune your model.

After your bucket is ready, upload your dataset file to the bucket.

Create an RLHF tuning job

You can perform RLHF tuning by using the Google Cloud console or the Vertex AI SDK for Python.

Vertex AI SDK for Python

To learn how to use the Vertex AI SDK for Python to tune your models with RLHF, open and run the following notebook with Colab, GitHub, or Vertex AI Workbench:

Google Cloud console

To tune a text model in the Google Cloud console by using RLHF tuning, perform the following steps:

In the Vertex AI section of the Google Cloud console, go to the Vertex AI Studio page.
Go to Vertex AI Studio
Click the Tune and distill tab.
Click Create tuned model.
Select Reinforcement learning from human feedback (RLHF).
Configure model details:
- Tune model name: Enter a name for your tuned model.
- Base model: Select the foundation model that you want to tune.
- Region: Enter the region where model tuning takes place. Supported regions are:
  - us-central1: Uses 8 Nvidia A100 80GB GPUs.
  - europe-west4: Uses 64 cores of the TPU v3 pod.
- Output directory: Enter the Cloud Storage location where artifacts are stored when your model is tuned.
Expand Advanced Options to configure advanced settings.
- Reward train steps: Enter the number of steps to use when training the reward model. The reward model is used to tune your model. The default value is 1000.
- Reward learning rate multiplier: Enter a float value that affects the learning rate when training the reward model. To increase the default learning rate, enter a higher value. To decrease the default learning rate, enter a lower value. The default value is 1.0.
- Reinforcement train steps: Enter the number of steps to perform when tuning the base model using reinforcement learning. The default value is 1000.
- Reinforcement learning rate multiplier: Enter a float value that affects the learning rate when training a reinforcement model. To increase the default learning rate, enter a higher value. To decrease the default learning rate, enter a lower value. The default value is 1.0.
Click Continue
In Human preference dataset, upload or choose a human preference dataset used to create a reward model. If you want to upload your dataset file, select radio_button_checked Upload JSONL file to Cloud Storage. If your dataset file is already in a Cloud Storage bucket, select radio_button_checked Existing JSONL file on Cloud Storage.
Upload a JSONL file
- In Select JSONL file, click Browse and select your dataset file.
- In Dataset location, click Browse and select the Cloud Storage bucket where you want to store your dataset file.
Use an existing JSONL file

In Cloud Storage file path, click Browse and select the Cloud Storage bucket where your dataset file is located.
In Prompt dataset, if you want to upload your dataset file, select radio_button_checked Upload JSONL file to Cloud Storage. Otherwise, if your prompt dataset file is already in a Cloud Storage bucket, select radio_button_checked Existing JSONL file on Cloud Storage.
Upload a JSONL file
- In Select JSONL file, click Browse and select your dataset file.
- In Dataset location, click Browse and select the Cloud Storage bucket where you want to store your dataset file.
Use an existing JSONL file

In Cloud Storage file path, click Browse and select the Cloud Storage bucket where your dataset file is located.
(Optional) To evaluate your tuned model, do the following:
1. Click Enable model evaluation.
2. In Evaluation dataset, click Browse.
3. Navigate to the Cloud Storage bucket that contains your evaluation dataset and select your evaluation dataset.
For more information, see Evaluation dataset.
Click Start tuning.

Check tuning operation status

To check the status of your model tuning job, in the Google Cloud console, go to the Vertex AI Pipelines page. This page shows the status of text and code model tuning jobs.

Go to Pipelines

Alternatively, you can configure email notifications for Vertex AI Pipelines so you are notified by email when the model tuning job finishes or fails.

What's next

Learn how to run a self-administered RLHF tuning job for Llama 2 using Vertex AI Pipelines.
Learn about responsible AI best practices and Vertex AI's safety filters.
Learn how to enable Data Access audit logs for your endpoints.
Learn how to evaluate your tuned model.
For more details about RLHF tuning in Vertex AI, see the following blog post, RLHF Tuning with Vertex AI.

Tune PaLM text models by using RLHF tuning

Workflow for RLHF model tuning

Supported models

Prepare RLHF tuning datasets

Prompt dataset

text-bison dataset

Example

chat-bison dataset

Example

Human preference dataset

Evaluation dataset (optional)

Reward model

Maintain consistency with production data

Upload tuning datasets to Cloud Storage

Create an RLHF tuning job

Vertex AI SDK for Python

Google Cloud console

Upload a JSONL file

Use an existing JSONL file

Upload a JSONL file

Use an existing JSONL file

Check tuning operation status

What's next