Prepare supervised tuning data

This document describes how to define a supervised tuning dataset for a Gemini model.

About supervised tuning datasets

A supervised tuning dataset is used to fine-tune a pre-trained model to a specific task or domain. The input data should be similar to what you expect the model to encounter in real-world use. The output labels should represent the correct answers or outcomes for each input.

Training dataset

To tune a model, you provide a training dataset. A training dataset must include a minimum of 16 examples. For best results, we recommend that you provide at least 100 to 500 examples. The more examples you provide in your dataset, the better the results. There is no limit for the number of examples in a training dataset.

Validation dataset

If possible, also provide a validation dataset. A validation dataset helps you measure the effectiveness of a tuning job. Validation datasets support up to 256 examples.

For sample datasets, see Sample datasets on this page.

Dataset format

Your model tuning dataset must be in the JSON Lines (JSONL) format, where each line contains a single tuning example. Before tuning your model, you must upload your dataset to a Cloud Storage bucket.

Each conversation example in a tuning dataset is composed of a required messages field.

  • The messages field consists of an array of role-content pairs. The role field refers to the author of the message and is set to either system, user, or model. The system role is optional and can only occur at the first element of the messages list. The user or model roles are required and can repeat in an alternating manner.

  • The content field is the content of the message.

  • For each example, the maximum token length for context and messages combined is 32,768 tokens. Additionally, each content field for the model field shouldn't exceed 8,192 tokens.

Example single turn

For more details on the following Gemini schema, see Migrate tuning from PaLM API to Gemini API.

{"messages": [{"role": "system", "content": "You should classify the text into one of the following classes:[business, entertainment]"},{"role": "user","content": "Diversify your investment portfolio"},{"role": "model","content": "business"}]}
{"messages": [{"role": "system", "content": "You should classify the text into one of the following classes:[business, entertainment]"},{"role": "user","content": "Watch a live concert"},{"role": "model","content": "entertainment"}]}

Example multi turn

For more details on the following Gemini schema, see Migrate tuning from PaLM API to Gemini API.

{
  "messages": [
    {
      "role": "system",
      "content": "You are a pirate dog named Captain Barktholomew."
    },
    {
      "role": "user",
      "content": "Hi"
    },
    {
      "role": "model",
      "content": "Argh! What brings ye to my ship?"
    },
    {
      "role": "user",
      "content": "What's your name?"
    },
    {
      "role": "model",
      "content": "I be Captain Barktholomew, the most feared pirate dog of the seven seas."
    }
  ]
}

Sample datasets

You can use a sample dataset to learn how to tune a gemini-1.0-pro-002 model.

To use these datasets, specify the URIs in the applicable parameters when creating a text model supervised tuning job.

For example:

...
"training_dataset_uri": "gs://cloud-samples-data/ai-platform/generative_ai/sft_train_data.jsonl",
...
"validation_dataset_uri": "gs://cloud-samples-data/ai-platform/generative_ai/sft_validation_data.jsonl",
...

Maintain consistency with production data

The examples in your datasets should match your expected production traffic. If your dataset contains specific formatting, keywords, instructions, or information, the production data should be formatted in the same way and contain the same instructions.

For example, if the examples in your dataset include a "question:" and a "context:", production traffic should also be formatted to include a "question:" and a "context:" in the same order as it appears in the dataset examples. If you exclude the context, the model will not recognize the pattern, even if the exact question was in an example in the dataset.

Upload tuning datasets to Cloud Storage

To run a tuning job, you need to upload one or more datasets to a Cloud Storage bucket. You can either create a new Cloud Storage bucket or use an existing one to store dataset files. The region of the bucket doesn't matter, but we recommend that you use a bucket that's in the same Google Cloud project where you plan to tune your model.

After your bucket is ready, upload your dataset file to the bucket.

What's next