This page explains how to format a dataset for supervised fine-tuning of a Gemini model. You can tune text, image, audio, and document data types. This page covers the following topics: A supervised fine-tuning dataset adapts a pre-trained model to a specific task or domain. The input data should be similar to what you expect the model to encounter in real-world use, and the output labels should represent the correct answers for each input. A complete dataset includes a training dataset. We recommend that you also include a validation dataset: For limitations on datasets, such as maximum token counts and file sizes, see About supervised fine-tuning for Gemini models. You can provide your tuning dataset in one of the following formats: This section describes the JSONL data format. Each line in the JSONL file is a single training example with the following structure: Each JSON object can contain the following fields: For limits on inputs, such as the maximum number of tokens or images, see the model specifications on the Google models page. To compute the number of tokens in your request, see Get token count. Each conversation example in a tuning dataset is composed of a required
The For each example, the maximum token length for Your training data should follow the best practices for prompt design. Each example should provide a detailed description of the task and the desired output format. The examples in your datasets should match your expected production traffic. If
your dataset contains specific formatting, keywords, instructions, or
information, the production data should be formatted in the same way and contain
the same instructions. For example, if the examples in your dataset include a To run a tuning job, you need to upload one or more dataset files to a Cloud Storage bucket. To run a tuning job, you need to upload one or more datasets to a
Cloud Storage bucket. You can either
create a new Cloud Storage bucket
or use an existing one to store dataset files. The region of the bucket doesn't
matter, but we recommend that you use a bucket that's in the same
Google Cloud project where you plan to tune your model. After your bucket is ready,
upload your dataset file
to the bucket.
About supervised fine-tuning datasets
Dataset format
Option
Description
Use Case
Multimodal dataset on Vertex AI (preview)
A managed dataset in Vertex AI that supports various data types and provides data management features. For more information, see Multimodal dataset on Vertex AI.
Recommended for managing large or complex multimodal datasets within the Google Cloud ecosystem.
JSON Lines (JSONL)
A text file where each line is a separate JSON object representing a single training example. The file must be uploaded to a Cloud Storage bucket.
A simple, flexible format suitable for text-based or simple multimodal tasks, especially when data is prepared outside of Google Cloud.
Dataset structure and fields
{
"systemInstruction": {
"role": string,
"parts": [
{
"text": string
}
]
},
"contents": [
{
"role": string,
"parts": [
{
// Union field data can be only one of the following:
"text": string,
"fileData": {
"mimeType": string,
"fileUri": string
}
}
]
}
]
}
systemInstruction
: (Optional) An instruction for the model to steer its behavior, such as "Answer as concisely as possible." See Supported models. The text
strings count toward the token limit. The role
field is ignored.
contents
: (Required) A conversation with the model. For single-turn queries, this is a single object. For multi-turn queries, this is a repeated field that contains the conversation history and the latest request. Each content
object contains the following:
role
: (Optional) The author of the message. Supported values are:
user
: The message is from the user.model
: The message is from the model. This is used to provide context in multi-turn conversations.parts
: (Required) A list of ordered parts that make up a single message. Each part can have a different IANA MIME type. A part can be one of the following types:
text
: A text prompt or code snippet.fileData
: Data stored in a file, specified by mimeType
and a fileUri
pointing to a file in Cloud Storage.functionCall
: A call to a function, containing the function's name and parameters. See Function calling.functionResponse
: The result of a functionCall
, used as context for the model. See Function calling.tools
: (Optional) A set of tools the model can use to interact with external systems. See Function calling.Dataset example
messages
field and an optional context
field.messages
field consists of an array of role-content pairs:
role
field
refers to the author of the message and is set to either system
, user
, or
model
. The system
role is optional and can only occur at the first element
of the messages list. The user
and model
roles are required and can repeat in
an alternating manner.content
field is the content of the message.context
and messages
combined
is 131,072 tokens. Additionally, each content
field for the model
field shouldn't
exceed 8,192 tokens.{
"messages": [
{
"role": string,
"content": string
}
]
}
Best practices for data creation
Follow prompt design best practices
Maintain consistency with production data
"question:"
and a
"context:"
, production traffic should also be formatted to include a
"question:"
and a "context:"
in the same order as it appears in the dataset
examples. If you exclude the context, the model will not recognize the pattern,
even if the exact question was in an example in the dataset.Upload the dataset to Cloud Storage
What's next
Prepare supervised fine-tuning data for Gemini models
Except as otherwise noted, the content of this page is licensed under the Creative Commons Attribution 4.0 License, and code samples are licensed under the Apache 2.0 License. For details, see the Google Developers Site Policies. Java is a registered trademark of Oracle and/or its affiliates.
Last updated 2025-08-18 UTC.