This document describes how to define a supervised fine-tuning dataset for a Gemini
model. You can tune text, image, audio, and document data types. A supervised fine-tuning dataset is used to fine-tune a pre-trained model to a
specific task or domain. The input data should be similar to what
you expect the model to encounter in real-world use. The output labels should
represent the correct answers or outcomes for each input. Training dataset To tune a model, you provide a training dataset. For best results, we recommend
that you start with 100 examples. You can scale up to thousands of examples if
needed. The quality of the dataset is far more important than the quantity. Validation dataset We strongly recommend that you provide a validation dataset. A validation dataset
helps you measure the effectiveness of a tuning job. Limitations For limitations on datasets, such as maximum input and output tokens,
maximum validation dataset size, and maximum training dataset file size, see
About supervised fine-tuning for Gemini models. We support the following data formats: JSON Lines (JSONL) format, where each line contains a single tuning example.
Before tuning your model, you must
upload your dataset to a Cloud Storage bucket. The example contains data with the following parameters: Required: The content of the current conversation with the model. For single-turn queries, this is a single instance. For multi-turn queries, this is a repeated field that contains conversation history and the latest request. Optional: See Supported models. Instructions for the model to steer it toward better performance. For example, "Answer as concisely as possible" or "Don't use technical terms in your response". The The Optional. A piece of code that enables the system to interact with external systems to perform an action, or set of actions, outside of knowledge and scope of the model. See Function calling. The base structured data type containing multi-part content of a message. This class consists of two main properties: Optional: The identity of the entity that
creates the message. The following values are supported: The For non-multi-turn conversations, this field can be left blank or unset. A list of ordered parts that make up a single message. Different parts may have different IANA MIME types. For limits on the inputs, such as the maximum number of tokens or the number of images, see the model specifications on the Google models page. To compute the number of tokens in your request, see Get token count. A data type containing media that is part of a multi-part Optional: A text prompt or code snippet. Optional: Data stored in a file. Optional: It contains a string representing the See Function calling. Optional: The result output of a See Function calling. Each conversation example in a tuning dataset is composed of a required
The For each example, the maximum token length for The examples in your datasets should match your expected production traffic. If
your dataset contains specific formatting, keywords, instructions, or
information, the production data should be formatted in the same way and contain
the same instructions. For example, if the examples in your dataset include a To run a tuning job, you need to upload one or more datasets to a
Cloud Storage bucket. You can either
create a new Cloud Storage bucket
or use an existing one to store dataset files. The region of the bucket doesn't
matter, but we recommend that you use a bucket that's in the same
Google Cloud project where you plan to tune your model. After your bucket is ready,
upload your dataset file
to the bucket. Once you have your training dataset and you've trained the model, it's time to design
prompts. It's important to follow the best practice of prompt design in your training dataset to give detailed description of the task to be performed and how the output
should look like.About supervised fine-tuning datasets
Dataset format
Dataset example for Gemini
{
"systemInstruction": {
"role": string,
"parts": [
{
"text": string
}
]
},
"contents": [
{
"role": string,
"parts": [
{
// Union field data can be only one of the following:
"text": string,
"fileData": {
"mimeType": string,
"fileUri": string
}
}
]
}
]
}
Parameters
Parameters
contents
Content
systemInstruction
Content
text
strings count toward the token limit.role
field of systemInstruction
is ignored and doesn't affect the performance of the model.
tools
Contents
role
and parts
. The role
property
denotes the individual producing the content, while the parts
property contains
multiple elements, each representing a segment of data within a message.
Parameters
role
string
user
: This indicates that the message is sent by a real person, typically a user-generated message.model
: This indicates that the message is generated by the model.model
value is used to insert messages from the model into the conversation during multi-turn conversations.
parts
part
Parts
Content
message.
Parameters
text
string
fileData
fileData
functionCall
FunctionCall
.FunctionDeclaration.name
field and a structured JSON object containing any parameters for the function call predicted by the model.
functionResponse
FunctionResponse
.FunctionCall
that contains a string representing the FunctionDeclaration.name
field and a structured JSON object containing any output from the function call. It is used as context to the model.Dataset example
messages
field and an optional context
field.messages
field consists of an array of role-content pairs:
role
field
refers to the author of the message and is set to either system
, user
, or
model
. The system
role is optional and can only occur at the first element
of the messages list. The user
and model
roles are required and can repeat in
an alternating manner.content
field is the content of the message.context
and messages
combined
is 131,072 tokens. Additionally, each content
field for the model
field shouldn't
exceed 8,192 tokens.{
"messages": [
{
"role": string,
"content": string
}
]
}
Maintain consistency with production data
"question:"
and a
"context:"
, production traffic should also be formatted to include a
"question:"
and a "context:"
in the same order as it appears in the dataset
examples. If you exclude the context, the model will not recognize the pattern,
even if the exact question was in an example in the dataset.Upload tuning datasets to Cloud Storage
Follow the best practice of prompt design
What's next
Prepare supervised fine-tuning data for Gemini models
Except as otherwise noted, the content of this page is licensed under the Creative Commons Attribution 4.0 License, and code samples are licensed under the Apache 2.0 License. For details, see the Google Developers Site Policies. Java is a registered trademark of Oracle and/or its affiliates.
Last updated 2025-08-21 UTC.