This page provides recommended steps to help you fine-tune a text model with Gemini. This guide covers classification, sentiment analysis, and extraction use cases.
Limitations
- Gemini models don't support confidence scores.
- Gemini can't supply numerical scores for sentiment analysis. This limits the ability to estimate the sentiment based on a numeric threshold.
To enhance the reliability of your AI model's outputs, consider incorporating methods like self-consistency for confidence scoring. This technique generates multiple outputs from the model for the same input and then employs a majority voting system to determine the most likely output. The confidence score can be represented as a ratio reflecting the proportion of times the majority output was generated.
For sentiment analysis tasks, consider using an LLM as a rater to provide verbal confidence scores. Prompt the model to analyze the sentiment and then return a verbal score (for example, "very likely") instead of a numeric score. Verbal scores are generally more interpretable and less prone to bias in the context of LLMs.
Text model tuning with Gemini
There are two options you can use to tune text models for classification, extraction, and sentiment analysis with Gemini, prompting with a pre-trained model and customized fine-tuning.
Prompting with pre-trained Gemini models: Prompting is the art of crafting effective instructions to guide AI models like Gemini in generating the outputs you want. It involves designing prompts that clearly convey the task, format you want, and any relevant context. You can use Gemini's capabilities with minimal setup. It's best suited for:
- Limited labeled data: If you have a small amount of labeled data or can't afford a lengthy fine-tuning process.
- Rapid prototyping: When you need to quickly test a concept or get a baseline performance without heavy investment in fine-tuning.
Customized fine-tuning of Gemini models: For more tailored results, Gemini lets you fine-tune its models on your specific datasets. To create an AI model that excels in your specific domain, consider fine-tuning. This involves retraining the base model on your own labeled dataset, adapting its weights to your task and data. You can adapt Gemini to your use cases. Fine-tuning is most effective when:
- You have labeled data: A sizable dataset to train on (think 100 examples or more), which allows the model to deeply learn your task's specifics.
- Complex or unique tasks: For scenarios where advanced prompting strategies are not sufficient, and a model tailored to your data is essential.
Try both approaches if possible to see which yields better results for your specific use case. We recommend starting with prompting to find the optimal prompt. Then, move on to fine-tuning (if required) to further boost performances or fix recurrent errors.
While adding more examples might be beneficial, it is important to evaluate where the model makes mistakes before adding more data. Regardless of the approach, high-quality, well-labeled data is crucial for good performance and better than quantity. Also, the data you use for fine-tuning should reflect the type of data the model will encounter in production. For model development, deployment and management, see the main Vertex AI documentation.
Prompting with Gemini
You use the language understanding capabilities of Gemini models by providing them with a few examples of your task (classification, extraction, sentiment analysis) in the prompt itself. The model learns from these examples and applies that knowledge to your new input.
Common prompting techniques that can be employed to optimize results include:
- Zero-shot prompting: Directly instructing the model to perform a task without providing specific examples.
- Few-shot prompting: Providing a few examples alongside the instructions to guide the model's understanding.
- Chain-of-thought prompting: Breaking down complex tasks into smaller steps and guiding the model through each step sequentially.
While prompt design is flexible, certain strategies can guide a model's output. Thorough testing and evaluation are essential for optimizing performance.
Large language models (LLM), trained on massive text data, learn language patterns and relationships. Given a prompt, these models predict the most likely continuation, similar to advanced autocompletion. Thus, when crafting prompts, consider the factors influencing a model's prediction.
The process of prompt engineering is illustrated in the following diagram:
Prepare input data
There are a few different prompting techniques you can use. The following examples demonstrate how to use a few-shot prompting technique:
Classification
Request
Classify the following as red wine or white wine:Name: Chardonnay Type: White wine Name: Cabernet Type: Red wine Name: Moscato Type: White wine Name: Riesling Type:
Response
White wine
Extraction
Request
Extract the technical specifications from the text below in a JSON format.
INPUT: Google Nest Wifi, network speed up to 1200Mpbs, 2.4GHz and 5GHz frequencies, WP3 protocol OUTPUT: { "product":"Google Nest Wifi", "speed":"1200Mpbs", "frequencies": ["2.4GHz", "5GHz"], "protocol":"WP3" } Google Pixel 7, 5G network, 8GB RAM, Tensor G2 processor, 128GB of storage, Lemongrass
Response
{ "product": "Google Pixel 7", "network": "5G", "ram": "8GB", "processor": "Tensor G2", "storage": "128GB", "color": "Lemongrass" }
Sentiment analysis
Request
Classify the sentiment of the message. Please only print the category name without anything else.Message: I had to compare two versions of Hamlet for my Shakespeare class and unfortunately I picked this version. Everything from the acting (the actors deliver most of their lines directly to the camera) to the camera shots (all medium or close up shots...no scenery shots and very little back ground in the shots) were absolutely terrible. I watched this over my spring break and it is very safe to say that I feel that I was gypped out of 114 minutes of my vacation. Not recommended by any stretch of the imagination. Category: negative Message: This Charles outing is decent but this is a pretty low-key performance. Marlon Brando stands out. There's a subplot with Mira Sorvino and Donald Sutherland that forgets to develop and it hurts the film a little. I'm still trying to figure out why Charlie want to change his name. Category: negative
Message: My family has watched Arthur Bach stumble and stammer since the movie first came out. We have most lines memorized. I watched it two weeks ago and still get tickled at the simple humor and view-at-life that Dudley Moore portrays. Liza Minelli did a wonderful job as the side kick - though I'm not her biggest fan. This movie makes me just enjoy watching movies. My favorite scene is when Arthur is visiting his fiancée's house. His conversation with the butler and Susan's father is side-spitting. The line from the butler, "Would you care to wait in the Library" followed by Arthur's reply, "Yes I would, the bathroom is out of the question", is my NEWMAIL notification on my computer.
Response
Positive
Get a prediction response
Here is an example of a sample Python code response for the Classification example. For more information, see the Overview of Generative AI on Vertex AI.
from vertexai import generative_models
from vertexai.generative_models import GenerativeModel
model = GenerativeModel(model_name="gemini-1.5-flash-001")
response = model.generate_content(["Classify the following as red wine or white wine:
<examples>
Name: Chardonnay
Type: White wine
Name: Cabernet
Type: Red wine
Name: Moscato
Type: White wine
</examples>
Name: Riesling
Type: "])
Fine-tuning with Gemini 1.0 Pro
Fine-tuning lets you adapt Gemini 1.0 Pro to your specific needs. Follow these steps to fine-tune Gemini 1.0 Pro with your own data:
Prepare training data
Convert your training data into Gemini's fine-tuning format, which uses a JSONL file structure. Each line in the file should represent a single training example. Each training example should follow this structure:
{"messages": [{"role": "system", "content": "<system_context>"},, {"role": "user", "content": "<user_input>"}, {"role": "model", "content": "<desired_output>"}]}
Here's an example of an entry that has two data points:
{"messages": [{"role": "system", "content": "You should classify the text into one of the following classes:[business, entertainment]"}, {"role": "user", "content": "Diversify your investment portfolio"}, {"role": "model", "content": "business"}]}
{"messages": [{"role": "system", "content": "You should classify the text into one of the following classes:[business, entertainment]"}, {"role": "user", "content": "Watch a live concert"}, {"role": "model", "content": "entertainment"}]}
For comprehensive instructions and more examples, refer to the official Gemini dataset preparation guide.
Execute the fine-tuning pipeline
To start your Gemini fine-tuning job, follow this step-by-step guide using the user interface, Python, or the REST API. During setup, select the Gemini model version, configure fine-tuning hyperparameters, and specify overall settings. We recommend experimenting with different values for epochs, learning rate multiplier, and adapter sizes. For datasets with 500 to 1000 examples, consider starting with the following configurations to get a good understanding of model learning and find optimal parameter settings for your task:
- epochs=2, learning_rate_multiplier=1, adapter_size=1;
- epochs=4, learning_rate_multiplier=1, adapter_size=1 (default);
- epochs=6, learning_rate_multiplier=1, adapter_size=4;
- epochs=12, learning_rate_multiplier=4, adapter_size=4; and
- epochs=12, learning_rate_multiplier=4, adapter_size=1.
By evaluating the performance of one or two different configurations, you can identify which parameters and modifications are most effective in improving performance. If the target level of performance is not achieved, you can continue experimenting with these promising configurations.
Evaluation tools and techniques
Consistent and comparable model evaluation is essential for understanding performance and making informed decisions. Here are some techniques and tools to remember during model evaluation:
- Maintain a consistent evaluation methodology: Use the same evaluation metrics and methods for both fine-tuned and prompted models. This facilitates direct, unbiased comparison. Whenever possible, use the same evaluation dataset used during model development and deployment. This ensures fair comparison across model types and helps identify quality discrepancies.
- Evaluation tools and techniques Vertex AI Generative AI evaluation service: Offers low-latency, synchronous evaluations on small data batches. Suitable for on-demand evaluations, rapid iteration, and experimentation. Integrates with other Vertex AI services with the Python SDK.
- Recommendation for classification, extraction, and sentiment analysis:
Include the
exact_match
metric when using the evaluation service.
Deployment
To deploy your tuned model, see Deploy a tuned model.
What's next
- Learn more about model tuning Introduction to tuning
- See Gemini for AutoML text users to learn more about key tooling differences.