Model tuning for Gemini text models

This page provides recommended steps to help you fine-tune a text model with Gemini. This guide covers classification, sentiment analysis, and extraction use cases.

Limitations

  • Gemini models don't support confidence scores.
  • Gemini can't supply numerical scores for sentiment analysis. This limits the ability to estimate the sentiment based on a numeric threshold.

To enhance the reliability of your AI model's outputs, consider incorporating methods like self-consistency for confidence scoring. This technique generates multiple outputs from the model for the same input and then employs a majority voting system to determine the most likely output. The confidence score can be represented as a ratio reflecting the proportion of times the majority output was generated.

For sentiment analysis tasks, consider using an LLM as a rater to provide verbal confidence scores. Prompt the model to analyze the sentiment and then return a verbal score (for example, "very likely") instead of a numeric score. Verbal scores are generally more interpretable and less prone to bias in the context of LLMs.

Text model tuning with Gemini

There are two options you can use to tune text models for classification, extraction, and sentiment analysis with Gemini, prompting with a pre-trained model and customized fine-tuning.

  • Prompting with pre-trained Gemini models: Prompting is the art of crafting effective instructions to guide AI models like Gemini in generating the outputs you want. It involves designing prompts that clearly convey the task, format you want, and any relevant context. You can use Gemini's capabilities with minimal setup. It's best suited for:

    • Limited labeled data: If you have a small amount of labeled data or can't afford a lengthy fine-tuning process.
    • Rapid prototyping: When you need to quickly test a concept or get a baseline performance without heavy investment in fine-tuning.
  • Customized fine-tuning of Gemini models: For more tailored results, Gemini lets you fine-tune its models on your specific datasets. To create an AI model that excels in your specific domain, consider fine-tuning. This involves retraining the base model on your own labeled dataset, adapting its weights to your task and data. You can adapt Gemini to your use cases. Fine-tuning is most effective when:

    • You have labeled data: A sizable dataset to train on (think 100 examples or more), which allows the model to deeply learn your task's specifics.
    • Complex or unique tasks: For scenarios where advanced prompting strategies are not sufficient, and a model tailored to your data is essential.

Try both approaches if possible to see which yields better results for your specific use case. We recommend starting with prompting to find the optimal prompt. Then, move on to fine-tuning (if required) to further boost performances or fix recurrent errors.

While adding more examples might be beneficial, it is important to evaluate where the model makes mistakes before adding more data. Regardless of the approach, high-quality, well-labeled data is crucial for good performance and better than quantity. Also, the data you use for fine-tuning should reflect the type of data the model will encounter in production. For model development, deployment and management, see the main Vertex AI documentation.

Prompting with Gemini

You use the language understanding capabilities of Gemini models by providing them with a few examples of your task (classification, extraction, sentiment analysis) in the prompt itself. The model learns from these examples and applies that knowledge to your new input.

Common prompting techniques that can be employed to optimize results include:

  • Zero-shot prompting: Directly instructing the model to perform a task without providing specific examples.
  • Few-shot prompting: Providing a few examples alongside the instructions to guide the model's understanding.
  • Chain-of-thought prompting: Breaking down complex tasks into smaller steps and guiding the model through each step sequentially.

While prompt design is flexible, certain strategies can guide a model's output. Thorough testing and evaluation are essential for optimizing performance.

Large language models (LLM), trained on massive text data, learn language patterns and relationships. Given a prompt, these models predict the most likely continuation, similar to advanced autocompletion. Thus, when crafting prompts, consider the factors influencing a model's prediction.

The process of prompt engineering is illustrated in the following diagram: The prompt engineering process

Prepare input data

There are a few different prompting techniques you can use. The following examples demonstrate how to use a few-shot prompting technique:

Classification

Request

               Classify the following as red wine or white wine:
               
                  Name: Chardonnay
                  Type: White wine
                  Name: Cabernet
                  Type: Red wine
                  Name: Moscato
                  Type: White wine
               
               Name: Riesling
               Type: 
            

Response

         White wine
         

Extraction

Request

            Extract the technical specifications from the text below in a JSON format.
INPUT: Google Nest Wifi, network speed up to 1200Mpbs, 2.4GHz and 5GHz frequencies, WP3 protocol OUTPUT: { "product":"Google Nest Wifi", "speed":"1200Mpbs", "frequencies": ["2.4GHz", "5GHz"], "protocol":"WP3" } Google Pixel 7, 5G network, 8GB RAM, Tensor G2 processor, 128GB of storage, Lemongrass

Response

         {
         "product": "Google Pixel 7",
         "network": "5G",
         "ram": "8GB",
         "processor": "Tensor G2",
         "storage": "128GB",
         "color": "Lemongrass"
         }
         

Sentiment analysis

Request

            Classify the sentiment of the message. Please only print the category name without anything else.
            
               Message: I had to compare two versions of Hamlet for my Shakespeare class and unfortunately I picked this version. Everything from the acting (the actors deliver most of their lines directly to the camera) to the camera shots (all medium or close up shots...no scenery shots and very little back ground in the shots) were absolutely terrible. I watched this over my spring break and it is very safe to say that I feel that I was gypped out of 114 minutes of my vacation. Not recommended by any stretch of the imagination.
               Category: negative
               Message: This Charles outing is decent but this is a pretty low-key performance. Marlon Brando stands out. There's a subplot with Mira Sorvino and Donald Sutherland that forgets to develop and it hurts the film a little. I'm still trying to figure out why Charlie want to change his name.
               Category: negative
Message: My family has watched Arthur Bach stumble and stammer since the movie first came out. We have most lines memorized. I watched it two weeks ago and still get tickled at the simple humor and view-at-life that Dudley Moore portrays. Liza Minelli did a wonderful job as the side kick - though I'm not her biggest fan. This movie makes me just enjoy watching movies. My favorite scene is when Arthur is visiting his fiancée's house. His conversation with the butler and Susan's father is side-spitting. The line from the butler, "Would you care to wait in the Library" followed by Arthur's reply, "Yes I would, the bathroom is out of the question", is my NEWMAIL notification on my computer.

Response

         Positive
         

Get a prediction response

Here is an example of a sample Python code response for the Classification example. For more information, see the Overview of Generative AI on Vertex AI.

from vertexai import generative_models
from vertexai.generative_models import GenerativeModel

model = GenerativeModel(model_name="gemini-1.5-flash-001")
response = model.generate_content(["Classify the following as red wine or white wine:

<examples>
Name: Chardonnay
Type: White wine
Name: Cabernet
Type: Red wine
Name: Moscato
Type: White wine
</examples>

Name: Riesling
Type: "])

Fine-tuning with Gemini 1.0 Pro

Fine-tuning lets you adapt Gemini 1.0 Pro to your specific needs. Follow these steps to fine-tune Gemini 1.0 Pro with your own data:

Prepare training data

Convert your training data into Gemini's fine-tuning format, which uses a JSONL file structure. Each line in the file should represent a single training example. Each training example should follow this structure:

{"messages": [{"role": "system", "content": "<system_context>"},, {"role": "user", "content": "<user_input>"}, {"role": "model", "content": "<desired_output>"}]}

Here's an example of an entry that has two data points:

{"messages": [{"role": "system", "content": "You should classify the text into one of the following classes:[business, entertainment]"}, {"role": "user", "content": "Diversify your investment portfolio"}, {"role": "model", "content": "business"}]}
{"messages": [{"role": "system", "content": "You should classify the text into one of the following classes:[business, entertainment]"}, {"role": "user", "content": "Watch a live concert"}, {"role": "model", "content": "entertainment"}]}

For comprehensive instructions and more examples, refer to the official Gemini dataset preparation guide.

Execute the fine-tuning pipeline

To start your Gemini fine-tuning job, follow this step-by-step guide using the user interface, Python, or the REST API. During setup, select the Gemini model version, configure fine-tuning hyperparameters, and specify overall settings. We recommend experimenting with different values for epochs, learning rate multiplier, and adapter sizes. For datasets with 500 to 1000 examples, consider starting with the following configurations to get a good understanding of model learning and find optimal parameter settings for your task:

  • epochs=2, learning_rate_multiplier=1, adapter_size=1;
  • epochs=4, learning_rate_multiplier=1, adapter_size=1 (default);
  • epochs=6, learning_rate_multiplier=1, adapter_size=4;
  • epochs=12, learning_rate_multiplier=4, adapter_size=4; and
  • epochs=12, learning_rate_multiplier=4, adapter_size=1.

By evaluating the performance of one or two different configurations, you can identify which parameters and modifications are most effective in improving performance. If the target level of performance is not achieved, you can continue experimenting with these promising configurations.

Evaluation tools and techniques

Consistent and comparable model evaluation is essential for understanding performance and making informed decisions. Here are some techniques and tools to remember during model evaluation:

  1. Maintain a consistent evaluation methodology: Use the same evaluation metrics and methods for both fine-tuned and prompted models. This facilitates direct, unbiased comparison. Whenever possible, use the same evaluation dataset used during model development and deployment. This ensures fair comparison across model types and helps identify quality discrepancies.
  2. Evaluation tools and techniques Vertex AI Generative AI evaluation service: Offers low-latency, synchronous evaluations on small data batches. Suitable for on-demand evaluations, rapid iteration, and experimentation. Integrates with other Vertex AI services with the Python SDK.
  3. Recommendation for classification, extraction, and sentiment analysis: Include the exact_match metric when using the evaluation service.

Deployment

To deploy your tuned model, see Deploy a tuned model.

What's next