Tune models overview

Tuning a foundation model can improve its performance. Foundation models are trained for general purposes and sometimes don't perform tasks as well as you'd like them to. This might be because the tasks you want the model to perform are specialized tasks that are difficult to teach a model by using only prompt design.

In these cases, you can use model tuning to improve the performance of a model for specific tasks. Model tuning can also help it adhere to specific output requirements when instructions aren't sufficient. This page provides an overview of model tuning, describes the tuning options available on Vertex AI, and helps you determine when each tuning option should be used.

Model tuning overview

Model tuning works by providing a model with a training dataset that contains many examples of a unique task. For unique or niche tasks, you can get significant improvements in model performance by tuning the model on a modest number of examples. After you tune a model, fewer examples are required in its prompts.

Vertex AI supports the following methods to tune foundation models.

Gemini

Supervised tuning

Supervised tuning for Gemini models improves the performance of the model by teaching it a new skill. Data that contains hundreds of labeled examples is used to teach the model to mimic a desired behavior or task. Each labeled example demonstrates what you want the model to output during inference.

When you run a supervised tuning job, the model learns additional parameters that help it encode the necessary information to perform the desired task or learn the desired behavior. These parameters are used during inference. The output of the tuning job is a new model that combines the newly learned parameters with the original model.

Supervised tuning of a text model is a good option when the output of your model isn't complex and is relatively easy to define. Supervised tuning is recommended for classification, sentiment analysis, entity extraction, summarization of content that's not complex, and writing domain-specific queries. For code models, supervised tuning is the only option.

PaLM

Supervised tuning

Supervised tuning for PaLM models improves the performance of the model by teaching it a new skill. Data that contains hundreds of labeled examples is used to teach the model to mimic a desired behavior or task. Each labeled example demonstrates what you want the model to output during inference.

When you run a supervised tuning job, the model learns additional parameters that help it encode the necessary information to perform the desired task or learn the desired behavior. These parameters are used during inference. The output of the tuning job is a new model that combines the newly learned parameters with the original model.

Supervised tuning of a text model is a good option when the output of your model isn't complex and is relatively easy to define. Supervised tuning is recommended for classification, sentiment analysis, entity extraction, summarization of content that's not complex, and writing domain-specific queries. For code models, supervised tuning is the only option.

Reinforcement learning from human feedback (RLHF) tuning

Reinforcement learning from human feedback (RLHF) for PaLM models uses preferences specified by humans to optimize a language model. By using human feedback to tune your models, you can make the models better align with human preferences and reduce undesired outcomes in scenarios where people have complex intuitions about a task. For example, RLHF can help with an ambiguous task, such as how to write a poem about the ocean, by offering a human two poems about the ocean and letting that person choose their preferred one.

RLHF tuning is a good option when the output of your model is complex and isn't easily achieved with supervised tuning. RLHF tuning is recommended for question answering, summarization of complex content, and content creation, such as a rewrite. RLHF tuning isn't supported by code models.

Model distillation

Model Distillation for PaLM models is a good option if you have a large model that you want to make smaller without degrading its ability to do what you want. The process of distilling a model creates a new, smaller trained model that costs less to use and has lower latency than the original model.

LoRA and QLoRA recommendations for LLMs

You can also use Low-Rank Adaptation of Large Language Models (LoRA) to tune Vertex Vertex AI LLM models.

This sections provides recommendations for using LoRA and it's more memory-efficient version, QLoRA.

LoRA tuning recommendations

The following table summarizes recommendations for tuning LLMs by using LoRA or QLoRA:

Specification Recommended Details
GPU memory efficiency QLoRA QLoRA has about 75% smaller peak GPU memory usage compared to LoRA.
Speed LoRA LoRA is about 66% faster than QLoRA in terms of tuning speed.
Cost efficiency LoRA While both methods are relatively inexpensive, LoRA is up to 40% less expensive than QLoRA.
Higher max sequence length QLoRA Higher max sequence length increases GPU memory consumption. QLoRA uses less GPU memory so it can support higher max sequence lengths.
Accuracy improvement Same Both methods offer similar accuracy improvements.
Higher batch size QLoRA QLoRA supports much higher batch sizes. For example, the following are batch size recommendations for tuning openLLaMA-7B on the following GPUs:
  • 1 x A100 40G:
    • LoRA: Batch size of 2 is recommended.
    • QLoRA: Batch size of 24 is recommended.
  • 1 x L4:
    • LoRA: Batch size of 1 fails with an out of memory error (OOM).
    • QLoRA: Batch size of 12 is recommended.
  • 1 x V100:
    • LoRA: Batch size of 1 fails with an out of memory error (OOM).
    • QLoRA: Batch size of 8 is recommended.

What's next