Tune and distill PaLM models

This page gives you an overview of tuning text and chat models and distilling text models. You learn about the types of tuning available and how distilling works. You also learn about the benefits of tuning and distillation, and scenarios for when you might want to tune or distill a text model.

Tune models

You can choose one of the following methods to tune a text model:

  • Supervised tuning - The text generation and text chat models support supervised tuning. Supervised tuning of a text model is a good option when the output of your model isn't complex and is relatively easy to define. Supervised tuning is recommended for classification, sentiment analysis, entity extraction, summarization of content that's not complex, and writing domain-specific queries. For code models, supervised tuning is the only option. To learn how to tune a text model with supervised tuning, see Tune text models with supervised tuning.

  • Reinforcement learning from human feedback (RLHF) tuning - The text generation foundation model and some Flan text-to-text transfer transformer (Flan-T5) models support RLHF tuning. RLHF tuning is a good option when the output of your model is complex. RLHF works well on models with sequence-level objectives objectives that aren't easily differentiated with supervised tuning. RLHF tuning is recommended for question answering, summarization of complex content, and content creation, such as a rewrite. To learn how to tune a text model with RLHF tuning, see Tune text models with RLHF tuning.

Benefits of text model tuning

Tuned text models are trained on more examples than can fit in a prompt. Because of this, after a pretrained model is tuned, you can provide fewer examples in the prompt than you would with the original pretrained model. Requiring fewer examples results in the following benefits:

  • Lower latency in requests.
  • Fewer tokens are used.
  • Lower latency and fewer tokens results in reducing the cost of inference.

Model distillation

In addition to supervised and RLHF tuning, Vertex AI supports model distillation. Distillation is the process of training a smaller student model on a larger teacher model to mimic the larger model's behavior while shrinking the size.

There are multiple types of model distillation, including:

  • Response-based: Train the student model on the response probabilities of the teacher model.
  • Feature-based: Train the student model to mimic the inner layers of the teacher model.
  • Relation-based: Train the student model on relationships in the input or output data of the teacher model.
  • Self-distillation: The teacher and student models have the same architecture and the model teaches itself.

Benefits of distilling step-by-step

The benefits of distilling step-by-step include:

  • Improved accuracy: Distilling step-by-step has been shown to outperform standard few-shot prompting on LLMs.
  • A distilled LLM can achieve results on users' specific end tasks that are similar to results from much larger LLMs.
  • Overcome data constraints. You can use DSS with an unlabeled prompt dataset with only a few thousand examples.
  • Smaller hosting footprints.
  • Reduced inferencing latency.

Distilling step-by-step using Vertex AI

Vertex AI supports a form of response-based distillation called distilling step-by-step (DSS). DSS is a method to train smaller, task-specific models through chain of thought (COT) prompting.

To use DSS, you need a small training dataset that consists of inputs and labels. If labels aren't available, the teacher model generates the labels. The rationales are extracted by the DSS process and then used to train the small model with a rationale-generation task and a typical prediction task. This lets the small model build intermediate reasoning before it reaches its final prediction.

The following diagram shows how distilling step-by-step uses COT prompting to extract rationales from a large language model (LLM). The rationales are used to train smaller task-specific models.

Distillation step-by-step (DSS) process diagram.
Source: Google Research.

Quota

Each Google Cloud project requires enough quota to run one tuning job, and one tuning job uses 8 GPUs. If your project doesn't have enough quota for one tuning job, or if you want to run multiple concurrent tuning jobs in your project, you need to request additional quota.

The following table shows the type and amount of quota to request depending on the region where you specified for tuning to take place:

Region Resource quota Amount per concurrent job

us-central1

Restricted image training Nvidia A100 80GB GPUs per region

8

Restricted image training CPUs for A2 CPU types per region

96

europe-west4

Restricted image training TPU V3 pod cores per region

64

Pricing

When you tune or distill a foundation model, you pay the cost to run the tuning or distillation pipeline. When you deploy a tuned or distilled foundation model to a Vertex AI endpoint, you aren't charged for hosting. For serving predictions, you pay the same price as you pay when serving predictions using an untuned foundation model (for tuning) or the student model (for distilling). To learn which foundation models can be tuned and distilled, see Foundation models. For pricing details, see Pricing for Generative AI on Vertex AI.

What's next