Create distilled text models

Distilling step by step uses a large teacher model to train smaller student model to perform certain tasks better with improved reasoning capabilities. The trained, distilled model can do the same things you care about in the larger teacher model at a lower cost and with lower latency.

When you distill a foundation model, you use a teacher model and a student model:

  • The teacher model is the large model that can do what you want. However, because of its size, the teacher model might cost more to use and have more latency than a smaller model.

  • The student model is a smaller than the teacher model. The training and distilling process uses labeled examples and rationales generated by the teacher model to tune the student model. The performance and reasoning capabilities of the resulting distilled model are better than the original student model.

You specify a teacher model and a student model when you create a distillation job.

Workflow for tuning and distilling a model

The distilling workflow on Vertex AI includes the following steps:

  1. Prepare your model tuning dataset.
  2. Specify the teacher model.
  3. Specify the student model.
  4. Upload the model tuning dataset to a Cloud Storage bucket.
  5. Create a model distilling job.

After model distillation completes, the distilled model is deployed to a Vertex AI endpoint. The name of the endpoint is the same as the name of the distilled model. Distilled models are available to select in Vertex AI Studio when you want to create a new prompt.

Supported models

You can specify the following for the teacher model:

  • text-unicorn@001

You can specify the following for the student model:

  • text-bison@002

Dataset format

Distillation works on a labeled or an unlabeled dataset. If you have a high quality labeled dataset with hundreds of examples, then we recommend that you use that. Otherwise, you can use an unlabeled prompt dataset. If you use an unlabeled dataset, then the teacher model generates the labels and the rationale for distillation. More than 1,000 examples are recommended if you use an unlabeled dataset.

The labeled or unlabeled distillation dataset must be in JSON Lines (JSONL) format where each line contains a single tuning example. Before you distill your model, you upload your dataset to a Cloud Storage bucket.

Each dataset example contains an input_text field with the model prompt and an optional output_text field that contains an example response that the distilled model is expected to produce.

The maximum token length for input_text is 7,168 and the maximum token length for output_text is 1,024. If either field exceeds the maximum token length, the excess tokens are truncated.

The maximum number of examples that a dataset for a text generation model can contain is 10,000.

Dataset example

{"input_text": "question: How many people live in Beijing? context:
With over 21 million residents, Beijing is the world's most populous national
capital city and is China's second largest city after Shanghai. It is
located in Northern China, and is governed as a municipality under the direct
administration of the State Council with 16 urban, suburban, and rural
districts.[14] Beijing is mostly surrounded by Hebei Province with the exception
of neighboring Tianjin to the southeast; together, the three divisions form the
Jingjinji megalopolis and the national capital region of China.",
"output_text": "over 21 million people"}

{"input_text": "question: How many parishes are there in Louisiana? context: The U.S. state of Louisiana is divided into 64 parishes (French: paroisses) in the same manner that 48 other states of the United States are divided into counties, and Alaska is divided into boroughs.", "output_text": "64"}

Include instructions in examples

For tasks such as classification, it is possible to create a dataset of examples that don't contain instructions. However, excluding instructions from the examples in the dataset leads to worse performance after distillation than including instructions, especially for smaller datasets.

Excludes instructions:

{"input_text": "5 stocks to buy now",
"output_text": "business"}

Includes instructions:

{"input_text": "Classify the following text into one of the following classes:
[business, entertainment] Text: 5 stocks to buy now",
"output_text": "business"}

Sample datasets

You can use a sample dataset to get started with distilling. The following is a classification task dataset that contains sample medical transcriptions for various medical specialties. The data is from mtsamples.com as made available on Kaggle.

  • Sample distillation dataset URI:

    gs://cloud-samples-data/vertex-ai/model-evaluation/peft_train_sample.jsonl

  • Sample eval dataset URI:

    gs://cloud-samples-data/vertex-ai/model-evaluation/peft_eval_sample.jsonl

To use these datasets, specify the URIs in the applicable parameters when creating a text model distillation job.

For example:

...
"dataset_uri": "gs://cloud-samples-data/vertex-ai/model-evaluation/peft_train_sample.jsonl",
...
"evaluation_data_uri": "gs://cloud-samples-data/vertex-ai/model-evaluation/peft_eval_sample.jsonl",
...

Maintain consistency with production data

The examples in your datasets should match your expected production traffic. If your dataset contains specific formatting, keywords, instructions, or information, the production data should be formatted in the same way and contain the same instructions.

For example, if the examples in your dataset include a "question:" and a "context:", production traffic should also be formatted to include a "question:" and a "context:" in the same order as it appears in the dataset examples. If you exclude the context, the model will not recognize the pattern, even if the exact question was in an example in the dataset.

Upload distilling datasets to Cloud Storage

To run a tuning job, you need to upload one or more datasets to a Cloud Storage bucket. You can either create a new Cloud Storage bucket or use an existing one to store dataset files. The region of the bucket doesn't matter, but we recommend that you use a bucket that's in the same Google Cloud project where you plan to tune your model.

After your bucket is ready, upload your dataset file to the bucket.

Distilling region settings

You can specify three Google Cloud region settings when you configure a distillation job. One region is where the pipeline that tunes your model runs. The other region is where the model tuning portion of the distillation process runs and the distilled model is uploaded.

Pipeline job region

The pipeline job region is the region where the pipeline job runs. If the optional model upload region isn't specified, then the model is uploaded and deployed to the pipeline job region. Intermediate data, such as the transformed dataset, is stored in the pipeline job region. To learn which regions you can use for the pipeline job region, see Supported pipeline job and model upload regions.

You must specify the pipeline job region using one of the following methods.

  • If you create a distillation job by sending a POST request using the pipelineJobs.create method, then you use the URL to specify the region where the pipeline job runs. In the following URL, replacing both instances of PIPELINE_JOB_REGION with the region where the pipeline runs:

     https://PIPELINE_JOB_REGION-aiplatform.googleapis.com/v1/projects/PROJECT_ID/locations/PIPELINE_JOB_REGION/pipelineJobs
    
  • If you use the Google Cloud console to create a distillation job, then you specify the pipeline job region in the Region control when you create your distillation job. In the Google Cloud console, the Region control specifies both the pipeline job region and the model upload region. When you use the Google Cloud console to create a distillation job, both regions are always the same.

Model upload region

You use the optional tuned_model_location parameter to specify where your distilled model is uploaded. If the model upload region isn't specified, then the distilled model is uploaded to the pipeline job region.You can use one of the Supported pipeline job and model upload regions for your model upload region. You can specify the model upload region using one of the following methods:

  • If you create a distillation job by sending a POST request using the pipelineJobs method, then you can use the location parameter to specify the model upload region.

  • If you use the Google Cloud console to create a distillation job, then you specify the model upload region in the Region control when you create your distillation job. In the Google Cloud console, the Region control specifies both the model upload region and the pipeline job region. When you use the Google Cloud console to create a distillation job, both regions are always the same.

Distilling region settings

The region you choose is where Vertex AI distills the model and then uploads the distilled model.

The tuning region is where the computation for the tuning portion of the distillation job occurs. This region is determined by the accelerator type you choose.

  • us-central1 - If you choose this region, then 8 Nvidia A100 80GB GPUs are used.
  • europe-west4 - If you choose this region, then 64 cores of the TPU v3 pod are used.

Supported pipeline job and model upload regions

You can use one of the following regions to specify the model upload region and to specify the pipeline job region:

  • us-central1
  • europe-west4
  • asia-southeast1
  • us-west1
  • europe-west3
  • europe-west2
  • asia-northeast1
  • us-east4
  • us-west4
  • northamerica-northeast1
  • europe-west9
  • europe-west1
  • asia-northeast3

Create a text model distilling job

You can create a text model distilling job by using the Google Cloud console or the API. For guidance on model distilling configurations, see Recommended configurations.

REST

To create a model distillation job, send a POST request by using the pipelineJobs method. Note that some of the parameters are not supported by all of the models. Ensure that you only include the applicable parameters for the model that you're distilling.

Before using any of the request data, make the following replacements:

  • PIPELINEJOB_DISPLAYNAME: A display name for the pipelineJob.
  • OUTPUT_DIR: The URI of the bucket to output pipeline artifacts to.
  • PROJECT_ID: Your project ID.
  • MODEL_DISPLAYNAME: A display name for the distilled model uploaded by the pipelineJob.
  • DATASET_URI: URI of your dataset file.
  • PIPELINE_JOB_REGION: The region where the pipeline tuning job runs. This is also the default region for where the tuned model is uploaded. If you want to upload your model to a different region, then use the location parameter to specify the tuned model upload region. For more information, see Model upload region.
  • MODEL_UPLOAD_REGION: (optional) The region where the tuned model is uploaded. If you don't specify a model upload region, then the tuned model uploads to the same region where the pipeline job runs. For more information, see Model upload region.
  • ACCELERATOR_TYPE: (optional, default GPU) The type of accelerator to use for model tuning. The valid options are:
    • GPU: Uses eight A100 80 GB GPUs for tuning. Make sure you have enough quota. If you choose GPU, then VPC‑SC is supported. CMEK is supported if the tuning location and model upload location are us-centra1. For more information, see Supervised tuning region settings. If you choose GPU, then your model tuning computations happen in the us-central1 region.
    • TPU: Uses 64 cores of the TPU v3 pod for tuning. Make sure you have enough quota. CMEK isn't supported, but VPC‑SC is supported. If you choose TPU, then your model tuning computations happen in the europe-west4 region.
  • TEACHER_MODEL_REFERENCE: Name of the teacher model to use for distilling. The supported model is text-unicorn@001.
  • STUDENT_MODEL_REFERENCE: Name of the student model to use for distilling. The supported model is text-bison@002.
  • STEPS: The number of steps to run for model tuning. The default value is 300. The batch size varies by tuning location and model size. For 8k models, such as text-bison@002, chat-bison@002, code-bison@002, and codechat-bison@002:
    • us-central1 has a batch size of 8.
    • europe-west4 has a batch size of 24.
    For 32k models, such as text-bison-32k, chat-bison-32k, code-bison-32k, and codechat-bison-32k:
    • us-central1 has a batch size of 8.
    • europe-west4 has a batch size of 8.

    For example, if you're training text-bison@002 in europe-west4, there are 240 examples in a training dataset, and you set steps to 20, then the number of training examples is the product of 20 steps and the batch size of 24, or 480 training steps. In this case, there are two epochs in the training process because it goes through the examples two times. In us-central1, if there are 240 examples in a training dataset and you set steps to 15, then the number of training examples is the product of 15 steps and the batch size of 8, or 120 training steps. In this case, there are 0.5 epochs because there are half as many training steps as there are examples.

  • LEARNING_RATE_MULTIPLIER: A multiplier to apply to the recommended learning rate. To use the recommended learning rate, use 1.0.
  • EVAL_DATASET_URI: (optional) The URI of the JSONL file that contains the evaluation dataset for batch prediction and evaluation. Evaluation isn't supported for chat-bison. For more information, see Dataset format for tuning a code model. The evaluation dataset requires between ten and 250 examples.
  • EVAL_INTERVAL: (optional, default 20) The number of tuning steps between each evaluation. An evaluation interval isn't supported for chat models. Because the evaluation runs on the entire evaluation dataset, a smaller evaluation interval results in a longer tuning time. For example, if steps is 200 and EVAL_INTERVAL is 100, then you will get only two data points for the evaluation metrics. This parameter requires that the evaluation_data_uri is set.
  • ENABLE_EARLY_STOPPING: (optional, default true) A boolean that, if set to true, stops tuning before completing all the tuning steps if model performance, as measured by the accuracy of predicted tokens, does not improve enough between evaluations runs. If false, tuning continues until all the tuning steps are complete. This parameter requires that the evaluation_data_uri is set. Enable early stopping isn't supported for chat models.
  • TENSORBOARD_RESOURCE_ID: (optional) The ID of a Vertex AI TensorBoard instance. The Vertex AI TensorBoard instance is used to create an experiment after the tuning job completes. The Vertex AI TensorBoard instance needs to be in the same region as the tuning pipeline.
  • ENCRYPTION_KEY_NAME: (optional) The fully qualified name of a customer-managed encryption key (CMEK) that you want to use for data encryption. A CMEK is available only in us-central1. If you use us-central1 and don't specify a CMEK, then a Google-managed encryption key is used. A Google-managed encryption key is used by default in all other available regions. For more information, see CMEK overview.
  • TEMPLATE_URI: The URI for the distilling template, https://us-kfp.pkg.dev/ml-pipeline/distillation/distillation/v1.0.0.
  • SERVICE_ACCOUNT: (optional) The service account that Vertex AI uses to run your pipeline job. By default, your project's Compute Engine default service account (PROJECT_NUMBER‑compute@developer.gserviceaccount.com) is used. Learn more about attaching a custom service account.

HTTP method and URL:

POST https://PIPELINE_JOB_REGION-aiplatform.googleapis.com/v1/projects/PROJECT_ID/locations/PIPELINE_JOB_REGION/pipelineJobs

Request JSON body:

{
  "displayName": "PIPELINEJOB_DISPLAYNAME",
  "runtimeConfig": {
    "gcsOutputDirectory": "gs://OUTPUT_DIR",
    "parameterValues": {
      "project": "PROJECT_ID",
      "model_display_name": "MODEL_DISPLAYNAME",
      "dataset_uri": "gs://DATASET_URI",
      "location": "MODEL_UPLOAD_REGION",
      "accelerator_type": "ACCELERATOR_TYPE",
      "teacher_model_reference": TEACHER_MODEL_REFERENCE,
      "student_model_reference": STUDENT_MODEL_REFERENCE,
      "train_steps": STEPS,
      "learning_rate_multiplier": LEARNING_RATE_MULTIPLIER,
      "evaluation_data_uri": "gs://EVAL_DATASET_URI",
      "evaluation_interval": EVAL_INTERVAL,
      "enable_early_stopping": ENABLE_EARLY_STOPPING,
      "enable_checkpoint_selection": "ENABLE_CHECKPOINT_SELECTION",
      "tensorboard_resource_id": "TENSORBOARD_ID",
      "encryption_spec_key_name": "ENCRYPTION_KEY_NAME"
    }
  },
  "encryptionSpec": {
    "kmsKeyName": "ENCRYPTION_KEY_NAME"
  },
  "serviceAccount": "SERVICE_ACCOUNT",
  "templateUri": "TEMPLATE_URI"
}

To send your request, choose one of these options:

curl

Save the request body in a file named request.json, and execute the following command:

curl -X POST \
-H "Authorization: Bearer $(gcloud auth print-access-token)" \
-H "Content-Type: application/json; charset=utf-8" \
-d @request.json \
"https://PIPELINE_JOB_REGION-aiplatform.googleapis.com/v1/projects/PROJECT_ID/locations/PIPELINE_JOB_REGION/pipelineJobs"

PowerShell

Save the request body in a file named request.json, and execute the following command:

$cred = gcloud auth print-access-token
$headers = @{ "Authorization" = "Bearer $cred" }

Invoke-WebRequest `
-Method POST `
-Headers $headers `
-ContentType: "application/json; charset=utf-8" `
-InFile request.json `
-Uri "https://PIPELINE_JOB_REGION-aiplatform.googleapis.com/v1/projects/PROJECT_ID/locations/PIPELINE_JOB_REGION/pipelineJobs" | Select-Object -Expand Content

You should receive a JSON response similar to the following. Note that pipelineSpec has been truncated to save space.

Python

To learn how to install or update the Vertex AI SDK for Python, see Install the Vertex AI SDK for Python. For more information, see the Python API reference documentation.

from __future__ import annotations


from typing import Optional


from google.auth import default
import vertexai
from vertexai.preview.language_models import TextGenerationModel, TuningEvaluationSpec


credentials, _ = default(scopes=["https://www.googleapis.com/auth/cloud-platform"])


def distill_model(
    project_id: str,
    location: str,
    dataset: str,
    teacher_model: str,
    train_steps: int = 300,
    evaluation_dataset: Optional[str] = None,
) -> None:
    """Distill a new model.

    Args:
      project_id: GCP Project ID, used to initialize vertexai
      location: GCP Region, used to initialize vertexai
      dataset: GCS URI of jsonl file.
      teacher_model: Name of the teacher model.
      train_steps: Number of training steps to use when tuning the model.
      evaluation_dataset: GCS URI of jsonl file of evaluation data.
    """
    vertexai.init(project=project_id, location=location, credentials=credentials)

    eval_spec = TuningEvaluationSpec(evaluation_data=evaluation_dataset)

    student_model = TextGenerationModel.from_pretrained("text-bison@002")
    distillation_job = student_model.distill_from(
        teacher_model=teacher_model,
        dataset=dataset,
        # Optional:
        train_steps=train_steps,
        evaluation_spec=eval_spec,
    )

    return distillation_job

Console

To distill a text model using the Google Cloud console, perform the following steps:

  1. In the Vertex AI section of the Google Cloud console, go to the Vertex AI Studio page.

    Go to Vertex AI Studio

  2. Click the Tune and distill tab.
  3. Click Create distilled model.
  4. Configure model details:
    • Model name: Enter a name for your distilled model.
    • Teacher model: Select the model that you want to use for the teacher model.
    • Student model: Select the model that you want to use for the student model.
    • Region: Select the region where the pipeline tuning job runs and where the tuned model is deployed.
    • Working directory: Enter the Cloud Storage location where artifacts are stored when your model is tuned.
  5. Expand Advanced Options to configure advanced settings.
    • Train steps: Enter the number of steps to run for model tuning. The default value is 300. The batch size varies by tuning location and model size. For 8k models, such as text-bison@002, chat-bison@002, code-bison@002, and codechat-bison@002:
      • us-central1 has a batch size of 8.
      • europe-west4 has a batch size of 24.
      For 32k models, such as text-bison-32k, chat-bison-32k, code-bison-32k, and codechat-bison-32k:
      • us-central1 has a batch size of 8.
      • europe-west4 has a batch size of 8.

      For example, if you're training text-bison@002 in europe-west4, there are 240 examples in a training dataset, and you set steps to 20, then the number of training examples is the product of 20 steps and the batch size of 24, or 480 training steps. In this case, there are two epochs in the training process because it goes through the examples two times. In us-central1, if there are 240 examples in a training dataset and you set steps to 15, then the number of training examples is the product of 15 steps and the batch size of 8, or 120 training steps. In this case, there are 0.5 epochs because there are half as many training steps as there are examples.

    • Learning rate multiplier: Enter the step size at each iteration. The default value is 1.
    • Accelerator type: (optional) Enter the type of accelerator to use for model tuning. The valid options are:
      • GPU: Uses eight A100 80 GB GPUs for tuning. Make sure you have enough quota. If you choose GPU, then VPC‑SC is supported. CMEK is supported if the tuning location and model upload location are us-centra1. For more information, see Supervised tuning region settings. If you choose GPU, then your model tuning computations happen in the us-central1 region.
      • TPU: Uses 64 cores of the TPU v3 pod for tuning. Make sure you have enough quota. CMEK isn't supported, but VPC‑SC is supported. If you choose TPU, then your model tuning computations happen in the europe-west4 region.
    • Add a TensorBoard instance: (optional) The ID of a Vertex AI TensorBoard instance. The Vertex AI TensorBoard instance is used to create an experiment after the tuning job completes. The Vertex AI TensorBoard instance needs to be in the same region as the tuning pipeline.
    • Encryption (optional) Choose to use a Google-managed encryption key or a customer-managed encryption key (CMEK). A CMEK is available for encryption only in the us-central1 region. In all other available regions, a Google-managed encryption key is used. For more information, see CMEK overview.
    • Service account (optional) Choose a a user-managed service account. A service account determines which Google Cloud resources your service code can access. If you don't choose a service account, then a Google-managed service account is used that includes permissions appropriate for most models.
  6. Click Continue
  7. If you want to upload your distillation dataset file, select  Upload JSONL file to Cloud Storage. If your dataset file is already in a Cloud Storage bucket, select  Existing JSONL file on Cloud Storage.

    Upload a JSONL file

    • In Select JSONL file, click Browse and select your dataset file.
    • In Dataset location, click Browse and select the Cloud Storage bucket where you want to store your dataset file.

    Use an existing JSONL file

    In Cloud Storage file path, click Browse and select the Cloud Storage bucket where your dataset file is located.

  8. (Optional) To evaluate your distilled model, select Enable model evaluation and configure your model evaluation:
    • Evaluation dataset: (optional) The URI of the JSONL file that contains the evaluation dataset for batch prediction and evaluation. Evaluation isn't supported for codechat-bison. For more information, see Dataset format for tuning a code model. The evaluation dataset requires between ten and 250 examples.
    • Evaluation interval: (optional, default 20) The number of tuning steps between each evaluation. An evaluation interval isn't supported for chat models. Because the evaluation runs on the entire evaluation dataset, a smaller evaluation interval results in a longer tuning time. For example, if steps is 200 and EVAL_INTERVAL is 100, then you will get only two data points for the evaluation metrics. This parameter requires that the evaluation_data_uri is set.
    • Enable early stopping: (optional, default true) A boolean that, if set to true, stops tuning before completing all the tuning steps if model performance, as measured by the accuracy of predicted tokens, does not improve enough between evaluations runs. If false, tuning continues until all the tuning steps are complete. This parameter requires that the evaluation_data_uri is set. Enable early stopping isn't supported for chat models.
    • Enable checkpoint selection: When enabled, Vertex AI selects and returns the checkpoint with the best model evaluation performance from all checkpoints created during the tuning job. When disabled, the final checkpoint created during the tuning job is returned. Each checkpoint refers to a snapshot of the model during a tuning job.
    • TensorBoard instance: (optional) The ID of a Vertex AI TensorBoard instance. The Vertex AI TensorBoard instance is used to create an experiment after the tuning job completes. The Vertex AI TensorBoard instance needs to be in the same region as the tuning pipeline.
  9. Click Start distillation.

The following table shows the recommended configurations for distilling a foundation model by task:

Task No. of examples in dataset Train steps
Classification 100+ 200-1000
Summarization 100-500+ 1000-1500
Extractive QA 100+ 200-800

For train steps, you can try more than one value to get the best performance on a particular dataset, for example, 100, 200, 500.

View a list of distilled models

You can view a list of models in your current project, including your distilled models, by using the Google Cloud console or the Vertex AI SDK for Python.

Python

Before trying this sample, follow the Python setup instructions in the Vertex AI quickstart using client libraries. For more information, see the Vertex AI Python API reference documentation.

To authenticate to Vertex AI, set up Application Default Credentials. For more information, see Set up authentication for a local development environment.


import vertexai
from vertexai.language_models import TextGenerationModel


def list_tuned_models(
    project_id: str,
    location: str,
) -> None:
    """List tuned models."""

    vertexai.init(project=project_id, location=location)
    model = TextGenerationModel.from_pretrained("text-bison@002")
    tuned_model_names = model.list_tuned_model_names()
    print(tuned_model_names)

    return tuned_model_names

Console

To view your distilled models in the Google Cloud console, go to the Vertex AI Model Registry page.

Go to Vertex AI Model Registry

Load a distilled text model

The following sample code uses the Vertex AI SDK for Python to load a distilled text generation model:

import vertexai
from vertexai.preview.language_models import TextGenerationModel

model = TextGenerationModel.get_tuned_model(TUNED_MODEL_NAME)

Replace TUNED_MODEL_NAME with the qualified resource name of your distilled model. This name is in the format projects/PROJECT_ID/locations/LOCATION/models/MODEL_ID. You can find the model ID of your distilled model in Vertex AI Model Registry.

Tuning and evaluation metrics

You can configure a model tuning job to collect and report model tuning and model evaluation metrics, which can then be visualized by using Vertex AI TensorBoard.

Model tuning metrics

You can configure a model tuning job to collect the following tuning metrics for chat-bison, code-bison, codechat-bison, and text-bison:
  • /train_total_loss: Loss for the tuning dataset at a training step.
  • /train_fraction_of_correct_next_step_preds: The token accuracy at a training step. A single prediction consists of a sequence of tokens. This metric measures the accuracy of the predicted tokens when compared to the ground truth in the tuning dataset.
  • /train_num_predictions: Number of predicted tokens at a training step.

Model evaluation metrics

You can configure a model tuning job to collect the following evaluation metrics for code-bison and text-bison:

  • /eval_total_loss: Loss for the evaluation dataset at an evaluation step.
  • /eval_fraction_of_correct_next_step_preds: The token accuracy at an evaluation step. A single prediction consists of a sequence of tokens. This metric measures the accuracy of the predicted tokens when compared to the ground truth in the evaluation dataset.
  • /eval_num_predictions: Number of predicted tokens at an evaluation step.

The metrics visualizations are available after the model tuning job completes. If you specify only a Vertex AI TensorBoard instance ID and not an evaluation dataset when you create the tuning job, only the visualizations for the tuning metrics are available.