Text chat

The PaLM 2 for Chat (chat-bison) foundation model is a large language model (LLM) that excels at language understanding, language generation, and conversations. This chat model is fine-tuned to conduct natural multi-turn conversations, and is ideal for text tasks about code that require back-and-forth interactions.

For text tasks that can be completed with one API response (without the need for continuous conversation), use the Text model.

To explore this model in the console, see the PaLM 2 for Chat model card in the Model Garden.
Go to the Model Garden

Use cases

  • Customer Service: Instruct the model to respond as customer service agents that only talk about your company's product

  • Technical Support: Instruct the model to interact with customers as a call center agent with specific parameters about how to respond and what not to say

  • Personas and characters: Instruct the model to respond in the style of a specific person ("...in the style of Shakespeare")

  • Website companion: Create a conversational assistant for shopping, travel, and other use cases

For more information, see Design chat prompts.

HTTP request

POST https://us-central1-aiplatform.googleapis.com/v1/projects/PROJECT_ID/locations/us-central1/publishers/google/models/chat-bison:predict

For more information, see the predict method.

Model versions

To use the latest model version, specify the model name without a version number, for example chat-bison.

To use a stable model version, specify the model version number, for example chat-bison@002. Each stable version is available for six months after the release date of the subsequent stable version.

The following table contains the available stable model versions:

chat-bison model Release date Discontinuation date
chat-bison@002 December 6, 2023 April 9, 2025

For more information, see Model versions and lifecycle.

Request body

{
  "instances": [
    {
      "context":  string,
      "examples": [
        {
          "input": { "content": string },
          "output": { "content": string }
        }
      ],
      "messages": [
        {
          "author": string,
          "content": string,
        }
      ],
    }
  ],
  "parameters": {
    "temperature": number,
    "maxOutputTokens": integer,
    "topP": number,
    "topK": integer,
    "groundingConfig": string,
    "stopSequences": [ string ],
    "candidateCount": integer
    "logprobs": integer,
    "presencePenalty": float,
    "frequencyPenalty": float,
    "seed": integer
  }
}

For chat API calls, the context, examples, and messages combine to form the prompt. The following table shows the parameters that you need to configure for the Vertex AI PaLM API for text:

Parameter Description Acceptable values

context

(optional)

Context shapes how the model responds throughout the conversation. For example, you can use context to specify words the model can or cannot use, topics to focus on or avoid, or the response format or style. Text

examples

(optional)

Examples for the model to learn how to respond to the conversation.
[{
  "input": {"content": "provide content"},
  "output": {"content": "provide content"}
}]

messages

(required)

Conversation history provided to the model in a structured alternate-author form. Messages appear in chronological order: oldest first, newest last. When the history of messages causes the input to exceed the maximum length, the oldest messages are removed until the entire prompt is within the allowed limit.
[{
  "author": "user",
  "content": "user message"
}]

temperature

The temperature is used for sampling during response generation, which occurs when topP and topK are applied. Temperature controls the degree of randomness in token selection. Lower temperatures are good for prompts that require a less open-ended or creative response, while higher temperatures can lead to more diverse or creative results. A temperature of 0 means that the highest probability tokens are always selected. In this case, responses for a given prompt are mostly deterministic, but a small amount of variation is still possible.

If the model returns a response that's too generic, too short, or the model gives a fallback response, try increasing the temperature.

0.0–1.0

Default: 0.0

maxOutputTokens

Maximum number of tokens that can be generated in the response. A token is approximately four characters. 100 tokens correspond to roughly 60-80 words.

Specify a lower value for shorter responses and a higher value for potentially longer responses.

1–2048

Default: 1024

topK

Top-K changes how the model selects tokens for output. A top-K of 1 means the next selected token is the most probable among all tokens in the model's vocabulary (also called greedy decoding), while a top-K of 3 means that the next token is selected from among the three most probable tokens by using temperature.

For each token selection step, the top-K tokens with the highest probabilities are sampled. Then tokens are further filtered based on top-P with the final token selected using temperature sampling.

Specify a lower value for less random responses and a higher value for more random responses.

1–40

Default: 40

topP

Top-P changes how the model selects tokens for output. Tokens are selected from the most (see top-K) to least probable until the sum of their probabilities equals the top-P value. For example, if tokens A, B, and C have a probability of 0.3, 0.2, and 0.1 and the top-P value is 0.5, then the model will select either A or B as the next token by using temperature and excludes C as a candidate.

Specify a lower value for less random responses and a higher value for more random responses.

0.0–1.0

Default: 0.95

stopSequences

Specifies a list of strings that tells the model to stop generating text if one of the strings is encountered in the response. If a string appears multiple times in the response, then the response truncates where it's first encountered. The strings are case-sensitive.

For example, if the following is the returned response when stopSequences isn't specified:

public static string reverse(string myString)

Then the returned response with stopSequences set to ["Str", "reverse"] is:

public static string

default: []

groundingConfig

Grounding lets you reference specific data when using language models. When you ground a model, the model can reference internal, confidential, and otherwise specific data from your repository and include the data in the response. Only data stores from Vertex AI Search are supported.

Path should follow format: projects/{project_id}/locations/global/collections/{collection_name}/dataStores/{DATA_STORE_ID}

candidateCount

The number of response variations to return. For each request, you're charged for the output tokens of all candidates, but are only charged once for the input tokens.

Specifying multiple candidates is a Preview feature that works with generateContent (streamGenerateContent is not supported). The following models are supported:

  • Gemini 1.5 Flash: 1-8, default: 1
  • Gemini 1.5 Pro: 1-8, default: 1
  • Gemini 1.0 Pro: 1-8, default: 1

1–4

Default: 1

logprobs

Returns the log probabilities of the top candidate tokens at each generation step. The model's chosen token might not be the same as the top candidate token at each step. Specify the number of candidates to return by using an integer value in the range of 1-5.

0-5

frequencyPenalty

Positive values penalize tokens that repeatedly appear in the generated text, decreasing the probability of repeating content. The minimum value is -2.0. The maximum value is up to, but not including, 2.0.

Minimum value: -2.0

Maximum value: 2.0

presencePenalty

Positive values penalize tokens that already appear in the generated text, increasing the probability of generating more diverse content. The minimum value is -2.0. The maximum value is up to, but not including, 2.0.

Minimum value: -2.0

Maximum value: 2.0

seed

When seed is fixed to a specific value, the model makes a best effort to provide the same response for repeated requests. Deterministic output isn't guaranteed. Also, changing the model or parameter settings, such as the temperature, can cause variations in the response even when you use the same seed value. By default, a random seed value is used.

This is a preview feature.

Optional

Sample request

REST

To test a text chat by using the Vertex AI API, send a POST request to the publisher model endpoint.

Before using any of the request data, make the following replacements:

For other fields, see the Request body table below.

HTTP method and URL:

POST https://us-central1-aiplatform.googleapis.com/v1/projects/PROJECT_ID/locations/us-central1/publishers/google/models/chat-bison:predict

Request JSON body:

{
  "instances": [{
      "context":  "CONTEXT",
      "examples": [
       { 
          "input": {"content": "EXAMPLE_INPUT"},
          "output": {"content": "EXAMPLE_OUTPUT"}
       }],
      "messages": [
       { 
          "author": "AUTHOR",
          "content": "CONTENT",
       }],
   }],
  "parameters": {
    "temperature": TEMPERATURE,
    "maxOutputTokens": MAX_OUTPUT_TOKENS,
    "topP": TOP_P,
    "topK": TOP_K
  }
}

To send your request, choose one of these options:

curl

Save the request body in a file named request.json, and execute the following command:

curl -X POST \
-H "Authorization: Bearer $(gcloud auth print-access-token)" \
-H "Content-Type: application/json; charset=utf-8" \
-d @request.json \
"https://us-central1-aiplatform.googleapis.com/v1/projects/PROJECT_ID/locations/us-central1/publishers/google/models/chat-bison:predict"

PowerShell

Save the request body in a file named request.json, and execute the following command:

$cred = gcloud auth print-access-token
$headers = @{ "Authorization" = "Bearer $cred" }

Invoke-WebRequest `
-Method POST `
-Headers $headers `
-ContentType: "application/json; charset=utf-8" `
-InFile request.json `
-Uri "https://us-central1-aiplatform.googleapis.com/v1/projects/PROJECT_ID/locations/us-central1/publishers/google/models/chat-bison:predict" | Select-Object -Expand Content

You should receive a JSON response similar to the sample response.

Python

To learn how to install or update the Vertex AI SDK for Python, see Install the Vertex AI SDK for Python. For more information, see the Python API reference documentation.

from vertexai.language_models import ChatModel, InputOutputTextPair

chat_model = ChatModel.from_pretrained("chat-bison@002")

parameters = {
    "temperature": 0.2,
    "max_output_tokens": 256,
    "top_p": 0.95,
    "top_k": 40,
}

chat_session = chat_model.start_chat(
    context="My name is Miles. You are an astronomer, knowledgeable about the solar system.",
    examples=[
        InputOutputTextPair(
            input_text="How many moons does Mars have?",
            output_text="The planet Mars has two moons, Phobos and Deimos.",
        ),
    ],
)

response = chat_session.send_message(
    "How many planets are there in the solar system?", **parameters
)
print(response.text)
# Example response:
# There are eight planets in the solar system:
# Mercury, Venus, Earth, Mars, Jupiter, Saturn, Uranus, and Neptune.

Node.js

Before trying this sample, follow the Node.js setup instructions in the Vertex AI quickstart using client libraries. For more information, see the Vertex AI Node.js API reference documentation.

To authenticate to Vertex AI, set up Application Default Credentials. For more information, see Set up authentication for a local development environment.

/**
 * TODO(developer): Uncomment these variables before running the sample.\
 * (Not necessary if passing values as arguments)
 */
// const project = 'YOUR_PROJECT_ID';
// const location = 'YOUR_PROJECT_LOCATION';
const aiplatform = require('@google-cloud/aiplatform');

// Imports the Google Cloud Prediction service client
const {PredictionServiceClient} = aiplatform.v1;

// Import the helper module for converting arbitrary protobuf.Value objects.
const {helpers} = aiplatform;

// Specifies the location of the api endpoint
const clientOptions = {
  apiEndpoint: 'us-central1-aiplatform.googleapis.com',
};
const publisher = 'google';
const model = 'chat-bison@001';

// Instantiates a client
const predictionServiceClient = new PredictionServiceClient(clientOptions);

async function callPredict() {
  // Configure the parent resource
  const endpoint = `projects/${project}/locations/${location}/publishers/${publisher}/models/${model}`;

  const prompt = {
    context:
      'My name is Miles. You are an astronomer, knowledgeable about the solar system.',
    examples: [
      {
        input: {content: 'How many moons does Mars have?'},
        output: {
          content: 'The planet Mars has two moons, Phobos and Deimos.',
        },
      },
    ],
    messages: [
      {
        author: 'user',
        content: 'How many planets are there in the solar system?',
      },
    ],
  };
  const instanceValue = helpers.toValue(prompt);
  const instances = [instanceValue];

  const parameter = {
    temperature: 0.2,
    maxOutputTokens: 256,
    topP: 0.95,
    topK: 40,
  };
  const parameters = helpers.toValue(parameter);

  const request = {
    endpoint,
    instances,
    parameters,
  };

  // Predict request
  const [response] = await predictionServiceClient.predict(request);
  console.log('Get chat prompt response');
  const predictions = response.predictions;
  console.log('\tPredictions :');
  for (const prediction of predictions) {
    console.log(`\t\tPrediction : ${JSON.stringify(prediction)}`);
  }
}

callPredict();

Java

Before trying this sample, follow the Java setup instructions in the Vertex AI quickstart using client libraries. For more information, see the Vertex AI Java API reference documentation.

To authenticate to Vertex AI, set up Application Default Credentials. For more information, see Set up authentication for a local development environment.


import com.google.cloud.aiplatform.v1beta1.EndpointName;
import com.google.cloud.aiplatform.v1beta1.PredictResponse;
import com.google.cloud.aiplatform.v1beta1.PredictionServiceClient;
import com.google.cloud.aiplatform.v1beta1.PredictionServiceSettings;
import com.google.protobuf.Value;
import com.google.protobuf.util.JsonFormat;
import java.io.IOException;
import java.util.ArrayList;
import java.util.List;

// Send a Predict request to a large language model to test a chat prompt
public class PredictChatPromptSample {

  public static void main(String[] args) throws IOException {
    // TODO(developer): Replace these variables before running the sample.
    String instance =
        "{\n"
            + "   \"context\":  \"My name is Ned. You are my personal assistant. My favorite movies"
            + " are Lord of the Rings and Hobbit.\",\n"
            + "   \"examples\": [ { \n"
            + "       \"input\": {\"content\": \"Who do you work for?\"},\n"
            + "       \"output\": {\"content\": \"I work for Ned.\"}\n"
            + "    },\n"
            + "    { \n"
            + "       \"input\": {\"content\": \"What do I like?\"},\n"
            + "       \"output\": {\"content\": \"Ned likes watching movies.\"}\n"
            + "    }],\n"
            + "   \"messages\": [\n"
            + "    { \n"
            + "       \"author\": \"user\",\n"
            + "       \"content\": \"Are my favorite movies based on a book series?\"\n"
            + "    }]\n"
            + "}";
    String parameters =
        "{\n"
            + "  \"temperature\": 0.3,\n"
            + "  \"maxDecodeSteps\": 200,\n"
            + "  \"topP\": 0.8,\n"
            + "  \"topK\": 40\n"
            + "}";
    String project = "YOUR_PROJECT_ID";
    String publisher = "google";
    String model = "chat-bison@001";

    predictChatPrompt(instance, parameters, project, publisher, model);
  }

  static void predictChatPrompt(
      String instance, String parameters, String project, String publisher, String model)
      throws IOException {
    PredictionServiceSettings predictionServiceSettings =
        PredictionServiceSettings.newBuilder()
            .setEndpoint("us-central1-aiplatform.googleapis.com:443")
            .build();

    // Initialize client that will be used to send requests. This client only needs to be created
    // once, and can be reused for multiple requests.
    try (PredictionServiceClient predictionServiceClient =
        PredictionServiceClient.create(predictionServiceSettings)) {
      String location = "us-central1";
      final EndpointName endpointName =
          EndpointName.ofProjectLocationPublisherModelName(project, location, publisher, model);

      Value.Builder instanceValue = Value.newBuilder();
      JsonFormat.parser().merge(instance, instanceValue);
      List<Value> instances = new ArrayList<>();
      instances.add(instanceValue.build());

      Value.Builder parameterValueBuilder = Value.newBuilder();
      JsonFormat.parser().merge(parameters, parameterValueBuilder);
      Value parameterValue = parameterValueBuilder.build();

      PredictResponse predictResponse =
          predictionServiceClient.predict(endpointName, instances, parameterValue);
      System.out.println("Predict Response");
    }
  }
}

Response body

{
  "predictions": [
    {
      "candidates": [
        {
          "author": string,
          "content": string
        }
      ],
      "citationMetadata": {
        "citations": [
          {
            "startIndex": integer,
            "endIndex": integer,
            "url": string,
            "title": string,
            "license": string,
            "publicationDate": string
          }
        ]
      },
      "logprobs": {
        "tokenLogProbs": [ float ],
        "tokens": [ string ],
        "topLogProbs": [ { map<string, float> } ]
      },
      "safetyAttributes": {
        "categories": [ string ],
        "blocked": false,
        "scores": [ float ],
        "errors": [ int ]
      }
    }
  ],
  "metadata": {
    "tokenMetadata": {
      "input_token_count": {
        "total_tokens": integer,
        "total_billable_characters": integer
      },
      "output_token_count": {
        "total_tokens": integer,
        "total_billable_characters": integer
      }
    }
  }
}
Response element Description
content Text content of the chat message.
candidates The chat result generated from given message.
categories The display names of Safety Attribute categories associated with the generated content. Order matches the Scores.
author Author tag for the turn.
scores The confidence scores of the each category, higher value means higher confidence.
blocked A flag indicating if the model's input or output was blocked.
startIndex Index in the prediction output where the citation starts (inclusive). Must be >= 0 and < end_index.
endIndex Index in the prediction output where the citation ends (exclusive). Must be > start_index and < len(output).
url URL associated with this citation. If present, this URL links to the webpage of the source of this citation. Possible URLs include news websites, GitHub repos, etc.
title Title associated with this citation. If present, it refers to the title of the source of this citation. Possible titles include news titles, book titles, etc.
license License associated with this recitation. If present, it refers to the license of the source of this citation. Possible licenses include code licenses, e.g., mit license.
publicationDate Publication date associated with this citation. If present, it refers to the date at which the source of this citation was published. Possible formats are YYYY, YYYY-MM, YYYY-MM-DD.
safetyAttributes A collection of categories and their associated confidence scores. 1-1 mapping to candidates.
input_token_count Number of input tokens. This is the total number of tokens across all messages, examples, and context.
output_token_count Number of output tokens. This is the total number of tokens in content across all candidates in the response.
tokens The sampled tokens.
tokenLogProbs The sampled tokens' log probabilities.
topLogProb The most likely candidate tokens and their log probabilities at each step.
logprobs Results of the `logprobs` parameter. 1-1 mapping to `candidates`.

Sample response

{
  "predictions": [
    {
      "citationMetadata": {
        "citations": []
      },
      "safetyAttributes": {
        "scores": [
          0.1
        ],
        "categories": [
          "Finance"
        ],
        "blocked": false
      },
      "candidates": [
        {
          "author": "AUTHOR",
          "content": "RESPONSE"
        }
      ]
    }
  ]
}

Stream response from Generative AI models

The parameters are the same for streaming and non-streaming requests to the APIs.

To view sample code requests and responses using the REST API, see Examples using the streaming REST API.

To view sample code requests and responses using the Vertex AI SDK for Python, see Examples using Vertex AI SDK for Python for streaming.