Vertex AI features a growing list of foundation models that you can test, deploy, and customize for use in your applications. Each foundation model is fine-tuned for specific use cases and is offered at different price points. This page summarizes the models that are available and gives you guidance on which models to use.
To learn more about all AI models and APIs on Vertex AI, see Explore AI models and APIs.
Model naming scheme
Foundation model names have three components: use case, model size, and
version number. The naming convention is in the format
<use case>-<model size>@<version number>
. For example,
text-bison@001
represents the Bison text model, version 001.
The model sizes are as follows:
- Bison: The best value in terms of capability and cost.
- Gecko: The smallest and lowest cost model for simple tasks.
Foundation models
The following table gives you an overview of the foundation models that are available in Vertex AI.
Model name | Description | Model properties |
---|---|---|
text-bison@001 | Fine-tuned to follow natural language instructions and is suitable for
a variety of language tasks, such as:
|
Max input token: 8,192 Max output tokens: 1,024 Training data: Up to Feb 2023 |
textembedding-gecko@001 (Model tuning not supported) |
Returns model embeddings for text inputs. | 3,072 input tokens and outputs 768-dimensional vector embeddings. |
chat-bison@001 (model tuning not supported) |
Fine-tuned for multi-turn conversation use cases. | Max input token: 4,096 Max output tokens: 1,024 Training data: Up to Feb 2023 Max turns : 2,500 |
Language support
PaLM models currently only support English.
Parameter definitions
Requests to the Vertex AI PaLM API require different parameter configurations based on the model type.
Text model parameters
Parameter | Description | Acceptable values |
---|---|---|
|
Text input to generate model response. Prompts can include preamble, questions, suggestions, instructions, or examples. | Text |
|
The temperature is used for sampling during the response generation, which occurs when topP and topK are applied. Temperature controls the degree of randomness in token selection. Lower temperatures are good for prompts that require a more deterministic and less open-ended or creative response, while higher temperatures can lead to more diverse or creative results. A temperature of 0 is deterministic: the highest probability response is always selected. For most use cases, try starting with a temperature of 0.2. |
|
|
Maximum number of tokens that can be generated in the response. Specify a lower value for shorter responses and a higher value for longer responses. A token may be smaller than a word. A token is approximately four characters. 100 tokens correspond to roughly 60-80 words. |
|
|
Top-k changes how the model selects tokens for output. A top-k of 1 means the selected token is the most probable among all tokens in the model's vocabulary (also called greedy decoding), while a top-k of 3 means that the next token is selected from among the 3 most probable tokens (using temperature). For each token selection step, the top K tokens with the highest probabilities are sampled. Then tokens are further filtered based on topP with the final token selected using temperature sampling.Specify a lower value for less random responses and a higher value for more random responses. |
|
|
Top-p changes how the model selects tokens for output. Tokens are selected from most K (see topK parameter) probable to least until the sum of their probabilities equals the top-p value. For example, if tokens A, B, and C have a probability of 0.3, 0.2, and 0.1 and the top-p value is 0.5, then the model will select either A or B as the next token (using temperature) and doesn't consider C. The default top-p value is 0.95. Specify a lower value for less random responses and a higher value for more random responses. |
|
Sample code for text
REST
MODEL_ID="text-bison"
PROJECT_ID=PROJECT_ID
curl \
-X POST \
-H "Authorization: Bearer $(gcloud auth print-access-token)" \
-H "Content-Type: application/json" \
https://us-central1-aiplatform.googleapis.com/v1/projects/${PROJECT_ID}/locations/us-central1/publishers/google/models/${MODEL_ID}:predict -d \
$'{
"instances": [
{ "prompt": "Hello"}
],
"parameters": {
"temperature": 0.2,
"maxOutputTokens": 256,
"topK": 40,
"topP": 0.95
}
}'
Python
from vertexai.preview.language_models import TextGenerationModel
def text_summarization_example(temperature=.2):
"""Summarization Example with a Large Language Model"""
model = TextGenerationModel.from_pretrained("text-bison")
response = model.predict(
'Hello',
temperature=temperature,
top_k=40,
top_p=.95,
max_output_tokens=256,
)
print(f"Response from Model: {response.text}")
Chat model parameters
For chat API calls, the context
, examples
, and messages
combine to form
the prompt.
Parameter | Description | Acceptable values |
---|---|---|
(optional) |
Context shapes how the model responds throughout the conversation. For example, you can use context to specify words the model can or cannot use, topics to focus on or avoid, or the response format or style. | Text |
(optional) |
List of structured messages to the model to learn how to respond to the conversation. | List[Structured Message] "input": {"content": "provide content"}, "output": {"content": "provide content"}} |
(required) |
Conversation history provided to the model in a structured alternate-author form. Messages appear in chronological order: oldest first, newest last. When the history of messages causes the input to exceed the maximum length, the oldest messages are removed until the entire prompt is within the allowed limit. | List[Structured Message] "author": "user", "content": "user message",} |
|
The temperature is used for sampling during the response generation, which occurs when topP and topK are applied. Temperature controls the degree of randomness in token selection. Lower temperatures are good for prompts that require a more deterministic and less open-ended or creative response, while higher temperatures can lead to more diverse or creative results. A temperature of 0 is deterministic: the highest probability response is always selected. For most use cases, try starting with a temperature of 0.2. |
|
|
Maximum number of tokens that can be generated in the response. Specify a lower value for shorter responses and a higher value for longer responses. A token may be smaller than a word. A token is approximately four characters. 100 tokens correspond to roughly 60-80 words. |
|
|
Top-k changes how the model selects tokens for output. A top-k of 1 means the selected token is the most probable among all tokens in the model's vocabulary (also called greedy decoding), while a top-k of 3 means that the next token is selected from among the 3 most probable tokens (using temperature). For each token selection step, the top K tokens with the highest probabilities are sampled. Then tokens are further filtered based on topP with the final token selected using temperature sampling.Specify a lower value for less random responses and a higher value for more random responses. |
|
|
Top-p changes how the model selects tokens for output. Tokens are selected from most K (see topK parameter) probable to least until the sum of their probabilities equals the top-p value. For example, if tokens A, B, and C have a probability of 0.3, 0.2, and 0.1 and the top-p value is 0.5, then the model will select either A or B as the next token (using temperature) and doesn't consider C. The default top-p value is 0.95. Specify a lower value for less random responses and a higher value for more random responses. |
|
Sample code for chat
REST
MODEL_ID="chat-bison"
PROJECT_ID=PROJECT_ID
curl \
-X POST \
-H "Authorization: Bearer $(gcloud auth print-access-token)" \
-H "Content-Type: application/json" \
https://us-central1-aiplatform.googleapis.com/v1/projects/${PROJECT_ID}/locations/us-central1/endpoints/${MODEL_ID}:predict -d \
'{
"instances": [{
"context": "My name is Ned. You are my personal assistant. My favorite movies are Lord of the Rings and Hobbit.",
"examples": [{
"input": {"content": "Who do you work for?"},
"output": {"content": "I work for Ned."}
},
{
"input": {"content": "What do I like?"},
"output": {"content": "Ned likes watching movies."}
}],
"messages": [
{
"author": "user",
"content": "Are my favorite movies based on a book series?",
},
{
"author": "bot",
"content": "Yes, your favorite movies, The Lord of the Rings and The Hobbit, are based on book series by J.R.R. Tolkien.",
},
{
"author": "user",
"content": "When where these books published?",
}],
}],
"parameters": {
"temperature": 0.3,
"maxDecodeSteps": 200,
"topP": 0.8,
"topK": 40
}
}'
Python
from vertexai.preview.language_models.language_models import ChatModel, InputOutputTextPair
chat_model = ChatModel.from_pretrained("chat-bison")
chat = chat_model.start_chat(
# Optional:
context="My name is Ned. You are my personal assistant. My favorite movies are Lord of the Rings and Hobbit.",
examples=[
InputOutputTextPair(
input_text="Who do you work for?",
output_text="I work for Ned.",
),
InputOutputTextPair(
input_text="What do I like?",
output_text="Ned likes watching movies.",
),
],
)
print(chat.send_message("Are my favorite movies based on a book series?"))
print(chat.send_message("When where these books published?"))
What's next
- Try a quickstart tutorial using Generative AI Studio or the Vertex AI API.
- Learn how to test text prompts.
- Learn how to test chat prompts.
- Explore pretrained models in Model Garden.
- Learn how to tune a foundation model.
- Learn about responsible AI best practices and Vertex AI's safety filters.