Google foundation models

Vertex AI features a growing list of foundation models that you can test, deploy, and customize for use in your AI-based applications. Foundation models are fine-tuned for specific use cases and offered at different price points. This page summarizes the models that are available in the various APIs and gives you guidance on which models to choose by use case.

To learn more about all AI models and APIs on Vertex AI, see Explore AI models and APIs.

Gemini models

The following table summarizes the models available in the Gemini API:

Model name Description Specifications
Gemini 1.5 Pro (Preview)
(gemini-1.5-pro)
Most versatile model, delivering top-tier quality across a range of production workloads, with a balance of exceptional capability and value.

Supports up to 1,048,576 tokens in a context window. Supports text, image, video with both frames and audio, and audio as inputs.
Max input tokens: 1,048,576
Max output tokens: 8,192
Max images per prompt: 300
Max video length (frames only): approximately one hour
Max video length (frame and audio): approximately 45 minutes
Max videos per prompt: 10
Max audio length: approximately eight hours
Gemini 1.0 Pro
(gemini-1.0-pro)
The best performing model with features for a wide range of text-only tasks.

Supports only text as input.
Supports supervised tuning.
Max total tokens (input and output): 32,760
Max output tokens: 8,192
Training data: Up to Feb 2023
Gemini 1.0 Pro Vision
(gemini-1.0-pro-vision)
The best performing image/video understanding model to handle a broad range of applications.

Supports text, image, and video as inputs.
Max total tokens (input and output): 16,384
Max output tokens: 2,048
Max images per prompt: 16
Max video length: 2 minutes
Max videos per prompt: 1
Training data: Up to Feb 2023
Gemini 1.0 Ultra (GA with allow list) Google's most capable text model, optimized for complex tasks, including instruction, code, and reasoning.

Supports only text as input.
Max tokens input: 8,192
Max tokens output: 2,048
Gemini 1.0 Ultra Vision
(GA with allow list)
Google's most capable multimodal vision model, optimized to support joint text, images, and video inputs. Max tokens input: 8,192
Max tokens output: 2,048

Gemini models support the following languages:
Arabic (ar), Bengali (bn), Bulgarian (bg), Chinese simplified and traditional (zh), Croatian (hr), Czech (cs), Danish (da), Dutch (nl), English (en), Estonian (et), Finnish (fi), French (fr), German (de), Greek (el), Hebrew (iw), Hindi (hi), Hungarian (hu), Indonesian (id), Italian (it), Japanese (ja), Korean (ko), Latvian (lv), Lithuanian (lt), Norwegian (no), Polish (pl), Portuguese (pt), Romanian (ro), Russian (ru), Serbian (sr), Slovak (sk), Slovenian (sl), Spanish (es), Swahili (sw), Swedish (sv), Thai (th), Turkish (tr), Ukrainian (uk), Vietnamese (vi).

Embeddings models

The following table summarizes the models available in the Embeddings API.

Model name Description Specifications
Embeddings for text
(textembedding-gecko@001,
textembedding-gecko@002,
textembedding-gecko@003,
text-embedding-preview-0409
)
Returns embeddings for English text inputs.

Supports supervised tuning of "textembedding-gecko" models, English only.
Max token input: 3,072 (textembedding-gecko@001),
2,048 (others).

Embedding dimension: text-embedding-preview-0409: <=768
Others: 768.
Embeddings for text multilingual
(textembedding-gecko-multilingual@001,
text-multilingual-embedding-preview-0409)
Returns embeddings for text inputs of over 100 languages

Supports supervised tuning of the "textembedding-gecko-multilingual" model.
Supports 100 languages.
Max token input: 2,048

Embedding dimension: text-multilingual-embedding-preview-0409: <=768
Others: 768.
Embeddings for multimodal
(multimodalembedding)
Returns embedding for text, image and video inputs, to compare content across different models.

Converts text, image, and video into the same vector space. Video only supports 1408 dimensions.
English only
Max token input: 32,
Max image size: 20 MB, Max video length: Two mins,

Embedding dimension: 128, 256, 512, or 1408 for text+image input, 1408 for video input.

Text multilingual embedding models support the following languages:
Afrikaans, Albanian, Amharic, Arabic, Armenian, Azerbaijani, Basque, Belarusian, Bengali, Bulgarian, Burmese, Catalan, Cebuano, Chichewa, Chinese, Corsican, Czech, Danish, Dutch, English, Esperanto, Estonian, Filipino, Finnish, French, Galician, Georgian, German, Greek, Gujarati, Haitian Creole, Hausa, Hawaiian, Hebrew, Hindi, Hmong, Hungarian, Icelandic, Igbo, Indonesian, Irish, Italian, Japanese, Javanese, Kannada, Kazakh, Khmer, Korean, Kurdish, Kyrgyz, Lao, Latin, Latvian, Lithuanian, Luxembourgish, Macedonian, Malagasy, Malay, Malayalam, Maltese, Maori, Marathi, Mongolian, Nepali, Norwegian, Pashto, Persian, Polish, Portuguese, Punjabi, Romanian, Russian, Samoan, Scottish Gaelic, Serbian, Shona, Sindhi, Sinhala, Slovak, Slovenian, Somali, Sotho, Spanish, Sundanese, Swahili, Swedish, Tajik, Tamil, Telugu, Thai, Turkish, Ukrainian, Urdu, Uzbek, Vietnamese, Welsh, West Frisian, Xhosa, Yiddish, Yoruba, Zulu.

Imagen model

The following table summarizes the models available in the Imagen API:

Model name Description Specifications
Imagen 2
(imagegeneration@006)
This model supports image generation and editing to create high quality images in seconds.

The editing feature supports object removal and insertion, outpainting, and product editing
Max image output: four
Aspect ratio (for generation): 1:1, 9:16, 16:9, 3:4, 4:3

Resolution: ~1500 pixels (varies by aspect ratio)

The Imagen model supports the following languages:
English, Chinese (simplified), Chinese (traditional), Hindi, Japanese, Korean, Portuguese, and Spanish.

Code completion models

The following table summarizes the models available in the Codey APIs:

Model name Description Specifications
Codey for Code Generation
(code-bison)
A model fine-tuned to generate code based on a natural language description of the desired code. For example, it can generate a unit test for a function.

Supports supervised tuning
Maximum input tokens: 6144
Maximum output tokens: 1024
Codey for Code Generation 32k
(code-bison-32k)
Similar capability as code-bison, but with longer context window

Supports supervised tuning
Max tokens (input + output): 32,768
Max output tokens: 8,192
Codey for Code Chat
(codechat-bison)
A model fine-tuned for chatbot conversations that help with code related questions.

Supports supervised tuning
Maximum input tokens: 6144
Maximum output tokens: 1024
Codey for Code Chat 32k
(codechat-bison-32k)
Similar capability as codechat-bison, but with longer context window

Supports supervised tuning
Max tokens (input + output): 32,768
Max output tokens: 8,192
Codey for Code Completion
(code-gecko)
A model fine-tuned to suggest code completion based on the context in code that's written. Maximum input tokens: 2048
Maximum output tokens: 64

MedLM models

The following table summarizes the models available in the MedLM API:

Model name Description Specifications
MedLM-medium (medlm-medium) A suite of models for the medical domain that supports HIPAA compliance.

This model helps healthcare practitioners with medical question and answer tasks, and summarization tasks for healthcare and medical documents.
Max tokens (input + output): 32,768
Max output tokens: 8,192
Languages: English
MedLM-large (medlm-large) A higher quality variation of MedLM. Max input tokens: 8,192
Max output tokens: 1,024
Languages: English

Explore all models in Model Garden

Model Garden is a platform that helps you discover, test, customize, and deploy Google proprietary and select OSS models and assets. To explore the generative AI models and APIs that are available on Vertex AI, go to Model Garden in the Google Cloud console.

Go to Model Garden

To learn more about Model Garden, including available models and capabilities, see Explore AI models in Model Garden.

What's next