Overview of Generative AI on Vertex AI

Generative AI on Vertex AI (also known as genAI or gen AI) gives you access to many large generative AI models so you can evaluate, tune, and deploy them for use in your AI-powered applications. This page gives you an overview of the generative AI workflow on Vertex AI, the features and models that are available, and directs you to resources for getting started.

Generative AI workflow

The following diagram shows a high level overview of the generative AI workflow.

Generative AI workflow diagram



The generative AI workflow typically starts with prompting. A prompt is a request sent to a generative AI model to elicit a response back. Depending on the model, a prompt can contain text, images, videos, audio, documents, and other modalities or even multiple modalities (multimodal).

Creating a prompt to get the desired response from the model is a practice called prompt design. While prompt design is a process of trial and error, there are prompt design principles and strategies that you can use to nudge the model to behave in the desired way. Vertex AI Studio offers a prompt management tool to help you manage your prompts.

Foundation models

Foundation models

Prompts are sent to a generative AI model for response generation. Vertex AI has a variety of generative AI foundation models that are accessible through a managed API, including the following:

  • Gemini API: Advanced reasoning, multiturn chat, code generation, and multimodal prompts.
  • Imagen API: Image generation, image editing, and visual captioning.
  • MedLM: Medical question answering and summarization. (Private GA)

The models differ in size, modality, and cost. You can explore Google models, as well as open models and models from Google partners, in Model Garden.

Model customization

Model customization

You can customize the default behavior of Google's foundation models so that they consistently generate the desired results without using complex prompts. This customization process is called model tuning. Model tuning helps you reduce the cost and latency of your requests by allowing you to simplify your prompts.

Vertex AI also offers model evaluation tools to help you evaluate the performance of your tuned model. After your tuned model is production-ready, you can deploy it to an endpoint and monitor performance like in standard MLOps workflows.

Request augmentation


Vertex AI offers multiple request augmentation methods that give the model access to external APIs and real-time information.

  • Grounding: Connects model responses to a source of truth, such as your own data or web search, helping to reduce hallucinations.
  • RAG: Connects models to external knowledge sources, such as documents and databases, to generate more accurate and informative responses.
  • Function calling: Lets the model interact with external APIs to get real-time information and perform real-world tasks.

Citation check

Citation check

After the response is generated, Vertex AI checks whether citations need to be included with the response. If a significant amount of the text in the response comes from a particular source, that source is added to the citation metadata in the response.

Responsible AI and safety

Responsible AI and safety

The last layer of checks that the prompt and response go through before being returned is the safety filters. Vertex AI checks both the prompt and response for how much the prompt or response belongs to a safety category. If the threshold is exceed for one or more categories, the response is blocked and Vertex AI returns a fallback response.



If the prompt and response passes the safety filter checks, the response is returned. Typically, the response is returned all at once. However, you can also receive responses progressively as it generates by enabling streaming.

Generative AI APIs and models

The generative AI models available in Vertex AI, also called foundation models, are categorized by the type of content that it's designed to generate. This content includes text, chat, image, code, video, multimodal data, and embeddings. Each model is exposed through a publisher endpoint that's specific to your Google Cloud project so there's no need to deploy the foundation model unless you need to tune it for a specific use case.

Gemini API offerings

The Vertex AI Gemini API contains the publisher endpoints for the Gemini models developed by Google DeepMind.

  • Gemini 1.5 Pro (Preview) supports multimodal prompts. You can include text, images, audio, video, and PDF files in your prompt requests and get text or code responses. Gemini 1.5 Pro (Preview) can process larger collections of images, larger text documents, and longer videos than Gemini 1.0 Pro Vision.
  • Gemini 1.0 Pro is designed to handle natural language tasks, multiturn text and code chat, and code generation.
  • Gemini 1.0 Pro Vision supports multimodal prompts. You can include text, images, video, and PDFs in your prompt requests and get text or code responses.

The following table shows some differences between the Gemini models that can help you choose which is best for you:

Gemini model Modalities Context window
Gemini 1.0 Pro / Gemini 1.0 Pro Vision
  • Text, code, PDF (Gemini 1.0 Pro Vision)
  • Up to 16 images
  • Video up to 2 minutes
  • 8,192 tokens in
  • 2,048 tokens out
Gemini 1.5 Pro (Preview)
  • Text, code, images, audio, video, PDF
  • Up to 3,000 images
  • Audio up to 8.4 hours
  • Video with audio up to 1 hour
  • 1M tokens in
  • 8,192 tokens out

PaLM API offerings

The Vertex AI PaLM API contains the publisher endpoints for Google's Pathways Language Model 2 (PaLM 2), which are large language models (LLMs) that generate text and code in response to natural language prompts.

  • PaLM API for text is fine-tuned for language tasks such as classification, summarization, and entity extraction.
  • PaLM API for chat is fine-tuned for multi-turn chat, where the model keeps track of previous messages in the chat and uses it as context for generating new responses.

Other Generative AI offerings

  • Text embedding generates vector embeddings for input text. You can use embeddings for tasks like semantic search, recommendation, classification, and outlier detection.

  • Multimodal embedding generates vector embeddings based on image and text inputs. These embeddings can later be used for other subsequent tasks like image classification or content recommendations.

  • Imagen, our text-to-image foundation model, lets you generate and customize studio-grade images at scale.

  • Partner models are a curated list of generative AI models developed by Google's partner companies. These generative AI models are offered as managed APIs. For example, Anthropic provides its Claude models as a service on Vertex AI.

  • Open models, like Llama, are available for you to deploy on Vertex AI or other platforms.

  • MedLM is a family of foundation models fine-tuned for the healthcare industry.

Certifications and security controls

Vertex AI supports CMEK, VPC Service Controls, Data Residency, and Access Transparency. There are some limitations for Generative AI features. For more information, see Generative AI security controls.

Get started