Overview of multimodal models


  • Try Gemini thumbnail
    Try the Gemini API

    After you're set up on Google Cloud, try some multimodal prompts in the Vertex AI Studio, or in a notebook tutorial by using the Python SDK or REST API.

  • Gemini API reference thumbnail
    Vertex AI Gemini API reference

    Learn about the endpoints, parameters, and return values of the Vertex AI Gemini API.

  • Gemini multimodal prompts thumbnail
    Multimodal prompt design

    Learn best practices for designing multimodal prompts and see example prompts.


Multimodal models

Prompt Response

Give me a recipe for these cookies.


Cookies
**INGREDIENTS**
- 1 c. (2 sticks) unsalted butter, softened
- 3/4 c. granulated sugar
- 3/4 c. brown sugar, packed
- 1 tsp . vanilla extract
- 2 large eggs
- 2 1/4 c. all-purpose flour
- 1 tsp. baking soda
- 1 tsp. salt
...

A multimodal model is a model that is capable of processing information from multiple modalities, including images, videos, and text. For example, you can send the model a photo of a plate of cookies and ask it to give you a recipe for those cookies.

Gemini models

The following Gemini models are available:

  • Gemini 1.5 Pro: (Preview) Created to be multimodal (text, images, audio, PDFs, code, videos) and to scale across a wide range of tasks with up to 1M input tokens.
  • Gemini 1.0 Pro: Designed to handle natural language tasks, multiturn text and code chat, and code generation.
  • Gemini 1.0 Pro Vision: Supports multimodal prompts. You can include text, images, and video in your prompt requests and get text or code responses.

Gemini 1.5 Pro use cases

Gemini 1.5 Pro (Preview) supports text generation from a prompt that includes one of, or a combination of, the following modalities in a prompt: text, code, PDFs, images, audio, video. Its use cases include, but are not limited to, the following:

Use Case Description
Summarization Create a shorter version of a document that incorporates pertinent information from the original text. For example, you might want to summarize a chapter from a textbook. Or, you could create a succinct product description from a long paragraph that describes the product in detail.
Visual information seeking Use external knowledge combined with information extracted from the input image or video to answer questions.
Object recognition Answer questions related to fine-grained identification of the objects in images and videos.
Digital content understanding Answer questions and extract information from visual content like infographics, charts, figures, tables, and web pages.
Structured content generation Generate responses based on multimodal inputs in formats like HTML and JSON.
Captioning and description Generate descriptions of images and videos with varying levels of detail.
Long form content You can process long form content, up to 1M tokens across text, code, image, video and audio.
Reasoning Compositionally infer new information without memorization or retrieval.
Audio Analyze speech files for summarization, transcription, and Q&A.
Audio and video Summarize a video file with audio and return chapters with timestamps.
Multimodal processing Process multiple types of input media at the same time, such as video and audio input.

Gemini 1.0 Pro use cases

Gemini 1.0 Pro supports text and code generation from a text prompt. Its use cases include, but are not limited to, the following:

Use Case Description
Summarization Create a shorter version of a document that incorporates pertinent information from the original text. For example, you might want to summarize a chapter from a textbook. Or, you could create a succinct product description from a long paragraph that describes the product in detail.
Question answering Provide answers to questions in text. For example, you might automate the creation of a Frequently Asked Questions (FAQ) document from knowledge base content.
Digital content understanding Assign a label to provided text. For example, a label might be applied to text that describes how grammatically correct it is.
Classification Assign a label describing the provided text. For example, apply labels that describe whether a block of text is grammatically correct.
Info seeking Combine world knowledge with information extracted from the images and videos.
Object recognition Answer questions related to fine-grained identification of the objects in images and videos.
Sentiment analysis This is a form of classification that identifies the sentiment of text. The sentiment is turned into a label that's applied to the text. For example, the sentiment of text might be polarities like positive or negative, or sentiments like anger or happiness.
Entity extraction Generate texts by specifying a set of requirements and background. For example, you might want to draft an email under a given context using a certain tone.
Code generation Generate code based on a description. For example, you can ask the model to write a function that checks whether a year is a leap year.

Gemini 1.0 Pro Vision use cases

Gemini 1.0 Pro Vision supports text generation using text, images, and video as input. Its use cases include, but are not limited to, the following:

Use Case Description
Info seeking Combine world knowledge with information extracted from the images and videos.
Object recognition Answer questions related to fine-grained identification of the objects in images and videos.
Digital content understanding Answer questions by extracting information from content like infographics, charts, figures, tables, and web pages.
Structured content generation Generate responses in formats like HTML and JSON based on provided prompt instructions.
Captioning / description Generate descriptions of images and videos with varying levels of detail.
Extrapolation Make guesses about what's not shown in an image or what happens before or after a video.
Photo object detection Detect an object in an image and return a text description of the object.
Return information about items in an image Use an image that contains multiple grocery items and Gemini 1.0 Pro Vision can return an estimate of how much you should pay for them..
Understand screens and interfaces Extract information from appliance screens, user interfaces, and layouts. For example, you might use a picture of an appliance with Gemini 1.0 Pro Vision to get instructions about how to use the appliance.
Understand technical diagrams Decipher an entity relationship (ER) diagram, understand the relationships between tables, identify requirements for optimization in a specific environment like BigQuery.
Make a recommendation based in multiple images You might use pictures of eye glasses to get a recommendation about which would fit your face best.
Generate a video description Detect what is shown in a video. For example, provide a video of a vacation destination get a description of the destination, the top 5 things to do there, and suggestions for how to get there.

To learn more about how to design prompts for various uses, see the following pages:

See also: Model strengths and limitations

Programming language SDKs

The Vertex AI Gemini API supports the following SDKs:

Python

from vertexai import generative_models
from vertexai.generative_models import GenerativeModel
model = GenerativeModel(model_name="gemini-1.0-pro-vision")

response = model.generate_content(["What is this?", img])

Node.js

// Initialize Vertex AI with your Cloud project and location
const vertexAI = new VertexAI({project: projectId, location: location});
const generativeVisionModel = vertexAI.getGenerativeModel({ model: "gemini-1.0-pro-vision"});

const result = await model.generateContent([
  "What is this?",
  {inlineData: {data: imgDataInBase64, mimeType: 'image/png'}}
]);

Java

public static void main(String[] args) throws Exception {
  try (VertexAI vertexAi = new VertexAI(PROJECT_ID, LOCATION); ) {
    GenerativeModel model = new GenerativeModel("gemini-1.0-pro-vision", vertexAI);
  List<Content> contents = new ArrayList<>();
  contents.add(ContentMaker
                .fromMultiModalData(
                    "What is this?",
                    PartMaker.fromMimeTypeAndData("image/jpeg", IMAGE_URI)));
  GenerateContentResponse response = model.generateContent(contents);
    }
  }
}

Go

model := client.GenerativeModel("gemini-1.0-pro-vision", "us-central1")
img := genai.ImageData("jpeg", image_bytes)
prompt := genai.Text("What is this?")
resp, err := model.GenerateContent(ctx, img, prompt)

What's the difference from Google AI Gemini API

The Vertex AI Gemini API and Google AI Gemini API both let you incorporate the capabilities of Gemini models into your applications. The platform that's right for you depends on your goals.

The Vertex AI Gemini API is designed for developers and enterprises for use in scaled deployments. It offers features such as enterprise security, data residency, performance, and technical support. If you're an existing Google Cloud customer or deploy medium to large scale applications, you're in the right place.

If you're a hobbyist, student, or developer who is new to Google Cloud, try the Google AI Gemini API, which is suitable for experimentation, prototyping, and small deployments. If you're looking for a way to use Gemini directly from your mobile and web apps, see the Google AI SDKs for Android, Swift, and web.

Vertex AI Gemini API documentation

Select one of the following topics to learn more about the Vertex AI Gemini API.

Get started with the Vertex AI Gemini API


Migrate to the Vertex AI Gemini API


Learn how to use core features

  • Send multimodal prompts thumbnail
    Send multimodal prompt requests

    Learn how to send multimodal prompt requests by using the Cloud Console, Python SDK, or the REST API.

  • Send chat prompts thumbnail
    Send chat prompt requests

    Learn how to send single-turn and multi-turn chat prompts by using the Cloud Console, Python SDK, or the REST API.

  • Function calling thumbnail
    Function calling

    Learn how to get the model to output JSON for calling external functions.