Generate text from multimodal prompt

This sample demonstrates how to generate text from a multimodal prompt using the Gemini model. The prompt consists of three images and two text prompts. The model generates a text response that describes the images and the text prompts.

Code sample

Python

Before trying this sample, follow the Python setup instructions in the Vertex AI quickstart using client libraries. For more information, see the Vertex AI Python API reference documentation.

To authenticate to Vertex AI, set up Application Default Credentials. For more information, see Set up authentication for a local development environment.

from google import genai
from google.genai.types import HttpOptions, Part

client = genai.Client(http_options=HttpOptions(api_version="v1"))
# TODO(Developer): Update the below file paths to your images
# image_path_1 = "path/to/your/image1.jpg"
# image_path_2 = "path/to/your/image2.jpg"
with open(image_path_1, "rb") as f:
    image_1_bytes = f.read()
with open(image_path_2, "rb") as f:
    image_2_bytes = f.read()

response = client.models.generate_content(
    model="gemini-2.0-flash-001",
    contents=[
        "Generate a list of all the objects contained in both images.",
        Part.from_bytes(data=image_1_bytes, mime_type="image/jpeg"),
        Part.from_bytes(data=image_2_bytes, mime_type="image/jpeg"),
    ],
)
print(response.text)
# Example response:
# Okay, here's a jingle combining the elements of both sets of images, focusing on ...
# ...

What's next

To search and filter code samples for other Google Cloud products, see the Google Cloud sample browser.