Introduction to multimodal classes in the Vertex AI SDK

You can use the Vertex AI SDK for Python to programmatically create solutions using Gemini model classes. You can use the Vertex AI SDK to load one of the multimodal models, Gemini Pro and Gemini Pro Vision. After you load a model, you can use it to generate text from a text prompt, an image, or a text prompt and an image. For more details about Gemini models, see Gemini API models.

The Gemini model classes represented in the Vertex AI SDK are in addition to classes that help you create Vertex AI solutions that aren't related to generative AI and language models. For information about how to use the Vertex AI SDK to automate data ingestion, train models, and get predictions on Vertex AI, see Introduction to the Vertex AI SDK for Python.

Install the Vertex AI SDK

To install the Vertex AI SDK for Python, run the following command:

pip install --upgrade google-cloud-aiplatform

For more information, see Install the Vertex AI SDK for Python. To view the language model section in the Vertex AI SDK reference guide, see Package language models.

Authenticate the Vertex AI SDK

After you install the Vertex AI SDK for Python, you need to authenticate. The following topics explain how to authenticate with the Vertex AI SDK if you're working locally and if you're working in Colaboratory:

  • If you're developing locally, do the following to set up Application Default Credentials (ADC) in your local environment:

    1. Install the Google Cloud CLI, then initialize it by running the following command:

      gcloud init
      
    2. Create local authentication credentials for your Google Account:

      gcloud auth application-default login
      

      A login screen is displayed. After you sign in, your credentials are stored in the local credential file used by ADC. For more information about working with ADC in a local environment, see Local development environment.

  • If you're working in Colaboratory, run the following command in a Colab cell to authenticate:

    from google.colab import auth
    auth.authenticate_user()
    

    This command opens a window where you can complete the authentication.

Load a Gemini model

To use the Vertex AI SDK to reference a multimodal model, you import GenerativeModel from vertexai.preview.generative_models, then use GenerativeModel to load the model. The following sample code shows you how to load the Gemini Pro and the Gemini Pro Vision models:

from vertexai.preview.generative_models import GenerativeModel

# Load Gemini Pro
gemini_pro_model = GenerativeModel("gemini-1.0-pro")

# Load Gemini Pro Vision
gemini_pro_vision_model = GenerativeModel("gemini-1.0-pro-vision")

GenerativeModel class code samples

The GenerativeModel class represents a Gemini model. You can use it to load the Gemini Pro or Gemini Pro Vision model. The GenerativeModel class includes methods to help you generate content from text, images, and video. The following code samples demonstrate how to use the GenerativeModel class.

Generate content using a text prompt

The following code sample uses the Gemini Pro multimodal model to generate text about roses:

from vertexai.preview.generative_models import GenerativeModel

gemini_pro_model = GenerativeModel("gemini-1.0-pro")
model_response = gemini_pro_model.generate_content("Why do cars have four wheels?")
print("model_response\n",model_response)

The response to this sample code might be similar to the following. The returned text is truncated for brevity.

candidates {
  content {
    parts {
      text: "1. **Stability:** Four wheels provide a wider base of support,
        which increases the vehicle\'s stability. This is especially important
        when cornering or driving on uneven surfaces.\n\n2...."
    }
  }
}
usage_metadata {
  prompt_token_count: 7
  candidates_token_count: 323
  total_token_count: 330
}

Generate content using more than one text prompt

The following code sample uses the Gemini Pro multimodal model to generate text using more than one text prompt:

from vertexai.preview.generative_models import GenerativeModel

gemini_pro_model = GenerativeModel("gemini-1.0-pro")
model_response = gemini_pro_model.generate_content(["What is x multiplied by 2?", "x = 42"])
print("model_response\n",model_response)

The response to this sample code might be similar to the following:

candidates {
  content {
    parts {
      text: "84"
    }
  }
}
usage_metadata {
  prompt_token_count: 13
  candidates_token_count: 2
  total_token_count: 15
}

Generate a description of an image

The following code sample uses the Gemini Pro Vision multimodal model to generate text that describes a flower:

from vertexai.preview.generative_models import GenerativeModel
from vertexai.preview.generative_models import Part

gemini_pro_vision_model = GenerativeModel("gemini-1.0-pro-vision")
image = Part.from_uri("gs://cloud-samples-data/ai-platform/flowers/daisy/10559679065_50d2b16f6d.jpg", mime_type="image/jpeg")
model_response = gemini_pro_vision_model.generate_content(["what is this image?", image])
print("model_response\n",model_response)

The response to this sample code might be similar to the following:

candidates {
  content {
    role: "model"
    parts {
      text: " The image is a photograph of a daisy flower in a field of fallen leaves."
    }
  }
  finish_reason: STOP
  safety_ratings {
    category: HARM_CATEGORY_HARASSMENT
    probability: NEGLIGIBLE
  }
  safety_ratings {
    category: HARM_CATEGORY_HATE_SPEECH
    probability: NEGLIGIBLE
  }
  safety_ratings {
    category: HARM_CATEGORY_SEXUALLY_EXPLICIT
    probability: NEGLIGIBLE
  }
  safety_ratings {
    category: HARM_CATEGORY_DANGEROUS_CONTENT
    probability: NEGLIGIBLE
  }
}
usage_metadata {
  prompt_token_count: 263
  candidates_token_count: 16
  total_token_count: 279
}

Generate content from text and an image

The following code sample uses the Gemini Pro Vision multimodal model to generate content using a text prompt and a picture of a flower:

from vertexai import generative_models
from vertexai.generative_models import GenerativeModel

image = generative_models.Part.from_uri("gs://cloud-samples-data/ai-platform/flowers/daisy/10559679065_50d2b16f6d.jpg", mime_type="image/jpeg")
gemini_pro_vision_model = GenerativeModel("gemini-1.0-pro-vision")
model_response = gemini_pro_vision_model.generate_content(["What is shown in this image?", image])
print("model_response\n",model_response)

The response to this sample code might be similar to the following:

candidates {
  content {
    role: "model"
    parts {
      text: " This is an image of a white daisy growing in a pile of brown, dead leaves."
    }
  }
  finish_reason: STOP
  safety_ratings {
    category: HARM_CATEGORY_HARASSMENT
    probability: NEGLIGIBLE
  }
  safety_ratings {
    category: HARM_CATEGORY_HATE_SPEECH
    probability: NEGLIGIBLE
  }
  safety_ratings {
    category: HARM_CATEGORY_SEXUALLY_EXPLICIT
    probability: NEGLIGIBLE
  }
  safety_ratings {
    category: HARM_CATEGORY_DANGEROUS_CONTENT
    probability: NEGLIGIBLE
  }
}
usage_metadata {
  prompt_token_count: 265
  candidates_token_count: 18
  total_token_count: 283
}

Generate content from a video

You can use streaming to generate content using a video. Gemini models support streaming for only video content. The following code sample uses the Gemini Pro Vision multimodal model to generate content using a text prompt and a video of an advertisement for a movie.

from vertexai import generative_models
from vertexai.generative_models import GenerativeModel

gemini_pro_vision_model = GenerativeModel("gemini-1.0-pro-vision")
response = gemini_pro_vision_model.generate_content([
  "What is in the video? ",
  generative_models.Part.from_uri("gs://cloud-samples-data/video/animals.mp4", mime_type="video/mp4"),
], stream=True)

for chunk in response :
  print(chunk.text)

The response to this sample code might be similar to the following:

The video is an advertisement for the movie Zootopia. It features a tiger, an
otter, and a sloth. The tiger is shown looking at the camera , while the otter
and the sloth are shown swimming in a pool. The video is set to the song "Try
Everything" by Shakira."

What's next