Tutorial: Vertex AI API in express mode

Vertex AI in express mode lets you quickly try out core generative AI features that are available on Vertex AI. This tutorial shows you how to perform the following tasks by using the Vertex AI API in express mode:

  • Install and initialize the Vertex AI SDK for Python for express mode.
  • Send a request to the Gemini for Google Cloud API, including the following:
    • Non-streaming request
    • Streaming request
    • Function calling request

Install and initialize the Vertex AI SDK for Python for express mode

The Vertex AI SDK for Python lets you use Google's generative AI models and features to build AI-powered applications. When using Vertex AI in express mode, install and initialize the google-cloud-aiplatform package to authenticate using your generated API key.

Install

To install the Vertex AI SDK for Python for express mode, run the following commands:

# Developer TODO: If you're using Colab, uncomment the following lines:
# from google.colab import auth
# auth.authenticate_user()

!pip install google-cloud-aiplatform

!pip install --force-reinstall -qq "numpy<2.0"

If you're using Colab, ignore any dependency conflicts and restart the runtime after installation.

Initialize

Configure the API key for express mode and environment variables. For details on getting an API key, see Vertex AI in express mode overview.

import base64
from google.cloud.aiplatform.preview import vertexai
from google.cloud.aiplatform.preview.vertexai.generative_models import GenerativeModel, Part, SafetySetting, FinishReason
import google.cloud.aiplatform.preview.vertexai.generative_models as generative_models

# Developer TODO: Replace API_KEY with your API key.
API_KEY = "API_KEY"

vertexai.init(api_key=API_KEY)

Send a request to the Gemini for Google Cloud API

You can send either streaming or non-streaming requests to the Gemini for Google Cloud API. Streaming requests return the response in chunks as the request is being processed. To a human user, streamed responses reduce the perception of latency. Non-streaming requests return the response in one chunk after the request is processed.

Streaming request

To send a streaming request, set stream=True and print the response in chunks.

def generate():
  model = GenerativeModel(
    "gemini-1.5-flash-001",
  )
  responses = model.generate_content(
      ["""Explain bubble sort to me""",
      generation_config=generation_config,
      safety_settings=safety_settings,
      stream=True,
  )

  for chunk in responses:
    print(chunk)


generation_config = {
    "max_output_tokens": 8192,
    "temperature": 1,
    "top_p": 0.95,
}

safety_settings = [
    SafetySetting(
        category=SafetySetting.HarmCategory.HARM_CATEGORY_HATE_SPEECH,
        threshold=SafetySetting.HarmBlockThreshold.BLOCK_MEDIUM_AND_ABOVE
    ),
    SafetySetting(
        category=SafetySetting.HarmCategory.HARM_CATEGORY_DANGEROUS_CONTENT,
        threshold=SafetySetting.HarmBlockThreshold.BLOCK_MEDIUM_AND_ABOVE
    ),
    SafetySetting(
        category=SafetySetting.HarmCategory.HARM_CATEGORY_SEXUALLY_EXPLICIT,
        threshold=SafetySetting.HarmBlockThreshold.BLOCK_MEDIUM_AND_ABOVE
    ),
    SafetySetting(
        category=SafetySetting.HarmCategory.HARM_CATEGORY_HARASSMENT,
        threshold=SafetySetting.HarmBlockThreshold.BLOCK_MEDIUM_AND_ABOVE
    ),
]

generate()

Non-streaming request

The following code sample defines a function that sends a non-streaming request to the gemini-1.5-flash-001. It shows you how to configure basic request parameters and safety settings.

def generate():
  model = GenerativeModel(
    "gemini-1.5-flash-001",
  )
  responses = model.generate_content(
      ["""Explain bubble sort to me"""],
      generation_config=generation_config,
      safety_settings=safety_settings,
      stream=True,
  )

  for response in responses:
    print(response.text, end="")


generation_config = {
    "max_output_tokens": 8192,
    "temperature": 1,
    "top_p": 0.95,
}

safety_settings = [
    SafetySetting(
        category=SafetySetting.HarmCategory.HARM_CATEGORY_HATE_SPEECH,
        threshold=SafetySetting.HarmBlockThreshold.BLOCK_MEDIUM_AND_ABOVE
    ),
    SafetySetting(
        category=SafetySetting.HarmCategory.HARM_CATEGORY_DANGEROUS_CONTENT,
        threshold=SafetySetting.HarmBlockThreshold.BLOCK_MEDIUM_AND_ABOVE
    ),
    SafetySetting(
        category=SafetySetting.HarmCategory.HARM_CATEGORY_SEXUALLY_EXPLICIT,
        threshold=SafetySetting.HarmBlockThreshold.BLOCK_MEDIUM_AND_ABOVE
    ),
    SafetySetting(
        category=SafetySetting.HarmCategory.HARM_CATEGORY_HARASSMENT,
        threshold=SafetySetting.HarmBlockThreshold.BLOCK_MEDIUM_AND_ABOVE
    ),
]

generate()

Function calling request

The following code sample outputs the JSON that's required to make a function call.

from google.cloud.aiplatform.preview import vertexai
from google.cloud.aiplatform.preview.vertexai.generative_models import (
    FunctionDeclaration,
    GenerativeModel,
    GenerationConfig,
    Part,
    Tool,
)

# Specify a function declaration and parameters for an API request.
get_current_weather_func = FunctionDeclaration(
        name="get_current_weather",
        description="Get the current weather in a given location",
        # Function parameters are specified in OpenAPI JSON schema format.
        parameters={
            "type": "object",
            "properties": {"location": {"type": "string", "description": "Location"}},
        },
    )

# Define a tool that includes the above get_current_weather_func.
weather_tool = Tool(
        function_declarations=[get_current_weather_func],
    )


gemini_model = GenerativeModel("gemini-1.5-flash-001", tools=[weather_tool])
model_response = gemini_model.generate_content("What is the weather in Boston?")

print("model_response\n",model_response)

Clean up

This tutorial does not create any Google Cloud resources, so no clean up is needed to avoid charges.

What's next