Gemini 2.0 (experimental)

Gemini 2.0 Flash is now available as an experimental preview release through the Vertex AI Gemini API and Vertex AI Studio. The model introduces new features and enhanced core capabilities:

  • Multimodal Live API: This new API helps you create real-time vision and audio streaming applications with tool use.
  • Speed and performance: Gemini 2.0 Flash has a significantly improved time to first token (TTFT) over Gemini 1.5 Flash.
  • Quality: The model maintains quality comparable to larger models like Gemini 1.5 Pro.
  • Improved agentic experiences: Gemini 2.0 delivers improvements to multimodal understanding, coding, complex instruction following, and function calling. These improvements work together to support better agentic experiences.
  • New Modalities: Gemini 2.0 introduces native image generation and controllable text-to-speech capabilities, enabling image editing, localized artwork creation, and expressive storytelling.

To support the new model, we're also shipping an all new SDK that supports simple migration between the Gemini Developer API and the Gemini API on Vertex AI.

For Gemini 2.0 technical details, see Google models.

Google Gen AI SDK (experimental)

The new Google Gen AI SDK provides a unified interface to Gemini 2.0 through both the Gemini Developer API and the Gemini API on Vertex AI. With a few exceptions, code that runs on one platform will run on both. This means that you can prototype an application using the Developer API and then migrate the application to Vertex AI without rewriting your code.

The Gen AI SDK also supports the Gemini 1.5 models.

The new SDK is available in Python and Go, with Java and JavaScript coming soon.

You can start using the SDK as shown below.

  1. Install the new SDK: pip install google-genai
  2. Then import the library, initialize a client, and generate content:
from google import genai

# Replace the `project` and `location` values with appropriate values for
# your project.
client = genai.Client(
    vertexai=True, project='YOUR_CLOUD_PROJECT', location='us-central1'
)
response = client.models.generate_content(
    model='gemini-2.0-flash-exp', contents='How does AI work?'
)
print(response.text)

(Optional) Set environment variables

Alternatively, you can initialize the client using environment variables. First set the appropriate values and export the variables:

# Replace the `GOOGLE_CLOUD_PROJECT` and `GOOGLE_CLOUD_LOCATION` values
# with appropriate values for your project.
export GOOGLE_CLOUD_PROJECT=YOUR_CLOUD_PROJECT
export GOOGLE_CLOUD_LOCATION=us-central1
export GOOGLE_GENAI_USE_VERTEXAI=True

Then you can initialize the client without any args:

client = genai.Client()

Multimodal Live API

The Multimodal Live API enables low-latency bidirectional voice and video interactions with Gemini. Using the Multimodal Live API, you can provide end users with the experience of natural, human-like voice conversations, and with the ability to interrupt the model's responses using voice commands. The model can process text, audio, and video input, and it can provide text and audio output.

The Multimodal Live API is available in the Gemini API as the BidiGenerateContent method and is built on WebSockets.

For more information, see the Multimodal Live API reference guide.

For a text-to-text example to help you get started with the Multimodal Live API, see the following:

from google import genai

client = genai.Client()
model_id = "gemini-2.0-flash-exp"
config = {"response_modalities": ["TEXT"]}

async with client.aio.live.connect(model=model_id, config=config) as session:
    message = "Hello? Gemini, are you there?"
    print("> ", message, "\n")
    await session.send(message, end_of_turn=True)

    async for response in session.receive():
        print(response.text)

Features:

  • Audio input with audio output
  • Audio and video input with audio output
  • A selection of voices; see Multimodal Live API voices
  • Session duration of up to 15 minutes for audio or up to 2 minutes of audio and video

To learn about additional capabilities of the Multimodal Live API, see Multimodal Live API capabilities.

Language:

  • English only

Limitations:

Search as a tool

Using Grounding with Google Search, you can improve the accuracy and recency of responses from the model. Starting with Gemini 2.0, Google Search is available as a tool. This means that the model can decide when to use Google Search. The following example shows how to configure Search as a tool.

from google import genai
from google.genai.types import Tool, GenerateContentConfig, GoogleSearch

client = genai.Client()
model_id = "gemini-2.0-flash-exp"

google_search_tool = Tool(
    google_search = GoogleSearch()
)

response = client.models.generate_content(
    model=model_id,
    contents="When is the next total solar eclipse in the United States?",
    config=GenerateContentConfig(
        tools=[google_search_tool],
        response_modalities=["TEXT"],
    )
)

for each in response.candidates[0].content.parts:
    print(each.text)
# Example response:
# The next total solar eclipse visible in the contiguous United States will be on ...

# To get grounding metadata as web content.
print(response.candidates[0].grounding_metadata.search_entry_point.rendered_content)

The Search-as-a-tool functionality also enables multi-turn searches and multi-tool queries (for example, combining Grounding with Google Search and code execution).

Search as a tool enables complex prompts and workflows that require planning, reasoning, and thinking:

  • Grounding to enhance factuality and recency and provide more accurate answers
  • Retrieving artifacts from the web to do further analysis on
  • Finding relevant images, videos, or other media to assist in multimodal reasoning or generation tasks
  • Coding, technical troubleshooting, and other specialized tasks
  • Finding region-specific information or assisting in translating content accurately
  • Finding relevant websites for further browsing

Bounding box detection

In this experimental launch, we are providing developers with a powerful tool for object detection and localization within images and video. By accurately identifying and delineating objects with bounding boxes, developers can unlock a wide range of applications and enhance the intelligence of their projects.

Key Benefits:

  • Simple: Integrate object detection capabilities into your applications with ease, regardless of your computer vision expertise.
  • Customizable: Produce bounding boxes based on custom instructions (e.g. "I want to see bounding boxes of all the green objects in this image"), without having to train a custom model.

Technical Details:

  • Input: Your prompt and associated images or video frames.
  • Output: Bounding boxes in the [y_min, x_min, y_max, x_max] format. The top left corner is the origin. The x and y axis go horizontally and vertically, respectively. Coordinate values are normalized to 0-1000 for every image.
  • Visualization: AI Studio users will see bounding boxes plotted within the UI. Vertex AI users should visualize their bounding boxes through custom visualization code.

Speech generation (early access/allowlist)

Gemini 2.0 supports a new multimodal generation capability: text to speech. Using the text-to-speech capability, you can prompt the model to generate high quality audio output that sounds like a human voice (say "hi everyone"), and you can further refine the output by steering the voice.

Image generation (early access/allowlist)

Gemini 2.0 supports the ability to output text with in-line images. This lets you use Gemini to conversationally edit images or generate multimodal outputs (for example, a blog post with text and images in a single turn). Previously this would have required stringing together multiple models.

Image generation is available as a private experimental release. It supports the following modalities and capabilities:

  • Text to image
    • Example prompt: "Generate an image of the Eiffel tower with fireworks in the background."
  • Text to image(s) and text (interleaved)
    • Example prompt: "Generate an illustrated recipe for a paella."
  • Image(s) and text to image(s) and text (interleaved)
    • Example prompt: (With an image of a furnished room) "What other color sofas would work in my space? can you update the image?"
  • Image editing (text and image to image)
    • Example prompt: "Edit this image to make it look like a cartoon"
    • Example prompt: [image of a cat] + [image of a pillow] + "Create a cross stitch of my cat on this pillow."
  • Multi-turn image editing (chat)
    • Example prompts: [upload an image of a blue car.] "Turn this car into a convertible." "Now change the color to yellow."
  • Watermarking
    • All generated images include a SynthID watermark.

Limitations:

  • Generation of people and editing of uploaded images of people are not allowed.
  • For best performance, use the following languages: EN, es-MX, ja-JP, zh-CN, hi-IN.
  • Image generation does not support audio or video inputs.
  • Image generation may not always trigger:
    • The model may output text only. Try asking for image outputs explicitly (e.g. "generate an image", "provide images as you go along", "update the image").
    • The model may stop generating partway through. Try again or try a different prompt.

Pricing

You aren't billed for the usage of experimental Google models.