Gemini 2.0 Flash supports response generation in multiple modalities, including text, speech, and images.
Text generation
Gemini 2.0 Flash supports text generation using the Google Cloud console, REST API, and supported SDKs. For more information, see our text generation guide.
Speech generation (early access/allowlist)
Gemini 2.0 supports a new multimodal generation capability: text to speech.
Using the text-to-speech capability, you can prompt the model to generate high
quality audio output that sounds like a human voice (say "hi everyone"
), and
you can further refine the output by steering the voice.
Generate speech
The following sections cover how to generate speech using either Vertex AI Studio or using the API.
For guidance and best practices for prompting, see Design multimodal prompts.
Using Vertex AI Studio
To use speech generation:
- Open Vertex AI Studio > Freeform.
-
Select
gemini-2.0-flash-exp
from the Models drop-down menu. - In the Response panel, select Audio from the drop-down menu.
- Write a description of the speech you want to generate in the text area of the Prompt panel.
- Click the Prompt ( ) button.
Gemini will generate speech based on your description. This process should take a few seconds, but may be comparatively slower depending on capacity.*
Using the API
Save the request body in a file named request.json
.
Run the following command in the terminal to create or overwrite this file in
the current directory:
cat << EOF > request.json { "contents": [ { "role": "user", "parts": [ { "text": "Say, 'How are you?'" } ] } ], "generation_config": { "response_modalities": [ "AUDIO"" ] }, "safety_settings": [ { "category": "HARM_CATEGORY_HATE_SPEECH", "threshold": "BLOCK_NONE" }, { "category": "HARM_CATEGORY_DANGEROUS_CONTENT", "threshold": "BLOCK_NONE" }, { "category": "HARM_CATEGORY_HARASSMENT", "threshold": "BLOCK_NONE" }, { "category": "HARM_CATEGORY_SEXUALLY_EXPLICIT", "threshold": "BLOCK_NONE" } ] } EOF
Then execute the following command to send your REST request:
curl -X POST \ -H "Authorization: Bearer $(gcloud auth print-access-token)" \ -H "Content-Type: application/json; charset=utf-8" \ "https://us-central1-aiplatform.googleapis.com/v1/projects/PROJECT_ID/locations/us-central1/publishers/google/models/gemini-2.0-flash-exp:generateContent" \ -d $"@request.json"
Gemini will generate audio based on your description. This process should take a few seconds, but may be comparatively slower depending on capacity.
Image generation (early access/allowlist)
Gemini 2.0 supports the ability to output text with in-line images. This lets you use Gemini to conversationally edit images or generate multimodal outputs (for example, a blog post with text and images in a single turn). Previously this would have required stringing together multiple models.
Image generation is available as a private experimental release. It supports the following modalities and capabilities:
- Text to image
- Example prompt: "Generate an image of the Eiffel tower with fireworks in the background."
- Text to image(s) and text (interleaved)
- Example prompt: "Generate an illustrated recipe for a paella. Create images to go alongside the text as you generate the recipe."
- Image(s) and text to image(s) and text (interleaved)
- Example prompt: (With an image of a furnished room) "What other color sofas would work in my space? can you update the image?"
- Image editing (text and image to image)
- Example prompt: "Edit this image to make it look like a cartoon"
- Example prompt: [image of a cat] + [image of a pillow] + "Create a cross stitch of my cat on this pillow."
- Multi-turn image editing (chat)
- Example prompts: [upload an image of a blue car.] "Turn this car into a convertible." "Now change the color to yellow."
- Watermarking
- All generated images include a SynthID watermark.
Limitations:
- Generation of people and editing of uploaded images of people are not allowed.
- For best performance, use the following languages: EN, es-MX, ja-JP, zh-CN, hi-IN.
- Image generation does not support audio or video inputs.
- Image generation may not always trigger:
- The model may output text only. Try asking for image outputs explicitly (e.g. "generate an image", "provide images as you go along", "update the image").
- The model may stop generating partway through. Try again or try a different prompt.
Generate images
The following sections cover how to generate images using either Vertex AI Studio or using the API.
For guidance and best practices for prompting, see Design multimodal prompts.
Using Vertex AI Studio
To use image generation:
- Open Vertex AI Studio > Freeform.
-
Select
gemini-2.0-flash-exp
from the Models drop-down menu. - In the Response panel, select Image and text from the drop-down menu.
- Write a description of the image you want to generate in the text area of the Prompt panel.
- Click the Prompt ( ) button.
Gemini will generate an image based on your description. This process should take a few seconds, but may be comparatively slower depending on capacity.
Using the API
Save the request body in a file named request.json
.
Run the following command in the terminal to create or overwrite this file in
the current directory:
cat << EOF > request.json { "contents": [ { "role": "user", "parts": [ { "text": "Generate an image of a cat." } ] } ], "generation_config": { "response_modalities": [ "IMAGE", "TEXT" ] }, "safety_settings": [ { "category": "HARM_CATEGORY_HATE_SPEECH", "threshold": "BLOCK_NONE" }, { "category": "HARM_CATEGORY_DANGEROUS_CONTENT", "threshold": "BLOCK_NONE" }, { "category": "HARM_CATEGORY_HARASSMENT", "threshold": "BLOCK_NONE" }, { "category": "HARM_CATEGORY_SEXUALLY_EXPLICIT", "threshold": "BLOCK_NONE" } ] } EOF
Then execute the following command to send your REST request:
curl -X POST \ -H "Authorization: Bearer $(gcloud auth print-access-token)" \ -H "Content-Type: application/json; charset=utf-8" \ "https://us-central1-aiplatform.googleapis.com/v1/projects/PROJECT_ID/locations/us-central1/publishers/google/models/gemini-2.0-flash-exp:generateContent" \ -d $"@request.json"
Gemini will generate an image based on your description. This process should take a few seconds, but may be comparatively slower depending on capacity.