Supported models

The following tables show the models that support Provisioned Throughput, the throughput for each generative AI scale unit (GSU) and the burndown rates for each model.

Google models

Provisioned Throughput only supports models that you call directly from your project using the specific model ID and not a model alias. To use Provisioned Throughput to make API calls to a model, you must use the specific model version ID (for example, gemini-2.0-flash-001) and not a model version alias.

Moreover, Provisioned Throughput doesn't support models that are called by other Vertex AI products, such as Vertex AI Agents and Vertex AI Search. For example, if you make API calls to Gemini 2.0 Flash while using Vertex AI Search, your Provisioned Throughput order for Gemini 2.0 Flash won't guarantee the calls made by Vertex AI Search.

Provisioned Throughput doesn't support batch prediction calls.

The following table shows the throughput, purchase increment, and burndown rates for Google models that support Provisioned Throughput. Your per-second throughput is defined as your prompt input and generated output across all requests per second.

To find out how many tokens your workload requires, refer to the SDK tokenizer or the countTokens API.

Model	Per-second throughput per GSU	Units	Minimum GSU purchase increment	Burndown rates
Gemini 2.5 Flash with Live API Latest supported version: `gemini-live-2.5-flash`	1620	Tokens	1	1 input text token = 1 input text token 1 input audio token = 6 input text tokens 1 input video token = 6 input text tokens 1 input session memory token = 1 input text token 1 output text token = 4 input text tokens 1 output audio token = 24 input text tokens
Gemini 2.5 Flash Image Latest supported version: `gemini-2.5-flash-image`	2690	Tokens	1	1 input text token = 1 token 1 input image token = 1 token 1 output text token = 9 tokens 1 output image token = 100 tokens
Gemini 2.5 Flash-Lite Latest supported version (GA): `gemini-2.5-flash-lite` Latest supported version (preview): `gemini-2.5-flash-lite-preview-09-2025`	8070	Tokens	1	1 input text token = 1 token 1 input image token = 1 token 1 input video token = 1 token 1 input audio token = 3 tokens 1 output response text token = 4 tokens 1 output reasoning text token = 4 tokens
Gemini 2.5 Flash with Live API native audio Latest supported version: `gemini-live-2.5-flash-preview-native-audio-09-2025` (preview)	1620	Tokens	1	1 input text token = 1 token 1 input audio token = 6 tokens 1 input video token = 6 tokens 1 input image token = 6 tokens 1 input session memory token = 1 token 1 output text token = 4 tokens 1 output audio token = 24 tokens
Gemini 2.5 Pro Latest supported version: `gemini-2.5-pro`	650	Tokens	1	Less than or equal to 200,000 input tokens: 1 input text token = 1 token 1 input image token = 1 token 1 input video token = 1 token 1 input audio token = 1 token 1 output response text token = 8 tokens 1 output reasoning text token = 8 tokens Greater than 200,000 input tokens: 1 input text token = 2 tokens 1 input image token = 2 tokens 1 input video token = 2 tokens 1 input audio token = 2 tokens 1 output response text token = 12 tokens 1 output reasoning text token = 12 tokens
Gemini 2.5 Flash Latest supported version (GA): `gemini-2.5-flash` Latest supported version (preview): `gemini-2.5-flash-preview-09-2025`	2690	Tokens	1	1 input text token = 1 token 1 input image token = 1 token 1 input video token = 1 token 1 input audio token = 4 tokens 1 output response text token = 9 tokens 1 output reasoning text token = 9 tokens
Gemini 2.0 Flash Latest supported version: `gemini-2.0-flash-001`	3360	Tokens	1	1 input text token = 1 token 1 input image token = 1 token 1 input video token = 1 token 1 input audio token = 7 tokens 1 output text token = 4 tokens
Gemini 2.0 Flash-Lite Latest supported version: `gemini-2.0-flash-lite-001`	6720	Tokens	1	1 input text token = 1 token 1 input image token = 1 token 1 input video token = 1 token 1 input audio token = 1 token 1 output text token = 4 tokens
Veo 3.1 preview Latest supported version: `veo-3.1-generate-preview` (preview)	0.0040	Video seconds	34	1 output video second = 1 output video second
	0.0040	Video+audio seconds	67	1 output video+audio second = 2 output video seconds
Veo 3.1 Fast preview Latest supported version: `veo-3.1-fast-generate-preview` (preview)	0.0080	Video seconds	17	1 output video second = 1 output video second
	0.0080	Video+audio seconds	25	1 output video+audio second = 1.45 output video seconds
Veo 3 Latest supported version: `veo-3.0-generate-001`	0.0040	Video seconds	34	1 output video second = 1 output video second
Veo 3 Latest supported version: `veo-3.0-generate-001`	0.0040	Video+audio seconds	67	1 output video+audio second = 2 output video seconds
Veo 3 Fast Latest supported version: `veo-3.0-fast-generate-001`	0.0080	Video seconds	17	1 output video second = 1 output video second
	0.0080	Video+audio seconds	25	1 output video+audio second = 1.45 output video seconds
Imagen 4 Ultra Generate `imagen-4.0-ultra-generate-001`	0.015	Images	1	Only output images count toward your Provisioned Throughput quota.
Imagen 4 Generate `imagen-4.0-generate-001`	0.02	Images	1	Only output images count toward your Provisioned Throughput quota.
Imagen 4 Fast Generate `imagen-4.0-fast-generate-001`	0.04	Images	1	Only output images count toward your Provisioned Throughput quota.
Imagen 3 Generate 002 `imagen-3.0-generate-002`	0.02	Images	1	Only output images count toward your Provisioned Throughput quota.
Imagen 3 Generate 001 `imagen-3.0-generate-001`	0.025	Images	1	Only output images count toward your Provisioned Throughput quota.
Imagen 3 Fast	0.05	Images	1	Only output images count toward your Provisioned Throughput quota.
MedLM medium Caution: MedLM is deprecated. Access to MedLM will no longer be available on or after September 29, 2025.	2,000	Characters	1	1 input char = 1 char 1 output char = 2 chars
MedLM large Caution: MedLM is deprecated. Access to MedLM will no longer be available on or after September 29, 2025.	200	Characters	1	1 input char = 1 char 1 output char = 3 chars
MedLM large 1.5 Caution: MedLM is deprecated. Access to MedLM will no longer be available on or after September 29, 2025.	200	Characters	1	1 input char = 1 char 1 output char = 3 chars

For information about a model's capabilities and input or output limits, see the documentation for the model.

Request access: The model gemini-live-2.5-flash is in private GA. For information about access to this release, see the access request page.

You can upgrade to new models as they are made available. For information about model availability and discontinuation dates, see Google models.

For more information about supported locations, see Available locations.

Global endpoint model support

Provisioned Throughput supports the global endpoint for the following models:

Model	Latest supported model version
Gemini 2.5 Flash Image	`gemini-2.5-flash-image`
Gemini 2.5 Flash-Lite	`gemini-2.5-flash-lite-preview-09-2025` (preview) `gemini-2.5-flash-lite` (GA)
Gemini 2.5 Pro	`gemini-2.5-pro`
Gemini 2.5 Flash	`gemini-2.5-flash-preview-09-2025` (preview) `gemini-2.5-flash` (GA)
Gemini 2.0 Flash	`gemini-2.0-flash-001`
Gemini 2.0 Flash-Lite	`gemini-2.0-flash-lite-001`

Traffic that exceeds the Provisioned Throughput quota uses the global endpoint, by default.

To assign Provisioned Throughput to the global endpoint of a model, select global as the region when you place a Provisioned Throughput order.

Supervised fine-tuned model support

The following is supported for Google models that support supervised fine-tuning:

Provisioned Throughput can be applied to both base models and supervised fine-tuned versions of those base models.
Supervised fine-tuned model endpoints and their corresponding base model count towards the same Provisioned Throughput quota.

For example, Provisioned Throughput purchased for gemini-2.0-flash-lite-001 for a specific project prioritizes requests that are made from supervised fine-tuned versions of gemini-2.0-flash-lite-001 created within that project. Use the appropriate header to control traffic behavior.

Partner models

The following table shows the throughput, purchase increment, and burndown rates for partner models that support Provisioned Throughput. Claude models are measured in tokens per second, which is defined as a total of input and output tokens across all requests per second.

Model	Throughput per GSU (tokens/sec)	Minimum GSU purchase	GSU purchase increment	Burndown rates
Anthropic's Claude Sonnet 4.5	350	25	1	Less than 200,000 input tokens: 1 input token = 1 token 1 output token = 5 tokens 1 cache write token = 1.25 tokens 1 cache hit token = 0.1 token Greater than or equal to 200,000 input tokens: 1 input token = 2 token 1 output token = 7.5 tokens 1 cache write token = 2.5 tokens 1 cache hit token = 0.2 token
Anthropic's Claude Opus 4.1	70	35	1	1 input token = 1 token 1 output token = 5 tokens 1 cache write token = 1.25 tokens 1 cache hit token = 0.1 token
Anthropic's Claude Haiku 4.5	1050	8	1	Less than 200,000 input tokens: 1 input token = 1 token 1 output token = 5 tokens 1 cache write token = 1.25 tokens 1 cache hit token = 0.1 token
Anthropic's Claude Opus 4	70	35	1	1 input token = 1 token 1 output token = 5 tokens 1 cache write token = 1.25 tokens 1 cache hit token = 0.1 token
Anthropic's Claude Sonnet 4	350	25	1	Less than 200,000 input tokens: 1 input token = 1 token 1 output token = 5 tokens 1 cache write token = 1.25 tokens 1 cache hit token = 0.1 token Greater than or equal to 200,000 input tokens: 1 input token = 2 token 1 output token = 7.5 tokens 1 cache write token = 2.5 tokens 1 cache hit token = 0.2 token
Anthropic's Claude 3.7 Sonnet	350	25	1	1 input token = 1 token 1 output token = 5 tokens 1 cache write token = 1.25 tokens 1 cache hit token = 0.1 token
Anthropic's Claude 3.5 Sonnet v2 (deprecated)	350	25	1	1 input token = 1 token 1 output token = 5 tokens 1 cache write token = 1.25 tokens 1 cache hit token = 0.1 token
Anthropic's Claude 3.5 Haiku	2,000	10	1	1 input token = 1 token 1 output token = 5 tokens 1 cache write token = 1.25 tokens 1 cache hit token = 0.1 token
Anthropic's Claude 3 Opus	70	35	1	1 input token = 1 token 1 output token = 5 tokens 1 cache write token = 1.25 tokens 1 cache hit token = 0.1 token
Anthropic's Claude 3 Haiku	4,200	5	1	1 input token = 1 token 1 output token = 5 tokens 1 cache write token = 1.25 tokens 1 cache hit token = 0.1 token
Anthropic's Claude 3.5 Sonnet (deprecated)	350	25	1	1 input token = 1 token 1 output token = 5 tokens 1 cache write token = 1.25 tokens 1 cache hit token = 0.1 token

For information about supported locations, see Anthropic Claude region availability. To order Provisioned Throughput for Anthropic models, contact your Google Cloud account representative.

What's next

Calculate Provisioned Throughput requirements.