Provisioned Throughput is a fixed-cost monthly subscription service that reserves throughput for supported generative AI models on Vertex AI. To reserve your throughput, you must specify the model and available locations in which the model runs.
This page explains when to use Provisioned Throughput, how it works, and how to subscribe.
Supported models
The following tables show the models that support Provisioned Throughput, the throughput for each generative AI scale unit (GSU), and the burndown rates for each model.
Google models
This table shows the throughput, purchase increment, and burndown rates for Google models that support Provisioned Throughput. The Google models are measured in characters per second, which is defined as your prompt input and generated text output characters across all requests per second.
Model | Throughput per GSU (chars/sec) | Minimum GSU purchase increment | Burndown rates | |
---|---|---|---|---|
Gemini 1.5 Flash | Less than or equal to 128,000 context window: 54,000 Greater than 128,000 context window: 27,000 |
1 | Less than or equal to 128,000 context window: 1 input char = 1 char 1 output char = 4 chars 1 image = 1,067 chars 1 video per second = 1,067 chars 1 audio per second = 107 chars |
Greater than 128,000 context window: 1 input char = 2 chars 1 output char = 8 chars 1 image = 2,134 chars 1 video per second = 2,134 chars 1 audio per second = 214 chars |
Gemini 1.5 Pro | 800 | 1 | Less than or equal to 128,000 context window: 1 input char = 1 char 1 output char = 3 chars 1 image = 1,052 chars 1 video per second = 1,052 chars 1 audio per second = 100 chars |
Greater than 128,000 context window: 1 input char = 2 chars 1 output char = 6 chars 1 image = 2,104 chars 1 video per second = 2,104 chars 1 audio per second = 200 chars |
Gemini 1.0 Pro | 8,000 | 1 | 1 input char = 1 char 1 output char = 3 chars 1 image = 20,000 chars 1 video per second = 16,000 chars |
|
Imagen 3 | 0.025 Throughput is measured in images/sec instead of chars/sec. |
1 | Only output images count toward your Provisioned Throughput quota. | |
Imagen 3 Fast | 0.05 Throughput is measured in images/sec instead of chars/sec. |
1 | Only output images count toward your Provisioned Throughput quota. | |
Imagen 2 | 0.05 Throughput is measured in images/sec instead of chars/sec. |
1 | Only output images count toward your Provisioned Throughput quota. | |
Imagen 2 Edit | 0.05 Throughput is measured in images/sec instead of chars/sec. |
1 | Only output images count toward your Provisioned Throughput quota. | |
MedLM-medium | 2,000 | 1 | 1 input char = 1 char 1 output char = 2 chars |
|
MedLM-large | 200 | 1 | 1 input char = 1 char 1 output char = 3 chars |
For more information about supported locations, see Available locations.
You can upgrade to new models as they are made available. For information about about availability and discontinuation dates, see Google models.
Google legacy models
See Legacy models that support Provisioned Throughput.
Partner models
This table shows the throughput, purchase increment, and burndown rates for partner models that support Provisioned Throughput. Claude models are measured in tokens per second, which is defined as a total of input and output tokens across all requests per second.
Model | Throughput per GSU (tokens/sec) | Minimum GSU purchase | GSU purchase increment | Burndown rates |
---|---|---|---|---|
Anthropic's Claude 3.5 Sonnet v2 |
350 | 25 | 1 | 1 input token = 1 token 1 output token = 5 tokens |
Anthropic's Claude 3.5 Sonnet |
350 | 25 | 1 | 1 input token = 1 token 1 output token = 5 tokens |
Anthropic's Claude 3 Opus |
70 | 35 | 1 | 1 input token = 1 token 1 output token = 5 tokens |
Anthropic's Claude 3 Haiku |
4,200 | 5 | 1 | 1 input token = 1 token 1 output token = 5 tokens |
Anthropic's Claude 3 Sonnet |
350 | 25 | 1 | 1 input token = 1 token 1 output token = 5 tokens |
For more information about supported locations, see Available locations.
When to use Provisioned Throughput
If any of the following considerations apply to your use case, consider using Provisioned Throughput:
- Your critical workloads consistently require high throughput. Throughput measurement depends on the model.
- You are building real-time generative AI production applications, such as chatbots and agents.
- Your throughput needs exceed 20,000 characters per second.
- You want to provide a consistent and predictable experience for users of your applications.
- You want deterministic generative AI costs by paying a fixed-monthly price with control of overages.
Provisioned Throughput is one of two ways to consume your generative AI models. The second way is pay-as-you-go, which is also referred to as on-demand.
How Provisioned Throughput is measured
This section explains the concepts of generative AI scale unit (GSU) and burndown rates. Provisioned Throughput is calculated and priced using GSUs and burndown rates.
A generative AI scale unit (GSU) is a measure of throughput for your prompts and responses. This amount specifies how much throughput to provision a model with.
To produce a standard unit across models, all inputs and outputs are converted to input characters per second (throughput) using model-specific ratios called burndown rates.
Different models use different amounts of throughput. For information about the minimum GSU purchase amount and increments for each model, see Supported models and burndown rates in this document.
This equation demonstrates how throughput is calculated:
inputs_per_query = inputs_across_modalities_converted_using_burndown_rates
outputs_per_query = outputs_across_modalities_converted_using_burndown_rates
throughput_per_second = (inputs_per_query + outputs_per_query) * queries_per_second
The calculated throughput per second determines how many GSUs that you need for your use case.
Example of estimating your Provisioned Throughput needs
To estimate your Provisioned Throughput needs, use the estimation tool in the Google Cloud console. The following example illustrates the process of estimating the amount of Provisioned Throughput for your model. The region isn't considered in the estimation calculations.
Gather your requirements.
In this example, your requirement is to ensure that you can send 2,000 characters with 2 images and receive 300 characters of output for 10 queries per second using
gemini-1.5-flash
.This step implies that you understand your use case, because you have identified the size of your inputs and outputs, the number of queries per second (QPS), and your model.
To estimate your throughput, specify your model. In this example, your model is
gemini-1.5-flash
.Specify the type of input, and identify the burndown rate. Use the burndown rates table to identify the burndown rate based on your type of input.
An image's burndown rate for the
gemini-1.5-flash
model is 1,067 characters.
Calculate your throughput.
Multiply the number of images by the burndown rate for the input type for your specific model.
2 images * 1,067 input characters per image = 2,134 input characters
Your total output characters is 300. Return to the burndown rates table, and find the burndown rate for output characters (four characters per output character) for your specific model (
gemini-1.5-flash
).300 output characters * 4 characters per output character = 1,200 converted input characters
Add your totals together.
2,000 input characters + 2,134 converted input characters for the images + 1,200 converted input characters for the output = 5,334 converted input characters per query
Multiply the characters per query by your expected queries per second to get the total throughput per second.
5,334 converted input characters per query * 10 QPS = 53,340 total converted input characters per second
Calculate your GSUs.
The GSUs are the total throughput per second divided by throughput per GSU from the burndown table.
53,340 total converted input chars per second ÷ 54,000 throughput per GSU = 0.988 GSUs
The minimum GSU purchase increment for
gemini-1.5-flash
is 1, which meets your requirement.
What to consider before subscribing
To help you decide whether you want to subscribe to Provisioned Throughput, review this list of details about the subscription:
You can't cancel your order.
Your Provisioned Throughput purchase is a commitment, which means that you can't cancel the order. However, you can increase the number of purchased GSUs. If you accidentally purchase a commitment or there's a problem with your configuration, contact your Google Cloud account representative for assistance.
You can auto-renew your subscription.
When you submit your order, you can choose to auto-renew your subscription at the end of its term, or let the subscription expire. You can cancel the auto-renew process. To cancel your subscription before it auto renews, cancel the auto renewal 30 days prior to the start of the next term.
If you need assistance with this process, contact your Google Cloud account representative.
You can change your model version or region with notice.
Provisioned Throughput is enabled after you've chosen your project, region, model, and version. You can change your model version within the same model publisher or region with a 10-business-day notice by contacting your Google Cloud account representative for assistance. For example, you can switch between Google's models. You can switch between partner A's models. You can switch between partner B's models. You can't switch between Google, partner A, and partner B's models.
There is no downtime when you switch to Provisioned Throughput from pay-as-you-go.
There is no downtime when you switch between models for a Provisioned Throughput order. However, the lead time to acquire throughput is required.
By default, the overage is billed as pay-as-you-go.
If your throughput exceeds your Provisioned Throughput order amount, overages are processed and billed as pay-as-you-go. You can control overages on a per-request basis. For more information, see Use the REST API.
Requests are prioritized.
Requests from Provisioned Throughput customers are prioritized and serviced first before on-demand requests.
You must commit to a minimum usage and payment.
Minimum usage is dependent on the generative AI model that you select. Any usage beyond the purchased throughput rate isn't assured and is serviced on a reasonable-efforts basis.
Throughput doesn't accumulate.
Any unused throughput doesn't accumulate or carry over to the next month.
Provisioned Throughput is measured on characters or tokens per second.
Provisioned Throughput is measured on characters or tokens per second, not on queries per minute (QPM). As a result, measuring Provisioned Throughput depends on your use case's query size and QPM.
Purchase Provisioned Throughput
This section provides the permissions you must have to place or to view a Provisioned Throughput order, and the instructions for placing and viewing your orders.
Permissions
To subscribe to Provisioned Throughput, you must have one of the following permissions assigned to your project, which lets you list and place new orders.
aiplatform.googleapis.com/provisionedThroughputAdmin
: Specific to Provisioned Throughput.aiplatform.googleapis.com/admin
: Gives administrative rights to every resource in Vertex AI.
This role lets you only list your orders:
aiplatform.googleapis.com/viewer
Place a Provisioned Throughput order
Before you place your order to use Imagen models, submit the Request to grant permissions form to be granted permissions. If you expect your QPM to exceed 30,000, then to maximize your Provisioned Throughput order, request an increase to your default Vertex AI system quota using the following information:
- Service: The Vertex AI API.
- Name:
Online prediction requests per minute per region
- Service type: A quota.
- Dimensions: The region where you ordered Provisioned Throughput.
- Value: This is your chosen online-prediction traffic limit.
Follow these steps to purchase a Provisioned Throughput subscription:
Console
- In the Google Cloud console, go to the Provisioned Throughput page.
- To start a new order, click Create.
- Enter an Order name.
- Select the Model.
- Select the Region.
- Enter the Number of generative AI scale units (GSUs) that you must
purchase. If you must estimate the number of GSUs, click the
Estimation tool.
- Select your Model.
- Enter the number of Queries per second.
- Enter the number of Input characters per query.
- Enter the number of Input images per query.
- Enter the number of Video seconds per query.
- Enter the number of Audio seconds per query.
- Enter the number of Output characters per query.
- If you want to use the values that you entered into the estimation tool, click Use calculated.
- Select your Term.
- Select your Renewal option.
- Click Continue.
- In the Summary section, review the price and throughput estimates for your order. Read the terms listed and linked in the form.
- To finalize your order, click Confirm.
Check order status
After you submit your Provisioned Throughput order, the order status might appear as one of the following:
- Pending review: You placed your order. Because approval depends on available capacity to provision your order, your order is waiting for review and approval. For more information about the status of your pending order, contact your Google Cloud account representative.
- Active: Google has approved and provisioned your order and billing starts.
- Expired: Your order has expired.
View Provisioned Throughput orders
Follow these steps to view your Provisioned Throughput orders:
Console
- In the Google Cloud console, go to the Provisioned Throughput page.
- Select the Region. Your list of orders appears.
Use Provisioned Throughput
This section explains how to control overages or bypass Provisioned Throughput and how to monitor the usage of Provisioned Throughput.
Control overages or bypass Provisioned Throughput
Use the REST API to control overages when you exceed your purchased throughput or to bypass Provisioned Throughput on a per-request basis.
Read through each option to determine what you must do to meet your use case.
Default behavior
If you exceed your purchased amount of throughput, the overages go to on-demand and are billed at the pay-as-you-go rate. After your Provisioned Throughput order is active, the default behavior takes place automatically. You don't have to change your code to begin consuming your order.
This curl example demonstrates the default behavior.
! curl -X POST \
-H "Authorization: Bearer $(gcloud auth print-access-token)" \
-H "Content-Type: application/json" \
$URL \
-d '{"contents": [{"role": "user", "parts": [{"text": "Hello."}]}]}'
Use only Provisioned Throughput
If you are managing costs by avoiding on-demand charges, use only Provisioned Throughput. Requests which exceed the Provisioned Throughput order amount return an error 429.
This curl example demonstrates how you can use the REST API to use your Provisioned Throughput subscription only, with overages returning an error 429.
Set the X-Vertex-AI-LLM-Request-Type
header to dedicated
.
! curl -X POST \
-H "Authorization: Bearer $(gcloud auth print-access-token)" \
-H "Content-Type: application/json" \
-H "X-Vertex-AI-LLM-Request-Type: dedicated" \
$URL \
-d '{"contents": [{"role": "user", "parts": [{"text": "Hello."}]}]}'
Use only pay-as-you-go
This is also referred to as using on-demand. Requests bypass the Provisioned Throughput order and are sent directly to pay-as-you-go. This might be useful for experiments or applications that are in development.
This curl example demonstrates how you can use the REST API to bypass Provisioned Throughput, and use only pay-as-you-go.
Set the X-Vertex-AI-LLM-Request-Type
header to shared
.
! curl -X POST \
-H "Authorization: Bearer $(gcloud auth print-access-token)" \
-H "Content-Type: application/json" \
-H "X-Vertex-AI-LLM-Request-Type: shared" \
$URL \
-d '{"contents": [{"role": "user", "parts": [{"text": "Hello."}]}]}'
Monitor Provisioned Throughput
You can monitor your Provisioned Throughput usage through monitoring metrics and on a per-request basis.
Response headers
If a request was processed using Provisioned Throughput, the following HTTP
header is present in the response. This line of code applies only to the
generateContent
API call.
{"X-Vertex-AI-LLM-Request-Type": "dedicated"}
Metrics
Provisioned Throughput can be monitored using a set of metrics that are measured
on the aiplatform.googleapis.com/PublisherModel
resource type. Each metric is
filterable along the following dimensions:
type
:input
,output
request_type
:dedicated
,shared
To filter a metric to view the Provisioned Throughput usage, use the dedicated
request type. The path prefix for a metric is
aiplatform.googleapis.com/publisher/online_serving
.
For example, the full path for the /consumed_throughput
metric is
aiplatform.googleapis.com/publisher/online_serving/consumed_throughput
.
The following Cloud Monitoring metrics are available on the
aiplatform.googleapis.com/PublisherModel
resource:
Metric | Display name | Description | Filter for Provisioned Throughput usage |
---|---|---|---|
/characters |
Characters | Input and output character count distribution. | |
/character_count |
Character count | Accumulated input and output character count. | |
/consumed_throughput |
Character Throughput | Throughput consumed (accounts for the burndown rate) in characters. | |
/model_invocation_count |
Model invocation count | Number of model invocations (prediction requests). | |
/model_invocation_latencies |
Model invocation latencies | Model invocation latencies (prediction latencies). | |
/first_token_latencies |
First token latencies | Duration from request received to first token returned. | |
/tokens |
Tokens | Input and output token count distribution. | |
/token_count |
Token count | Accumulated input and output token count. |
What's next
- Contact your Google Cloud account representative to place a Provisioned Throughput order or to increase the number of GSUs on an existing order.
- For more information about troubleshooting error 429 when using dynamic shared
quota or Provisioned Throughput, see
Error code
429
. - To learn more about Dynamic shared
quota.