Provisioned Throughput

Provisioned Throughput is a fixed-cost monthly subscription service that reserves throughput for supported generative AI models on Vertex AI. To reserve your throughput, you must specify the model and available locations in which the model runs.

This page explains when to use Provisioned Throughput, how it works, and how to subscribe.

Supported models

The following tables show the models that support Provisioned Throughput, the throughput for each generative AI scale unit (GSU), and the burndown rates for each model.

Google models

This table shows the throughput, purchase increment, and burndown rates for Google models that support Provisioned Throughput. The Google models are measured in characters per second, which is defined as your prompt input and generated text output characters across all requests per second.

Model Throughput per GSU (chars/sec) Minimum GSU purchase increment Burndown rates
gemini-1.5-flash 8,000 5 Less than or equal to 128,000 context:
1 input char = 1 char
1 output char = 3 chars
1 image = 1,052 chars
1 video per second = 1,052 chars
1 audio per second = 100 chars
Greater than 128,000 context:
1 input char = 2 chars
1 output char = 6 chars
1 image = 2,104 chars
1 video per second = 2,104 chars
1 audio per second = 200 chars
gemini-1.5-pro 800 50 Less than or equal to 128,000 context:
1 input char = 1 char
1 output char = 3 chars
1 image = 1,052 chars
1 video per second = 1,052 chars
1 audio per second = 100 chars
Greater than 128,000 context:
1 input char = 2 chars
1 output char = 6 chars
1 image = 2,104 chars
1 video per second = 2,104 chars
1 audio per second = 200 chars
gemini-1.0-pro 8,000 5 1 input char = 1 char
1 output char = 3 chars
1 image = 20,000 chars
1 video per second = 16,000 chars
MedLM-medium 2,000 5 1 input char = 1 char
1 output char = 2 chars
MedLM-large 200 50 1 input char = 1 char
1 output char = 3 chars

For more information about supported locations, see Available locations.

You can upgrade to new models as they are made available. For information about about availability and discontinuation dates, see Google models.

Google legacy models

See Legacy models that support Provisioned Throughput.

Partner models

This table shows the throughput, purchase increment, and burndown rates for partner models that support Provisioned Throughput. Claude models are measured in tokens per second, which is defined as a total of input and output tokens across all requests per second.

Model Throughput per GSU (tokens/sec) Minimum GSU purchase increment Burndown rates
Anthropic Claude 3.5 Sonnet 350 25 1 input token = 1 token
1 output token = 5 tokens
Anthropic Claude 3 Opus 70 35 1 input token = 1 token
1 output token = 5 tokens
Anthropic Claude 3 Haiku 4,200 5 1 input token = 1 token
1 output token = 5 tokens
Anthropic Claude 3 Sonnet 350 25 1 input token = 1 token
1 output token = 5 tokens

For more information about supported locations, see Available locations.

When to use Provisioned Throughput

If any of the following considerations apply to your use case, consider using Provisioned Throughput:

  • Your critical workloads consistently require high throughput. Throughput measurement depends on the model.
  • You are building real-time generative AI production applications, such as chatbots and agents.
  • Your throughput needs exceed 20,000 characters per second.
  • You want to provide a consistent and predictable experience for users of your applications.
  • You want deterministic generative AI costs by paying a fixed-monthly price with control of overages.

Provisioned Throughput is one of two ways to consume your generative AI models. The second way is pay-as-you-go, which is also referred to as on-demand.

How Provisioned Throughput is measured

This section explains the concepts of generative AI scale unit (GSU) and burndown rates. Provisioned Throughput is calculated and priced using GSUs and burndown rates.

A generative AI scale unit (GSU) is a measure of throughput for your prompts and responses. This amount specifies how much throughput to provision a model with.

To produce a standard unit across models, all inputs and outputs are converted to input characters per second (throughput) using model-specific ratios called burndown rates.

Different models use different amounts of throughput. For information about the minimum GSU purchase amount and increments for each model, see Supported models and burndown rates in this document.

This equation demonstrates how throughput is calculated:

inputs_per_query = inputs_across_modalities_converted_using_burndown_rates
outputs_per_query = outputs_across_modalities_converted_using_burndown_rates

throughput_per_second = (inputs_per_query + outputs_per_query) * queries_per_second

The calculated throughput per second determines how many GSUs that you need for your use case.

Example of estimating your Provisioned Throughput needs

To estimate your Provisioned Throughput needs, use the estimation tool in the Google Cloud console. The following example illustrates the process of estimating the amount of provisioned throughput for your model. The region isn't considered in the estimation calculations.

  1. Gather your requirements.

    1. In this example, your requirement is to ensure that you can send 2,000 characters with 2 images and receive 300 characters of output for 10 queries per second using gemini-1.5-flash.

      This step implies that you understand your use case, because you have identified the size of your inputs and outputs, the number of queries per second (QPS), and your model.

    2. To estimate your throughput, specify your model. In this example, your model is gemini-1.5-flash.

    3. Specify the type of input, and identify the burndown rate. Use the burndown rates table to identify the burndown rate based on your type of input.

      An image's burndown rate for the gemini-1.5-flash model is 1,052 characters.

  2. Calculate your throughput.

    1. Multiply the number of images by the burndown rate for the input type for your specific model.

      2 images * 1,052 characters = 2,104 input characters

    2. Your total output characters is 300. Return to the burndown rates table, and find the burndown rate for output characters (3 characters) for your specific model (gemini-1.5-flash).

      300 output characters * 3 characters = 900 input characters

    3. Add your totals together.

      2,000 input characters + 2,104 converted input characters for the images + 900 converted input characters for the output = 5,004 input characters per query

    4. Multiply the characters per query by your expected queries per second to get the total throughput per second.

      5,004 input characters per query * 10 QPS = 50,040 input characters per second

  3. Calculate your GSUs.

    1. The GSUs are the total throughput per second divided by throughput per GSU from the burndown table.

      50,040 input chars per second ÷ 8,000 throughput per GSU = 6.255 GSUs

    2. The minimum GSU purchase increment for gemini-1.5-flash is 5. The next multiple of 5 from 6.255 is 10. Therefore, you need 10 GSUs to meet your requirement.

What to consider before subscribing

To help you decide whether you want to subscribe to Provisioned Throughput, review this list of details about the subscription:

  • You can't cancel your order.

    Your Provisioned Throughput purchase is a commitment, which means that you can't cancel the order. However, you can increase the number of purchased GSUs. If you accidentally purchase a commitment or there's a problem with your configuration, contact your Google Cloud account representative for assistance.

  • You can auto-renew your subscription.

    You can choose to auto-renew your subscription at the end of its term, or let the subscription expire.

  • You can change your model version or region with notice.

    Provisioned Throughput is enabled after you've chosen your project, region, model, and version. You can change your model version within the same model publisher or region with a 10-business-day notice by contacting your Google Cloud account representative for assistance. For example, you can switch between Google's models. You can switch between partner A's models. You can switch between partner B's models. You can't switch between Google, partner A, and partner B's models.

  • There is no downtime when you switch to Provisioned Throughput from pay-as-you-go.

    There is no downtime when you switch between models for a Provisioned Throughput order. However, the lead time to acquire throughput is required.

  • By default, the overage is billed as pay-as-you-go.

    If your throughput exceeds your Provisioned Throughput order amount, overages are processed and billed as pay-as-you-go. You can control overages on a per-request basis. For more information, see Use the REST API.

  • Requests are prioritized.

    Requests from Provisioned Throughput customers are prioritized and serviced first before on-demand requests.

  • You must commit to a minimum usage and payment.

    Minimum usage is dependent on the generative AI model that you select. Any usage beyond the purchased throughput rate isn't assured and is serviced on a reasonable-efforts basis.

  • Throughput doesn't accumulate.

    Any unused throughput doesn't accumulate or carry over to the next month.

  • Provisioned Throughput is measured on characters or tokens per second.

    Provisioned Throughput is measured on characters or tokens per second, not on queries per minute (QPM). As a result, measuring Provisioned Throughput depends on your use case's query size and QPM.

Purchase Provisioned Throughput

This section provides the permissions you must have to place or to view a Provisioned Throughput order, and the instructions for placing and viewing your orders.

Permissions

To subscribe to Provisioned Throughput, you must have one of the following permissions assigned to your project, which lets you list and place new orders.

  • aiplatform.googleapis.com/provisionedThroughputAdmin: Specific to Provisioned Throughput.
  • aiplatform.googleapis.com/admin: Gives administrative rights to every resource in Vertex AI.

This role lets you only list your orders:

  • aiplatform.googleapis.com/viewer

Place a Provisioned Throughput order

Follow these steps to purchase a Provisioned Throughput subscription:

Console

  1. In the Google Cloud console, go to the Provisioned Throughput page.

    Go to Provisioned Throughput

  2. To start a new order, click Create.
  3. Enter an Order name.
  4. Select the Model.
  5. Select the Region.
  6. Enter the Number of generative AI scale units (GSUs) that you must purchase. If you must estimate the number of GSUs, click the Estimation tool.
    1. Select your Model.
    2. Enter the number of Queries per second.
    3. Enter the number of Input characters per query.
    4. Enter the number of Input images per query.
    5. Enter the number of Video seconds per query.
    6. Enter the number of Audio seconds per query.
    7. Enter the number of Output characters per query.
    8. If you want to use the values that you entered into the estimation tool, click Use calculated.
  7. Select your Term.
  8. Select your Renewal option.
  9. Click Continue.
  10. In the Summary section, review the price and throughput estimates for your order. Read the terms listed and linked in the form.
  11. To finalize your order, click Confirm.

Check order status

After you submit your Provisioned Throughput order, the order status might appear as one of the following:

  • Pending review: You placed your order. Because approval depends on available capacity to provision your order, your order is waiting for review and approval. For more information about the status of your pending order, contact your Google Cloud account representative.
  • Active: Google has approved and provisioned your order and billing starts.

View Provisioned Throughput orders

Follow these steps to view your Provisioned Throughput orders:

Console

  1. In the Google Cloud console, go to the Provisioned Throughput page.

    Go to Provisioned Throughput

  2. Select the Region. Your list of orders appears.

Use Provisioned Throughput

This section explains how to control overages or bypass Provisioned Throughput and how to monitor the usage of Provisioned Throughput.

Control overages or bypass Provisioned Throughput

Use the REST API to control overages when you exceed your purchased throughput or to bypass Provisioned Throughput on a per-request basis.

Read through each option to determine what you must do to meet your use case.

Default behavior

If you exceed your purchased amount of throughput, the overages go to on-demand and are billed at the pay-as-you-go rate. After your Provisioned Throughput order is active, the default behavior takes place automatically. You don't have to change your code to begin consuming your order.

This curl example demonstrates the default behavior.

! curl -X POST \
  -H "Authorization: Bearer $(gcloud auth print-access-token)" \
  -H "Content-Type: application/json" \
  $URL \
  -d '{"contents": [{"role": "user", "parts": [{"text": "Hello."}]}]}'

Use only Provisioned Throughput

If you are managing costs by avoiding on-demand charges, use only Provisioned Throughput. Requests which exceed the Provisioned Throughput order amount return an error 429.

This curl example demonstrates how you can use the REST API to use your Provisioned Throughput subscription only, with overages returning an error 429.

Set the X-Vertex-AI-LLM-Request-Type header to dedicated.

! curl -X POST \
  -H "Authorization: Bearer $(gcloud auth print-access-token)" \
  -H "Content-Type: application/json" \
  -H "X-Vertex-AI-LLM-Request-Type: dedicated" \
  $URL \
  -d '{"contents": [{"role": "user", "parts": [{"text": "Hello."}]}]}'

Use only pay-as-you-go

This is also referred to as using on-demand. Requests bypass the Provisioned Throughput order and are sent directly to pay-as-you-go. This might be useful for experiments or applications that are in development.

This curl example demonstrates how you can use the REST API to bypass Provisioned Throughput, and use only pay-as-you-go.

Set the X-Vertex-AI-LLM-Request-Type header to shared.

! curl -X POST \
  -H "Authorization: Bearer $(gcloud auth print-access-token)" \
  -H "Content-Type: application/json" \
  -H "X-Vertex-AI-LLM-Request-Type: shared" \
  $URL \
  -d '{"contents": [{"role": "user", "parts": [{"text": "Hello."}]}]}'

Monitor Provisioned Throughput

You can monitor your Provisioned Throughput usage through monitoring metrics and on a per-request basis.

Response headers

If a request was processed using Provisioned Throughput, the following HTTP header is present in the response. This line of code applies only to the generateContent API call.

  {"X-Vertex-AI-LLM-Request-Type": "dedicated"}

Metrics

Provisioned Throughput can be monitored using a set of metrics that are measured on the aiplatform.googleapis.com/PublisherModel resource type. Each metric is filterable along the following dimensions:

  • type: input, output
  • request_type: dedicated, shared

To filter a metric to view the Provisioned Throughput usage, use the dedicated request type. The path prefix for a metric is aiplatform.googleapis.com/publisher/online_serving. For example, the full path for the /consumed_throughput metric is aiplatform.googleapis.com/publisher/online_serving/consumed_throughput.

The following Cloud Monitoring metrics are available on the aiplatform.googleapis.com/PublisherModel resource:

Metric Description Filter for Provisioned Throughput usage
/characters Input and output character count distribution Yes
/character_count Accumulated input and output character count Yes
/consumed_throughput Throughput consumed (accounts for the burndown rate) in characters Yes
/model_invocation_count Number of model invocations (prediction requests)
/model_invocation_latencies Model invocation latencies (prediction latencies)
/first_token_latencies Duration from request received to first token returned
/tokens Input and output token count distribution
/token_count Accumulated input and output token count

What's next