Use Provisioned Throughput

This page explains how to control overages or bypass Provisioned Throughput and how to monitor usage.

Control overages or bypass Provisioned Throughput

Use the API to control overages when you exceed your purchased throughput or to bypass Provisioned Throughput on a per-request basis.

Read through each option to determine what you must do to meet your use case.

Default behavior

If you exceed your purchased amount of throughput, the overages go to on-demand and are billed at the pay-as-you-go rate. After your Provisioned Throughput order is active, the default behavior takes place automatically. You don't have to change your code to begin consuming your order.

Use only Provisioned Throughput

If you are managing costs by avoiding on-demand charges, use only Provisioned Throughput. Requests which exceed the Provisioned Throughput order amount return an error 429.

When sending requests to the API, set the X-Vertex-AI-LLM-Request-Type HTTP header to dedicated.

Use only pay-as-you-go

This is also referred to as using on-demand. Requests bypass the Provisioned Throughput order and are sent directly to pay-as-you-go. This might be useful for experiments or applications that are in development.

When sending requests to the API, set the X-Vertex-AI-LLM-Request-Type HTTP header to shared.

Example

Gen AI SDK for Python

Learn how to install or update the Gen AI SDK for Python.

To learn more, see the SDK reference documentation.

Set environment variables to use the Gen AI SDK with Vertex AI:

# Replace the `GOOGLE_CLOUD_PROJECT` and `GOOGLE_CLOUD_LOCATION` values
# with appropriate values for your project.
export GOOGLE_CLOUD_PROJECT=GOOGLE_CLOUD_PROJECT
export GOOGLE_CLOUD_LOCATION=us-central1
export GOOGLE_GENAI_USE_VERTEXAI=True


from google import genai
from google.genai.types import HttpOptions

client = genai.Client(
    http_options=HttpOptions(
        api_version="v1",
        headers={
            # Options:
            # - "dedicated": Use Provisioned Throughput
            # - "shared": Use pay-as-you-go
            # https://cloud.google.com/vertex-ai/generative-ai/docs/use-provisioned-throughput
            "X-Vertex-AI-LLM-Request-Type": "shared"
        },
    )
)
response = client.models.generate_content(
    model="gemini-2.0-flash-001",
    contents="How does AI work?",
)
print(response.text)

# Example response:
# Okay, let's break down how AI works. It's a broad field, so I'll focus on the ...
#
# Here's a simplified overview:
# ...

Vertex AI SDK for Python

To learn how to install or update the Vertex AI SDK for Python, see Install the Vertex AI SDK for Python. For more information, see the Vertex AI SDK for Python API reference documentation.

import vertexai
from vertexai.generative_models import GenerativeModel

# TODO(developer): Update and un-comment below line
# PROJECT_ID = "your-project-id"
vertexai.init(
    project=PROJECT_ID,
    location="us-central1",
    # Options:
    # - "dedicated": Use Provisioned Throughput
    # - "shared": Use pay-as-you-go
    # https://cloud.google.com/vertex-ai/generative-ai/docs/use-provisioned-throughput
    request_metadata=[("x-vertex-ai-llm-request-type", "shared")],
)

model = GenerativeModel("gemini-1.5-flash-002")

response = model.generate_content(
    "What's a good name for a flower shop that specializes in selling bouquets of dried flowers?"
)

print(response.text)
# Example response:
# **Emphasizing the Dried Aspect:**
# * Everlasting Blooms
# * Dried & Delightful
# * The Petal Preserve
# ...

REST

After you set up your environment, you can use REST to test a text prompt. The following sample sends a request to the publisher model endpoint.

curl -X POST \
  -H "Authorization: Bearer $(gcloud auth print-access-token)" \
  -H "Content-Type: application/json" \
  -H "X-Vertex-AI-LLM-Request-Type: dedicated" \ # Options: dedicated, shared
  $URL \
  -d '{"contents": [{"role": "user", "parts": [{"text": "Hello."}]}]}'

Monitor Provisioned Throughput

You can monitor your Provisioned Throughput usage through monitoring metrics and on a per-request basis.

Response headers

If a request was processed using Provisioned Throughput, the following HTTP header is present in the response. This line of code applies only to the generateContent API call.

  {"X-Vertex-AI-LLM-Request-Type": "dedicated"}

Metrics

Provisioned Throughput can be monitored using a set of metrics that are measured on the aiplatform.googleapis.com/PublisherModel resource type. Each metric is filterable along the following dimensions:

  • type: input, output
  • request_type: dedicated, shared

To filter a metric to view the Provisioned Throughput usage, use the dedicated request type. The path prefix for a metric is aiplatform.googleapis.com/publisher/online_serving.

For example, the full path for the /consumed_throughput metric is aiplatform.googleapis.com/publisher/online_serving/consumed_throughput.

The following Cloud Monitoring metrics are available on the aiplatform.googleapis.com/PublisherModel resource in the Gemini models and have a filter for Provisioned Throughput usage:

Metric Display name Description
/characters Characters Input and output character count distribution.
/character_count Character count Accumulated input and output character count.
/consumed_throughput Character Throughput Throughput consumed, which accounts for the burndown rate in characters. For token-based models, this is equivalent to the throughput consumed in tokens * 4.
/model_invocation_count Model invocation count Number of model invocations (prediction requests).
/model_invocation_latencies Model invocation latencies Model invocation latencies (prediction latencies).
/first_token_latencies First token latencies Duration from request received to first token returned.
/tokens Tokens Input and output token count distribution.
/token_count Token count Accumulated input and output token count.

Anthropic models also have a filter for Provisioned Throughput but only for tokens/token_count.

What's next