This page explains how to control overages or bypass Provisioned Throughput and how to monitor usage.
Control overages or bypass Provisioned Throughput
Use the API to control overages when you exceed your purchased throughput or to bypass Provisioned Throughput on a per-request basis.
Read through each option to determine what you must do to meet your use case.
Default behavior
If you exceed your purchased amount of throughput, the overages go to on-demand and are billed at the pay-as-you-go rate. After your Provisioned Throughput order is active, the default behavior takes place automatically. You don't have to change your code to begin consuming your order.
Use only Provisioned Throughput
If you are managing costs by avoiding on-demand charges, use only Provisioned Throughput. Requests which exceed the Provisioned Throughput order amount return
an error 429
.
When sending requests to the API, set the X-Vertex-AI-LLM-Request-Type
HTTP header to dedicated
.
Use only pay-as-you-go
This is also referred to as using on-demand. Requests bypass the Provisioned Throughput order and are sent directly to pay-as-you-go. This might be useful for experiments or applications that are in development.
When sending requests to the API, set the X-Vertex-AI-LLM-Request-Type
HTTP header to shared
.
Example
Gen AI SDK for Python
Learn how to install or update the Gen AI SDK for Python.
To learn more, see the SDK reference documentation.Set environment variables to use the Gen AI SDK with Vertex AI:
# Replace the `GOOGLE_CLOUD_PROJECT` and `GOOGLE_CLOUD_LOCATION` values # with appropriate values for your project. export GOOGLE_CLOUD_PROJECT=GOOGLE_CLOUD_PROJECT export GOOGLE_CLOUD_LOCATION=us-central1 export GOOGLE_GENAI_USE_VERTEXAI=True
Vertex AI SDK for Python
To learn how to install or update the Vertex AI SDK for Python, see Install the Vertex AI SDK for Python. For more information, see the Vertex AI SDK for Python API reference documentation.
REST
After you set up your environment, you can use REST to test a text prompt. The following sample sends a request to the publisher model endpoint.
curl -X POST \
-H "Authorization: Bearer $(gcloud auth print-access-token)" \
-H "Content-Type: application/json" \
-H "X-Vertex-AI-LLM-Request-Type: dedicated" \ # Options: dedicated, shared
$URL \
-d '{"contents": [{"role": "user", "parts": [{"text": "Hello."}]}]}'
Monitor Provisioned Throughput
You can monitor your Provisioned Throughput usage through monitoring metrics and on a per-request basis.
Response headers
If a request was processed using Provisioned Throughput, the following HTTP
header is present in the response. This line of code applies only to the
generateContent
API call.
{"X-Vertex-AI-LLM-Request-Type": "dedicated"}
Metrics
Provisioned Throughput can be monitored using a set of metrics that
are measured on the aiplatform.googleapis.com/PublisherModel
resource type.
Each metric is filterable along the following dimensions:
type
:input
,output
request_type
:dedicated
,shared
To filter a metric to view the Provisioned Throughput usage, use the
dedicated
request type. The path prefix for a metric is
aiplatform.googleapis.com/publisher/online_serving
.
For example, the full path for the /consumed_throughput
metric is
aiplatform.googleapis.com/publisher/online_serving/consumed_throughput
.
The following Cloud Monitoring metrics are available on the
aiplatform.googleapis.com/PublisherModel
resource in the Gemini
models and have a filter for Provisioned Throughput usage:
Metric | Display name | Description |
---|---|---|
/characters |
Characters | Input and output character count distribution. |
/character_count |
Character count | Accumulated input and output character count. |
/consumed_throughput |
Character Throughput | Throughput consumed, which accounts for the burndown rate in characters. For token-based models, this is equivalent to the throughput consumed in tokens * 4. |
/model_invocation_count |
Model invocation count | Number of model invocations (prediction requests). |
/model_invocation_latencies |
Model invocation latencies | Model invocation latencies (prediction latencies). |
/first_token_latencies |
First token latencies | Duration from request received to first token returned. |
/tokens |
Tokens | Input and output token count distribution. |
/token_count |
Token count | Accumulated input and output token count. |
Anthropic models also have a filter for Provisioned Throughput but
only for tokens/token_count
.
What's next
- Troubleshoot Error code
429
.