This page explains how to control overages or bypass Provisioned Throughput and how to monitor the usage of Provisioned Throughput.
Control overages or bypass Provisioned Throughput
Use the REST API to control overages when you exceed your purchased throughput or to bypass Provisioned Throughput on a per-request basis.
Read through each option to determine what you must do to meet your use case.
Default behavior
If you exceed your purchased amount of throughput, the overages go to on-demand and are billed at the pay-as-you-go rate. After your Provisioned Throughput order is active, the default behavior takes place automatically. You don't have to change your code to begin consuming your order.
This curl example demonstrates the default behavior.
! curl -X POST \
-H "Authorization: Bearer $(gcloud auth print-access-token)" \
-H "Content-Type: application/json" \
$URL \
-d '{"contents": [{"role": "user", "parts": [{"text": "Hello."}]}]}'
Use only Provisioned Throughput
If you are managing costs by avoiding on-demand charges, use only Provisioned Throughput. Requests which exceed the Provisioned Throughput order amount return an error 429.
This curl example demonstrates how you can use the REST API to use your Provisioned Throughput subscription only, with overages returning an error 429.
Set the X-Vertex-AI-LLM-Request-Type
header to dedicated
.
! curl -X POST \
-H "Authorization: Bearer $(gcloud auth print-access-token)" \
-H "Content-Type: application/json" \
-H "X-Vertex-AI-LLM-Request-Type: dedicated" \
$URL \
-d '{"contents": [{"role": "user", "parts": [{"text": "Hello."}]}]}'
Use only pay-as-you-go
This is also referred to as using on-demand. Requests bypass the Provisioned Throughput order and are sent directly to pay-as-you-go. This might be useful for experiments or applications that are in development.
This curl example demonstrates how you can use the REST API to bypass Provisioned Throughput, and use only pay-as-you-go.
Set the X-Vertex-AI-LLM-Request-Type
header to shared
.
! curl -X POST \
-H "Authorization: Bearer $(gcloud auth print-access-token)" \
-H "Content-Type: application/json" \
-H "X-Vertex-AI-LLM-Request-Type: shared" \
$URL \
-d '{"contents": [{"role": "user", "parts": [{"text": "Hello."}]}]}'
Monitor Provisioned Throughput
You can monitor your Provisioned Throughput usage through monitoring metrics and on a per-request basis.
Response headers
If a request was processed using Provisioned Throughput, the following HTTP
header is present in the response. This line of code applies only to the
generateContent
API call.
{"X-Vertex-AI-LLM-Request-Type": "dedicated"}
Metrics
Provisioned Throughput can be monitored using a set of metrics that
are measured on the aiplatform.googleapis.com/PublisherModel
resource type.
Each metric is filterable along the following dimensions:
type
:input
,output
request_type
:dedicated
,shared
To filter a metric to view the Provisioned Throughput usage, use the
dedicated
request type. The path prefix for a metric is
aiplatform.googleapis.com/publisher/online_serving
.
For example, the full path for the /consumed_throughput
metric is
aiplatform.googleapis.com/publisher/online_serving/consumed_throughput
.
The following Cloud Monitoring metrics are available on the
aiplatform.googleapis.com/PublisherModel
resource in the Gemini
models and have a filter for Provisioned Throughput usage:
Metric | Display name | Description |
---|---|---|
/characters |
Characters | Input and output character count distribution. |
/character_count |
Character count | Accumulated input and output character count. |
/consumed_throughput |
Character Throughput | Throughput consumed (accounts for the burndown rate) in characters. |
/model_invocation_count |
Model invocation count | Number of model invocations (prediction requests). |
/model_invocation_latencies |
Model invocation latencies | Model invocation latencies (prediction latencies). |
/first_token_latencies |
First token latencies | Duration from request received to first token returned. |
/tokens |
Tokens | Input and output token count distribution. |
/token_count |
Token count | Accumulated input and output token count. |
Anthropic models also have a filter for Provisioned Throughput but
only for tokens/token_count
.
What's next
- Troubleshoot Error code
429
.