Generative AI on Vertex AI rate limits

Google Cloud uses quotas to help ensure fairness and reduce spikes in resource use and availability. A quota restricts how much of a Google Cloud resource your Google Cloud project can use. Quotas apply to a range of resource types, including hardware, software, and network components. For example, quotas can restrict the number of API calls to a service, the number of load balancers used concurrently by your project, or the number of projects that you can create. Quotas protect the community of Google Cloud users by preventing the overloading of services. Quotas also help you to manage your own Google Cloud resources.

The Cloud Quotas system does the following:

  • Monitors your consumption of Google Cloud products and services
  • Restricts your consumption of those resources
  • Provides a way to request changes to the quota value

In most cases, when you attempt to consume more of a resource than its quota allows, the system blocks access to the resource, and the task that you're trying to perform fails.

Quotas generally apply at the Google Cloud project level. Your use of a resource in one project doesn't affect your available quota in another project. Within a Google Cloud project, quotas are shared across all applications and IP addresses.

Quotas by region and model

The requests per minute (RPM) quota applies to a base model and all versions, identifiers, and tuned versions of that model. The following examples show how the RPM quota is applied:

  • A request to the base model, gemini-1.0-pro, and a request to its stable version, gemini-1.0-pro-001, are counted as two requests toward the RPM quota of the base model, gemini-1.0-pro.

  • A request to two versions of a base model, gemini-1.0-pro-001 and gemini-1.0-pro-002, are counted as two requests toward the RPM quota of the base model, gemini-1.0-pro.

  • A request to two versions of a base model, gemini-1.0-pro-001 and a tuned version named my-tuned-chat-model, are counted as two requests toward the base model, gemini-1.0-pro.

The quotas apply to Generative AI on Vertex AI requests for a given Google Cloud project and supported region.

View the quotas in the Google Cloud console

To view the quotas in the Google Cloud console, do the following:

  1. In the Google Cloud console, go to the IAM & Admin Quotas page.

    View Quotas in Console

  2. In the Filter field, specify the dimension or metric.
Dimension (model identifier) Metric (quota identifier for Gemini models)
base_model: gemini-1.5-flash
base_model: gemini-1.5-pro
You can request adjustments in the following:
  • aiplatform.googleapis.com/generate_content_requests_per_minute_per_project_per_base_model
  • aiplatform.googleapis.com/generate_content_input_tokens_per_minute_per_base_model
All other models You can adjust only one quota:
  • aiplatform.googleapis.com/generate_content_requests_per_minute_per_project_per_base_model

Choose a region to view the quota limits for each available model:

Rate limits

The following rate limits apply to the listed models across all regions for the metric, generate_content_input_tokens_per_minute_per_base_model:

Base model Tokens per minute
base_model: gemini-1.5-flash 4M (4,000,000)
base_model: gemini-1.5-pro 4M (4,000,000)

Batch requests

The quotas and limits for batch requests are the same across all regions.

Concurrent batch requests

The following table lists the quotas for the number of concurrent batch requests:

Quota Value
aiplatform.googleapis.com/textembedding_gecko_concurrent_batch_prediction_jobs 4
aiplatform.googleapis.com/model_garden_oss_concurrent_batch_prediction_jobs 1
aiplatform.googleapis.com/gemini_pro_concurrent_batch_prediction_jobs 1

If the number of tasks submitted exceeds the allocated quota, the tasks are placed in a queue and processed when the quota capacity becomes available.

Batch request limits

The following table lists the size limit of each batch text generation request.

Model Limit
gemini-1.5-pro 50k records
gemini-1.5-flash 150k records
gemini-1.0-pro 150k records
gemini-1.0-pro-vision 50k records

Custom-trained model quotas

The following quotas apply to Generative AI on Vertex AI tuned models for a given project and region:

Quota Value
Restricted image training TPU V3 pod cores per region
* supported Region - europe-west4
64
Restricted image training Nvidia A100 80GB GPUs per region
* supported Region - us-central1
* supported Region - us-east4

8
2

* Tuning scenarios have accelerator reservations in specific regions. Quotas for tuning are supported and must be requested in specific regions.

Text embedding limits

Each text embedding model request can have up to 250 input texts (generating 1 embedding per input text) and 20,000 tokens per request. Only the first 2,048 tokens in each input text is used to compute the embeddings.

Gen AI Evaluation Service quotas

The Gen AI Evaluation Service uses gemini-1.5-pro as a judge model , and mechanisms to ensure consistent and objective evaluation for model-based metrics.

A single evaluation request for a model-based metric might result in multiple underlying requests to the Gen AI Evaluation Service. Each model's quota is calculated on a per-project basis, which means that any requests directed to gemini-1.5-pro for model inference and model-based evaluation contribute to the quota. Different model quotas are set differently. The quota for the evaluation service and the quota for the underlying autorater model are shown in the table.

Request quota Default quota
Gen AI Evaluation Service requests per minute 1,000 requests per project per region
Online prediction requests per minute for base_model: gemini-1.5-pro See Quotas by region and model.

If you receive an error related to quotas while using the Gen AI Evaluation Service, you might need to file a quota increase request. See View and Manage Quotas for more information.

Limit Value
Gen AI Evaluation Service request timeout 60 seconds

First-time users of the Gen AI Evaluation Service within a new project might experience an initial setup delay generally up to two minutes. This is a one-time process. If your first request fails, wait a few minutes and then retry. Subsequent evaluation requests typically complete within 60 seconds.

The maximum input and output tokens are limited for the model-based metrics as per the model used as the autorater. See Model information | Generative AI on Vertex AI | Google Cloud for limits for relevant models.

RAG Engine quotas

For each service to perform retrieval-augmented generation (RAG) using RAG Engine, the following quotas apply, with the quota measured as requests per minute (RPM).
Service Quota Metric
RAG Engine data management APIs 60 RPM VertexRagDataService requests per minute per region
RetrievalContexts API 1,500 RPM VertexRagService retrieve requests per minute per region
base_model: textembedding-gecko 1,500 RPM Online prediction requests per base model per minute per region per base_model

An additional filter for you to specify is base_model: textembedding-gecko
The following limits apply:
Service Limit Metric
Concurrent ImportRagFiles requests 3 RPM VertexRagService concurrent import requests per region
Maximum number of files per ImportRagFiles request 10,000 VertexRagService import rag files requests per region

For more rate limits and quotas, see Generative AI on Vertex AI rate limits.

Pipeline evaluation quotas

If you receive an error related to quotas while using the evaluation pipelines service, you might need to file a quota increase request. See View and Manage Quotas for more information.

The evaluation pipelines service uses Vertex AI Pipelines to run PipelineJobs. See relevant quotas for Vertex AI Pipelines. The following are general quota recommendations:

Service Quota Recommendation
Vertex AI API Concurrent LLM batch prediction jobs per region Pointwise: 1 * num_concurrent_pipelines

Pairwise: 2 * num_concurrent_pipelines
Vertex AI API Evaluation requests per minute per region 1000 * num_concurrent_pipelines

Additionally, when calculating model-based evaluation metrics, the autorater might hit quota issues. The relevant quota depends on which autorater was used:

Tasks Quota Base model Recommendation
summarization
question_answering
Online prediction requests per base model per minute per region per base_model text-bison 60 * num_concurrent_pipelines

Vertex AI Pipelines

Each tuning job uses Vertex AI Pipelines. For more information, see Vertex AI Pipelines quotas and limits.

Vertex AI Reasoning Engine

The following quotas and limits apply to Vertex AI Reasoning Engine for a given project in each region.

Quota Value
Create/Delete/Update Reasoning Engine per minute 10
Query Reasoning Engine per minute 60
Maximum number of Reasoning Engine resources 100

Error code 429

If the number of your requests exceeds the capacity allocated to process requests, then error code 429 is returned. The following table displays the error message generated by each type of quota framework:

Quota framework Message
Pay-as-you-go Resource exhausted, please try again later.
Provisioned Throughput Too many requests. Exceeded the Provisioned Throughput.

With a Provisioned Throughput subscription, you can reserve an amount of throughput for specific generative AI models. If you don't have a Provisioned Throughput subscription and resources aren't available to your application, then an error code 429 is returned. Although you don't have reserved capacity, you can try your request again. However, the request isn't counted against your error rate as described in your service level agreement (SLA).

For projects that have purchased Provisioned Throughput, Vertex AI measures a project's throughput and reserves that amount of throughput so that it's available. When you're using less than your purchased throughput amount, errors that might otherwise return as 429 are returned as 5XX and are counted as part of the error rate that is described in the SLA.

Pay-as-you-go

On the pay-as-you-go quota framework, you have the following options for resolving 429 errors:

Provisioned Throughput

To correct the error generated by Provisioned Throughput, do the following:

  • Use the default example, which doesn't set a header in prediction requests. Any overages are processed on-demand and billed as pay-as-you-go.
  • Increase the number of GSUs in your Provisioned Throughput subscription.

Quota increases

If you want to increase any of your quotas for Generative AI on Vertex AI, you can use the Google Cloud console to request a quota increase. To learn more about quotas, see Work with quotas.

What's next