A quota restricts how much of a shared Google Cloud resource your Google Cloud project can use, including hardware, software, and network components. Therefore, quotas are a part of a system that does the following:
- Monitors your use or consumption of Google Cloud products and services.
- Restricts your consumption of those resources, for reasons that include ensuring fairness and reducing spikes in usage.
- Maintains configurations that automatically enforce prescribed restrictions.
- Provides a means to request or make changes to the quota.
In most cases, when a quota is exceeded, the system immediately blocks access to the relevant Google resource, and the task that you're trying to perform fails. In most cases, quotas apply to each Google Cloud project and are shared across all applications and IP addresses that use that Google Cloud project.
Quotas by region and model
The queries per minute (QPM) quota applies to a base model and all versions,
identifiers, and tuned versions of that model. For example, a request to
text-bison
and a request to text-bison@001
are counted as two requests
toward the QPM quota of the base model, text-bison
. Similarly, a request to
text-bison@001
and text-bison@002
are counted as two requests toward the QPM
quota of the base model, text-bison
. The same applies to tuned models, so a
request to chat-bison@002
and a tuned model based on chat-bison@002
named
my-tuned-chat-model
are counted as two requests toward the base model,
chat-bison
.
The quotas apply to Generative AI on Vertex AI requests for a given Google Cloud project and supported region.
To view the quotas in the Google Cloud console, do the following:
- In the Google Cloud console, go to the IAM & Admin Quotas page.
In the Filter field, specify the dimension or metric.
Dimension: The model identifier. For example,
base_model:gemini-1.0-pro
orbase_model:text-bison
.Metric: The quota identifier.
- For Gemini models:
aiplatform.googleapis.com/generate_content_requests_per_minute_per_project_per_base_model
- For PaLM 2 models:
aiplatform.googleapis.com/online_prediction_requests_per_base_model
- For Gemini models:
Choose a region to view the quota limits for each available model:
Batch quotas
The following quotas and limits are the same across the regions for Generative AI on Vertex AI batch prediction jobs:
Quota | Value |
---|---|
text_bison_concurrent_batch_prediction_jobs |
4 |
code_bison_concurrent_batch_prediction_jobs |
4 |
textembedding_gecko_concurrent_batch_prediction_jobs |
4 |
Custom-trained model quotas
The following quotas apply to Generative AI on Vertex AI tuned models for a given project and region:
Quota | Value |
---|---|
Restricted image training TPU V3 pod cores per region * supported Region - europe-west4 |
64 |
Restricted image training Nvidia A100 80GB GPUs per region * supported Region - us-central1 * supported Region - us-east4 |
8 2 |
* Tuning scenarios have accelerator reservations in specific regions. Quotas for tuning are supported and must be requested in specific regions.
Online evaluation quotas
The evaluation online service uses the text-bison
model as an autorater with Google IP
prompts and mechanisms to ensure consistent and objective evaluation for
model-based metrics.
A single evaluation request for a model-based metric might result in multiple
underlying requests to the online prediction service. Each model's quota is
calculated on a per-project basis, which means that any requests directed to the
text-bison
for model inference and model-based evaluation contribute to the
quota. Different model quotas are set differently. The quota for the evaluation
service and the quota for the underlying autorater model are shown in the table.
Request quota | Default quota |
---|---|
Online evaluation service requests per minute | 1,000 requests per project per region |
Online prediction requests per minute for base_model, base_model: text-bison |
1,600 requests per project per region |
If you receive an error related to quotas while using the evaluation online service, you might need to file a quota increase request. See View and Manage Quotas for more information.
Limit | Value |
---|---|
Online evaluation service request timeout | 60 seconds |
First-time users of the online evaluation service within a new project might experience an initial setup delay generally up to two minutes. This is a one-time process. If your first request fails, wait a few minutes and then retry. Subsequent evaluation requests typically complete within 60 seconds.
The maximum input and output tokens are limited for the model-based metrics as per the model used as the autorater. See Model information | Generative AI on Vertex AI | Google Cloud for limits for relevant models.
LlamaIndex on Vertex AI quotas
The following quotas are for performing retrieval-augmented generation (RAG) by using LlamaIndex on Vertex AI:
Service | Quota |
---|---|
LlamaIndex on Vertex AI data management APIs | 60 requests per minute (RPM) |
RetrievalContexts API |
1,500 RPM |
Data ingestion | 1,000 files |
textembedding-gecko@003
text embedding API quota is used for document
indexing. Consider increasing the quota for the best indexing performance.
Pipeline evaluation quotas
If you receive an error related to quotas while using the evaluation pipelines service, you might need to file a quota increase request. See View and Manage Quotas for more information.
The evaluation pipelines service uses Vertex AI Pipelines to run
PipelineJobs
. See relevant quotas for
Vertex AI Pipelines. The following are general quota recommendations:
Service | Quota | Recommendation |
---|---|---|
Vertex AI API | Concurrent LLM batch prediction jobs per region | Pointwise: 1 * num_concurrent_pipelines Pairwise: 2 * num_concurrent_pipelines |
Vertex AI API | Evaluation requests per minute per region | 1000 * num_concurrent_pipelines |
Additionally, when calculating model-based evaluation metrics, the autorater might hit quota issues. The relevant quota depends on which autorater was used:
Tasks | Quota | Base model | Recommendation |
---|---|---|---|
summarization question_answering |
Online prediction requests per base model per minute per region per base_model | text-bison |
60 * num_concurrent_pipelines |
Vertex AI Pipelines
Each tuning job uses Vertex AI Pipelines. For more information, see Vertex AI Pipelines quotas and limits.
Quota increases
If you want to increase any of your quotas for Generative AI on Vertex AI, you can use the Google Cloud console to request a quota increase. To learn more about quotas, see Work with quotas.
What's next
- Learn more about Vertex AI quotas and limits.