Gen AI Evaluation Service 會使用 gemini-2.0-flash 做為模型評估指標的預設評估模型。以模型為基礎的指標單一評估要求,可能會導致對 Gen AI Evaluation Service 的多個基礎要求。系統會根據每個專案計算各模型的配額,也就是說,凡是導向 gemini-2.0-flash 的模型推論和模型評估要求,都會計入配額。下表列出 Gen AI Evaluation Service 和基礎評估模型適用的配額:
[[["容易理解","easyToUnderstand","thumb-up"],["確實解決了我的問題","solvedMyProblem","thumb-up"],["其他","otherUp","thumb-up"]],[["難以理解","hardToUnderstand","thumb-down"],["資訊或程式碼範例有誤","incorrectInformationOrSampleCode","thumb-down"],["缺少我需要的資訊/範例","missingTheInformationSamplesINeed","thumb-down"],["翻譯問題","translationIssue","thumb-down"],["其他","otherDown","thumb-down"]],["上次更新時間:2025-09-04 (世界標準時間)。"],[],[],null,["# Generative AI on Vertex AI quotas and system limits\n\nThis page introduces two ways to consume generative AI services, provides a list\nof quotas by region and model, and shows you how to view and edit your quotas in\nthe Google Cloud console.\n\nOverview\n--------\n\nThere are two ways to consume generative AI services. You can choose\n*pay-as-you-go (PayGo)* , or you can pay in advance using\n*Provisioned Throughput*.\n\nIf you're using PayGo, your usage of generative AI features is subject to one of\nthe following quota systems, depending on which model you're using:\n\n- Models earlier than Gemini 2.0 use a standard quota system for each generative AI model to help ensure fairness and to reduce spikes in resource use and availability. Quotas apply to Generative AI on Vertex AI requests for a given Google Cloud project and supported region.\n- Newer models use [Dynamic shared quota\n (DSQ)](/vertex-ai/generative-ai/docs/dynamic-shared-quota), which dynamically distributes available PayGo capacity among all customers for a specific model and region, removing the need to set quotas and to submit quota increase requests. **There are no quotas with DSQ**.\n\nTo help ensure high availability for your application and to get predictable\nservice levels for your production workloads, see\n[Provisioned Throughput](/vertex-ai/generative-ai/docs/provisioned-throughput).\n\nQuota system by model\n---------------------\n\nThe following models support\n[Dynamic shared quota (DSQ)](/vertex-ai/generative-ai/docs/dynamic-shared-quota):\n\n- [Gemini 2.5 Flash Image Preview](/vertex-ai/generative-ai/docs/models/gemini/2-5-flash#image) (Preview)\n- [Gemini 2.5 Flash-Lite](/vertex-ai/generative-ai/docs/models/gemini/2-5-flash-lite)\n- [Gemini 2.0 Flash with Live API](/vertex-ai/generative-ai/docs/models/gemini/2-0-flash#live-api) (Preview)\n- [Gemini 2.0 Flash with image generation](/vertex-ai/generative-ai/docs/models/gemini/2-0-flash) (Preview)\n- [Gemini 2.5 Pro](/vertex-ai/generative-ai/docs/models/gemini/2-5-pro)\n- [Gemini 2.5 Flash](/vertex-ai/generative-ai/docs/models/gemini/2-5-flash)\n- [Gemini 2.0 Flash](/vertex-ai/generative-ai/docs/models/gemini/2-0-flash)\n- [Gemini 2.0 Flash-Lite](/vertex-ai/generative-ai/docs/models/gemini/2-0-flash-lite)\n\nThe following legacy Gemini models support DSQ:\n\n- Gemini 1.5 Pro\n- Gemini 1.5 Flash\n\nNon-Gemini and earlier Gemini models use the standard\nquota system. For more information, see\n[Vertex AI quotas and limits](/vertex-ai/docs/quotas).\n\nTuned model quotas\n------------------\n\nTuned model inference shares the same quota as the base model.\nThere is no separate quota for tuned model inference.\n\nText embedding limits\n---------------------\n\nEach request can have up to 250 input texts (generating 1 embedding per input text) and 20,000 tokens per request. Only the first 2,048 tokens in each input text are used to compute the embeddings. For `gemini-embedding-001`, the [quota](/vertex-ai/docs/quotas#model-region-quotas) is listed under the name `gemini-embedding`.\n\n### Embed content input tokens per minute per base model\n\n\nUnlike previous embedding models which were primarily limited by RPM quotas, the quota for the\nGemini Embedding model limits the number of tokens that can be sent per minute per\nproject.\n\nVertex AI Agent Engine limits\n-----------------------------\n\nThe following limits apply to [Vertex AI Agent Engine](/vertex-ai/generative-ai/docs/agent-engine/overview) for a given project in each region: \n\nBatch prediction\n----------------\n\nThe quotas and limits for batch inference jobs are the same across all regions. \n\n### Concurrent batch inference job limits for Gemini models\n\nThere are no predefined quota limits on batch inference for Gemini models. Instead, the batch service provides access to a large, shared pool of resources, dynamically allocated based on the model's real-time availability and demand across all customers for that model. When more customers are active and saturated the model's capacity, your batch requests might be queued for capacity. \n\n### Concurrent batch inference job quotas non-Gemini models\n\nThe following table lists the quotas for the number of concurrent batch inference jobs, which don't apply to Gemini models:\n\nIf the number of tasks submitted exceeds the allocated quota, the tasks are placed in a queue and processed when the quota capacity becomes available.\n\n### View and edit the quotas in the Google Cloud console\n\nTo view and edit the quotas in the Google Cloud console, do the following:\n\n1. Go to the **Quotas and System Limits** page.\n2. [Go to Quotas and System Limits](https://console.cloud.google.com/iam-admin/quotas)\n3. To adjust the quota, copy and paste the property `aiplatform.googleapis.com/textembedding_gecko_concurrent_batch_prediction_jobs` in the **Filter** . Press **Enter**.\n4. Click the three dots at the end of the row, and select **Edit quota**.\n5. Enter a new quota value in the pane, and click **Submit request**.\n\nVertex AI RAG Engine\n--------------------\n\n| The [VPC-SC security controls](/vertex-ai/generative-ai/docs/security-controls) and\n| CMEK are supported by Vertex AI RAG Engine. Data residency and AXT security controls aren't\n| supported.\nFor each service to perform retrieval-augmented generation (RAG) using RAG Engine, the following quotas apply, with the quota measured as requests per minute (RPM).\n\nThe following limits apply:\n\nFor more rate limits and quotas, see [Generative AI on Vertex AI\nrate limits](/vertex-ai/generative-ai/docs/quotas).\n\nGen AI evaluation service\n-------------------------\n\nThe Gen AI evaluation service uses `gemini-2.0-flash` as a default judge model for model-based metrics. A single evaluation request for a model-based metric might result in multiple underlying requests to the Gen AI evaluation service. Each model's quota is calculated on a per-project basis, which means that any requests directed to `gemini-2.0-flash` for model inference and model-based evaluation contribute to the quota. Quotas for the Gen AI evaluation service and the underlying judge model are shown in the following table:\n\nIf you receive an error related to quotas while using the Gen AI evaluation service, you might\nneed to file a quota increase request. See [View and manage\nquotas](/docs/quotas/view-manage) for more information.\n\nWhen you use the Gen AI evaluation service for the first time in a new project, you might\nexperience an initial setup delay up to two minutes. If your first request fails, wait a few minutes\nand then retry. Subsequent evaluation requests typically complete within 60 seconds.\n\nThe maximum input and output tokens for model-based metrics depend on the model used\nas the judge model. See [Google models](/vertex-ai/generative-ai/docs/learn/models) for a\nlist of models.\n\n### Vertex AI Pipelines quotas\n\nEach tuning job uses Vertex AI Pipelines. For more information,\nsee [Vertex AI Pipelines quotas and limits](/vertex-ai/docs/quotas#vertex-ai-pipelines).\n\nWhat's next\n-----------\n\n- To learn more about dynamic shared quota, see [Dynamic shared\n quota](/vertex-ai/generative-ai/docs/dsq).\n- To learn about quotas and limits for Vertex AI, see [Vertex AI quotas and limits](/vertex-ai/docs/quotas).\n- To learn more about Google Cloud quotas and system limits, see the [Cloud Quotas documentation](/docs/quotas/overview)."]]