This page explains dynamic shared quota (DSQ) and how DSQ is different from Provisioned Throughput. An example is also presented to explain how DSQ works.
DSQ distributes available on-demand capacity among all queries being processed by Google Cloud services for specific models. This capability eliminates the need to set quota limits and eliminates the need to submit quota increase requests (QIRs).
DSQ processes requests made by projects in a group of regions. Quotas are removed, and available capacity is distributed to each project. DSQ helps to ensure that continuous service is provided to both small and large projects.
With the existing Cloud Quotas system, reallocation happens every minute, which means that you might exhaust your quota for that minute in the first 10 seconds and then your project is unable to do anything for the remaining 50 seconds until your quota is enabled again. With DSQ, your capacity distribution is reevaluated each second. If there's available capacity, your project might get more traffic (queries) processed. Using the Cloud Quotas system, if your traffic exceeds your quota that has been set, the excess amount gets throttled (rejected).
Provisioned Throughput is the only way to ensure high availability for your application and to get predictable service levels for your production workloads. For more information about Provisioned Throughput, see Provisioned Throughput.
Supported models
This section lists models that support dynamic shared quota (DSQ), which is enabled by default in these models.
DSQ is processed as pay-as-you-go. If you go over your allocated capacity, a 429 error is generated. For more information on troubleshooting the error, see Error code 429.
Google models
The following table lists the Google models (and versions) that support DSQ:
Model | DSQ release date | Status |
---|---|---|
Gemini 1.5 Flash (gemini-1.5-flash-002 ) |
September 24, 2024 | Live |
Gemini 1.5 Pro (gemini-1.5-pro-002 ) |
September 24, 2024 | Live |
Partner models
The following table lists the Claude models that support DSQ. For more information about Claude models, see Use the Claude models from Anthropic.
How dynamic shared quota works
This section explains fundamental terms that are core to understanding how dynamic shared quota (DSQ) works, followed by an analogy and examples.
Limit, quota, and capacity
Limit, quota, and capacity are different. For example, quota isn't the same as capacity.
A limit is a maximum amount that's set to restrict the number of requests that a project can make on a model. That value can't be changed. Google protects its systems by using limits.
A quota is a limit, which is also imposed by Google to restrict the number of requests that projects make on specific models, but the quota can be changed. While a quota specifies the number of requests that can be made to a model, quotas don't guarantee that the capacity is allocated to that project. Quotas were created with the goal to protect the system from overloading and misuse of Google Cloud services.
The capacity is how many resources are available to your project to process your requests. Capacity is limited by your quota, but the quota doesn't guarantee that the capacity is available.
Capacity allocation for DSQ is at the project level.
How quota and capacity work in DSQ
The river-and-the-cup analogy clearly explains how quota and capacity work in DSQ.
Imagine that your community lives by a river, and each person in your community is given a 12-oz drinking cup to get water from that river. The river is filled with water, but each person's cup can only hold 12 ounces of water.
As long as the river has enough water, each person can refill their cup based on their needs up to the 12-oz limit. However, if that river starts to go dry, each person must receive a lesser amount, for example two or four ounces of water.
The amount that the river holds is the capacity. The amount that the cup can hold is the quota.
Each person only sees what's in their cups and not the river. You can see your quota (also referred to as query limits) using the Quotas & System Limits page in the Google Cloud console.
With DSQ, you hold a magic cup that holds unlimited water (capacity), because quotas no longer exist. DSQ doesn't depend on how much your cup can hold but focuses on the distribution of the river's water depending on the number of cups and the needed capacity of each cup that must share that capacity.
Example of how DSQ works
In this example, this table shows four projects with a total capacity of 100 QPS. The columns in the table include the following:
Current demand: This is how much each project wants to use. The current demand is more than the total capacity. In this example, 317 QPS (current demand) as opposed to 100 QPS (total capacity for all of the projects).
Current quota proportional allocation: This is the result of dividing the capacity by the request count. Project A gets the greatest quota, because the project requested the most, which results in other projects not getting enough quota.
DSQ allocation: The capacity that is allocated across the projects.
Project A | Project B | Project C | Project D | |
---|---|---|---|---|
Current demand | 250 | 32 | 25 | 10 |
Current proportional allocation | 79 | 10 | 8 | 3 |
DSQ allocation | 33 | 32 | 25 | 10 |
These steps show you how to calculate the DSQ allocation:
Each project receives their share of a quota. In this case, 25 QPS.
Project D uses only 10 QPS of its 25 QPS. Therefore, the additional capacity of 15 QPS is redistributed.
Project C gets enough quota to continue receiving 25 QPS.
Projects A and B remain in need of more quota. Therefore, the extra quota from project D (15 QPS) is divided and distributed equally to projects A and B (7.5 QPS each).
Project B receives 7.5 QPS from project D to reach 32.5 QPS, and project A is restricted to an amount of 32.5 QPS. Project A receives an error
429
for the requests that exceed the capacity allocated.
Example of capacity in a specific region
Google Cloud looks at the available capacity in a specific region, such as North America, and then looks at how many projects are sending requests.
Consider project A, which sends 25 queries per minute (QPM), and project B, which sends 25 QPM. The service can support 100 QPM. If project A increases the rate of its queries to 75 QPM, then DSQ supports the increase. If project A increases the rate of its queries to 100 QPM, then DSQ decreases project A to 75 QPM in order to continue to serve project B at 25 QPM.
Considerations
Before making a decision to purchase a model that supports DSQ, review the following considerations:
Consideration | Solution |
---|---|
Control cost and prevent budget overruns. | Configure a self-imposed quota called a consumer quota override. For more information, see Creating a consumer quota override. |
Prioritize traffic. | Use Provisioned Throughput. |
Monitor your usage. | View the following metrics:
aiplatform section
in the Cloud Monitoring documentation. |
Monitor your QPS usage
To monitor your Gemini QPS usage, see the Quotas & System Limits page.
Troubleshoot DSQ errors
When the shared capacity by region is exhausted, your query might receive a 429 error. To troubleshoot errors that might occur, see Error code 429.
What's next
- To learn more about Gemini models that support DSQ, see Gemini models.
- To learn more about Generative AI quotas and limits, see Generative AI on Vertex AI rate limits.
- To learn more about quotas and limits for Vertex AI, see Vertex AI quotas and limits.
- To learn more about Google Cloud quotas and limits, see
Understand quota values and system limits.