This document explains why rate limiting is used, describes strategies and techniques for rate limiting, and explains where rate limiting is relevant for Google Cloud products. Much of this information applies to several layers in technology stacks, but this document focuses on rate limiting at the application level.
Rate limiting refers to preventing the frequency of an operation from exceeding some constraint. In large-scale systems, rate limiting is commonly used to protect underlying services and resources.
Why rate limiting is used
Rate limiting is generally put in place as a defensive measure for services. Shared services need to protect themselves from excessive use—whether intended or unintended—to maintain service availability. Even highly scalable systems should have limits on consumption at some level. For the system to perform well, clients must also be designed with rate limiting in mind to reduce the chances of cascading failure. Rate limiting on both the client side and the server side is crucial for maximizing throughput and minimizing end-to-end latency across large distributed systems.
Preventing resource starvation
The most common reason for rate limiting is to improve the availability of API-based services by avoiding resource starvation. Many load-based denial-of-service incidents in large systems are unintentional—caused by errors in software or configurations in some other part of the system—not malicious attacks (such as network-based distributed denial of service attacks). Resource starvation that isn't caused by a malicious attack is sometimes referred to as friendly-fire denial of service (DoS).
Generally, a service applies rate limiting at a step before the constrained resource, with some advanced-warning safety margin. Margin is required because there can be some lag in loads, and the protection of rate limiting needs to be in place before critical contention for a resource happens. For example, a RESTful API might apply rate limiting to protect an underlying database; without rate limiting, a scalable API service could make large numbers of calls to the database concurrently, and the database might not be able to send clear rate-limiting signals.
Managing policies and quotas
When the capacity of a service is shared among many users or consumers, it can apply rate limiting per user to provide fair and reasonable use, without affecting other users. These limits might be applied over longer time periods, or they might be applied to resources that are not measured by rate but by quantity allocated. These rate and allocation limits are collectively referred to as quotas. Quotas can also apply to API monetization packages or free tier limits. For details, see the Quotas and caps section of this document.
It is important to understand how these quotas are shared by applications and projects in your organization. For example, a rogue canary release in a production project can consume quota for a resource used by production-serving infrastructure.
In complex linked systems that process large volumes of data and messages, you can use rate limiting to control these flows—whether merging many streams into a single service or distributing a single work stream to many workers.
For example, you can distribute work more evenly between workers by limiting the flow into each worker, preventing a single worker from accumulating a queue of unprocessed items while other workers are idle. Flow control balances having prefetched data ready to process locally with making sure that each node in a system has an equal opportunity to get work done. For more information, see the Pub/Sub: Flow control section.
Avoiding excess costs
You can use rate limiting to control costs—for example, if an underlying resource is capable of auto-scaling to meet demand, but the budget for that resource usage is limited. An organization might use rate limiting to prevent experiments from running out of control and accumulating large bills. This concern is, in part, why many Google Cloud quotas are set with initial values that can be increased on request. Other cost-motivated rate limiting might be applied by organizations that are offering fixed-cost SaaS (software as a service) solutions, who need to model their cost, price, and margin per customer.
In a chain or mesh of services, many nodes of the system are both clients and servers. Each part of the system might apply no rate limiting strategy at all, or might combine one or more strategies in different ways, so a view of the whole system is required to ensure that everything is working optimally. Even in the cases where the rate limiting is implemented entirely on the server side, the client should be engineered to react appropriately.
For most situations, if the tool or infrastructure that is implementing your rate-limiting strategy itself is failing or unreachable, your service should fail open and try to serve all requests. Clients typically do not operate beyond quotas, and failing open is less disruptive to large-scale systems than failing closed. Failing closed leads to a complete outage, versus failing open, which leads to a degraded condition. Decisions about failing open or failing closed are mostly relevant on the server side, but knowledge of what retry techniques the clients use on a failed request might influence your decisions made about server behavior.
No rate limiting
It is important to consider the option of performing no rate limiting as a floor in your design—that is, as a worst-case situation that your system must be able to accommodate. Build your system with robust error handling in case some part of your rate-limiting strategy fails, and understand what users of your service will receive in those situations. Ensure that you provide useful error codes and that you don't leak any sensitive data in error codes. Using timeouts, deadlines, and circuit-breaking patterns helps your service to be more robust in the absence of rate limiting.
If your service calls other services to fulfill requests, you can choose how you pass any rate-limiting signals from those services back to the original caller.
The simplest option is to only forward the rate-limiting response from the downstream service to the caller. An alternative is to enforce the rate limits on behalf of the downstream service and block the caller.
Enforce rate limits
The most common rate-limiting strategy is for a service to apply one or more techniques for enforcing rate limits. This rate limiting might be put in place to protect the service directly, or it might be put in place to protect a downstream resource when it is known that the downstream service has no ability to protect itself. For example, if you are running an API service that connects to a legacy backend system that is not resilient under heavy loads, the API service should not use the pass-through strategy assuming that the legacy service will provide its own rate-limiting signals.
To enforce rate limiting, first understand why it is being applied in this case,
and then determine which attributes of the request are best suited to be used as
the limiting key (for example, source IP address, user, API key). After you
choose a limiting key, a limiting implementation can use it to track usage.
When limits are reached, the service returns a limiting signal (usually a
If computing a response is expensive or time-consuming, a system might be unable to provide a prompt response to a request, which makes it harder for a service to handle high rates of requests. An alternative to rate limiting in these cases is to shunt requests into a queue and return some form of job ID. This allows the service to maintain higher availability, and it reduces the compute effort for clients that otherwise might be doing long blocking calls while waiting for a response. How the result of the deferred response is returned to the caller is another set of choices, but it generally involves polling on the state of a job ID or through a fully event-based system in which the caller can register a callback or subscribe to an event channel. Such a system is outside the scope of this document.
The deferred response pattern is easiest to apply when the immediate response to a request holds no real information. If this pattern is overused, then it can increase the complexity and failure modes of your system.
The strategies described so far apply to rate limiting on the server side. However, these strategies can inform the design of clients, especially when you consider that many components in a distributed system are both client and server.
Just as a service's primary purpose in using rate limiting is to protect itself and maintain availability, a client's primary purpose is to fulfill the request it is making to a service. A service might be unable to fulfill a request from a client for a variety of reasons, including the following:
- The service is unreachable because of network conditions.
- The service returned a non-specific error.
- The service denies the request because of an authentication or authorization failure.
- The client request is invalid or malformed.
- The service rate-limits the caller and sends a backpressure signal
We recommend designing clients to be resilient to these types of problems. Google-provided client libraries have many built-in features that recognize the above scenarios.
In response to rate-limiting, intermittent, or non-specific errors, a client should generally retry the request after a delay. It is a best practice for this delay to increase exponentially after each failed request, which is referred to as exponential backoff. When many clients might be making schedule-based requests (such as fetching results every hour), additional random time (jitter) should be applied to the request timing, the backoff period, or both to ensure that these multiple client instances don't become periodic thundering herd, and themselves cause a form of DDoS.
Imagine a mobile app with many users that checks in with an API at exactly noon
every day, and applies the same deterministic back-off logic. At noon, many
clients call the service, which starts rate limiting and returning responses
429 status code. The clients then dutifully back off and wait a set
amount of time (deterministic delay) of exactly 60 seconds, and then at 12:01
the service receives another large set of requests. By adding a random offset
(jitter) to the time of the initial request or to the delay time, the requests
and retries can be more evenly distributed, giving the service a better chance
of fulfilling the requests.
Ideally, non-idempotent requests can be made in the context of a strongly consistent transaction, but not all service requests can offer such guarantees, and so retries that mutate data need to consider the consequences of duplicate action.
For situations in which the client developer knows that the system that they are calling is not resilient to stressful loads and does not support rate-limiting signals (back-pressure), the client library developer or client application developer can choose to apply self-imposed throttling, using the same techniques for enforcing rate limits that can be used on the server side.
For clients of APIs that defer the response with an asynchronous long-running operation ID, the client can choose to enter a blocking loop polling the status of the deferred response, removing this complexity from the user of the client library.
Techniques for enforcing rate limits
In general, a rate is a simple count of occurrences over time. However, there are several different techniques for measuring and limiting rates, each with their own uses and implications.
- Token bucket: A token bucket maintains a rolling and accumulating budget of usage as a balance of tokens. This technique recognizes that not all inputs to a service correspond 1:1 with requests. A token bucket adds tokens at some rate. When a service request is made, the service attempts to withdraw a token (decrementing the token count) to fulfill the request. If there are no tokens in the bucket, the service has reached its limit and responds with backpressure. For example, in a GraphQL service, a single request might result in multiple operations that are composed into a result. These operations may each take one token. This way, the service can keep track of the capacity that it needs to limit the use of, rather than tie the rate-limiting technique directly to requests.
- Leaky bucket: A leaky bucket is similar to a token bucket, but the rate is limited by the amount that can drip or leak out of the bucket. This technique recognizes that the system has some degree of finite capacity to hold a request until the service can act on it; any extra simply spills over the edge and is discarded. This notion of buffer capacity (but not necessarily the use of leaky buckets) also applies to components adjacent to your service, such as load balancers and disk I/O buffers.
- Fixed window: Fixed-window limits—such as 3,000 requests per hour or 10 requests per day—are easy to state, but they are subject to spikes at the edges of the window, as available quota resets. Consider, for example, a limit of 3,000 requests per hour, which still allows for a spike of all 3,000 requests to be made in the first minute of the hour, which might overwhelm the service.
- Sliding window: Sliding windows have the benefits of a fixed window, but the rolling window of time smooths out bursts. Systems such as Redis facilitate this technique with expiring keys.
When you have many independently running instances of a service (such as Cloud Functions) in a distributed system, and the service needs to be limited as a whole, you need to use a fast, logically-global (global to all running functions, not necessarily geographically global) key-value store like Redis to synchronize the various limit counters.
Rate-limiting features in Google Cloud
Every Google API (internal and external) enforces some degree of rate limiting or quota. This is a fundamental principle of service design at Google. This section addresses how these services also expose rate limiting as a feature that you can use when building on Google Cloud.
Quota and caps
Google Cloud enforces quotas that constrain how much of a particular Google Cloud resource your project can use. Rate quotas specify how much of a resource can be used in a given time, such as API requests per day. You can also set your own constraints on how much a resource can be used in a given time; such custom constraints are called caps.
For details about quotas and caps, including information about how to set caps and request quota increases, see Working with quotas.
Each Google Cloud product has a page that lists limits of the services (such as maximum message sizes) and rate-based quotas (such as the maximum number of queries per second for a certain API). These pages also note whether you can request increases or not. To find these pages, begin with this search.
Generally, Google Cloud quotas are per project, and the windows are per second or per minute. When you have multiple parts of your solution running in a project, it is important to note that they share these quotas.
You can monitor how your quota is being consumed, and even set alerts on changes to the way you are using quota, or when a usage exceeds a certain amount. You can set your own cap on API usage and use budget alerts to control the costs associated with API usage.
Cloud Tasks is a fully managed service that you can use to manage the execution, dispatch, and delivery of a large number of distributed tasks. Using Cloud Tasks, you can asynchronously perform work outside of a user request. Cloud Tasks lets you set both rate and concurrency limits. Cloud Tasks uses the token bucket technique to allow for a degree of burstiness in how messages are delivered within those limits.
Cloud Functions is a lightweight compute solution for developers to create single-purpose, stand-alone functions that respond to cloud events without the need to manage a server or runtime environment. Cloud Functions are stateless and highly scalable by default: Google's managed infrastructure automatically creates function instances to handle incoming request load. Because of this scaling behavior, functions can become targets of high rates of requests, and if these functions call downstream services, the functions can become a source of unintended DoS on those downstream services.
One type of DoS is connection exhaustion on databases. If each function instance establishes a database connection to a backend, a traffic spike might result in automatically scaling up many instances and, consequently, exhausting available connection capacity on database servers. To prevent functions from scaling beyond a certain number of instances, the service provides a per-function max instances setting.
For background functions, Google Cloud invokes your function with the event payload and context. You can specify whether you want the system to retry event delivery if your function fails or is unable to process the event—perhaps, because it's being rate-limited by something downstream.
Though the max instances setting can help you to limit concurrency, it does not give direct control over how many times per second your function can be called. See the What's next section for tutorials demonstrating how to use Redis to globally coordinate rate limiting across function invocations.
Pub/Sub: Flow control
Pub/Sub is a fully-managed real-time messaging service that allows you to send and receive messages between independent applications. When moving high numbers of messages through Pub/Sub topics, you might need to tune the rates of how messages are processed at consuming clients so that consumers working in parallel are effective and not holding too many outstanding messages, impacting overall processing latency. To tune this behavior, Pub/Sub clients expose several flow-control settings.
Cloud Run is a managed compute platform that enables you to run stateless containers that are invocable through HTTP requests. Unlike Cloud Functions, a single container instance can process multiple requests concurrently if supported by the serving stack in the container.
Istio is an open source independent service mesh that provides the fundamentals that you need to successfully run a distributed microservice architecture. A resilient microservice architecture requires services being defensive against rogue peer services, so Istio provides for rate limiting directly in the service mesh.
Cloud Endpoints is an API management system that helps you to secure, monitor, analyze, and set quotas on your APIs using the same infrastructure that Google uses for its own APIs. As a service designed to help you to expose services to the external world, it provides the ability to configure your own quotas, including rate-based policies.
Apigee is a platform for developing and managing API proxies. An API proxy is your interface to developers that want to use your backend services. Rather than having them consume those services directly, they access an Apigee API proxy that you create. It is common to put Apigee in front of backend services that might not have their own rate-limiting capabilities, so rate limiting is a built-in feature of Apigee.
Google Cloud Armor
Google Cloud Armor uses Google's global infrastructure and security systems to deliver defense against distributed denial of service (DDoS) attacks against infrastructure and applications. This includes built-in logic regarding malicious spikes and high rate loads to your protected services.
Though not a Google Cloud service, Project Shield uses Google infrastructure to protect qualifying sites from DDoS attacks.
Additional techniques for greater resilience
Rate limiting at the application level can provide services with increased resilience, but resilience can be further improved by combining application-level rate limiting with other techniques:
Caching: Storing results that are slow to compute makes it possible for a service to process a higher rate of requests, which might cause rate-limiting backpressure to be applied less frequently to clients.
Circuit breaking: You can make service networks more resilient to problems resulting from propagation of recurring errors by making parts of the system latch temporarily to a quiet state. For an example implementation, see the circuit breaking section of the Istio documentation.
Prioritization: Not all users of a system are of equal priority. Consider additional factors in designing rate-limiting keys to ensure that higher-priority clients are served. You can use load shedding to remove the burden of lower-priority traffic from systems.
Rate limiting at multiple layers: If your machine's network interface or OS kernel is being overwhelmed, then application-layer rate limiting might never even have a chance to begin. You can apply rate limits at layer 3 in iptables, or on-premises appliances can limit at layer 4. You might also be exposed to tuneable rate limits applied to your system's I/O for things like disk and network buffers.
Monitoring: It's crucial for operations systems personnel to know when throttling is occurring. Monitoring for rates that exceed quotas is critical for incident management and catching regressions in software. We recommend implementing such monitoring for both the client and server perspectives of services. Not all occurrences of rate limiting should cause alerts that demand immediate attention by operations personnel. In non-extreme cases, you can respond to rate-limiting signals later, as part of routine evaluation and maintenance of your system. You can use the logs regarding when rate limiting occurs as a signal that you need to make changes, such as increasing the capacity of a component, requesting an increase in a quota, or modifying a policy.
The SRE books have great points for designing complex systems.
These blog posts from other organizations further explore rate limiting:
- Scaling your API with rate limiters
- Announcing Ratelimit: Go/gRPC service for generic rate limiting
- How we built rate limiting capable of scaling to millions of domains
- An alternative approach to rate limiting
- High-performance rate limiting
- How to Design a Scalable Rate Limiting Algorithm
These tutorials provide step-by-step instructions for rate-limiting techniques:
- Rate-limiting serverless functions with Redis and VPC Connector
- Deferring requests: Asynchronous patterns for Cloud Functions
- Explore reference architectures, diagrams, tutorials, and best practices about Google Cloud. Take a look at our Cloud Architecture Center.