About instance autoscaling in Cloud Run services

In Cloud Run, each revision is automatically scaled to the number of instances needed to handle all incoming requests, events, or CPU utilization.

When a revision does not receive any traffic, by default it is scaled in to zero instances. However, if desired, you can change this default to specify an instance to be kept idle or "warm" using the minimum instances setting. If you are using CPU outside of requests, you should set minimum instances equal to 1.

In addition to the rate of incoming requests, events, or CPU utilization, the number of instances scheduled is impacted by:

The average CPU utilization of existing instances over a one minute window, targeting to keep scheduled instances to a 60% CPU utilization.
The current request concurrency, compared to the maximum concurrency over a one minute window.
The maximum number of instances setting
The minimum number of instances setting

The Cloud Run autoscaler evaluates these every 5 seconds.

CPU always allocated and autoscaling

If you configure your Cloud Run service to have CPU always allocated, you should be aware of scaling to and from zero behavior.

CPU always allocated scaling from zero. Scaling from zero can only be triggered by a request, so a service that is not processing requests cannot scale from zero. For these workloads, you can either set minimum instances > 0, or include a "wake-up request" in your design to restart processing after scaling to zero.

CPU always allocated scaling to zero. Given that no instance is ever at 0% CPU, looking at all CPU usage would result in never scaling to zero. This means the decision to scale from one to zero can only be made by checking to see if the instance is processing a request.

About maximum instances

In some cases you may want to limit the total number of instances that can be started, for cost control reasons, or for better compatibility with other resources used by your service. For example, your Cloud Run service might interact with a database that can only handle a certain number of concurrent open connections.

You can use the maximum instances setting to limit the total number of instances that can be started in parallel, as documented in Setting a maximum number of instances.

Exceeding maximum instances

Under normal circumstances, your revision scales out by creating new instances to handle incoming traffic load. But when you set a maximum instances limit, in some scenarios there will be insufficient instances to meet that traffic load. In that case, incoming requests are queued (pending) as follows:

If new instances are starting up, such as during a scale-out, requests will pend for at least the average startup time of container instances of this service. This includes when the request initiates a scale-out, such as when scaling from zero.
If the startup time is less than 10 seconds, requests will pend for up to 10 seconds.
If there are no instances in the process of starting, and the request does not initiate a scale-out, requests will pend for up to 10 seconds.

During this time window, if an instance finishes processing requests, it becomes available to process the queued pending requests. If no instances become available during the window, the request fails with a 429 error code.

Scaling guarantees

The maximum instances limit is an upper limit per revision and it means that the number of instances for this revision shouldn't exceed the maximum.

Under normal circumstances, Cloud Run is able to scale out to the maximum instances limit very fast to handle all incoming requests or events. However, setting a high limit does not mean that your revision will be able scale out to the specified number of instances at any given moment. In exceptional circumstances, Cloud Run can throttle scaling to ensure good service for all customers.

Exceeding maximum instances due to traffic spikes

In some cases, such as rapid traffic surges or system maintenance, Cloud Run might, for a short period of time, create more instances than are specified in the maximum instances setting. New instances can be started in excess of the maximum instances setting to replace existing instances and to provide a grace period for inflight requests to finish processing.

The maximum instance limit can be exceeded under normal operation a few times per week. The grace period usually lasts up to 15 minutes, or up to the value specified in the request timeout setting. These extra instances are destroyed within 15 minutes after they become idle.

If many replacements are needed, the updates are usually spread out over many minutes or hours, but each replacement has an excess instance for just the grace period. Instances in excess of the maximum instance value are normally less than twice the configured maximum instances limit, but can be much larger for sudden large traffic spikes.

Load tests experience more instances exceeding the maximum instances setting because the system may change where traffic spikes are served to preserve capacity for existing workloads that have sustained load patterns.

If your service cannot tolerate this temporary behavior, you may want to factor in a safety margin and set a lower maximum instances value.

Traffic splits

Because the maximum instances limit is a limit for each revision, if the service splits traffic across multiple revisions, the total number of instances for the service can exceed the maximum instances per revision. This can be observed in the Instance Count metrics.

Deployments

When you deploy a new revision to serve 100% of the traffic, Cloud Run starts enough instances of the new revision before directing traffic to it. This reduces the impact of new revision deployments on request latencies, notably when serving high levels of traffic. Because the maximum instances limit is a limit for each revision, during a deployment, the total number of instances for the service can exceed the maximum instances per revision. This can be observed in the Instance Count metrics.

Idle instances and minimizing cold starts

Cloud Run does not immediately shut down instances once they have handled all requests. To minimize the impact of cold starts, Cloud Run may keep some instances idle for a maximum of 15 minutes. These instances are ready to handle requests in case of a sudden traffic spike.

For example, when an instance has finished handling requests, it may remain idle for a period of time in case another request needs to be handled. An idle instance may persist resources, such as open database connections. Note that CPU is only allocated during request processing unless you explicitly configure your service to have CPU always allocated.

To keep idle instances permanently available, use the min-instance setting. Note that using this feature will incur cost even when the service is not actively serving requests.

Autoscaling and pending requests

If new instances are starting up, such as during a scale-out, requests will pend for at least the average startup time of container instances of this service. This includes when the request initiates a scale-out, such as when scaling from zero.
If the startup time is less than 10 seconds, requests will pend for up to 10 seconds.
If there are no instances in the process of starting, and the request does not initiate a scale-out, requests will pend for up to 10 seconds.

Autoscaling impact on backing services

As the number of instances automatically increases, your Cloud Run service might encounter limits with its backing services. For example, Cloud SQL has an API quota limit. Make sure these backing services have enough quota and can handle connections from all instances of your Cloud Run service. Consider setting a maximum number of instances to avoid overloading backing services.

Autoscaling and Pub/Sub

Google recommends using push subscriptions to consume messages from a Pub/Sub topic on Cloud Run. Pushed messages are received like HTTP requests by the container, thus triggering the same autoscaling behavior.

Autoscaling and multiple containers (sidecars)

Cloud Run considers the CPU utilization of instances for autoscaling, where the CPU utilization of an instance is the percentage of allocated CPU in use.

Note that you allocate CPU when you set CPU limits at the container level. If you use multiple containers per instance, the actual CPU allocation for that instance is the sum of the CPU limits you set on each container.

What's next

To manage the maximum number of instances of your Cloud Run services, see Setting a maximum number of instances.
To manage the maximum number of simultaneous requests handled by each instance, see Setting concurrency.
To optimize your concurrency setting, see development tips for tuning concurrency.
To specify an idle instance to keep running to minimize latency or cold starts on first requests, see Using min-instance to enable idle instances.