In Cloud Run, each revision is automatically scaled to the number of container instances needed to handle all incoming requests. When a revision does not receive any traffic, by default it is scaled in to zero container instances. However, if desired, you can change this default to specify an instance to be kept idle or "warm" using the minimum instances setting.
In addition to the rate of incoming requests, the number of instances scheduled is impacted by:
- The CPU utilization of existing instances (Targeting to keep scheduled instances to a 60% CPU utilization)
- The maximum concurrency setting
- The maximum number of container instances setting
- The minimum number of container instances setting
About maximum container instances
In some cases you may want to limit the total number of container instances that can be started, for cost control reasons, or for better compatibility with other resources used by your service. For example, your Cloud Run service might interact with a database that can only handle a certain number of concurrent open connections.
You can use the maximum container instances setting to limit the total number of instances that can be started in parallel, as documented in Setting a maximum number of container instances.
Exceeding maximum instances
Under normal circumstances, your revision scales out by creating new instances
to handle incoming traffic load. But when you set a maximum instances limit, in some
scenarios there will be insufficient instances to meet that traffic load. In
that case, incoming requests can be queued for up to
During this time window, if an instance finishes processing requests, it becomes
available to process queued requests.
If no instances become available during the window, the request fails with a
429 error code.
The maximum instances limit is an upper limit per revision. Setting a high limit does not mean that your revision will scale out to the specified number of container instances. It only means that the number of container instances for this revision should not exceed the maximum.
Exceeding max instances
In some cases, such as rapid traffic surges or system maintenance, Cloud Run might, for a short period of time, create more container instances than are specified in the maximum instances setting. New instances can be started in excess of the maximum instances setting to replace existing instances and to provide a grace period for inflight requests to finish processing.
The maximum instance limit can be exceeded under normal operation a few times per week. The grace period usually lasts up to 15 minutes, or up to the value specified in the request timeout setting. These extra instances are destroyed within 15 minutes after they become idle.
If many replacements are needed, the updates are usually spread out over many minutes or hours, but each replacement has an excess instance for just the grace period. Instances in excess of the maximum instance value are normally less than twice the configured maximum instances limit, but can be much larger for sudden large traffic spikes.
Load tests experience more instances exceeding the maximum instances setting because the system may change where traffic spikes are served to preserve capacity for existing workloads that have sustained load patterns.
If your service cannot tolerate this temporary behavior, you may want to factor in a safety margin and set a lower maximum instances value.
Because the maximum instances limit is a limit for each revision, if the service splits traffic across multiple revisions, the total number of instances for the service can exceed the maximum instances per revision. This can be observed in the Instance Count metrics.
When you deploy a new revision to serve 100% of the traffic, Cloud Run gradually migrates traffic from the revision previously serving 100% of the traffic to the new one. Because the maximum instances limit is a limit for each revision, during a deployment, the total number of instances for the service can exceed the maximum instances per revision. This can be observed in the Instance Count metrics.
Idle instances and minimizing cold starts
Cloud Run does not immediately shut down instances once they have handled all requests. To minimize the impact of cold starts, Cloud Run may keep some instances idle for a maximum of 15 minutes. These instances are ready to handle requests in case of a sudden traffic spike.
For example, when a container instance has finished handling requests, it may remain idle for a period of time in case another request needs to be handled. An idle container instance may persist resources, such as open database connections. Note that CPU is only allocated during request processing unless you explicitly configure your service to have CPU always allocated.
- To manage the maximum number of instances of your Cloud Run services, see Setting a maximum number of container instances.
- To manage the maximum number of simultaneous requests handled by each container instance, see Setting concurrency.
- To optimize your concurrency setting, see development tips for tuning concurrency.
- To specify an idle instance to keep running to minimize latency or cold starts
on first requests, see
min-instanceto enable idle instances.