How Instances are Managed

Instances are the basic building blocks of App Engine, providing all the resources needed to successfully host your application. This includes the language runtime, the App Engine APIs, and your application's code and memory. Each instance includes a security layer to ensure that instances cannot inadvertently affect each other.

Instances are the computing units that App Engine uses to automatically scale your application. At any given time, your application can be running on one instance or many instances, with requests being spread across all of them.

Introduction to instances

Instances are resident or dynamic. A dynamic instance starts up and shuts down automatically based on the current needs. A resident instance runs all the time, which can improve your application's performance. Both dynamic and resident instances instantiate the code included in an App Engine service version.

If you use manual scaling for an app, the instances it runs on are resident instances. If you use either basic or automatic scaling, your app runs on dynamic instances.

Configuring your app includes specifying how its services scale, including:

  • The initial number of instances for a service.
  • How new instances are created or stopped in response to traffic.
  • The allotted amount of time in which an instance is allowed to handle a request.

The scaling type you assign to a service determines whether its instances are resident or dynamic:

  • Auto scaling services use dynamic instances.
  • Manual scaling services use resident instances.
  • Basic scaling services use dynamic instances.

App Engine charges for instance usage on an hourly basis. You can track your instance usage on the Google Cloud Platform Console Instances page. If you want to set a limit on incurred instance costs, you can do so by setting a spending limit. Each service that you deploy to App Engine behaves like a microservice that independently scales based on how you configured it.

Scaling dynamic instances

App Engine applications are powered by any number of dynamic instances at a given time, depending on the volume of incoming requests. As requests for your application increase, so do the number of dynamic instances.

The App Engine scheduler decides whether to serve each new request with an existing instance (either one that is idle or accepts concurrent requests), put the request in a pending request queue, or start a new instance for that request. The decision takes into account the number of available instances, how quickly your application has been serving requests (its latency), and how long it takes to start a new instance.

If you use automatic scaling, you can optimize the scheduler behavior to obtain your desired performance versus cost trade-off by setting the values for target_cpu_utilization, target_throughput_utilization, and max_concurrent_requests.

Auto scaling parameter Description
Target CPU Utilization Sets the CPU utilization ratio threshold to specify the CPU usage threshold at which more instances will be started to handle traffic.
Target Throughput Utilization Sets the throughput threshold for the number of concurrent requests after which more instances will be started to handle traffic.
Max Concurrent Requests Sets the max concurrent requests an instance can accept before the scheduler spawns a new instance.

Watch the App Engine New Scheduler Settings video to see the effects of these settings.

Each instance has its own queue for incoming requests. App Engine monitors the number of requests waiting in each instance's queue. If App Engine detects that queues for an application are getting too long due to increased load, it automatically creates a new instance of the application to handle that load.

App Engine scales up very quickly. So, if you are sending batches of requests to your services, for example, to a task queue for processing, a large number of instances will be created quickly. We recommend controlling this by rate limiting the number of request sent per second, if possible. For example, in an App Engine task queue, you can control the rate at which tasks are pushed.

App Engine also scales instances in reverse when request volumes decrease. This scaling helps ensure that all of your application's current instances are being used to optimal efficiency and cost effectiveness.

When an application is not being used at all, App Engine turns off its associated dynamic instances, but readily reloads them as soon as they are needed. Reloading instances can result in loading requests and additional latency for users.

You can specify a minimum number of idle instances. Setting an appropriate number of idle instances for your application based on request volume allows your application to serve every request with little latency, unless you are experiencing abnormally high request volume.

Instance scaling

When you upload a version of a service, the app.yaml specifies a scaling type and instance class that apply to every instance of that version. The scaling type controls how instances are created. The instance class determines compute resources (memory size and CPU speed) and pricing. There are three scaling types: manual, basic, and automatic. The available instance classes depend on the scaling type.

Manual scaling
A service with manual scaling uses resident instances that continuously run the specified number of instances irrespective of the load level. This allows tasks such as complex initializations and applications that rely on the state of the memory over time.
Automatic scaling
Auto scaling services use dynamic instances that get created based on request rate, response latencies, and other application metrics. However, if you specify a number of minimum idle instances, that specified number of instances run as resident instances while any additional instances are dynamic.
Basic Scaling
A service with basic scaling uses dynamic instances. Each instance is created when the application receives a request. The instance will be turned down when the app becomes idle. Basic scaling is ideal for work that is intermittent or driven by user activity.

This table compares the performance features of the three scaling types:

Feature Automatic scaling Manual scaling Basic scaling
Deadlines 60-second deadline for HTTP requests, 10-minute deadline for task queue tasks. Requests can run for up to 24 hours. A manually-scaled instance can choose to handle /_ah/start and execute a program or script for many hours without returning an HTTP response code. Task queue tasks can run up to 24 hours. Same as manual scaling.
Residence Instances are evicted from memory based on usage patterns. Instances remain in memory, and state is preserved across requests. When instances are restarted, an /_ah/stop request appears in the logs. If there is a registered shutdown hook, it has 30 seconds to complete before shutdown occurs. Instances are evicted based on the idle_timeout parameter. If an instance has been idle, for example it has not received a request, for more than idle_timeout, then the instance is evicted.
Startup and shutdown Instances are created on demand to handle requests and automatically turned down when idle. Instances are sent a start request automatically by App Engine in the form of an empty GET request to /_ah/start. An instance that is manually stopped has 30 seconds to finish handling requests before it is forcibly terminated. Instances are created on demand to handle requests and automatically turned down when idle, based on the idle_timeout configuration parameter. As with manual scaling, an instance that is manually stopped, has 30 seconds to finish handling requests before it is forcibly terminated.
Instance addressability Instances are anonymous. Instance "i" of version "v" of service "s" is addressable at the URL: If you have set up a wildcard subdomain mapping for a custom domain, you can also address a service or any of its instances via a URL of the form or You can reliably cache state in each instance and retrieve it in subsequent requests. Same as manual scaling.
Scaling App Engine scales the number of instances automatically in response to processing volume. This scaling factors in the automatic_scaling settings that are provided on a per-version basis in the configuration file. You configure the number of instances of each version in that service's configuration file. The number of instances usually corresponds to the size of a dataset being held in memory or the desired throughput for offline work. A service with basic scaling is configured by setting the maximum number of instances in the max_instances parameter of the basic_scaling setting. The number of live instances scales with the processing volume.
Free daily usage quota 28 instance-hours 8 instance-hours 8 instance-hours

Instance life cycle

Instance states

An instance of an auto-scaled service is always running. However, an instance of a manual or basic scaled service can be either running or stopped. All instances of the same service and version, share the same state. You can change the state of your instances by stopping your versions, either by using the Versions page in the GCP Console, the gcloud app versions start and gcloud app versions stop commands , or the Modules package.


Each service instance is created in response to a start request, which is an empty HTTP GET request to /_ah/start. App Engine sends this request to bring an instance into existence; users cannot send a request to /_ah/start. Manual and basic scaling instances must respond to the start request before they can handle another request. The start request can be used for two purposes:

  • To start a program that runs indefinitely, without accepting further requests.
  • To initialize an instance before it receives additional traffic.

Manual, basic, and automatically scaling instances startup differently. When you start a manual scaling instance, App Engine immediately sends a /_ah/start request to each instance. When you start an instance of a basic scaling service, App Engine allows it to accept traffic, but the /_ah/start request is not sent to an instance until it receives its first user request. Multiple basic scaling instances are only started as necessary, in order to handle increased traffic. Automatically scaling instances do not receive any /_ah/start request.

When an instance responds to the /_ah/start request with an HTTP status code of 200–299 or 404, it is considered to have successfully started and can handle additional requests. Otherwise, App Engine terminates the instance. Manual scaling instances are restarted immediately, while basic scaling instances are restarted only when needed for serving traffic.


The shutdown process might be triggered by a variety of planned and unplanned events, such as:

  • You manually stop an instance.
  • You deploy an updated version to the service.
  • The instance exceeds the maximum memory for its configured instance_class.
  • Your application runs out of Instance Hours quota.
  • Your instance is moved to a different machine, either because the current machine that is running the instance is restarted, or App Engine moved your instance to improve load distribution.

Loading requests

When App Engine creates a new instance for your application, the instance must first load any libraries and resources required to handle the request. This happens during the first request to the instance, called a Loading Request. During a loading request, your application undergoes initialization which causes the request to take longer.

The following best practices allow you to reduce the duration of loading requests:

  • Load only the code needed for startup.
  • Access the disk as little as possible.
  • In some cases, loading code from a zip or jar file is faster than loading from many separate files.

Warmup requests

Warmup requests are a specific type of loading request that load application code into an instance ahead of time, before any live requests are made. Manual or basic scaling instances do not receive an /_ah/warmup request.

To learn more about how to use warmup requests, see Configuring warmup requests.

Instance uptime

App Engine attempts to keep manual and basic scaling instances running indefinitely. However, at this time there is no guaranteed uptime for manual and basic scaling instances. Hardware and software failures that cause early termination or frequent restarts can occur without prior warning and can take considerable time to resolve; thus, you should construct your application in a way that tolerates these failures.

Here are some good strategies for avoiding downtime due to instance restarts:

  • Reduce the amount of time it takes for your instances restart or for new ones to start.
  • For long-running computations, periodically create checkpoints so that you can resume from that state.
  • Your app should be "stateless" so that nothing is stored on the instance.
  • Use queues for performing asynchronous task execution.
  • If you configure your instances to manual scaling:
    • Use load balancing across multiple instances.
    • Configure more instances than required to handle normal traffic.
    • Write fall-back logic that uses cached results when a manual scaling instance is unavailable.

Instance billing

In general, instances are charged per-minute for their uptime in addition to a 15-minute startup fee (see Pricing for details). You will be billed only for idle instances up to the number of maximum idle instances that you set for each service. Runtime overhead is counted against the instance memory. This will be higher for Java applications than Python.

Billing is slightly different in resident and dynamic instances:

  • For resident instances, billing ends fifteen minutes after the instance is shut down.
  • For dynamic instances, billing ends fifteen minutes after the last request has finished processing.
Was this page helpful? Let us know how we did:

Send feedback about...

App Engine standard environment for Go