Updated: February 2017
App Engine is a scalable system which will automatically add more capacity as workloads increase.
Here are some best practices to ensure that your app will scale to high load.
Run load tests
If you are expecting high traffic, we strongly recommend that you run load tests to reduce the risk of hitting a bottleneck in your code or in App Engine.
Your tests should be designed to simulate real world traffic as closely as possible. You should test at the maximum load that you expect to encounter. In order to accurately assess demand, you can run a 1% test to determine how much traffic will flow to your new service.
In addition, some applications will get a sudden increase in load, and you will need to predict the rate of increase. If you are expecting spikey load, you should also test how your application performs when traffic suddenly increases.
We also recommend that you test under various scenarios. For example, what happens if your application moves to a new datacenter, causing a Memcache flush and all instances to be stopped and restarted? In the Google Cloud Platform Console, you can flush Memcache and stop all instances to simulate this. See Netflix Chaos Monkey for a description of this type of testing.
In addition to measuring performance during load tests, you should also measure the increase in costs as the number of users of your service increases.
You should allow sufficient time in your schedule to resolve problems that might arise in load testing.
Do not set a spending limit that could be exceeded
You can configure a daily spending limit for your application if you are paying online. You can view the spending limit in the GCP Console. Your application will serve errors if your spending limit is exceeded, so ensure that your limit is sufficient to handle the maximum possible daily usage. You should not wait until the last minute to try to increase your spending limit, because Google needs an approval from your credit card issuer. If there are problems getting this approval then your spending limit increase will be delayed.
Ensure that you will not hit quota limits on API calls
Some API calls have per-minute and per-day quota limits in order to prevent a single application from using up more than its share of available resources. In the GCP Console, you can view your quota details for all API calls. You will get a quota denied error if you exceed quota limits.
APIs that are not yet generally available are usually subject to strict quota limits. You can visit the App Engine features page to find out which APIs are generally available, and which are still in preview or experimental stages.
In addition, there are some APIs that have relatively strict quotas, despite being generally available. APIs in this category include URL Fetch, and Sockets. If you are using these APIs, you should pay particular attention to quota limits.
The per-minute quota limits are not shown in the GCP Console. Your application will not hit a per-minute quota unless you have an unexpected usage pattern. Load testing can be used to determine whether you would hit a per-minute quota limit.
Shard task queues if high throughput is needed
You can shard task queues if you want higher throughput than is possible with a single queue. You can use the same principles described in the sharding counters article to do this. Your application should not depend on tasks executing immediately after creation so that you can handle a short term backlog in the task queues without affecting the experience of your end users.
Use the default performance settings unless you have tested the impact of changes
We recommend that you use the default settings for automatic scaling for max idle instances and min/max pending latency on automatic unless you have done load testing with other settings to verify their effects. The default performance settings will, in most cases, enable the lowest possible latency. A trade-off for low latency is usually higher costs due to having additional idle instances that can handle temporary spikes in load.
You should set min_idle_instances if you want to minimize latency, particularly if you expect sudden spikes in traffic. The number of idle instances that are needed will depend on your traffic and it is best to do load tests to determine the optimal number.
You should use the default value for max_concurrent_requests. Increasing this value might cause a performance penalty that can manifest as requests waiting longer for API calls to return. You should run load tests to determine impact before making changes.
Use traffic migration or splitting when switching to a new default version
A high traffic application might get errors or higher latency when updating to a new version in the following scenarios:
- Complete update of a new default version
- Set the default version
Once the update is complete, App Engine will send requests to the new version. However, the new version can take some time to spin up enough instances to handle all traffic. During this period, requests can potentially sit on the pending queue and can time out.
An application can serve requests from both versions while you are moving traffic to the new version. In most cases, this will not cause any problems. However, if you have an incompatibility in the cached objects used by an application then you will need to ensure that users go to the same version of an application during their session. You will need to code this into your application logic.
Do not exceed Memcache rated operations per second
You can use Dedicated Memcache in order to get guaranteed capacity and more consistent performance. You must ensure that you do not exceed the rated operations per second.
The rated operations per second of Dedicated Memcache is disproportionately lower for items larger than 1 KB. One strategy for reducing item size is to compress large values before storing in Memcache.
Note that the Memcache graphs in the App Engine Admin Console dashboard show average operations per second aggregated over a time period. Thus these graphs might not show very short term spikes in usage, which could impact performance.
If your load is unevenly distributed across the Memcache keyspace then you might not see the expected performance from Dedicated Memcache.
Avoid Memcache hot keys
Hot keys are a common anti-pattern that can cause Memcache capacity to be exceeded.
For Dedicated Memcache, we recommend that the peak access rate on a single key should be 1-2 orders of magnitude less than the per-GB rating. For example, the rating for 1 KB sized items is 10,000 operations per second per GB of Dedicated Memcache. Therefore, the load on a single key should not be higher than 100 - 1,000 operations per second for items that are 1 KB in size.
Memcache hashes keys so that keys that are lexicographically close will not cause hotspots.
In load tests, you might see better performance. However, you should design your code so that it complies with the published ratings because the performance of Dedicated Memcache can change.
The GCP Console displays Memcache hot keys.
Here are some strategies for reducing the operations per second on frequently-used keys:
- Organize data per user so that a single HTTP request only hits that user's data. Avoid storing global data which must be accessed from all HTTP requests.
- Shard your frequently-used keys to keep under the per-key guideline. You should create enough keys so that none appear in the Memcache Viewer's list of top keys. Even if all of these keys are on the same Memcache backend, it will generally not cause a hot key issue.
- Cache frequently-used keys in your instances' global memory not in Memcache. Note that you would lose the consistency of Memcache if you stored data in instance memory, because one instance might have different data than another.
Test third-party dependencies
If you depend on a system outside App Engine for handling requests then you should ensure that this system has been tested to handle high load. For example, if you are using URL Fetch to get data from a third-party web server then you can determine the impact of various load testing scenarios on the third party web server's throughput and latency.
Implement backoff on retry
Your code can retry on failure, whether calling a service such as Cloud Datastore or an external service using URL Fetch or the Socket API. However, your client code should protect against situations in which the remote service is overloaded. A high rate of retries can cause the service to recover more slowly.
Google recommends using an approach in which the client caps the amount of outgoing traffic if it detects that a significant proportion of requests are failing with server-side errors. The Adaptive Throttling algorithm is fully described in the SRE book. Google's gRPC client libraries implement a variation of the Adaptive Throttling algorithm.
You should also implement backoff on retry on automated clients that call your App Engine application.
Understand how costs change as usage of your service increases
During the initial design of a new service, the developers can choose to trade-off the efficiency of their code for reduced complexity, in order to build features more quickly. However, this trade-off might lead to significantly higher costs, if the service gets a significant increase in usage.
Below are some examples of ways in which you can reduce costs by making changes in your application's design. This list is by-no-means exhaustive, but it reflects the experiences of some App Engine customers who have had significant increases in traffic.
Inefficient use of APIs
App Engine's backend services, such as Memcache and Cloud Datastore, will generally scale, even if you use these APIs inefficiently. But you would be reducing performance and increasing costs. The additional costs can be significant for some applications. The tips below can also apply when calling a third party API from your application.
- Use asynchronous API calls, when available, so that your application can do other work while waiting for the call to return.
- Set an API call deadline, when possible, so that your application is not blocked indefinitely if there is a spike in latency.
- Avoid noop writes to storage systems, such as Memcache or Datastore. These will increase latency, and in the case of Datastore, also increase costs. You should refactor your code if you are doing this.
- Consider using batch API calls, when available, that can offer improvements in performance. You should ensure that you eliminate duplicate calls from a batch, in order to avoid unnecessary work.
- Task names are unique per queue. You can ensure that you are not creating duplicate tasks, by setting a meaningful task name.
- Avoid unnecessary indexed fields for your Datastore kinds which adds additional costs. It might be more cost efficient to run a Mapreduce job to create your own index, than to update the indexed fields on each write.
You can implement polling in various scenarios. For example:
- Mobile clients poll your application for updates.
- Your application makes API calls to third party sites to get updated information on each user in your system.
In many cases, the polling requests consume resources unnecessarily, because there has been no change to the user's data. You can poll less frequently in order to reduce costs, but that can lead to reduced freshness. Instead, you can use a different method to push updates to clients, such as Websockets. However, Websockets cannot be implemented on App Engine, so would require some additional complexity in your implementation.
Using backend services outside App Engine
You have a small corpus that fits in a single machine's memory. You can use Cloud Datastore for this use case. It is quick to get started using App Engine and Datastore, and it will scale to high IOPS. However, if you have very high IOPS, it might be less expensive to run your own replicated cache on a quorum of machines.
Adding caching layers will improve your application's performance, thereby reducing costs. An App Engine application can cache data in an instance's global memory and in Memcache.
In some cases, you will benefit from more aggressive caching strategies. For example, you can cache data on your mobile clients, which then update your application asynchronously.
The design of an application can depend on caching working correctly in order to scale. Your load tests should include a scenario in which you simulate a failure of the caching layer to verify that your application will continue to scale in this situation.
Follow best practices for scaling Cloud Datastore
See the Datastore best practices article for tips on ensuring that your use of Cloud Datastore will scale.
Separate services for user-facing and batch traffic
You should use separate services for user-facing and batch traffic. App Engine autoscales based on recent traffic patterns. A sudden burst of traffic from a batch job may cause user-facing requests to be queued while waiting for the autoscaler to kick in. Task queue operations have lower priority in the pending request queue, but it is still good practice to ensure user-facing traffic is not mixed with batch traffic in the same service.