Optimizing for long-term cost management in fully managed applications

A couple of weeks ago, we discussed a bunch of practical steps you can take to keep costs in check for applications running on Google Cloud’s fully managed serverless compute platforms, App Engine, Cloud Run and Cloud Functions. But beyond the immediate actions like setting max instances and budget alerts, there are several general optimizations you can make to increase the overall efficiency of your fully managed serverless applications. As a result, you can reduce your overall bill: optimizing your CPU and RAM usage, improving cold-start times, and implementing local caching. Read on for more.

Optimize CPU and RAM

CPU usage is typically the most significant contributor to the number of instances required by your application. The higher the CPU consumed by a request, the more instances you need and the higher your bill. Additionally, in runtimes that use garbage collection excessive RAM (heap) usage can lead to excessive garbage collection, which can also consume large amounts of CPU. 

Another important aspect of CPU usage is "CPU stranding." This happens when a given instance is unable to process any more requests due to being overloaded with too many requests (possibly as a result of a low concurrency setting), causing more instances to be created while CPU resources are underutilized. It can also happen as a result of I/O bottlenecks, causing back pressure on requests. CPU stranding is usually characterized by low overall CPU utilization, and there are a few techniques you can use to mitigate this issue:

  1. Increase concurrency if it has been manually set. By default, fully managed services in GCP will try to optimize concurrency, however if you've set it to a low number you may want to review that setting to see if it's still optimal. At this time Cloud Functions does not provide a multi-concurrency mode, so if the other techniques listed here are not sufficient, you could consider moving low CPU utilization functions to Cloud Run. To help with this, we recently published a range of open source frameworks and buildpacks to build and deploy functions on Cloud Run.

  2. Offload slow I/O operations. In some cases it may be possible to process slower I/O operations asynchronously using a queue system like Pub/Sub or Cloud Tasks. This can improve the overall throughput of your service, with a trade-off of higher overall latency of these asynchronous tasks. A good example of this is sending an email. It may not be necessary to block execution of a request on sending an email if that operation is slow; instead, you may process sending the email asynchronously with a slight delay.

  3. Avoid slow I/O operations. As discussed below, implementing caching can also remove slow I/O operations from the critical path of a request.

  4. Improve I/O operation performance. Sometimes it may not be obvious that a slow I/O operation is contributing to low CPU utilization, so adding instrumentation around these I/O calls using custom metrics can provide further insight.

Optimize cold starts

A "cold start" occurs when a request to your application cannot be served by an existing instance and a new instance must be created. This happens if there are zero instances of your service (known as 0-to-1 scaling) but also during traffic spikes where more instances are needed to serve the load (known as 1-to-N scaling).

Reducing the overall time spent on a cold start not only reduces the cost associated with the cold start itself, but can also reduce the overall number of cold starts you need to perform. That’s because new instances become viable more quickly, so you need fewer instances overall, particularly during periods of rapid increases in traffic. The first step to improving cold-start time is to observe cold-start times when running locally (in your development environment). Although the absolute times may not be identical, the portion of overall runtime startup that your code consumes (as distinct from the language runtime itself) is a good indicator of cold-start overhead.

Here are some common cold-start pitfalls when designing an application for a serverless environment.

Excessive CPU usage, or slow operations at startup
Any code that executes in "global scope" (i.e., when the runtime is loaded) affects the cold-start time. If you have any CPU-intensive or long-running tasks in global scope when the application starts up (for example heavyweight calculations related to loading global state, or loading state from a slow database), this can negatively impact your overall app performance and cost. There are a few strategies that can be taken to mitigate this:

  • Where possible, reduce overall CPU-intensive and/or slow tasks that occur in global scope (not always possible).

  • Try to break up the task into smaller operations that can then be "lazy loaded" in the context of a request.

Note that when code is executing in global scope, it will often delay overall instance startup. Even for operations which are I/O bound, where the CPU is largely idle, the instance will not report as being "available" until this operation is complete. This means it will not be able to handle any requests, even if it is configured for concurrency. When there are spikes in traffic, this can lead to a larger than necessary number of new instances being created.

  // This is in "global scope" and will execute for every cold start

exports.helloHeavyWorld = (data, context) => {

  if(data.something === true) {
    // This might be the only place you need "heavy things"
    // but also may happen infrequently
    // consider initializeHeavyThings() here instead
    console.log(`I need heavy things :(`);

  } else {
    console.log(`I don't need heavy things :)`);

Process crashes like out-of-memory errors
An unexpected failure in your application that causes the whole runtime (the root process) to exit abnormally (crash) typically results in a companion cold-start, replacing the failed instance with a new one. While automatically recreating instances improves the overall reliability of your application, it can also obscure the failures themselves (you may not realize instances are crashing). This can often surface as intermittent or unexpected spikes in latency. We recommend you monitor Cloud Logging for messages related to process crashes and configure alerts for logs that indicate a crash. You can do this by creating logs-based metrics, then using Cloud Monitoring to create alerts based on those metrics.

Excessive or unused dependencies
While it's idiomatic for many languages to declare library dependencies in global scope (e.g. require or import statements in Node.js), this can sometimes have an adverse impact on cold-start times. In particular, library dependencies in interpreted languages, or loading native libraries, can both consume large amounts of CPU and may also cause large numbers of file system operations—both of which can negatively impact startup performance.

Check that the dependencies you're loading are all being used. Then, if it’s appropriate, move specific dependency initialization into "request scope" (that is, loaded during a request rather than at startup) for relevant code paths rather than loading the dependencies globally for all paths. Heavyweight dependencies that are referenced rarely affect all cold starts, even if the dependency is only used in a small subset of requests.

Implement local caching

Implementing basic local caching for certain pieces of data can produce dramatic improvements in cost and performance. Even high-speed solutions like Memcache and Redis can introduce substantial overhead if accessed frequently. In these cases, keeping a local in-memory copy (perhaps with a simple time-to-live expiry policy) of frequently generated or loaded data can sometimes deliver impressive improvements. 

To find good candidates for local caching, you generally want to investigate situations where the same operation happens frequently. Pieces of data with the following characteristics are commonly associated with a caching opportunity:

  • Frequently loaded from an external system or database

  • Rarely or never change 

  • Don’t need to be up-to-the-second

  • Idempotent outbound operations that are frequently repeated (e.g., updates to a single database entry)

Here are some examples of operations that can be cached:

  • Fetching image URLs that go on every page of a web app

  • Game leaderboards where 10s delay is acceptable

  • Computing the average score of a set of players on each page load

  • Computing the average price of all the items in inventory at a grocery store

  • A fleet of workers generates batch entries and a (duplicate) Cloud Task task to signal launching the batch; caching the task name can cut down on the number of duplicate requests sent.

Measuring the effectiveness of your changes

It may be difficult to know whether a change you've made results in an overall improvement, or whether it made things worse. Google Cloud's serverless products are instrumented out of the box with Cloud Monitoring, allowing you to evaluate the impact of your changes. Additionally, custom Cloud Monitoring metrics can provide deeper insight into the behavior of your application. 

You may not always want to roll out a change to all your users before knowing if it will have an overall improvement. To make rollouts more reliable, we recommend splitting incoming traffic between a control and an experiment group. In Cloud Run, deploy your changes to a new revision (your experiment) and use traffic management to send a portion (e.g. 50%) of the traffic to it, and keep the remainder of the traffic to the control group. Then monitor these two revisions in Cloud Monitoring's metrics explorer to determine if your change has a real positive impact. Once you confirm that the change improves things without an adverse impact on other measures, you can migrate 100% of the traffic to the revision that contains your change. You can also do a similar traffic split between App Engine versions.

To give an example, consider the case of an in-game leaderboard. As a user-facing feature, delays in updating the leaderboard on the order of a minute are acceptable. In this case, implementing a cache with a timeout reduces request latency and possibly CPU. This would make it possible for one instance to serve more requests, resulting in a lower bill. A custom metric for cache hits/misses would allow you to understand the behavior, and the standard metrics will give you visibility into the effect of the local cache. Deploying the cache-enabled version of the leaderboard with 50% traffic and comparing to the previous version gives you an apples-to-apples comparison, allowing you to evaluate the effectiveness of the cache change. 

Playing the serverless long game

Google Cloud's serverless compute products choose reasonable default values for configuration, and automatically scale and load balance your applications, but every workload is different. Using the above techniques you can tune and optimize your application to get the best possible price and performance. For more insight and information about architecting your next application on the fully managed Google Cloud platform, subscribe to the serverless blog channel

A special thanks to Nicholas Hanssens, Software Engineer; Steren Gianini, Product Manager; Ben Marks, Software Engineer; and Matt Larkin, Product Manager, for the contributions to this blog post.