Jump to Content
Databases

Farewell to overprovisioning: How to unlock cost-effective elasticity with Spanner

August 12, 2024
Szabolcs Rozsnyai

Senior Staff Solutions Architect, Spanner

Karthi Thyagarajan

Senior Staff Solutions Architect, Spanner

Join us at Google Cloud Next

Early bird pricing available now through Feb 14th.

Register

When it comes to databases, many enterprises face the following dilemma: How do you scale cost-effectively, while maintaining service quality, latency, and throughput?

This is especially difficult because enterprise workloads are often subject to highly variable traffic patterns. For instance, traffic to e-commerce or retail banking applications fluctuate, often seeing peak utilization during day and slowing down significantly at night, as illustrated in the diagram below.

https://storage.googleapis.com/gweb-cloudblog-publish/images/1_DB-OVERPROVISIONING.max-1700x1700.png

This variability requires database infrastructure that can handle peak loads without compromising on performance. The conventional approach to handling these peaks is to overprovision resources i.e., CPU, RAM, disk I/O and storage.

The downside to overprovisioning is that most of the resources remain idle during non-peak hours, and therefore come with additional cost. In contrast to stateless application tiers, databases are by nature stateful, so right-sizing database resources to track actual utilization is typically not an option. In addition, tightly coupling compute and storage resources typically incurs downtime and can affect the availability or at the very least the service quality (degraded latency, for instance). So to meet their SLAs, database services need to prioritize consistent latency and throughput, despite the substantial waste in resources that that entails. 

Going back to our e-commerce example, most online retail websites understand that loss of traffic due to frustrated users is a bigger hit to revenue than database over-provisioning. Still, depending on anticipated peak loads, it’s not uncommon to see average resource utilization that hovers around 20% — meaning that 80% of the allocated resources are wasted!

How Spanner helps reduce waste associated with over-provisioning

Unlike traditional database systems, Spanner decouples compute and storage resources. And Spanner’ autoscaler takes advantage of this separation between compute and storage to elastically adjust compute resources, all with zero downtime and a negligible impact to latency. This type of dynamic scaling that adjusts compute capacity to align with workload demands helps ensure optimal performance without extra costs.

https://storage.googleapis.com/gweb-cloudblog-publish/images/2_SPANNER-PEAK-SMOOTHING.max-1600x1600.png

Spanner’s autoscaler adjusts the node count in and out to closely track traffic patterns without compromising median latency and normalizing p90+ spikes within a very short timeframe.

In addition to compute resources required to serve database traffic, storage is another resource that demands careful consideration in traditional database systems. Spanner makes things simpler with a “pay as you go” storage model that precludes the need to pre-provision and pay for disk storage and I/O in advance of demand. In addition, Spanner’s Time-to-live (TTL) feature can help you purge data that you no longer need to keep around in your “hot” OLTP database, helping to ensure that storage utilization, and in turn cost, closely tracks to your business needs.

In short, Spanner removes the need for overprovisioning by elastically expanding and contracting compute and storage resources. But you may be wondering: what about sudden spikes in demand — some of which may be an order of magnitude greater than the baseline? In the next section, we describe how Spanner’s scale-out architecture plus pre-warming can provide guardrails against sudden high-amplitude spikes in demand.

Let’s talk about peak events

https://storage.googleapis.com/gweb-cloudblog-publish/images/3_PEAK-EVENTS.max-1100x1100.png

Once again, we turn to the retail space for some relatable examples of peak events — promotional periods such as Cyber Monday and Black Friday. These situations call for meticulous planning and coordination to ensure that exceptional traffic spikes in demand can be handled without compromising service delivery or overspending on idle resources. 

Even with meticulous planning, Peak events can bring demand spikes that go beyond levels that can be supported by already overprovisioned capacity. Of course, overprovisioning resources year-round for peak events would be prohibitively expensive. As a result, organizations facing predictable peak events have processes in place to deploy capacity ahead of time.

What does that process entail in a traditional database system? The diagram below illustrates the high level steps required.

https://storage.googleapis.com/gweb-cloudblog-publish/images/4_PEAK-PLANNING.max-2000x2000.png

Steps leading up to peak events:

  • Step 1: Plan and coordinate expected events. This requires close collaboration with business teams (e.g., marketing) to ensure accurate forecasting of demand, allowing IT teams to prepare accordingly.

  • Step 2: Temporarily suspend services. Some guardrails (e.g., enabling read replicas) may be put in place to allow for partial operations. 

  • Step 3: Provision more CPU and RAM. This step involves increasing computing power and memory to handle the expected surge.

Post-peak steps:

  • Step 4: Take service offline again.

  • Step 5: Deprovision excess CPU and RAM to revert to the daily business capacity.

  • Step 6: Restart the system.

Of course, there are ways to minimize downtime: some organizations opt to provision mirrored database instances and re-route traffic before and after the peak event in question.

However, with or without downtime, the overall approach outlined above is a lot of work, and incurs additional expenses and risks:

  • High cost due to significant overprovisioning of resources in anticipation of peaks that may never arrive.

  • On the flip side, inaccurate forecasts that can lead to underprovisioning and subsequent performance degradation, which in turn can impact revenue impact and/or frustrate customers

  • Unanticipated spikes, for instance those caused by viral events, that aren’t even in scope and could potentially lead to full-blown service outages

Spanner: the insurance policy

https://storage.googleapis.com/gweb-cloudblog-publish/images/5_SPANNER-INSURANCE.max-1800x1800.png

In contrast, Spanner, with its elastic compute model combined with autoscaling, can gracefully absorb significant spikes in traffic without compromising service health or user experience. In most cases, Spanner instances don’t require large-scale launch coordinations or capacity planning. And even when they’re warranted, users don’t have to execute intrusive upgrade procedures.

Spanner has a built-in mechanism to automatically shard your data and appropriately scale underlying resources so that service quality is not degraded. However, in cases where your Spanner instance has been in a steady state for a significant period without having seen traffic spikes, a warm-up procedure might be required to avoid performance degradations. This is especially relevant for new service launch events that expect significant traffic spikes. To learn more about pre-warming Spanner, please see: How Spanner makes application launches easier with warmup and benchmarking tools.

Conclusion

By separating compute from storage and utilizing autoscaling, Spanner dynamically adjusts to fluctuating traffic patterns. It scales out to meet high-demand peaks and scales back in to prevent waste, ensuring consistent performance while optimizing costs. This makes Spanner well-suited to handling unexpected surges and planned events, helping businesses deliver reliable services without overspending on idle resources.

Want to learn more about what makes Spanner unique and how it’s being used today? Try it yourself for free for 90-days or for as little as $65 USD/month for a production-ready instance that grows with your business without downtime or disruptive re-architecture.

Posted in