Cost Management

Cloud cost optimization: principles for lasting success

May 14, 2020

Justin Lerma

Professional Services, Technical Account Manager

Pathik Sharma

Cloud FinOps Lead, delta, Google Cloud Consulting

We’ve been working side by side with some complex customers as they usher in the next generation of applications and services on Google Cloud. When it comes to optimizing costs, there are lots of tools and techniques that organizations can use. But tools can only take you so far. In our experiences, there are several high-level principles that organizations, no matter the size, can follow to make sure they’re getting the most out of the cloud.

In this blog post, we’ll take a look at some of these concepts, so you can effectively right-size your deployments. Then we’ll also consider the three kinds of cloud cost optimization tools, and provide a framework for how to prioritize cost optimization projects. Finally, if you want more, including prescriptive advice about optimizing compute, networking, storage and data analytics costs on Google Cloud, we’ve regrouped some of most popular blogs on the topic into an all-in-one downloadable ebook, “Understanding the principles of cost optimization.”

Cost optimization with people and processes

As with most things in technology, the greatest standards are only as good as how well they are followed. The limiting factor, more often than not, isn’t the capability of the technology, but the people and processes involved. The intersection of executive teams, project leads, finance, and site reliability engineers (SREs) all come into play when it comes to cost optimization. As a first step, these key stakeholders should meet to design a set of standards for the company that outline desired service-level profitability, reliability, and performance. We highly recommend establishing a tiger team to kickstart this initiative.

Using cloud’s enhanced cost visibility
A key benefit of a cloud environment is the enhanced visibility into your utilization data. Each cloud service is tracked and can be measured independently. This can be a double-edged sword: now you have tens of thousands of SKUs and if you don't know who is buying what services and why, then it becomes difficult to understand the total cost of ownership (TCO) for the application(s) or service(s) deployed in the cloud.

This is a common problem when customers make the initial shift from an on-premises capital expenditures (CapEx) model to cloud-based operational expenditures (OpEx). In the old days, a central finance team set a static budget and then procured the needed resources. Forecasting was based on a metric such as historic growth to determine the needs for the next month, quarter, year, or even multiple years. No purchase was made until everyone had the opportunity to meet and weigh in across the company on whether or not it was needed.

Now, in an OpEx environment, an engineering team can spin up resources as desired to optimally run their services. We see that for many cloud customers, it’s often something of a Wild West—where engineering spins up resources without standardized guardrails such as setting up budgets and alerts, appropriate resource labeling and frequent cadence to view cost from an engineering and finance perspective. While that empowers velocity, it’s not really a good starting position to effectively design a cost-to-value equation for a service—essentially, the value generated by the service—much less optimize spending. We see customers struggling to identify the cost of development vs. production projects in their environments due to lack of standardized labelling practices. In other cases, we see engineers over-provisioning instances to avoid performance issues, only to see considerable overhead during non-peak times. This leads to wasted resources in the long run. Creating company-wide standards for what type of resources are available and when to deploy them is paramount to optimizing your cloud costs.

We’ve seen this dynamic many times, and it's unfortunate that one of the most desirable features of the cloud—elasticity—is sometimes perceived as an issue. When there is an unexpected spike in a bill, some customers might see the increase in cost as worrisome. Unless you attribute the cost to business metrics such as transactions processed or number of users served, you really are missing context to interpret your cloud bill. For many customers, it's easier to see that costs are rising and attribute that increase to a specific business owner or group, but they don’t have enough context to give a specific recommendation to the project owner. The team could be spending more money because they are serving more customers—a good thing. Conversely, costs may be rising because someone forgot to shut down an unneeded high-CPU VM running over the weekend—and it’s pushing unnecessary traffic to Australia.

One way to fix this problem is to organize and structure your costs in relation to your business needs. Then, you can drill down into the services using Cloud Billing reports to get an at-a-glance view of your costs. You can also get more granular cost views of your environment by attributing costs back to departments or teams using labels, and by building your own custom dashboards. This approach allows you to label a resource based on a predefined business metric, then track its spend over time. Longer term, the goal isn’t to understand that you spent “$X on Compute Engine last month,” but that “it costs $X to serve customers who bring in $Y revenue.” This is the type of analysis you should strive to create.

https://storage.googleapis.com/gweb-cloudblog-publish/images/1_Billing_Reports_in_the_Google_Cloud.max-2000x2000.jpg

Billing Reports in the Google Cloud console let you explore granular cost details

One of the main features of the cloud is that it allows you to expedite feature velocity for faster time to market, and this elasticity is what lets you deploy workloads in a matter of minutes as opposed to waiting months in the traditional on-premises environment. You may not know how fast your business will actually grow, so establishing a cost visibility model up front is essential. And once you go beyond simple cost-per-service metrics, you can start to measure new business metrics like profitability as a performance metric per project.

Understanding value vs. cost
The goal of building a complex cloud system isn’t merely to cut costs. Take your fitness goals as an analogy. When attempting to become more fit, many people fixate on losing weight. But losing weight isn’t always a great key indicator in and of itself. You can lose weight as an outcome of being sick or dehydrated. When we aim for an indicator like weight loss, what we actually care about is our overall fitness or how we look and feel when being active, like the ability to play with your kids, live a long life, dance—that sort of thing. Similarly, in the world of cost optimization, it's not about just cutting costs. It's about identifying waste and ensuring you are maximizing the value of every dollar spent.

Similarly, our most sophisticated customers aren’t fixated on a specific cost-cutting number, they’re asking a variety of questions to get at their overall operational fitness:

What are we actually providing for our customers (unit)?
How much does it cost me to provide that thing and only that thing?
How can I optimize all correlated spend per unit created?

In short, they have gone ahead and created their own unit economics model. They ask these questions up front, and then work to build a system that enables them to answer these key questions as well as audit their behavior. This is not something we typically see in a crawl state customer, but many of those that are in the walk state are employing some of these concepts as they design their system for the future.

Implementing standardized processes from the get-go
Ensuring that you are implementing these recommendations consistently is something that must be designed and enforced systematically. Automation tools like Terraform and Cloud Deployment Manager can help create guardrails before you deploy a cloud resource. It is much more difficult to implement a standard retroactively. We have seen everything from IT Ops shutting off or threatening to shut off untagged resources to established “walls of shame” for people who didn’t adhere to standards. (We’re fans of positive reinforcement, such as a pizza, or a trophy, or even a pizza trophy.)

What’s an example of an optimization process that you might want to standardize early on? Deploying resources, for one. Should every engineer really be able to deploy any amount of any resource? Probably not. We see this as an area where creating a standard up front can make a big difference.

Structuring your resources for effective cost management is important too. It’s best to adopt the simplest structure that satisfies your initial requirements, then adjust your resource hierarchy as your requirements evolve. You can use the setup wizard to guide you through recommendations and steps to create your optimal environment. Within this resource hierarchy, you can use projects, folders, and labels to help create logical groupings of resources that support your management and cost attribution requirements.

https://storage.googleapis.com/gweb-cloudblog-publish/images/2_resource_hierarchy_for_cloud.max-1900x1900.jpg

Example of a resource hierarchy for cloud

In your resource hierarchy, labeling resources is a top priority for organizations interested in managing costs. This is essentially your ability to attribute costs back to a specific business, service, unit, leader, etc. Without labeling resources, it’s incredibly difficult to decipher how much it costs you to do any specific thing. Rather than saying you spent $36,000 on Compute Engine, it’s preferable to be able to say you spent $36,000 to deliver memes to 400,000 users last month. The second statement is much more insightful than the first. We highly recommend creating standardized labels together with the engineering and finance teams, and using labels for as many resources as you can.

Review and repeat for best results
As a general practice, you should meet regularly with the appropriate teams to review usage trends and also adjust forecasting as necessary. The Cloud Billing console makes it easy to review and audit your cloud spend on a regular basis, while custom dashboards provide more granular cost views. Without regular reviews and appropriate unit economics, as well as visibility into your spend, it’s hard to move beyond being reactive when you observe a spike in your bill.

https://storage.googleapis.com/gweb-cloudblog-publish/images/3_Google_Data_Studio.max-2000x2000.jpg

If you’re a stable customer, you can review your spending less frequently, as the opportunities to tweak your strategies will be reliant on items like new Google Cloud features vs. a business change on your product roadmap. But if you’re deploying many new applications and spending millions of dollars per month, a small investment in conducting more frequent cost reviews can lead to big savings in a short amount of time. In some cases, our more advanced customers meet and adjust forecasts as often as every day. When you’re spending millions of dollars a month, even a small percentage shift in your overall bill can take money away from things like experimenting with new technologies or hiring additional engineers.

To truly operate efficiently and maximize the value of the cloud takes multiple teams with various backgrounds working together to design a system catered to your specific business needs. Some best practices are to establish a review cadence based on how fast you are building and spending in the cloud. The Iron Triangle is a commonly used framework that measures cost vs. speed vs. quality. You can work with your teams to set up an agreed-upon framework that works for your business. From there, you can either tighten your belt, or invest more.

The tools of the cost optimization trade

Once you have a firm grasp on how to approach cost optimization in the cloud, it’s time to think about the various tools at your disposal. At a high level, cost management on Google Cloud relies on three broad kinds of tools.

Cost visibility—this includes knowing what you spend in detail, how specific services are billed, and the ability to display how (or why) you spent a specific amount to achieve a business outcome. Here, keep in mind key capabilities such as the ability to create shared accountability, hold frequent cost reviews, analyze trends, and visualize the impact of your actions on a near-real-time basis. Using a standardized strategy for organizing your resources, you can accurately map your costs to your organization's operational structure to create a showback/chargeback model. You can also use cost controls like budget alerts and quotas to keep your costs in check over time.
Resource usage optimization—this is reducing waste in your environment by optimizing usage. The goal is to implement a specific set of standards that draws an appropriate intersection between cost and performance within an environment. This is the lens to look through when reviewing whether there are idle resources, better services on which to deploy an app, or even whether launching a custom VM shape might be more appropriate. Most companies that are successful at avoiding waste are optimizing resource usage in a decentralized fashion, as individual application owners are usually the best equipped to shut down or resize resources due to their intimate familiarity with the workloads. In addition, you can use Recommender to help detect issues like under- or over-provisioned VM instances or idle resources. Enabling your team to surface these recommendations automatically is the aim of any great optimization effort.
Pricing efficiency—this includes capabilities such as sustained use discounts, committed use discounts, flat-rate pricing, per-second billing or other volume discounting features that allow you to optimize rates for a specific service. These capabilities are best leveraged by more centralized teams within your company, such as a Cloud Center of Excellence (CCoE) or FinOps team that can lower the potential for waste while optimizing coverage across all business units. This is something to continue to review both pre-cloud migration as well as regularly once you go live.

Considering both people and processes will go a long way toward making sure your standards are useful and aligned to what your business needs. Similarly, understanding Google Cloud’s cost visibility, resource usage optimization, and pricing efficiency features will give you the tools you need to optimize costs across all your technologies and teams.

How to prioritize recommendations
With lots of competing initiatives, it can be difficult to prioritize cost optimization recommendations and ensure your organization is making the time to review these efforts consistently. Having visibility into the amount of engineering effort as well as potential cost savings can help your team establish its priorities. Some customers focus solely on innovation and speed of migration for years on end, and over time their bad optimization habits compound, leading to substantial waste. These funds could have gone towards developing new features, purchasing additional infrastructure, or hiring more engineers to improve their feature development velocity. It’s important to find a balance between cost and velocity and understand the ramifications of leaning too far in one direction over another.

To help you prioritize one cost optimization recommendation over another, it’s a good idea to tag recommendations with an estimate of two characteristics:

Effort: Estimated level of work (in weeks) required to coordinate the resources and implement a cost optimization recommendation.
Savings: Amount of estimated potential savings (in percentage per service) that you may realize by implementing a cost optimization recommendation.

https://storage.googleapis.com/gweb-cloudblog-publish/images/4_How_to_prioritize_recommendations.max-700x700.jpg

While it's not always possible to estimate with pinpoint accuracy how much a cost savings measure will save you before testing, it's important to try and make an educated guess for each effort. For instance, knowing that a certain change could potentially save you 60% on your Cloud Storage for project X should be enough to help with the prioritization matrix and establishing engineering priorities with your team. Sometimes you can estimate actual savings. Especially with purchasing options, a FinOps team can estimate the potential savings by taking advantage of features like committed use discounts for a specific amount of their infrastructure. By performing this exercise, you want the team to be able to make informed decisions on where engineering is going, so they can focus their energy from a culture standpoint.

From principles to practice

Optimizing cloud costs isn’t a checklist, it’s a mindset; you’ll have the best results if you think strategically and establish strong processes to help you stay on track. But there are also lots of service-specific steps you can take to getting your bill under control. For more tactical advice, check out these posts on how to save on your Google Cloud compute, storage, networking, data analytics, and serverless applications. Or, for a handy reference, download our “Understanding the principles of cost optimization” ebook, which regroups several of these topics in one place.

Google Cloud