Google Cloud Platform

Developing supportability for a public cloud

Supportability-01.jpg

The Google Cloud technical support team resolves customer support cases. We also spend a portion of our time improving the supportability of Google Cloud services so that we can solve your cases faster and also so that you have fewer cases in the first place. The challenges of improving supportability for the large, complex, fast-changing distributed system that underpins Google Cloud products have led us to develop several tools and best practices. Many challenges remain to be solved, of course, but we’ll share some of our progress in this post.

Defining supportability. The term “supportability” is defined by Wikipedia as a synonym for serviceability: it’s the speed with which a problem in a product can be fixed. But we wanted to go further and redefine supportability in a way that encompasses the whole of the customer technical support experience, not just how quickly support cases can be resolved. 

Measuring supportability. As we set out, we wanted an objective way to measure supportability in order to evaluate our performance, like the SLOs used by our colleagues in site reliability engineering (SRE), to measure reliability. 

To do this, we initially relied on transactional surveys of customer satisfaction. These can give us good signals in cases where we're exceeding customer expectations, or failing. But these surveys do not give us a good overall picture of our support quality. We have recently started making more use of customer effort score, a metric gleaned from customer surveys that helps show the effort required by customers to fix their problems. Research shows that effort score correlates well with what customers actually want from support: a low-friction way of getting their problems resolved.

Inline-01.jpg

But this only considers customer effort, so it would incentivize us to just throw people or other resources at the problem, or even to push effort onto the Google Cloud product engineering teams. So we needed to include overall effort, leading to this way to measure supportability:

Effort by customer, support and product teams to resolve customer support cases.

One thing to note is that higher effort means lower supportability, but we find it more intuitive to measure effort than lack of effort.

We currently use various metrics to measure the total effort, the main ones being:

  • Customer effort score: customer perception of effort required to fix their problems

  • Total resolution time: time from case open to case close

  • Contact rate: cases created per user of the product

  • Bug rate: Proportion of cases escalated to the product engineering team

  • Consult rate: Proportion of cases escalated to a product specialist on the support team

With some assumptions, we can normalize these metrics to make them comparable between products, then set targets.

Supportability challenges. Troubleshooting problems in a large distributed system is considerably more challenging than it is for monolithic systems, for some key reasons:

  • The production environment is constantly changing. Each product has many components, all of which have regular rollouts of new releases. In addition, each of these components may have multiple dependencies with their own rollout schedules.
  • Customers are developers who may be running their code on our platform, if they are using a product like App Engine. We do not have visibility into the customer's code and the scope of failure scenarios is much larger than it is for a product that presents a well-defined API.
  • The host and network are both virtualized, so traditional troubleshooting tools like ping and traceroute are not effective.
  • If you are supporting a monolithic system, you may be able to look up an error message in a knowledge base, then find potential solutions. Error messages in a distributed system may not be easy to find due to an architecture that uses high RPC (remote procedure call) fanout. In addition, the high scale in a large public cloud involving millions of operations per second for some APIs can make it hard to find relevant errors in the logs.

Building a supportability practice
As our team has evolved, we’ve created some practices that help lead to better support outcomes and would like to share some of them with you.

Launch reviews. We have launched more than 100 products in the past few years, and each product has a steady stream of feature releases, resulting in multiple feature launches per day. Over these years, we’ve developed a system of communications among the teams involved. For each product, we assign a supportability program manager and a support engineer, known as the product engagement lead (PEL), to interface with each product engineering team and approve every launch of a significant customer-facing feature. Like SREs with their product readiness reviews, we follow a launch checklist that verifies we have the right support resources and processes in place for each stage of a product's lifecycle: alpha, beta and generally available. Some critical checklist items include: internal knowledge base, training for support engineers, ensuring that bug triage processes meet our internal SLAs, access to troubleshooting tools, and configuring our case tracking tools to collect relevant reporting data. We also review deprecations to ensure that customers have an acceptable migration path and we have a plan to ensure that they are properly notified.

Added educational tools. Our supportability efforts also focus on helping customers avoid the need to create a case. With one suite of products, more than 75% of support cases were "how to" questions. Engineers on our technical support team designed a system to point customers to relevant documentation as they were creating a case. This helped customers self-solve their issues, which is much less effort than creating a case. The same system helps us identify gaps in the documentation. We used A/B testing to measure the amount of case deflection and carefully monitored customer satisfaction to ensure that we did not cause frustration by making it harder for customers to create cases.

Some cases can be solved without human intervention. For example, we found that customers creating P1 cases for one particular product often were experiencing outages caused by exceeding quotas. We built an automated process for checking incoming cases, and then handling them without human intervention for this and other types of known issues. Our robot case handler scores among the highest in the team in terms of satisfaction in transactional surveys.

To help customers write more reliable applications, members of the support team helped found the Customer Reliability Engineering (CRE) team, which teaches customers the principles used by Google SREs. CREs provides "shared fate," in which Google pagers go off when a customer's application experiences an incident.

Supportability at scale. One way to deal with complexity is for support engineers to specialize in handling as small a set of products as possible so that they can quickly ramp up their expertise. Sharding by product is a trade-off between coverage and expertise. Our support engineers may specialize in one or two products with high case volume, and multiple products with lower volume. As our case volume grows, we expect to be able to have narrower specializations. 

We maintain architecture diagrams for each product, so that our support engineers understand how the product is implemented. This knowledge helps them to identify the specific component that has failed and contact the SRE team responsible for that part of the product. We also maintain a set of playbooks for each product. Prescriptive playbooks provide steps to follow in a well-known process, such as a quota increase. These playbooks are potential candidates for automation. Diagnostic playbooks are troubleshooting steps for a category of problem, for example, if a customer's App Engine application is slow. We try to have coverage for the most commonly occurring set of customer issues in our diagnostic playbooks. The Checklist Manifesto does a great job of describing the benefits of this type of playbook. 

We have found it particularly useful to focus on cases that take a long time to resolve. We hold weekly meetings for each product to review long-running cases. We are able to identify patterns that cause cases to take a long time, and we then try to come up with improvements in processes, training or documentation to prevent these problems. 

The future of supportability. Our supportability practices in Google Cloud were initially started by our program management team in an effort to introduce more rigor and measurement when evaluating the quality, cost and scalability of our support. As this practice evolves, we are now working on defining engineering principles and best practices. We see parallels with the SRE role, which emerged at Google because our systems were too large and complex to be managed reliably and cost-effectively with traditional system administration techniques. So SREs developed a new set of engineering practices around reliability. Similarly, our technical solutions engineers on the support team use their case-handling experience to drive supportability improvements. We continually look for ways to use our engineering skills and operational experience to build tools and systems to improve supportability. 

The growth in the cloud keeps us on our toes with new challenges. We know that we need to find innovative ways to deliver high-quality support at scale. It is an exciting time to be working on supportability and there are huge opportunities for us to have meaningful impact on our customers' experience. We are currently expanding our team.

Lilli Mulvaney, head of supportability programs, Google Cloud Platform, also contributed to this blog post.