Google Cloud

Troubleshooting tips: How to talk so your cloud provider will listen (and understand)

May 31, 2018

Luke Stone

Technical Solutions Engineer, Google Cloud Platform

Jian Ma

Editor’s note: We’re excited to bring you this blog post from the team of Google experts who wrote the book (really!) on Site Reliability Engineering (SRE) a few years back. The second edition of the book is underway, and as a teaser, this post delves into one area of SRE that’s relevant to many IT teams today: troubleshooting in the age of cloud computing. This is part one of two. Then, check out the second installment specifically on troubleshooting cloud provider communications.

Effective technology troubleshooting requires a systematic approach, as opposed to luck or experience. Troubleshooting can be a learned skill, as discussed in our site reliability engineering (SRE) troubleshooting primer.

But how does that change when you and your operations team are running systems and services on cloud infrastructure? Regardless of where your websites or apps live, you’re the one getting paged when the site goes down and you are the one under pressure to solve the problems and answer questions.

Cloud presents a new way of working for IT teams shifting away from legacy systems. You had full visibility and control of all aspects of your system when it was on-premises, but now you depend on off-site, cloud-based infrastructure, into which you may have limited visibility or understanding. That’s true no matter how top-notch your systems and processes are and how great a troubleshooter you are. You simply can’t see much into many cloud provider systems. The troubleshooting challenge is no longer limited to pure debugging; you also need to effectively communicate with your cloud provider. You’ll need to engage each provider’s support process and engineers to find where problems originated and fix them as soon as possible. This is an area of opportunity for IT teams as they gain new skills and adapt to the cloud-based technology model.

It’s definitely possible to do cloud troubleshooting right. When you’re choosing a provider, make sure to understand how they’ll work with you during any issues. Once you’re working with a provider, you have more control than you think over how you communicate issues and speed along their resolution. We propose this model, inspired by the SRE model of actionable improvements, for working with your cloud provider to troubleshoot more effectively and efficiently. (For more on cloud provider communications throughout the troubleshooting process, see the companion post.)

Understand your cloud provider's support workflow

Your goal here is to figure out the best way to provide the right information to your provider when an issue inevitably arises. This is a useful place to start, especially since you may depend on multiple cloud providers to run your infrastructure. Your interaction with cloud support will typically begin with questions related to migration. Your relationship then progresses into the domain of production integration, and finally, joint troubleshooting of production problems.

Keep in mind that different cloud providers have different philosophies when it comes to customer interaction. Some provide a large degree of freedom and little direct support. They expect you to find answers to most of your questions from online forums such as Stack Overflow. Other providers emphasize tight customer integration and joint troubleshooting. You have some homework to do before you begin serving real customer traffic from your cloud deployment. Talk to your cloud provider's salespeople, project managers and support engineers, to get a sense of how they approach support. Ask them the following questions:

What does the lifecycle of a typical support issue report look like?
What is your internal escalation process if an issue becomes complex or critical?
Do you have an internal SLO for <service name>? If so, what is that SLO?
What types of premium support are available?

This step is critical in reducing your frustration when you have to troubleshoot an issue that involves a cloud provider’s giant black box.

Communicate with cloud provider support efficiently

Once you have a sense of the provider’s support workflow, you can figure out the best way to get your information across. There are some best practices for filing the perfect issue report with cloud provider support teams, including what to say in your issue report and why. These guidelines follow the 80/20 rule: we try to give 20% of the details that will be useful in 80% of your issue reports. The same principles apply if you're filing bug reports in issue trackers or posting to user groups and forums.

Your guiding principle when communicating to cloud providers should be clarity: specify the appropriate level of technical detail and communicate expectations explicitly.

Provide basic information

It may seem like common sense, but these basics are essential to include in an issue report. Failing to provide any of these details leads to delays and a poor experience.

Include four critical details

Effective troubleshooting starts with time, product, location and specific identifiers.
1. Time
Here are a few examples of including time in an issue report:

Starting at 2017-09-08 15:13 PDT and ending 5 minutes later, we observed...
Observed intermittently, starting no earlier than 2017-09-10 and observed 2-5 times...
Ongoing since 2017-09-08 15:13 PDT...
From 2017-09-08 15:13 PDT to 2017-09-08 22:22 PDT...

Including the onset time and duration allow support teams to focus their time-series monitoring on the relevant period. Be explicit about whether the issue is ongoing or whether it was observed only in the past. If the issue is not ongoing, be explicit about that fact and provide an end time, if possible.

Remember to always include the time zone. ISO 8601 format is a good choice because it is unambiguous and easy to sort. If you instead specify time in a relative way, whomever you're working with must convert local time into an absolute format they can input into time-series monitoring tools. This conversion is error-prone and costly: for example, sending an email about something that happened "earlier yesterday" means that your counterpart has to search for an email header and start doing mental math. This introduces cognitive load, which decreases the person’s mental energy available for solving the technical problem.

If an issue was intermittent over some period of time, state when it was first observed, or perhaps note a time in the past when it was surely not happening. Include the frequency of observations and note one or two specific examples.

Meanwhile, here are some antipatterns to avoid:

Earlier today: Not specific enough
Yesterday: Requires the recipient to figure out the implied date; can be confusing especially when work crosses the International Date Line
9/8: Ambiguous, as the date might be interpreted as September in United States, or August in other locales. Use ISO 8601 format for clarity.

2. Product
Be as specific as possible in your issue report about the product you're using, including version information where applicable. These, for example, aren’t specific enough to locate the components or logs that can help with diagnosis:

REST API returned errors...
The data mining query interface is hanging...

Ideally, you should refer to specific APIs or URLs, or include screenshots. If the issue originates in a specific component or tool (for example, the CLI or Terraform), clearly identify that tool. If multiple products are involved, be specific about each one.Describe the behavior you're observing, and the behavior you expected to occur.

Antipatterns:

Can't create virtual machine: It's not clear how you're attempting to create those machine, nor does it say what the failure mode is.
The CLI command is giving an error:

Instead, provide the specific error, and the command syntax so others can run the command themselves.
Better: I ran 'mktool create my-instance --zone us-central1' and got the following error message...

3. Location
It's important to specify the region and zone because cloud providers often roll out changes to one region or zone at a time. Therefore, region or zone is a proxy for a cloud-internal software version number.
These are examples of how you might include location information in an issue report:

In us-east1-a...
I tried regions eu-west-1 and eu-west-3...

Given this information, the support team can see if there's a rollout underway in a given location, or map your issue to an internal release ID for use in an internal bug report.

4. Specific identifiers
Project identifiers are included in many troubleshooting tools. Specify whether you observed the error in multiple projects, or in one project but not another.

These are examples of specific identifiers:

In project 123412341234 or my-project-id...
Across multiple projects (including 123412341234)...
Connecting to cloud external IP 218.239.8.9 from our corporate gateway 56.56.56.56...

IP addresses are another form of unambiguous identifiers, though they also require additional details when used in an issue report. When specifying an IP, try to describe the context of how it's used. For example, specify whether the IP is connected to a VM instance, a load balancer or a custom route, or if it's an API endpoint. If the IP address isn't part of the cloud platform (for example, your home internet, a VPN endpoint or an external monitoring system), specify that information. Remember, 192.168.0.1 occurs many times in the world.

Antipatterns:

One of our instances is unreachable…: Overly vague
We can't connect from the Internet...: Overly vague

Note that other models, such as the Five Ws (popular in fields like journalism) can provide structure to your report.

Specify impact and response expectations

Basic information to include in your issue report to a cloud provider should also include how it’s affecting your business and when it needs to be resolved.

Priority expectations
Cloud provider support commonly uses the priority you specify to initially route the issue report and to determine the urgency of the issue. The priority rating drives the speed of response, potentially paging oncall personnel.

In addition to selecting the appropriate priority, it's useful to add a sentence describing the impact. Help avoid incorrect assumptions by being explicit about why you selected P1.

Think of the priority in terms of the impact to your business. Your cloud provider may have what appear to be strict definitions of priority (e.g., P1 signified a total outage). Don't let these definitions slow progress on issues that are business critical; extrapolate impact if your issue isn't addressed, or describe the worst case scenario of an exposure related to the issue. For example, the following two descriptions essentially describe impeding P1 issues:

Our backlog is increasing. Current impact is minor, but if this issue is not fixed in 12 hrs, we're effectively down.
A key monitoring component has failed. While there is no current impact, this means we have a blind spot that will cause the next failure to become a user visible outage.

Response time expectations
If you have specific needs related to response time, indicate them clearly. For example, you might specify "I need a response by 5pm because that's when my shift ends." If you have internal outage communication SLOs, make sure you request a response time from your provider that is within that interval so you can meet those SLOs. Cloud providers likely have 24/7 support, but if this isn't the case, or if your relevant personnel are in a particular time zone, communicate your time zone-specific needs to your provider.

Including these details in your issue report will ideally save you time later and speed up your overall provider resolution process. Check out part two for tips on communicating with your cloud provider throughout the actual troubleshooting process.