This article helps project managers and technical leadership create execution plans for expected periods of peak user traffic to an application. For example, in the United States, the last Friday of November, known as Black Friday, can be the busiest shopping days of the year. Many other countries have similar days or weeks when retailers can expect extremely high traffic. This article outlines areas where you can increase organizational readiness, system reliability, and Google-customer engagement for Black Friday–type events.
This article outlines a system to:
- Manage three distinct stages for handling an event: planning, preparation, and execution.
- Engage technical, operational, and leadership stakeholders in improving process and collaboration.
- Establish architectural patterns that help handle Black Friday–type events.
- Promote best practices from Google Site Reliability Engineering (SRE).
Commerce systems in the retail industry face a significant technical challenge: how to handle short-term peak requests that are 5x or greater than the average load. In the United States, these events occur in November after Thanksgiving, during Black Friday and Cyber Monday. Other countries have similar peak scale events at different times in the year.
One approach to this challenge is to embrace a cloud-native systems strategy. Compared to traditional deployments, this strategy can help improve your peak planning and execution, and it offers new avenues to consider.
Although Black Friday is a major commercial event in the United States, it exhibits characteristics that occur in other contexts—for example, election nights or holidays such as Christmas and New Year's Eve. Peak events generally exhibit the following traits:
- Increase in traffic from 5x to 20x or greater, with generally higher conversion rates and greater loads on backend systems.
- Rapid scaling over a short period of time as the event opens for traffic.
- Some trailing decrease in traffic back to normal levels. This decrease is usually slower than the scale to peak.
In commerce systems where traffic skews toward transactions, the frontend increase might not be as dramatic. For example, while the frontend web traffic might increase 2x to 4x, the load on the backend systems measured in orders/min might increase 10x or more.
The following figure shows a typical increase in user traffic. The traffic is characterized by a sudden, variable spike to peak traffic followed by a gradual return to normal.
We recommend that you follow these three stages when conducting your Black Friday event: planning, preparation, and execution. In the following diagram, note how each stage emphasizes different roles, skills, and processes.
Operations and administration staff lead the organization through the tactical aspects of a peak event. By reviewing the three stages and their activities, the staff can prepare the team to successfully handle a peak event. If unexpected issues arise during the event, the preparations help provide faster, more accurate triage and a path to resolution that reduces the impact of those issues.
- Black Friday
- Any type of peak-traffic event for an application.
- Queries per second. A common metric associated with volume of traffic. QPS is a general technical volume term, as compared to order/min.
- Transactional indicator. This term includes other indicators such as cart adds/sec. These indicators show transactional volume of detail in more business-relevant terms compared to QPS.
- Service Level Indicator. A metric used to determine if a service is within acceptable bounds. For example, an SLI for page load time might be loading a web page plus all its underlying API calls at the 95th percentile.
- Service Level Objective. Defined using an SLI and a target objective. An example SLO might be the preceding SLI latency must be below 400 ms.
- Service Level Agreement. The reliability targets, conditions, and (often financial) consequences for a service that are defined in the contract between a customer and supplier. An example SLA might be if the SLI latency stays above 1s for 10 consecutive minutes, the customer is entitled to a 50% discount on the invoiced usage of traffic above 100 QPS within the hour the SLA violation occurred.
The adage "Hope is not a strategy" is often quoted as a strong reason for rigorous planning and preparation. Strong cross-functional planning and coordination across an organization and with third-party vendors increases the probability of success. Engineering, architecture, operations, and business stakeholders must agree on what is measured, and must know what impact the peak event will have on production systems. For planning and operational systems support, data-driven decision making drives the choices and tradeoffs these teams make.
The following table summarizes the main steps in the planning stage.
|Collecting data||Know your measurement data.
Review prior peak events.
Prepare business projections.
|Establishing program management||Set up communication channels.
Review the shared state.
Set up monitoring and probing.
Review system monitoring.
Create capacity plans.
Share risk analysis and rank priorities.
|Planning the architecture||Review the design architecture.
Review Black Friday architecture patterns.
Create a failover strategy.
The first step in the planning stage is to collect data.
Know your measurement data
When you start with existing measurement data (requests/sec, orders/min, cart adds/sec), you build a foundation for understanding and planning for a peak event. You can begin this process by answering the following questions:
- What data is available for past normal and peak operations?
- Does the data measure the right system and business metrics?
- What data do you need to collect in order to better understand how the system is operating?
Your goal is to determine what data is available and what value that data has provided in the past. If you identify any metrics or SLIs that are not being collected, prioritize that data so that during the preparation stage you can modify the system to collect this data.
Beyond just collecting data, it's important to know what insights the metrics are providing. When you know how the metrics relate to user satisfaction and business success indicators, you learn how to interpret the metrics when the scale of the system is dramatically larger.
Review prior peak events
No matter the process, learning from prior events is important, especially if it highlights known constraints to the system under stress. In the technology world, postmortems are a useful technique for capturing data and understanding incidents that occurred.
Prepare business projections
By predicting the impact of Black Friday, your business can plan how to acquire goods and prepare for selling. Factors such as inventory management, supply chain management, and staffing depend on precise business projections. Preparing your system for a scale spike also depends on these predictions. The more you know about prior forecasts, the more accurate your new business sales forecasts will be. Those new forecasts are critical inputs into projecting expected demand on your system.
The following key business driver metrics are common across most e-commerce sites:
- Number of visits
- Number of conversions
- Average order value
Taking these drivers as a foundation, you can convert the metrics to a form that the system can analyze. Such metrics might include the following:
- Queries per second
- Cart adds/sec
Then you can convert these metrics into SLIs and SLOs that measure system behavior and indicate system health.
Establishing program management
Coordination is key for Black Friday success. The planning, preparation, and execution stages must be aligned and agreed to by engineering, architects, operations, leadership, other internal stakeholders, and third-party vendors. Shared objectives require stakeholders to communicate and collaborate transparently.
A good way to start is by setting up a cross-functional team and a steering committee whose job is to rally the stakeholders. Get these teams set up early so that program management can set the timeline. Program management also needs to secure budget and resources for additional infrastructure, testing support, and training.
Set up communication channels
Communication is critical for all stages of a peak-event execution. During planning, create the following:
Peak-Event Planning document, which includes:
- List of assumptions
- List of unknowns
- Teams, team leads, stakeholders, and responsible parties involved. Use a responsibility assignment matrix (RACI), or similar format, so that ownership is clearly communicated.
- A clear, unambiguous timeline precisely calling out critical tasks and the teams that own them.
Clear, regularly scheduled meetings with notes to keep all teams informed on progress. Call out areas of concern early and swarm issues before they become blockers. Peak events are usually fixed in time and are hard to adjust for if the schedule falls behind.
Executive Summary Document. Condense the most critical information for the peak event onto a single page (or two). This document is easy to distribute, helps people absorb planning information, and resolves simple questions. The document quickly brings new people up to speed on peak-event planning.
Lightweight architecture documents that allow operational teams to quickly comprehend parts of the system. When there is time to comprehend the entire system, architecture documents with hundreds of component boxes and dozens of connecting lines are useful. During operational processes, however, an understandable diagram of how major components are connected can help isolate issues faster and expedite resolution.
Operations support documents and escalation procedures. These materials clearly state for the operations team how to identify and report an issue, address handling of an issue, and evaluate engagement with other teams such as developers or vendor support. Make sure you have common incident management processes and templates for expediting classes of incidents.
Review the shared state
Shared state means collecting different perspectives to reach consensus across teams. Instead of a single perspective, this agreement provides a broader view of the system. Shared state shows how collective decisions impact the system beyond the scope of an individual's decision. For example, a small technical configuration change might have a large impact on business metrics.
Each shared state review is conducted between the owning team and all related teams. Engaging with the Google Cloud team on these reviews builds context for the Google teams involved with Black Friday and allows Google teams to provide feedback around best practices from working with many other customers.
Set up monitoring and probing
Monitoring helps you make informed decisions about maintaining operational efficiency of the system. Monitoring also helps you track the health of the system, proactively alerts on upcoming issues, and provides insight into the four golden signals for your applications:
- Latency. Time required to serve a request to the system.
- Traffic. Demand on the system, usually measured in requests per second.
- Errors. Rate of failing requests, either as an absolute rate or as a proportion of all requests.
- Saturation. Metric(s) that indicate the utilization of critical system resources. These metrics might also proactively indicate that resources, such as database storage, are running out.
Understanding the critical values for these four signals is essential for detecting peak-event issues. Specific, actionable monitoring steps include the following:
Monitor key business metrics, such as revenue or number of orders taken. These metrics can indicate problems in overall system coordination, even if all individual components are functioning properly. In addition, monitor any email marketing campaign schedules and track when and how many emails are sent and when traffic arrives on the site. A last minute email blast can surprise you and drive more traffic than you expect, or drive traffic to a page that is not well cached.
Monitor metrics that are symptomatic of user satisfaction. Curate a small number of dashboards that give a high-level overview of service health, and then drill into specific subsystems or locations.
Create defense in depth, for example, by combining black-box monitoring (such as probes) and white-box monitoring (such as exported application metrics); for example:
Set up probes to measure latency, packet loss, and throughput on network connections that span from Google Cloud to on-premises networks—for example, VPNs and direct connections.
Set up black-box probes to measure the availability and latency of third-party services you depend on. Additionally, ask the third party for access to their dashboard and alerts.
Given sudden traffic increases, scaling peak events might resemble a distributed denial of service (DDoS) attack against the system. Evaluate how to adjust DDoS defenses to help ensure that the protection system is not blocking legitimate customers who are trying to use your commerce system.
Web bots scrape sites for content and pricing, but are most busy during the peak-event season due to aggressive competition. Determine the percentage (and type) of bot traffic versus real traffic, how it changes during the peak event, and how to protect the system from unwanted traffic.
Review system monitoring
As noted in the terminology section, SLIs are used to measure SLOs. You can use monitoring to capture SLIs and other critical system values. This approach helps you determine the following:
- Which system metrics are being captured with monitoring
- The latency in monitor reporting
- Alert levels of interest
Ideally, you want monitoring to capture information about all relevant infrastructure and application metrics that are important to review performance and SLOs.
You can use monitoring in many areas:
- Virtual machine statistics – instances, CPU, memory, utilization versus counts.
- Container-based statistics – cluster utilization, cluster capacity, pod level utilization, and counts.
- Networking statistics – ingress/egress, bandwidth between components, latency versus throughput.
Application metrics. Application-specific behavior such as QPS, writes/sec, and messages sent/sec.
Managed services statistics. QPS, throughput, latency, utilization, and so on, for Google-managed services such as APIs or products such as BigQuery, App Engine, and Cloud Bigtable.
Connectivity statistics. VPN/interconnect-related statistics about connecting to on-premises systems or systems that are external to Google Cloud.
SLIs. Metrics associated with the overall health of the system. These metrics are generally broader than specific infrastructure components and might be represented as something like API QPS rates, orders/min, or cart adds/second.
Consider the response time for an alert as well as the alerting threshold. For example, alerting on a 10-percent order drop might occur several minutes after a problem started, whereas if you alert on "1 sec without orders," it might indicate an issue more quickly.
After you select SLOs, get broad agreement from all stakeholders so that there is a single set of core metrics that represent the overall system health. Then narrow the data needed to configure alerting and notification of system abnormal behavior.
Create capacity plans
Capacity planning is a critical activity for retail commerce systems. Cloud-native systems, with their dynamic capacity allocation, provide architectural and technical advantages over existing systems. Cloud-native systems are often under average load using thousands of compute cores with large capacity bursts possibly into tens of thousands of cores. In this case, the cloud provider and the cloud infrastructure team share responsibility for capacity planning, and the operations team must create capacity plans with confidence intervals. These intervals help you avoid unexpected incidents, such as the inability to further scale with additional VMs.
These dynamic capacity systems work well for limited and sporadic capacity spikes. However, we recommend that you review capacity plans and evaluate pre-provisioning capacity instead of relying on autoscaling and just-in-time provisioning. Think about how you and your cloud provider can use prescaling, reservations, and other capabilities to enable your system availability to grow. You can provision this infrastructure shortly before the peak event, and spin it down shortly afterward to avoid unnecessary cloud costs.
To improve the capacity planning experience, Google and its customers have developed the following best practices:
- Map metrics to peak usage estimates. Create plans for assessing whether the estimates were wrong. Plans reduce uncertainty and help you quickly adjust when estimates are exceeded.
- Evaluate capacity needs for peak + confidence interval.
- Work with the cloud provider and the infrastructure team to review and confirm capacity.
- Remember that quota and capacity are different. Quota is the option to have a resource, whereas capacity is how much of that resource is needed.
- Evaluate quota and capacity on each available resource: compute, memory, storage, network, API consumption, and so on. During load testing, this evaluation helps you discover unknown quota and capacity issues.
- Set up quota and capacity to handle a full region failure.
- Evaluate quota and capacity during the planning stage, and adjust them during the preparation stage.
Events like Black Friday occur periodically: there is always next year's Black Friday, another promotional campaign to execute, or a future large-scale product launch. Each event has unique information, but the assets are reusable and expandable. With each successive event, you can improve the process by creating, curating, and improving checklists to reduce the administrative burden. Checklists are like airplane pilot pre-flight checks; they help prevent errors and result in consistent execution.
We recommend the following best practices for checklists:
- Develop a strong practice in Reliable Product Launches.
- Use existing launch or peak-event checklists to avoid repeating previous issues.
- Assign an authorized party to evaluate the checklist to use with the current event and to follow that list through the execution stage.
Review vendor operational checklists such as the launch checklist for Google Cloud, and review product-specific checklists such as:
Share risk analysis and rank priorities
It's important to share risk analysis with stakeholders and to communicate your priorities. Your analysis should identify which risks are owned by the system, by your internal teams, by the cloud provider, and by third-party vendors. Also share risks with your cloud provider and vendors. Some risks might be known and acceptable; others might require the cloud provider or vendor to assist in creating a mitigation plan.
Planning the architecture
Planning your architecture for Black Friday is a distinct problem domain with its own best practices. Peak system constraints influence your risk analysis and are critical to successful peak-load testing.
Review the design architecture
We recommend that you review your high-level design architecture so that all parties understand the solution. This review informs teams of reliability concerns, such as single points of failure (SPOF) or the use of alpha or beta components from Google. Although you might not be able to mitigate all issues, awareness that they exist benefits all teams.
Supporting a system for production readiness requires two distinct levels of architecture, each aimed at a specific audience:
System architecture helps the operations team learn the critical system components. A clear diagram that shows how major components interconnect can help isolate and resolve issues faster.
Component architecture explains significant I/O points, dependencies, and primary workflows through the component to architects and developers. These components are important for complex user journeys and help identify where an issue might originate within components and data flows, such as inventory, pricing, master product data, orders, and invoices. Data flows also help drive SLOs and monitoring and probing decisions.
Review Black Friday architecture patterns
Architecture patterns help you to optimize cloud-native systems and to manage some of the challenges of cloud-distributed systems. The following best practices help improve reliability of cloud systems during peak events:
Use cache systems, but know how they behave during scaling operations. During scaling, cache systems do not react with the same cache hit ratios. This difference can lead to cascading issues where cache misses cause more load to occur during the scaling process and overwhelm the system quicker than expected.
Use a Content Delivery Network (CDN) such as Cloud CDN to alleviate major traffic spikes and to improve reliability and availability. CDNs can absorb large traffic spikes before that traffic reaches your origin servers.
Use Cloud Interconnect instances along with internet-based VPNs configured in Active-Active configuration. Active-Active refers to both the primary and secondary connectivity paths being used for handling traffic at all times, as opposed to Active-Passive where the secondary takes over if the primary fails. Interconnects support dedicated bandwidth between a cloud provider and a data center. We recommend provisioning multiple interconnects in different metro areas and with different providers. This approach increases the SLA and reliability as described in Establishing 99.99% Availability for Dedicated Interconnect. The VPNs are used in cases of interconnect failure or as short-term increased capacity.
Create a failover strategy
During critical events such as Black Friday, unprepared teams struggle with outages and poor customer experiences. Having a failover strategy helps maintain system uptime while teams investigate the root cause of the failure. All things fail, and evaluating the risks associated with failures and knowing the SLOs for the system create tradeoffs between reliability and cost.
Cloud systems use the following failover patterns:
Handle regional failures. Develop patterns to ensure persistent data is synchronized between regions and addressing a regional failure. Handling stateless application servers is far easier than synchronizing data. Review and use managed services to increase reliability and decrease the amount of management overhead. Synchronization patterns are tied to the data platform capabilities. Some articles for further research include:
Deploy the system to multiple locations. Deploy across at least 2 and preferably 3 regions to mitigate risk around particular zone or region issues with the cloud provider.
Use multiple zones in each region for regional failover/resiliency in order to continue operating without needing a full region failover. Deploying to multiple regions reduces latency to customers and supports addressing a regional outage issue.
Use the type of load balancer that is most appropriate for your scenario. Google Cloud has different types of load balancers that provide a variety of capabilities. To learn more, see the Cloud Load Balancing documentation.
Ensure the system operates without cache to meet peak demand. Test the system up front for the case where the caching system fails or yields slow cache hit rates. A cache system failure or cache flush could trigger a cascading failure scenario that causes other components to overload and start failing.
Plan for CDN failure. These failures are similar to cache system failures. Work with the CDN provider to learn how to handle CDN system failures that could unexpectedly increase load during Black Friday.
Automate all failover scenarios to avoid manual steps, which are time consuming and error-prone, especially during the stress of dealing with an outage.
Build resilience for any external service dependency. Build in resilience patterns such as a "circuit breaker" to prevent a cascading failure should that service fail.
Limit latency for third-party products. In the client-side code of your website, limit or eliminate blocking HTTP calls to third-party web services. Latency or availability issues with the third party could cause your website to be unresponsive for your users.
Protect against network attacks. Help protect your site against DDoS attacks and web bots. Use web security tools such as Google Cloud Armor and your CDN's security platform.
The following planning-stage deliverables are used during the preparation and execution stages:
- Risk analysis. Identifies the probability and impact of various risks to the project, including operational, technical, business, and time factors.
- Architectural change plan. Includes where the system could be modified to support higher scale than during normal operations.
- Testing plan. Evaluates the system at levels above expected traffic and evaluates common failure test scenarios.
- Communications plan. Keeps shared state with all stakeholders consistent. This plan also includes escalation procedures for communications and for cases where consensus or responsiveness is important.
- SLO/SLI agreement. Shows consensus on monitoring metrics and SLIs to collect. Also includes agreement on SLOs that demonstrate system availability from the customer's perspective. This agreement includes plans for how to implement metrics not currently collected and a monitoring dashboard for easy review of SLO compliance. The agreement outlines how third-party vendor SLAs affect your SLO metrics.
- List of checklists. Ensures consistent review and execution of common procedures.
- List of operational playbooks. Lists playbooks that address incident scenarios that might occur during Black Friday. Playbooks optimize for a quicker return to operational status.
The objective of the preparation stage is to test the system's ability to scale for peak user traffic and to document the results. Completing the preparation stage results in architecture refinement to handle peak traffic more efficiently and increase system reliability. This stage also yields procedures for operations and support that help streamline processes for handling the peak event and any issues that might occur. Consider this stage as practice for the peak event from a system and operations perspective.
The following table summarizes the main steps in the planning stage.
|Helping to ensure reliability||Set up load and performance testing.
Evaluate and test scenarios.
|Changing the architecture for scale and reliability||Introduce a queuing and message brokering intermediary layer.
Implement caching to improve performance.
Implement load shedding.
Implement the circuit breaker pattern.
|Setting up monitoring and logging||Configure monitoring.
Use a dashboard to distribute monitoring.
|Using feedback to revise operations and support||Build checklists and playbooks.
Conduct training on support and escalation procedures.
Set up the operations team for success.
|Coordinating capacity requests||Review and share with Google Cloud your expected resource capacity needs.|
|Establishing a change freeze||Allocate time before the event to deploy new software and infrastructure.
During promotional activities that support the event, avoid making changes.
Coordinate bug fixes and timing with the development team.
Helping to ensure reliability
Google Site Reliability Engineers focus on the reliability and velocity of Google services. SREs are responsible for the health of the platform and and help ensure the highest possible reliability of Google services. A team of Google SREs published two books that capture many best practices and experience-based advice on how to build, launch, and operate systems at scale. Although much of the information is applicable, two key aspects are particularly relevant to a successful Black Friday event: load testing and failure testing.
Set up load and performance testing
Load testing is the process of deploying a test version of the system and creating requests to simulate high use of the system. Load testing normally focuses on testing for sustainable user-perceived behavior at some percentile below the absolute peak. Testing for peak requires hitting that top percentile with consistent good performance.
When testing load and performance, follow these best practices:
Load and performance test at 100 percent or more of peak traffic. Recalibrate the percentiles, and make sure to set the peak to above 100 percent. Recalibrating leaves room for a small amount of poor system behavior while the peak traffic receives adequate performance.
How much above peak traffic requires testing? That determination results from confidence intervals in the peak-event traffic. Less confidence requires higher load to ensure a greater chance of peak-event success. Confidence intervals that are narrower and accurate after the fact require testing to a lower percentage above peak.
Test the velocity from normal traffic to peak traffic. Depending on the peak event, changes to traffic could occur quite suddenly, possibly in seconds, as for a Black Friday store opening. Test the ability to scale quickly from normal traffic to peak. Parts of the system might need time to react to load signals.
Test zonal/regional failover, redeployment, and cache-flush scenarios under load.
Test customer journeys under load. Test fully through the system to help ensure that the customer experience is valid under peak. Testing individual components or technical execution paths improves system reliability but doesn't measure the customer's experience.
Consider load shedding to avoid cascading failures. Implement load shedding to keep your most critical customer journeys from failing, sacrificing less critical experiences. An example is disabling catalog browsing or searching user reviews in favor of ensuring that the checkout experience maintains its SLO.
Evaluate and share testing results widely. Use a blameless postmortem process to encourage open, honest evaluation and constructive feedback to address challenges. Identify database hot spots, failed quotas, unexpected dependencies under load, and other difficult-to-debug situations as a team.
Test for different load mixes: Test mobile versus desktop, geographical distribution of origins, transaction types, and so on, because they can stress different parts of the system.
Evaluate and test scenarios
To increase confidence in the success of the peak event, it's critical to evaluate and test scenarios that might occur during the event. Google DiRT, Netflix Chaos Monkey, and many other processes and tools are designed to test whether a system can adapt to disaster situations.
During the preparation stage, testing the highest impact failure scenarios can increase the probability of success. So even if the situation occurs during peak, there is a process or technical solution to address the issue. Follow these best practices:
Test most likely failure scenarios. Use a probability/risk factor weighting to determine the most impactful issues that could occur during a peak event. Design and simulate that failure mode under load to see how the system will react. Consider intentional failure scenarios before chaotic failures.
Consider using the circuit breaker pattern. Identify places where the architecture can implement the circuit breaker pattern and continue to operate with some components/functionality offline. This approach is especially important with third-party dependencies outside the system's direct control. Disabling third-party dependencies can help preserve the majority of user experience with a small loss of functionality.
Test operational responses. Operational teams responding to failures is more important than system responses. Ensuring that they know how to deal with unexpected events can apply to far more than just the intentional failure tests executed.
Simulate as close to actual testing as possible. Do not provide inside information to the operations and support teams. Test interacting with internal and external support to validate response times, cooperation, and seeing how the operational team executes during an issue.
Execute a blameless postmortem process to document the situation, the response to the situation, and areas where improvements can be made to ensure successful recovery on a future issue.
Changing the architecture for scale and reliability
Load testing and failure testing, along with architecture reviews, encourage limited-scope architectural changes that can enhance the scale and reliability of the system. However, introducing changes adds risk, so limit the changes to a conservative range of time.
Here are some example changes to consider:
- Introduce a queuing/message brokering intermediary layer in order to manage traffic and load-sensitive systems such as limited-scale systems like inventory or order processing. A queue can help reduce spikes against the backend system and avoid some types of overload behavior.
Implement caching to improve performance. There are three basic categories of cache:
- Frontend: By introducing a content delivery network such as Cloud CDN, you remove load from your system by distributing assets closer to customers.
- Application level: Have the consumer of a backend service cache slow-changing data in Memorystore, such as product information at the consumer level.
- API level: Use an API to leverage Memorystore to cache data in the API.
Don't inadvertently create a cascading failure problem by over-relying on cache. Caching to achieve scale can be disastrous if the cache is unavailable and the underlying system cannot handle the traffic.
Implement load shedding to avoid overall system failures. This approach keeps a highly stressed system still available to the most important user journeys, while intentionally limiting other user journeys.
Implement the circuit breaker pattern. This approach gracefully handles component failures and controls cascading failures. Implementing this pattern can help avoid caller components from failing when certain components are overloaded.
Ideally, many of these architectural changes were implemented during feature development. Where there is appropriate risk and sufficient resources, architectural changes can help improve system availability.
Setting up monitoring and logging
In order to see how the system is behaving, use monitoring and logging as your key introspection tools. Applying the right granularity and depth of information ensures the right amount of compute and storage for monitoring and logging performance.
Monitoring requires you to identify the right metrics from past events, review good candidate metrics for current and future events, and implement systems to enable metric collection, aggregation, and reporting. Consider the following elements as part of your monitoring configuration:
- Provide proper system coverage by capturing new metrics or SLIs identified during the planning stage.
Audit monitoring configuration to ensure that resources are properly monitored. A common error is to have a resource-creation process that doesn't properly enroll the resource in monitoring.
Use monitoring system metrics to inform automation systems such as autoscalers or self-healing systems.
Use monitoring as a critical component of operational playbooks.
Use a dashboard to distribute monitoring
To help preserve a shared state between stakeholders, a monitoring dashboard can keep all parties focused on SLOs instead of more fine-grained metrics. A dashboard offers the following advantages:
- Single pane of glass. A single system with shared dashboards for metrics improves operational responsiveness and decreases the likelihood of interpretation errors.
- Proper granularity of reporting. Aggregating can make trends more visible, but can also obfuscate individual resource issues.
- Appropriate alerting. Provide alerting to notify the proper teams in standard ways. For example, configure monitoring alerting such as email, SMS, or pager notifications based on playbooks.
Using feedback to revise operations and support
The known documentable items were presented in the planning stage. For the preparation stage, you need to update those documents with feedback from architectural changes, load testing, and failure testing.
Build checklists and playbooks
Follow these best practices to build valuable checklists:
- Allow the checklist owner to verify that the checklist executes properly. Empower the owner to create changes that adhere to the checklist's goals.
- Plan to resolve or mitigate issues based on acceptable risks. Get cross-functional agreement on mitigation plans if the checklist item cannot be achieved. Make sure all teams understand the impact of the mitigation.
Develop playbooks for known issues and for common issues that you discovered during failure testing.
Conduct training on support and escalation procedures
Make sure that all operations team members are aware of their responsibilities and those of other teams:
- Make sure that your on-call team members know their response time requirements and how they can use the existing documentation and playbooks to expedite resolving issues.
Develop a tabletop incident-management exercise. Create a war room simulating the Black Friday virtual war room, as discussed later in this document. Test various incidents, and test how the team will respond. To ensure proper coordination during incidents, include raising events to Google as part of the process. However, don't create a fake P1 ticket with Google without first clearing it with CE/Support. Clearly labeled test/fake P4 tickets are okay.
- Conduct Wheel of Misfortune exercises. These exercises evaluate operations team readiness, robustness of procedures and playbooks, and proper coordination with other teams.
- Create different situations such as losing CDN caching, a run on a hot SKU with bad pricing (good for your customers!), or a region failure, and how the operations team should proceed through these events.
Execute the tabletop exercise multiple times to decrease response times and improve coordination on successive simulations. Use postmortems to capture results of each exercise and incrementally improve the next exercise.
Coordinate vendor support
Confirm how to contact and convey details to other team members or vendor support. Vendors implement support procedures in different ways, so it is important to know how to best interact with their support offerings. Review vendor SLAs for support response and for availability and latency metrics.
Engage Google Cloud Support
Google Cloud provides information on how to optimize engagement with Google Cloud Support. Public documentation is available for all customers. Customers who engage directly with Google Cloud Customer Engineering or who are enrolled in Platinum Support can work directly with their Google representative to review and understand support coverage. For more details, see Google Cloud Support later in the execution-stage section of this document.
Set the operations team up for success
Peak events involve a lot of stress, emotion, and intensity. To avoid burning out the operations team, institute a rotation process. Keeping the operations team well rested means fewer errors and provides additional reserve support in case a major incident occurs. Make sure of the following:
- The operations team is taking on appropriate workloads during the peak event.
- Downtime is scheduled.
- Everyone is getting a good night's sleep.
- The operations team has defined an escalation path or backup and knows who to contact if they need help.
Most operations teams have stories of doing "all-nighters" to resolve an issue. While we applaud the commitment, it is better to rotate staff than to ask for superhuman efforts. You can reduce tension and error rates by cross-training team members so that you have multiple specialists available.
Coordinating capacity requests
During the planning stage, you shared and coordinated capacity planning with Google Cloud and other vendors. Following are some additional capacity planning strategies to consider:
- Review and share with Google Cloud your expected resource capacity needs as specifically as possible. Share CPUs, core, disk, API QPS, and so on. Ensure that quotas for the organization and individual projects are appropriately allocated. Adjust as needed.
- Quota level should be significantly higher than expected usage, in case there is more traffic than expected.
- Quota is not equal to capacity and can have different impact on systems. Load testing at scale can help identify any quotas in Google Cloud such as API call limits or quotas with other Google systems such as OAuth2.
- Meet regularly with the Google team to review current capacity requests and to ensure that those requests can be fulfilled as needed.
Establishing a change freeze
Black Friday is likely one of the most important events of the year for your business. Not only will it put significant stress on your web infrastructure, it might result in record revenue for your business. The primary cause of failures is change (expected or unexpected). Moreover, architecture changes can affect your infrastructure's ability to scale, which is why we recommend that you freeze changes during and leading up to the peak event.
When establishing a change freeze, follow these best practices:
- Plan ample time prior to the event to deploy new software and infrastructure. Entrance criteria to the change freeze should include successful peak load and failure tests.
- Don't interrupt business promotional activity—for example, pre-Black Friday sales and marketing campaigns. Avoid making changes during these times.
- Align with your development team to agree when the last minor bug fixes can be deployed. You don't need to lock down bug fixes as early as architectural or infrastructure changes, but you shouldn't be pushing such changes close to the peak event.
- Ensure that your third-party services adhere to a change freeze. Know when they start and end their change freeze, and consider updating their contractual SLAs to make it binding.
- Ask your cloud provider about their change-management process and what changes they might execute on your peak days, if any.
The preparation stage includes revising planning-stage documents and new items developed during this stage, for example:
- Review, revise, and distribute deliverables from the planning stage that were modified by the preparation stage process. Load and failure testing will often challenge some assumptions and require revisions to schedules, risks, capacity, and operational plans.
- Distribute results of load and failure testing to the shared state group of stakeholders. Ensure everyone is aware of successes and areas where improvements were made or where new risks were identified.
- Validate that shared single pane of glass monitoring is operational and accessible to all parties. Transparent access to information provides faster decision resolution.
- Check that the operations team is fully engaged and taking ownership of the execution stage.
- Review that staffing plans are appropriate and reasonable, that rested people are making good decisions, and that there are handoff procedures for staff rotation.
- Verify that capacity plans are well documented and that quotas are validated by Google Cloud and implemented in Google's systems.
- Complete a vendor support readiness evaluation for each vendor. Know how to file support cases and how to escalate cases.
With the planning and preparation stages complete, your Black Friday process no longer relies on hope. If everything goes according to plan, the peak event is a non-event thanks to extraordinary operations support. The operations team is well informed and prepared, and there are no unexpected incidents. If an incident does occur, it can be identified and resolved quickly.
The following table summarizes the main steps in the execution stage.
|Scaling manually to peak capacity||To scale up, you first need a plan for how to gradually reduce scale after Black Friday.|
|Executing for fast recovery||Set up a virtual war room.
Leverage experience with incident management.
Work with Google Cloud Support.
|Improving the data delivery process||Conduct root cause analysis.
Analyze predictions versus actuals.
Aggregate the postmortem and retrospective.
Scaling manually to peak capacity
Autoscaling systems are great for unexpected spikes, but in most cases, we don't recommend them as the primary scaling process for a 5x to 10x increase in traffic. Autoscaling systems generally only jump in small increments, usually a VM of additional capacity per evaluation cycle. If your system must quickly scale across dozens or hundreds of VMs of additional capacity, manually scaling the system into the right range provides a better user experience.
Use manual scaling to slowly raise the system capacity so that the system is prepared, caches can be warmed, and slower events like virtual machine starts can happen before the peak event starts. This process is also a good safety check that capacity is available for the system.
While scaling up to peak traffic is the primary goal, you also need a plan for how to gradually reduce scale after Black Friday. If the original scaling strategies for normal traffic are in place but traffic has not yet returned to normal, scaling down too quickly can cause an outage.
Executing for fast recovery
The primary goal for the operations team is to keep the system reliable and operational. Engineers tend to look for the root cause of a problem while it's still occurring so that they can collect data about the problem. While root cause analysis is valuable, focus on recovery instead. Find opportunities to collect data and store it for analysis while expending the most energy on system recovery, such as adding capacity, removing a poorly performing component, or restarting systems. You can always analyze the "why" of a problem offline after the incident has been mitigated.
Set up a virtual war room
Operations team members are the leads during execution but are not the only people engaged. Cross-team collaboration mechanisms help the operations team to reach out for support within the company and from vendors.
A virtual war room instills all parties with a spirit of teamwork during the peak event. People can quickly share ideas without the delay of asynchronous chat systems or email communications. Virtual war rooms give people an easy way to get up to speed and participate.
We also recommend creating sub-team spaces specific to the peak event. These subspaces allow the main war room to focus on quick triage and delegation if multiple incidents occur.
Leverage experience with incident management
To ensure that the right people are engaged on incidents, use failure testing, tabletop experiments, and test runs on the incident management process. If the operations team is comfortable with the incident management process, their overall responsiveness and resolution rate will improve.
Ongoing incidents, system alerts, and reported issues require playbooks that help teams quickly triage situations, and they identify if remedies exist. For unforeseen issues, playbooks do include data that helps teams identify and resolve issues quickly. However, adding playbook notes and gathering constructive feedback is not the highest priority. Instead, note the information and schedule a postmortem to review the playbook execution.
Work with Google Cloud Support
When engaging with Google Cloud Support, be as specific as possible and don't presume any knowledge that is not explicitly stated. This approach helps ensure that Support sees the issue from the customer's point of view and can better narrow down investigations. Focus on these four critical details:
Time. State the issue in clear, non-relative terms (never use "earlier today" or "yesterday") and be clear on timezone (reference UTC where possible). Often Google engineers will be looking at logs or other data that are not in the customer's local time zone. Those engineers also might not be located in the customer's time zone and might be unaware of the implied time.
Product. Identify the product and the actions you are taking on it. Saying "It doesn't work" makes it difficult to determine what the customer is seeing compared to what Google can see.
Location. Identify the region and zone. This information can help when location-specific events might be occurring, such as a software rollout or circumstances that might indicate a correlation between the customer issue and another event.
Specific identifiers. Identify the project ID, billing ID, specific VM, and other relevant identifiers. Customers have many different projects and resources, often with customer-encoded naming schemes that Google is not aware of. If Google is looking at the correct project, billing account, or resource, they can expedite resolution.
To begin a case with Support, provide as much data as possible and make sure that you respond to information requests from engineers. To speed resolution, provide screenshots, tcpdump outputs, log file snippets, and any other requested information.
Google approaches troubleshooting as follows:
- Gather and share system observations.
- Create a hypothesis that explains the observations.
- Identify tests that might prove or disprove the hypothesis.
- Execute the tests and agree on the meaning of the result.
- Move on to the next hypothesis.
- Repeat until solved.
By working through these steps together and sharing information on progress, you can dramatically expedite case resolution.
Improving the data delivery process
Given that the first step in a new cycle is data collection, the last step during the execution stage is data delivery. Make sure you preserve metrics and logs, along with other deliverables created during the process. By doing so, you build a solid foundation on which to start the next cycle and begin the planning stage anew.
Conduct root cause analysis
Root cause analysis (RCA) focuses on identifying the lowest-level faults in a system such that, if a fault is removed, the system would function normally. We don't recommend performing RCA immediately during an outage, except for the data collection component. As the system is being brought back up, try to collect and preserve data that was generated while the event was occurring and immediately afterward. When system recovery is complete, perform RCA soon after to avoid data degradation or reduction in effort/emphasis due to time.
RCA means doing the following:
- Describe the problem encountered, leveraging the "five whys" technique.
- Detail a timeline for attribution of data to points in time during the system failure.
- Separate secondary factors from primary factors for the system failure. Try to identify the factors that are the root of the problem. Iterating on the factors helps to identify the lowest-level components involved in the failure.
Look at detection, alerting, and automatic mitigation opportunities:
- How long did it take to detect and diagnose the issue?
- Was all the telemetry available?
- Were there early warnings?
- Were alarms triggered? How soon? Did they contain sufficient information to be immediately understood and acted upon by the operations team?
- Were there too many dependency alarms? Is there a need for correlation and suppression?
- Could these alarms be connected to automatic mitigation actions?
Each incident should have its own investigation. Where incidents overlap, coordinate with Support to determine if there is a causal relationship. When the root cause is known, develop an actionable list of activities to help prevent future incidents. Identify resource, time, and cost efforts to mitigate and prevent the incident from reoccurring. If the incident took a system out of SLO, schedule recurrence prevention as appropriate.
Analyze predictions versus actuals
One of the earliest steps in the cycle is to collect data from prior events and start to use it as a basis for projecting future demand. By identifying differences, and their causes, you can help future peak-event planners. Understanding how business metrics affect SLOs will help identify how to increase system reliability.
Also review capacity plans. Areas where capacity was under expected values might have caused an incident to get additional capacity.
Aggregate the postmortem and retrospective
We recommend that you do a postmortem and retrospective process. In this process, all teams share details on any events that occurred and how the customer and Google systems behaved during the peak event. All teams review those experiences and fold them into future planning.
Gather the following information during the postmortem and retrospective process:
- Analysis of process and which planning stage steps were effective.
- Insight from different perspectives on what performed well during preparation and execution stages, along with feedback on improvements to both stages.
- Opportunities to publish best practices and solutions that can help other customers be successful with Google Cloud.
- Perspectives from engineering, operations, project management, business, and leadership. Where appropriate, reach out to customer constituencies to collect their experiences.
- Information in written form and discussed during meetings. Insights gained from meetings sometimes don't initially make the written narrative.
Acknowledge the cross-functional team members who helped manage a successful Black Friday. Celebrate success with team and individual recognition, rewarding those who expended tremendous energy.
When teams reflect on and appreciate success, it starts them on the road to recovering from a long and stressful process. Job satisfaction is a key factor for long-term team success. Enjoy a unique moment to pause, recall, and toast success.
Execution-stage deliverables include successful completion of the Black Friday event—no small measure—and the end of planning, preparing, and executing. The primary deliverables for this stage are data collection and analysis, including the following:
- Reviews of projections and actual data on metrics, SLOs, and capacity plans.
- Root cause analysis of incidents that occurred during the execution stage.
- Postmortem and retrospective, identifying areas that worked well and areas that could be improved (with specific actionable items).
No article of this magnitude is ever complete. To reduce anxiety before and during Black Friday, work within your company and with your Google Cloud team to identify questions and comments that are not addressed in this article. You might also be interested in exploring the following topics:
- Explore additional SRE Practices and create a culture of scalable production support.
- Review a YouTube Next 2018 presentation on this content along with a Google customer's perspective.
- Read about Customer Reliability Engineering, Google's engagement model partnering with customers to support more reliable systems.
- Learn more about Cloud Load Balancing which is used in several architecture design patterns.
- Share the stories with your colleagues and new team members. Help them learn from your successes (and failures) so the team can do a great job next time. Contribute to the Google Cloud community your knowledge.
- Explore reference architectures, diagrams, tutorials, and best practices about Google Cloud. Take a look at our Cloud Architecture Center.