This document in the Google Cloud Architecture Framework shows you how to plan for peak traffic and launch events to avoid disrupting your business.
Peak events are major business-related events that cause a sharp increase in traffic beyond the application's standard baseline. These peak events require planned scaling.
For example, retail businesses with an online presence can expect peak events during holidays. Black Friday, which occurs the day after Thanksgiving in the United States, is one of the busiest shopping days of the year. For the healthcare industry in the United States, the months of October and November can have peak events due to spikes in online traffic for benefits enrollment.
Launch events are any substantial roll outs or migrations of new capability in production. For example, a migration from on-premises to the cloud, or a launch of a new product service or feature.
If you are launching a new product, you should expect an increased load on your systems during the announcement and potentially after. These events can often cause load increases of 5–20 times (or greater) on frontend systems. That increased load increases the load on backend systems as well. Often, these frontend and backend loads are characterized by rapid scaling over a short time as the event opens for web traffic. Launch events involve a trailing decrease in traffic to normal levels. This decrease is usually slower than the scale to peak.
Peak and launch events includes three stages:
- Planning and preparation for the launch or peak traffic event
- Launching the event
- Reviewing event performance
The practices described in this document can help each of these stages run smoothly.
Create a general playbook for launch and peak events
Build a general playbook with a long-term view of current and future peak events. Keep adding lessons learned to the document, so it can be a reference for future peak events.
Plan for your launch and for peak events
Plan ahead. Create business projections for upcoming launches and for expected (and unexpected) peak events. Preparing your system for scale spikes depends on understanding your business projections. The more you know about prior forecasts, the more accurate you can make your new business forecasts. Those new forecasts are critical inputs into projecting expected demand on your system.
Establishing program management and coordinated planning—across your organization and with third-party vendors—is also a key to success. Get these teams set up early so that your program management team can set timelines, secure budgets, and gather resources for additional infrastructure, testing support, and training.
It's important to set up clear communication channels. Communication is critical for all stages of a launch or a peak event. Discuss risks and areas of concern early and swarm issues before they become blockers. Create event planning documentation. Condense the most critical information about the peak event and distribute it. Doing so helps people absorb planning information and resolves basic questions. The document helps brings new people up to speed on peak-event planning.
Document your plan for each event. As you document your plan, ensure that you do the following:
- Identify any assumptions, risks, and unknown factors.
- Review past events to determine relevant information for the upcoming launch or peak event. Determine what data is available and what value that data has provided in the past.
- Detail the rollback plan for launch and migration events.
- Perform an architecture review:
- Document key resources and architectural components.
- Use the Architecture Framework to review all aspects of the environment for risks and scale concerns.
- Create a diagram of how major components are connected. A diagram can help you isolate issues faster and expedite resolution.
- If appropriate, configure the service to use alert actions to auto-restart if there's a failure. When using Compute Engine, consider using autoscaling for handling throughput spikes.
- Identify metrics and alerts to track:
- Identify business and system metrics to monitor for the event. If any metrics or service level indicators (SLIs) aren't being collected, modify the system to collect this data.
- Ensure you have sufficient monitoring and alerting capabilities and have reviewed normal and previous peak traffic patterns. Ensure alerts are set appropriately.
- Ensure system metrics are being captured with monitoring and alert levels of interest.
- Review your increased capacity requirements with your Google Cloud account team and plan for the required quota management. For more details, review Ensure your quotas match your capacity requirements.
- Ensure you have appropriate cloud support levels, your team understands how to open support cases, and you have an escalation path established. For more details, review Establish cloud support and escalation processes.
- Define a communication plan, timeline, and responsibilities:
- Engage cross-functional stakeholders to coordinate communication and program planning. These stakeholders can include appropriate people from technical, operational, and leadership teams, and third-party vendors.
- Establish an unambiguous timeline containing critical tasks and the teams that own them.
- Establish a responsibility assignment matrix (RACI) to communicate ownership for teams, team leads, stakeholders, and responsible parties.
Establish review processes
When the peak traffic event or launch event is over, review the event to document the lessons you learned. Then, update your playbook with those lessons. Finally, apply what you learned to the next major event. Learning from prior events is important, especially when they highlight constraints to the system while under stress.
Retrospective reviews, also called postmortems, for peak traffic events or launch events are a useful technique for capturing data and understanding the incidents that occurred. Do this review for peak traffic and launch events that went as expected, and for any incidents that caused problems. As you do this review foster a blameless culture.
For more information about postmortems, see Postmortem Culture: Learning from Failure.
- Create a culture of automation (next document in this series)
- Explore other categories in the Architecture Framework such as system design, security, privacy, and compliance, reliability, cost optimization, and performance optimization.