Peak performance: How retailers used Google Cloud during Black Friday/Cyber Monday
Sr. Director, Cloud Support
At Google Cloud, we work with businesses in a range of industries, and we’ve seen nearly every business experience peak events when their online traffic skyrockets. For retailers, their peak events are Black Friday and Cyber Monday (or BFCM)—the period right after Thanksgiving in the U.S., when holiday shopping starts. The weekend kicks off the all-important holiday shopping season of November and December, when an estimated 20% of all annual retail sales occur.
During an average day, online retail sales in the U.S. total about $1.4 billion, CNET reports. In contrast, on Black Friday 2018, U.S. online sales totaled $6.22 billion (up 24% from 2017). Cyber Monday 2018 sales surged to $7.9 billion (up 19% from 2017)—the biggest online sales day ever in the U.S., according to Adobe Analytics.
Traffic to retailers’ mobile and shopping apps surges to levels unmatched during the rest of the year, and availability or scalability issues can result in millions of dollars of lost sales. Every year, there are well-publicized retail website crashes, so avoiding downtime—along with the accompanying reputation damage, unhappy customers and stressed, overworked IT teams—is particularly important for retailers.
We know that a solid technology infrastructure is the foundation for retailers to stay ahead of demand and succeed during this busy season. Beyond that, though, support for that infrastructure is essential. Support isn’t just activated if something goes wrong. Support for an event like Black Friday and Cyber Monday involves preparation well ahead of time, and includes testing, architecture reviews, capacity planning, operational drills, and war rooms during the event itself. We took a prescriptive approach to BFCM support, setting expectations and ownership early (more than six months ahead), to understand what each retail customer needed, both on their side and from our team.
We’ll go through the steps that helped our retail customers have a fruitful and disaster-free season. These steps can generally help you prepare for your own peak event. We’ll also describe how one large-scale retail platform in particular—Shopify—had a successful BFCM using Google Cloud.
Preparing to support retailers on Black Friday/Cyber Monday
We started planning for Black Friday and Cyber Monday for our retail customers in the spring of 2018 to align with their typical preparation timeline. We formed a task force composed of representatives from Google Cloud’s Professional Services, Customer Engineering, Support, Customer Reliability Engineering (CRE), and Product and Engineering teams. We met regularly to strategize, develop tactics, and execute on those tactics with the goal of making sure Google team members and our GCP retail customers were well-prepared.
We focused on a few key technology areas where planning could help prevent any issues.
1. Early capacity planning
As early as May 2018, our account teams began reaching out to GCP retail customers. We discussed high-level planning, such as their particular holiday shopping objectives and the infrastructure capacity they might need to meet those goals.
We worked closely with retailers to review their architectures and advise on techniques to forecast and plan for increases in capacity before Black Friday, since scalability is essential when planning for traffic spikes. We conducted tests across teams and services, and stress-tested systems to uncover any constraints or weaknesses and remediate as needed. Those tailored preparations paid off across the board. With GCP capacity status firmly green—available—throughout Black Friday and Cyber Monday, shoppers visiting our retail customers’ sites could make their purchases without running into a slow or unresponsive site.
2. Reliability testing
Identifying potential reliability issues in a “pre-mortem” (an important component of CRE) was another preemptive step we took. Early on, our CRE team partnered with our retail customers to analyze the reliability of their infrastructures, and run through tabletop exercises to see how well-prepared the customer was in the face of a failure. In some cases, the Professional Services team helped perform load testing to make sure retailers’ platforms could handle expected levels of peak traffic, and in others we encouraged regular load testing and evaluation. And given how important mobile commerce has become, we also tested the performance and reliability of customers’ mobile apps. We also employed Apigee’s API monitoring tools to ensure API stability. We’ve seen APIs become more important in retail technology, since they allow more flexible, microservice-based e-commerce sites.
3. Operational war rooms
“What could possibly go wrong?”
That’s the million-dollar question to ask before a big IT event. We got together with our retail customers’ IT and engineering teams to explore and test for possible worst-case scenarios, like an entire site crash. We created a central Black Friday/Cyber Monday war room staffed with senior-level, experienced Googlers from the Professional Services, Support, and Site Reliability Engineering (SRE) teams. This team of first responders was prepared to use real-time communications to stay connected and address any problems as soon as they arose. This was in addition to understanding customer and vendor integrations and making sure escalation paths were defined ahead of time, so that customer expectations were clear for various channels.
During that weekend, we doubled the number of on-call support staff available to retail customers. In some cases, we placed account teams on-site at GCP and Apigee retail customer locations to help as needed. We monitored whether any retail customers were starting to have reliability or latency problems. If something needed to be triaged, the war room team kicked into action, tackling issues and advising on next steps. The Google war room team also had direct, open access to Google engineers and executives for additional support.
Apigee team members kept a close eye on API traffic during the Black Friday period. The number of API calls for Apigee’s customers (excluding those who host the platform on-premises) grew 95% compared to the same span of time in 2017. Peak API traffic running through Apigee more than doubled, from 48,000 transactions per second (TPS) to 108,000 TPS this year, and the platform remained 99.999% available.
How retailers sailed through Black Friday and Cyber Monday
One of our retail partners, Shopify, is an e-commerce platform supporting more than 600,000 independent retailers. The complexity of managing all those storefronts makes predicting holiday site traffic and sales spikes even more challenging. Shopify provides a platform with 99.98% uptime, and calls BFCM their annual “World Cup” event.
Shopify’s platform is made up of many internal services and interaction points with third-party providers, such as payment gateways and shipping carriers. Each of those dependencies has to be reliable and perform well for BFCM to go off without a hitch.
In 2017, on Black Friday and Cyber Monday, only about 10% of Shopify’s stores ran on GCP. The rest were hosted from their own data center. In 2018, Shopify went all-in on GCP as its infrastructure provider, with 100% of its retailers running on our platform.
Shopify Production Engineers began working side-by-side with Google’s BFCM team months before the holiday shopping season. We collaborated on capacity planning so Shopify would have the right capacity buffer needed to accommodate an even bigger peak load than they had in 2017, and helped diagnose and fix potential performance problems, such as network latency.
During the rest of the year, our Shopify account team stayed highly engaged with Shopify engineers on Slack, Google Hangouts Chat, and other real-time communications tools. For Black Friday and Cyber Monday, we increased our communication further and dispatched Googlers to Shopify’s own war room in Toronto.
“As we went into BFCM 2018, we no longer had data center capacity to fall back on,” says Camilo Lopez, Director of Production Engineering at Shopify. “But we were confident that with Google Cloud, we had the extra support and strong technology foundation needed for a successful Black Friday and Cyber Monday. The big event came and went without incident. Our merchants collectively sold over $1.5 billion USD in merchandise that weekend, up from $1 billion in 2017.”
This BFCM weekend was a record breaker for Shopify, with a peak of nearly 11,000 orders created per minute and around 100,000 requests per second being served for extended periods during the weekend. Overall, most system metrics followed a pattern of 1.8 times what they were in 2017.
Cloud planning and support make for stress-free events
By following the above strategies, you can be ready for whatever comes your way, whether it’s a huge, unanticipated traffic spike or a major uptick in sales you count on every year. And that brings benefits for customers and your IT teams. After this year’s successful BFCM, a staff member from one of our newer retailers sent us a note of thanks and remarked that 2018 was the first time in years that he was able to enjoy Thanksgiving dinner with his family.
To achieve your own low-stress peak events, plan and prepare before the event. Consider how your service might fail, how you’d detect these failures, and how you’d react to them. Perform tests to find potential weaknesses. Choose good measures of your customers’ experience, and closely monitor your infrastructure during the event. Do a post-mortem immediately afterwards to make the next big event is even smoother. Find out more here on adopting these strategies for your organization.
And of course, our GCP support team is here to help during these events, both planned and unplanned. If you have a large event where we can help, get in touch with your Technical Account Manager, or your Google Cloud account team.