DevOps & SRE

DevOps Awards winner Sabre on nurturing team culture

July 14, 2023

https://storage.googleapis.com/gweb-cloudblog-publish/images/devops_sabre.max-2000x2000.png

Diwakar Pandrangi

Director Site Reliability Engineering

Dominic Briggs

SVP SIte Reliability Engineering

Sabre is a leading software and technology company that powers the global travel industry. With decades of revolutionary firsts, their team of experts drive innovation and ingenuity across the travel ecosystem. Sabre partners with airlines, hoteliers, agencies, and other travel partners to retail, distribute, and fulfill travel. In this blog post, we’re highlighting Sabre for the DevOps achievements that earned them the ‘Nurturing team culture’ award in the 2022 DevOps Awards. If you want to learn more about the winners and how they used DORA metrics and practices to grow their businesses, start here.

As a leader in the travel technology industry, Sabre is constantly innovating and growing, but legacy systems were keeping us from realizing the full benefits of the cloud. Our largest product, Air Offer, is a robust platform that handles millions of flight calculations and shopping requests per second, generating trillions of flight solutions monthly and integrating 7.5 petabytes of data into big data platforms. Air Offer was initially built in the 90s, and while we had shifted it to the cloud, the inherent complexity in the system made it unwieldy and potentially unreliable. This system was too big to fail, but that also meant it was almost too big to change.

The biggest challenge this presented was around scalability. Traffic for services like Air Shopping was doubling every year as shopping algorithms became more complex, as were the computational needs to keep adapting to the changing business dynamics and airline algorithms. Keeping up with these changes was a team effort, but to make lasting improvements, we needed to couple it with better infrastructure, tools, and best practices.

Multi-optimization problem

The company decided we needed a solution that could improve reliability, optimize cost, and increase agility while keeping performance and quality consistent and competitive. Our customers expected quick results when searching for travel deals, and we wanted to improve stability to the point where we could introduce marketing campaigns like flash sales. This would require both cloud mastery and DevOps adoption.

The three main high-level goals the company looked to accomplish with this digital transformation were to:

Improve the velocity and quality of the software and services we provide to our customers
Become more innovative by partnering with Google Cloud from a technology and travel perspective
Improve reliability and security while reducing operational costs

Solution

To realize these goals, we worked closely with Google Cloud to transform our system and company culture to make better use of the cloud. With the help of Google Cloud, we worked through the core network and security designs to find options that gave our product teams a flexible base to build from. We then looked at specific designs and implementations to match key products like Air Shopping with the Google Cloud services that meet its scalability requirements.

The solution we built involved a mix of public clouds and the data center where we host our applications. This approach took into account the application dependencies and latencies that come from cross-application communications, ensuring a better customer experience. For example, by moving Air Offer from two datacenters that were over provisioned for redundancy to a model where it is spread across four regions using data insights and the flexibility of the cloud to manage capacity and distribution, the company was able to optimize performance and cost in real time while increasing reliability.

Since scale was a major issue going into this digital transformation, we decided the best solution would be to focus on autoscaling. With standard autoscaling, the MIGs are monitored for auto scaling signals in this case CPU utilization to determine when one is experiencing excess demand and start up more servers to handle the increased demand, but this wasn’t a one-size-fits-all solution. For more complicated applications like Air Shopping where servers take several minutes to start up due to data caching requirements, we turned to the predictive autoscaling feature in Compute Engine. This feature keeps extra compute ready to handle periods of high traffic and optimize compute usage without sacrificing customer experience. Solutions like this are invaluable during Black Friday and Cyber Monday when the company must be ready for the highest demand of the year but don’t want to waste money with overprovisioning.

Our company also drove optimization with Compute Engine using spot and preemptible VMs. These gave us the ability to adjust our blend of compute and flexible instances across regions to get the optimal pricing.

The migration of the first workloads took roughly 15 months, starting with Air Shopping. During this time, we learned new ways to work, including implementing a secure CI/CD pipeline and adopting the cloud native concept of Infrastructure-as-Code (IaC) alongside autoscaling and cost management. Once the team was comfortable with the process and started establishing best practices, bringing in other regions to handle compute became much quicker and smoother.

A cross-functional team, including Site Reliability Engineers, platform and software engineers, was established to migrate workloads into Google Cloud. In sticking with Westrum’s definition of good culture - with a shared goal, they removed workflow bottlenecks, addressed blockers directly in standups, and reduced miscommunication. This cooperative approach fostered team pride and efficiency. Ultimately, we built a team of experts across the organization to assist other teams with similar migrations. Success begets success!

Results

This combination of multi-region deployment for high reliability, predictive autoscaling for more efficient resource consumption, and spot and preemptive VMs to optimize cost gave the company the flexibility and continuity we needed to meet customer needs. And, in meeting these needs, we also saw quantifiable improvements to our own business.

Predictive autoscaling specifically has yielded about 10% greater benefit from using basic autoscaling, equaling a projected savings of $3M in 2023 based on projected shipping costs. By knowing how long it takes for a server to start up, the predictive autoscaling logic starts the servers slightly in advance of when they would be needed. This saved the business from having to run around 50% more servers than necessary all day just in case of peak traffic periods without autoscaling, or the extra 10% we needed with basic autoscaling. Additionally, by moving all workloads to Google Cloud with one CI/CD pipeline to deliver changes, Sabre greatly reduced the time to deploy while improving cycle times for new feature releases.

Before the workload moved to Google Cloud; it took approximately 8-10 hours per release and per location to deploy one of the big applications. We were operating across 4 locations and after adopting a single CI/CD pipeline we have been able to reduce this by 50% to 4 hours per release per region.

Between the technology and DORA research principles Google Cloud helped us introduce, we saw notable improvements in the five essential characteristics of cloud computing:

Rapid elasticity: To handle spikes in traffic, managed instance groups (MIGs) can scale to several hundred servers in minutes
On-demand self service: Sabre’s teams can reserve capacity and adjust quotas at a project or org level to accommodate changing demand at the click of a button
Broad network access: Google Cloud simplifies infrastructure by giving teams easy access to the computing resources, reports, and alerting across a variety of devices
Measured service: With Google Cloud operations and other tools, Sabre can meter and scale resources to optimize user experience while tracking cost
Resource pooling: By using cloud-native capabilities like autoscaling and separating resource pools by client type, Sabre gives customers the ability to scale and use resources within their contracted volumes without human intervention

Stay tuned for the rest of the series highlighting the DevOps Award Winners and read the 2022 State of DevOps report to dive deeper into the DORA research.

Posted in