DevOps & SRE

DevOps Awards winner Priceline on leveraging a loosely coupled architecture

July 25, 2023

https://storage.googleapis.com/gweb-cloudblog-publish/images/image1_e825Hk4.max-2000x2000.png

Michael Spoonauer

Senior Software Engineer

Priceline has been a leader in online travel for twenty-five years offering hotels, flights and more. Priceline's proprietary deals technology pairs negotiation with innovation to analyze billions of data points to generate deep discounts for customers they can't find anywhere else. In fact, Priceline has saved customers over $15B on travel. In this blog post, we’re highlighting Priceline for the DevOps achievements that earned them the ‘Leveraging Loosely Coupled Architecture’ award in the 2022 DevOps Awards. If you want to learn more about the winners and how they used DORA metrics and practices to grow their businesses, start here.

Priceline was created to help customers find the best travel deals exactly when they need them, which means that any delays or disruptions to searching for the best travel deals not only potentially impacts sales but also leaves travelers with less options. This is why our company realized that performing periodic regional failovers — for maintenance, troubleshooting, or just stress testing a region — was causing delays due to an inability of auto-scalable compute resources to scale up quickly enough in response to their own changing performance metrics. These delays were resulting in the following issues:

Latency and even intermittent search failures
Hesitancy to perform regional failovers unless necessary
Overhead in manually scaling up and then back down Google resources to compensate
Increased costs from over-provisioning resources to avoid customer issues

Solving for the future

To reduce any delays in regional failovers, Priceline’s technology leadership realized that we needed a DevOps transformation with participation from both executive leadership and individual practitioners. Specifically, we needed a way to make sure that there would be enough capacity to facilitate large amounts of platform traffic shifting from region to region while also ensuring that we only paid for the capacity being used.

This goal required a solution that was:

Dynamic
Responsive
Configurable
API-driven
Resource-aware

Efficient cluster management

To address these challenges and improve platform stability, we partnered with Google Cloud to find, implement, and validate a solution based on DevOps Research and Assessment (DORA) research. DORA is the largest and longest running research program of its kind, that seeks to understand the capabilities that drive software delivery and operations performance within our organization. Not only that, but Google Cloud helped our teams apply DORA capabilities — which led to improved organizational performance.

In order to compensate for auto-scalable compute resources that couldn’t scale up quickly enough based on metrics alone, we implemented a mechanism that leverages two separate maximizer components working in tandem through a loosely coupled architecture.

The first component is the Python-based Bigtable maximizer that can optimize clusters before platform traffic becomes an issue. This maximizer uses Google Bigtable APIs for each Bigtable cluster to find in a specified project and region each cluster’s current minimum and maximum node count. It can then raise the minimum number of nodes so that it temporarily matches the maximum node count, and with a command, it goes back to the original minimum count.

The second, Python-based Google Kubernetes Engine (GKE) deployment maximizer component uses the Kubernetes API to find the maximum number of replicas for each Horizontal Pod Autoscaler (HPA) object in a GKE cluster. This time, it looks for the maximum number of replicas needed for the HPA object and sets the minimum to match that maximum so that each HPA is temporarily maximized to handle the influx in traffic from the opposite region. And similarly, with a subsequent command, the minimum number of replicas for each HPA can be restored to its original value. The GKE deployment maximizer also lets our teams choose what to maximize by specifying individual HPA objects, entire namespaces within an environment and region, or multiple clusters, namespaces, and HPA objects designated within a JSON file.

With this two-pronged method, we can automatically and seamlessly maximize clusters and deployments before receiving platform traffic, and then optimize them back down immediately after a brief pause to ensure that they do not incur expenses for capacity they don’t need. This lets our teams mitigate issues immediately and reliably with logs that they can investigate after the fact, virtually eliminating any maintenance burden or risk of misconfiguration that could affect production.

Results

By optimizing for speed, without sacrificing stability, Priceline teams could confidently schedule regional failovers without concerns that they could cause disruptions to customer searches. This has led to measurable improvements — as shown by significant improvement in DORA’s four key metrics:

Deployment frequency: With unified, reusable CI/CD pipelines built for both GKE and Compute Engine-based applications, we could deploy applications to production multiple times a day if necessary, with the added protection of built-in rollback capabilities. This stability has led to at least 70% more frequent deployments.
Lead time for changes: We have achieved a 30% reduction in the time needed to perform changes with regional failover.
Change failure rate: Migrating to the cloud enabled a uniform CI/CD process with multiple gates to ensure that teams follow a proper, repeatable process with necessary testing and without configuration drift. Now, by discovering and mitigating issues earlier in the software development lifecycle, at least 90% less time is spent investigating possible production incidents.
Time to restore service: By taking advantage of templated, customized APM alerts, we could learn about and respond to anomalous behavior, application performance drops, and outright failures in real-time. This has reduced the duration of regional failovers and failback by at least 43%, with most production failure recoveries done in minutes.

Stay tuned for the rest of the series highlighting the DevOps Award Winners and read the 2022 State of DevOps report to dive deeper into the DORA research.

Posted in