Wayfair: Accelerating MLOps to power great experiences at scale
Vinay Narayana
Head of ML Engineering at Wayfair
Bas Geerdink
Lead ML Engineer at Wayfair
Machine Learning (ML) is part of everything we do at Wayfair to support each of the 30 million active customers on our website. It enables us to make context-aware, real-time and intelligent decisions across every aspect of our business. We use ML models to forecast product demand across the globe, to ensure our customers can quickly access what they’re looking for. Natural language processing (NLP) models are used to analyze chat messages on our website so customers can be redirected to the appropriate customer support team as quickly as possible, without having to wait for a human assistant to become available..
ML is an integral part of our strategy for remaining competitive as a business and supports a wide range of eCommerce engineering processes at Wayfair. As an online furniture and home goods retailer, the steps we take to make the experience of our customers as smooth, convenient, and pleasant as possible determine how successful we are. This vision inspires our approach to technology and we’re proud of our heritage as a tech company, with more than 3,000 in-house engineers and data scientists working on the development and maintenance of our platform.
We’ve been building ML models for years, as well as other homegrown tools and technologies, to help solve the challenges we’ve faced along the way. We began on-prem but decided to migrate to Google Cloud in 2019, utilizing a lift-and-shift strategy to minimize the number of changes we had to make to move multiple workloads into the cloud. Among other things, that meant deploying Apache Airflow clusters on the Google Cloud infrastructure and retrofitting our homegrown technologies to ensure compatibility.
While some of the challenges we faced with our legacy infrastructure were resolved immediately, such as lack of scalability, others remained for our data scientists. For example, we lacked a central feature store and relied on a shared cluster with a shared environment for workflow orchestration, which caused noisy neighbor problems.
As a Google Cloud customer, however, we can easily access new solutions as they become available. So in 2021, when Google Cloud launched Vertex AI, we didn’t hesitate to try it out as an end-to-end ML platform to support the work of our data scientists.
One AI platform with all the ML tools needed
As big fans of open source, platform-agnostic software, we were impressed by Vertex AI Pipelines and how they work on top of open-source frameworks like Kubeflow. This enables us to build software that runs on any infrastructure. We enjoyed how the tool looks, feels, and operates. Within six months, we moved from configuring our infrastructure manually to conducting a POC, to a first production release.
Next on our priority list was to use Vertex AI Feature Store to serve and use AI technologies as ML features in real-time, or in batch with a single line of code. Vertex AI Feature Store fully manages and scales its underlying infrastructure, such as storage and compute resources. That means our data scientists can now focus on feature computation logic, instead of worrying about the challenges of storing features for offline and online usage.
While our data scientists are proficient in building and training models, they are less comfortable setting up the infrastructure and bringing the models to production. So, when we embarked on an MLOps transformation, it was important for us to enable data scientists to leverage a platform as seamlessly as possible without having to know all about its underlying infrastructure. To that end, our goal was to build an abstraction on Vertex AI. Our simple python-based library interacts with the Vertex AI Pipeline and Vertex AI Features Store. And a typical data scientist can leverage this setup without having to know how Vertex AI works in the backend. That’s the vision we’re marching towards–and we’ve already started to notice its benefits.
Reducing hyperparameter tuning from two weeks to under one hour
While we enjoy using open source tools such as Apache Airflow, the way we were using it was creating issues for our data scientists. And we frequently ran into infrastructure challenges, carried over from our legacy technologies, such as support issues and failed jobs. So we built a CI/CD pipeline using Vertex AI Pipelines, based on Kubeflow, to remove the complexity of model maintenance.
Now everything is well arranged, documented, scalable, easy to test, and well organized in terms of best practices. This incentivizes people to adopt a new standardized way of working, which in turn brings its own benefits. One example that illustrates this is hyperparameter tuning, an essential part of controlling the behavior of a machine learning model.
In machine learning, hyperparameter tuning or optimization is the problem of choosing a set of optimal hyperparameters for a learning algorithm. A hyperparameter is a parameter whose value is used to control the learning process. Every machine learning model will have a different hyperparameter, whose value is set before the learning process begins. And a good choice of hyperparameters can make an algorithm perform optimally.
But while hyperparameter tuning is a very common process in data science, there are no standards in terms of how this should be done. Doing it in Python using a legacy infrastructure would take a data scientist on average two weeks. We have over 100 data scientists at Wayfair, so standardizing this practice and making it more efficient was a priority for us.
With a standardized way of working on Vertex AI, all our data scientists can now leverage our code to access CI/CD, monitoring, and analytics out-of-the-box to conduct hyperparameter tuning in just one day.
Powering great customer experiences with more ML-based functionalities
Next, we’re working on a docker container template that will enable data scientists to deploy a running ‘hello world’ Vertex AI pipeline. It can take a data science team more than two months to get a ML model fully operational on average. With Vertex AI, we expect to cut down that time to two weeks. Like most of the things we do, this will have a direct impact on our customer experience.
It’s important to remember that some ML models are more complex than others. Those that have an output that the customer immediately sees while navigating the website, such as when an item will be delivered to their door, are more complicated. This prediction is made by ML models and automated by Vertex AI. It must be accurate, and it must appear on-screen extremely quickly while customers browse the website. That means these models have the highest requirements and are the most difficult to publish to production.
We’re actively working on building and implementing tools to streamline and enable continuous monitoring of our data and models in production, which we want to integrate with Vertex AI. We believe in the power of AutoML to build models faster, so our goal is to evaluate all these services in GCP and then find a way to leverage them internally.
And it’s already clear that the new ways of working enabled by Vertex AI not only make the lives of our data scientists easier, but also have a ripple effect that directly impacts the experience of millions of shoppers who visit our website daily. They’re all experiencing better technology and more functionalities, faster.
For a more detailed dive on how our data scientists are using Vertex AI, look for part two of this blog coming soon.