How Wayfair is modernizing, one database at a time
Associate Director of Engineering
Associate Director of Technical Product Management
Editor's note: With 18 fulfillment centers, 38 delivery centers, and a catalog of more than 22 million items, online retailer Wayfair needed a way to quickly move from their on-premises data centers running SQL Server to Google Cloud—without derailing their team of over 3,000 engineers, their tens of millions of customers, or their 16,000 supplier partners.This story recaps their journey, their choices, and the advantages they’re already enjoying.
Cloud, here we come
Our footprint of over 10,000 databases & replicas consists primarily of Microsoft SQL Server hosted in our on-premises data centers. To move to the cloud successfully, we needed to migrate the databases quickly while preserving the ability to continue to serve production traffic at scale. Our SQL Server workloads use highly specific and optimized configurations and database and server startup scripts. We also needed to bring our own licenses. So, the goal was to lift and shift our machines as quickly as possible with minimal changes and then use cloud databases to modernize those workloads while reducing our SQL Server footprint post migration.
To address our database migration challenge, we chose Google Cloud database services. These services provided a clear path for shifting our workloads to the cloud via Cloud SQL for SQL Server, while also giving us the flexibility to be deliberate about which engine and product we want to run our systems on going forward. We could run SQL Server on virtual machines (VMs), for example, but we could also take advantage of database offerings like Cloud SQL and Cloud Spanner.
The challenge? Delivering better database choices
In our database technology organization, we are committed to making sure that the Wayfair engineering community has our support when they develop applications. To that end, we offer a curated catalog of database offerings—each pre-configured for specific use cases—that developers can choose from while creating their applications. As a result, development teams don’t have to invest time setting up secrets, backups, deployments, and geographic availability.
When you have several thousands of databases, this adds up to saved development time and faster time to market, thanks to the ability to interact with abstractions instead of self-solutioning at the individual team level. As part of the migration, we began to evaluate other ways of making it even easier for developers to select a database that meets their feature requirements from our database catalog.
Why we chose Google Cloud database services
When adding to our database catalog, we do extensive product discovery by talking to developers, learning about their use cases, and identifying their needs. Armed with this analysis, we’re then able to consider which databases are best suited as a solution we can add to our catalog.
For our initial DBaaS Platform offerings, this process identified PostgreSQL on Cloud SQL and Spanner as catalog choices for our initial releases. We like PostgreSQL because of its industry adoption, ecosystem, and because our engineers are familiar with it. We’re also currently in the process of evaluating the need for NoSQL offerings and are beginning to test other solutions like Bigtable and Firestore. Besides the databases themselves, we’re also using Google Kubernetes Engine (GKE) and Compute Engine VMs to host the services our team builds. In addition, we use Pub/Sub and Dataflow for sending operational data to our analytical store in BigQuery.
For database delivery, we decided to provide a database-as-a-service (DBaaS) platform that combines the subject matter expertise of a centralized team of database experts with a suite of tooling and configured database offerings that solve for pervasive Wayfair database use cases. For example, some of our international teams need globally distributed reads and writes, high read teams need reads replicas, and some teams want single region reads and writes. With these requirements in mind, we curated a catalog that takes advantage of PostgreSQL on Cloud SQL and Cloud Spanner, as well as Microsoft SQL Server on Compute Engine.
Giving everyone the freedom to work the way they want
Building our solutions on top of managed Google Cloud database services significantly reduces our operational load and enables us to provide fit-for-purpose solutions to developers and infrastructure engineers beyond what was possible before. With Google Cloud services, we can translate technical requirements from a developer into an infrastructure solution that’s provisioned and up to spec.
Google Cloud offers multiple database options, which is critical because changing just one criterion can affect the kind of database we provision—be it Spanner, Cloud SQL, or an unmanaged instance on a VM. We use Terraform by Hashicorp to abstract the provisioning aspect away from the infrastructure engineers. Then we integrate Terraform with command-line interfaces (CLIs) and graphical user interfaces (GUIs), using a combination of homegrown tools and open-source solutions like Backstage.
We’re also using Cloud SQL for PostgreSQL for common needs like tracking the status of asynchronous jobs or hosting our internal service directory. Our goal here is to solve for the 80 percent of scenarios where the user can follow a well-paved path, while acknowledging there will always be the other 20 percent of situations where users need a more customized, bespoke solution.
By taking this approach, we’ve been able to scale with the business without having to significantly increase operational headcount. Rather than a team submitting a request for a new database and waiting a few days for a DBA to configure and build it, our internal users can request a Wayfair configured database on-demand through our internal platform. As a result, our developers and engineers can work the way they want in the context of the Wayfair application and infrastructure without having to worry about provisioning and abstracting physical infrastructure.
Creating innovative applications faster through self-service
For our database implementation, we present our DBaaS platform users with workflows where they share the features they require through a declarative Q&A workflow. Then a component called our “Database Service” maps the user’s need to a database offering that matches those requirements. For example, some of our core eCommerce services serve end-user traffic no matter where the user is geographically. These use cases need to support a high volume of requests, so our Database Service would map this request to Spanner which supports distributed writes.
How does this work in practice? Typically, a high throughput distributed workload would require building a manually sharded database, but these have a very long cycle time and require continued administrative maintenance. The Wayfair DBaaS Platform offers a radically faster method by enabling the provisioning of a Spanner instance through a self-service workflow.
From there, we record the database configuration detail in a metadata repository that maintains the state of all of our databases, provisions the instance, configures secrets, and offers ready to use database deployment pipelines. With this model, users can go from database request to schema design and first deployment within a few hours; the time-to-innovation for a quick proof of concept is incredibly fast. The operational time savings for administrators and the cycle time savings for users are really incredible.
A significant increase in engineering sentiment
Over the past year, the internal engineering sentiment around our DBaaS platform has gone up significantly. Our support NPS increased 28.52% and our tooling (offering) NPS increased 41.22%. The secret to these successful metrics is Google managed databases; they provide the products we need to solve for our use cases.
Now we’re able to spend more time working with users and less time on infrastructure management. Working with Google Cloud as a cloud provider reduces our time to market to support new use cases, reduces our operational overhead, increases developer velocity, and enables us to scale at the speed of our business.
Where we go from here
We’re currently migrating our existing microservices and services to database platforms that use Google Cloud managed resources. Our goal is to support 100-500 transactions per second (tps) per database, on average, for our non-distributed databases (Cloud SQL or Microsoft SQL Server). Currently, we’re running more than 280 primary production databases built through our DBaaS Platform, which drives the capacity requirements up to 10,000-50,000 tps overall, and are planning to double or quadruple that over the next year or two.
As for distributed databases (Spanner), our goal is to support 500-5000 tps per database for 10-20 databases over the next one or two years.