Databases

How Glance improves database operations with Spanner

March 4, 2024

Hardik Taluja

SDE III, Glance

Try Gemini 1.5 models

Google's most advanced multimodal models in Vertex AI

Editor’s note: Today we hear from Glance, which offers diverse and engaging content directly on the lock screen of Android Phones ranging from breaking news, gaming, entertainment, to personalized shopping recommendations, empowering individuals to effortlessly discover and engage with what they love, all at a single glance. The company recently migrated from Azure Cosmos DB to Spanner to improve database operations and lay the groundwork for better data science capabilities. Read on to learn how they did it.

At Glance, we’re proud of our robust integration with a multitude of industry-leading Original Equipment Manufacturers (OEMs) across diverse regions (India, Indonesia, Latin America, and Japan), leading to a broad and varied user base of 230+ million. To handle transactions on such a huge scale, we recently migrated our major database operations to Google Cloud Spanner.

A bit of background

We migrated our document database from Azure Cosmos DB to Spanner to ease the relational semantics and better manage our database transactions. When onboarding our application to the document-based Azure Cosmos DB previously, we had to tweak our application to optimize transactions, schema designs, as well as read and write tuning. This included setting up our own schema migration and evolution process, since there were frequent field additions in the JavaScript Object Notation (JSON) documents used in Azure Cosmos DB, as well as new collections.

Because the document database did not recognize relationships between data tables, it could not be optimized for JOINs which led to cumbersome operations and replicated data being stored as documents. At the same time, we were using Apache Thrift protocol on wire for the documents and the team had to ensure that the right pipelines were deployed.

To choose the database that would best fit our operations, we conducted a month-long proof-of-concept, while benchmarking against several databases. The benchmark included understanding existing workflows, as well as running several tests against shortlisted databases. This included endurance tests, tests on production-like data, and latency tests during traffic spikes. We also ran a series of load tests with Yahoo Cloud Serving Benchmark (YCSB), an open-source software that evaluates the capabilities of various databases, followed by a Jmeter test to benchmark the actual use case.

We chose Spanner because it could achieve the required scale of queries per second (QPS) that the company needed, while still keeping the database CPU and latencies within recommended limits. We were able to achieve client-side latencies below 20ms for both read and write.

In addition, Spanner provided us with the best of both worlds. Essentially, it provided a non-relational scale coupled with relational semantics. This meant that we could benefit from effortlessly scaling our database horizontally without sacrificing the benefits of SQL databases, such as data availability and consistency. As such, we have moved from a NoSQL database to a SQL database, and use interleaving for some use cases as Spanner can physically co-locate rows of related tables.

Building features faster with seamless schema updates

When we migrated our database from Azure Cosmos DB to Spanner, updating the schema was simple. As the team did not have any prior experience managing schema updates since the Azure Cosmos DB was schema-free, the ease of updating schema proved to be particularly valuable. In particular, schema updates such as adding new tables, columns, or secondary indexes are background tasks in Spanner, and do not impact the live database, which results in zero downtime.

That said, changes in schema also required us to manage application-level changes in the Data Access Layer at times. Spanner provides client libraries for almost all the major languages to correct these changes in our applications. It can also be integrated with Liquibase, an open source database-independent library for tracking, managing, and applying database schema changes that we are using to manage and execute schema changes.

Migrating to Spanner in phases

Throughout the migration journey, the Google Cloud team supported our team with troubleshooting advice, suggestions on best practices, and interactive sessions. This took around three months, with the process made up of several phases:

The first stage was to plan the migration, which included discussions around the database and schema design, as well as the migration strategy. Should the migration be offline or online, or a mix of both? What sort of application changes are expected? How can cutover and rollback be carried out? These questions were discussed thoroughly between the Glance and Google Cloud teams.
The entire process was tested thoroughly, complete with multiple dry runs and production simulations. After each dry run, data validation scripts were prepared for verifying data accuracy. A setup to accommodate changes in schema design was also prepared since the migration was going to span months.
The final phrase was the execution. As most of the workloads were critical, we opted for online migration. The process included Change Data Capture (CDC), bulk snapshot, import, data validations, reverse CDC, cutover, and more.

Since Spanner is a distributed database, we took the opportunity during the migration to remodel our data. On top of correcting any inefficient designs, we also wanted to leverage the benefits of the database’s SQL schema design. With the table interleaving functionality of Spanner, we could optimize for reads that were distributed across different collections.

To check the read/write access patterns of keys across the entire database, the Key Visualizer in Spanner provides us with graphs for checking CPU usage, throughput, latencies and access patterns. The tool makes it easy to see which key ranges are under heavy load and causing hotspots. At the same time, we can also design our database schema to avoid these hotspots when too many requests are sent to the same key range and can cause high latencies.

In particular, the tool was useful during the POC phase, allowing us to validate our load test harness. For instance, P99 latencies were going above 20ms at times. Through the heatmap in Key Visualizer, we realized that the load generator was not able to distribute the load evenly. However, we were able to fix this issue, such that traffic could be uniformly distributed across the database splits. As a result, latency saw a significant improvement, with less than 13ms for P99 latencies in read and write.

For issues that are not visible in the Key Visualizer, we leveraged tables in the SPANNER_SYS database to introspect on Spanner instances. This allowed us to investigate issues with query stats, check top-N queries in intervals, see current and oldest running queries, lock stats, read, write and transaction stats. Access to introspective database metrics via a simple and scalable SQL interface makes performance debugging much easier.

After the migration process, we used change streams to support the migration of backend functions, such as backups or data pipelines for sending reports or powering dashboards. Change streams watch and stream out data changes, such as inserts, updates, and deletes, in near real-time, thus providing a flexible, scalable way to stream data changes.

Simplifying database operations for more agile operations

With Spanner, our day-to-day database operations have improved, with the database providing multiple monitoring panels and insights for queries, transactions, and locks. These are features that are not easily available with most traditional SQL databases.

Making schema updates is also simpler, which is crucial to database management as the database will evolve alongside the company’s growth. Because the Glance application does not suffer from downtime when schemas are being updated in Spanner, we can remain agile while focusing on other aspects of our operations, so that business performance is not affected. Then there is change streams, which processes data changes in near real-time, thus ensuring consistency and accuracy across all systems.

Finally, we can respond seamlessly to huge traffic spikes in our application, with Spanner being our database of choice, especially when handling transactional use cases at a large scale.

Since the initial migration a year ago, we have migrated a few more databases to Spanner. We are also exploring the use of BigQuery Federation by Spanner for driving our data science operations.