How Memorystore cut costs in half for Quizlet’s Memcached workloads
Mason Leung
Site Reliability Engineer, Quizlet
Editor’s note: We’re hearing today from Quizlet, an online learning community and provider of study resources to students and teachers. For a more secure and convenient solution to Memcached node scaling, they switched to Memorystore.
With over 50 million monthly users, Quizlet is one of the largest online learning platforms in the world, serving students and teachers with intelligent study tools ranging from digital flashcards and practice questions to interactive diagrams and games. Quizlet began migrating to Google Cloud in 2015, adding data cloud technology solutions including BigQuery and Cloud Storage.
At Quizlet, site reliability engineers (SREs) have a saying: if Memcached (an open source distributed memory-caching system) goes down, Quizlet goes down. Faced with a volatile caching system, we needed a more scalable solution to manage our Memcached nodes. Google Memorystore for Memcached handled that task without disruption at half the cost. The switch to Memorystore has also reduced a lot of pressure on our SRE team and improved overall performance.
Our Memcached use cases
Quizlet uses Memcached in a fairly straightforward way. For example, our users can create “study sets” that capture the subject matter being studied and, in addition to terms and definitions, a typical set may include images and a voice recording—elements stored in individual databases. The first time a set is loaded, data is pulled from various databases. Then we manipulate the data into presentable shape to a user, and store that data in Memcached, where it’s more quickly accessed in the future.
Before Memorystore, Quizlet’s data infrastructure ran Memcached on our Google Compute Engine virtual machines. We managed nodes in a Memcached pool via Puppet, and our caching strategy was consistent hashing using Ketama. If we needed additional capacity, bringing up a new Memcached node was complicated. After spinning up a new VM, we would have 20 to 30 minutes of adding configurations, loading the OS and running Memcached binary.
Too much guesswork in our legacy infrastructure
Scaling was a big challenge. For example, every year we had to calculate our expected traffic in the back-to-school season which is peak traffic season for us, determining whether we have enough Memcached resources or if more are needed. This always involved guesswork. If maintenance was needed, it required manual work. This might include pulling out a Memcached VM to move it to a different region to ensure we have VMs across various regions. With this legacy setup, scaling, updating, and patching were all very difficult and time-consuming.
This took a toll on our SRE team. Nobody wanted to deal with the Memcached layer because of its unreliable structure. As essentially an ephemeral cache, when it worked, it was fine. But we never wanted to modify it unless absolutely necessary, because if it were to go down, so would quizlet.com. That kind of unreliable infrastructure could be catastrophic.
Testing out Memorystore
We had the same challenges scaling with our web infrastructure at Quizlet, which had prompted us to develop a company-wide strategy to adopt more managed services. For example, we transitioned from running our own VMs to a container system. This included moving our web services into Google Kubernetes Engine (GKE).
As part of this migration, we evaluated Memorystore for Memcached, which promised to automate complex tasks for Memcached like ensuring availability, scalability, patching, and monitoring. We tested Memorystore in three phases: first, load tested in a development environment, compared against one of our Memcached VMs, with a marginal though acceptable decrease in performance. Next, we added a Memorystore node to one of our Memcached pools in our staging with no bad results, proving the node was compatible. Finally, we created a Memorystore cluster, put one node into production and observed its performance against our 24 VMs running in the same pool. Everything ran smoothly.
Memorystore gave us more footprint for a lower cost
Memorystore worked right out of the box to meet Quizlet’s needs. At the time, we thought that in order to replace our Memcached infrastructure we might have to go through a painful migration process. But that proved unnecessary because Memorystore for Memcached had no compatibility issue. That accelerated our effort because all we had to do was roll Memorystore nodes in and then shut down Memcached VMs on GCE. That gave us a lot of confidence that we could move the whole thing over.
Memorystore is thankfully integrated with the monitoring capabilities of Google Cloud’s Operations suite (formerly Stackdriver), and looking at the metrics made it clear that we didn’t have to run as many nodes as we first thought. Originally, we had 24 Memcached VMs, and each one had 40–50 GB, with a total size of 1 TB of caching. Today we are still running with 1 TB of cache , but with significantly lower operational overhead because we have only six Memorystore nodes deployed and utilizing only 15-20% . This makes us feel more resilient toward traffic spikes or unpredictable failures.
For the cutover itself, we added each Memorystore node one by one into the pool and let them run for a few days. If Quizlet services continue to perform, we’d pull out the VM nodes. After a few weeks we were down to one VM node, and it ran perfectly. We shut down the last VM node a few months ago.
In addition to using much less memory than before, we cut our costs by half. Right now, we’re using only a quarter of our capacity, and we could cut more costs if we wanted to. But since it’s running well for us, and we have other projects to work on, we’ll keep it and maintain monitoring. With the Memorystore for Memcached layer, we’re at least good for the next “back to school” period, if not longer.
The results are in
The switch to Memorystore has made scaling so much easier, and has eliminated a lot of stress for the SRE team. In addition: cost reduction, bigger footprint and reduced Memcached ops time; we also have easier monitoring now, thanks to the integrations of Google Cloud Operations suite with Memorystore. For one, we have the ability to set up thresholds and alarms. Also, we can quickly build dashboards for easy, more reliable results. With our self-managed environment, Memcached would report false numbers for our usage stats, like 58 GB of memory on a 64 GB machine. It was difficult to determine if we were truthfully that close to capacity, but if so, that could have been disastrous to our business as well. So getting more reliable monitoring was another motivation to switch to Memorystore.
Now that we know that Memorystore works for us, we may in the future look to it to help us move from a monolith cache deployment to a per-service or per-group structure instead.
At Quizlet, we value the wellbeing of our people. The switch to Memorystore has been huge for our SRE team. With our self-managed Memcached, it felt like disaster was more a question of when, not if. Now, with Memorystore’s reliability and managed services, we can focus our thoughts and energy toward new solutions to benefit our users and our business.
Check out the study resources available to students and teachers at Quizlet. Then explore the features available in Memorystore.