Jump to Content
Systems

Beyond malloc efficiency to fleet efficiency

July 15, 2021
Chris Kennelly

Senior Staff Software Engineer

The “memory wall” has been a long-standing challenge in computer hardware design—CPUs are getting faster and faster, but bandwidth and latency to main memory (or worse, to disk) haven’t kept up. The large working sets of data center workloads have exacerbated this problem, causing translation lookaside buffer (TLB) misses to become a large portion of the “data center tax” of warehouse-scale computers. In this post, we explore one technique for reducing TLB misses and improving application performance: huge pages.

TLBs enable a processor core to map a virtual address in a program to the physical location in memory where the data is held. The TLB caches a limited number of TLB entries; if the mapping is not present in the TLB it needs to be fetched by an expensive operation. For x86 processors, each TLB entry provides a virtual-to-physical mapping for a 4KiB region of memory. In contrast, using huge pages, a single TLB entry provides a mapping for a 2MiB memory region. The same number of TLB entries can now map 512 times the memory; this substantially reduces the number of TLB misses, and their associated costs.

We’ve seen firsthand the improvements that huge pages can bring. In our OSDI 2021 paper, “Beyond malloc efficiency to fleet efficiency,” we describe Temeraire, our huge page-aware improvements to TCMalloc, our production memory allocator. The code for these changes is available on Github.  

By managing memory in user-space at the huge-page-level, we can simultaneously make application code faster and also reduce memory overhead in the allocator by returning memory to the operating system faster. In Google’s data centers, this improvement reduced TLB stalls by 6% and memory fragmentation by 26%.

This work represents a pivot from minimizing cycles in the allocator code to instead improving fleetwide productivity—how much useful work a particular set of servers can do. Spending more time in malloc to make better allocation decisions (and thus reducing memory stalls) is the right tradeoff if application performance improves. As an example of the benefits of this approach, one service increased its time in TCMalloc from 2.7% to 3.5%, an apparent regression, but reaped improvements of 3.4% more requests-per-second, a 1.7% latency reduction, and a 6.5% reduction in peak memory usage!

The lessons learned from optimizing TCMalloc have also allowed us to improve our optimization process. We present these in our OSDI paper as well:  

  • Adding telemetry to the TCMalloc instances running on our servers and collecting it with Google-Wide Profiling allows us to understand the usage of TCMalloc on the diverse workloads in Google’s data centers.  

  • With help from Site Reliability Engineering, we developed tools for running A/B experiments on a small fraction of machines, allowing us to safely roll out a new optimization to a fraction of machines and observe its performance impact.  

In both cases, getting feedback about the impact of changes earlier shortens the cycle from observation to optimization. These tools provide important capabilities—they do not directly make software more efficient, but they enable optimizations—making them the motor of optimization progress.

Drawing on the lessons from designing, implementing, and enabling Temeraire, has enabled a virtuous cycle of optimization. Following the deployment of Temeraire, we gained insights to improve our huge page allocation decisions further. We’ll present this work at ISMM 2021, in our paper “Adaptive Hugepage Subrelease for Non-moving Memory Allocators in Warehouse-Scale Computers.” We hope this work inspires other work to look beyond cycles consumed by the data center tax to application-level improvements enabled by optimizing the data center tax.

Posted in