Beyond malloc efficiency to fleet efficiency

By mullaned2002

July 15, 2021

699

The “memory wall” has been a long-standing challenge in computer hardware design—CPUs are getting faster and faster, but bandwidth and latency to main memory (or worse, to disk) haven’t kept up. The large working sets of data center workloads have exacerbated this problem, causing translation lookaside buffer (TLB) misses to become a large portion of the “data center tax” of warehouse-scale computers. In this post, we explore one technique for reducing TLB misses and improving application performance: huge pages.

TLBs enable a processor core to map a virtual address in a program to the physical location in memory where the data is held. The TLB caches a limited number of TLB entries; if the mapping is not present in the TLB it needs to be fetched by an expensive operation. For x86 processors, each TLB entry provides a virtual-to-physical mapping for a 4KiB region of memory. In contrast, using huge pages, a single TLB entry provides a mapping for a 2MiB memory region. The same number of TLB entries can now map 512 times the memory; this substantially reduces the number of TLB misses, and their associated costs.

We’ve seen firsthand the improvements that huge pages can bring. In our OSDI 2021 paper, “Beyond malloc efficiency to fleet efficiency,” we describe Temeraire, our huge page-aware improvements to TCMalloc, our production memory allocator. The code for these changes is available on Github.

By managing memory in user-space at the huge-page-level, we can simultaneously make application code faster and also reduce memory overhead in the allocator by returning memory to the operating system faster. In Google’s data centers, this improvement reduced TLB stalls by 6% and memory fragmentation by 26%.

This work represents a pivot from minimizing cycles in the allocator code to instead improving fleetwide productivity—how much useful work a particular set of servers can do. Spending more time in malloc to make better allocation decisions (and thus reducing memory stalls) is the right tradeoff if application performance improves. As an example of the benefits of this approach, one service increased its time in TCMalloc from 2.7% to 3.5%, an apparent regression, but reaped improvements of 3.4% more requests-per-second, a 1.7% latency reduction, and a 6.5% reduction in peak memory usage!

The lessons learned from optimizing TCMalloc have also allowed us to improve our optimization process. We present these in our OSDI paper as well:

Adding telemetry to the TCMalloc instances running on our servers and collecting it with Google-Wide Profiling allows us to understand the usage of TCMalloc on the diverse workloads in Google’s data centers.

With help from Site Reliability Engineering, we developed tools for running A/B experiments on a small fraction of machines, allowing us to safely roll out a new optimization to a fraction of machines and observe its performance impact.

In both cases, getting feedback about the impact of changes earlier shortens the cycle from observation to optimization. These tools provide important capabilities—they do not directly make software more efficient, but they enable optimizations—making them the motor of optimization progress.

Drawing on the lessons from designing, implementing, and enabling Temeraire, has enabled a virtuous cycle of optimization. Following the deployment of Temeraire, we gained insights to improve our huge page allocation decisions further. We’ll present this work at ISMM 2021, in our paper “Adaptive Hugepage Subrelease for Non-moving Memory Allocators in Warehouse-Scale Computers.” We hope this work inspires other work to look beyond cycles consumed by the data center tax to application-level improvements enabled by optimizing the data center tax.

Cloud BlogRead More

Previous articleIntroducing Quilkin: open-source UDP proxies built for game server communication

Next articleOptimizing your Google Cloud spend with BigQuery and Looker

Beyond malloc efficiency to fleet efficiency

Leverage enterprise data with Denodo and Vertex AI for generative AI applications

TypeScript takes aim at truthy and nullish bugs

Hex-LLM: High-efficiency large language model serving on TPUs in Vertex AI Model Garden

LEAVE A REPLY Cancel reply

Most Popular

Schneider Electric automates Salesforce account hierarchy management with generative artificial intelligence (AI) using Amazon Aurora and Amazon Bedrock

Leverage enterprise data with Denodo and Vertex AI for generative AI applications

TypeScript takes aim at truthy and nullish bugs

Make relevant movie recommendations using Amazon Neptune, Amazon Neptune Machine Learning, and Amazon OpenSearch Service

Recent Comments

EDITOR PICKS

Exploring the Click Element Variable in Google Tag Manager

How to track events with Google Tag Manager and Google Analytics

Data Layer Variable in GTM: What, Why, and Where?

POPULAR POSTS

Duck-typing, scope, and investigative functions in Python

Code Verify: An open source browser extension for verifying code authenticity on the web

Why xHE-AAC is being embraced at Meta

POPULAR CATEGORY