Re: [PATCH] mm: limit THP alignment – performance gain observed in AI inference workloads

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

From: Dev Jain <dev.jain@arm.com>
To: siddhartha@kenip.in
Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>,
	linux-mm@kvack.org, linux-kernel@vger.kernel.org,
	mgorman@suse.de
Subject: Re: [PATCH] mm: limit THP alignment – performance gain observed in AI inference workloads
Date: Mon, 30 Jun 2025 10:55:52 +0530	[thread overview]
Message-ID: <d8ffe547-5516-43e5-9f33-56b2698a0b4f@arm.com> (raw)
In-Reply-To: <3ee2e7fea6f263aa884e3e715632b09f@kenip.in>


On 30/06/25 6:13 am, siddhartha@kenip.in wrote:
> On 2025-06-28 09:19, Dev Jain wrote:
>> On 27/06/25 9:00 pm, Lorenzo Stoakes wrote:
>>> +cc Vlata
>>>
>>> On Fri, Jun 27, 2025 at 04:09:16PM +0530, siddhartha@kenip.in wrote:
>>>> Hi all,
>>>>
>>>> I wanted to share validation data from a Hugging Face-based AI 
>>>> inferencing
>>>> workload,
>>>> which was significantly impacted by the THP alignment logic 
>>>> introduced in
>>>> commit efa7df3e3bb5.
>>>>
>>>> Using transformer models with dynamic input lengths on Intel Xeon 
>>>> (Cooper
>>>> Lake),
>>>> we observed up to a 3200% throughput improvement after applying the 
>>>> patch
>>>> from Oct 2024:
>>>>
>>>>    mm: limit THP alignment of anonymous mappings to PMD-aligned sizes
>>> All congratulations are owed to Vlastimil Babka for doing this, cc'd :)
>>>
>>> I gather he enjoys novelty beer mugs as tokens of thanks ;)
>>
>> I was wondering how the change can get us such a big optimization - the
>> alignment causes us to gain at most 1 extra PMD-THP mapping. Is there
>> something else I am missing?
>>
>> I ask because when I was reading the code I was thinking whether a 
>> similar
>> change can be done for mTHPs.
>>
>>>
>>>> Metrics:
>>>> - Model: BERT-base
>>>> - Inference engine: Transformers + ONNX Runtime
>>>> - Kernel: 6.6 vs patched 6.6.8
>>>> - Batch size: 8-32, input length: 64-512 tokens
>>>> - Metric: inference throughput (samples/sec)
>>>>
>>>> Thanks for the fix -- this change had real impact on a 
>>>> production-relevant
>>>> workload.
>>>>
>>>> Best Regards,
>>>> Siddhartha Sharma
>>>> ISV @ Kenip
>>>> Solution Link: 
>>>> https://www.intel.com/content/www/us/en/partner/showcase/offering/a5bHo00000045YUIAY/deadlock-clearance.html
>>>>
>
> Hi Dev Jain,
>
> Thank you for reviewing and for your thoughtful question.
>
> You're absolutely right that, in isolation, gaining one additional 
> PMD-THP mapping wouldn't explain a 3200% speedup. But in our use case 
> (Hugging Face inference workloads with dynamic input sizes and many 
> allocations), the original PMD alignment logic caused a cascade of 
> side effects:
>
> The performance improvement comes from how that interacts with dynamic 
> memory allocation patterns in AI inference workloads, especially those 
> using frameworks like Hugging Face Transformers.
>
> In our specific use case, the workloads were running on Intel 
> Developer Cloud, but I no longer have access to that particular 
> environment or the original profiling output. However, I’d like to 
> highlight why this patch had such an outsized effect:
>
> 🔹 1. Fragmentation Avoidance
> In model shard loading (e.g., large BERT or GPT2 models split into 
> multiple memory segments), many medium-sized anonymous allocations 
> occur in rapid succession. These workloads tend to allocate many 512 
> KB – 1.5 MB buffers dynamically (token buffers, intermediate tensors). 
> Aligning each one to PMD size, even when their length wasn’t 
> PMD-aligned, led to gaps between them — defeating natural coalescing 
> into a single THP.
>
> 🔹 2. TLB aliasing and cache index pressure
>
> These fragmented mappings caused frequent TLB misses and poor L1/L2 
> cache reuse.
>
> The result was what looks like “memory thrashing,” with slow memory 
> access dominating total inference time.
> When every mapping is PMD-aligned (even if not PMD-sized), the gaps 
> between them prevent Transparent Huge Pages (THPs) from activating 
> effectively.
>
> This breaks THP coalescence and causes fragmented page tables and 
> higher memory overhead per shard.
>
> 🔹 3. Latency & Throughput Penalty from Memory Misalignment
> This leads to higher TLB miss rates, especially under multi-threaded 
> load, which dramatically slows down token embedding and attention 
> calculations.
>
> When loading model shards, memory initialization becomes 
> cache-unfriendly, with poor reuse across cores.
>
> This affects not only inference latency but also model cold-start time 
> — which is critical in autoscaling deployments.
>
> 🔹 4. Qualitative Observation
> Without this patch: shard loading stuttered, warm-up was slow, and we 
> saw CPU cycles dominated by page_fault and TLB miss handlers.
>
> With this patch: shard loading smoothed out, THPs were correctly 
> applied (based on smaps), and throughput shot up by an order of 
> magnitude.
>
> 🔹 5. Measured Impact
> On Intel Xeon (Cooper Lake), a 6.6.0 kernel with PMD alignment on 
> non-aligned sizes showed 11–32× worse performance.
>
> With the patched kernel (which skips alignment unless the length is 
> PMD-aligned), memory layout was contiguous again and THP was 
> consistently utilized.
>
> This isn’t about one extra THP — it’s about preventing widespread THP 
> fragmentation and the resulting dramatic cache/TLB degradation. For AI 
> workloads with high concurrency and dynamic shapes, this small patch 
> has a massive effect on layout and locality.
>
> So, it's not just “1 more huge page” — it's avoiding massive 
> fragmentation that leads to:
>
> 1. TLB miss storms
>
> 2. Poor locality
>
> 3. Cache index thrashing
>
> 4. Improvement in latency and throughput
>
> This applies across many adjacent, odd-length allocations typical of 
> AI inference workloads.
>
> The original alignment logic created a pattern of broken contiguity — 
> defeating THP benefits altogether.
>
> In AI workloads using Hugging Face Transformers, model shards and 
> intermediate tensors are dynamically allocated during inference. These 
> allocations often fall just below or above the 2MB threshold that THP 
> relies on. Misalignment or forced alignment to PMD boundaries causes 
> fragmentation and disrupts huge page coalescence, affecting performance.
>
> 📊 Memory Allocation Pattern Diagram
>
> Without Patch (PMD Alignment Forced):
>
> |<--2MB-->|<--Gap-->|<--2MB-->|<--Gap-->|<--2MB-->
> | Alloc A |         | Alloc B |         | Alloc C |
>
> Each allocation is PMD-aligned, even if it’s not PMD-sized
>
> Gaps prevent THP coalescence → TLB/cache fragmentation
>
> With Patch (PMD Alignment Conditional):
>
> |<---------6MB Contiguous Region--------->|
> |  Alloc A  | Alloc B | Alloc C | Padding |
>
> Contiguous anonymous memory region
>
> Coalesced into one or more THPs
>
> Improved locality and TLB efficiency
>
> While I regret not having the raw perf output at hand, I’d be happy to 
> replicate a similar test locally and share reproducible results if 
> helpful.
>
> Best Regards,
>
> Siddhartha Sharma

Thanks for your detailed explanation! I misunderstood that the 
optimization you were talking about

was due to efa7df3e3bb5, instead it was due to the alignment. Your 
explanation makes a lot of

sense!


For this workload, do you enable mTHPs on your system? My plan is to 
make a similar patch for

the mTHP case and I'd be grateful if you can get me some results : )

>
>

next prev parent reply	other threads:[~2025-06-30  5:26 UTC|newest]

Thread overview: 28+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-06-27 10:39 siddhartha
2025-06-27 10:45 ` siddhartha
2025-06-27 15:30 ` Lorenzo Stoakes
2025-06-28  3:49   ` Dev Jain
2025-06-30  0:43     ` siddhartha
2025-06-30  5:25       ` Dev Jain [this message]
2025-06-30  5:28         ` Dev Jain
2025-06-30 10:54         ` Lorenzo Stoakes
2025-06-30 11:48           ` siddhartha
2025-07-01  5:23           ` Dev Jain
2025-07-01  5:28             ` Lorenzo Stoakes
2025-07-01  5:45               ` Dev Jain
2025-07-01  5:53                 ` Lorenzo Stoakes
2025-07-01  6:30                   ` Dev Jain
2025-07-01  6:50                     ` Lorenzo Stoakes
2025-07-01  6:58                       ` Dev Jain
2025-07-01 12:15                         ` siddhartha
2025-07-01 12:39                           ` Lorenzo Stoakes
2025-07-01 13:23                             ` siddhartha
2025-07-01 13:28                               ` Lorenzo Stoakes
2025-07-01 14:20                                 ` siddhartha
2025-07-01 16:20                             ` Dev Jain
2025-07-01 18:49                               ` Zi Yan
2025-07-07  8:56                                 ` Vlastimil Babka
2025-07-28  5:41                                   ` siddhartha
2025-07-28 11:00                                     ` Vlastimil Babka
2025-07-01 15:40                           ` Yang Shi
2025-08-11 22:14 siddhartha

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=d8ffe547-5516-43e5-9f33-56b2698a0b4f@arm.com \
    --to=dev.jain@arm.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=lorenzo.stoakes@oracle.com \
    --cc=mgorman@suse.de \
    --cc=siddhartha@kenip.in \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox