Re: [PATCH] mm: limit THP alignment – performance gain observed in AI inference workloads

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

From: siddhartha@kenip.in
To: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Cc: Dev Jain <dev.jain@arm.com>,
	linux-mm@kvack.org, linux-kernel@vger.kernel.org,
	mgorman@suse.de
Subject: Re: [PATCH] mm: limit THP alignment – performance gain observed in AI inference workloads
Date: Mon, 30 Jun 2025 17:18:02 +0530	[thread overview]
Message-ID: <8128c0338e5df5476ec9fd6eb3079964@kenip.in> (raw)
In-Reply-To: <ba2c89bd-88de-48f8-abd0-b62d8b1d50b3@lucifer.local>

On 2025-06-30 16:24, Lorenzo Stoakes wrote:
> +cc Vlastimil, please keep him cc'd on discussions here as the author 
> of this
> fix in the conversation.
> 
> On Mon, Jun 30, 2025 at 10:55:52AM +0530, Dev Jain wrote:
>> 
>> 
>> For this workload, do you enable mTHPs on your system? My plan is to 
>> make a
>> similar patch for
>> 
>> the mTHP case and I'd be grateful if you can get me some results : )
> 
> I'd urge caution here.
> 
> The reason there was a big perf improvement is that, for certain 
> workloads, the
> original patch by Rik caused issues with VMA fragmentation. So rather 
> than
> getting adjacent VMAs that might later be khugepage'd, you'd get a 
> bunch of VMAs
> that were auto-aligned and thus fragmented from one another.
> 
> So while you got speed ups on some workloads, you got really bad perf 
> impact on
> some that were subject to this.
> 
> The observed speed up was on a very specific benchmark also. While it's 
> a great
> improvement, it's important to understand the context (see the original 
> patch
> for details [0]).
> 
> I do think it's worth considering changing 
> thp_get_unmapped_area_vmflags() for
> mTHP, as it's currently very limited (just PMD alignment) and it'd 
> possibly be
> sensible to change this to checking against allowed THP alignments, but 
> I'd not
> assume this is going to get some crazy speed up as observed here.
> 
> Note that any such change would probably require some refactoring in 
> THP first
> to make it not quite so awful.
> 
> I also think for Siddharta's usecase mTHP isn't really relevant is it, 
> as intel
> do not support mTHP currently do they?
> 
> Regards, Lorenzo
> 
> [0]: 
> https://lore.kernel.org/all/20241024151228.101841-2-vbabka@suse.cz/T/#u

Hi Lorenzo, Dev, All,

Thank you for the thoughtful responses and for engaging with the 
performance implications of the patch.

You're absolutely right that the observed speedup came from a specific 
class of workloads — in this case, token-length-variable AI inference 
pipelines based on Hugging Face Transformers and ONNX Runtime. These 
workloads trigger highly dynamic, anonymous memory allocation patterns, 
often in bursts aligned with model shard loading and attention map 
resizing. In such cases, VMA fragmentation due to PMD-aligned, 
non-PMD-sized mappings led to near-complete loss of THP utilization.

Once the alignment restriction was lifted (via Rik’s patch), we observed 
substantial restoration of THP behavior, which is where the performance 
gains came from. That said, I completely agree that:

Not all workloads benefit from this

Some may even regress if the underlying VMAs aren't THP-coalescible for 
other reasons

Still, for high-throughput inference workloads on modern Intel CPUs, 
this behavior isn’t a corner case. The shift toward multi-model 
concurrent serving (e.g., LLM-as-a-Service) means this dynamic 
allocation pattern is becoming common, especially in 
edge/latency-sensitive deployments.

🧠 On mTHP: Intel Does Support It
Regarding mTHP — yes, Intel platforms (especially server-grade Xeon 
processors from Cascade Lake onward) do support mapped transparent huge 
pages, including via:

tmpfs-backed files

madvise(MADV_HUGEPAGE) on file mappings

shmem usage with shmem_enabled in the kernel

So I’d say mTHP is certainly relevant for workloads where model weights 
or tensors are pre-cached or memory-mapped — a pattern we’re also seeing 
as Hugging Face, ONNX, and PyTorch ecosystems move toward zero-copy 
tensor sharing.

Given that, I'd absolutely be interested in testing any mTHP-targeted 
patch — and I’d be happy to help validate it, especially if it avoids 
the VMA fragmentation pitfall you rightly pointed out.

Thanks again for the detailed feedback, and I’ll try to replicate and 
share further traces (from my local testbed) since I currently don’t 
have access to the original Intel Developer Cloud logs.

Best regards,
Siddhartha Sharma

next prev parent reply	other threads:[~2025-06-30 11:48 UTC|newest]

Thread overview: 28+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-06-27 10:39 siddhartha
2025-06-27 10:45 ` siddhartha
2025-06-27 15:30 ` Lorenzo Stoakes
2025-06-28  3:49   ` Dev Jain
2025-06-30  0:43     ` siddhartha
2025-06-30  5:25       ` Dev Jain
2025-06-30  5:28         ` Dev Jain
2025-06-30 10:54         ` Lorenzo Stoakes
2025-06-30 11:48           ` siddhartha [this message]
2025-07-01  5:23           ` Dev Jain
2025-07-01  5:28             ` Lorenzo Stoakes
2025-07-01  5:45               ` Dev Jain
2025-07-01  5:53                 ` Lorenzo Stoakes
2025-07-01  6:30                   ` Dev Jain
2025-07-01  6:50                     ` Lorenzo Stoakes
2025-07-01  6:58                       ` Dev Jain
2025-07-01 12:15                         ` siddhartha
2025-07-01 12:39                           ` Lorenzo Stoakes
2025-07-01 13:23                             ` siddhartha
2025-07-01 13:28                               ` Lorenzo Stoakes
2025-07-01 14:20                                 ` siddhartha
2025-07-01 16:20                             ` Dev Jain
2025-07-01 18:49                               ` Zi Yan
2025-07-07  8:56                                 ` Vlastimil Babka
2025-07-28  5:41                                   ` siddhartha
2025-07-28 11:00                                     ` Vlastimil Babka
2025-07-01 15:40                           ` Yang Shi
2025-08-11 22:14 siddhartha

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=8128c0338e5df5476ec9fd6eb3079964@kenip.in \
    --to=siddhartha@kenip.in \
    --cc=dev.jain@arm.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=lorenzo.stoakes@oracle.com \
    --cc=mgorman@suse.de \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox