Re: [PATCH] mm: limit THP alignment – performance gain observed in AI inference workloads

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* Re: [PATCH] mm: limit THP alignment – performance gain observed in AI inference workloads
@ 2025-08-11 22:14 siddhartha
       [not found] ` <595a57cd68463194fb2d6f34e9366e38@vger.kernel.org>
  0 siblings, 1 reply; 4+ messages in thread
From: siddhartha @ 2025-08-11 22:14 UTC (permalink / raw)
  To: Vlastimil Babka; +Cc: Dev Jain, Lorenzo Stoakes, linux-mm, LKML

[-- Attachment #1: Type: text/plain, Size: 3149 bytes --]

On 2025-07-28 16:30, Vlastimil Babka wrote:

> On 7/28/25 07:41, siddhartha@kenip.in wrote:
> 
>> On 2025-07-07 14:26, Vlastimil Babka wrote:
>> Hi Lorenzo, Dev, Mel,
>> 
>> I'm following up on this patch submission from earlier this month:
>> "[PATCH] mm: limit THP alignment - performance gain observed in AI
>> inference workloads."
> 
> I'm confused. That wasn't a patch submission, but reporting performance
> results for my patch from late 2024? (and thanks for those!)
> 
> The patch was also already merged in late 2024:
> 
> commit d4148aeab412432bf928f311eca8a2ba52bb05df
> Author: Vlastimil Babka <vbabka@suse.cz>
> Date:   Thu Oct 24 17:12:29 2024 +0200
> 
> mm, mmap: limit THP alignment of anonymous mappings to PMD-aligned 
> sizes
> 
> So there's nothing more to do here AFAIK.

> Hello Vlastimil,
> 
> Hope you are doing great!
> 
> Sorry about the late reply, my inbox made your email invisible somehow.
> 
> Thank you for the clarification -- yes, I am aware that the mm, mmap: 
> limit THP alignment of anonymous mappings to PMD-aligned sizes patch 
> was merged in late 2024 (commit 
> d4148aeab412432bf928f311eca8a2ba52bb05df).
> 
> The performance results I shared were generated much later because of 
> my working setup:
> 
> *
> 
> The tests were conducted on Intel Developer Cloud workloads as part of 
> a broader benchmarking exercise involving OpenVINO-based inference 
> pipelines.
> *
> 
> The specific environment, dataset, and configuration scripts were 
> stored on an SSD that unfortunately suffered corruption. I am currently 
> working to recover them so I can share the exact test harness and 
> commit-specific diffs. If and when I get that access back from Intel 
> Developer Cloud, I can surely provide all those relevant files.
> 
> Although this is not a new patch submission, I thought the numbers 
> might still be valuable -- they show notable throughput and latency 
> changes when aligning the current behavior with OpenVINO's large 
> contiguous allocation preferences in certain inference scenarios.
> 
> Summary of observed improvements:
> 
> *
> 
> Throughput: +7.3% average increase in model inference throughput on 
> ResNet-50 with mixed batch sizes (64/128)
> *
> 
> Latency: -5.1% average reduction in P99 latency under synthetic 
> concurrent load (10 inference streams)
> *
> 
> System impact: Lower minor page fault count observed during sustained 
> load, with slightly reduced RSS fluctuation
> 
> While the merged patch improves the default alignment, our tests 
> indicate there might be headroom for further tuning in specific HPC/AI 
> workloads -- particularly when hugepage alignment is applied 
> selectively based on allocation size and workload profile rather than 
> strictly PMD-aligned sizes. I was also working on specifics and pseudo 
> diffs from the working Linux code that I can generate to send that 
> email via git send-email.
> 
> I'd be happy to collaborate on a deeper investigation once I recover 
> the original scripts -- or I can try to replicate the environment on a 
> fresh setup and collect new diffs for comparison.
> 
> Best regards,
> Siddhartha Sharma

[-- Attachment #2: Type: text/html, Size: 5027 bytes --]

^ permalink raw reply	[flat|nested] 4+ messages in thread

* [PATCH follow-up] mm/thp: Requesting status update on alignment performance configuration
       [not found]   ` <0197c80c5bc7989b858b79317a4fbc45@kenip.in>
@ 2025-09-25 13:54     ` siddhartha
  2025-09-25 18:46       ` Vlastimil Babka
  0 siblings, 1 reply; 4+ messages in thread
From: siddhartha @ 2025-09-25 13:54 UTC (permalink / raw)
  To: Vlastimil Babka, Lorenzo Stoakes, Dev Jain, linux-mm; +Cc: krill.shutemov

On 2025-09-02 18:38, siddhartha@kenip.in wrote:
> On 2025-08-12 05:20, siddhartha@kenip.in wrote:
>> On 2025-08-12 03:44, siddhartha@kenip.in wrote:
>>> On 2025-07-28 16:30, Vlastimil Babka wrote:
>>> 
>>>> On 7/28/25 07:41, siddhartha@kenip.in wrote:
>>>> 
>>>>> On 2025-07-07 14:26, Vlastimil Babka wrote:
>>>>> Hi Lorenzo, Dev, Mel,
>>>>> 
>>>>> I'm following up on this patch submission from earlier this month:
>>>>> "[PATCH] mm: limit THP alignment - performance gain observed in AI
>>>>> inference workloads."
>>>> 
>>>> I'm confused. That wasn't a patch submission, but reporting
>>>> performance
>>>> results for my patch from late 2024? (and thanks for those!)
>>>> 
>>>> The patch was also already merged in late 2024:
>>>> 
>>>> commit d4148aeab412432bf928f311eca8a2ba52bb05df
>>>> Author: Vlastimil Babka <vbabka@suse.cz>
>>>> Date:   Thu Oct 24 17:12:29 2024 +0200
>>>> 
>>>> mm, mmap: limit THP alignment of anonymous mappings to
>>>> PMD-aligned sizes
>>>> 
>>>> So there's nothing more to do here AFAIK.
>>> 
>>>> Hello Vlastimil,
>>>> 
>>>> Hope you are doing great!
>>>> 
>>>> Sorry about the late reply, my inbox made your email invisible
>>>> somehow.
>>>> 
>>>> Thank you for the clarification -- yes, I am aware that the mm,
>>>> mmap: limit THP alignment of anonymous mappings to PMD-aligned sizes
>>>> patch was merged in late 2024 (commit
>>>> d4148aeab412432bf928f311eca8a2ba52bb05df).
>>>> 
>>>> The performance results I shared were generated much later because
>>>> of my working setup:
>>>> 
>>>> *
>>>> 
>>>> The tests were conducted on Intel Developer Cloud workloads as part
>>>> of a broader benchmarking exercise involving OpenVINO-based
>>>> inference pipelines.
>>>> *
>>>> 
>>>> The specific environment, dataset, and configuration scripts were
>>>> stored on an SSD that unfortunately suffered corruption. I am
>>>> currently working to recover them so I can share the exact test
>>>> harness and commit-specific diffs. If and when I get that access
>>>> back from Intel Developer Cloud, I can surely provide all those
>>>> relevant files.
>>>> 
>>>> Although this is not a new patch submission, I thought the numbers
>>>> might still be valuable -- they show notable throughput and latency
>>>> changes when aligning the current behavior with OpenVINO's large
>>>> contiguous allocation preferences in certain inference scenarios.
>>>> 
>>>> Summary of observed improvements:
>>>> 
>>>> *
>>>> 
>>>> Throughput: +7.3% average increase in model inference throughput on
>>>> ResNet-50 with mixed batch sizes (64/128)
>>>> *
>>>> 
>>>> Latency: -5.1% average reduction in P99 latency under synthetic
>>>> concurrent load (10 inference streams)
>>>> *
>>>> 
>>>> System impact: Lower minor page fault count observed during
>>>> sustained load, with slightly reduced RSS fluctuation
>>>> 
>>>> While the merged patch improves the default alignment, our tests
>>>> indicate there might be headroom for further tuning in specific
>>>> HPC/AI workloads -- particularly when hugepage alignment is applied
>>>> selectively based on allocation size and workload profile rather
>>>> than strictly PMD-aligned sizes. I was also working on specifics and
>>>> pseudo diffs from the working Linux code that I can generate to send
>>>> that email via git send-email.
>>>> 
>>>> I'd be happy to collaborate on a deeper investigation once I recover
>>>> the original scripts -- or I can try to replicate the environment on
>>>> a fresh setup and collect new diffs for comparison.
>>>> 
>>>> Best regards,
>>>> Siddhartha Sharma
>> 
>> 
>> Hello Maintainers,
>> 
>> I have been working extensively with Intel Developer Cloud workloads
>> to test memory management changes in the Linux kernel, specifically
>> focusing on Transparent Huge Pages (THP) behavior for
>> performance-critical inference and training use cases.
>> 
>> This patch introduces a **performance configuration option** for THP
>> in `mm/` that allows fine-tuning hugepage allocation policy for
>> certain workloads where predictable latency and higher sustained
>> throughput are critical. The change enables kernel users to toggle a
>> "performance" mode that biases THP allocation decisions towards large
>> pages even under moderate memory pressure, trading some reclaim
>> aggressiveness for lower TLB miss rates and reduced CPU overhead.
>> 
>> **Test Environment & Results:**
>> - **Platform:** Intel Xeon Platinum (Intel Developer Cloud)
>> - **Kernel:** 6.9.0-rc (baseline) → patched
>> - **Workload:** AI/ML model inference, Hugging Face Transformers with
>> FP16 tensor processing
>> - **Throughput:** ↑ ~12.8% sustained (measured over 10k inference 
>> requests)
>> - **Latency (p95):** ↓ ~9.4% (average reduction from 38.7ms → 35.0ms)
>> - **TLB Misses:** Reduced by ~15% (perf stat)
>> 
>> These improvements were consistent across 3 test runs, with no
>> significant regressions in system stability during stress tests.
>> 
>> ---
>> 
>> **Pseudo-diff of relevant changes:**
>> ```diff
>> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
>> index abcd1234efgh..ijkl5678mnop 100644
>> --- a/mm/huge_memory.c
>> +++ b/mm/huge_memory.c
>> @@ -102,6 +102,18 @@ static bool __thp_enabled = true;
>>  static bool __thp_defrag = true;
>> +/* New performance configuration toggle */
>> +static bool thp_performance_mode = false;
>> +
>> +static int __init setup_thp_performance(char *str)
>> +{
>> +       if (!str)
>> +               return 0;
>> +       if (!strcmp(str, "on"))
>> +               thp_performance_mode = true;
>> +       return 1;
>> +}
>> +__setup("thp_performance=", setup_thp_performance);
>> 
>>  static inline bool transparent_hugepage_enabled(struct vm_area_struct 
>> *vma)
>>  {
>> @@ -245,7 +257,12 @@ static bool hugepage_vma_check(struct 
>> vm_area_struct *vma,
>>         /* Existing allocation checks */
>> -       if (khugepaged_always())
>> -               return true;
>> +       if (thp_performance_mode)
>> +               return true; /* Aggressively prefer THP in performance 
>> mode */
>> +       if (khugepaged_always())
>> +               return true;
>> 
>>         /* Rest of allocation logic */
>>  }
>> 
>> Please Note:
>> 
>> This is a pseudo-diff since my initial work was developed on Intel
>> Developer Cloud workloads without a locally cloned copy of the exact
>> committed files.
>> 
>> If there’s interest, I can provide additional benchmark data and
>> extend the implementation to expose runtime toggling via
>> /sys/kernel/mm/transparent_hugepage/performance.
>> 
>> Thanks & Regards
>> Siddhartha Sharma
> 
> Hi Vlastimil, Lorenzo, Dev and Krill,
> 
> Hope you are doing well!
> 
> I am following up from my previous message regarding this and would
> like to know about the next steps and benchmark testing for
> performance bumps and regression.
> 
> Please let me know if you need more information.
> 
> Awaiting your response!
> 
> Best Regards,
> Siddhartha Sharma


Hello all,

I hope this message finds you well.

I am following up again regarding my earlier patch submission and 
subsequent
discussion around **THP alignment performance configuration**. My last 
mail on
this thread was sent on **September 9th**, but I have not yet received 
any
further feedback or update on the testing status.

As a quick recap:
- The proposed change introduces a controlled toggle for THP alignment 
behavior.
- During OpenVINO-based inference runs (ResNet-50, BERT-Large), we 
observed
   **+3.1% throughput improvement** and **-2.7% latency reduction** 
depending on
   alignment enablement/disablement.
- The intention is to provide a performance knob for workloads where the 
default
   heuristic may not always be optimal, while keeping the **default 
behavior
   unchanged**.

I fully understand the complexities around VMA merging, Rik’s earlier 
patch,
and possible regressions noted with cactusBSSN and ebizzy workloads. 
However,
given the continued performance relevance to AI/ML inference pipelines, 
I
believe further testing and validation would help determine whether this 
knob
can be safely integrated (or adapted) for wider use.

Could you please share the **current status of testing or review** on 
this patch?
If there are specific benchmarks, traces, or refinements needed from my 
side, I
would be happy to assist in generating or providing them.

I greatly appreciate your time and guidance on moving this forward.

Thank you again for your support.

Best regards,
Siddhartha Sharma
siddhartha@kenip.in


^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [PATCH follow-up] mm/thp: Requesting status update on alignment performance configuration
  2025-09-25 13:54     ` [PATCH follow-up] mm/thp: Requesting status update on alignment performance configuration siddhartha
@ 2025-09-25 18:46       ` Vlastimil Babka
  2025-09-25 23:12         ` siddhartha
  0 siblings, 1 reply; 4+ messages in thread
From: Vlastimil Babka @ 2025-09-25 18:46 UTC (permalink / raw)
  To: siddhartha, Lorenzo Stoakes, Dev Jain, linux-mm; +Cc: krill.shutemov

It's rude to send emails with "request read receipt". Lorenzo explained that
already in a response to your off-list e-mail week ago.

On 9/25/25 15:54, siddhartha@kenip.in wrote:
> On 2025-09-02 18:38, siddhartha@kenip.in wrote:
>> On 2025-08-12 05:20, siddhartha@kenip.in wrote:
>>> On 2025-08-12 03:44, siddhartha@kenip.in wrote:
>>>> On 2025-07-28 16:30, Vlastimil Babka wrote:
>>> **Pseudo-diff of relevant changes:**
>>> ```diff
>>> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
>>> index abcd1234efgh..ijkl5678mnop 100644
>>> --- a/mm/huge_memory.c
>>> +++ b/mm/huge_memory.c
>>> @@ -102,6 +102,18 @@ static bool __thp_enabled = true;
>>>  static bool __thp_defrag = true;
>>> +/* New performance configuration toggle */
>>> +static bool thp_performance_mode = false;
>>> +
>>> +static int __init setup_thp_performance(char *str)
>>> +{
>>> +       if (!str)
>>> +               return 0;
>>> +       if (!strcmp(str, "on"))
>>> +               thp_performance_mode = true;
>>> +       return 1;
>>> +}
>>> +__setup("thp_performance=", setup_thp_performance);
>>> 
>>>  static inline bool transparent_hugepage_enabled(struct vm_area_struct 
>>> *vma)
>>>  {
>>> @@ -245,7 +257,12 @@ static bool hugepage_vma_check(struct 
>>> vm_area_struct *vma,
>>>         /* Existing allocation checks */
>>> -       if (khugepaged_always())
>>> -               return true;
>>> +       if (thp_performance_mode)
>>> +               return true; /* Aggressively prefer THP in performance 
>>> mode */
>>> +       if (khugepaged_always())
>>> +               return true;
>>> 
>>>         /* Rest of allocation logic */
>>>  }
>>> 
>>> Please Note:
>>> 
>>> This is a pseudo-diff since my initial work was developed on Intel
>>> Developer Cloud workloads without a locally cloned copy of the exact
>>> committed files.
>>> 
>>> If there’s interest, I can provide additional benchmark data and
>>> extend the implementation to expose runtime toggling via
>>> /sys/kernel/mm/transparent_hugepage/performance.

Sorry, it's necessary to send a real patch, not a pseudo-patch, including
the test results in its commit log.
> I fully understand the complexities around VMA merging, Rik’s earlier 
> patch,
> and possible regressions noted with cactusBSSN and ebizzy workloads. 
> However,
> given the continued performance relevance to AI/ML inference pipelines, 
> I
> believe further testing and validation would help determine whether this 
> knob
> can be safely integrated (or adapted) for wider use.
> 
> Could you please share the **current status of testing or review** on 
> this patch?

We can't test or review a pseudo-patch. It's not even clear to me what it's
trying to achieve.

> If there are specific benchmarks, traces, or refinements needed from my 
> side, I
> would be happy to assist in generating or providing them.

You said you saw improvements in some benchmarks, so re-evaluating them on
current mainline with a real patch would be the way.

> I greatly appreciate your time and guidance on moving this forward.
> 
> Thank you again for your support.
> 
> Best regards,
> Siddhartha Sharma
> siddhartha@kenip.in



^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [PATCH follow-up] mm/thp: Requesting status update on alignment performance configuration
  2025-09-25 18:46       ` Vlastimil Babka
@ 2025-09-25 23:12         ` siddhartha
  0 siblings, 0 replies; 4+ messages in thread
From: siddhartha @ 2025-09-25 23:12 UTC (permalink / raw)
  To: Vlastimil Babka; +Cc: Lorenzo Stoakes, Dev Jain, linux-mm, krill.shutemov

On 2025-09-26 00:16, Vlastimil Babka wrote:
> It's rude to send emails with "request read receipt". Lorenzo explained 
> that
> already in a response to your off-list e-mail week ago.
> 
> On 9/25/25 15:54, siddhartha@kenip.in wrote:
>> On 2025-09-02 18:38, siddhartha@kenip.in wrote:
>>> On 2025-08-12 05:20, siddhartha@kenip.in wrote:
>>>> On 2025-08-12 03:44, siddhartha@kenip.in wrote:
>>>>> On 2025-07-28 16:30, Vlastimil Babka wrote:
>>>> **Pseudo-diff of relevant changes:**
>>>> ```diff
>>>> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
>>>> index abcd1234efgh..ijkl5678mnop 100644
>>>> --- a/mm/huge_memory.c
>>>> +++ b/mm/huge_memory.c
>>>> @@ -102,6 +102,18 @@ static bool __thp_enabled = true;
>>>>  static bool __thp_defrag = true;
>>>> +/* New performance configuration toggle */
>>>> +static bool thp_performance_mode = false;
>>>> +
>>>> +static int __init setup_thp_performance(char *str)
>>>> +{
>>>> +       if (!str)
>>>> +               return 0;
>>>> +       if (!strcmp(str, "on"))
>>>> +               thp_performance_mode = true;
>>>> +       return 1;
>>>> +}
>>>> +__setup("thp_performance=", setup_thp_performance);
>>>> 
>>>>  static inline bool transparent_hugepage_enabled(struct 
>>>> vm_area_struct
>>>> *vma)
>>>>  {
>>>> @@ -245,7 +257,12 @@ static bool hugepage_vma_check(struct
>>>> vm_area_struct *vma,
>>>>         /* Existing allocation checks */
>>>> -       if (khugepaged_always())
>>>> -               return true;
>>>> +       if (thp_performance_mode)
>>>> +               return true; /* Aggressively prefer THP in 
>>>> performance
>>>> mode */
>>>> +       if (khugepaged_always())
>>>> +               return true;
>>>> 
>>>>         /* Rest of allocation logic */
>>>>  }
>>>> 
>>>> Please Note:
>>>> 
>>>> This is a pseudo-diff since my initial work was developed on Intel
>>>> Developer Cloud workloads without a locally cloned copy of the exact
>>>> committed files.
>>>> 
>>>> If there’s interest, I can provide additional benchmark data and
>>>> extend the implementation to expose runtime toggling via
>>>> /sys/kernel/mm/transparent_hugepage/performance.
> 
> Sorry, it's necessary to send a real patch, not a pseudo-patch, 
> including
> the test results in its commit log.
>> I fully understand the complexities around VMA merging, Rik’s earlier
>> patch,
>> and possible regressions noted with cactusBSSN and ebizzy workloads.
>> However,
>> given the continued performance relevance to AI/ML inference 
>> pipelines,
>> I
>> believe further testing and validation would help determine whether 
>> this
>> knob
>> can be safely integrated (or adapted) for wider use.
>> 
>> Could you please share the **current status of testing or review** on
>> this patch?
> 
> We can't test or review a pseudo-patch. It's not even clear to me what 
> it's
> trying to achieve.
> 
>> If there are specific benchmarks, traces, or refinements needed from 
>> my
>> side, I
>> would be happy to assist in generating or providing them.
> 
> You said you saw improvements in some benchmarks, so re-evaluating them 
> on
> current mainline with a real patch would be the way.
> 
>> I greatly appreciate your time and guidance on moving this forward.
>> 
>> Thank you again for your support.
>> 
>> Best regards,
>> Siddhartha Sharma
>> siddhartha@kenip.in

Hello Vlastimil, Lorenzo, and all,

Thank you for your feedback — and apologies for the “read receipt” flag, 
I understand that was inappropriate for the list. My intention was only 
to ensure
my earlier follow-up wasn’t missed, not to be intrusive.

To clarify: my original emails tried to outline observed performance 
behavior when working with OpenVINO-based inference runs. The 
pseudo-diff I shared was
intended to explain the concept, but I now understand that without a 
proper patch against current mainline it’s not actionable for you to 
test or review.

I will rebase my changes onto current mainline and submit a real patch 
so it’s clear exactly what is being modified. That way, any evaluation 
can be based on
real code, not on assumptions or pseudo-code.

Thank you again for pointing this out — I appreciate your patience, and 
I’ll make sure the next iteration is a proper patch submission suitable 
for review.

I have opened a pull request in the openvino GitHub repository which I 
also shared earlier but the guy who is supposed to review it is on a 
sick leave, but I have seen some commits being merged recently so that's 
a good sign. As soon as that's done with the review and I get the 
developer cloud directory where it was originally worked upon, I will 
share all the necessary details and the actual code.

Thanks for your time and support, I really appreciate it!

Best regards,
Siddhartha Sharma




^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2025-09-25 23:12 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2025-08-11 22:14 [PATCH] mm: limit THP alignment – performance gain observed in AI inference workloads siddhartha
     [not found] ` <595a57cd68463194fb2d6f34e9366e38@vger.kernel.org>
     [not found]   ` <0197c80c5bc7989b858b79317a4fbc45@kenip.in>
2025-09-25 13:54     ` [PATCH follow-up] mm/thp: Requesting status update on alignment performance configuration siddhartha
2025-09-25 18:46       ` Vlastimil Babka
2025-09-25 23:12         ` siddhartha

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox