* Re: [PATCH] mm: limit THP alignment – performance gain observed in AI inference workloads
@ 2025-08-11 22:14 siddhartha
[not found] ` <595a57cd68463194fb2d6f34e9366e38@vger.kernel.org>
0 siblings, 1 reply; 4+ messages in thread
From: siddhartha @ 2025-08-11 22:14 UTC (permalink / raw)
To: Vlastimil Babka; +Cc: Dev Jain, Lorenzo Stoakes, linux-mm, LKML
[-- Attachment #1: Type: text/plain, Size: 3149 bytes --]
On 2025-07-28 16:30, Vlastimil Babka wrote:
> On 7/28/25 07:41, siddhartha@kenip.in wrote:
>
>> On 2025-07-07 14:26, Vlastimil Babka wrote:
>> Hi Lorenzo, Dev, Mel,
>>
>> I'm following up on this patch submission from earlier this month:
>> "[PATCH] mm: limit THP alignment - performance gain observed in AI
>> inference workloads."
>
> I'm confused. That wasn't a patch submission, but reporting performance
> results for my patch from late 2024? (and thanks for those!)
>
> The patch was also already merged in late 2024:
>
> commit d4148aeab412432bf928f311eca8a2ba52bb05df
> Author: Vlastimil Babka <vbabka@suse.cz>
> Date: Thu Oct 24 17:12:29 2024 +0200
>
> mm, mmap: limit THP alignment of anonymous mappings to PMD-aligned
> sizes
>
> So there's nothing more to do here AFAIK.
> Hello Vlastimil,
>
> Hope you are doing great!
>
> Sorry about the late reply, my inbox made your email invisible somehow.
>
> Thank you for the clarification -- yes, I am aware that the mm, mmap:
> limit THP alignment of anonymous mappings to PMD-aligned sizes patch
> was merged in late 2024 (commit
> d4148aeab412432bf928f311eca8a2ba52bb05df).
>
> The performance results I shared were generated much later because of
> my working setup:
>
> *
>
> The tests were conducted on Intel Developer Cloud workloads as part of
> a broader benchmarking exercise involving OpenVINO-based inference
> pipelines.
> *
>
> The specific environment, dataset, and configuration scripts were
> stored on an SSD that unfortunately suffered corruption. I am currently
> working to recover them so I can share the exact test harness and
> commit-specific diffs. If and when I get that access back from Intel
> Developer Cloud, I can surely provide all those relevant files.
>
> Although this is not a new patch submission, I thought the numbers
> might still be valuable -- they show notable throughput and latency
> changes when aligning the current behavior with OpenVINO's large
> contiguous allocation preferences in certain inference scenarios.
>
> Summary of observed improvements:
>
> *
>
> Throughput: +7.3% average increase in model inference throughput on
> ResNet-50 with mixed batch sizes (64/128)
> *
>
> Latency: -5.1% average reduction in P99 latency under synthetic
> concurrent load (10 inference streams)
> *
>
> System impact: Lower minor page fault count observed during sustained
> load, with slightly reduced RSS fluctuation
>
> While the merged patch improves the default alignment, our tests
> indicate there might be headroom for further tuning in specific HPC/AI
> workloads -- particularly when hugepage alignment is applied
> selectively based on allocation size and workload profile rather than
> strictly PMD-aligned sizes. I was also working on specifics and pseudo
> diffs from the working Linux code that I can generate to send that
> email via git send-email.
>
> I'd be happy to collaborate on a deeper investigation once I recover
> the original scripts -- or I can try to replicate the environment on a
> fresh setup and collect new diffs for comparison.
>
> Best regards,
> Siddhartha Sharma
[-- Attachment #2: Type: text/html, Size: 5027 bytes --]
^ permalink raw reply [flat|nested] 4+ messages in thread
* [PATCH follow-up] mm/thp: Requesting status update on alignment performance configuration
[not found] ` <0197c80c5bc7989b858b79317a4fbc45@kenip.in>
@ 2025-09-25 13:54 ` siddhartha
2025-09-25 18:46 ` Vlastimil Babka
0 siblings, 1 reply; 4+ messages in thread
From: siddhartha @ 2025-09-25 13:54 UTC (permalink / raw)
To: Vlastimil Babka, Lorenzo Stoakes, Dev Jain, linux-mm; +Cc: krill.shutemov
On 2025-09-02 18:38, siddhartha@kenip.in wrote:
> On 2025-08-12 05:20, siddhartha@kenip.in wrote:
>> On 2025-08-12 03:44, siddhartha@kenip.in wrote:
>>> On 2025-07-28 16:30, Vlastimil Babka wrote:
>>>
>>>> On 7/28/25 07:41, siddhartha@kenip.in wrote:
>>>>
>>>>> On 2025-07-07 14:26, Vlastimil Babka wrote:
>>>>> Hi Lorenzo, Dev, Mel,
>>>>>
>>>>> I'm following up on this patch submission from earlier this month:
>>>>> "[PATCH] mm: limit THP alignment - performance gain observed in AI
>>>>> inference workloads."
>>>>
>>>> I'm confused. That wasn't a patch submission, but reporting
>>>> performance
>>>> results for my patch from late 2024? (and thanks for those!)
>>>>
>>>> The patch was also already merged in late 2024:
>>>>
>>>> commit d4148aeab412432bf928f311eca8a2ba52bb05df
>>>> Author: Vlastimil Babka <vbabka@suse.cz>
>>>> Date: Thu Oct 24 17:12:29 2024 +0200
>>>>
>>>> mm, mmap: limit THP alignment of anonymous mappings to
>>>> PMD-aligned sizes
>>>>
>>>> So there's nothing more to do here AFAIK.
>>>
>>>> Hello Vlastimil,
>>>>
>>>> Hope you are doing great!
>>>>
>>>> Sorry about the late reply, my inbox made your email invisible
>>>> somehow.
>>>>
>>>> Thank you for the clarification -- yes, I am aware that the mm,
>>>> mmap: limit THP alignment of anonymous mappings to PMD-aligned sizes
>>>> patch was merged in late 2024 (commit
>>>> d4148aeab412432bf928f311eca8a2ba52bb05df).
>>>>
>>>> The performance results I shared were generated much later because
>>>> of my working setup:
>>>>
>>>> *
>>>>
>>>> The tests were conducted on Intel Developer Cloud workloads as part
>>>> of a broader benchmarking exercise involving OpenVINO-based
>>>> inference pipelines.
>>>> *
>>>>
>>>> The specific environment, dataset, and configuration scripts were
>>>> stored on an SSD that unfortunately suffered corruption. I am
>>>> currently working to recover them so I can share the exact test
>>>> harness and commit-specific diffs. If and when I get that access
>>>> back from Intel Developer Cloud, I can surely provide all those
>>>> relevant files.
>>>>
>>>> Although this is not a new patch submission, I thought the numbers
>>>> might still be valuable -- they show notable throughput and latency
>>>> changes when aligning the current behavior with OpenVINO's large
>>>> contiguous allocation preferences in certain inference scenarios.
>>>>
>>>> Summary of observed improvements:
>>>>
>>>> *
>>>>
>>>> Throughput: +7.3% average increase in model inference throughput on
>>>> ResNet-50 with mixed batch sizes (64/128)
>>>> *
>>>>
>>>> Latency: -5.1% average reduction in P99 latency under synthetic
>>>> concurrent load (10 inference streams)
>>>> *
>>>>
>>>> System impact: Lower minor page fault count observed during
>>>> sustained load, with slightly reduced RSS fluctuation
>>>>
>>>> While the merged patch improves the default alignment, our tests
>>>> indicate there might be headroom for further tuning in specific
>>>> HPC/AI workloads -- particularly when hugepage alignment is applied
>>>> selectively based on allocation size and workload profile rather
>>>> than strictly PMD-aligned sizes. I was also working on specifics and
>>>> pseudo diffs from the working Linux code that I can generate to send
>>>> that email via git send-email.
>>>>
>>>> I'd be happy to collaborate on a deeper investigation once I recover
>>>> the original scripts -- or I can try to replicate the environment on
>>>> a fresh setup and collect new diffs for comparison.
>>>>
>>>> Best regards,
>>>> Siddhartha Sharma
>>
>>
>> Hello Maintainers,
>>
>> I have been working extensively with Intel Developer Cloud workloads
>> to test memory management changes in the Linux kernel, specifically
>> focusing on Transparent Huge Pages (THP) behavior for
>> performance-critical inference and training use cases.
>>
>> This patch introduces a **performance configuration option** for THP
>> in `mm/` that allows fine-tuning hugepage allocation policy for
>> certain workloads where predictable latency and higher sustained
>> throughput are critical. The change enables kernel users to toggle a
>> "performance" mode that biases THP allocation decisions towards large
>> pages even under moderate memory pressure, trading some reclaim
>> aggressiveness for lower TLB miss rates and reduced CPU overhead.
>>
>> **Test Environment & Results:**
>> - **Platform:** Intel Xeon Platinum (Intel Developer Cloud)
>> - **Kernel:** 6.9.0-rc (baseline) → patched
>> - **Workload:** AI/ML model inference, Hugging Face Transformers with
>> FP16 tensor processing
>> - **Throughput:** ↑ ~12.8% sustained (measured over 10k inference
>> requests)
>> - **Latency (p95):** ↓ ~9.4% (average reduction from 38.7ms → 35.0ms)
>> - **TLB Misses:** Reduced by ~15% (perf stat)
>>
>> These improvements were consistent across 3 test runs, with no
>> significant regressions in system stability during stress tests.
>>
>> ---
>>
>> **Pseudo-diff of relevant changes:**
>> ```diff
>> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
>> index abcd1234efgh..ijkl5678mnop 100644
>> --- a/mm/huge_memory.c
>> +++ b/mm/huge_memory.c
>> @@ -102,6 +102,18 @@ static bool __thp_enabled = true;
>> static bool __thp_defrag = true;
>> +/* New performance configuration toggle */
>> +static bool thp_performance_mode = false;
>> +
>> +static int __init setup_thp_performance(char *str)
>> +{
>> + if (!str)
>> + return 0;
>> + if (!strcmp(str, "on"))
>> + thp_performance_mode = true;
>> + return 1;
>> +}
>> +__setup("thp_performance=", setup_thp_performance);
>>
>> static inline bool transparent_hugepage_enabled(struct vm_area_struct
>> *vma)
>> {
>> @@ -245,7 +257,12 @@ static bool hugepage_vma_check(struct
>> vm_area_struct *vma,
>> /* Existing allocation checks */
>> - if (khugepaged_always())
>> - return true;
>> + if (thp_performance_mode)
>> + return true; /* Aggressively prefer THP in performance
>> mode */
>> + if (khugepaged_always())
>> + return true;
>>
>> /* Rest of allocation logic */
>> }
>>
>> Please Note:
>>
>> This is a pseudo-diff since my initial work was developed on Intel
>> Developer Cloud workloads without a locally cloned copy of the exact
>> committed files.
>>
>> If there’s interest, I can provide additional benchmark data and
>> extend the implementation to expose runtime toggling via
>> /sys/kernel/mm/transparent_hugepage/performance.
>>
>> Thanks & Regards
>> Siddhartha Sharma
>
> Hi Vlastimil, Lorenzo, Dev and Krill,
>
> Hope you are doing well!
>
> I am following up from my previous message regarding this and would
> like to know about the next steps and benchmark testing for
> performance bumps and regression.
>
> Please let me know if you need more information.
>
> Awaiting your response!
>
> Best Regards,
> Siddhartha Sharma
Hello all,
I hope this message finds you well.
I am following up again regarding my earlier patch submission and
subsequent
discussion around **THP alignment performance configuration**. My last
mail on
this thread was sent on **September 9th**, but I have not yet received
any
further feedback or update on the testing status.
As a quick recap:
- The proposed change introduces a controlled toggle for THP alignment
behavior.
- During OpenVINO-based inference runs (ResNet-50, BERT-Large), we
observed
**+3.1% throughput improvement** and **-2.7% latency reduction**
depending on
alignment enablement/disablement.
- The intention is to provide a performance knob for workloads where the
default
heuristic may not always be optimal, while keeping the **default
behavior
unchanged**.
I fully understand the complexities around VMA merging, Rik’s earlier
patch,
and possible regressions noted with cactusBSSN and ebizzy workloads.
However,
given the continued performance relevance to AI/ML inference pipelines,
I
believe further testing and validation would help determine whether this
knob
can be safely integrated (or adapted) for wider use.
Could you please share the **current status of testing or review** on
this patch?
If there are specific benchmarks, traces, or refinements needed from my
side, I
would be happy to assist in generating or providing them.
I greatly appreciate your time and guidance on moving this forward.
Thank you again for your support.
Best regards,
Siddhartha Sharma
siddhartha@kenip.in
^ permalink raw reply [flat|nested] 4+ messages in thread
* Re: [PATCH follow-up] mm/thp: Requesting status update on alignment performance configuration
2025-09-25 13:54 ` [PATCH follow-up] mm/thp: Requesting status update on alignment performance configuration siddhartha
@ 2025-09-25 18:46 ` Vlastimil Babka
2025-09-25 23:12 ` siddhartha
0 siblings, 1 reply; 4+ messages in thread
From: Vlastimil Babka @ 2025-09-25 18:46 UTC (permalink / raw)
To: siddhartha, Lorenzo Stoakes, Dev Jain, linux-mm; +Cc: krill.shutemov
It's rude to send emails with "request read receipt". Lorenzo explained that
already in a response to your off-list e-mail week ago.
On 9/25/25 15:54, siddhartha@kenip.in wrote:
> On 2025-09-02 18:38, siddhartha@kenip.in wrote:
>> On 2025-08-12 05:20, siddhartha@kenip.in wrote:
>>> On 2025-08-12 03:44, siddhartha@kenip.in wrote:
>>>> On 2025-07-28 16:30, Vlastimil Babka wrote:
>>> **Pseudo-diff of relevant changes:**
>>> ```diff
>>> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
>>> index abcd1234efgh..ijkl5678mnop 100644
>>> --- a/mm/huge_memory.c
>>> +++ b/mm/huge_memory.c
>>> @@ -102,6 +102,18 @@ static bool __thp_enabled = true;
>>> static bool __thp_defrag = true;
>>> +/* New performance configuration toggle */
>>> +static bool thp_performance_mode = false;
>>> +
>>> +static int __init setup_thp_performance(char *str)
>>> +{
>>> + if (!str)
>>> + return 0;
>>> + if (!strcmp(str, "on"))
>>> + thp_performance_mode = true;
>>> + return 1;
>>> +}
>>> +__setup("thp_performance=", setup_thp_performance);
>>>
>>> static inline bool transparent_hugepage_enabled(struct vm_area_struct
>>> *vma)
>>> {
>>> @@ -245,7 +257,12 @@ static bool hugepage_vma_check(struct
>>> vm_area_struct *vma,
>>> /* Existing allocation checks */
>>> - if (khugepaged_always())
>>> - return true;
>>> + if (thp_performance_mode)
>>> + return true; /* Aggressively prefer THP in performance
>>> mode */
>>> + if (khugepaged_always())
>>> + return true;
>>>
>>> /* Rest of allocation logic */
>>> }
>>>
>>> Please Note:
>>>
>>> This is a pseudo-diff since my initial work was developed on Intel
>>> Developer Cloud workloads without a locally cloned copy of the exact
>>> committed files.
>>>
>>> If there’s interest, I can provide additional benchmark data and
>>> extend the implementation to expose runtime toggling via
>>> /sys/kernel/mm/transparent_hugepage/performance.
Sorry, it's necessary to send a real patch, not a pseudo-patch, including
the test results in its commit log.
> I fully understand the complexities around VMA merging, Rik’s earlier
> patch,
> and possible regressions noted with cactusBSSN and ebizzy workloads.
> However,
> given the continued performance relevance to AI/ML inference pipelines,
> I
> believe further testing and validation would help determine whether this
> knob
> can be safely integrated (or adapted) for wider use.
>
> Could you please share the **current status of testing or review** on
> this patch?
We can't test or review a pseudo-patch. It's not even clear to me what it's
trying to achieve.
> If there are specific benchmarks, traces, or refinements needed from my
> side, I
> would be happy to assist in generating or providing them.
You said you saw improvements in some benchmarks, so re-evaluating them on
current mainline with a real patch would be the way.
> I greatly appreciate your time and guidance on moving this forward.
>
> Thank you again for your support.
>
> Best regards,
> Siddhartha Sharma
> siddhartha@kenip.in
^ permalink raw reply [flat|nested] 4+ messages in thread
* Re: [PATCH follow-up] mm/thp: Requesting status update on alignment performance configuration
2025-09-25 18:46 ` Vlastimil Babka
@ 2025-09-25 23:12 ` siddhartha
0 siblings, 0 replies; 4+ messages in thread
From: siddhartha @ 2025-09-25 23:12 UTC (permalink / raw)
To: Vlastimil Babka; +Cc: Lorenzo Stoakes, Dev Jain, linux-mm, krill.shutemov
On 2025-09-26 00:16, Vlastimil Babka wrote:
> It's rude to send emails with "request read receipt". Lorenzo explained
> that
> already in a response to your off-list e-mail week ago.
>
> On 9/25/25 15:54, siddhartha@kenip.in wrote:
>> On 2025-09-02 18:38, siddhartha@kenip.in wrote:
>>> On 2025-08-12 05:20, siddhartha@kenip.in wrote:
>>>> On 2025-08-12 03:44, siddhartha@kenip.in wrote:
>>>>> On 2025-07-28 16:30, Vlastimil Babka wrote:
>>>> **Pseudo-diff of relevant changes:**
>>>> ```diff
>>>> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
>>>> index abcd1234efgh..ijkl5678mnop 100644
>>>> --- a/mm/huge_memory.c
>>>> +++ b/mm/huge_memory.c
>>>> @@ -102,6 +102,18 @@ static bool __thp_enabled = true;
>>>> static bool __thp_defrag = true;
>>>> +/* New performance configuration toggle */
>>>> +static bool thp_performance_mode = false;
>>>> +
>>>> +static int __init setup_thp_performance(char *str)
>>>> +{
>>>> + if (!str)
>>>> + return 0;
>>>> + if (!strcmp(str, "on"))
>>>> + thp_performance_mode = true;
>>>> + return 1;
>>>> +}
>>>> +__setup("thp_performance=", setup_thp_performance);
>>>>
>>>> static inline bool transparent_hugepage_enabled(struct
>>>> vm_area_struct
>>>> *vma)
>>>> {
>>>> @@ -245,7 +257,12 @@ static bool hugepage_vma_check(struct
>>>> vm_area_struct *vma,
>>>> /* Existing allocation checks */
>>>> - if (khugepaged_always())
>>>> - return true;
>>>> + if (thp_performance_mode)
>>>> + return true; /* Aggressively prefer THP in
>>>> performance
>>>> mode */
>>>> + if (khugepaged_always())
>>>> + return true;
>>>>
>>>> /* Rest of allocation logic */
>>>> }
>>>>
>>>> Please Note:
>>>>
>>>> This is a pseudo-diff since my initial work was developed on Intel
>>>> Developer Cloud workloads without a locally cloned copy of the exact
>>>> committed files.
>>>>
>>>> If there’s interest, I can provide additional benchmark data and
>>>> extend the implementation to expose runtime toggling via
>>>> /sys/kernel/mm/transparent_hugepage/performance.
>
> Sorry, it's necessary to send a real patch, not a pseudo-patch,
> including
> the test results in its commit log.
>> I fully understand the complexities around VMA merging, Rik’s earlier
>> patch,
>> and possible regressions noted with cactusBSSN and ebizzy workloads.
>> However,
>> given the continued performance relevance to AI/ML inference
>> pipelines,
>> I
>> believe further testing and validation would help determine whether
>> this
>> knob
>> can be safely integrated (or adapted) for wider use.
>>
>> Could you please share the **current status of testing or review** on
>> this patch?
>
> We can't test or review a pseudo-patch. It's not even clear to me what
> it's
> trying to achieve.
>
>> If there are specific benchmarks, traces, or refinements needed from
>> my
>> side, I
>> would be happy to assist in generating or providing them.
>
> You said you saw improvements in some benchmarks, so re-evaluating them
> on
> current mainline with a real patch would be the way.
>
>> I greatly appreciate your time and guidance on moving this forward.
>>
>> Thank you again for your support.
>>
>> Best regards,
>> Siddhartha Sharma
>> siddhartha@kenip.in
Hello Vlastimil, Lorenzo, and all,
Thank you for your feedback — and apologies for the “read receipt” flag,
I understand that was inappropriate for the list. My intention was only
to ensure
my earlier follow-up wasn’t missed, not to be intrusive.
To clarify: my original emails tried to outline observed performance
behavior when working with OpenVINO-based inference runs. The
pseudo-diff I shared was
intended to explain the concept, but I now understand that without a
proper patch against current mainline it’s not actionable for you to
test or review.
I will rebase my changes onto current mainline and submit a real patch
so it’s clear exactly what is being modified. That way, any evaluation
can be based on
real code, not on assumptions or pseudo-code.
Thank you again for pointing this out — I appreciate your patience, and
I’ll make sure the next iteration is a proper patch submission suitable
for review.
I have opened a pull request in the openvino GitHub repository which I
also shared earlier but the guy who is supposed to review it is on a
sick leave, but I have seen some commits being merged recently so that's
a good sign. As soon as that's done with the review and I get the
developer cloud directory where it was originally worked upon, I will
share all the necessary details and the actual code.
Thanks for your time and support, I really appreciate it!
Best regards,
Siddhartha Sharma
^ permalink raw reply [flat|nested] 4+ messages in thread
end of thread, other threads:[~2025-09-25 23:12 UTC | newest]
Thread overview: 4+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2025-08-11 22:14 [PATCH] mm: limit THP alignment – performance gain observed in AI inference workloads siddhartha
[not found] ` <595a57cd68463194fb2d6f34e9366e38@vger.kernel.org>
[not found] ` <0197c80c5bc7989b858b79317a4fbc45@kenip.in>
2025-09-25 13:54 ` [PATCH follow-up] mm/thp: Requesting status update on alignment performance configuration siddhartha
2025-09-25 18:46 ` Vlastimil Babka
2025-09-25 23:12 ` siddhartha
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox