[Question] About memory.c: process_huge

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [Question] About memory.c: process_huge_page
@ 2025-09-24 11:46 Zhu Haoran
  2025-09-25  1:32 ` Huang, Ying
  0 siblings, 1 reply; 9+ messages in thread
From: Zhu Haoran @ 2025-09-24 11:46 UTC (permalink / raw)
  To: Ying Huang; +Cc: linux-mm

Hi!

I recently noticed the process_huge_page function in memory.c, which was
intended to keep the cache hotness of target page after processing. I compared
the vm-scalability anon-cow-seq-hugetlb microbench using the default
process_huge_page and sequential processing (code posted below).

I ran test on epyc-7T83 with 36vCPUs and 64GB memory. Using default
process_huge_page, the avg bandwidth is 1148 mb/s. However sequential
processing yielded a better bandwidth of about 1255 mb/s and only
one-third cache-miss rate compared with default one.

The same test was run on epyc-9654 with 36vCPU and 64GB mem. The
bandwidth result was similar but the difference was smaller: 1170mb/s
for default and 1230 mb/s for sequential. Although we did find the cache
miss rate here did the reverse, since the sequential processing seen 3
times miss more than the default.

These result seem really inconsitent with the what described in your
patchset [1]. What factors might explain these behaviors?

Thanks for your time.

[1] https://lkml.org/lkml/2018/5/23/1072

---
Sincere,
Zhu Haoran

---

static int process_huge_page(
        unsigned long addr_hint, unsigned int nr_pages,
        int (*process_subpage)(unsigned long addr, int idx, void *arg),
        void *arg)
{
    int i, ret;
    unsigned long addr = addr_hint &
        ~(((unsigned long)nr_pages << PAGE_SHIFT) - 1);

    might_sleep();
    for (i = 0; i < nr_pages; i++) {
            cond_resched();
            ret = process_subpage(addr + i * PAGE_SIZE, i, arg);
            if (ret)
                    return ret;
    }

    return 0;
}

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [Question] About memory.c: process_huge_page
  2025-09-24 11:46 [Question] About memory.c: process_huge_page Zhu Haoran
@ 2025-09-25  1:32 ` Huang, Ying
  2025-09-25  3:38   ` Dev Jain
                     ` (2 more replies)
  0 siblings, 3 replies; 9+ messages in thread
From: Huang, Ying @ 2025-09-25  1:32 UTC (permalink / raw)
  To: Zhu Haoran; +Cc: linux-mm

Hi, Haoran,

Zhu Haoran <zhr1502@sjtu.edu.cn> writes:

> Hi!
>
> I recently noticed the process_huge_page function in memory.c, which was
> intended to keep the cache hotness of target page after processing. I compared
> the vm-scalability anon-cow-seq-hugetlb microbench using the default
> process_huge_page and sequential processing (code posted below).
>
> I ran test on epyc-7T83 with 36vCPUs and 64GB memory. Using default
> process_huge_page, the avg bandwidth is 1148 mb/s. However sequential
> processing yielded a better bandwidth of about 1255 mb/s and only
> one-third cache-miss rate compared with default one.
>
> The same test was run on epyc-9654 with 36vCPU and 64GB mem. The
> bandwidth result was similar but the difference was smaller: 1170mb/s
> for default and 1230 mb/s for sequential. Although we did find the cache
> miss rate here did the reverse, since the sequential processing seen 3
> times miss more than the default.
>
> These result seem really inconsitent with the what described in your
> patchset [1]. What factors might explain these behaviors?

One possible difference is cache topology.  Can you try to bind the test
process to the CPUs in one CCX (that is, share one LLC).  This make it
possible to hit the local cache.

> Thanks for your time.
>
> [1] https://lkml.org/lkml/2018/5/23/1072
>
> ---
> Sincere,
> Zhu Haoran
>
> ---
>
> static int process_huge_page(
>         unsigned long addr_hint, unsigned int nr_pages,
>         int (*process_subpage)(unsigned long addr, int idx, void *arg),
>         void *arg)
> {
>     int i, ret;
>     unsigned long addr = addr_hint &
>         ~(((unsigned long)nr_pages << PAGE_SHIFT) - 1);
>
>     might_sleep();
>     for (i = 0; i < nr_pages; i++) {
>             cond_resched();
>             ret = process_subpage(addr + i * PAGE_SIZE, i, arg);
>             if (ret)
>                     return ret;
>     }
>
>     return 0;
> }

---
Best Regards,
Huang, Ying


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [Question] About memory.c: process_huge_page
  2025-09-25  1:32 ` Huang, Ying
@ 2025-09-25  3:38   ` Dev Jain
  2025-09-26 12:40     ` Zhu Haoran
  2025-09-26 12:27   ` Zhu Haoran
  2025-09-26 12:38   ` Zhu Haoran
  2 siblings, 1 reply; 9+ messages in thread
From: Dev Jain @ 2025-09-25  3:38 UTC (permalink / raw)
  To: Huang, Ying, Zhu Haoran; +Cc: linux-mm


On 25/09/25 7:02 am, Huang, Ying wrote:
> Hi, Haoran,
>
> Zhu Haoran <zhr1502@sjtu.edu.cn> writes:
>
>> Hi!
>>
>> I recently noticed the process_huge_page function in memory.c, which was
>> intended to keep the cache hotness of target page after processing. I compared
>> the vm-scalability anon-cow-seq-hugetlb microbench using the default
>> process_huge_page and sequential processing (code posted below).
>>
>> I ran test on epyc-7T83 with 36vCPUs and 64GB memory. Using default
>> process_huge_page, the avg bandwidth is 1148 mb/s. However sequential
>> processing yielded a better bandwidth of about 1255 mb/s and only
>> one-third cache-miss rate compared with default one.
>>
>> The same test was run on epyc-9654 with 36vCPU and 64GB mem. The
>> bandwidth result was similar but the difference was smaller: 1170mb/s
>> for default and 1230 mb/s for sequential. Although we did find the cache
>> miss rate here did the reverse, since the sequential processing seen 3
>> times miss more than the default.
>>
>> These result seem really inconsitent with the what described in your
>> patchset [1]. What factors might explain these behaviors?
> One possible difference is cache topology.  Can you try to bind the test
> process to the CPUs in one CCX (that is, share one LLC).  This make it
> possible to hit the local cache.

Hi, I just had a different question, why is the function sprinkled with
cond_resched() in each loop, especially the last one in which we are calling
it every iteration? I suppose one reason for slowdown may be this too.

>
>> Thanks for your time.
>>
>> [1] https://lkml.org/lkml/2018/5/23/1072
>>
>> ---
>> Sincere,
>> Zhu Haoran
>>
>> ---
>>
>> static int process_huge_page(
>>          unsigned long addr_hint, unsigned int nr_pages,
>>          int (*process_subpage)(unsigned long addr, int idx, void *arg),
>>          void *arg)
>> {
>>      int i, ret;
>>      unsigned long addr = addr_hint &
>>          ~(((unsigned long)nr_pages << PAGE_SHIFT) - 1);
>>
>>      might_sleep();
>>      for (i = 0; i < nr_pages; i++) {
>>              cond_resched();
>>              ret = process_subpage(addr + i * PAGE_SIZE, i, arg);
>>              if (ret)
>>                      return ret;
>>      }
>>
>>      return 0;
>> }
> ---
> Best Regards,
> Huang, Ying
>


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [Question] About memory.c: process_huge_page
  2025-09-25  3:38   ` Dev Jain
@ 2025-09-26 12:40     ` Zhu Haoran
  0 siblings, 0 replies; 9+ messages in thread
From: Zhu Haoran @ 2025-09-26 12:40 UTC (permalink / raw)
  To: dev.jain; +Cc: linux-mm, ying.huang, zhr1502

Dev Jain <dev.jain@arm.com> writes:
>On 25/09/25 7:02 am, Huang, Ying wrote:
>> Hi, Haoran,
>>
>> Zhu Haoran <zhr1502@sjtu.edu.cn> writes:
>>
>>> Hi!
>>>
>>> I recently noticed the process_huge_page function in memory.c, which was
>>> intended to keep the cache hotness of target page after processing. I compared
>>> the vm-scalability anon-cow-seq-hugetlb microbench using the default
>>> process_huge_page and sequential processing (code posted below).
>>>
>>> I ran test on epyc-7T83 with 36vCPUs and 64GB memory. Using default
>>> process_huge_page, the avg bandwidth is 1148 mb/s. However sequential
>>> processing yielded a better bandwidth of about 1255 mb/s and only
>>> one-third cache-miss rate compared with default one.
>>>
>>> The same test was run on epyc-9654 with 36vCPU and 64GB mem. The
>>> bandwidth result was similar but the difference was smaller: 1170mb/s
>>> for default and 1230 mb/s for sequential. Although we did find the cache
>>> miss rate here did the reverse, since the sequential processing seen 3
>>> times miss more than the default.
>>>
>>> These result seem really inconsitent with the what described in your
>>> patchset [1]. What factors might explain these behaviors?
>> One possible difference is cache topology.  Can you try to bind the test
>> process to the CPUs in one CCX (that is, share one LLC).  This make it
>> possible to hit the local cache.
>
>Hi, I just had a different question, why is the function sprinkled with
>cond_resched() in each loop, especially the last one in which we are calling
>it every iteration? I suppose one reason for slowdown may be this too.

However, whether it is process_huge_page or sequential processing, the
implementation always performs cond_resched before each process_subpage. Seems
no difference.

>> Thanks for your time.
>>
>> [1] https://lkml.org/lkml/2018/5/23/1072
>>
>> ---
>> Sincere,
>> Zhu Haoran
>>
>> ---
>>
>> static int process_huge_page(
>>          unsigned long addr_hint, unsigned int nr_pages,
>>          int (*process_subpage)(unsigned long addr, int idx, void *arg),
>>          void *arg)
>> {
>>      int i, ret;
>>      unsigned long addr = addr_hint &
>>          ~(((unsigned long)nr_pages << PAGE_SHIFT) - 1);
>>
>>      might_sleep();
>>      for (i = 0; i < nr_pages; i++) {
>>              cond_resched();
>>              ret = process_subpage(addr + i * PAGE_SIZE, i, arg);
>>              if (ret)
>>                      return ret;
>>      }
>>
>>      return 0;
>> }
> ---
> Best Regards,
> Huang, Ying

---
Sincere,
Zhu Haoran


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [Question] About memory.c: process_huge_page
  2025-09-25  1:32 ` Huang, Ying
  2025-09-25  3:38   ` Dev Jain
@ 2025-09-26 12:27   ` Zhu Haoran
  2025-09-28  0:48     ` Huang, Ying
  2025-09-26 12:38   ` Zhu Haoran
  2 siblings, 1 reply; 9+ messages in thread
From: Zhu Haoran @ 2025-09-26 12:27 UTC (permalink / raw)
  To: ying.huang; +Cc: linux-mm, zhr1502, dev.jain

"Huang, Ying" <ying.huang@linux.alibaba.com> writes:
>Hi, Haoran,
>
>Zhu Haoran <zhr1502@sjtu.edu.cn> writes:
>
>> Hi!
>>
>> I recently noticed the process_huge_page function in memory.c, which was
>> intended to keep the cache hotness of target page after processing. I compared
>> the vm-scalability anon-cow-seq-hugetlb microbench using the default
>> process_huge_page and sequential processing (code posted below).
>>
>> I ran test on epyc-7T83 with 36vCPUs and 64GB memory. Using default
>> process_huge_page, the avg bandwidth is 1148 mb/s. However sequential
>> processing yielded a better bandwidth of about 1255 mb/s and only
>> one-third cache-miss rate compared with default one.
>>
>> The same test was run on epyc-9654 with 36vCPU and 64GB mem. The
>> bandwidth result was similar but the difference was smaller: 1170mb/s
>> for default and 1230 mb/s for sequential. Although we did find the cache
>> miss rate here did the reverse, since the sequential processing seen 3
>> times miss more than the default.
>>
>> These result seem really inconsitent with the what described in your
>> patchset [1]. What factors might explain these behaviors?
>
>One possible difference is cache topology.  Can you try to bind the test
>process to the CPUs in one CCX (that is, share one LLC).  This make it
>possible to hit the local cache.

Thank you for the suggestion.

I reduced the test to 16 vCPUs and bound them to one CCX on the epyc-9654. The
rerun results are:

                     sequential    process_huge_page
BW (MB/s)                523.88               531.60   ( + 1.47%)
user cachemiss           0.318%               0.446%   ( +40.25%)
kernel cachemiss         1.405%              18.406%   ( + 1310%)
usertime                  26.72                18.76   ( -29.79%)
systime                   35.97                42.64   ( +18.54%)

I was able to reproduce the much lower user time, but the bw gap is still not
that significant as in your patch. It was bottlenecked by kernel cache-misses
and execution time. One possible explanation is that AMD has less aggressive
cache prefetcher, which fails to predict the access pattern of current
process_huge_page in kernel. To verify that I ran a microbench that iterates
through 4K pages in sequential/reverse order and access each page in seq/rev
order (4 combinations in total).

cachemiss rate
                  seq-seq  seq-rev  rev-seq  rev-rev
epyc-9654           0.08%    1.71%    1.98%    0.09%
epyc-7T83           1.07%   13.64%    6.23%    1.12%
i5-13500H          27.08%   28.87%   29.57%   25.35%

I also ran the anon-cow-seq on my laptop i5-13500H and all metrics aligned well
with your patch. So I guess this could be the root cause why AMD won't benefit
from the patch?

>> Thanks for your time.
>>
>> [1] https://lkml.org/lkml/2018/5/23/1072
>>
>> ---
>> Sincere,
>> Zhu Haoran
>>
>> ---
>>
>> static int process_huge_page(
>>         unsigned long addr_hint, unsigned int nr_pages,
>>         int (*process_subpage)(unsigned long addr, int idx, void *arg),
>>         void *arg)
>> {
>>     int i, ret;
>>     unsigned long addr = addr_hint &
>>         ~(((unsigned long)nr_pages << PAGE_SHIFT) - 1);
>>
>>     might_sleep();
>>     for (i = 0; i < nr_pages; i++) {
>>             cond_resched();
>>             ret = process_subpage(addr + i * PAGE_SIZE, i, arg);
>>             if (ret)
>>                     return ret;
>>     }
>>
>>     return 0;
>> }
>
>---
>Best Regards,
>Huang, Ying

---
Sincere,
Zhu Haoran


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [Question] About memory.c: process_huge_page
  2025-09-26 12:27   ` Zhu Haoran
@ 2025-09-28  0:48     ` Huang, Ying
  2025-09-28 10:07       ` Zhu Haoran
  0 siblings, 1 reply; 9+ messages in thread
From: Huang, Ying @ 2025-09-28  0:48 UTC (permalink / raw)
  To: Zhu Haoran; +Cc: linux-mm, dev.jain

Zhu Haoran <zhr1502@sjtu.edu.cn> writes:

> "Huang, Ying" <ying.huang@linux.alibaba.com> writes:
>>Hi, Haoran,
>>
>>Zhu Haoran <zhr1502@sjtu.edu.cn> writes:
>>
>>> Hi!
>>>
>>> I recently noticed the process_huge_page function in memory.c, which was
>>> intended to keep the cache hotness of target page after processing. I compared
>>> the vm-scalability anon-cow-seq-hugetlb microbench using the default
>>> process_huge_page and sequential processing (code posted below).
>>>
>>> I ran test on epyc-7T83 with 36vCPUs and 64GB memory. Using default
>>> process_huge_page, the avg bandwidth is 1148 mb/s. However sequential
>>> processing yielded a better bandwidth of about 1255 mb/s and only
>>> one-third cache-miss rate compared with default one.
>>>
>>> The same test was run on epyc-9654 with 36vCPU and 64GB mem. The
>>> bandwidth result was similar but the difference was smaller: 1170mb/s
>>> for default and 1230 mb/s for sequential. Although we did find the cache
>>> miss rate here did the reverse, since the sequential processing seen 3
>>> times miss more than the default.
>>>
>>> These result seem really inconsitent with the what described in your
>>> patchset [1]. What factors might explain these behaviors?
>>
>>One possible difference is cache topology.  Can you try to bind the test
>>process to the CPUs in one CCX (that is, share one LLC).  This make it
>>possible to hit the local cache.
>
> Thank you for the suggestion.
>
> I reduced the test to 16 vCPUs and bound them to one CCX on the epyc-9654. The
> rerun results are:
>
>                      sequential    process_huge_page
> BW (MB/s)                523.88               531.60   ( + 1.47%)
> user cachemiss           0.318%               0.446%   ( +40.25%)
> kernel cachemiss         1.405%              18.406%   ( + 1310%)
> usertime                  26.72                18.76   ( -29.79%)
> systime                   35.97                42.64   ( +18.54%)
>
> I was able to reproduce the much lower user time, but the bw gap is still not
> that significant as in your patch. It was bottlenecked by kernel cache-misses
> and execution time. One possible explanation is that AMD has less aggressive
> cache prefetcher, which fails to predict the access pattern of current
> process_huge_page in kernel. To verify that I ran a microbench that iterates
> through 4K pages in sequential/reverse order and access each page in seq/rev
> order (4 combinations in total).
>
> cachemiss rate
>                   seq-seq  seq-rev  rev-seq  rev-rev
> epyc-9654           0.08%    1.71%    1.98%    0.09%
> epyc-7T83           1.07%   13.64%    6.23%    1.12%
> i5-13500H          27.08%   28.87%   29.57%   25.35%
>
> I also ran the anon-cow-seq on my laptop i5-13500H and all metrics aligned well
> with your patch. So I guess this could be the root cause why AMD won't benefit
> from the patch?

The cache size per process needs to be checked too.  The smaller the
cache size per process, the more the benefit.

>>> Thanks for your time.
>>>
>>> [1] https://lkml.org/lkml/2018/5/23/1072
>>>
>>> ---
>>> Sincere,
>>> Zhu Haoran
>>>
>>> ---
>>>
>>> static int process_huge_page(
>>>         unsigned long addr_hint, unsigned int nr_pages,
>>>         int (*process_subpage)(unsigned long addr, int idx, void *arg),
>>>         void *arg)
>>> {
>>>     int i, ret;
>>>     unsigned long addr = addr_hint &
>>>         ~(((unsigned long)nr_pages << PAGE_SHIFT) - 1);
>>>
>>>     might_sleep();
>>>     for (i = 0; i < nr_pages; i++) {
>>>             cond_resched();
>>>             ret = process_subpage(addr + i * PAGE_SIZE, i, arg);
>>>             if (ret)
>>>                     return ret;
>>>     }
>>>
>>>     return 0;
>>> }

---
Best Regards,
Huang, Ying


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [Question] About memory.c: process_huge_page
  2025-09-28  0:48     ` Huang, Ying
@ 2025-09-28 10:07       ` Zhu Haoran
  2025-10-09  1:23         ` Huang, Ying
  0 siblings, 1 reply; 9+ messages in thread
From: Zhu Haoran @ 2025-09-28 10:07 UTC (permalink / raw)
  To: ying.huang; +Cc: dev.jain, linux-mm, zhr1502

"Huang, Ying" <ying.huang@linux.alibaba.com> writes:
>Zhu Haoran <zhr1502@sjtu.edu.cn> writes:
>
>> "Huang, Ying" <ying.huang@linux.alibaba.com> writes:
>>>Hi, Haoran,
>>>
>>>Zhu Haoran <zhr1502@sjtu.edu.cn> writes:
>>>
>>>> Hi!
>>>>
>>>> I recently noticed the process_huge_page function in memory.c, which was
>>>> intended to keep the cache hotness of target page after processing. I compared
>>>> the vm-scalability anon-cow-seq-hugetlb microbench using the default
>>>> process_huge_page and sequential processing (code posted below).
>>>>
>>>> I ran test on epyc-7T83 with 36vCPUs and 64GB memory. Using default
>>>> process_huge_page, the avg bandwidth is 1148 mb/s. However sequential
>>>> processing yielded a better bandwidth of about 1255 mb/s and only
>>>> one-third cache-miss rate compared with default one.
>>>>
>>>> The same test was run on epyc-9654 with 36vCPU and 64GB mem. The
>>>> bandwidth result was similar but the difference was smaller: 1170mb/s
>>>> for default and 1230 mb/s for sequential. Although we did find the cache
>>>> miss rate here did the reverse, since the sequential processing seen 3
>>>> times miss more than the default.
>>>>
>>>> These result seem really inconsitent with the what described in your
>>>> patchset [1]. What factors might explain these behaviors?
>>>
>>>One possible difference is cache topology.  Can you try to bind the test
>>>process to the CPUs in one CCX (that is, share one LLC).  This make it
>>>possible to hit the local cache.
>>
>> Thank you for the suggestion.
>>
>> I reduced the test to 16 vCPUs and bound them to one CCX on the epyc-9654. The
>> rerun results are:
>>
>>                      sequential    process_huge_page
>> BW (MB/s)                523.88               531.60   ( + 1.47%)
>> user cachemiss           0.318%               0.446%   ( +40.25%)
>> kernel cachemiss         1.405%              18.406%   ( + 1310%)
>> usertime                  26.72                18.76   ( -29.79%)
>> systime                   35.97                42.64   ( +18.54%)
>>
>> I was able to reproduce the much lower user time, but the bw gap is still not
>> that significant as in your patch. It was bottlenecked by kernel cache-misses
>> and execution time. One possible explanation is that AMD has less aggressive
>> cache prefetcher, which fails to predict the access pattern of current
>> process_huge_page in kernel. To verify that I ran a microbench that iterates
>> through 4K pages in sequential/reverse order and access each page in seq/rev
>> order (4 combinations in total).
>>
>> cachemiss rate
>>                   seq-seq  seq-rev  rev-seq  rev-rev
>> epyc-9654           0.08%    1.71%    1.98%    0.09%
>> epyc-7T83           1.07%   13.64%    6.23%    1.12%
>> i5-13500H          27.08%   28.87%   29.57%   25.35%
>>
>> I also ran the anon-cow-seq on my laptop i5-13500H and all metrics aligned well
>> with your patch. So I guess this could be the root cause why AMD won't benefit
>> from the patch?
>
>The cache size per process needs to be checked too.  The smaller the
>cache size per process, the more the benefit.

Right. I tuned down the task number to run anon-cow-seq on Intel Core so that
per-task cache size could be larger. The benefit has indeed dropped.

1.125MB cache per task
              sequential     process_huge_page
Amean              664.7                 740.1 (+11.34%)

2MB cache per task
              sequential     process_huge_page
Amean             1287.9                1350.5 ( +4.86%)

4.5MB cache per task
              sequential     process_huge_page
Amean             1373.2                1406.3 ( +2.41%)

9MB cache per task
              sequential     process_huge_page
Amean             2149.0                2070.4 ( -3.66%)

On epyc platform, 7T83 has 4MB per-process cache size and 9654 has only 2MB. On
9654 there was only slightly improvement but on 7T83 we even observe ~10%
regression with process_huge_page.

Do you think this performance issue is worth addressing, especially for AMD? or
acceptable as an architecture difference.

>>>> Thanks for your time.
>>>>
>>>> [1] https://lkml.org/lkml/2018/5/23/1072
>>>>
>>>> ---
>>>> Sincere,
>>>> Zhu Haoran
>>>>
>>>> ---
>>>>
>>>> static int process_huge_page(
>>>>         unsigned long addr_hint, unsigned int nr_pages,
>>>>         int (*process_subpage)(unsigned long addr, int idx, void *arg),
>>>>         void *arg)
>>>> {
>>>>     int i, ret;
>>>>     unsigned long addr = addr_hint &
>>>>         ~(((unsigned long)nr_pages << PAGE_SHIFT) - 1);
>>>>
>>>>     might_sleep();
>>>>     for (i = 0; i < nr_pages; i++) {
>>>>             cond_resched();
>>>>             ret = process_subpage(addr + i * PAGE_SIZE, i, arg);
>>>>             if (ret)
>>>>                     return ret;
>>>>     }
>>>>
>>>>     return 0;
>>>> }

---
Sincere,
Zhu Haoran


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [Question] About memory.c: process_huge_page
  2025-09-28 10:07       ` Zhu Haoran
@ 2025-10-09  1:23         ` Huang, Ying
  0 siblings, 0 replies; 9+ messages in thread
From: Huang, Ying @ 2025-10-09  1:23 UTC (permalink / raw)
  To: Zhu Haoran; +Cc: dev.jain, linux-mm

Zhu Haoran <zhr1502@sjtu.edu.cn> writes:

> "Huang, Ying" <ying.huang@linux.alibaba.com> writes:
>>Zhu Haoran <zhr1502@sjtu.edu.cn> writes:
>>
>>> "Huang, Ying" <ying.huang@linux.alibaba.com> writes:
>>>>Hi, Haoran,
>>>>
>>>>Zhu Haoran <zhr1502@sjtu.edu.cn> writes:
>>>>
>>>>> Hi!
>>>>>
>>>>> I recently noticed the process_huge_page function in memory.c, which was
>>>>> intended to keep the cache hotness of target page after processing. I compared
>>>>> the vm-scalability anon-cow-seq-hugetlb microbench using the default
>>>>> process_huge_page and sequential processing (code posted below).
>>>>>
>>>>> I ran test on epyc-7T83 with 36vCPUs and 64GB memory. Using default
>>>>> process_huge_page, the avg bandwidth is 1148 mb/s. However sequential
>>>>> processing yielded a better bandwidth of about 1255 mb/s and only
>>>>> one-third cache-miss rate compared with default one.
>>>>>
>>>>> The same test was run on epyc-9654 with 36vCPU and 64GB mem. The
>>>>> bandwidth result was similar but the difference was smaller: 1170mb/s
>>>>> for default and 1230 mb/s for sequential. Although we did find the cache
>>>>> miss rate here did the reverse, since the sequential processing seen 3
>>>>> times miss more than the default.
>>>>>
>>>>> These result seem really inconsitent with the what described in your
>>>>> patchset [1]. What factors might explain these behaviors?
>>>>
>>>>One possible difference is cache topology.  Can you try to bind the test
>>>>process to the CPUs in one CCX (that is, share one LLC).  This make it
>>>>possible to hit the local cache.
>>>
>>> Thank you for the suggestion.
>>>
>>> I reduced the test to 16 vCPUs and bound them to one CCX on the epyc-9654. The
>>> rerun results are:
>>>
>>>                      sequential    process_huge_page
>>> BW (MB/s)                523.88               531.60   ( + 1.47%)
>>> user cachemiss           0.318%               0.446%   ( +40.25%)
>>> kernel cachemiss         1.405%              18.406%   ( + 1310%)
>>> usertime                  26.72                18.76   ( -29.79%)
>>> systime                   35.97                42.64   ( +18.54%)
>>>
>>> I was able to reproduce the much lower user time, but the bw gap is still not
>>> that significant as in your patch. It was bottlenecked by kernel cache-misses
>>> and execution time. One possible explanation is that AMD has less aggressive
>>> cache prefetcher, which fails to predict the access pattern of current
>>> process_huge_page in kernel. To verify that I ran a microbench that iterates
>>> through 4K pages in sequential/reverse order and access each page in seq/rev
>>> order (4 combinations in total).
>>>
>>> cachemiss rate
>>>                   seq-seq  seq-rev  rev-seq  rev-rev
>>> epyc-9654           0.08%    1.71%    1.98%    0.09%
>>> epyc-7T83           1.07%   13.64%    6.23%    1.12%
>>> i5-13500H          27.08%   28.87%   29.57%   25.35%
>>>
>>> I also ran the anon-cow-seq on my laptop i5-13500H and all metrics aligned well
>>> with your patch. So I guess this could be the root cause why AMD won't benefit
>>> from the patch?
>>
>>The cache size per process needs to be checked too.  The smaller the
>>cache size per process, the more the benefit.
>
> Right. I tuned down the task number to run anon-cow-seq on Intel Core so that
> per-task cache size could be larger. The benefit has indeed dropped.
>
> 1.125MB cache per task
>               sequential     process_huge_page
> Amean              664.7                 740.1 (+11.34%)
>
> 2MB cache per task
>               sequential     process_huge_page
> Amean             1287.9                1350.5 ( +4.86%)
>
> 4.5MB cache per task
>               sequential     process_huge_page
> Amean             1373.2                1406.3 ( +2.41%)
>
> 9MB cache per task
>               sequential     process_huge_page
> Amean             2149.0                2070.4 ( -3.66%)
>
> On epyc platform, 7T83 has 4MB per-process cache size and 9654 has only 2MB. On
> 9654 there was only slightly improvement but on 7T83 we even observe ~10%
> regression with process_huge_page.
>
> Do you think this performance issue is worth addressing, especially for AMD? or
> acceptable as an architecture difference.

I think that we can do some experiment at least.  For example, process
pages sequentially more.  That is, process tail subpages sequentially
instead of reversely.  You may try more if that doesn't help much.

>>>>> Thanks for your time.
>>>>>
>>>>> [1] https://lkml.org/lkml/2018/5/23/1072
>>>>>
>>>>> ---
>>>>> Sincere,
>>>>> Zhu Haoran
>>>>>
>>>>> ---
>>>>>
>>>>> static int process_huge_page(
>>>>>         unsigned long addr_hint, unsigned int nr_pages,
>>>>>         int (*process_subpage)(unsigned long addr, int idx, void *arg),
>>>>>         void *arg)
>>>>> {
>>>>>     int i, ret;
>>>>>     unsigned long addr = addr_hint &
>>>>>         ~(((unsigned long)nr_pages << PAGE_SHIFT) - 1);
>>>>>
>>>>>     might_sleep();
>>>>>     for (i = 0; i < nr_pages; i++) {
>>>>>             cond_resched();
>>>>>             ret = process_subpage(addr + i * PAGE_SIZE, i, arg);
>>>>>             if (ret)
>>>>>                     return ret;
>>>>>     }
>>>>>
>>>>>     return 0;
>>>>> }

---
Best Regards,
Huang, Ying


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [Question] About memory.c: process_huge_page
  2025-09-25  1:32 ` Huang, Ying
  2025-09-25  3:38   ` Dev Jain
  2025-09-26 12:27   ` Zhu Haoran
@ 2025-09-26 12:38   ` Zhu Haoran
  2 siblings, 0 replies; 9+ messages in thread
From: Zhu Haoran @ 2025-09-26 12:38 UTC (permalink / raw)
  To: ying.huang; +Cc: linux-mm, zhr1502, dev.jain

"Huang, Ying" <ying.huang@linux.alibaba.com> writes:
>Hi, Haoran,
>
>Zhu Haoran <zhr1502@sjtu.edu.cn> writes:
>
>> Hi!
>>
>> I recently noticed the process_huge_page function in memory.c, which was
>> intended to keep the cache hotness of target page after processing. I compared
>> the vm-scalability anon-cow-seq-hugetlb microbench using the default
>> process_huge_page and sequential processing (code posted below).
>>
>> I ran test on epyc-7T83 with 36vCPUs and 64GB memory. Using default
>> process_huge_page, the avg bandwidth is 1148 mb/s. However sequential
>> processing yielded a better bandwidth of about 1255 mb/s and only
>> one-third cache-miss rate compared with default one.
>>
>> The same test was run on epyc-9654 with 36vCPU and 64GB mem. The
>> bandwidth result was similar but the difference was smaller: 1170mb/s
>> for default and 1230 mb/s for sequential. Although we did find the cache
>> miss rate here did the reverse, since the sequential processing seen 3
>> times miss more than the default.
>>
>> These result seem really inconsitent with the what described in your
>> patchset [1]. What factors might explain these behaviors?
>
>One possible difference is cache topology.  Can you try to bind the test
>process to the CPUs in one CCX (that is, share one LLC).  This make it
>possible to hit the local cache.

Thank you for the suggestion.

I reduced the test to 16 vCPUs and bound them to one CCX on the epyc-9654. The
rerun results are:

                     sequential    process_huge_page
BW (MB/s)                523.88               531.60   ( + 1.47%)
user cachemiss           0.318%               0.446%   ( +40.25%)
kernel cachemiss         1.405%              18.406%   ( + 1310%)
usertime                  26.72                18.76   ( -29.79%)
systime                   35.97                42.64   ( +18.54%)

I was able to reproduce the much lower user time, but the bw gap is still not
that significant as in your patch. It was bottlenecked by kernel cache-misses
and execution time. One possible explanation is that AMD has less aggressive
cache prefetcher, which fails to predict the access pattern of current
process_huge_page in kernel. To verify that I ran a microbench that iterates
through 4K pages in sequential/reverse order and access each page in seq/rev
order (4 combinations in total).

cachemiss rate
                  seq-seq  seq-rev  rev-seq  rev-rev
epyc-9654           0.08%    1.71%    1.98%    0.09%
epyc-7T83           1.07%   13.64%    6.23%    1.12%
i5-13500H          27.08%   28.87%   29.57%   25.35%

I also ran the anon-cow-seq on my laptop i5-13500H and all metrics aligned well
with your patch. So I guess this could be the root cause why AMD won't benefit
from the patch?

>> Thanks for your time.
>>
>> [1] https://lkml.org/lkml/2018/5/23/1072
>>
>> ---
>> Sincere,
>> Zhu Haoran
>>
>> ---
>>
>> static int process_huge_page(
>>         unsigned long addr_hint, unsigned int nr_pages,
>>         int (*process_subpage)(unsigned long addr, int idx, void *arg),
>>         void *arg)
>> {
>>     int i, ret;
>>     unsigned long addr = addr_hint &
>>         ~(((unsigned long)nr_pages << PAGE_SHIFT) - 1);
>>
>>     might_sleep();
>>     for (i = 0; i < nr_pages; i++) {
>>             cond_resched();
>>             ret = process_subpage(addr + i * PAGE_SIZE, i, arg);
>>             if (ret)
>>                     return ret;
>>     }
>>
>>     return 0;
>> }
>
>---
>Best Regards,
>Huang, Ying

---
Sincere,
Zhu Haoran


^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2025-10-09  1:24 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2025-09-24 11:46 [Question] About memory.c: process_huge_page Zhu Haoran
2025-09-25  1:32 ` Huang, Ying
2025-09-25  3:38   ` Dev Jain
2025-09-26 12:40     ` Zhu Haoran
2025-09-26 12:27   ` Zhu Haoran
2025-09-28  0:48     ` Huang, Ying
2025-09-28 10:07       ` Zhu Haoran
2025-10-09  1:23         ` Huang, Ying
2025-09-26 12:38   ` Zhu Haoran

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox