Re: [PATCH v2 1/2] mm: use aligned address in clear_gigantic_page()

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

From: Kefeng Wang <wangkefeng.wang@huawei.com>
To: "Huang, Ying" <ying.huang@intel.com>
Cc: David Hildenbrand <david@redhat.com>,
	Andrew Morton <akpm@linux-foundation.org>,
	Matthew Wilcox <willy@infradead.org>,
	Muchun Song <muchun.song@linux.dev>, <linux-mm@kvack.org>,
	Zi Yan <ziy@nvidia.com>
Subject: Re: [PATCH v2 1/2] mm: use aligned address in clear_gigantic_page()
Date: Tue, 5 Nov 2024 10:06:19 +0800	[thread overview]
Message-ID: <addb25dc-fb1c-4005-a04c-02beb95240de@huawei.com> (raw)
In-Reply-To: <87ikt3914v.fsf@yhuang6-desk2.ccr.corp.intel.com>



On 2024/11/4 10:35, Huang, Ying wrote:
> Kefeng Wang <wangkefeng.wang@huawei.com> writes:
> 
>> On 2024/11/1 16:16, Huang, Ying wrote:
>>> Kefeng Wang <wangkefeng.wang@huawei.com> writes:
>>>
>>>> On 2024/10/31 16:39, Huang, Ying wrote:
>>>>> Kefeng Wang <wangkefeng.wang@huawei.com> writes:
>>>>> [snip]
>>>>>>
>>>>>>>> 1) Will test some rand test to check the different of performance as
>>>>>>>> David suggested.>>>>
>>>>>>>> 2) Hope the LKP to run more tests since it is very useful(more test
>>>>>>>> set and different machines)
>>>>>>> I'm starting to use LKP to test.
>>>>>>
>>>>>> Greet.
>>>>
>>>>
>>>> Sorry for the late,
>>>>
>>>>> I have run some tests with LKP to test.
>>>>> Firstly, there's almost no measurable difference between clearing
>>>>> pages
>>>>> from start to end or from end to start on Intel server CPU.  I guess
>>>>> that there's some similar optimization for both direction.
>>>>> For multiple processes (same as logical CPU number)
>>>>> vm-scalability/anon-w-seq test case, the benchmark score increases
>>>>> about 22.4%.
>>>>
>>>> So process_huge_page is better than clear_gigantic_page() on Intel?
>>> For vm-scalability/anon-w-seq test case, it is.  Because the
>>> performance
>>> of forward and backward clearing is almost same, and the user space
>>> accessing has cache-hot benefit.
>>>
>>>> Could you test the following case on x86?
>>>> echo 10240 >
>>>> /sys/devices/system/node/node0/hugepages/hugepages-2048kB/nr_hugepages
>>>> mkdir -p /hugetlbfs/
>>>> mount none /hugetlbfs/ -t hugetlbfs
>>>> rm -f /hugetlbfs/test && fallocate -l 20G /hugetlbfs/test && fallocate
>>>> -d -l 20G /hugetlbfs/test && time taskset -c 10 fallocate -l 20G
>>>> /hugetlbfs/test
>>> It's not trivial for me to do this test.  Because 0day wraps test
>>> cases.
>>> Do you know which existing test cases provide this?  For example, in
>>> vm-scalability?
>>
>> I don't know the public fallocate test, I will try to find a intel
>> machine to test this case.
> 
> I don't expect it to change much, because we have observed that the
> performance of forward and backward clearing is similar on Intel.

I find a Intel(R) Xeon(R) Gold 6140 CPU @ 2.30GHz

Caches (sum of all):
   L1d:                    1.1 MiB (36 instances)
   L1i:                    1.1 MiB (36 instances)
   L2:                     36 MiB (36 instances)
   L3:                     49.5 MiB (2 instances)
NUMA:
   NUMA node(s):           2
   NUMA node0 CPU(s):      0-17,36-53
   NUMA node1 CPU(s):      18-35,54-71


Before:

Performance counter stats for 'taskset -c 10 fallocate -l 20G 
/mnt/hugetlbfs/test':

           3,856.93 msec task-clock                       #    0.997 
CPUs utilized
                  6      context-switches                 #    1.556 
/sec
                  1      cpu-migrations                   #    0.259 
/sec
                132      page-faults                      #   34.224 
/sec
     11,520,934,848      cycles                           #    2.987 GHz 
                         (19.95%)
        213,731,011      instructions                     #    0.02 
insn per cycle              (24.96%)
         58,164,361      branches                         #   15.080 
M/sec                       (24.96%)
            262,547      branch-misses                    #    0.45% of 
all branches             (24.97%)
         96,029,321      CPU_CLK_UNHALTED.REF_XCLK        #   24.898 
M/sec
                                                   #      0.3 % 
tma_frontend_bound
                                                   #      3.3 % 
tma_retiring
                                                   #     96.4 % 
tma_backend_bound
                                                   #      0.0 % 
tma_bad_speculation      (24.99%)
        149,735,020      IDQ_UOPS_NOT_DELIVERED.CORE      #   38.822 
M/sec                       (25.01%)
          2,486,326      INT_MISC.RECOVERY_CYCLES_ANY     #  644.638 
K/sec                       (20.01%)
         95,973,482      CPU_CLK_UNHALTED.ONE_THREAD_ACTIVE #   24.883 
M/sec                       (20.01%)
     11,526,783,305      CPU_CLK_UNHALTED.THREAD          #    2.989 
G/sec                       (20.01%)
      1,519,072,911      UOPS_RETIRED.RETIRE_SLOTS        #  393.855 
M/sec                       (20.01%)
      1,526,020,825      UOPS_ISSUED.ANY                  #  395.657 
M/sec                       (20.01%)
         59,784,189      L1-dcache-loads                  #   15.500 
M/sec                       (20.01%)
        337,479,254      L1-dcache-load-misses            #  564.50% of 
all L1-dcache accesses   (20.02%)
            175,954      LLC-loads                        #   45.620 
K/sec                       (20.02%)
             51,955      LLC-load-misses                  #   29.53% of 
all L1-icache accesses   (20.02%)
    <not supported>      L1-icache-loads 

          2,864,230      L1-icache-load-misses 
                         (20.02%)
         59,769,391      dTLB-loads                       #   15.497 
M/sec                       (20.02%)
                819      dTLB-load-misses                 #    0.00% of 
all dTLB cache accesses  (20.02%)
              2,459      iTLB-loads                       #  637.553 
/sec                        (20.01%)
                370      iTLB-load-misses                 #   15.05% of 
all iTLB cache accesses  (19.98%)

        3.870393637 seconds time elapsed

        0.000000000 seconds user
        3.833021000 seconds sys

After(using clear_gigantic_page()):

Performance counter stats for 'taskset -c 10 fallocate -l 20G 
/mnt/hugetlbfs/test':

           4,426.18 msec task-clock                       #    0.994 
CPUs utilized
                  8      context-switches                 #    1.807 
/sec
                  1      cpu-migrations                   #    0.226 
/sec
                131      page-faults                      #   29.597 
/sec
     13,221,263,588      cycles                           #    2.987 GHz 
                         (19.98%)
        215,924,995      instructions                     #    0.02 
insn per cycle              (25.00%)
         58,430,182      branches                         #   13.201 
M/sec                       (25.01%)
            279,381      branch-misses                    #    0.48% of 
all branches             (25.03%)
        110,199,114      CPU_CLK_UNHALTED.REF_XCLK        #   24.897 
M/sec
                                                   #      0.3 % 
tma_frontend_bound
                                                   #      2.9 % 
tma_retiring
                                                   #     96.8 % 
tma_backend_bound
                                                   #      0.0 % 
tma_bad_speculation      (25.06%)
        160,650,548      IDQ_UOPS_NOT_DELIVERED.CORE      #   36.296 
M/sec                       (25.07%)
          2,559,970      INT_MISC.RECOVERY_CYCLES_ANY     #  578.370 
K/sec                       (20.05%)
        110,229,402      CPU_CLK_UNHALTED.ONE_THREAD_ACTIVE #   24.904 
M/sec                       (20.05%)
     13,227,924,727      CPU_CLK_UNHALTED.THREAD          #    2.989 
G/sec                       (20.03%)
      1,525,019,287      UOPS_RETIRED.RETIRE_SLOTS        #  344.545 
M/sec                       (20.01%)
      1,531,307,263      UOPS_ISSUED.ANY                  #  345.966 
M/sec                       (19.98%)
         60,600,471      L1-dcache-loads                  #   13.691 
M/sec                       (19.96%)
        337,576,917      L1-dcache-load-misses            #  557.05% of 
all L1-dcache accesses   (19.96%)
            177,157      LLC-loads                        #   40.025 
K/sec                       (19.96%)
             48,056      LLC-load-misses                  #   27.13% of 
all L1-icache accesses   (19.97%)
    <not supported>      L1-icache-loads 

          2,653,617      L1-icache-load-misses 
                         (19.97%)
         60,609,241      dTLB-loads                       #   13.693 
M/sec                       (19.97%)
                530      dTLB-load-misses                 #    0.00% of 
all dTLB cache accesses  (19.97%)
              1,952      iTLB-loads                       #  441.013 
/sec                        (19.97%)
              3,059      iTLB-load-misses                 #  156.71% of 
all iTLB cache accesses  (19.97%)

        4.450664421 seconds time elapsed

        0.000984000 seconds user
        4.397795000 seconds sys


This shows the backward is better than forward，at least for this CPU.


> 
>>>
>>>>> For multiple processes vm-scalability/anon-w-rand test case, no
>>>>> measurable difference for benchmark score.
>>>>> So, the optimization helps sequential workload mainly.
>>>>> In summary, on x86, process_huge_page() will not introduce any
>>>>> regression.  And it helps some workload.
>>>>> However, on ARM64, it does introduce some regression for clearing
>>>>> pages
>>>>> from end to start.  That needs to be addressed.  I guess that the
>>>>> regression can be resolved via using more clearing from start to end
>>>>> (but not all).  For example, can you take a look at the patch below?
>>>>> Which uses the similar framework as before, but clear each small trunk
>>>>> (mpage) from start to end.  You can adjust MPAGE_NRPAGES to check when
>>>>> the regression can be restored.
>>>>> WARNING: the patch is only build tested.
>>>>
>>>>
>>>> Base: baseline
>>>> Change1: using clear_gigantic_page() for 2M PMD
>>>> Change2: your patch with MPAGE_NRPAGES=16
>>>> Change3: Case3 + fix[1]
>>> What is case3?
>>
>> Oh, it is Change2.
> 
> Got it.
> 
>>>
>>>> Change4: your patch with MPAGE_NRPAGES=64 + fix[1]
>>>>
>>>> 1. For rand write,
>>>>      case-anon-w-rand/case-anon-w-rand-hugetlb no measurable difference
>>>>
>>>> 2. For seq write,
>>>>
>>>> 1) case-anon-w-seq-mt:
>>> Can you try case-anon-w-seq?  That may be more stable.
>>>
>>>> base:
>>>> real    0m2.490s    0m2.254s    0m2.272s
>>>> user    1m59.980s   2m23.431s   2m18.739s
>>>> sys     1m3.675s    1m15.462s   1m15.030s
>>>>
>>>> Change1:
>>>> real    0m2.234s    0m2.225s    0m2.159s
>>>> user    2m56.105s   2m57.117s   3m0.489s
>>>> sys     0m17.064s   0m17.564s   0m16.150s
>>>>
>>>> Change2：
>>>> real	0m2.244s    0m2.384s	0m2.370s
>>>> user	2m39.413s   2m41.990s   2m42.229s
>>>> sys	0m19.826s   0m18.491s   0m18.053s
>>> It appears strange.  There's no much cache hot benefit even if we
>>> clear
>>> pages from end to begin (with larger chunk).
>>> However, sys time improves a lot.  This shows clearing page with
>>> large
>>> chunk helps on ARM64.
>>>
>>>> Change3：  // best performance
>>>> real	0m2.155s    0m2.204s	0m2.194s
>>>> user	3m2.640s    2m55.837s   3m0.902s
>>>> sys	0m17.346s   0m17.630s   0m18.197s
>>>>
>>>> Change4：
>>>> real	0m2.287s    0m2.377s	0m2.284s	
>>>> user	2m37.030s   2m52.868s   3m17.593s
>>>> sys	0m15.445s   0m34.430s   0m45.224s
>>> Change4 is essentially same as Change1.  I don't know why they are
>>> different.  Is there some large variation among run to run?
>>
>> As above shown, I test three times, the test results are relatively
>> stable, at least for real, I will try case-anon-w-seq.
> 
> Can you also show the score of vm-scalability?
> 
> TBH, I cannot understand your results.  For example, why there are
> measurable difference between Change3 and Change4?  In both cases, the
> kernel clears pages from start to end.

OK，will retest once I can access the machine again.

> 
>>> Can you further optimize the prototype patch below?  I think that it
>>> has
>>> potential to fix your issue.
>>
>> Yes, thanks for you helper, but this will make process_huge_page() a
>> little more complicated :)
> 
> IMHO, we should try to root cause it, then try to find the proper
> solution and optimize (simplifies) it.

 From the above fallocate test on intel, it seems that different
microarchitectures have different performance on Intel too.

next prev parent reply	other threads:[~2024-11-05  2:06 UTC|newest]

Thread overview: 33+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-10-26  5:43 Kefeng Wang
2024-10-26  5:43 ` [PATCH v2 2/2] mm: use aligned address in copy_user_gigantic_page() Kefeng Wang
2024-10-28 10:01   ` David Hildenbrand
2024-10-28  6:17 ` [PATCH v2 1/2] mm: use aligned address in clear_gigantic_page() Huang, Ying
2024-10-28  6:35   ` Kefeng Wang
2024-10-28  7:03     ` Huang, Ying
2024-10-28  8:35       ` Kefeng Wang
2024-10-28 10:00 ` David Hildenbrand
2024-10-28 12:52   ` Kefeng Wang
2024-10-28 13:14     ` David Hildenbrand
2024-10-28 13:33       ` Kefeng Wang
2024-10-28 13:46         ` David Hildenbrand
2024-10-28 14:22           ` Kefeng Wang
2024-10-28 14:24             ` David Hildenbrand
2024-10-29 13:04               ` Kefeng Wang
2024-10-29 14:04                 ` David Hildenbrand
2024-10-30  1:04                   ` Huang, Ying
2024-10-30  3:04                     ` Kefeng Wang
2024-10-30  3:21                       ` Huang, Ying
2024-10-30  5:05                         ` Kefeng Wang
2024-10-31  8:39                           ` Huang, Ying
2024-11-01  7:43                             ` Kefeng Wang
2024-11-01  8:16                               ` Huang, Ying
2024-11-01  9:45                                 ` Kefeng Wang
2024-11-04  2:35                                   ` Huang, Ying
2024-11-05  2:06                                     ` Kefeng Wang [this message]
2024-12-01  2:15                             ` Andrew Morton
2024-12-01  5:37                               ` Huang, Ying
2024-12-02  1:03                                 ` Kefeng Wang
2024-12-06  1:47                                   ` Andrew Morton
2024-12-06  2:08                                     ` Kefeng Wang
2024-11-01  6:18                           ` Huang, Ying
2024-11-01  7:51                             ` Kefeng Wang

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=addb25dc-fb1c-4005-a04c-02beb95240de@huawei.com \
    --to=wangkefeng.wang@huawei.com \
    --cc=akpm@linux-foundation.org \
    --cc=david@redhat.com \
    --cc=linux-mm@kvack.org \
    --cc=muchun.song@linux.dev \
    --cc=willy@infradead.org \
    --cc=ying.huang@intel.com \
    --cc=ziy@nvidia.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox