From: Kefeng Wang <wangkefeng.wang@huawei.com>
To: "Huang, Ying" <ying.huang@intel.com>
Cc: David Hildenbrand <david@redhat.com>,
Andrew Morton <akpm@linux-foundation.org>,
Matthew Wilcox <willy@infradead.org>,
Muchun Song <muchun.song@linux.dev>, <linux-mm@kvack.org>,
Zi Yan <ziy@nvidia.com>
Subject: Re: [PATCH v2 1/2] mm: use aligned address in clear_gigantic_page()
Date: Tue, 5 Nov 2024 10:06:19 +0800 [thread overview]
Message-ID: <addb25dc-fb1c-4005-a04c-02beb95240de@huawei.com> (raw)
In-Reply-To: <87ikt3914v.fsf@yhuang6-desk2.ccr.corp.intel.com>
On 2024/11/4 10:35, Huang, Ying wrote:
> Kefeng Wang <wangkefeng.wang@huawei.com> writes:
>
>> On 2024/11/1 16:16, Huang, Ying wrote:
>>> Kefeng Wang <wangkefeng.wang@huawei.com> writes:
>>>
>>>> On 2024/10/31 16:39, Huang, Ying wrote:
>>>>> Kefeng Wang <wangkefeng.wang@huawei.com> writes:
>>>>> [snip]
>>>>>>
>>>>>>>> 1) Will test some rand test to check the different of performance as
>>>>>>>> David suggested.>>>>
>>>>>>>> 2) Hope the LKP to run more tests since it is very useful(more test
>>>>>>>> set and different machines)
>>>>>>> I'm starting to use LKP to test.
>>>>>>
>>>>>> Greet.
>>>>
>>>>
>>>> Sorry for the late,
>>>>
>>>>> I have run some tests with LKP to test.
>>>>> Firstly, there's almost no measurable difference between clearing
>>>>> pages
>>>>> from start to end or from end to start on Intel server CPU. I guess
>>>>> that there's some similar optimization for both direction.
>>>>> For multiple processes (same as logical CPU number)
>>>>> vm-scalability/anon-w-seq test case, the benchmark score increases
>>>>> about 22.4%.
>>>>
>>>> So process_huge_page is better than clear_gigantic_page() on Intel?
>>> For vm-scalability/anon-w-seq test case, it is. Because the
>>> performance
>>> of forward and backward clearing is almost same, and the user space
>>> accessing has cache-hot benefit.
>>>
>>>> Could you test the following case on x86?
>>>> echo 10240 >
>>>> /sys/devices/system/node/node0/hugepages/hugepages-2048kB/nr_hugepages
>>>> mkdir -p /hugetlbfs/
>>>> mount none /hugetlbfs/ -t hugetlbfs
>>>> rm -f /hugetlbfs/test && fallocate -l 20G /hugetlbfs/test && fallocate
>>>> -d -l 20G /hugetlbfs/test && time taskset -c 10 fallocate -l 20G
>>>> /hugetlbfs/test
>>> It's not trivial for me to do this test. Because 0day wraps test
>>> cases.
>>> Do you know which existing test cases provide this? For example, in
>>> vm-scalability?
>>
>> I don't know the public fallocate test, I will try to find a intel
>> machine to test this case.
>
> I don't expect it to change much, because we have observed that the
> performance of forward and backward clearing is similar on Intel.
I find a Intel(R) Xeon(R) Gold 6140 CPU @ 2.30GHz
Caches (sum of all):
L1d: 1.1 MiB (36 instances)
L1i: 1.1 MiB (36 instances)
L2: 36 MiB (36 instances)
L3: 49.5 MiB (2 instances)
NUMA:
NUMA node(s): 2
NUMA node0 CPU(s): 0-17,36-53
NUMA node1 CPU(s): 18-35,54-71
Before:
Performance counter stats for 'taskset -c 10 fallocate -l 20G
/mnt/hugetlbfs/test':
3,856.93 msec task-clock # 0.997
CPUs utilized
6 context-switches # 1.556
/sec
1 cpu-migrations # 0.259
/sec
132 page-faults # 34.224
/sec
11,520,934,848 cycles # 2.987 GHz
(19.95%)
213,731,011 instructions # 0.02
insn per cycle (24.96%)
58,164,361 branches # 15.080
M/sec (24.96%)
262,547 branch-misses # 0.45% of
all branches (24.97%)
96,029,321 CPU_CLK_UNHALTED.REF_XCLK # 24.898
M/sec
# 0.3 %
tma_frontend_bound
# 3.3 %
tma_retiring
# 96.4 %
tma_backend_bound
# 0.0 %
tma_bad_speculation (24.99%)
149,735,020 IDQ_UOPS_NOT_DELIVERED.CORE # 38.822
M/sec (25.01%)
2,486,326 INT_MISC.RECOVERY_CYCLES_ANY # 644.638
K/sec (20.01%)
95,973,482 CPU_CLK_UNHALTED.ONE_THREAD_ACTIVE # 24.883
M/sec (20.01%)
11,526,783,305 CPU_CLK_UNHALTED.THREAD # 2.989
G/sec (20.01%)
1,519,072,911 UOPS_RETIRED.RETIRE_SLOTS # 393.855
M/sec (20.01%)
1,526,020,825 UOPS_ISSUED.ANY # 395.657
M/sec (20.01%)
59,784,189 L1-dcache-loads # 15.500
M/sec (20.01%)
337,479,254 L1-dcache-load-misses # 564.50% of
all L1-dcache accesses (20.02%)
175,954 LLC-loads # 45.620
K/sec (20.02%)
51,955 LLC-load-misses # 29.53% of
all L1-icache accesses (20.02%)
<not supported> L1-icache-loads
2,864,230 L1-icache-load-misses
(20.02%)
59,769,391 dTLB-loads # 15.497
M/sec (20.02%)
819 dTLB-load-misses # 0.00% of
all dTLB cache accesses (20.02%)
2,459 iTLB-loads # 637.553
/sec (20.01%)
370 iTLB-load-misses # 15.05% of
all iTLB cache accesses (19.98%)
3.870393637 seconds time elapsed
0.000000000 seconds user
3.833021000 seconds sys
After(using clear_gigantic_page()):
Performance counter stats for 'taskset -c 10 fallocate -l 20G
/mnt/hugetlbfs/test':
4,426.18 msec task-clock # 0.994
CPUs utilized
8 context-switches # 1.807
/sec
1 cpu-migrations # 0.226
/sec
131 page-faults # 29.597
/sec
13,221,263,588 cycles # 2.987 GHz
(19.98%)
215,924,995 instructions # 0.02
insn per cycle (25.00%)
58,430,182 branches # 13.201
M/sec (25.01%)
279,381 branch-misses # 0.48% of
all branches (25.03%)
110,199,114 CPU_CLK_UNHALTED.REF_XCLK # 24.897
M/sec
# 0.3 %
tma_frontend_bound
# 2.9 %
tma_retiring
# 96.8 %
tma_backend_bound
# 0.0 %
tma_bad_speculation (25.06%)
160,650,548 IDQ_UOPS_NOT_DELIVERED.CORE # 36.296
M/sec (25.07%)
2,559,970 INT_MISC.RECOVERY_CYCLES_ANY # 578.370
K/sec (20.05%)
110,229,402 CPU_CLK_UNHALTED.ONE_THREAD_ACTIVE # 24.904
M/sec (20.05%)
13,227,924,727 CPU_CLK_UNHALTED.THREAD # 2.989
G/sec (20.03%)
1,525,019,287 UOPS_RETIRED.RETIRE_SLOTS # 344.545
M/sec (20.01%)
1,531,307,263 UOPS_ISSUED.ANY # 345.966
M/sec (19.98%)
60,600,471 L1-dcache-loads # 13.691
M/sec (19.96%)
337,576,917 L1-dcache-load-misses # 557.05% of
all L1-dcache accesses (19.96%)
177,157 LLC-loads # 40.025
K/sec (19.96%)
48,056 LLC-load-misses # 27.13% of
all L1-icache accesses (19.97%)
<not supported> L1-icache-loads
2,653,617 L1-icache-load-misses
(19.97%)
60,609,241 dTLB-loads # 13.693
M/sec (19.97%)
530 dTLB-load-misses # 0.00% of
all dTLB cache accesses (19.97%)
1,952 iTLB-loads # 441.013
/sec (19.97%)
3,059 iTLB-load-misses # 156.71% of
all iTLB cache accesses (19.97%)
4.450664421 seconds time elapsed
0.000984000 seconds user
4.397795000 seconds sys
This shows the backward is better than forward,at least for this CPU.
>
>>>
>>>>> For multiple processes vm-scalability/anon-w-rand test case, no
>>>>> measurable difference for benchmark score.
>>>>> So, the optimization helps sequential workload mainly.
>>>>> In summary, on x86, process_huge_page() will not introduce any
>>>>> regression. And it helps some workload.
>>>>> However, on ARM64, it does introduce some regression for clearing
>>>>> pages
>>>>> from end to start. That needs to be addressed. I guess that the
>>>>> regression can be resolved via using more clearing from start to end
>>>>> (but not all). For example, can you take a look at the patch below?
>>>>> Which uses the similar framework as before, but clear each small trunk
>>>>> (mpage) from start to end. You can adjust MPAGE_NRPAGES to check when
>>>>> the regression can be restored.
>>>>> WARNING: the patch is only build tested.
>>>>
>>>>
>>>> Base: baseline
>>>> Change1: using clear_gigantic_page() for 2M PMD
>>>> Change2: your patch with MPAGE_NRPAGES=16
>>>> Change3: Case3 + fix[1]
>>> What is case3?
>>
>> Oh, it is Change2.
>
> Got it.
>
>>>
>>>> Change4: your patch with MPAGE_NRPAGES=64 + fix[1]
>>>>
>>>> 1. For rand write,
>>>> case-anon-w-rand/case-anon-w-rand-hugetlb no measurable difference
>>>>
>>>> 2. For seq write,
>>>>
>>>> 1) case-anon-w-seq-mt:
>>> Can you try case-anon-w-seq? That may be more stable.
>>>
>>>> base:
>>>> real 0m2.490s 0m2.254s 0m2.272s
>>>> user 1m59.980s 2m23.431s 2m18.739s
>>>> sys 1m3.675s 1m15.462s 1m15.030s
>>>>
>>>> Change1:
>>>> real 0m2.234s 0m2.225s 0m2.159s
>>>> user 2m56.105s 2m57.117s 3m0.489s
>>>> sys 0m17.064s 0m17.564s 0m16.150s
>>>>
>>>> Change2:
>>>> real 0m2.244s 0m2.384s 0m2.370s
>>>> user 2m39.413s 2m41.990s 2m42.229s
>>>> sys 0m19.826s 0m18.491s 0m18.053s
>>> It appears strange. There's no much cache hot benefit even if we
>>> clear
>>> pages from end to begin (with larger chunk).
>>> However, sys time improves a lot. This shows clearing page with
>>> large
>>> chunk helps on ARM64.
>>>
>>>> Change3: // best performance
>>>> real 0m2.155s 0m2.204s 0m2.194s
>>>> user 3m2.640s 2m55.837s 3m0.902s
>>>> sys 0m17.346s 0m17.630s 0m18.197s
>>>>
>>>> Change4:
>>>> real 0m2.287s 0m2.377s 0m2.284s
>>>> user 2m37.030s 2m52.868s 3m17.593s
>>>> sys 0m15.445s 0m34.430s 0m45.224s
>>> Change4 is essentially same as Change1. I don't know why they are
>>> different. Is there some large variation among run to run?
>>
>> As above shown, I test three times, the test results are relatively
>> stable, at least for real, I will try case-anon-w-seq.
>
> Can you also show the score of vm-scalability?
>
> TBH, I cannot understand your results. For example, why there are
> measurable difference between Change3 and Change4? In both cases, the
> kernel clears pages from start to end.
OK,will retest once I can access the machine again.
>
>>> Can you further optimize the prototype patch below? I think that it
>>> has
>>> potential to fix your issue.
>>
>> Yes, thanks for you helper, but this will make process_huge_page() a
>> little more complicated :)
>
> IMHO, we should try to root cause it, then try to find the proper
> solution and optimize (simplifies) it.
From the above fallocate test on intel, it seems that different
microarchitectures have different performance on Intel too.
next prev parent reply other threads:[~2024-11-05 2:06 UTC|newest]
Thread overview: 33+ messages / expand[flat|nested] mbox.gz Atom feed top
2024-10-26 5:43 Kefeng Wang
2024-10-26 5:43 ` [PATCH v2 2/2] mm: use aligned address in copy_user_gigantic_page() Kefeng Wang
2024-10-28 10:01 ` David Hildenbrand
2024-10-28 6:17 ` [PATCH v2 1/2] mm: use aligned address in clear_gigantic_page() Huang, Ying
2024-10-28 6:35 ` Kefeng Wang
2024-10-28 7:03 ` Huang, Ying
2024-10-28 8:35 ` Kefeng Wang
2024-10-28 10:00 ` David Hildenbrand
2024-10-28 12:52 ` Kefeng Wang
2024-10-28 13:14 ` David Hildenbrand
2024-10-28 13:33 ` Kefeng Wang
2024-10-28 13:46 ` David Hildenbrand
2024-10-28 14:22 ` Kefeng Wang
2024-10-28 14:24 ` David Hildenbrand
2024-10-29 13:04 ` Kefeng Wang
2024-10-29 14:04 ` David Hildenbrand
2024-10-30 1:04 ` Huang, Ying
2024-10-30 3:04 ` Kefeng Wang
2024-10-30 3:21 ` Huang, Ying
2024-10-30 5:05 ` Kefeng Wang
2024-10-31 8:39 ` Huang, Ying
2024-11-01 7:43 ` Kefeng Wang
2024-11-01 8:16 ` Huang, Ying
2024-11-01 9:45 ` Kefeng Wang
2024-11-04 2:35 ` Huang, Ying
2024-11-05 2:06 ` Kefeng Wang [this message]
2024-12-01 2:15 ` Andrew Morton
2024-12-01 5:37 ` Huang, Ying
2024-12-02 1:03 ` Kefeng Wang
2024-12-06 1:47 ` Andrew Morton
2024-12-06 2:08 ` Kefeng Wang
2024-11-01 6:18 ` Huang, Ying
2024-11-01 7:51 ` Kefeng Wang
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=addb25dc-fb1c-4005-a04c-02beb95240de@huawei.com \
--to=wangkefeng.wang@huawei.com \
--cc=akpm@linux-foundation.org \
--cc=david@redhat.com \
--cc=linux-mm@kvack.org \
--cc=muchun.song@linux.dev \
--cc=willy@infradead.org \
--cc=ying.huang@intel.com \
--cc=ziy@nvidia.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox