From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <owner-linux-mm@kvack.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by smtp.lore.kernel.org (Postfix) with ESMTP id 78CA8D1CA0E
	for <linux-mm@archiver.kernel.org>; Tue,  5 Nov 2024 02:06:38 +0000 (UTC)
Received: by kanga.kvack.org (Postfix)
	id 0A6CE6B0093; Mon,  4 Nov 2024 21:06:38 -0500 (EST)
Received: by kanga.kvack.org (Postfix, from userid 40)
	id 02F306B0095; Mon,  4 Nov 2024 21:06:37 -0500 (EST)
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id DC45C6B0098; Mon,  4 Nov 2024 21:06:37 -0500 (EST)
X-Delivered-To: linux-mm@kvack.org
Received: from relay.hostedemail.com (smtprelay0015.hostedemail.com [216.40.44.15])
	by kanga.kvack.org (Postfix) with ESMTP id B9A206B0093
	for <linux-mm@kvack.org>; Mon,  4 Nov 2024 21:06:37 -0500 (EST)
Received: from smtpin26.hostedemail.com (a10.router.float.18 [10.200.18.1])
	by unirelay05.hostedemail.com (Postfix) with ESMTP id 34E9540CB0
	for <linux-mm@kvack.org>; Tue,  5 Nov 2024 02:06:37 +0000 (UTC)
X-FDA: 82750401426.26.96ED7BE
Received: from szxga03-in.huawei.com (szxga03-in.huawei.com [45.249.212.189])
	by imf20.hostedemail.com (Postfix) with ESMTP id 888921C001F
	for <linux-mm@kvack.org>; Tue,  5 Nov 2024 02:05:53 +0000 (UTC)
Authentication-Results: imf20.hostedemail.com;
	dkim=none;
	dmarc=pass (policy=quarantine) header.from=huawei.com;
	spf=pass (imf20.hostedemail.com: domain of wangkefeng.wang@huawei.com designates 45.249.212.189 as permitted sender) smtp.mailfrom=wangkefeng.wang@huawei.com
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com;
	s=arc-20220608; t=1730772312;
	h=from:from:sender:reply-to:subject:subject:date:date:
	 message-id:message-id:to:to:cc:cc:mime-version:mime-version:
	 content-type:content-type:
	 content-transfer-encoding:content-transfer-encoding:
	 in-reply-to:in-reply-to:references:references;
	bh=MD9MGKl7NFdbpubCcASPJcPmPmjDdqKu3UXgWKI9uPI=;
	b=LWV5JuHpnaa48AtwlOPGteY5LvoxY6HJ/LQueM/wkWVZ6NCYtbhit0iBPkGr3QktP/S1Ty
	woQH0LA8NRDZiNE/AVlbQyKnTqi+4HgUnO6D2lEBksqUSN6v2IX/2AB9FmQKoPr3bUEC3a
	QEjzas/6KGngmR5VrCdAxB+niPo+5ts=
ARC-Authentication-Results: i=1;
	imf20.hostedemail.com;
	dkim=none;
	dmarc=pass (policy=quarantine) header.from=huawei.com;
	spf=pass (imf20.hostedemail.com: domain of wangkefeng.wang@huawei.com designates 45.249.212.189 as permitted sender) smtp.mailfrom=wangkefeng.wang@huawei.com
ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1730772312; a=rsa-sha256;
	cv=none;
	b=R/yYw7hXkkQVofc2RQpNjXRcttXr0TOBmbFo/wJgdF8m9kbmvoCPNPh3YRLRJ78Xp534i2
	Cnf+BdbzIn/nTR5pICfd6TZ9HQcF7pJux3nynplBzKz9c4onuKX8nmjZgonOwGSqrHRTXM
	QVvDmQ7oshyhLWnz+y88Pikrelmxtv4=
Received: from mail.maildlp.com (unknown [172.19.88.194])
	by szxga03-in.huawei.com (SkyGuard) with ESMTP id 4XjBW45tkzzQsbZ;
	Tue,  5 Nov 2024 10:05:20 +0800 (CST)
Received: from dggpemf100008.china.huawei.com (unknown [7.185.36.138])
	by mail.maildlp.com (Postfix) with ESMTPS id 2B0E914037C;
	Tue,  5 Nov 2024 10:06:20 +0800 (CST)
Received: from [10.174.177.243] (10.174.177.243) by
 dggpemf100008.china.huawei.com (7.185.36.138) with Microsoft SMTP Server
 (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id
 15.2.1544.11; Tue, 5 Nov 2024 10:06:19 +0800
Message-ID: <addb25dc-fb1c-4005-a04c-02beb95240de@huawei.com>
Date: Tue, 5 Nov 2024 10:06:19 +0800
MIME-Version: 1.0
User-Agent: Mozilla Thunderbird
Subject: Re: [PATCH v2 1/2] mm: use aligned address in clear_gigantic_page()
To: "Huang, Ying" <ying.huang@intel.com>
CC: David Hildenbrand <david@redhat.com>, Andrew Morton
	<akpm@linux-foundation.org>, Matthew Wilcox <willy@infradead.org>, Muchun
 Song <muchun.song@linux.dev>, <linux-mm@kvack.org>, Zi Yan <ziy@nvidia.com>
References: <20241026054307.3896926-1-wangkefeng.wang@huawei.com>
 <e343f77f-1cf7-463f-96a3-4a1ecfc045ea@redhat.com>
 <54f5f3ee-8442-4c49-ab4e-c46e8db73576@huawei.com>
 <4219a788-52ad-4d80-82e6-35a64c980d50@redhat.com>
 <127d4a00-29cc-4b45-aa96-eea4e0adaed2@huawei.com>
 <fdefce29-6ff9-47db-ba0d-1eec9d09cf33@redhat.com>
 <9b06805b-4f4f-4b37-861f-681e3ab9d470@huawei.com>
 <113d3cb9-0391-48ab-9389-f2fd1773ab73@redhat.com>
 <cb8da36d-13b8-43dc-a598-1d19e623282f@huawei.com>
 <efdb5cde-8915-4bec-a5f3-c93d471f0ba6@redhat.com>
 <878qu6wgcm.fsf@yhuang6-desk2.ccr.corp.intel.com>
 <b1dce36e-325e-45cf-b6e9-9e20d4b32550@huawei.com>
 <87sese9sy9.fsf@yhuang6-desk2.ccr.corp.intel.com>
 <64f1c69d-3706-41c5-a29f-929413e3dfa2@huawei.com>
 <87v7x88y3q.fsf@yhuang6-desk2.ccr.corp.intel.com>
 <848e4b40-f734-475f-9b1e-2f543e622a6c@huawei.com>
 <87msij8j2f.fsf@yhuang6-desk2.ccr.corp.intel.com>
 <5f752263-ba3c-4d85-939e-6da7648a9011@huawei.com>
 <87ikt3914v.fsf@yhuang6-desk2.ccr.corp.intel.com>
Content-Language: en-US
From: Kefeng Wang <wangkefeng.wang@huawei.com>
In-Reply-To: <87ikt3914v.fsf@yhuang6-desk2.ccr.corp.intel.com>
Content-Type: text/plain; charset="UTF-8"; format=flowed
Content-Transfer-Encoding: 8bit
X-Originating-IP: [10.174.177.243]
X-ClientProxiedBy: dggems706-chm.china.huawei.com (10.3.19.183) To
 dggpemf100008.china.huawei.com (7.185.36.138)
X-Rspam-User: 
X-Rspamd-Server: rspam03
X-Rspamd-Queue-Id: 888921C001F
X-Stat-Signature: n6qd9eeukpnz3oxjr4b6sdhsysok7rtr
X-HE-Tag: 1730772353-807926
X-HE-Meta: U2FsdGVkX18KI3rktOd5PxXjjNNuMo4K9AIkgvK5LZpZWg3dDQZ9oKQxv0TB8/fis3czKUjJl9+81gBUy+fy7wTs0nW6+22cfEYUfyDHjlvb1F+9P/JX6cpFQMcCoj8f8Qor7+z0ybLsdvbJdc5Wuz45ax18DaLDKkony73ONH4nHSXse6mdi8+wYv//0Y7y2/0D48moC0XKRjyiOQ+50FGKh/heLzoT3e6t/Ij/lxK19rX6wgOqnAvBoIX8+IhX9CZ7F7GnQlKRVYqzYtu6soAUODPXDfsTau9TmzbaANGo+8t9EEFF0OMGDv/J02+rwDfiECG+ydRpDTLN5kmBLfQz9EE03xA8VF6N4uGdUftvK1RcD1bSZF0MLWamB0WyNlOkA8ff1erPtCjmWu75TEtJ5+l9AOT24BKjUz4U4xc1VVn6rzI7MmUrIzaNOxxcZ2kDykJ2Oxs4eq/8GC5lIcBWkccJbPviv5R778x1SiR6IsamQpslQi2iVp1zhu0Z0yuUPjTlZ/g55x2IgA7bFmIPVLEbh9SIw+SRzfsmjOtLrkd3MlCKViUBM7CNMwzuhaEFA1wTKTjEImzIuz8R4s1nQdCgyqDSX+9LH5OAASd3Hz06TcklJTI6EEkQGetsHCMDnoV1JU8fE/MjRI/8zT3wcdUyJ6klqnEAZpImAviBCGIUU5QRGOETPZgvZlvn4elzmXJUolyMcSP+y4lkx3OI8w+2nLlVQKB5+dS7zaq0ipA8yMsONDl1lHYkJdkRdj6IZIScHRdeLP22aubp/PIXxLi7y1Xx7Y712i+N3I+znslPxmtaBlSYZwKaLrYJ5ZvNeT/BG2f886rTrE4Af8tG08O8m5eJP6gVyknnyMvGG6bkrLvVJSYmyZ2xMybEk6TNUgrUR3Fg95cx4AEL61AvCTxDEThnuBa/AQlyIu5dOzWEFJeJjRUO7CxmUA0/fGEphzRwGXtecwDtpfv
 iXsCJSIY
 rOI7PbnkGxEwvL8rN/SGS0xBZ1xAopF9Zh6zK2/iyB4rLtsQ4QLTK6Ea3c/MTsA3NqkZXK3K7p8qrn30DE+h/SeMew/7mNVVMFA1cDC5/XhLXIr26s3CzmNcVVcwOXj0O2JJEyviFQsQJfZF4q6vtvfcLL2VTWiqrdma8wJER9la/9uzoi3+2o9dH65M8AIav+TrfYg2xWybnL74MNWSIJjRm2/amLoyoj62oNJWOWxLWj559FLqnAYyC8cWLo01v3VZ/yHJJA1lkMJ0=
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>
List-Subscribe: <mailto:majordomo@kvack.org>
List-Unsubscribe: <mailto:majordomo@kvack.org>


On 2024/11/4 10:35, Huang, Ying wrote:
> Kefeng Wang <wangkefeng.wang@huawei.com> writes:
> 
>> On 2024/11/1 16:16, Huang, Ying wrote:
>>> Kefeng Wang <wangkefeng.wang@huawei.com> writes:
>>>
>>>> On 2024/10/31 16:39, Huang, Ying wrote:
>>>>> Kefeng Wang <wangkefeng.wang@huawei.com> writes:
>>>>> [snip]
>>>>>>
>>>>>>>> 1) Will test some rand test to check the different of performance as
>>>>>>>> David suggested.>>>>
>>>>>>>> 2) Hope the LKP to run more tests since it is very useful(more test
>>>>>>>> set and different machines)
>>>>>>> I'm starting to use LKP to test.
>>>>>>
>>>>>> Greet.
>>>>
>>>>
>>>> Sorry for the late,
>>>>
>>>>> I have run some tests with LKP to test.
>>>>> Firstly, there's almost no measurable difference between clearing
>>>>> pages
>>>>> from start to end or from end to start on Intel server CPU.  I guess
>>>>> that there's some similar optimization for both direction.
>>>>> For multiple processes (same as logical CPU number)
>>>>> vm-scalability/anon-w-seq test case, the benchmark score increases
>>>>> about 22.4%.
>>>>
>>>> So process_huge_page is better than clear_gigantic_page() on Intel?
>>> For vm-scalability/anon-w-seq test case, it is.  Because the
>>> performance
>>> of forward and backward clearing is almost same, and the user space
>>> accessing has cache-hot benefit.
>>>
>>>> Could you test the following case on x86?
>>>> echo 10240 >
>>>> /sys/devices/system/node/node0/hugepages/hugepages-2048kB/nr_hugepages
>>>> mkdir -p /hugetlbfs/
>>>> mount none /hugetlbfs/ -t hugetlbfs
>>>> rm -f /hugetlbfs/test && fallocate -l 20G /hugetlbfs/test && fallocate
>>>> -d -l 20G /hugetlbfs/test && time taskset -c 10 fallocate -l 20G
>>>> /hugetlbfs/test
>>> It's not trivial for me to do this test.  Because 0day wraps test
>>> cases.
>>> Do you know which existing test cases provide this?  For example, in
>>> vm-scalability?
>>
>> I don't know the public fallocate test, I will try to find a intel
>> machine to test this case.
> 
> I don't expect it to change much, because we have observed that the
> performance of forward and backward clearing is similar on Intel.

I find a Intel(R) Xeon(R) Gold 6140 CPU @ 2.30GHz

Caches (sum of all):
   L1d:                    1.1 MiB (36 instances)
   L1i:                    1.1 MiB (36 instances)
   L2:                     36 MiB (36 instances)
   L3:                     49.5 MiB (2 instances)
NUMA:
   NUMA node(s):           2
   NUMA node0 CPU(s):      0-17,36-53
   NUMA node1 CPU(s):      18-35,54-71


Before:

Performance counter stats for 'taskset -c 10 fallocate -l 20G 
/mnt/hugetlbfs/test':

           3,856.93 msec task-clock                       #    0.997 
CPUs utilized
                  6      context-switches                 #    1.556 
/sec
                  1      cpu-migrations                   #    0.259 
/sec
                132      page-faults                      #   34.224 
/sec
     11,520,934,848      cycles                           #    2.987 GHz 
                         (19.95%)
        213,731,011      instructions                     #    0.02 
insn per cycle              (24.96%)
         58,164,361      branches                         #   15.080 
M/sec                       (24.96%)
            262,547      branch-misses                    #    0.45% of 
all branches             (24.97%)
         96,029,321      CPU_CLK_UNHALTED.REF_XCLK        #   24.898 
M/sec
                                                   #      0.3 % 
tma_frontend_bound
                                                   #      3.3 % 
tma_retiring
                                                   #     96.4 % 
tma_backend_bound
                                                   #      0.0 % 
tma_bad_speculation      (24.99%)
        149,735,020      IDQ_UOPS_NOT_DELIVERED.CORE      #   38.822 
M/sec                       (25.01%)
          2,486,326      INT_MISC.RECOVERY_CYCLES_ANY     #  644.638 
K/sec                       (20.01%)
         95,973,482      CPU_CLK_UNHALTED.ONE_THREAD_ACTIVE #   24.883 
M/sec                       (20.01%)
     11,526,783,305      CPU_CLK_UNHALTED.THREAD          #    2.989 
G/sec                       (20.01%)
      1,519,072,911      UOPS_RETIRED.RETIRE_SLOTS        #  393.855 
M/sec                       (20.01%)
      1,526,020,825      UOPS_ISSUED.ANY                  #  395.657 
M/sec                       (20.01%)
         59,784,189      L1-dcache-loads                  #   15.500 
M/sec                       (20.01%)
        337,479,254      L1-dcache-load-misses            #  564.50% of 
all L1-dcache accesses   (20.02%)
            175,954      LLC-loads                        #   45.620 
K/sec                       (20.02%)
             51,955      LLC-load-misses                  #   29.53% of 
all L1-icache accesses   (20.02%)
    <not supported>      L1-icache-loads 

          2,864,230      L1-icache-load-misses 
                         (20.02%)
         59,769,391      dTLB-loads                       #   15.497 
M/sec                       (20.02%)
                819      dTLB-load-misses                 #    0.00% of 
all dTLB cache accesses  (20.02%)
              2,459      iTLB-loads                       #  637.553 
/sec                        (20.01%)
                370      iTLB-load-misses                 #   15.05% of 
all iTLB cache accesses  (19.98%)

        3.870393637 seconds time elapsed

        0.000000000 seconds user
        3.833021000 seconds sys

After(using clear_gigantic_page()):

Performance counter stats for 'taskset -c 10 fallocate -l 20G 
/mnt/hugetlbfs/test':

           4,426.18 msec task-clock                       #    0.994 
CPUs utilized
                  8      context-switches                 #    1.807 
/sec
                  1      cpu-migrations                   #    0.226 
/sec
                131      page-faults                      #   29.597 
/sec
     13,221,263,588      cycles                           #    2.987 GHz 
                         (19.98%)
        215,924,995      instructions                     #    0.02 
insn per cycle              (25.00%)
         58,430,182      branches                         #   13.201 
M/sec                       (25.01%)
            279,381      branch-misses                    #    0.48% of 
all branches             (25.03%)
        110,199,114      CPU_CLK_UNHALTED.REF_XCLK        #   24.897 
M/sec
                                                   #      0.3 % 
tma_frontend_bound
                                                   #      2.9 % 
tma_retiring
                                                   #     96.8 % 
tma_backend_bound
                                                   #      0.0 % 
tma_bad_speculation      (25.06%)
        160,650,548      IDQ_UOPS_NOT_DELIVERED.CORE      #   36.296 
M/sec                       (25.07%)
          2,559,970      INT_MISC.RECOVERY_CYCLES_ANY     #  578.370 
K/sec                       (20.05%)
        110,229,402      CPU_CLK_UNHALTED.ONE_THREAD_ACTIVE #   24.904 
M/sec                       (20.05%)
     13,227,924,727      CPU_CLK_UNHALTED.THREAD          #    2.989 
G/sec                       (20.03%)
      1,525,019,287      UOPS_RETIRED.RETIRE_SLOTS        #  344.545 
M/sec                       (20.01%)
      1,531,307,263      UOPS_ISSUED.ANY                  #  345.966 
M/sec                       (19.98%)
         60,600,471      L1-dcache-loads                  #   13.691 
M/sec                       (19.96%)
        337,576,917      L1-dcache-load-misses            #  557.05% of 
all L1-dcache accesses   (19.96%)
            177,157      LLC-loads                        #   40.025 
K/sec                       (19.96%)
             48,056      LLC-load-misses                  #   27.13% of 
all L1-icache accesses   (19.97%)
    <not supported>      L1-icache-loads 

          2,653,617      L1-icache-load-misses 
                         (19.97%)
         60,609,241      dTLB-loads                       #   13.693 
M/sec                       (19.97%)
                530      dTLB-load-misses                 #    0.00% of 
all dTLB cache accesses  (19.97%)
              1,952      iTLB-loads                       #  441.013 
/sec                        (19.97%)
              3,059      iTLB-load-misses                 #  156.71% of 
all iTLB cache accesses  (19.97%)

        4.450664421 seconds time elapsed

        0.000984000 seconds user
        4.397795000 seconds sys


This shows the backward is better than forward，at least for this CPU.


> 
>>>
>>>>> For multiple processes vm-scalability/anon-w-rand test case, no
>>>>> measurable difference for benchmark score.
>>>>> So, the optimization helps sequential workload mainly.
>>>>> In summary, on x86, process_huge_page() will not introduce any
>>>>> regression.  And it helps some workload.
>>>>> However, on ARM64, it does introduce some regression for clearing
>>>>> pages
>>>>> from end to start.  That needs to be addressed.  I guess that the
>>>>> regression can be resolved via using more clearing from start to end
>>>>> (but not all).  For example, can you take a look at the patch below?
>>>>> Which uses the similar framework as before, but clear each small trunk
>>>>> (mpage) from start to end.  You can adjust MPAGE_NRPAGES to check when
>>>>> the regression can be restored.
>>>>> WARNING: the patch is only build tested.
>>>>
>>>>
>>>> Base: baseline
>>>> Change1: using clear_gigantic_page() for 2M PMD
>>>> Change2: your patch with MPAGE_NRPAGES=16
>>>> Change3: Case3 + fix[1]
>>> What is case3?
>>
>> Oh, it is Change2.
> 
> Got it.
> 
>>>
>>>> Change4: your patch with MPAGE_NRPAGES=64 + fix[1]
>>>>
>>>> 1. For rand write,
>>>>      case-anon-w-rand/case-anon-w-rand-hugetlb no measurable difference
>>>>
>>>> 2. For seq write,
>>>>
>>>> 1) case-anon-w-seq-mt:
>>> Can you try case-anon-w-seq?  That may be more stable.
>>>
>>>> base:
>>>> real    0m2.490s    0m2.254s    0m2.272s
>>>> user    1m59.980s   2m23.431s   2m18.739s
>>>> sys     1m3.675s    1m15.462s   1m15.030s
>>>>
>>>> Change1:
>>>> real    0m2.234s    0m2.225s    0m2.159s
>>>> user    2m56.105s   2m57.117s   3m0.489s
>>>> sys     0m17.064s   0m17.564s   0m16.150s
>>>>
>>>> Change2：
>>>> real	0m2.244s    0m2.384s	0m2.370s
>>>> user	2m39.413s   2m41.990s   2m42.229s
>>>> sys	0m19.826s   0m18.491s   0m18.053s
>>> It appears strange.  There's no much cache hot benefit even if we
>>> clear
>>> pages from end to begin (with larger chunk).
>>> However, sys time improves a lot.  This shows clearing page with
>>> large
>>> chunk helps on ARM64.
>>>
>>>> Change3：  // best performance
>>>> real	0m2.155s    0m2.204s	0m2.194s
>>>> user	3m2.640s    2m55.837s   3m0.902s
>>>> sys	0m17.346s   0m17.630s   0m18.197s
>>>>
>>>> Change4：
>>>> real	0m2.287s    0m2.377s	0m2.284s	
>>>> user	2m37.030s   2m52.868s   3m17.593s
>>>> sys	0m15.445s   0m34.430s   0m45.224s
>>> Change4 is essentially same as Change1.  I don't know why they are
>>> different.  Is there some large variation among run to run?
>>
>> As above shown, I test three times, the test results are relatively
>> stable, at least for real, I will try case-anon-w-seq.
> 
> Can you also show the score of vm-scalability?
> 
> TBH, I cannot understand your results.  For example, why there are
> measurable difference between Change3 and Change4?  In both cases, the
> kernel clears pages from start to end.

OK，will retest once I can access the machine again.

> 
>>> Can you further optimize the prototype patch below?  I think that it
>>> has
>>> potential to fix your issue.
>>
>> Yes, thanks for you helper, but this will make process_huge_page() a
>> little more complicated :)
> 
> IMHO, we should try to root cause it, then try to find the proper
> solution and optimize (simplifies) it.

 From the above fallocate test on intel, it seems that different
microarchitectures have different performance on Intel too.