From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 78CA8D1CA0E for ; Tue, 5 Nov 2024 02:06:38 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 0A6CE6B0093; Mon, 4 Nov 2024 21:06:38 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 02F306B0095; Mon, 4 Nov 2024 21:06:37 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id DC45C6B0098; Mon, 4 Nov 2024 21:06:37 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0015.hostedemail.com [216.40.44.15]) by kanga.kvack.org (Postfix) with ESMTP id B9A206B0093 for ; Mon, 4 Nov 2024 21:06:37 -0500 (EST) Received: from smtpin26.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay05.hostedemail.com (Postfix) with ESMTP id 34E9540CB0 for ; Tue, 5 Nov 2024 02:06:37 +0000 (UTC) X-FDA: 82750401426.26.96ED7BE Received: from szxga03-in.huawei.com (szxga03-in.huawei.com [45.249.212.189]) by imf20.hostedemail.com (Postfix) with ESMTP id 888921C001F for ; Tue, 5 Nov 2024 02:05:53 +0000 (UTC) Authentication-Results: imf20.hostedemail.com; dkim=none; dmarc=pass (policy=quarantine) header.from=huawei.com; spf=pass (imf20.hostedemail.com: domain of wangkefeng.wang@huawei.com designates 45.249.212.189 as permitted sender) smtp.mailfrom=wangkefeng.wang@huawei.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1730772312; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=MD9MGKl7NFdbpubCcASPJcPmPmjDdqKu3UXgWKI9uPI=; b=LWV5JuHpnaa48AtwlOPGteY5LvoxY6HJ/LQueM/wkWVZ6NCYtbhit0iBPkGr3QktP/S1Ty woQH0LA8NRDZiNE/AVlbQyKnTqi+4HgUnO6D2lEBksqUSN6v2IX/2AB9FmQKoPr3bUEC3a QEjzas/6KGngmR5VrCdAxB+niPo+5ts= ARC-Authentication-Results: i=1; imf20.hostedemail.com; dkim=none; dmarc=pass (policy=quarantine) header.from=huawei.com; spf=pass (imf20.hostedemail.com: domain of wangkefeng.wang@huawei.com designates 45.249.212.189 as permitted sender) smtp.mailfrom=wangkefeng.wang@huawei.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1730772312; a=rsa-sha256; cv=none; b=R/yYw7hXkkQVofc2RQpNjXRcttXr0TOBmbFo/wJgdF8m9kbmvoCPNPh3YRLRJ78Xp534i2 Cnf+BdbzIn/nTR5pICfd6TZ9HQcF7pJux3nynplBzKz9c4onuKX8nmjZgonOwGSqrHRTXM QVvDmQ7oshyhLWnz+y88Pikrelmxtv4= Received: from mail.maildlp.com (unknown [172.19.88.194]) by szxga03-in.huawei.com (SkyGuard) with ESMTP id 4XjBW45tkzzQsbZ; Tue, 5 Nov 2024 10:05:20 +0800 (CST) Received: from dggpemf100008.china.huawei.com (unknown [7.185.36.138]) by mail.maildlp.com (Postfix) with ESMTPS id 2B0E914037C; Tue, 5 Nov 2024 10:06:20 +0800 (CST) Received: from [10.174.177.243] (10.174.177.243) by dggpemf100008.china.huawei.com (7.185.36.138) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.2.1544.11; Tue, 5 Nov 2024 10:06:19 +0800 Message-ID: Date: Tue, 5 Nov 2024 10:06:19 +0800 MIME-Version: 1.0 User-Agent: Mozilla Thunderbird Subject: Re: [PATCH v2 1/2] mm: use aligned address in clear_gigantic_page() To: "Huang, Ying" CC: David Hildenbrand , Andrew Morton , Matthew Wilcox , Muchun Song , , Zi Yan References: <20241026054307.3896926-1-wangkefeng.wang@huawei.com> <54f5f3ee-8442-4c49-ab4e-c46e8db73576@huawei.com> <4219a788-52ad-4d80-82e6-35a64c980d50@redhat.com> <127d4a00-29cc-4b45-aa96-eea4e0adaed2@huawei.com> <9b06805b-4f4f-4b37-861f-681e3ab9d470@huawei.com> <113d3cb9-0391-48ab-9389-f2fd1773ab73@redhat.com> <878qu6wgcm.fsf@yhuang6-desk2.ccr.corp.intel.com> <87sese9sy9.fsf@yhuang6-desk2.ccr.corp.intel.com> <64f1c69d-3706-41c5-a29f-929413e3dfa2@huawei.com> <87v7x88y3q.fsf@yhuang6-desk2.ccr.corp.intel.com> <848e4b40-f734-475f-9b1e-2f543e622a6c@huawei.com> <87msij8j2f.fsf@yhuang6-desk2.ccr.corp.intel.com> <5f752263-ba3c-4d85-939e-6da7648a9011@huawei.com> <87ikt3914v.fsf@yhuang6-desk2.ccr.corp.intel.com> Content-Language: en-US From: Kefeng Wang In-Reply-To: <87ikt3914v.fsf@yhuang6-desk2.ccr.corp.intel.com> Content-Type: text/plain; charset="UTF-8"; format=flowed Content-Transfer-Encoding: 8bit X-Originating-IP: [10.174.177.243] X-ClientProxiedBy: dggems706-chm.china.huawei.com (10.3.19.183) To dggpemf100008.china.huawei.com (7.185.36.138) X-Rspam-User: X-Rspamd-Server: rspam03 X-Rspamd-Queue-Id: 888921C001F X-Stat-Signature: n6qd9eeukpnz3oxjr4b6sdhsysok7rtr X-HE-Tag: 1730772353-807926 X-HE-Meta: U2FsdGVkX18KI3rktOd5PxXjjNNuMo4K9AIkgvK5LZpZWg3dDQZ9oKQxv0TB8/fis3czKUjJl9+81gBUy+fy7wTs0nW6+22cfEYUfyDHjlvb1F+9P/JX6cpFQMcCoj8f8Qor7+z0ybLsdvbJdc5Wuz45ax18DaLDKkony73ONH4nHSXse6mdi8+wYv//0Y7y2/0D48moC0XKRjyiOQ+50FGKh/heLzoT3e6t/Ij/lxK19rX6wgOqnAvBoIX8+IhX9CZ7F7GnQlKRVYqzYtu6soAUODPXDfsTau9TmzbaANGo+8t9EEFF0OMGDv/J02+rwDfiECG+ydRpDTLN5kmBLfQz9EE03xA8VF6N4uGdUftvK1RcD1bSZF0MLWamB0WyNlOkA8ff1erPtCjmWu75TEtJ5+l9AOT24BKjUz4U4xc1VVn6rzI7MmUrIzaNOxxcZ2kDykJ2Oxs4eq/8GC5lIcBWkccJbPviv5R778x1SiR6IsamQpslQi2iVp1zhu0Z0yuUPjTlZ/g55x2IgA7bFmIPVLEbh9SIw+SRzfsmjOtLrkd3MlCKViUBM7CNMwzuhaEFA1wTKTjEImzIuz8R4s1nQdCgyqDSX+9LH5OAASd3Hz06TcklJTI6EEkQGetsHCMDnoV1JU8fE/MjRI/8zT3wcdUyJ6klqnEAZpImAviBCGIUU5QRGOETPZgvZlvn4elzmXJUolyMcSP+y4lkx3OI8w+2nLlVQKB5+dS7zaq0ipA8yMsONDl1lHYkJdkRdj6IZIScHRdeLP22aubp/PIXxLi7y1Xx7Y712i+N3I+znslPxmtaBlSYZwKaLrYJ5ZvNeT/BG2f886rTrE4Af8tG08O8m5eJP6gVyknnyMvGG6bkrLvVJSYmyZ2xMybEk6TNUgrUR3Fg95cx4AEL61AvCTxDEThnuBa/AQlyIu5dOzWEFJeJjRUO7CxmUA0/fGEphzRwGXtecwDtpfv iXsCJSIY rOI7PbnkGxEwvL8rN/SGS0xBZ1xAopF9Zh6zK2/iyB4rLtsQ4QLTK6Ea3c/MTsA3NqkZXK3K7p8qrn30DE+h/SeMew/7mNVVMFA1cDC5/XhLXIr26s3CzmNcVVcwOXj0O2JJEyviFQsQJfZF4q6vtvfcLL2VTWiqrdma8wJER9la/9uzoi3+2o9dH65M8AIav+TrfYg2xWybnL74MNWSIJjRm2/amLoyoj62oNJWOWxLWj559FLqnAYyC8cWLo01v3VZ/yHJJA1lkMJ0= X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On 2024/11/4 10:35, Huang, Ying wrote: > Kefeng Wang writes: > >> On 2024/11/1 16:16, Huang, Ying wrote: >>> Kefeng Wang writes: >>> >>>> On 2024/10/31 16:39, Huang, Ying wrote: >>>>> Kefeng Wang writes: >>>>> [snip] >>>>>> >>>>>>>> 1) Will test some rand test to check the different of performance as >>>>>>>> David suggested.>>>> >>>>>>>> 2) Hope the LKP to run more tests since it is very useful(more test >>>>>>>> set and different machines) >>>>>>> I'm starting to use LKP to test. >>>>>> >>>>>> Greet. >>>> >>>> >>>> Sorry for the late, >>>> >>>>> I have run some tests with LKP to test. >>>>> Firstly, there's almost no measurable difference between clearing >>>>> pages >>>>> from start to end or from end to start on Intel server CPU. I guess >>>>> that there's some similar optimization for both direction. >>>>> For multiple processes (same as logical CPU number) >>>>> vm-scalability/anon-w-seq test case, the benchmark score increases >>>>> about 22.4%. >>>> >>>> So process_huge_page is better than clear_gigantic_page() on Intel? >>> For vm-scalability/anon-w-seq test case, it is. Because the >>> performance >>> of forward and backward clearing is almost same, and the user space >>> accessing has cache-hot benefit. >>> >>>> Could you test the following case on x86? >>>> echo 10240 > >>>> /sys/devices/system/node/node0/hugepages/hugepages-2048kB/nr_hugepages >>>> mkdir -p /hugetlbfs/ >>>> mount none /hugetlbfs/ -t hugetlbfs >>>> rm -f /hugetlbfs/test && fallocate -l 20G /hugetlbfs/test && fallocate >>>> -d -l 20G /hugetlbfs/test && time taskset -c 10 fallocate -l 20G >>>> /hugetlbfs/test >>> It's not trivial for me to do this test. Because 0day wraps test >>> cases. >>> Do you know which existing test cases provide this? For example, in >>> vm-scalability? >> >> I don't know the public fallocate test, I will try to find a intel >> machine to test this case. > > I don't expect it to change much, because we have observed that the > performance of forward and backward clearing is similar on Intel. I find a Intel(R) Xeon(R) Gold 6140 CPU @ 2.30GHz Caches (sum of all): L1d: 1.1 MiB (36 instances) L1i: 1.1 MiB (36 instances) L2: 36 MiB (36 instances) L3: 49.5 MiB (2 instances) NUMA: NUMA node(s): 2 NUMA node0 CPU(s): 0-17,36-53 NUMA node1 CPU(s): 18-35,54-71 Before: Performance counter stats for 'taskset -c 10 fallocate -l 20G /mnt/hugetlbfs/test': 3,856.93 msec task-clock # 0.997 CPUs utilized 6 context-switches # 1.556 /sec 1 cpu-migrations # 0.259 /sec 132 page-faults # 34.224 /sec 11,520,934,848 cycles # 2.987 GHz (19.95%) 213,731,011 instructions # 0.02 insn per cycle (24.96%) 58,164,361 branches # 15.080 M/sec (24.96%) 262,547 branch-misses # 0.45% of all branches (24.97%) 96,029,321 CPU_CLK_UNHALTED.REF_XCLK # 24.898 M/sec # 0.3 % tma_frontend_bound # 3.3 % tma_retiring # 96.4 % tma_backend_bound # 0.0 % tma_bad_speculation (24.99%) 149,735,020 IDQ_UOPS_NOT_DELIVERED.CORE # 38.822 M/sec (25.01%) 2,486,326 INT_MISC.RECOVERY_CYCLES_ANY # 644.638 K/sec (20.01%) 95,973,482 CPU_CLK_UNHALTED.ONE_THREAD_ACTIVE # 24.883 M/sec (20.01%) 11,526,783,305 CPU_CLK_UNHALTED.THREAD # 2.989 G/sec (20.01%) 1,519,072,911 UOPS_RETIRED.RETIRE_SLOTS # 393.855 M/sec (20.01%) 1,526,020,825 UOPS_ISSUED.ANY # 395.657 M/sec (20.01%) 59,784,189 L1-dcache-loads # 15.500 M/sec (20.01%) 337,479,254 L1-dcache-load-misses # 564.50% of all L1-dcache accesses (20.02%) 175,954 LLC-loads # 45.620 K/sec (20.02%) 51,955 LLC-load-misses # 29.53% of all L1-icache accesses (20.02%) L1-icache-loads 2,864,230 L1-icache-load-misses (20.02%) 59,769,391 dTLB-loads # 15.497 M/sec (20.02%) 819 dTLB-load-misses # 0.00% of all dTLB cache accesses (20.02%) 2,459 iTLB-loads # 637.553 /sec (20.01%) 370 iTLB-load-misses # 15.05% of all iTLB cache accesses (19.98%) 3.870393637 seconds time elapsed 0.000000000 seconds user 3.833021000 seconds sys After(using clear_gigantic_page()): Performance counter stats for 'taskset -c 10 fallocate -l 20G /mnt/hugetlbfs/test': 4,426.18 msec task-clock # 0.994 CPUs utilized 8 context-switches # 1.807 /sec 1 cpu-migrations # 0.226 /sec 131 page-faults # 29.597 /sec 13,221,263,588 cycles # 2.987 GHz (19.98%) 215,924,995 instructions # 0.02 insn per cycle (25.00%) 58,430,182 branches # 13.201 M/sec (25.01%) 279,381 branch-misses # 0.48% of all branches (25.03%) 110,199,114 CPU_CLK_UNHALTED.REF_XCLK # 24.897 M/sec # 0.3 % tma_frontend_bound # 2.9 % tma_retiring # 96.8 % tma_backend_bound # 0.0 % tma_bad_speculation (25.06%) 160,650,548 IDQ_UOPS_NOT_DELIVERED.CORE # 36.296 M/sec (25.07%) 2,559,970 INT_MISC.RECOVERY_CYCLES_ANY # 578.370 K/sec (20.05%) 110,229,402 CPU_CLK_UNHALTED.ONE_THREAD_ACTIVE # 24.904 M/sec (20.05%) 13,227,924,727 CPU_CLK_UNHALTED.THREAD # 2.989 G/sec (20.03%) 1,525,019,287 UOPS_RETIRED.RETIRE_SLOTS # 344.545 M/sec (20.01%) 1,531,307,263 UOPS_ISSUED.ANY # 345.966 M/sec (19.98%) 60,600,471 L1-dcache-loads # 13.691 M/sec (19.96%) 337,576,917 L1-dcache-load-misses # 557.05% of all L1-dcache accesses (19.96%) 177,157 LLC-loads # 40.025 K/sec (19.96%) 48,056 LLC-load-misses # 27.13% of all L1-icache accesses (19.97%) L1-icache-loads 2,653,617 L1-icache-load-misses (19.97%) 60,609,241 dTLB-loads # 13.693 M/sec (19.97%) 530 dTLB-load-misses # 0.00% of all dTLB cache accesses (19.97%) 1,952 iTLB-loads # 441.013 /sec (19.97%) 3,059 iTLB-load-misses # 156.71% of all iTLB cache accesses (19.97%) 4.450664421 seconds time elapsed 0.000984000 seconds user 4.397795000 seconds sys This shows the backward is better than forward,at least for this CPU. > >>> >>>>> For multiple processes vm-scalability/anon-w-rand test case, no >>>>> measurable difference for benchmark score. >>>>> So, the optimization helps sequential workload mainly. >>>>> In summary, on x86, process_huge_page() will not introduce any >>>>> regression. And it helps some workload. >>>>> However, on ARM64, it does introduce some regression for clearing >>>>> pages >>>>> from end to start. That needs to be addressed. I guess that the >>>>> regression can be resolved via using more clearing from start to end >>>>> (but not all). For example, can you take a look at the patch below? >>>>> Which uses the similar framework as before, but clear each small trunk >>>>> (mpage) from start to end. You can adjust MPAGE_NRPAGES to check when >>>>> the regression can be restored. >>>>> WARNING: the patch is only build tested. >>>> >>>> >>>> Base: baseline >>>> Change1: using clear_gigantic_page() for 2M PMD >>>> Change2: your patch with MPAGE_NRPAGES=16 >>>> Change3: Case3 + fix[1] >>> What is case3? >> >> Oh, it is Change2. > > Got it. > >>> >>>> Change4: your patch with MPAGE_NRPAGES=64 + fix[1] >>>> >>>> 1. For rand write, >>>> case-anon-w-rand/case-anon-w-rand-hugetlb no measurable difference >>>> >>>> 2. For seq write, >>>> >>>> 1) case-anon-w-seq-mt: >>> Can you try case-anon-w-seq? That may be more stable. >>> >>>> base: >>>> real 0m2.490s 0m2.254s 0m2.272s >>>> user 1m59.980s 2m23.431s 2m18.739s >>>> sys 1m3.675s 1m15.462s 1m15.030s >>>> >>>> Change1: >>>> real 0m2.234s 0m2.225s 0m2.159s >>>> user 2m56.105s 2m57.117s 3m0.489s >>>> sys 0m17.064s 0m17.564s 0m16.150s >>>> >>>> Change2: >>>> real 0m2.244s 0m2.384s 0m2.370s >>>> user 2m39.413s 2m41.990s 2m42.229s >>>> sys 0m19.826s 0m18.491s 0m18.053s >>> It appears strange. There's no much cache hot benefit even if we >>> clear >>> pages from end to begin (with larger chunk). >>> However, sys time improves a lot. This shows clearing page with >>> large >>> chunk helps on ARM64. >>> >>>> Change3: // best performance >>>> real 0m2.155s 0m2.204s 0m2.194s >>>> user 3m2.640s 2m55.837s 3m0.902s >>>> sys 0m17.346s 0m17.630s 0m18.197s >>>> >>>> Change4: >>>> real 0m2.287s 0m2.377s 0m2.284s >>>> user 2m37.030s 2m52.868s 3m17.593s >>>> sys 0m15.445s 0m34.430s 0m45.224s >>> Change4 is essentially same as Change1. I don't know why they are >>> different. Is there some large variation among run to run? >> >> As above shown, I test three times, the test results are relatively >> stable, at least for real, I will try case-anon-w-seq. > > Can you also show the score of vm-scalability? > > TBH, I cannot understand your results. For example, why there are > measurable difference between Change3 and Change4? In both cases, the > kernel clears pages from start to end. OK,will retest once I can access the machine again. > >>> Can you further optimize the prototype patch below? I think that it >>> has >>> potential to fix your issue. >> >> Yes, thanks for you helper, but this will make process_huge_page() a >> little more complicated :) > > IMHO, we should try to root cause it, then try to find the proper > solution and optimize (simplifies) it. From the above fallocate test on intel, it seems that different microarchitectures have different performance on Intel too.