From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 9D978E674AC for ; Fri, 1 Nov 2024 07:52:07 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 162806B0083; Fri, 1 Nov 2024 03:52:07 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 112D76B0085; Fri, 1 Nov 2024 03:52:07 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id F1BB26B0088; Fri, 1 Nov 2024 03:52:06 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0013.hostedemail.com [216.40.44.13]) by kanga.kvack.org (Postfix) with ESMTP id D27956B0083 for ; Fri, 1 Nov 2024 03:52:06 -0400 (EDT) Received: from smtpin20.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay10.hostedemail.com (Postfix) with ESMTP id 457F6C1B18 for ; Fri, 1 Nov 2024 07:52:06 +0000 (UTC) X-FDA: 82736755962.20.9913928 Received: from szxga01-in.huawei.com (szxga01-in.huawei.com [45.249.212.187]) by imf20.hostedemail.com (Postfix) with ESMTP id C4D9E1C0006 for ; Fri, 1 Nov 2024 07:51:28 +0000 (UTC) Authentication-Results: imf20.hostedemail.com; dkim=none; spf=pass (imf20.hostedemail.com: domain of wangkefeng.wang@huawei.com designates 45.249.212.187 as permitted sender) smtp.mailfrom=wangkefeng.wang@huawei.com; dmarc=pass (policy=quarantine) header.from=huawei.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1730447308; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=LKljLqXXpP5cQnWsEdoDSPuuWjxjNSTLyfRqduaCmKI=; b=jN/aMwdYPCL9E2LwRuDm35zaSOeELVI9OqYrsYXAVjB7rM+DGG5L2R7dLw1FWeAFG3bk3y RhVQDVXuwFAuudFbw9BPn5+qafqfh/JYXswb1h4p1MqK5I7LVGFtErKZHyM/dR5hXyFq3G n325kwqGkDojg4WS3cBo7ithSv2TY2A= ARC-Authentication-Results: i=1; imf20.hostedemail.com; dkim=none; spf=pass (imf20.hostedemail.com: domain of wangkefeng.wang@huawei.com designates 45.249.212.187 as permitted sender) smtp.mailfrom=wangkefeng.wang@huawei.com; dmarc=pass (policy=quarantine) header.from=huawei.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1730447308; a=rsa-sha256; cv=none; b=16zqhLHgB3H/9Y9xleKIZV7LKsIrpHWu9/LZSOezAtcJfjhxxPVQPp4PTbTHxIITzifSsj OZfv69fXpBFu6HfedHhTQcDM7Uz7YoDy44NvBrmPqJFJAlzMs9Z2NciZxIDhjZo9xlPwSS cMqtrNGVN4wCc8k6anp6r1Kv1q3fRbM= Received: from mail.maildlp.com (unknown [172.19.163.174]) by szxga01-in.huawei.com (SkyGuard) with ESMTP id 4XftLy2V1CzySM0; Fri, 1 Nov 2024 15:50:18 +0800 (CST) Received: from dggpemf100008.china.huawei.com (unknown [7.185.36.138]) by mail.maildlp.com (Postfix) with ESMTPS id 76984140390; Fri, 1 Nov 2024 15:51:59 +0800 (CST) Received: from [10.174.177.243] (10.174.177.243) by dggpemf100008.china.huawei.com (7.185.36.138) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.2.1544.11; Fri, 1 Nov 2024 15:51:58 +0800 Message-ID: <85b42de2-5461-4901-ad95-884422beb68e@huawei.com> Date: Fri, 1 Nov 2024 15:51:58 +0800 MIME-Version: 1.0 User-Agent: Mozilla Thunderbird Subject: Re: [PATCH v2 1/2] mm: use aligned address in clear_gigantic_page() To: "Huang, Ying" CC: David Hildenbrand , Andrew Morton , Matthew Wilcox , Muchun Song , , Zi Yan References: <20241026054307.3896926-1-wangkefeng.wang@huawei.com> <54f5f3ee-8442-4c49-ab4e-c46e8db73576@huawei.com> <4219a788-52ad-4d80-82e6-35a64c980d50@redhat.com> <127d4a00-29cc-4b45-aa96-eea4e0adaed2@huawei.com> <9b06805b-4f4f-4b37-861f-681e3ab9d470@huawei.com> <113d3cb9-0391-48ab-9389-f2fd1773ab73@redhat.com> <878qu6wgcm.fsf@yhuang6-desk2.ccr.corp.intel.com> <87sese9sy9.fsf@yhuang6-desk2.ccr.corp.intel.com> <64f1c69d-3706-41c5-a29f-929413e3dfa2@huawei.com> <87r07v8oj5.fsf@yhuang6-desk2.ccr.corp.intel.com> Content-Language: en-US From: Kefeng Wang In-Reply-To: <87r07v8oj5.fsf@yhuang6-desk2.ccr.corp.intel.com> Content-Type: text/plain; charset="UTF-8"; format=flowed Content-Transfer-Encoding: 8bit X-Originating-IP: [10.174.177.243] X-ClientProxiedBy: dggems703-chm.china.huawei.com (10.3.19.180) To dggpemf100008.china.huawei.com (7.185.36.138) X-Rspamd-Server: rspam06 X-Rspamd-Queue-Id: C4D9E1C0006 X-Stat-Signature: etymk64efar1j86s3s3se4ttcut4brtm X-Rspam-User: X-HE-Tag: 1730447488-771768 X-HE-Meta: U2FsdGVkX1/LRGqMk6E4sNPcueAvyHedtpnwZ+3zZC2lGAUqYPosNWb9vWU+P6axaGv7OcLwX26oKtbvvNvhm7NM2pHHSRB/RFqHcxoXdDoNJi+X7xCPvYyyCaJIBTGnUZphlxIMBYXbgbo/n3a42/GcID5gZ7VQkmIAX475CTi6m5hn074OpanMyduqPJ0QJg5C7i1J5APvq8Uyw6l78KxlN0x5Aj7oX1kBm5XdgQK7BYQDE+4w7dp3XfHXYzud46VU9fQS4TykLY8+drFREVspHjSLD4tBblqUetX+df9ALOEsI0g5Sg2Brl3zNXAtMnGbJipvPLm5dcg790wiG3TYZNY5ZJWbl2JswAmXXse3L0hONwYi2D7srl6+GAPm1ZMYGaH2H2IH682UhsGbvcYh7I6C2xAvWwj0TbWOZXNzo6R98dgkXv9Usq39+s/yqy21NmGOZKImJ3hGiTFdVjeco9shx9Nr2y97jkertllXUgFverWEtoFeRKSrGLXQxKgnL1DCOCt6WQENbKrQmz/25y34ImIZsyx0yX7S5EEDaAxV8AQwVC6kZd0D30nS4SRQVAiIANDsW/fahZrRkSXH4XI9W7nqnIJSq3rA391pjIp0fPDlxOLK73qgmrs8lLBvLgn3I/FL1GOLJdjSThZSYa0S62VjK2MeNYFW9S5peMs9xTm/7CKNh/2TBziub5B/PBb0XQlgyjbYTRp790SfUzyRMBabqsjGGrQO8245bNF7pTCjTWlHjLPY6WrcuOX2Z3uh/dOF5rV5/Nr0Udxf6XxFFaBB28w+8ipJz56uGsetuwAJoJdoWHBpmqRTPscFwC8V/E/bVw4/bhAc8cztu6f2wsPNVqLnv4WnZVCxIoQlVxOHpiOsFzsb30Qo00ymYXhzXJpS0n3AJJCycEZMfnapmIFujPPV0QexlcU5n+r07CC2v/33dKMOxDolAmYN0itdRlD5C02xKaS 0zC4Ap3Y SesQkeCo20dMjC7uNC8YGyhm3R75p8S4cKt2XY420CLjV5+0MoAT8gdC0sive4tPTxOZldZNSxogV+P6UVG6jAEEowcpNXxxa+pDviNpJyK+dlpmc2Vo01IsGot8In+vldzD8ldPmGvgrgduja4DRQDE3DpVJ03oMHA6Mo3q23cRLBR1+uY3Pxn3rZdsM18TFvp6PGgihfkUAaU7S7SBbN00xf8URoNS1adZcjtd7lh/cCPJxosixlmzddcEXthTpVlNtwzvkFabRPR+7tc8wEk3OAHO5MC5rM3RdMkiLZOp3hc1CKoYVGBc4NC+hiQnqEZIMoLlYj1rwZ8bGFIhNbd3uClyfoNljB9Iyk0Mm7m7SLc0jC/sfuz4mNg== X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On 2024/11/1 14:18, Huang, Ying wrote: > Kefeng Wang writes: > >> On 2024/10/30 11:21, Huang, Ying wrote: >>> Kefeng Wang writes: >>> >>>> On 2024/10/30 9:04, Huang, Ying wrote: >>>>> David Hildenbrand writes: >>>>> >>>>>> On 29.10.24 14:04, Kefeng Wang wrote: >>>>>>>>>>>>>> >>>>>>>>>>>>>> That should all be cleaned up ... process_huge_page() likely >>>>>>>>>>>>>> shouldn't >>>>>>>>>>>>> >>>>>>>>>>>>> Yes, let's fix the bug firstly, >>>>>>>>>>>>> >>>>>>>>>>>>>> be even consuming "nr_pages". >>>>>>>>>>>>> >>>>>>>>>>>>> No sure about this part, it uses nr_pages as the end and calculate >>>>>>>>>>>>> the >>>>>>>>>>>>> 'base'. >>>>>>>>>>>> >>>>>>>>>>>> It should be using folio_nr_pages(). >>>>>>>>>>> >>>>>>>>>>> But process_huge_page() without an explicit folio argument, I'd like to >>>>>>>>>>> move the aligned address calculate into the folio_zero_user and >>>>>>>>>>> copy_user_large_folio(will rename it to folio_copy_user()) in the >>>>>>>>>>> following cleanup patches, or do it in the fix patches? >>>>>>>>>> >>>>>>>>>> First, why does folio_zero_user() call process_huge_page() for *a small >>>>>>>>>> folio*? Because we like or code to be extra complicated to understand? >>>>>>>>>> Or am I missing something important? >>>>>>>>> >>>>>>>>> The folio_zero_user() used for PMD-sized THP and HugeTLB before, and >>>>>>>>> after anon mTHP supported, it is used for order-2~order-PMD-order THP >>>>>>>>> and HugeTLB, so it won't process a small folio if I understand correctly. >>>>>>>> >>>>>>>> And unfortunately neither the documentation nor the function name >>>>>>>> expresses that :( >>>>>>>> >>>>>>>> I'm happy to review any patches that improve the situation here :) >>>>>>>> >>>>>>> Actually, could we drop the process_huge_page() totally, from my >>>>>>> testcase[1], process_huge_page() is not better than clear/copy page >>>>>>> from start to last, and sequential clearing/copying maybe more >>>>>>> beneficial to the hardware prefetching, and is there a way to let lkp >>>>>>> to test to check the performance, since the process_huge_page() >>>>>>> was submitted by Ying, what's your opinion? >>>>> I don't think that it's a good idea to revert the commit without >>>>> studying and root causing the issues. I can work together with you on >>>>> that. If we have solid and well explained data to prove >>>>> process_huge_page() isn't benefitial, we can revert the commit. >>>> >>>> >>>> Take 'fallocate 20G' as an example, before >>>> >>>> Performance counter stats for 'taskset -c 10 fallocate -l 20G >>>> /mnt/hugetlbfs/test': >>> IIUC, fallocate will zero pages, but will not touch them at all, >>> right? >>> If so, no cache benefit from clearing referenced page last. >> >> >> Yes, for this case, only clear page. >>> >>>> 3,118.94 msec task-clock # 0.999 CPUs >>>> utilized >>>> 30 context-switches # 0.010 K/sec >>>> 1 cpu-migrations # 0.000 K/sec >>>> 136 page-faults # 0.044 K/sec >>>> 8,092,075,873 cycles # >>>> 2.594 GHz (92.82%) >>>> 1,624,587,663 instructions # 0.20 insn per >>>> cycle (92.83%) >>>> 395,341,850 branches # 126.755 M/sec >>>> (92.82%) >>>> 3,872,302 branch-misses # 0.98% of all >>>> branches (92.83%) >>>> 1,398,066,701 L1-dcache-loads # 448.251 M/sec >>>> (92.82%) >>>> 58,124,626 L1-dcache-load-misses # 4.16% of all >>>> L1-dcache accesses (92.82%) >>>> 1,032,527 LLC-loads # 0.331 M/sec >>>> (92.82%) >>>> 498,684 LLC-load-misses # 48.30% of all >>>> LL-cache accesses (92.84%) >>>> 473,689,004 L1-icache-loads # 151.875 M/sec >>>> (92.82%) >>>> 356,721 L1-icache-load-misses # 0.08% of all >>>> L1-icache accesses (92.85%) >>>> 1,947,644,987 dTLB-loads # 624.458 M/sec >>>> (92.95%) >>>> 10,185 dTLB-load-misses # 0.00% of all >>>> dTLB cache accesses (92.96%) >>>> 474,622,896 iTLB-loads # 152.174 M/sec >>>> (92.95%) >>>> 94 iTLB-load-misses # 0.00% of all >>>> iTLB cache accesses (85.69%) >>>> >>>> 3.122844830 seconds time elapsed >>>> >>>> 0.000000000 seconds user >>>> 3.107259000 seconds sys >>>> >>>> and after(clear from start to end) >>>> >>>> Performance counter stats for 'taskset -c 10 fallocate -l 20G >>>> /mnt/hugetlbfs/test': >>>> >>>> 1,135.53 msec task-clock # 0.999 CPUs >>>> utilized >>>> 10 context-switches # 0.009 K/sec >>>> 1 cpu-migrations # 0.001 K/sec >>>> 137 page-faults # 0.121 K/sec >>>> 2,946,673,587 cycles # >>>> 2.595 GHz (92.67%) >>>> 1,620,704,205 instructions # 0.55 insn per >>>> cycle (92.61%) >>>> 394,595,772 branches # 347.499 M/sec >>>> (92.60%) >>>> 130,756 branch-misses # 0.03% of all >>>> branches (92.84%) >>>> 1,396,726,689 L1-dcache-loads # 1230.022 M/sec >>>> (92.96%) >>>> 338,344 L1-dcache-load-misses # 0.02% of all >>>> L1-dcache accesses (92.95%) >>>> 111,737 LLC-loads # 0.098 M/sec >>>> (92.96%) >>>> 67,486 LLC-load-misses # 60.40% of all >>>> LL-cache accesses (92.96%) >>>> 418,198,663 L1-icache-loads # 368.285 M/sec >>>> (92.96%) >>>> 173,764 L1-icache-load-misses # 0.04% of all >>>> L1-icache accesses (92.96%) >>>> 2,203,364,632 dTLB-loads # 1940.385 M/sec >>>> (92.96%) >>>> 17,195 dTLB-load-misses # 0.00% of all >>>> dTLB cache accesses (92.95%) >>>> 418,198,365 iTLB-loads # 368.285 M/sec >>>> (92.96%) >>>> 79 iTLB-load-misses # 0.00% of all >>>> iTLB cache accesses (85.34%) >>>> >>>> 1.137015760 seconds time elapsed >>>> >>>> 0.000000000 seconds user >>>> 1.131266000 seconds sys >>>> >>>> The IPC improved a lot,less LLC-loads and more L1-dcache-loads, but >>>> this depends on the implementation of the microarchitecture. >>> Anyway, we need to avoid (or reduce at least) the pure memory >>> clearing >>> performance. Have you double checked whether process_huge_page() is >>> inlined? Perf-profile result can be used to check this too. >>> >> >> Yes, I'm sure the process_huge_page() is inlined. >> >>> When you say from start to end, you mean to use clear_gigantic_page() >>> directly, or change process_huge_page() to clear page from start to end? >>> >> >> Using clear_gigantic_page() and changing process_huge_page() to clear >> page from start to end are both good for performance when sequential >> clearing, but no random test so far. >> >>>> 1) Will test some rand test to check the different of performance as >>>> David suggested. >>>> >>>> 2) Hope the LKP to run more tests since it is very useful(more test >>>> set and different machines) >>> I'm starting to use LKP to test. > > https://lore.kernel.org/linux-mm/20200419155856.dtwxomdkyujljdfi@oneplus.com/ > > Just remembered that we have discussed a similar issue for arm64 before. > Can you take a look at it? There's more discussion and tests/results in > the thread, I think that may be helpful. > Thanks for the tips, will check it. > -- > Best Regards, > Huang, Ying >