From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <owner-linux-mm@kvack.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by smtp.lore.kernel.org (Postfix) with ESMTP id 9D978E674AC
	for <linux-mm@archiver.kernel.org>; Fri,  1 Nov 2024 07:52:07 +0000 (UTC)
Received: by kanga.kvack.org (Postfix)
	id 162806B0083; Fri,  1 Nov 2024 03:52:07 -0400 (EDT)
Received: by kanga.kvack.org (Postfix, from userid 40)
	id 112D76B0085; Fri,  1 Nov 2024 03:52:07 -0400 (EDT)
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id F1BB26B0088; Fri,  1 Nov 2024 03:52:06 -0400 (EDT)
X-Delivered-To: linux-mm@kvack.org
Received: from relay.hostedemail.com (smtprelay0013.hostedemail.com [216.40.44.13])
	by kanga.kvack.org (Postfix) with ESMTP id D27956B0083
	for <linux-mm@kvack.org>; Fri,  1 Nov 2024 03:52:06 -0400 (EDT)
Received: from smtpin20.hostedemail.com (a10.router.float.18 [10.200.18.1])
	by unirelay10.hostedemail.com (Postfix) with ESMTP id 457F6C1B18
	for <linux-mm@kvack.org>; Fri,  1 Nov 2024 07:52:06 +0000 (UTC)
X-FDA: 82736755962.20.9913928
Received: from szxga01-in.huawei.com (szxga01-in.huawei.com [45.249.212.187])
	by imf20.hostedemail.com (Postfix) with ESMTP id C4D9E1C0006
	for <linux-mm@kvack.org>; Fri,  1 Nov 2024 07:51:28 +0000 (UTC)
Authentication-Results: imf20.hostedemail.com;
	dkim=none;
	spf=pass (imf20.hostedemail.com: domain of wangkefeng.wang@huawei.com designates 45.249.212.187 as permitted sender) smtp.mailfrom=wangkefeng.wang@huawei.com;
	dmarc=pass (policy=quarantine) header.from=huawei.com
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com;
	s=arc-20220608; t=1730447308;
	h=from:from:sender:reply-to:subject:subject:date:date:
	 message-id:message-id:to:to:cc:cc:mime-version:mime-version:
	 content-type:content-type:
	 content-transfer-encoding:content-transfer-encoding:
	 in-reply-to:in-reply-to:references:references;
	bh=LKljLqXXpP5cQnWsEdoDSPuuWjxjNSTLyfRqduaCmKI=;
	b=jN/aMwdYPCL9E2LwRuDm35zaSOeELVI9OqYrsYXAVjB7rM+DGG5L2R7dLw1FWeAFG3bk3y
	RhVQDVXuwFAuudFbw9BPn5+qafqfh/JYXswb1h4p1MqK5I7LVGFtErKZHyM/dR5hXyFq3G
	n325kwqGkDojg4WS3cBo7ithSv2TY2A=
ARC-Authentication-Results: i=1;
	imf20.hostedemail.com;
	dkim=none;
	spf=pass (imf20.hostedemail.com: domain of wangkefeng.wang@huawei.com designates 45.249.212.187 as permitted sender) smtp.mailfrom=wangkefeng.wang@huawei.com;
	dmarc=pass (policy=quarantine) header.from=huawei.com
ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1730447308; a=rsa-sha256;
	cv=none;
	b=16zqhLHgB3H/9Y9xleKIZV7LKsIrpHWu9/LZSOezAtcJfjhxxPVQPp4PTbTHxIITzifSsj
	OZfv69fXpBFu6HfedHhTQcDM7Uz7YoDy44NvBrmPqJFJAlzMs9Z2NciZxIDhjZo9xlPwSS
	cMqtrNGVN4wCc8k6anp6r1Kv1q3fRbM=
Received: from mail.maildlp.com (unknown [172.19.163.174])
	by szxga01-in.huawei.com (SkyGuard) with ESMTP id 4XftLy2V1CzySM0;
	Fri,  1 Nov 2024 15:50:18 +0800 (CST)
Received: from dggpemf100008.china.huawei.com (unknown [7.185.36.138])
	by mail.maildlp.com (Postfix) with ESMTPS id 76984140390;
	Fri,  1 Nov 2024 15:51:59 +0800 (CST)
Received: from [10.174.177.243] (10.174.177.243) by
 dggpemf100008.china.huawei.com (7.185.36.138) with Microsoft SMTP Server
 (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id
 15.2.1544.11; Fri, 1 Nov 2024 15:51:58 +0800
Message-ID: <85b42de2-5461-4901-ad95-884422beb68e@huawei.com>
Date: Fri, 1 Nov 2024 15:51:58 +0800
MIME-Version: 1.0
User-Agent: Mozilla Thunderbird
Subject: Re: [PATCH v2 1/2] mm: use aligned address in clear_gigantic_page()
To: "Huang, Ying" <ying.huang@intel.com>
CC: David Hildenbrand <david@redhat.com>, Andrew Morton
	<akpm@linux-foundation.org>, Matthew Wilcox <willy@infradead.org>, Muchun
 Song <muchun.song@linux.dev>, <linux-mm@kvack.org>, Zi Yan <ziy@nvidia.com>
References: <20241026054307.3896926-1-wangkefeng.wang@huawei.com>
 <e343f77f-1cf7-463f-96a3-4a1ecfc045ea@redhat.com>
 <54f5f3ee-8442-4c49-ab4e-c46e8db73576@huawei.com>
 <4219a788-52ad-4d80-82e6-35a64c980d50@redhat.com>
 <127d4a00-29cc-4b45-aa96-eea4e0adaed2@huawei.com>
 <fdefce29-6ff9-47db-ba0d-1eec9d09cf33@redhat.com>
 <9b06805b-4f4f-4b37-861f-681e3ab9d470@huawei.com>
 <113d3cb9-0391-48ab-9389-f2fd1773ab73@redhat.com>
 <cb8da36d-13b8-43dc-a598-1d19e623282f@huawei.com>
 <efdb5cde-8915-4bec-a5f3-c93d471f0ba6@redhat.com>
 <878qu6wgcm.fsf@yhuang6-desk2.ccr.corp.intel.com>
 <b1dce36e-325e-45cf-b6e9-9e20d4b32550@huawei.com>
 <87sese9sy9.fsf@yhuang6-desk2.ccr.corp.intel.com>
 <64f1c69d-3706-41c5-a29f-929413e3dfa2@huawei.com>
 <87r07v8oj5.fsf@yhuang6-desk2.ccr.corp.intel.com>
Content-Language: en-US
From: Kefeng Wang <wangkefeng.wang@huawei.com>
In-Reply-To: <87r07v8oj5.fsf@yhuang6-desk2.ccr.corp.intel.com>
Content-Type: text/plain; charset="UTF-8"; format=flowed
Content-Transfer-Encoding: 8bit
X-Originating-IP: [10.174.177.243]
X-ClientProxiedBy: dggems703-chm.china.huawei.com (10.3.19.180) To
 dggpemf100008.china.huawei.com (7.185.36.138)
X-Rspamd-Server: rspam06
X-Rspamd-Queue-Id: C4D9E1C0006
X-Stat-Signature: etymk64efar1j86s3s3se4ttcut4brtm
X-Rspam-User: 
X-HE-Tag: 1730447488-771768
X-HE-Meta: U2FsdGVkX1/LRGqMk6E4sNPcueAvyHedtpnwZ+3zZC2lGAUqYPosNWb9vWU+P6axaGv7OcLwX26oKtbvvNvhm7NM2pHHSRB/RFqHcxoXdDoNJi+X7xCPvYyyCaJIBTGnUZphlxIMBYXbgbo/n3a42/GcID5gZ7VQkmIAX475CTi6m5hn074OpanMyduqPJ0QJg5C7i1J5APvq8Uyw6l78KxlN0x5Aj7oX1kBm5XdgQK7BYQDE+4w7dp3XfHXYzud46VU9fQS4TykLY8+drFREVspHjSLD4tBblqUetX+df9ALOEsI0g5Sg2Brl3zNXAtMnGbJipvPLm5dcg790wiG3TYZNY5ZJWbl2JswAmXXse3L0hONwYi2D7srl6+GAPm1ZMYGaH2H2IH682UhsGbvcYh7I6C2xAvWwj0TbWOZXNzo6R98dgkXv9Usq39+s/yqy21NmGOZKImJ3hGiTFdVjeco9shx9Nr2y97jkertllXUgFverWEtoFeRKSrGLXQxKgnL1DCOCt6WQENbKrQmz/25y34ImIZsyx0yX7S5EEDaAxV8AQwVC6kZd0D30nS4SRQVAiIANDsW/fahZrRkSXH4XI9W7nqnIJSq3rA391pjIp0fPDlxOLK73qgmrs8lLBvLgn3I/FL1GOLJdjSThZSYa0S62VjK2MeNYFW9S5peMs9xTm/7CKNh/2TBziub5B/PBb0XQlgyjbYTRp790SfUzyRMBabqsjGGrQO8245bNF7pTCjTWlHjLPY6WrcuOX2Z3uh/dOF5rV5/Nr0Udxf6XxFFaBB28w+8ipJz56uGsetuwAJoJdoWHBpmqRTPscFwC8V/E/bVw4/bhAc8cztu6f2wsPNVqLnv4WnZVCxIoQlVxOHpiOsFzsb30Qo00ymYXhzXJpS0n3AJJCycEZMfnapmIFujPPV0QexlcU5n+r07CC2v/33dKMOxDolAmYN0itdRlD5C02xKaS
 0zC4Ap3Y
 SesQkeCo20dMjC7uNC8YGyhm3R75p8S4cKt2XY420CLjV5+0MoAT8gdC0sive4tPTxOZldZNSxogV+P6UVG6jAEEowcpNXxxa+pDviNpJyK+dlpmc2Vo01IsGot8In+vldzD8ldPmGvgrgduja4DRQDE3DpVJ03oMHA6Mo3q23cRLBR1+uY3Pxn3rZdsM18TFvp6PGgihfkUAaU7S7SBbN00xf8URoNS1adZcjtd7lh/cCPJxosixlmzddcEXthTpVlNtwzvkFabRPR+7tc8wEk3OAHO5MC5rM3RdMkiLZOp3hc1CKoYVGBc4NC+hiQnqEZIMoLlYj1rwZ8bGFIhNbd3uClyfoNljB9Iyk0Mm7m7SLc0jC/sfuz4mNg==
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>
List-Subscribe: <mailto:majordomo@kvack.org>
List-Unsubscribe: <mailto:majordomo@kvack.org>


On 2024/11/1 14:18, Huang, Ying wrote:
> Kefeng Wang <wangkefeng.wang@huawei.com> writes:
> 
>> On 2024/10/30 11:21, Huang, Ying wrote:
>>> Kefeng Wang <wangkefeng.wang@huawei.com> writes:
>>>
>>>> On 2024/10/30 9:04, Huang, Ying wrote:
>>>>> David Hildenbrand <david@redhat.com> writes:
>>>>>
>>>>>> On 29.10.24 14:04, Kefeng Wang wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> That should all be cleaned up ... process_huge_page() likely
>>>>>>>>>>>>>> shouldn't
>>>>>>>>>>>>>
>>>>>>>>>>>>> Yes, let's fix the bug firstly,
>>>>>>>>>>>>>
>>>>>>>>>>>>>> be even consuming "nr_pages".
>>>>>>>>>>>>>
>>>>>>>>>>>>> No sure about this part, it uses nr_pages as the end and calculate
>>>>>>>>>>>>> the
>>>>>>>>>>>>> 'base'.
>>>>>>>>>>>>
>>>>>>>>>>>> It should be using folio_nr_pages().
>>>>>>>>>>>
>>>>>>>>>>> But process_huge_page() without an explicit folio argument, I'd like to
>>>>>>>>>>> move the aligned address calculate into the folio_zero_user and
>>>>>>>>>>> copy_user_large_folio(will rename it to folio_copy_user()) in the
>>>>>>>>>>> following cleanup patches, or do it in the fix patches?
>>>>>>>>>>
>>>>>>>>>> First, why does folio_zero_user() call process_huge_page() for *a small
>>>>>>>>>> folio*? Because we like or code to be extra complicated to understand?
>>>>>>>>>> Or am I missing something important?
>>>>>>>>>
>>>>>>>>> The folio_zero_user() used for PMD-sized THP and HugeTLB before, and
>>>>>>>>> after anon mTHP supported, it is used for order-2~order-PMD-order THP
>>>>>>>>> and HugeTLB, so it won't process a small folio if I understand correctly.
>>>>>>>>
>>>>>>>> And unfortunately neither the documentation nor the function name
>>>>>>>> expresses that :(
>>>>>>>>
>>>>>>>> I'm happy to review any patches that improve the situation here :)
>>>>>>>>
>>>>>>> Actually, could we drop the process_huge_page() totally, from my
>>>>>>> testcase[1], process_huge_page() is not better than clear/copy page
>>>>>>> from start to last, and sequential clearing/copying maybe more
>>>>>>> beneficial to the hardware prefetching, and is there a way to let lkp
>>>>>>> to test to check the performance, since the process_huge_page()
>>>>>>> was submitted by Ying, what's your opinion?
>>>>> I don't think that it's a good idea to revert the commit without
>>>>> studying and root causing the issues.  I can work together with you on
>>>>> that.  If we have solid and well explained data to prove
>>>>> process_huge_page() isn't benefitial, we can revert the commit.
>>>>
>>>>
>>>> Take 'fallocate 20G' as an example， before
>>>>
>>>> Performance counter stats for 'taskset -c 10 fallocate -l 20G
>>>> /mnt/hugetlbfs/test':
>>> IIUC, fallocate will zero pages, but will not touch them at all,
>>> right?
>>> If so, no cache benefit from clearing referenced page last.
>>
>>
>> Yes, for this case, only clear page.
>>>
>>>>             3,118.94 msec task-clock                #    0.999 CPUs
>>>>             utilized
>>>>                   30      context-switches          #    0.010 K/sec
>>>>                   1      cpu-migrations            #    0.000 K/sec
>>>>                   136      page-faults               #    0.044 K/sec
>>>>                   8,092,075,873      cycles                    #
>>>>                   2.594 GHz                (92.82%)
>>>>        1,624,587,663      instructions              #    0.20  insn per
>>>>        cycle           (92.83%)
>>>>          395,341,850      branches                  #  126.755 M/sec
>>>>          (92.82%)
>>>>            3,872,302      branch-misses             #    0.98% of all
>>>>            branches          (92.83%)
>>>>        1,398,066,701      L1-dcache-loads           #  448.251 M/sec
>>>>        (92.82%)
>>>>           58,124,626      L1-dcache-load-misses     #    4.16% of all
>>>>           L1-dcache accesses  (92.82%)
>>>>            1,032,527      LLC-loads                 #    0.331 M/sec
>>>>            (92.82%)
>>>>              498,684      LLC-load-misses           #   48.30% of all
>>>>              LL-cache accesses  (92.84%)
>>>>          473,689,004      L1-icache-loads           #  151.875 M/sec
>>>>          (92.82%)
>>>>              356,721      L1-icache-load-misses     #    0.08% of all
>>>>              L1-icache accesses  (92.85%)
>>>>        1,947,644,987      dTLB-loads                #  624.458 M/sec
>>>>        (92.95%)
>>>>               10,185      dTLB-load-misses          #    0.00% of all
>>>>               dTLB cache accesses  (92.96%)
>>>>          474,622,896      iTLB-loads                #  152.174 M/sec
>>>>          (92.95%)
>>>>                   94      iTLB-load-misses          #    0.00% of all
>>>>                   iTLB cache accesses  (85.69%)
>>>>
>>>>          3.122844830 seconds time elapsed
>>>>
>>>>          0.000000000 seconds user
>>>>          3.107259000 seconds sys
>>>>
>>>> and after（clear from start to end）
>>>>
>>>> Performance counter stats for 'taskset -c 10 fallocate -l 20G
>>>> /mnt/hugetlbfs/test':
>>>>
>>>>             1,135.53 msec task-clock                #    0.999 CPUs
>>>>             utilized
>>>>                   10      context-switches          #    0.009 K/sec
>>>>                   1      cpu-migrations            #    0.001 K/sec
>>>>                   137      page-faults               #    0.121 K/sec
>>>>                   2,946,673,587      cycles                    #
>>>>                   2.595 GHz                (92.67%)
>>>>        1,620,704,205      instructions              #    0.55  insn per
>>>>        cycle           (92.61%)
>>>>          394,595,772      branches                  #  347.499 M/sec
>>>>          (92.60%)
>>>>              130,756      branch-misses             #    0.03% of all
>>>>              branches          (92.84%)
>>>>        1,396,726,689      L1-dcache-loads           # 1230.022 M/sec
>>>>        (92.96%)
>>>>              338,344      L1-dcache-load-misses     #    0.02% of all
>>>>              L1-dcache accesses  (92.95%)
>>>>              111,737      LLC-loads                 #    0.098 M/sec
>>>>              (92.96%)
>>>>               67,486      LLC-load-misses           #   60.40% of all
>>>>               LL-cache accesses  (92.96%)
>>>>          418,198,663      L1-icache-loads           #  368.285 M/sec
>>>>          (92.96%)
>>>>              173,764      L1-icache-load-misses     #    0.04% of all
>>>>              L1-icache accesses  (92.96%)
>>>>        2,203,364,632      dTLB-loads                # 1940.385 M/sec
>>>>        (92.96%)
>>>>               17,195      dTLB-load-misses          #    0.00% of all
>>>>               dTLB cache accesses  (92.95%)
>>>>          418,198,365      iTLB-loads                #  368.285 M/sec
>>>>          (92.96%)
>>>>                   79      iTLB-load-misses          #    0.00% of all
>>>>                   iTLB cache accesses  (85.34%)
>>>>
>>>>          1.137015760 seconds time elapsed
>>>>
>>>>          0.000000000 seconds user
>>>>          1.131266000 seconds sys
>>>>
>>>> The IPC improved a lot，less LLC-loads and more L1-dcache-loads， but
>>>> this depends on the implementation of the microarchitecture.
>>> Anyway, we need to avoid (or reduce at least) the pure memory
>>> clearing
>>> performance.  Have you double checked whether process_huge_page() is
>>> inlined?  Perf-profile result can be used to check this too.
>>>
>>
>> Yes, I'm sure the process_huge_page() is inlined.
>>
>>> When you say from start to end, you mean to use clear_gigantic_page()
>>> directly, or change process_huge_page() to clear page from start to end?
>>>
>>
>> Using clear_gigantic_page() and changing process_huge_page() to clear
>> page from start to end are both good for performance when sequential
>> clearing, but no random test so far.
>>
>>>> 1) Will test some rand test to check the different of performance as
>>>> David suggested.
>>>>
>>>> 2) Hope the LKP to run more tests since it is very useful(more test
>>>> set and different machines)
>>> I'm starting to use LKP to test.
> 
> https://lore.kernel.org/linux-mm/20200419155856.dtwxomdkyujljdfi@oneplus.com/
> 
> Just remembered that we have discussed a similar issue for arm64 before.
> Can you take a look at it?  There's more discussion and tests/results in
> the thread, I think that may be helpful.
> 

Thanks for the tips, will check it.
> --
> Best Regards,
> Huang, Ying
>