From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <owner-linux-mm@kvack.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by smtp.lore.kernel.org (Postfix) with ESMTP id 8FFEFD74962
	for <linux-mm@archiver.kernel.org>; Wed, 30 Oct 2024 05:05:44 +0000 (UTC)
Received: by kanga.kvack.org (Postfix)
	id CA6CB8D0002; Wed, 30 Oct 2024 01:05:43 -0400 (EDT)
Received: by kanga.kvack.org (Postfix, from userid 40)
	id C57388D0001; Wed, 30 Oct 2024 01:05:43 -0400 (EDT)
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id AF86A8D0002; Wed, 30 Oct 2024 01:05:43 -0400 (EDT)
X-Delivered-To: linux-mm@kvack.org
Received: from relay.hostedemail.com (smtprelay0017.hostedemail.com [216.40.44.17])
	by kanga.kvack.org (Postfix) with ESMTP id 8A2118D0001
	for <linux-mm@kvack.org>; Wed, 30 Oct 2024 01:05:43 -0400 (EDT)
Received: from smtpin14.hostedemail.com (a10.router.float.18 [10.200.18.1])
	by unirelay08.hostedemail.com (Postfix) with ESMTP id 28A6D1401B0
	for <linux-mm@kvack.org>; Wed, 30 Oct 2024 05:05:43 +0000 (UTC)
X-FDA: 82729080714.14.79F37BB
Received: from szxga02-in.huawei.com (szxga02-in.huawei.com [45.249.212.188])
	by imf22.hostedemail.com (Postfix) with ESMTP id D7614C000D
	for <linux-mm@kvack.org>; Wed, 30 Oct 2024 05:05:07 +0000 (UTC)
Authentication-Results: imf22.hostedemail.com;
	dkim=none;
	spf=pass (imf22.hostedemail.com: domain of wangkefeng.wang@huawei.com designates 45.249.212.188 as permitted sender) smtp.mailfrom=wangkefeng.wang@huawei.com;
	dmarc=pass (policy=quarantine) header.from=huawei.com
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com;
	s=arc-20220608; t=1730264661;
	h=from:from:sender:reply-to:subject:subject:date:date:
	 message-id:message-id:to:to:cc:cc:mime-version:mime-version:
	 content-type:content-type:
	 content-transfer-encoding:content-transfer-encoding:
	 in-reply-to:in-reply-to:references:references;
	bh=Ugo6+yjpLkxiJbdI/q9XAVGr3arrPHg8HYFvU6JpAbM=;
	b=IokqvaO2LCWO3zISOXbqIt5HlEBvWy9vGARXoAzKSdOmYcX3xqlo0amchqxIqeNZV2VkHg
	4gufU9IovQI7LgYMbuTS8WtMG8CTmzfP379H35Sd5TwgId1LVvOHuTKoXKsBJo7ixa9osQ
	LWQ+AbfmDiOeMfwBzK036/OilCpBscA=
ARC-Authentication-Results: i=1;
	imf22.hostedemail.com;
	dkim=none;
	spf=pass (imf22.hostedemail.com: domain of wangkefeng.wang@huawei.com designates 45.249.212.188 as permitted sender) smtp.mailfrom=wangkefeng.wang@huawei.com;
	dmarc=pass (policy=quarantine) header.from=huawei.com
ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1730264661; a=rsa-sha256;
	cv=none;
	b=R4M0KfCxBKaz766Hr15rcfQm0SyUgboDSpT2uURpyeQMR+vp6AokRXE/WO2C0VGXg2mIek
	k030nqNXNYjcDzH/HerGDqA2GbED02yiN8/m5oYqEsH53xv2nWXyTdGBht7qx3ohNuIwfZ
	QSm3hsendWButgRRsR0vWAtxuN7fv/8=
Received: from mail.maildlp.com (unknown [172.19.88.105])
	by szxga02-in.huawei.com (SkyGuard) with ESMTP id 4XdZks1MFszdkLS;
	Wed, 30 Oct 2024 13:03:01 +0800 (CST)
Received: from dggpemf100008.china.huawei.com (unknown [7.185.36.138])
	by mail.maildlp.com (Postfix) with ESMTPS id ADEC51400DC;
	Wed, 30 Oct 2024 13:05:34 +0800 (CST)
Received: from [10.174.177.243] (10.174.177.243) by
 dggpemf100008.china.huawei.com (7.185.36.138) with Microsoft SMTP Server
 (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id
 15.2.1544.11; Wed, 30 Oct 2024 13:05:34 +0800
Message-ID: <64f1c69d-3706-41c5-a29f-929413e3dfa2@huawei.com>
Date: Wed, 30 Oct 2024 13:05:33 +0800
MIME-Version: 1.0
User-Agent: Mozilla Thunderbird
Subject: Re: [PATCH v2 1/2] mm: use aligned address in clear_gigantic_page()
To: "Huang, Ying" <ying.huang@intel.com>
CC: David Hildenbrand <david@redhat.com>, Andrew Morton
	<akpm@linux-foundation.org>, Matthew Wilcox <willy@infradead.org>, Muchun
 Song <muchun.song@linux.dev>, <linux-mm@kvack.org>, Zi Yan <ziy@nvidia.com>
References: <20241026054307.3896926-1-wangkefeng.wang@huawei.com>
 <e343f77f-1cf7-463f-96a3-4a1ecfc045ea@redhat.com>
 <54f5f3ee-8442-4c49-ab4e-c46e8db73576@huawei.com>
 <4219a788-52ad-4d80-82e6-35a64c980d50@redhat.com>
 <127d4a00-29cc-4b45-aa96-eea4e0adaed2@huawei.com>
 <fdefce29-6ff9-47db-ba0d-1eec9d09cf33@redhat.com>
 <9b06805b-4f4f-4b37-861f-681e3ab9d470@huawei.com>
 <113d3cb9-0391-48ab-9389-f2fd1773ab73@redhat.com>
 <cb8da36d-13b8-43dc-a598-1d19e623282f@huawei.com>
 <efdb5cde-8915-4bec-a5f3-c93d471f0ba6@redhat.com>
 <878qu6wgcm.fsf@yhuang6-desk2.ccr.corp.intel.com>
 <b1dce36e-325e-45cf-b6e9-9e20d4b32550@huawei.com>
 <87sese9sy9.fsf@yhuang6-desk2.ccr.corp.intel.com>
Content-Language: en-US
From: Kefeng Wang <wangkefeng.wang@huawei.com>
In-Reply-To: <87sese9sy9.fsf@yhuang6-desk2.ccr.corp.intel.com>
Content-Type: text/plain; charset="UTF-8"; format=flowed
Content-Transfer-Encoding: 8bit
X-Originating-IP: [10.174.177.243]
X-ClientProxiedBy: dggems705-chm.china.huawei.com (10.3.19.182) To
 dggpemf100008.china.huawei.com (7.185.36.138)
X-Rspamd-Server: rspam03
X-Rspam-User: 
X-Rspamd-Queue-Id: D7614C000D
X-Stat-Signature: 6zjhshdcu186ze4qtuqmygw1ufqg93cf
X-HE-Tag: 1730264707-365350
X-HE-Meta: U2FsdGVkX1/o30As4Ec8SEKbSpLabpy90ix3An4zaPRKy4QL1kdWylYoku1RR+rFcAZo8nILwDoaBm1k+sXlMiVGGAjlP6yf50P3CpwmR8H6VyzXWrqnIN+02cDHIBdO72G2NNfThEAzxxSd9lkBu81Tzr75KwJy4Qu7DH55YhpvRkRUIRSAuycw8f+JLpluLeJR5APFwYdtdyonQOdOR65NQeRPWC+DK1APFRlkrnlMMEzMUjNuqXQrIXGJffgy/7L7qJcPf2mYXn0nuN21OYc6pdVIy6z0VzPIYoA1xLKH0hqb8KnwjZWA+LwQ7DugtiZYoPmG/OrKMihfaBJiDmMpduCeJKPwKP+wz4FwxEoPD81j7zKd2zsYHVHualy+gnHetHl7TMVHwrvXeqQXheJwHcNO7+v+cCURHi9kHTwsyZqVl4gOC8rjH6ID+FiUjFyzLMKjbx8eFW1w0pDn9s5kEUGegwhAJKgC7mDfGOiFSTwM5NhEIECXV0fIxgoIvKZ2XkCumdNbwNL5IF6s2sI7gjd0xp3Im94m8SMvhwfl2gYG7/IIQvMUJOUY89Q9PIkp0HXM5Yxbz2ew/rI3x8cWPVdJNTbeLr9MocJzemuk0E0OrAPgWDIHSWORyVVJKT2oX/y3OrRMofwuLJTEqbByRl1ZfvNyuh+zWypN38V48wUdgj4Yz0dEgTEzex7xQ4DbFKsQUnACUGn2DOOcpr80cLrnkq98YD7d5VQShQPqlRSEyOp/wFGOvPF/SfjHKaToPTiWvz0pfCFFtfVQQ/wmnisS38Iw/X9ze/D1Lijmb+fXVXJsYF00TRXI7UG4b8AEVlfimCSJwVZjymGJOGdPf/RqOZxfQgG+fUyvTFYB1RVjbAxDKEJrEa8SkL3vfzuD6VgZx/AqmUPNCvizr2Kwq74j9nLHeOXwS315G7lJ6diuM+HrVqoKjNU1rF/nVmp0bDnKeoiRLnCLD9S
 7whQMkCo
 Hu15zPjEcstkWW5CK1tsYg8q0a1bDZNUQ9x8vZerwK7kUXTPexFNALPXrINn9bmWt+3arJWTep29gLAPQbvUOWsFmCxDFkXguv/0c/iSwSagqIlT5qC4MZBSLhJLY0xwdhy4We3dmtuQOu8i/WiFc0VUK9V7rd3K95eLG9D6AvbjWBBpIv33Jd9yC/JIT1BHIQQeN4pYOex7AVXwci7khjOUD4Z9tUqOQK7AN9XEA1/K+86FMV2ZxKb3KMVS0SErO8Gh5HYr8feYYTIHj+Fw+cmQpl1siwW/QBc9U
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>
List-Subscribe: <mailto:majordomo@kvack.org>
List-Unsubscribe: <mailto:majordomo@kvack.org>


On 2024/10/30 11:21, Huang, Ying wrote:
> Kefeng Wang <wangkefeng.wang@huawei.com> writes:
> 
>> On 2024/10/30 9:04, Huang, Ying wrote:
>>> David Hildenbrand <david@redhat.com> writes:
>>>
>>>> On 29.10.24 14:04, Kefeng Wang wrote:
>>>>>>>>>>>>
>>>>>>>>>>>> That should all be cleaned up ... process_huge_page() likely
>>>>>>>>>>>> shouldn't
>>>>>>>>>>>
>>>>>>>>>>> Yes, let's fix the bug firstly,
>>>>>>>>>>>
>>>>>>>>>>>> be even consuming "nr_pages".
>>>>>>>>>>>
>>>>>>>>>>> No sure about this part, it uses nr_pages as the end and calculate
>>>>>>>>>>> the
>>>>>>>>>>> 'base'.
>>>>>>>>>>
>>>>>>>>>> It should be using folio_nr_pages().
>>>>>>>>>
>>>>>>>>> But process_huge_page() without an explicit folio argument, I'd like to
>>>>>>>>> move the aligned address calculate into the folio_zero_user and
>>>>>>>>> copy_user_large_folio(will rename it to folio_copy_user()) in the
>>>>>>>>> following cleanup patches, or do it in the fix patches?
>>>>>>>>
>>>>>>>> First, why does folio_zero_user() call process_huge_page() for *a small
>>>>>>>> folio*? Because we like or code to be extra complicated to understand?
>>>>>>>> Or am I missing something important?
>>>>>>>
>>>>>>> The folio_zero_user() used for PMD-sized THP and HugeTLB before, and
>>>>>>> after anon mTHP supported, it is used for order-2~order-PMD-order THP
>>>>>>> and HugeTLB, so it won't process a small folio if I understand correctly.
>>>>>>
>>>>>> And unfortunately neither the documentation nor the function name
>>>>>> expresses that :(
>>>>>>
>>>>>> I'm happy to review any patches that improve the situation here :)
>>>>>>
>>>>> Actually, could we drop the process_huge_page() totally, from my
>>>>> testcase[1], process_huge_page() is not better than clear/copy page
>>>>> from start to last, and sequential clearing/copying maybe more
>>>>> beneficial to the hardware prefetching, and is there a way to let lkp
>>>>> to test to check the performance, since the process_huge_page()
>>>>> was submitted by Ying, what's your opinion?
>>> I don't think that it's a good idea to revert the commit without
>>> studying and root causing the issues.  I can work together with you on
>>> that.  If we have solid and well explained data to prove
>>> process_huge_page() isn't benefitial, we can revert the commit.
>>
>>
>> Take 'fallocate 20G' as an example， before
>>
>> Performance counter stats for 'taskset -c 10 fallocate -l 20G
>> /mnt/hugetlbfs/test':
> 
> IIUC, fallocate will zero pages, but will not touch them at all, right?
> If so, no cache benefit from clearing referenced page last.


Yes, for this case, only clear page.
> 
>>            3,118.94 msec task-clock                #    0.999 CPUs
>>            utilized
>>                  30      context-switches          #    0.010 K/sec
>>                  1      cpu-migrations            #    0.000 K/sec
>>                  136      page-faults               #    0.044 K/sec
>>                  8,092,075,873      cycles                    #
>>                  2.594 GHz                (92.82%)
>>       1,624,587,663      instructions              #    0.20  insn per
>>       cycle           (92.83%)
>>         395,341,850      branches                  #  126.755 M/sec
>>         (92.82%)
>>           3,872,302      branch-misses             #    0.98% of all
>>           branches          (92.83%)
>>       1,398,066,701      L1-dcache-loads           #  448.251 M/sec
>>       (92.82%)
>>          58,124,626      L1-dcache-load-misses     #    4.16% of all
>>          L1-dcache accesses  (92.82%)
>>           1,032,527      LLC-loads                 #    0.331 M/sec
>>           (92.82%)
>>             498,684      LLC-load-misses           #   48.30% of all
>>             LL-cache accesses  (92.84%)
>>         473,689,004      L1-icache-loads           #  151.875 M/sec
>>         (92.82%)
>>             356,721      L1-icache-load-misses     #    0.08% of all
>>             L1-icache accesses  (92.85%)
>>       1,947,644,987      dTLB-loads                #  624.458 M/sec
>>       (92.95%)
>>              10,185      dTLB-load-misses          #    0.00% of all
>>              dTLB cache accesses  (92.96%)
>>         474,622,896      iTLB-loads                #  152.174 M/sec
>>         (92.95%)
>>                  94      iTLB-load-misses          #    0.00% of all
>>                  iTLB cache accesses  (85.69%)
>>
>>         3.122844830 seconds time elapsed
>>
>>         0.000000000 seconds user
>>         3.107259000 seconds sys
>>
>> and after（clear from start to end）
>>
>> Performance counter stats for 'taskset -c 10 fallocate -l 20G
>> /mnt/hugetlbfs/test':
>>
>>            1,135.53 msec task-clock                #    0.999 CPUs
>>            utilized
>>                  10      context-switches          #    0.009 K/sec
>>                  1      cpu-migrations            #    0.001 K/sec
>>                  137      page-faults               #    0.121 K/sec
>>                  2,946,673,587      cycles                    #
>>                  2.595 GHz                (92.67%)
>>       1,620,704,205      instructions              #    0.55  insn per
>>       cycle           (92.61%)
>>         394,595,772      branches                  #  347.499 M/sec
>>         (92.60%)
>>             130,756      branch-misses             #    0.03% of all
>>             branches          (92.84%)
>>       1,396,726,689      L1-dcache-loads           # 1230.022 M/sec
>>       (92.96%)
>>             338,344      L1-dcache-load-misses     #    0.02% of all
>>             L1-dcache accesses  (92.95%)
>>             111,737      LLC-loads                 #    0.098 M/sec
>>             (92.96%)
>>              67,486      LLC-load-misses           #   60.40% of all
>>              LL-cache accesses  (92.96%)
>>         418,198,663      L1-icache-loads           #  368.285 M/sec
>>         (92.96%)
>>             173,764      L1-icache-load-misses     #    0.04% of all
>>             L1-icache accesses  (92.96%)
>>       2,203,364,632      dTLB-loads                # 1940.385 M/sec
>>       (92.96%)
>>              17,195      dTLB-load-misses          #    0.00% of all
>>              dTLB cache accesses  (92.95%)
>>         418,198,365      iTLB-loads                #  368.285 M/sec
>>         (92.96%)
>>                  79      iTLB-load-misses          #    0.00% of all
>>                  iTLB cache accesses  (85.34%)
>>
>>         1.137015760 seconds time elapsed
>>
>>         0.000000000 seconds user
>>         1.131266000 seconds sys
>>
>> The IPC improved a lot，less LLC-loads and more L1-dcache-loads， but
>> this depends on the implementation of the microarchitecture.
> 
> Anyway, we need to avoid (or reduce at least) the pure memory clearing
> performance.  Have you double checked whether process_huge_page() is
> inlined?  Perf-profile result can be used to check this too.
> 

Yes, I'm sure the process_huge_page() is inlined.

> When you say from start to end, you mean to use clear_gigantic_page()
> directly, or change process_huge_page() to clear page from start to end?
> 

Using clear_gigantic_page() and changing process_huge_page() to clear 
page from start to end are both good for performance when sequential 
clearing, but no random test so far.

>> 1) Will test some rand test to check the different of performance as
>> David suggested.
>>
>> 2) Hope the LKP to run more tests since it is very useful(more test
>> set and different machines)
> 
> I'm starting to use LKP to test.

Greet.