From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <owner-linux-mm@kvack.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by smtp.lore.kernel.org (Postfix) with ESMTP id 73137E674B5
	for <linux-mm@archiver.kernel.org>; Fri,  1 Nov 2024 08:20:36 +0000 (UTC)
Received: by kanga.kvack.org (Postfix)
	id B8F686B0082; Fri,  1 Nov 2024 04:20:35 -0400 (EDT)
Received: by kanga.kvack.org (Postfix, from userid 40)
	id B3F166B0083; Fri,  1 Nov 2024 04:20:35 -0400 (EDT)
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id 9DF506B0085; Fri,  1 Nov 2024 04:20:35 -0400 (EDT)
X-Delivered-To: linux-mm@kvack.org
Received: from relay.hostedemail.com (smtprelay0011.hostedemail.com [216.40.44.11])
	by kanga.kvack.org (Postfix) with ESMTP id 801B36B0082
	for <linux-mm@kvack.org>; Fri,  1 Nov 2024 04:20:35 -0400 (EDT)
Received: from smtpin16.hostedemail.com (a10.router.float.18 [10.200.18.1])
	by unirelay02.hostedemail.com (Postfix) with ESMTP id E235D121BC9
	for <linux-mm@kvack.org>; Fri,  1 Nov 2024 08:20:34 +0000 (UTC)
X-FDA: 82736829042.16.1D1067A
Received: from mgamail.intel.com (mgamail.intel.com [192.198.163.13])
	by imf14.hostedemail.com (Postfix) with ESMTP id 208C7100016
	for <linux-mm@kvack.org>; Fri,  1 Nov 2024 08:20:00 +0000 (UTC)
Authentication-Results: imf14.hostedemail.com;
	dkim=pass header.d=intel.com header.s=Intel header.b=BGbYf2Cd;
	spf=pass (imf14.hostedemail.com: domain of ying.huang@intel.com designates 192.198.163.13 as permitted sender) smtp.mailfrom=ying.huang@intel.com;
	dmarc=pass (policy=none) header.from=intel.com
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com;
	s=arc-20220608; t=1730449017;
	h=from:from:sender:reply-to:subject:subject:date:date:
	 message-id:message-id:to:to:cc:cc:mime-version:mime-version:
	 content-type:content-type:
	 content-transfer-encoding:content-transfer-encoding:
	 in-reply-to:in-reply-to:references:references:dkim-signature;
	bh=6Yf+SN3WyahLxTn0/xv188CkyImKtIV5hy6Xm2KuPuw=;
	b=xIJdRM28qNoYd7D4dcVDGQeA5VcIa2Zr54Y30lI/fGEcvI2HJWCVKbDwTz7WeH/dqRe97D
	uPEizCuJmpKLBQ5Jc5BEGUTP4wLqROobgW5YkrYkXbSeECSR40I+ZBEEekHT3jEu7I1C36
	9uws4YKmN3nZgkWC0mBRqujma4Kd3KY=
ARC-Authentication-Results: i=1;
	imf14.hostedemail.com;
	dkim=pass header.d=intel.com header.s=Intel header.b=BGbYf2Cd;
	spf=pass (imf14.hostedemail.com: domain of ying.huang@intel.com designates 192.198.163.13 as permitted sender) smtp.mailfrom=ying.huang@intel.com;
	dmarc=pass (policy=none) header.from=intel.com
ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1730449017; a=rsa-sha256;
	cv=none;
	b=Mx8+AbQfZIeAX3sjuk2rgJvbO1BAFOt1MFFEjN0Kion9oizdB3nupUWluR5JCEYqnb4N/Z
	Hj+cT50LF+x9OE7gTe/mnnZIUV3Y4Yr99+L7lgWFGrJXxU2pKtSINt4im/viQ359Qy97nD
	84kjXym+9y3+W+OykWSsQRf7COxEcWU=
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple;
  d=intel.com; i=@intel.com; q=dns/txt; s=Intel;
  t=1730449232; x=1761985232;
  h=from:to:cc:subject:in-reply-to:references:date:
   message-id:mime-version:content-transfer-encoding;
  bh=aJgzqBdZAoUVqsFjlKUq7nywjKHl6uI4Q551/0Mr9/Y=;
  b=BGbYf2Cd1vZ0XUk7eMmjbrDkStpMVB+X4l5Iw9x7H+f4hCVO7TqepGqH
   0T6DXtoOWwkzzJrawICRfJjypIlX4lSOtyzIlPSfJXuFf3fdM8zTxwuq7
   qiwpHChVQpiStJJ19s1xBzkg+058KfhMemdaGVWO8T0mUnIEKCYMSTK12
   3+AmmzxhwBxOdN1RMqv6Lshbdnc/mWEVgCWy68TKK7rxswLu7vQ5A7Tgy
   8DTegHe1VIxQpNiE8bStZLwzqmrSfzVBHqdk98EqeLVtfwutE0ok36kQK
   oAAySMW1Xv2RHSMB6uMwaaITi9Xq37fV91nIjE2B1vmKZ0x2ytW3Hwd5c
   w==;
X-CSE-ConnectionGUID: 8ztAyjefTUCG07eFiqKihw==
X-CSE-MsgGUID: /0pB4mbIT5iD5x28gqI3CA==
X-IronPort-AV: E=McAfee;i="6700,10204,11242"; a="33050517"
X-IronPort-AV: E=Sophos;i="6.11,249,1725346800"; 
   d="scan'208";a="33050517"
Received: from orviesa001.jf.intel.com ([10.64.159.141])
  by fmvoesa107.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 01 Nov 2024 01:20:30 -0700
X-CSE-ConnectionGUID: yKv+EKCjTm28Z7YX+w5Z7Q==
X-CSE-MsgGUID: /GRa7T5BRyOuWi9+upmaPw==
X-ExtLoop1: 1
X-IronPort-AV: E=Sophos;i="6.11,249,1725346800"; 
   d="scan'208";a="120357092"
Received: from yhuang6-desk2.sh.intel.com (HELO yhuang6-desk2.ccr.corp.intel.com) ([10.238.208.55])
  by smtpauth.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 01 Nov 2024 01:20:29 -0700
From: "Huang, Ying" <ying.huang@intel.com>
To: Kefeng Wang <wangkefeng.wang@huawei.com>
Cc: David Hildenbrand <david@redhat.com>,  Andrew Morton
 <akpm@linux-foundation.org>,  Matthew Wilcox <willy@infradead.org>,
  "Muchun Song" <muchun.song@linux.dev>,  <linux-mm@kvack.org>,  Zi Yan
 <ziy@nvidia.com>
Subject: Re: [PATCH v2 1/2] mm: use aligned address in clear_gigantic_page()
In-Reply-To: <848e4b40-f734-475f-9b1e-2f543e622a6c@huawei.com> (Kefeng Wang's
	message of "Fri, 1 Nov 2024 15:43:55 +0800")
References: <20241026054307.3896926-1-wangkefeng.wang@huawei.com>
	<e343f77f-1cf7-463f-96a3-4a1ecfc045ea@redhat.com>
	<54f5f3ee-8442-4c49-ab4e-c46e8db73576@huawei.com>
	<4219a788-52ad-4d80-82e6-35a64c980d50@redhat.com>
	<127d4a00-29cc-4b45-aa96-eea4e0adaed2@huawei.com>
	<fdefce29-6ff9-47db-ba0d-1eec9d09cf33@redhat.com>
	<9b06805b-4f4f-4b37-861f-681e3ab9d470@huawei.com>
	<113d3cb9-0391-48ab-9389-f2fd1773ab73@redhat.com>
	<cb8da36d-13b8-43dc-a598-1d19e623282f@huawei.com>
	<efdb5cde-8915-4bec-a5f3-c93d471f0ba6@redhat.com>
	<878qu6wgcm.fsf@yhuang6-desk2.ccr.corp.intel.com>
	<b1dce36e-325e-45cf-b6e9-9e20d4b32550@huawei.com>
	<87sese9sy9.fsf@yhuang6-desk2.ccr.corp.intel.com>
	<64f1c69d-3706-41c5-a29f-929413e3dfa2@huawei.com>
	<87v7x88y3q.fsf@yhuang6-desk2.ccr.corp.intel.com>
	<848e4b40-f734-475f-9b1e-2f543e622a6c@huawei.com>
Date: Fri, 01 Nov 2024 16:16:56 +0800
Message-ID: <87msij8j2f.fsf@yhuang6-desk2.ccr.corp.intel.com>
User-Agent: Gnus/5.13 (Gnus v5.13)
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: quoted-printable
X-Rspamd-Server: rspam06
X-Rspamd-Queue-Id: 208C7100016
X-Stat-Signature: 96zd7d3hm7zinfx6wu8kdpzuzh3fgc4k
X-Rspam-User: 
X-HE-Tag: 1730449200-574432
X-HE-Meta: U2FsdGVkX18OJYM6YSNIwRqP9Vl/T82ZpRpOj+8zSGpsqg8kCpEEV0geIv+8YcirEtS8PY40g5Guk9FAX4D9Wr+qgJ+F8hjUdnI3WgkZqQ0PsyOFRr74yRsD+i3G67hCL0hwWiqZdyBuIcEynTG4G2ZbZi6FYw1zAIFmefgnGhdUnMG2N7WEgf7Z4SRHjH79zrXXdsRGsWPf3/Y31rOCaNku3qmxwJWFhuGWszIT+sgtdAn0ur8mhlKrYfpU4jrE8oT2IR0IJAPhXmnFI3zpXYf/RCF0ujvugIJvquMQYoenSZ9+ojoud/ijHpbAAHsSTJSr3Gb1pJRrBgIv74UuawwkLYMLdK1COmsEhLtl2WoQE/VPNl/9dnaB+bE5aD6I/xMADbLLQivl6pwDYZjETS/k0Uhno9DFPsMquohWODY2su4x9K90Ic5hOpW8yNPiIe7c3W6h+j8V6rh6x80zQq1C2aNYrZSor6lyFp63FZS2l5//2M5JY339lZz4y0au7KOlA4ravLhvjhQ/oEBwPu4MEm9ywiQL9DenETl2W/EYRsYOfMMpAFi8sZgzreqMm5G1bNPMZAzdjifriVVRjAyzMNF5W5bN5/EyLNF01vmmfoxKj4ULIEjZnfrcq+M6YYOAs81ojzqFEkMHtWSmum2gkTjEn37CdXXyKekFV0U10Nb95yh42FGnJQTwWcNRkpWXGBD2x29MX+ygUZtqk4Nw5GWgsmdqkKnddYyF9wkra7NQ7Ho0hADe7U3MSlXuvX6S6dINYTuRX9lqb9iRlNqN2RMf+5LB8adOXRuviWBua4+3Mddr/z+5IDQeaUzwFwnXUK1x2dR+dRY3omjmONbaKGxL4VPSkaGB+0skYxXclJFb7YwnBGQFxQSrrKVp/HEF9Yo8LUzf/h8I4Gppi7n2LxTI948ymQMfJC6ge6dX9sUM2+4oj30YG2mr2P1hO66XTdRNJzgWKVAiwYv
 So8oxDVi
 IRmCDm2D+Pgc341mlYT3ihTQBqrRTGwbdHyxNIJI4i1twFCeivKvoSb03OmyBN334+15z+pGKzENcnGVttEOIj+qXpZxqMUi7TQ2qNR/fWHL7zrdEkd5r9F9Pf0Fb/ZMhFY6ocddVXlVFZDftpb4GIIKE5Ydi3q8NrrkS2LChmGK3VAuKBS4sGQJUNVIYgYDI9obIY/GgGKvz+rg3GNRjCfpvT0oxj67p+KqIX6zRPfQtETVAQp3JPXSZoccTjfWe1QP+yept1H9E8GI=
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>
List-Subscribe: <mailto:majordomo@kvack.org>
List-Unsubscribe: <mailto:majordomo@kvack.org>

Kefeng Wang <wangkefeng.wang@huawei.com> writes:

> On 2024/10/31 16:39, Huang, Ying wrote:
>> Kefeng Wang <wangkefeng.wang@huawei.com> writes:
>> [snip]
>>>
>>>>> 1) Will test some rand test to check the different of performance as
>>>>> David suggested.>>>>
>>>>> 2) Hope the LKP to run more tests since it is very useful(more test
>>>>> set and different machines)
>>>> I'm starting to use LKP to test.
>>>
>>> Greet.
>
>
> Sorry for the late,
>
>> I have run some tests with LKP to test.
>> Firstly, there's almost no measurable difference between clearing
>> pages
>> from start to end or from end to start on Intel server CPU.  I guess
>> that there's some similar optimization for both direction.
>> For multiple processes (same as logical CPU number)
>> vm-scalability/anon-w-seq test case, the benchmark score increases
>> about 22.4%.
>
> So process_huge_page is better than clear_gigantic_page() on Intel?

For vm-scalability/anon-w-seq test case, it is.  Because the performance
of forward and backward clearing is almost same, and the user space
accessing has cache-hot benefit.

> Could you test the following case on x86?
> echo 10240 >
> /sys/devices/system/node/node0/hugepages/hugepages-2048kB/nr_hugepages
> mkdir -p /hugetlbfs/
> mount none /hugetlbfs/ -t hugetlbfs
> rm -f /hugetlbfs/test && fallocate -l 20G /hugetlbfs/test && fallocate
> -d -l 20G /hugetlbfs/test && time taskset -c 10 fallocate -l 20G
> /hugetlbfs/test

It's not trivial for me to do this test.  Because 0day wraps test cases.
Do you know which existing test cases provide this?  For example, in
vm-scalability?

>> For multiple processes vm-scalability/anon-w-rand test case, no
>> measurable difference for benchmark score.
>> So, the optimization helps sequential workload mainly.
>> In summary, on x86, process_huge_page() will not introduce any
>> regression.  And it helps some workload.
>> However, on ARM64, it does introduce some regression for clearing
>> pages
>> from end to start.  That needs to be addressed.  I guess that the
>> regression can be resolved via using more clearing from start to end
>> (but not all).  For example, can you take a look at the patch below?
>> Which uses the similar framework as before, but clear each small trunk
>> (mpage) from start to end.  You can adjust MPAGE_NRPAGES to check when
>> the regression can be restored.
>> WARNING: the patch is only build tested.
>
>
> Base: baseline
> Change1: using clear_gigantic_page() for 2M PMD
> Change2: your patch with MPAGE_NRPAGES=3D16
> Change3: Case3 + fix[1]

What is case3?

> Change4: your patch with MPAGE_NRPAGES=3D64 + fix[1]
>
> 1. For rand write,
>    case-anon-w-rand/case-anon-w-rand-hugetlb no measurable difference
>
> 2. For seq write,
>
> 1) case-anon-w-seq-mt:

Can you try case-anon-w-seq?  That may be more stable.

> base:
> real    0m2.490s    0m2.254s    0m2.272s
> user    1m59.980s   2m23.431s   2m18.739s
> sys     1m3.675s    1m15.462s   1m15.030s
>
> Change1:
> real    0m2.234s    0m2.225s    0m2.159s
> user    2m56.105s   2m57.117s   3m0.489s
> sys     0m17.064s   0m17.564s   0m16.150s
>
> Change2=EF=BC=9A
> real	0m2.244s    0m2.384s	0m2.370s
> user	2m39.413s   2m41.990s   2m42.229s
> sys	0m19.826s   0m18.491s   0m18.053s

It appears strange.  There's no much cache hot benefit even if we clear
pages from end to begin (with larger chunk).

However, sys time improves a lot.  This shows clearing page with large
chunk helps on ARM64.

> Change3=EF=BC=9A  // best performance
> real	0m2.155s    0m2.204s	0m2.194s
> user	3m2.640s    2m55.837s   3m0.902s
> sys	0m17.346s   0m17.630s   0m18.197s
>
> Change4=EF=BC=9A
> real	0m2.287s    0m2.377s	0m2.284s=09
> user	2m37.030s   2m52.868s   3m17.593s
> sys	0m15.445s   0m34.430s   0m45.224s

Change4 is essentially same as Change1.  I don't know why they are
different.  Is there some large variation among run to run?

Can you further optimize the prototype patch below?  I think that it has
potential to fix your issue.

> 2) case-anon-w-seq-hugetlb
>    very similar 1), Change4 slightly better than Change3, but not big
>    different.
>
> 3) hugetlbfs fallocate 20G
>    Change1(0m1.136s) =3D Change3(0m1.136s) =3D  Change4(0m1.135s) <
>    Change2(0m1.275s) < base(0m3.016s)
>
> In summary, the Change3 is best and Change1 is good on my arm64 machine.
>
>> Best Regards,
>> Huang, Ying
>> -----------------------------------8<-----------------------------------=
-----
>>  From 406bcd1603987fdd7130d2df6f7d4aee4cc6b978 Mon Sep 17 00:00:00 2001
>> From: Huang Ying <ying.huang@intel.com>
>> Date: Thu, 31 Oct 2024 11:13:57 +0800
>> Subject: [PATCH] mpage clear
>> ---
>>   mm/memory.c | 70 ++++++++++++++++++++++++++++++++++++++++++++++++++---
>>   1 file changed, 67 insertions(+), 3 deletions(-)
>> diff --git a/mm/memory.c b/mm/memory.c
>> index 3ccee51adfbb..1fdc548c4275 100644
>> --- a/mm/memory.c
>> +++ b/mm/memory.c
>> @@ -6769,6 +6769,68 @@ static inline int process_huge_page(
>>   	return 0;
>>   }
>>   +#define MPAGE_NRPAGES	(1<<4)
>> +#define MPAGE_SIZE	(PAGE_SIZE * MPAGE_NRPAGES)
>> +static inline int clear_huge_page(
>> +	unsigned long addr_hint, unsigned int nr_pages,
>> +	int (*process_subpage)(unsigned long addr, int idx, void *arg),
>> +	void *arg)
>> +{
>> +	int i, n, base, l, ret;
>> +	unsigned long addr =3D addr_hint &
>> +		~(((unsigned long)nr_pages << PAGE_SHIFT) - 1);
>> +	unsigned long nr_mpages =3D ((unsigned long)nr_pages << PAGE_SHIFT) / =
MPAGE_SIZE;
>> +
>> +	/* Process target subpage last to keep its cache lines hot */
>> +	might_sleep();
>> +	n =3D (addr_hint - addr) / MPAGE_SIZE;
>> +	if (2 * n <=3D nr_mpages) {
>> +		/* If target subpage in first half of huge page */
>> +		base =3D 0;
>> +		l =3D n;
>> +		/* Process subpages at the end of huge page */
>> +		for (i =3D nr_mpages - 1; i >=3D 2 * n; i--) {
>> +			cond_resched();
>> +			ret =3D process_subpage(addr + i * MPAGE_SIZE,
>> +					      i * MPAGE_NRPAGES, arg);
>> +			if (ret)
>> +				return ret;
>> +		}
>> +	} else {
>> +		/* If target subpage in second half of huge page */
>> +		base =3D nr_mpages - 2 * (nr_mpages - n);
>> +		l =3D nr_mpages - n;
>> +		/* Process subpages at the begin of huge page */
>> +		for (i =3D 0; i < base; i++) {
>> +			cond_resched();
>> +			ret =3D process_subpage(addr + i * MPAGE_SIZE,
>> +					      i * MPAGE_NRPAGES, arg);
>> +			if (ret)
>> +				return ret;
>> +		}
>> +	}
>> +	/*
>> +	 * Process remaining subpages in left-right-left-right pattern
>> +	 * towards the target subpage
>> +	 */
>> +	for (i =3D 0; i < l; i++) {
>> +		int left_idx =3D base + i;
>> +		int right_idx =3D base + 2 * l - 1 - i;
>> +
>> +		cond_resched();
>> +		ret =3D process_subpage(addr + left_idx * MPAGE_SIZE,
>> +				      left_idx * MPAGE_NRPAGES, arg);
>> +		if (ret)
>> +			return ret;
>> +		cond_resched();
>> +		ret =3D process_subpage(addr + right_idx * MPAGE_SIZE,
>> +				      right_idx * MPAGE_NRPAGES, arg);
>> +		if (ret)
>> +			return ret;
>> +	}
>> +	return 0;
>> +}
>> +
>>   static void clear_gigantic_page(struct folio *folio, unsigned long add=
r,
>>   				unsigned int nr_pages)
>>   {
>> @@ -6784,8 +6846,10 @@ static void clear_gigantic_page(struct folio *fol=
io, unsigned long addr,
>>   static int clear_subpage(unsigned long addr, int idx, void *arg)
>>   {
>>   	struct folio *folio =3D arg;
>> +	int i;
>>   -	clear_user_highpage(folio_page(folio, idx), addr);
>> +	for (i =3D 0; i < MPAGE_NRPAGES; i++)
>> +		clear_user_highpage(folio_page(folio, idx + i), addr + i * PAGE_SIZE);
>>   	return 0;
>>   }
>>   @@ -6798,10 +6862,10 @@ void folio_zero_user(struct folio *folio,
>> unsigned long addr_hint)
>>   {
>>   	unsigned int nr_pages =3D folio_nr_pages(folio);
>>   -	if (unlikely(nr_pages > MAX_ORDER_NR_PAGES))
>> +	if (unlikely(nr_pages !=3D HPAGE_PMD_NR))
>>   		clear_gigantic_page(folio, addr_hint, nr_pages);
>>   	else
>> -		process_huge_page(addr_hint, nr_pages, clear_subpage, folio);
>> +		clear_huge_page(addr_hint, nr_pages, clear_subpage, folio);
>>   }
>>     static int copy_user_gigantic_page(struct folio *dst, struct
>> folio *src,
>
>
> [1] fix patch
>
> diff --git a/mm/memory.c b/mm/memory.c
> index b22d4b83295b..aee99ede0c4f 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -6816,7 +6816,7 @@ static inline int clear_huge_page(
>                 base =3D 0;
>                 l =3D n;
>                 /* Process subpages at the end of huge page */
> -               for (i =3D nr_mpages - 1; i >=3D 2 * n; i--) {
> +               for (i =3D 2 * n; i < nr_mpages; i++) {
>                         cond_resched();
>                         ret =3D process_subpage(addr + i * MPAGE_SIZE,
>                                               i * MPAGE_NRPAGES, arg);