From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <owner-linux-mm@kvack.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by smtp.lore.kernel.org (Postfix) with ESMTP id DF5D4D13590
	for <linux-mm@archiver.kernel.org>; Mon, 28 Oct 2024 02:42:49 +0000 (UTC)
Received: by kanga.kvack.org (Postfix)
	id 665266B0093; Sun, 27 Oct 2024 22:42:49 -0400 (EDT)
Received: by kanga.kvack.org (Postfix, from userid 40)
	id 615276B0096; Sun, 27 Oct 2024 22:42:49 -0400 (EDT)
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id 48E8E6B0098; Sun, 27 Oct 2024 22:42:49 -0400 (EDT)
X-Delivered-To: linux-mm@kvack.org
Received: from relay.hostedemail.com (smtprelay0012.hostedemail.com [216.40.44.12])
	by kanga.kvack.org (Postfix) with ESMTP id 24E9A6B0093
	for <linux-mm@kvack.org>; Sun, 27 Oct 2024 22:42:49 -0400 (EDT)
Received: from smtpin26.hostedemail.com (a10.router.float.18 [10.200.18.1])
	by unirelay07.hostedemail.com (Postfix) with ESMTP id 2E1BB160697
	for <linux-mm@kvack.org>; Mon, 28 Oct 2024 02:42:23 +0000 (UTC)
X-FDA: 82721462502.26.23F8F09
Received: from mgamail.intel.com (mgamail.intel.com [192.198.163.19])
	by imf05.hostedemail.com (Postfix) with ESMTP id 0E2CE100004
	for <linux-mm@kvack.org>; Mon, 28 Oct 2024 02:42:01 +0000 (UTC)
Authentication-Results: imf05.hostedemail.com;
	dkim=pass header.d=intel.com header.s=Intel header.b=Pl4t2Eqn;
	spf=pass (imf05.hostedemail.com: domain of ying.huang@intel.com designates 192.198.163.19 as permitted sender) smtp.mailfrom=ying.huang@intel.com;
	dmarc=pass (policy=none) header.from=intel.com
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com;
	s=arc-20220608; t=1730083210;
	h=from:from:sender:reply-to:subject:subject:date:date:
	 message-id:message-id:to:to:cc:cc:mime-version:mime-version:
	 content-type:content-type:
	 content-transfer-encoding:content-transfer-encoding:
	 in-reply-to:in-reply-to:references:references:dkim-signature;
	bh=+Y84fy7iiKOK462HGUCw5WheTq5SLGAXVFF9UfFJHBo=;
	b=aw/bppOfiiUOQ5Qy2iD8LJK/zrJRSWg/6BcNk7+0fk+EqFe69qrvYfeYN/hyF2qAHSF8Ps
	AoDCqwRhm2sRYd+247Hd0oRs7f9SBz1LvLMriXqJCRvF+vf1ihnjVyb87izD/85kQP0Ur8
	X8VIO7KWmvOPlBzRJIJ8KrACN0nwyE0=
ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1730083210; a=rsa-sha256;
	cv=none;
	b=nayMUk7U36UIv1yTHSnO3EKKLW90O6IjjbaD7wMCyw1KjTZN7d5/7uMlQ3ezPx85IDOdn0
	P/rNAJ37GnrP9ZrzP+MDQ0qjCevd+LbXzSedRpPCGb+eyd1FraEBZB2t649FIQnwaXa+O1
	pxvYgBeAk0IwToOaeUPuTwWNIFBEx+Q=
ARC-Authentication-Results: i=1;
	imf05.hostedemail.com;
	dkim=pass header.d=intel.com header.s=Intel header.b=Pl4t2Eqn;
	spf=pass (imf05.hostedemail.com: domain of ying.huang@intel.com designates 192.198.163.19 as permitted sender) smtp.mailfrom=ying.huang@intel.com;
	dmarc=pass (policy=none) header.from=intel.com
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple;
  d=intel.com; i=@intel.com; q=dns/txt; s=Intel;
  t=1730083365; x=1761619365;
  h=from:to:cc:subject:in-reply-to:references:date:
   message-id:mime-version:content-transfer-encoding;
  bh=JjKN5Aw25l9zH6oy1CKlE+BDawe7BVe8hvjba9NgfUU=;
  b=Pl4t2EqnPfu9jYowB1vyz66WkKqoTFJjtIXPu7XQBRv9vwiSe3hG4hIb
   f0wWpCpKtp4zzjK8Rej1Tt8BuB3sEqeT0REKYeabG5x3O4r/4XKWESGJk
   7v0GRLg7tFA5pIF4CR8ONOorQMc7oKfc22tJFhR1kQCb4PkAMRPmUMBWz
   XJbjT6RxzDOEtalF6eDqMWKGabDfCTZl/y0CY7kHZM2Jf85xN+QZmI93A
   Sr6ucZuQ0h+eH++cku03ftC8IdPti302KV9ZZOEO8MT423iF0yBvv3elH
   J2NpD3rJbUrBm71axR8j1FATAkotkXjOnY6aEBHx2W7VIAfsdktD6VAOe
   g==;
X-CSE-ConnectionGUID: xIHznGvERmK7T8X9DCJqjw==
X-CSE-MsgGUID: J3Tigjr/RhOGFzQjIAUJ8g==
X-IronPort-AV: E=McAfee;i="6700,10204,11238"; a="29130414"
X-IronPort-AV: E=Sophos;i="6.11,238,1725346800"; 
   d="scan'208";a="29130414"
Received: from fmviesa006.fm.intel.com ([10.60.135.146])
  by fmvoesa113.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 27 Oct 2024 19:42:43 -0700
X-CSE-ConnectionGUID: WpWnbToIR52NZ5vTyuisdQ==
X-CSE-MsgGUID: 1ZGOuYJETFGZxmz133kcmg==
X-ExtLoop1: 1
X-IronPort-AV: E=Sophos;i="6.11,238,1725346800"; 
   d="scan'208";a="81051474"
Received: from yhuang6-desk2.sh.intel.com (HELO yhuang6-desk2.ccr.corp.intel.com) ([10.238.208.55])
  by fmviesa006-auth.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 27 Oct 2024 19:42:42 -0700
From: "Huang, Ying" <ying.huang@intel.com>
To: Kefeng Wang <wangkefeng.wang@huawei.com>
Cc: Barry Song <21cnbao@gmail.com>,  <akpm@linux-foundation.org>,
  <baolin.wang@linux.alibaba.com>,  <david@redhat.com>,
  <hughd@google.com>,  <linux-mm@kvack.org>,  <willy@infradead.org>
Subject: Re: [PATCH] mm: shmem: convert to use folio_zero_range()
In-Reply-To: <86f9f4e8-9c09-4333-ae4f-f51a71c3aca7@huawei.com> (Kefeng Wang's
	message of "Fri, 25 Oct 2024 21:35:11 +0800")
References: <06d99b89-17ad-447e-a8f1-8e220b5688ac@huawei.com>
	<20241022225603.10491-1-21cnbao@gmail.com>
	<31afe958-91eb-484a-90b9-91114991a9a2@huawei.com>
	<87iktg3n2i.fsf@yhuang6-desk2.ccr.corp.intel.com>
	<5ccb295c-6ec9-4d00-8236-e3ba19221f40@huawei.com>
	<875xpg39q5.fsf@yhuang6-desk2.ccr.corp.intel.com>
	<1a37d4a9-eef8-4fe0-aeb0-fa95c33b305a@huawei.com>
	<871q042x1h.fsf@yhuang6-desk2.ccr.corp.intel.com>
	<86f9f4e8-9c09-4333-ae4f-f51a71c3aca7@huawei.com>
Date: Mon, 28 Oct 2024 10:39:08 +0800
Message-ID: <87ttcx0x4j.fsf@yhuang6-desk2.ccr.corp.intel.com>
User-Agent: Gnus/5.13 (Gnus v5.13)
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: quoted-printable
X-Rspamd-Queue-Id: 0E2CE100004
X-Stat-Signature: mtd5xpn64yx7996fx8ab9zetye11tbi9
X-Rspamd-Server: rspam09
X-Rspam-User: 
X-HE-Tag: 1730083321-398999
X-HE-Meta: U2FsdGVkX18SqB95nYSsdtxytB3dT67OHDrpDVmcHOizvt9Tj+5rEtkICNRmHxH4ePvxKYtH6VoIsMI8DMX6Fy0Ah1kQN4A5aDKwQXti0tBVlfKwpgpf2PsCl4nB5koOHcY5eqwsql65g5nr0827d00cdneEl5dJqYgnkMfUXTWx2A1vHOibet1AF7Ykjz0Pt+TwdDe1DSySzKLCv+opGjbjsbYIiNGWuFjMCtg9FgWRRI5EAMPyO0yjd23eyDO6dibrKgZOvpDg7FvAcKwzgQrpehzeFX1bVZgC+39LpqXvG+HaW7V+2BkrIPMgPCrb2jqOW/fid/u1c93lQaN7MraFxXM2evhHSHNQ+GY1JDFA7UB0fojnME+scBpw/pTC7Hy9rJ3CNuYfpVXDHj/HzZk87GhGHKTR3VQ+57WWNHuFAwxm7L396GNL5VBwKaY15jCZ0m4VmaJ+Z8g5BmYtUwMRL/pT9JuUoOe6KY1Yym8LRB5NWtEJ6FoKgiuus4dRkH1zS4vXgs/1tfX/lqWArTobD5u3r3tzqvohJAwc+NhrwB9vtC9HmF2MKO4HVkw+moIXxJj53mUg6bySMRW7YVoq2xed3z+MS3gdX8fGwzpW+j9qciH0xkB5dssxAdBH9kC7djl0fhnjLUtk7cg85YGT59NYfrp9I9eqg+V4bTBG+UiTBR19cFDvL36aJa3gMnbVMWpWQI0aIb0y3Sw0tYJ97tNSkfe5ayQTCTIRCY8GOus3r7SPK/EhbXyzxtaAoEYi8SFl++2wUhQxkImBfkJ8dNaYkJXTMFsX0sa4dFPlWwl1DKnv6ATx20KkwsfVL6RwnpCx51Q91RkG1DcCorY5TZGBS6AI1HTKz1h7VvXeV+/8vXGfrZSase6GzWbJ4BfX7EIWUhagNUlz3RiG6RrbMjNxPFy6MFE9At4GZPTHcMmTbHQKWGUPes4wUT0djZ8+ALW9PPHezmae/Q+
 pJGGi04J
 Se5C49Cc/Wgld1kb/c2zOSqDLiUjquPkI3xtdTcIgHGNIoWAn/ytXiBeexpgPzWVkr3Zsliz9id621YMshIm46uc+qyYNMk6bCPHlUT7OSWMhf3jACT/oR6DWGrIxYJatNQMMfIY5S1hpdz8w/bhmE/TMmHU2Gyh3k/hC34sXKU34RLV6OGJ2cvvhHW4Gfa2za9Gnc/Jx+ZIDQZMANst8ejNnB7VLQdJR0G60prqkXMBiEGU=
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>
List-Subscribe: <mailto:majordomo@kvack.org>
List-Unsubscribe: <mailto:majordomo@kvack.org>

Kefeng Wang <wangkefeng.wang@huawei.com> writes:

> On 2024/10/25 20:21, Huang, Ying wrote:
>> Kefeng Wang <wangkefeng.wang@huawei.com> writes:
>>=20
>>> On 2024/10/25 15:47, Huang, Ying wrote:
>>>> Kefeng Wang <wangkefeng.wang@huawei.com> writes:
>>>>
>>>>> On 2024/10/25 10:59, Huang, Ying wrote:
>>>>>> Hi, Kefeng,
>>>>>> Kefeng Wang <wangkefeng.wang@huawei.com> writes:
>>>>>>
>>>>>>> +CC Huang Ying,
>>>>>>>
>>>>>>> On 2024/10/23 6:56, Barry Song wrote:
>>>>>>>> On Wed, Oct 23, 2024 at 4:10=E2=80=AFAM Kefeng Wang <wangkefeng.wa=
ng@huawei.com> wrote:
>>>>>>>>>
>>>>>>>>>
>>>>>>> ...
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> On 2024/10/17 23:09, Matthew Wilcox wrote:
>>>>>>>>>>>>>>>>>>>>>>>> On Thu, Oct 17, 2024 at 10:25:04PM +0800, Kefeng W=
ang wrote:
>>>>>>>>>>>>>>>>>>>>>>>>> Directly use folio_zero_range() to cleanup code.
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>> Are you sure there's no performance regression int=
roduced by this?
>>>>>>>>>>>>>>>>>>>>>>>> clear_highpage() is often optimised in ways that w=
e can't optimise for
>>>>>>>>>>>>>>>>>>>>>>>> a plain memset(). =C2=A0On the other hand, if the =
folio is large, maybe a
>>>>>>>>>>>>>>>>>>>>>>>> modern CPU will be able to do better than clear-on=
e-page-at-a-time.
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> Right, I missing this, clear_page might be better t=
han memset, I change
>>>>>>>>>>>>>>>>>>>>>>> this one when look at the shmem_writepage(), which =
already convert to
>>>>>>>>>>>>>>>>>>>>>>> use folio_zero_range() from clear_highpage(), also =
I grep
>>>>>>>>>>>>>>>>>>>>>>> folio_zero_range(), there are some other to use fol=
io_zero_range().
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> fs/bcachefs/fs-io-buffered.c: =C2=A0 =C2=A0 =C2=A0 =
=C2=A0 =C2=A0 folio_zero_range(folio, 0,
>>>>>>>>>>>>>>>>>>>>>>> folio_size(folio));
>>>>>>>>>>>>>>>>>>>>>>> fs/bcachefs/fs-io-buffered.c: =C2=A0 =C2=A0 =C2=A0 =
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 folio_zero_range(f,
>>>>>>>>>>>>>>>>>>>>>>> 0, folio_size(f));
>>>>>>>>>>>>>>>>>>>>>>> fs/bcachefs/fs-io-buffered.c: =C2=A0 =C2=A0 =C2=A0 =
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 folio_zero_range(f,
>>>>>>>>>>>>>>>>>>>>>>> 0, folio_size(f));
>>>>>>>>>>>>>>>>>>>>>>> fs/libfs.c: =C2=A0 =C2=A0 folio_zero_range(folio, 0=
, folio_size(folio));
>>>>>>>>>>>>>>>>>>>>>>> fs/ntfs3/frecord.c: =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=
=A0 =C2=A0 folio_zero_range(folio, 0,
>>>>>>>>>>>>>>>>>>>>>>> folio_size(folio));
>>>>>>>>>>>>>>>>>>>>>>> mm/page_io.c: =C2=A0 folio_zero_range(folio, 0, fol=
io_size(folio));
>>>>>>>>>>>>>>>>>>>>>>> mm/shmem.c: =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=
=A0 folio_zero_range(folio, 0, folio_size(folio));
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>> IOW, what performance testing have you done with t=
his patch?
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> No performance test before, but I write a testcase,
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> 1) allocate N large folios (folio_alloc(PMD_ORDER))
>>>>>>>>>>>>>>>>>>>>>>> 2) then calculate the diff(us) when clear all N fol=
ios
>>>>>>>>>>>>>>>>>>>>>>>     =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 clear_highpa=
ge/folio_zero_range/folio_zero_user
>>>>>>>>>>>>>>>>>>>>>>> 3) release N folios
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> the result(run 5 times) shown below on my machine,
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> N=3D1,
>>>>>>>>>>>>>>>>>>>>>>>     =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=
=A0 clear_highpage =C2=A0folio_zero_range =C2=A0 =C2=A0folio_zero_user
>>>>>>>>>>>>>>>>>>>>>>>     =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A01 =C2=A0 =C2=
=A0 =C2=A069 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0=
 74 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 177
>>>>>>>>>>>>>>>>>>>>>>>     =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A02 =C2=A0 =C2=
=A0 =C2=A057 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0=
 62 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 168
>>>>>>>>>>>>>>>>>>>>>>>     =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A03 =C2=A0 =C2=
=A0 =C2=A054 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0=
 58 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 234
>>>>>>>>>>>>>>>>>>>>>>>     =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A04 =C2=A0 =C2=
=A0 =C2=A054 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0=
 58 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 157
>>>>>>>>>>>>>>>>>>>>>>>     =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A05 =C2=A0 =C2=
=A0 =C2=A056 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0=
 62 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 148
>>>>>>>>>>>>>>>>>>>>>>> avg =C2=A0 =C2=A0 =C2=A0 58 =C2=A0 =C2=A0 =C2=A0 =
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 62.8 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =
=C2=A0 =C2=A0 =C2=A0 176.8
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> N=3D100
>>>>>>>>>>>>>>>>>>>>>>>     =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=
=A0 clear_highpage =C2=A0folio_zero_range =C2=A0 =C2=A0folio_zero_user
>>>>>>>>>>>>>>>>>>>>>>>     =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A01 =C2=A0 =C2=
=A011015 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 11309 =C2=
=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 32833
>>>>>>>>>>>>>>>>>>>>>>>     =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A02 =C2=A0 =C2=
=A010385 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 11110 =C2=
=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 49751
>>>>>>>>>>>>>>>>>>>>>>>     =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A03 =C2=A0 =C2=
=A010369 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 11056 =C2=
=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 33095
>>>>>>>>>>>>>>>>>>>>>>>     =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A04 =C2=A0 =C2=
=A010332 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 11017 =C2=
=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 33106
>>>>>>>>>>>>>>>>>>>>>>>     =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A05 =C2=A0 =C2=
=A010483 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 11000 =C2=
=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 49032
>>>>>>>>>>>>>>>>>>>>>>> avg =C2=A0 =C2=A0 10516.8 =C2=A0 =C2=A0 =C2=A0 =C2=
=A0 =C2=A0 =C2=A0 =C2=A0 11098.4 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =
39563.4
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> N=3D512
>>>>>>>>>>>>>>>>>>>>>>>     =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=
=A0 clear_highpage =C2=A0folio_zero_range =C2=A0 folio_zero_user
>>>>>>>>>>>>>>>>>>>>>>>     =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A01 =C2=A0 =C2=
=A055560 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 60055 =C2=
=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0156876
>>>>>>>>>>>>>>>>>>>>>>>     =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A02 =C2=A0 =C2=
=A055485 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 60024 =C2=
=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0157132
>>>>>>>>>>>>>>>>>>>>>>>     =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A03 =C2=A0 =C2=
=A055474 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 60129 =C2=
=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0156658
>>>>>>>>>>>>>>>>>>>>>>>     =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A04 =C2=A0 =C2=
=A055555 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 59867 =C2=
=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0157259
>>>>>>>>>>>>>>>>>>>>>>>     =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A05 =C2=A0 =C2=
=A055528 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 59932 =C2=
=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0157108
>>>>>>>>>>>>>>>>>>>>>>> avg =C2=A0 =C2=A0 55520.4 =C2=A0 =C2=A0 =C2=A0 =C2=
=A0 =C2=A0 =C2=A0 =C2=A0 60001.4 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A01=
57006.6
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> folio_zero_user with many cond_resched(), so time f=
luctuates a lot,
>>>>>>>>>>>>>>>>>>>>>>> clear_highpage is better folio_zero_range as you sa=
id.
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> Maybe add a new helper to convert all folio_zero_ra=
nge(folio, 0,
>>>>>>>>>>>>>>>>>>>>>>> folio_size(folio))
>>>>>>>>>>>>>>>>>>>>>>> to use clear_highpage + flush_dcache_folio?
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> If this also improves performance for other existing=
 callers of
>>>>>>>>>>>>>>>>>>>>>> folio_zero_range(), then that's a positive outcome.
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> ...
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> hi Kefeng,
>>>>>>>>>>>>>>>>>>> what's your point? providing a helper like clear_highfo=
lio() or similar?
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Yes, from above test, using clear_highpage/flush_dcache_=
folio is better
>>>>>>>>>>>>>>>>>> than using folio_zero_range() for folio zero(especially =
for large
>>>>>>>>>>>>>>>>>> folio), so I'd like to add a new helper, maybe name it f=
olio_zero()
>>>>>>>>>>>>>>>>>> since it zero the whole folio.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> we already have a helper like folio_zero_user()?
>>>>>>>>>>>>>>>>> it is not good enough?
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Since it is with many cond_resched(), the performance is w=
orst...
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Not exactly? It should have zero cost for a preemptible ker=
nel.
>>>>>>>>>>>>>>> For a non-preemptible kernel, it helps avoid clearing the f=
olio
>>>>>>>>>>>>>>> from occupying the CPU and starving other processes, right?
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> --- a/mm/shmem.c
>>>>>>>>>>>>>> +++ b/mm/shmem.c
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> @@ -2393,10 +2393,7 @@ static int shmem_get_folio_gfp(struct=
 inode
>>>>>>>>>>>>>> *inode, pgoff_t index,
>>>>>>>>>>>>>>     =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 * it now, lest=
 undo on failure cancel our earlier guarantee.
>>>>>>>>>>>>>>     =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 */
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>     =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0if (sgp !=3D SG=
P_WRITE && !folio_test_uptodate(folio)) {
>>>>>>>>>>>>>> - =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 long i, n=
 =3D folio_nr_pages(folio);
>>>>>>>>>>>>>> -
>>>>>>>>>>>>>> - =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 for (i =
=3D 0; i < n; i++)
>>>>>>>>>>>>>> - =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =
=C2=A0 =C2=A0 =C2=A0 clear_highpage(folio_page(folio, i));
>>>>>>>>>>>>>> + =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 folio_zer=
o_user(folio, vmf->address);
>>>>>>>>>>>>>>     =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =
=C2=A0 =C2=A0flush_dcache_folio(folio);
>>>>>>>>>>>>>>     =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =
=C2=A0 =C2=A0folio_mark_uptodate(folio);
>>>>>>>>>>>>>>     =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0}
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Do we perform better or worse with the following?
>>>>>>>>>>>>>
>>>>>>>>>>>>> Here is for SGP_FALLOC, vmf =3D NULL, we could use folio_zero=
_user(folio,
>>>>>>>>>>>>> 0), I think the performance is worse, will retest once I can =
access
>>>>>>>>>>>>> hardware.
>>>>>>>>>>>>
>>>>>>>>>>>> Perhaps, since the current code uses clear_hugepage(). Does us=
ing
>>>>>>>>>>>> index << PAGE_SHIFT as the addr_hint offer any benefit?
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> when use folio_zero_user(), the performance is vary bad with ab=
ove
>>>>>>>>>>> fallocate test(mount huge=3Dalways),
>>>>>>>>>>>
>>>>>>>>>>>     =C2=A0 =C2=A0 =C2=A0 =C2=A0 folio_zero_range =C2=A0 clear_h=
ighpage =C2=A0 =C2=A0 =C2=A0 =C2=A0 folio_zero_user
>>>>>>>>>>> real =C2=A0 =C2=A00m1.214s =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =
=C2=A0 0m1.111s =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A00m3.159s
>>>>>>>>>>> user =C2=A0 =C2=A00m0.000s =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =
=C2=A0 0m0.000s =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A00m0.000s
>>>>>>>>>>> sys =C2=A0 =C2=A0 0m1.210s =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =
=C2=A0 0m1.109s =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A00m3.152s
>>>>>>>>>>>
>>>>>>>>>>> I tried with addr_hint =3D 0/index << PAGE_SHIFT, no obvious di=
fferent.
>>>>>>>>>>
>>>>>>>>>> Interesting. Does your kernel have preemption disabled or
>>>>>>>>>> preemption_debug enabled?
>>>>>>>>>
>>>>>>>>> ARM64 server, CONFIG_PREEMPT_NONE=3Dy
>>>>>>>> this explains why the performance is much worse.
>>>>>>>>
>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> If not, it makes me wonder whether folio_zero_user() in
>>>>>>>>>> alloc_anon_folio() is actually improving performance as expected,
>>>>>>>>>> compared to the simpler folio_zero() you plan to implement. :-)
>>>>>>>>>
>>>>>>>>> Yes, maybe, the folio_zero_user(was clear_huge_page) is from
>>>>>>>>> 47ad8475c000 ("thp: clear_copy_huge_page"), so original clear_hug=
e_page
>>>>>>>>> is used in HugeTLB, clear PUD size maybe spend many time, but for=
 PMD or
>>>>>>>>> other size of large folio, cond_resched is not necessary since we
>>>>>>>>> already have some folio_zero_range() to clear large folio, and no=
 issue
>>>>>>>>> was reported.
>>>>>>>> probably worth an optimization. calling cond_resched() for each pa=
ge
>>>>>>>> seems too aggressive and useless.
>>>>>>>
>>>>>>> After some test, I think the cond_resched() is not the root cause,
>>>>>>> no performance gained with batched cond_resched(), even I kill
>>>>>>> cond_resched() from process_huge_page, no improvement.
>>>>>>>
>>>>>>> But when I unconditionally use clear_gigantic_page() in
>>>>>>> folio_zero_user(patched), there is big improvement with above
>>>>>>> fallocate on tmpfs(mount huge=3Dalways), also I test some other tes=
tcase,
>>>>>>>
>>>>>>>
>>>>>>> 1) case-anon-w-seq-mt: (2M PMD THP)
>>>>>>>
>>>>>>> base:
>>>>>>> real    0m2.490s    0m2.254s    0m2.272s
>>>>>>> user    1m59.980s   2m23.431s   2m18.739s
>>>>>>> sys     1m3.675s    1m15.462s   1m15.030s=09
>>>>>>>
>>>>>>> patched:
>>>>>>> real    0m2.234s    0m2.225s    0m2.159s
>>>>>>> user    2m56.105s   2m57.117s   3m0.489s
>>>>>>> sys     0m17.064s   0m17.564s   0m16.150s
>>>>>>>
>>>>>>> Patched kernel win on sys and bad in user, but real is almost same,
>>>>>>> maybe a little better than base.
>>>>>> We can find user time difference.  That means the original cache hot
>>>>>> behavior still applies on your system.
>>>>>> However, it appears that the performance to clear page from end to
>>>>>> begin
>>>>>> is really bad on your system.
>>>>>> So, I suggest to revise the current implementation to use sequential
>>>>>> clearing as much as possible.
>>>>>>
>>>>>
>>>>> I test case-anon-cow-seq-hugetlb for copy_user_large_folio()
>>>>>
>>>>> base:
>>>>> real    0m6.259s    0m6.197s    0m6.316s
>>>>> user    1m31.176s   1m27.195s   1m29.594s
>>>>> sys     7m44.199s   7m51.490s   8m21.149s
>>>>>
>>>>> patched(use copy_user_gigantic_page for 2M hugetlb too)
>>>>> real    0m3.182s    0m3.002s    0m2.963s
>>>>> user    1m19.456s   1m3.107s    1m6.447s
>>>>> sys     2m59.222s   3m10.899s   3m1.027s
>>>>>
>>>>> and sequential copy is better than the current implementation,
>>>>> so I will use sequential clear and copy.
>>>> Sorry, it appears that you misunderstanding my suggestion.  I
>>>> suggest to
>>>> revise process_huge_page() to use more sequential memory clearing and
>>>> copying to improve its performance on your platform.
>>>> --
>>>> Best Regards,
>>>> Huang, Ying
>>>>
>>>>>>> 2) case-anon-w-seq-hugetlb:(2M PMD HugeTLB)
>>>>>>>
>>>>>>> base:
>>>>>>> real    0m5.175s    0m5.117s    0m4.856s
>>>>>>> user    5m15.943s   5m7.567s    4m29.273s
>>>>>>> sys     2m38.503s   2m21.949s   2m21.252s
>>>>>>>
>>>>>>> patched:
>>>>>>> real    0m4.966s    0m4.841s    0m4.561s
>>>>>>> user    6m30.123s   6m9.516s    5m49.733s
>>>>>>> sys     0m58.503s   0m47.847s   0m46.785s
>>>>>>>
>>>>>>>
>>>>>>> This case is similar to the case1.
>>>>>>>
>>>>>>> 3) fallocate hugetlb 20G (2M PMD HugeTLB)
>>>>>>>
>>>>>>> base:
>>>>>>> real    0m3.016s    0m3.019s    0m3.018s
>>>>>>> user    0m0.000s    0m0.000s    0m0.000s
>>>>>>> sys     0m3.009s    0m3.012s    0m3.010s
>>>>>>>
>>>>>>> patched:
>>>>>>>
>>>>>>> real    0m1.136s    0m1.136s    0m1.136s
>>>>>>> user    0m0.000s    0m0.000s    0m0.004s
>>>>>>> sys     0m1.133s    0m1.133s    0m1.129s
>>>>>>>
>>>>>>>
>>>>>>> There is big win on patched kernel, and it is similar to above tmpfs
>>>>>>> test, so maybe we could revert the commit c79b57e462b5 ("mm: hugetl=
b:
>>>>>>> clear target sub-page last when clearing huge page").
>>>
>>> I tried the following changes,
>>> diff --git a/mm/memory.c b/mm/memory.c
>>> index 66cf855dee3f..e5cc75adfa10 100644
>>> --- a/mm/memory.c
>>> +++ b/mm/memory.c
>>> @@ -6777,7 +6777,7 @@ static inline int process_huge_page(
>>>                  base =3D 0;
>>>                  l =3D n;
>>>                  /* Process subpages at the end of huge page */
>>> -               for (i =3D nr_pages - 1; i >=3D 2 * n; i--) {
>>> +               for (i =3D 2 * n; i < nr_pages; i++) {
>>>                          cond_resched();
>>>                          ret =3D process_subpage(addr + i * PAGE_SIZE, =
i,
>>>                          arg);
>>>                          if (ret)
>>>
>>> Since n =3D 0, so the copying is from start to end now, but not
>>> improvement for case-anon-cow-seq-hugetlb,
>>>
>>> and if use copy_user_gigantic_pager, the time reduced from 6s to 3s
>>>
>>> diff --git a/mm/memory.c b/mm/memory.c
>>> index fe21bd3beff5..2c6532d21d84 100644
>>> --- a/mm/memory.c
>>> +++ b/mm/memory.c
>>> @@ -6876,10 +6876,7 @@ int copy_user_large_folio(struct folio *dst,
>>> struct folio *src,
>>>                  .vma =3D vma,
>>>          };
>>>
>>> -       if (unlikely(nr_pages > MAX_ORDER_NR_PAGES))
>>> -               return copy_user_gigantic_page(dst, src, addr_hint,
>>>                  vma, nr_pages);
>>> -
>>> -       return process_huge_page(addr_hint, nr_pages, copy_subpage, &ar=
g);
>>> +       return copy_user_gigantic_page(dst, src, addr_hint, vma, nr_pag=
es);
>>>   }
>> It appears that we have code generation issue here.  Can you check
>> it?
>> Whether code is inlined in the same way?
>>=20
>
> No different, and I checked the asm, both process_huge_page and
> copy_user_gigantic_page are inlined, it is strange...

It's not inlined in my configuration.  And __always_inline below changes
it for me.

If it's already inlined and the code is actually almost same, why
there's difference?  Is it possible for you to do some profile or
further analysis?

>> Maybe we can start with
>> modified   mm/memory.c
>> @@ -6714,7 +6714,7 @@ EXPORT_SYMBOL(__might_fault);
>>    * operation.  The target subpage will be processed last to keep its
>>    * cache lines hot.
>>    */
>> -static inline int process_huge_page(
>> +static __always_inline int process_huge_page(
>>   	unsigned long addr_hint, unsigned int nr_pages,
>>   	int (*process_subpage)(unsigned long addr, int idx, void *arg),
>>   	void *arg)

--
Best Regards,
Huang, Ying