From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <owner-linux-mm@kvack.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by smtp.lore.kernel.org (Postfix) with ESMTP id 70B86D0C603
	for <linux-mm@archiver.kernel.org>; Fri, 25 Oct 2024 12:24:55 +0000 (UTC)
Received: by kanga.kvack.org (Postfix)
	id 6BD076B0082; Fri, 25 Oct 2024 08:24:54 -0400 (EDT)
Received: by kanga.kvack.org (Postfix, from userid 40)
	id 66CBF6B0083; Fri, 25 Oct 2024 08:24:54 -0400 (EDT)
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id 50D8D6B0085; Fri, 25 Oct 2024 08:24:54 -0400 (EDT)
X-Delivered-To: linux-mm@kvack.org
Received: from relay.hostedemail.com (smtprelay0013.hostedemail.com [216.40.44.13])
	by kanga.kvack.org (Postfix) with ESMTP id 294F66B0082
	for <linux-mm@kvack.org>; Fri, 25 Oct 2024 08:24:54 -0400 (EDT)
Received: from smtpin30.hostedemail.com (a10.router.float.18 [10.200.18.1])
	by unirelay03.hostedemail.com (Postfix) with ESMTP id 588F0A09DE
	for <linux-mm@kvack.org>; Fri, 25 Oct 2024 12:24:18 +0000 (UTC)
X-FDA: 82712043078.30.95E372A
Received: from mgamail.intel.com (mgamail.intel.com [192.198.163.12])
	by imf16.hostedemail.com (Postfix) with ESMTP id 857F3180008
	for <linux-mm@kvack.org>; Fri, 25 Oct 2024 12:24:31 +0000 (UTC)
Authentication-Results: imf16.hostedemail.com;
	dkim=pass header.d=intel.com header.s=Intel header.b=IGX3CpGS;
	dmarc=pass (policy=none) header.from=intel.com;
	spf=pass (imf16.hostedemail.com: domain of ying.huang@intel.com designates 192.198.163.12 as permitted sender) smtp.mailfrom=ying.huang@intel.com
ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1729858978; a=rsa-sha256;
	cv=none;
	b=lytqxRMUjs0Eyp7litpWdibrBaNy9JIptTxB3/sFFmzMBZuI+BP5DcAaQRblvZaBshStf+
	F5cTeV8jue5jYhO3qUdG2ASqo6589c1vKeaZP3vw8q8xvnrUzoLU0bMnLYlsWxA6Y0uR4Z
	U2EGQltu0FE8Ik8H4SvtVwB+FmOrXFA=
ARC-Authentication-Results: i=1;
	imf16.hostedemail.com;
	dkim=pass header.d=intel.com header.s=Intel header.b=IGX3CpGS;
	dmarc=pass (policy=none) header.from=intel.com;
	spf=pass (imf16.hostedemail.com: domain of ying.huang@intel.com designates 192.198.163.12 as permitted sender) smtp.mailfrom=ying.huang@intel.com
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com;
	s=arc-20220608; t=1729858978;
	h=from:from:sender:reply-to:subject:subject:date:date:
	 message-id:message-id:to:to:cc:cc:mime-version:mime-version:
	 content-type:content-type:
	 content-transfer-encoding:content-transfer-encoding:
	 in-reply-to:in-reply-to:references:references:dkim-signature;
	bh=v8teGY5WIHKQDcS820sEuTWGyY7UzPrv0uuE6dID3bk=;
	b=8aK006ifcdInS58IZv7omQZNsUpgG3q9BYyg7fo3O6C841UgDaH4mtY7X2+Gidr4eZFFa1
	Cj1c8ZubgFPZZqmxTvNmRvKJh9sVQ9Dr6puWWPxO2OH/gV6FIc6tm+c5AipFHb3w2soQZf
	uXbh5l0k80PxDDBo58FOlCVPwdwKv7Q=
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple;
  d=intel.com; i=@intel.com; q=dns/txt; s=Intel;
  t=1729859091; x=1761395091;
  h=from:to:cc:subject:in-reply-to:references:date:
   message-id:mime-version:content-transfer-encoding;
  bh=o0m2R8nyKhl7S/VQrUer3RI8e2ilr2bRWUKL4aFNJqs=;
  b=IGX3CpGSnWM6bo414EWxnX2qr3dnKmk7G5auPnxAXSJCEZklr2MSL6Iu
   de9RQ8DApe/kc+KdX0EK1r99N4Wz1L9lnYjrVi9zBZs72aqivlSDcSzNY
   to37V0y65674yAmFzyaGz3/0rYZdWhzeGJpH4yWEFpCg7LwMrHxDM6CgN
   RlL7yEi/q2TcuzNJIbVWybPrstH4s4ndgC7tb6EAxDS350SgH2IUXycNo
   zznDqwCbEAp7s82u9ShS0OTShoNmUaESbR5pmDuM6faQU+sIK3xAU5GLV
   R35sft3J/S+IYbjzOutkW5yMf6Iy2kzyE0LBPhRzrHkncU+irHeSMIWvC
   A==;
X-CSE-ConnectionGUID: WtDIfrZpSUWpkKdHZVlwyg==
X-CSE-MsgGUID: Lz2dpNXeRj6YKuKURvOTaA==
X-IronPort-AV: E=McAfee;i="6700,10204,11236"; a="33437059"
X-IronPort-AV: E=Sophos;i="6.11,231,1725346800"; 
   d="scan'208";a="33437059"
Received: from fmviesa006.fm.intel.com ([10.60.135.146])
  by fmvoesa106.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 25 Oct 2024 05:24:49 -0700
X-CSE-ConnectionGUID: DOgCRy0xQuaszA5r5ego+A==
X-CSE-MsgGUID: sX7iKOBsSBmzEPYb3Z6WQg==
X-ExtLoop1: 1
X-IronPort-AV: E=Sophos;i="6.11,231,1725346800"; 
   d="scan'208";a="80503171"
Received: from yhuang6-desk2.sh.intel.com (HELO yhuang6-desk2.ccr.corp.intel.com) ([10.238.208.55])
  by fmviesa006-auth.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 25 Oct 2024 05:24:47 -0700
From: "Huang, Ying" <ying.huang@intel.com>
To: Kefeng Wang <wangkefeng.wang@huawei.com>
Cc: Barry Song <21cnbao@gmail.com>,  <akpm@linux-foundation.org>,
  <baolin.wang@linux.alibaba.com>,  <david@redhat.com>,
  <hughd@google.com>,  <linux-mm@kvack.org>,  <willy@infradead.org>
Subject: Re: [PATCH] mm: shmem: convert to use folio_zero_range()
In-Reply-To: <1a37d4a9-eef8-4fe0-aeb0-fa95c33b305a@huawei.com> (Kefeng Wang's
	message of "Fri, 25 Oct 2024 18:21:44 +0800")
References: <06d99b89-17ad-447e-a8f1-8e220b5688ac@huawei.com>
	<20241022225603.10491-1-21cnbao@gmail.com>
	<31afe958-91eb-484a-90b9-91114991a9a2@huawei.com>
	<87iktg3n2i.fsf@yhuang6-desk2.ccr.corp.intel.com>
	<5ccb295c-6ec9-4d00-8236-e3ba19221f40@huawei.com>
	<875xpg39q5.fsf@yhuang6-desk2.ccr.corp.intel.com>
	<1a37d4a9-eef8-4fe0-aeb0-fa95c33b305a@huawei.com>
Date: Fri, 25 Oct 2024 20:21:14 +0800
Message-ID: <871q042x1h.fsf@yhuang6-desk2.ccr.corp.intel.com>
User-Agent: Gnus/5.13 (Gnus v5.13)
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: quoted-printable
X-Rspamd-Queue-Id: 857F3180008
X-Rspam-User: 
X-Rspamd-Server: rspam05
X-Stat-Signature: kywipschngfo3kxupbyaksdra9a8sub4
X-HE-Tag: 1729859071-316787
X-HE-Meta: U2FsdGVkX18Pnnkpta2kFJlBCNjXilG/5H9T1PZmK22LouZpl0tOgiHXyBhFMZgfJnkLvgCC9iFjgeEC2j/gQ7FHnC3JMGX/sNxLWn0rjcGolMe3R67nRz4+gfa+KCss3OSjd68O+TrC8FvHBfMrTkYW7u8DOBNfBa0lJB45gNToKLwkRi4P8FDlt/a60hLzBTOJsQPPzONCUDFLBtHUe/OHu+H8LRD5PjKgpfOLqoZ98wPoXZ70xiqGTwkYWmTkEONqvTpGnjxheLUu9XppI9AvxfgLQ2JHrz0b3Vm+U2cqB9hvA9VG507QDly+rUi7WBxYN2pU+v8ThrSR2nivHCBzd+WuqM9FlJFImgTsCE4sz8K/Ge+qZb30ROCyYYO+NT1PvpeaAnn+LeDLymtn0b1+X3zWLO8OhR5dUW9fC4oh7RaofsZESspGePgrFnsrBYcKg9Ic93tHkXZzPx6lNrexBi2FRqBVIBiB7TmsCZvqNpsl+Dr0ZzubtYaswj2j0p5JsR1NSrgd0nsl88pMpjAVFlAAyNtEE7FRMcsR6Rs+nBQTKZZ2wuxJkyR8FLS/1wtbN9ZfnaasF7yBBDVmv5uawa4Gd81KaN7vfZ5YUUuYSI7KrqbhKfSCAkHi9jhSXVGUdTeZQE8t3bGYUT+gcrRoD/INDin9BqqQriirEMeUkHcpm9X/2lB6ZMGH9Te2eayqP1vLIcplKFLaq6U0kwCbjaYkkn1cUb+VaRPTxScf9vFZ+rwh51/ebqI9lehFKfRqfek/gTkXE3VhGdBAW4hguox1TIhUlBGT5TutRKsKEfKioM5IvzmRouVS5/2VKCikbPjcxQSKmbaEAS/rL5KrqGz5SeCeSivuyB4n1jbtyihZnATKPSB6Fzih/D1G6iriwsn+7DFhhX7ERPu7r6pHc07qsggxp5TtXoayPjYv57C+TgeKFmH3Z18qAvxNFlnhnaYR50aBLFbuZGE
 eCYXPf6a
 wA/5hXb+UidGrLt060NHm7dBN9E1R2uK+lLcJUMaLjueI9GaseeqbEaPVNNJfNHVKEil6fps8DTYGho+O0gV+N5qfbxiuYSZRBkYmBjitKYgL3Smek9wLfmqKZCkWaXUh6FxTewMwj7ANjMxa3K2/H3EUFBXNt/xbGJtRAWLxT+JcdgW411p0rZyIarr9CpivXECiPRVcuEwOvJO3tZEJnRT7tK2aji7pHSpEOYwt5BarVpk=
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>
List-Subscribe: <mailto:majordomo@kvack.org>
List-Unsubscribe: <mailto:majordomo@kvack.org>

Kefeng Wang <wangkefeng.wang@huawei.com> writes:

> On 2024/10/25 15:47, Huang, Ying wrote:
>> Kefeng Wang <wangkefeng.wang@huawei.com> writes:
>>=20
>>> On 2024/10/25 10:59, Huang, Ying wrote:
>>>> Hi, Kefeng,
>>>> Kefeng Wang <wangkefeng.wang@huawei.com> writes:
>>>>
>>>>> +CC Huang Ying,
>>>>>
>>>>> On 2024/10/23 6:56, Barry Song wrote:
>>>>>> On Wed, Oct 23, 2024 at 4:10=E2=80=AFAM Kefeng Wang <wangkefeng.wang=
@huawei.com> wrote:
>>>>>>>
>>>>>>>
>>>>> ...
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> On 2024/10/17 23:09, Matthew Wilcox wrote:
>>>>>>>>>>>>>>>>>>>>>> On Thu, Oct 17, 2024 at 10:25:04PM +0800, Kefeng Wan=
g wrote:
>>>>>>>>>>>>>>>>>>>>>>> Directly use folio_zero_range() to cleanup code.
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> Are you sure there's no performance regression intro=
duced by this?
>>>>>>>>>>>>>>>>>>>>>> clear_highpage() is often optimised in ways that we =
can't optimise for
>>>>>>>>>>>>>>>>>>>>>> a plain memset(). =C2=A0On the other hand, if the fo=
lio is large, maybe a
>>>>>>>>>>>>>>>>>>>>>> modern CPU will be able to do better than clear-one-=
page-at-a-time.
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> Right, I missing this, clear_page might be better tha=
n memset, I change
>>>>>>>>>>>>>>>>>>>>> this one when look at the shmem_writepage(), which al=
ready convert to
>>>>>>>>>>>>>>>>>>>>> use folio_zero_range() from clear_highpage(), also I =
grep
>>>>>>>>>>>>>>>>>>>>> folio_zero_range(), there are some other to use folio=
_zero_range().
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> fs/bcachefs/fs-io-buffered.c: =C2=A0 =C2=A0 =C2=A0 =
=C2=A0 =C2=A0 folio_zero_range(folio, 0,
>>>>>>>>>>>>>>>>>>>>> folio_size(folio));
>>>>>>>>>>>>>>>>>>>>> fs/bcachefs/fs-io-buffered.c: =C2=A0 =C2=A0 =C2=A0 =
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 folio_zero_range(f,
>>>>>>>>>>>>>>>>>>>>> 0, folio_size(f));
>>>>>>>>>>>>>>>>>>>>> fs/bcachefs/fs-io-buffered.c: =C2=A0 =C2=A0 =C2=A0 =
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 folio_zero_range(f,
>>>>>>>>>>>>>>>>>>>>> 0, folio_size(f));
>>>>>>>>>>>>>>>>>>>>> fs/libfs.c: =C2=A0 =C2=A0 folio_zero_range(folio, 0, =
folio_size(folio));
>>>>>>>>>>>>>>>>>>>>> fs/ntfs3/frecord.c: =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=
=A0 =C2=A0 folio_zero_range(folio, 0,
>>>>>>>>>>>>>>>>>>>>> folio_size(folio));
>>>>>>>>>>>>>>>>>>>>> mm/page_io.c: =C2=A0 folio_zero_range(folio, 0, folio=
_size(folio));
>>>>>>>>>>>>>>>>>>>>> mm/shmem.c: =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0=
 folio_zero_range(folio, 0, folio_size(folio));
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> IOW, what performance testing have you done with thi=
s patch?
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> No performance test before, but I write a testcase,
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> 1) allocate N large folios (folio_alloc(PMD_ORDER))
>>>>>>>>>>>>>>>>>>>>> 2) then calculate the diff(us) when clear all N folios
>>>>>>>>>>>>>>>>>>>>>    =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 clear_highpage/=
folio_zero_range/folio_zero_user
>>>>>>>>>>>>>>>>>>>>> 3) release N folios
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> the result(run 5 times) shown below on my machine,
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> N=3D1,
>>>>>>>>>>>>>>>>>>>>>    =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 c=
lear_highpage =C2=A0folio_zero_range =C2=A0 =C2=A0folio_zero_user
>>>>>>>>>>>>>>>>>>>>>    =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A01 =C2=A0 =C2=A0 =
=C2=A069 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 74 =
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 177
>>>>>>>>>>>>>>>>>>>>>    =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A02 =C2=A0 =C2=A0 =
=C2=A057 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 62 =
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 168
>>>>>>>>>>>>>>>>>>>>>    =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A03 =C2=A0 =C2=A0 =
=C2=A054 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 58 =
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 234
>>>>>>>>>>>>>>>>>>>>>    =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A04 =C2=A0 =C2=A0 =
=C2=A054 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 58 =
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 157
>>>>>>>>>>>>>>>>>>>>>    =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A05 =C2=A0 =C2=A0 =
=C2=A056 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 62 =
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 148
>>>>>>>>>>>>>>>>>>>>> avg =C2=A0 =C2=A0 =C2=A0 58 =C2=A0 =C2=A0 =C2=A0 =C2=
=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 62.8 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=
=A0 =C2=A0 =C2=A0 176.8
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> N=3D100
>>>>>>>>>>>>>>>>>>>>>    =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 c=
lear_highpage =C2=A0folio_zero_range =C2=A0 =C2=A0folio_zero_user
>>>>>>>>>>>>>>>>>>>>>    =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A01 =C2=A0 =C2=A01=
1015 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 11309 =C2=A0 =
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 32833
>>>>>>>>>>>>>>>>>>>>>    =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A02 =C2=A0 =C2=A01=
0385 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 11110 =C2=A0 =
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 49751
>>>>>>>>>>>>>>>>>>>>>    =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A03 =C2=A0 =C2=A01=
0369 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 11056 =C2=A0 =
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 33095
>>>>>>>>>>>>>>>>>>>>>    =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A04 =C2=A0 =C2=A01=
0332 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 11017 =C2=A0 =
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 33106
>>>>>>>>>>>>>>>>>>>>>    =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A05 =C2=A0 =C2=A01=
0483 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 11000 =C2=A0 =
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 49032
>>>>>>>>>>>>>>>>>>>>> avg =C2=A0 =C2=A0 10516.8 =C2=A0 =C2=A0 =C2=A0 =C2=A0=
 =C2=A0 =C2=A0 =C2=A0 11098.4 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 395=
63.4
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> N=3D512
>>>>>>>>>>>>>>>>>>>>>    =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 c=
lear_highpage =C2=A0folio_zero_range =C2=A0 folio_zero_user
>>>>>>>>>>>>>>>>>>>>>    =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A01 =C2=A0 =C2=A05=
5560 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 60055 =C2=A0 =
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0156876
>>>>>>>>>>>>>>>>>>>>>    =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A02 =C2=A0 =C2=A05=
5485 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 60024 =C2=A0 =
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0157132
>>>>>>>>>>>>>>>>>>>>>    =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A03 =C2=A0 =C2=A05=
5474 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 60129 =C2=A0 =
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0156658
>>>>>>>>>>>>>>>>>>>>>    =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A04 =C2=A0 =C2=A05=
5555 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 59867 =C2=A0 =
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0157259
>>>>>>>>>>>>>>>>>>>>>    =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A05 =C2=A0 =C2=A05=
5528 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 59932 =C2=A0 =
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0157108
>>>>>>>>>>>>>>>>>>>>> avg =C2=A0 =C2=A0 55520.4 =C2=A0 =C2=A0 =C2=A0 =C2=A0=
 =C2=A0 =C2=A0 =C2=A0 60001.4 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A01570=
06.6
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> folio_zero_user with many cond_resched(), so time flu=
ctuates a lot,
>>>>>>>>>>>>>>>>>>>>> clear_highpage is better folio_zero_range as you said.
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> Maybe add a new helper to convert all folio_zero_rang=
e(folio, 0,
>>>>>>>>>>>>>>>>>>>>> folio_size(folio))
>>>>>>>>>>>>>>>>>>>>> to use clear_highpage + flush_dcache_folio?
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> If this also improves performance for other existing c=
allers of
>>>>>>>>>>>>>>>>>>>> folio_zero_range(), then that's a positive outcome.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>> ...
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> hi Kefeng,
>>>>>>>>>>>>>>>>> what's your point? providing a helper like clear_highfoli=
o() or similar?
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Yes, from above test, using clear_highpage/flush_dcache_fo=
lio is better
>>>>>>>>>>>>>>>> than using folio_zero_range() for folio zero(especially fo=
r large
>>>>>>>>>>>>>>>> folio), so I'd like to add a new helper, maybe name it fol=
io_zero()
>>>>>>>>>>>>>>>> since it zero the whole folio.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> we already have a helper like folio_zero_user()?
>>>>>>>>>>>>>>> it is not good enough?
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Since it is with many cond_resched(), the performance is wor=
st...
>>>>>>>>>>>>>
>>>>>>>>>>>>> Not exactly? It should have zero cost for a preemptible kerne=
l.
>>>>>>>>>>>>> For a non-preemptible kernel, it helps avoid clearing the fol=
io
>>>>>>>>>>>>> from occupying the CPU and starving other processes, right?
>>>>>>>>>>>>
>>>>>>>>>>>> --- a/mm/shmem.c
>>>>>>>>>>>> +++ b/mm/shmem.c
>>>>>>>>>>>>
>>>>>>>>>>>> @@ -2393,10 +2393,7 @@ static int shmem_get_folio_gfp(struct i=
node
>>>>>>>>>>>> *inode, pgoff_t index,
>>>>>>>>>>>>    =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 * it now, lest un=
do on failure cancel our earlier guarantee.
>>>>>>>>>>>>    =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 */
>>>>>>>>>>>>
>>>>>>>>>>>>    =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0if (sgp !=3D SGP_W=
RITE && !folio_test_uptodate(folio)) {
>>>>>>>>>>>> - =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 long i, n =
=3D folio_nr_pages(folio);
>>>>>>>>>>>> -
>>>>>>>>>>>> - =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 for (i =3D =
0; i < n; i++)
>>>>>>>>>>>> - =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=
=A0 =C2=A0 =C2=A0 clear_highpage(folio_page(folio, i));
>>>>>>>>>>>> + =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 folio_zero_=
user(folio, vmf->address);
>>>>>>>>>>>>    =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=
=A0 =C2=A0flush_dcache_folio(folio);
>>>>>>>>>>>>    =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=
=A0 =C2=A0folio_mark_uptodate(folio);
>>>>>>>>>>>>    =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0}
>>>>>>>>>>>>
>>>>>>>>>>>> Do we perform better or worse with the following?
>>>>>>>>>>>
>>>>>>>>>>> Here is for SGP_FALLOC, vmf =3D NULL, we could use folio_zero_u=
ser(folio,
>>>>>>>>>>> 0), I think the performance is worse, will retest once I can ac=
cess
>>>>>>>>>>> hardware.
>>>>>>>>>>
>>>>>>>>>> Perhaps, since the current code uses clear_hugepage(). Does using
>>>>>>>>>> index << PAGE_SHIFT as the addr_hint offer any benefit?
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>> when use folio_zero_user(), the performance is vary bad with above
>>>>>>>>> fallocate test(mount huge=3Dalways),
>>>>>>>>>
>>>>>>>>>    =C2=A0 =C2=A0 =C2=A0 =C2=A0 folio_zero_range =C2=A0 clear_high=
page =C2=A0 =C2=A0 =C2=A0 =C2=A0 folio_zero_user
>>>>>>>>> real =C2=A0 =C2=A00m1.214s =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=
=A0 0m1.111s =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A00m3.159s
>>>>>>>>> user =C2=A0 =C2=A00m0.000s =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=
=A0 0m0.000s =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A00m0.000s
>>>>>>>>> sys =C2=A0 =C2=A0 0m1.210s =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=
=A0 0m1.109s =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A00m3.152s
>>>>>>>>>
>>>>>>>>> I tried with addr_hint =3D 0/index << PAGE_SHIFT, no obvious diff=
erent.
>>>>>>>>
>>>>>>>> Interesting. Does your kernel have preemption disabled or
>>>>>>>> preemption_debug enabled?
>>>>>>>
>>>>>>> ARM64 server, CONFIG_PREEMPT_NONE=3Dy
>>>>>> this explains why the performance is much worse.
>>>>>>
>>>>>>>
>>>>>>>>
>>>>>>>> If not, it makes me wonder whether folio_zero_user() in
>>>>>>>> alloc_anon_folio() is actually improving performance as expected,
>>>>>>>> compared to the simpler folio_zero() you plan to implement. :-)
>>>>>>>
>>>>>>> Yes, maybe, the folio_zero_user(was clear_huge_page) is from
>>>>>>> 47ad8475c000 ("thp: clear_copy_huge_page"), so original clear_huge_=
page
>>>>>>> is used in HugeTLB, clear PUD size maybe spend many time, but for P=
MD or
>>>>>>> other size of large folio, cond_resched is not necessary since we
>>>>>>> already have some folio_zero_range() to clear large folio, and no i=
ssue
>>>>>>> was reported.
>>>>>> probably worth an optimization. calling cond_resched() for each page
>>>>>> seems too aggressive and useless.
>>>>>
>>>>> After some test, I think the cond_resched() is not the root cause,
>>>>> no performance gained with batched cond_resched(), even I kill
>>>>> cond_resched() from process_huge_page, no improvement.
>>>>>
>>>>> But when I unconditionally use clear_gigantic_page() in
>>>>> folio_zero_user(patched), there is big improvement with above
>>>>> fallocate on tmpfs(mount huge=3Dalways), also I test some other testc=
ase,
>>>>>
>>>>>
>>>>> 1) case-anon-w-seq-mt: (2M PMD THP)
>>>>>
>>>>> base:
>>>>> real    0m2.490s    0m2.254s    0m2.272s
>>>>> user    1m59.980s   2m23.431s   2m18.739s
>>>>> sys     1m3.675s    1m15.462s   1m15.030s=09
>>>>>
>>>>> patched:
>>>>> real    0m2.234s    0m2.225s    0m2.159s
>>>>> user    2m56.105s   2m57.117s   3m0.489s
>>>>> sys     0m17.064s   0m17.564s   0m16.150s
>>>>>
>>>>> Patched kernel win on sys and bad in user, but real is almost same,
>>>>> maybe a little better than base.
>>>> We can find user time difference.  That means the original cache hot
>>>> behavior still applies on your system.
>>>> However, it appears that the performance to clear page from end to
>>>> begin
>>>> is really bad on your system.
>>>> So, I suggest to revise the current implementation to use sequential
>>>> clearing as much as possible.
>>>>
>>>
>>> I test case-anon-cow-seq-hugetlb for copy_user_large_folio()
>>>
>>> base:
>>> real    0m6.259s    0m6.197s    0m6.316s
>>> user    1m31.176s   1m27.195s   1m29.594s
>>> sys     7m44.199s   7m51.490s   8m21.149s
>>>
>>> patched(use copy_user_gigantic_page for 2M hugetlb too)
>>> real    0m3.182s    0m3.002s    0m2.963s
>>> user    1m19.456s   1m3.107s    1m6.447s
>>> sys     2m59.222s   3m10.899s   3m1.027s
>>>
>>> and sequential copy is better than the current implementation,
>>> so I will use sequential clear and copy.
>> Sorry, it appears that you misunderstanding my suggestion.  I
>> suggest to
>> revise process_huge_page() to use more sequential memory clearing and
>> copying to improve its performance on your platform.
>> --
>> Best Regards,
>> Huang, Ying
>>=20
>>>>> 2) case-anon-w-seq-hugetlb:(2M PMD HugeTLB)
>>>>>
>>>>> base:
>>>>> real    0m5.175s    0m5.117s    0m4.856s
>>>>> user    5m15.943s   5m7.567s    4m29.273s
>>>>> sys     2m38.503s   2m21.949s   2m21.252s
>>>>>
>>>>> patched:
>>>>> real    0m4.966s    0m4.841s    0m4.561s
>>>>> user    6m30.123s   6m9.516s    5m49.733s
>>>>> sys     0m58.503s   0m47.847s   0m46.785s
>>>>>
>>>>>
>>>>> This case is similar to the case1.
>>>>>
>>>>> 3) fallocate hugetlb 20G (2M PMD HugeTLB)
>>>>>
>>>>> base:
>>>>> real    0m3.016s    0m3.019s    0m3.018s
>>>>> user    0m0.000s    0m0.000s    0m0.000s
>>>>> sys     0m3.009s    0m3.012s    0m3.010s
>>>>>
>>>>> patched:
>>>>>
>>>>> real    0m1.136s    0m1.136s    0m1.136s
>>>>> user    0m0.000s    0m0.000s    0m0.004s
>>>>> sys     0m1.133s    0m1.133s    0m1.129s
>>>>>
>>>>>
>>>>> There is big win on patched kernel, and it is similar to above tmpfs
>>>>> test, so maybe we could revert the commit c79b57e462b5 ("mm: hugetlb:
>>>>> clear target sub-page last when clearing huge page").
>
> I tried the following changes,
> diff --git a/mm/memory.c b/mm/memory.c
> index 66cf855dee3f..e5cc75adfa10 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -6777,7 +6777,7 @@ static inline int process_huge_page(
>                 base =3D 0;
>                 l =3D n;
>                 /* Process subpages at the end of huge page */
> -               for (i =3D nr_pages - 1; i >=3D 2 * n; i--) {
> +               for (i =3D 2 * n; i < nr_pages; i++) {
>                         cond_resched();
>                         ret =3D process_subpage(addr + i * PAGE_SIZE, i,
>                         arg);
>                         if (ret)
>
> Since n =3D 0, so the copying is from start to end now, but not
> improvement for case-anon-cow-seq-hugetlb,
>
> and if use copy_user_gigantic_pager, the time reduced from 6s to 3s
>
> diff --git a/mm/memory.c b/mm/memory.c
> index fe21bd3beff5..2c6532d21d84 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -6876,10 +6876,7 @@ int copy_user_large_folio(struct folio *dst,
> struct folio *src,
>                 .vma =3D vma,
>         };
>
> -       if (unlikely(nr_pages > MAX_ORDER_NR_PAGES))
> -               return copy_user_gigantic_page(dst, src, addr_hint,
>                 vma, nr_pages);
> -
> -       return process_huge_page(addr_hint, nr_pages, copy_subpage, &arg);
> +       return copy_user_gigantic_page(dst, src, addr_hint, vma, nr_pages=
);
>  }

It appears that we have code generation issue here.  Can you check it?
Whether code is inlined in the same way?

Maybe we can start with

modified   mm/memory.c
@@ -6714,7 +6714,7 @@ EXPORT_SYMBOL(__might_fault);
  * operation.  The target subpage will be processed last to keep its
  * cache lines hot.
  */
-static inline int process_huge_page(
+static __always_inline int process_huge_page(
 	unsigned long addr_hint, unsigned int nr_pages,
 	int (*process_subpage)(unsigned long addr, int idx, void *arg),
 	void *arg)

--
Best Regards,
Huang, Ying