From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 40443D116E1 for ; Fri, 25 Oct 2024 03:02:42 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id C88F86B0096; Thu, 24 Oct 2024 23:02:41 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id C38536B0099; Thu, 24 Oct 2024 23:02:41 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id B00D96B009B; Thu, 24 Oct 2024 23:02:41 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0014.hostedemail.com [216.40.44.14]) by kanga.kvack.org (Postfix) with ESMTP id 8BCE46B0096 for ; Thu, 24 Oct 2024 23:02:41 -0400 (EDT) Received: from smtpin07.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay06.hostedemail.com (Postfix) with ESMTP id 44824ABCCD for ; Fri, 25 Oct 2024 03:02:03 +0000 (UTC) X-FDA: 82710625956.07.4D29FE4 Received: from mgamail.intel.com (mgamail.intel.com [192.198.163.19]) by imf24.hostedemail.com (Postfix) with ESMTP id ADBA918001B for ; Fri, 25 Oct 2024 03:02:34 +0000 (UTC) Authentication-Results: imf24.hostedemail.com; dkim=pass header.d=intel.com header.s=Intel header.b=FY4A6iv5; spf=pass (imf24.hostedemail.com: domain of ying.huang@intel.com designates 192.198.163.19 as permitted sender) smtp.mailfrom=ying.huang@intel.com; dmarc=pass (policy=none) header.from=intel.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1729825204; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=5sVJ5tfbOmiGEZdh6BbxPO0Gzqp4FbBMMC3aRR29mJY=; b=Kiv2Yi/W+l/e7fSWLGG2CRrUMRewwaZobbcTh51LPY7gNDdaAeMSqP2zbKWsex6cj6XZS+ QM2Yf92UNytyqvbOn+Os/NDSQNgRWUHwXhrIxynYgtVL5O38SumfbOI3mkJ8GJ+3ww7wNC MSHG2emmi3o2BDhEUcN5ldBzf+U705o= ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1729825204; a=rsa-sha256; cv=none; b=Rjd5g8fRgpa61H9OGJ72JAaxryn3AObSgewDMJs1pqpQr/Mj26RjHlA12RQNYEgGnT3rnJ umyePzOuRMDapwa4HF4jJAjTJSdY9kaItSQJFUzz0qrwjzaFOZSXZMgZIZ7VAnLyPORnhk WoZMzWi7w9ghmf/9+OMebLp+ezL5mL8= ARC-Authentication-Results: i=1; imf24.hostedemail.com; dkim=pass header.d=intel.com header.s=Intel header.b=FY4A6iv5; spf=pass (imf24.hostedemail.com: domain of ying.huang@intel.com designates 192.198.163.19 as permitted sender) smtp.mailfrom=ying.huang@intel.com; dmarc=pass (policy=none) header.from=intel.com DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1729825358; x=1761361358; h=from:to:cc:subject:in-reply-to:references:date: message-id:mime-version:content-transfer-encoding; bh=w4Ae5IP3k+sU2mhqlqnva4Rt3G2XdYGNowFxYw3nZFs=; b=FY4A6iv5ii70wfCpn9wdXA7Pd/k/hn3aCV7emDlxgGLgRfTuUoKQm/gI MRP+Xd22BfBFxulNbbiMzCiKz+kMGv5cV5iJJPuzpsCVKtPQZ8xRycYwA jIXyBYyaqsiTLYp8HeJQFc55T4Q+m9FWV0ez1IjGxv2ehwo+f2VnWT7AT O6vX9CdcKA8FpYr2AvOrsZz9EwzaWp+/HF73O3Zj0+1h5wHC5p6LGzQPY 7vRy+PucaXtGWrEHZSGnaxk/ag81lYqvXQ77Hvuk8r9Rch4FQ8vmVnfds tJVUC1wJ7HGACTl+fOzKpT0NxUJD/fls9KRgagu0Dj/aR3GRi4QwuPt11 g==; X-CSE-ConnectionGUID: yNRDeUJURSq93WXeeoAjbQ== X-CSE-MsgGUID: l/PFNvw5Q1ivj+SpM9snlw== X-IronPort-AV: E=McAfee;i="6700,10204,11235"; a="28939984" X-IronPort-AV: E=Sophos;i="6.11,231,1725346800"; d="scan'208";a="28939984" Received: from fmviesa010.fm.intel.com ([10.60.135.150]) by fmvoesa113.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 24 Oct 2024 20:02:36 -0700 X-CSE-ConnectionGUID: xGqUR2TgTs6rV3D+2UHg2A== X-CSE-MsgGUID: HMArwRs5Q0aVE+7eOk7dAg== X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="6.11,231,1725346800"; d="scan'208";a="81098932" Received: from yhuang6-desk2.sh.intel.com (HELO yhuang6-desk2.ccr.corp.intel.com) ([10.238.208.55]) by fmviesa010-auth.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 24 Oct 2024 20:02:34 -0700 From: "Huang, Ying" To: Kefeng Wang Cc: Barry Song <21cnbao@gmail.com>, , , , , , Subject: Re: [PATCH] mm: shmem: convert to use folio_zero_range() In-Reply-To: <31afe958-91eb-484a-90b9-91114991a9a2@huawei.com> (Kefeng Wang's message of "Thu, 24 Oct 2024 18:10:20 +0800") References: <06d99b89-17ad-447e-a8f1-8e220b5688ac@huawei.com> <20241022225603.10491-1-21cnbao@gmail.com> <31afe958-91eb-484a-90b9-91114991a9a2@huawei.com> Date: Fri, 25 Oct 2024 10:59:01 +0800 Message-ID: <87iktg3n2i.fsf@yhuang6-desk2.ccr.corp.intel.com> User-Agent: Gnus/5.13 (Gnus v5.13) MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable X-Rspam-User: X-Rspamd-Server: rspam04 X-Rspamd-Queue-Id: ADBA918001B X-Stat-Signature: cog9rbeiuabwcdtkosgbg91fppc6tkpa X-HE-Tag: 1729825354-990254 X-HE-Meta: U2FsdGVkX1/0WP5Z9nGihyLtPuNRjkwZcum/UNJsThNcAtmQIYmWyEJkM59cG8btTkusJvi2L7yckH47JWnDm2A+7bkwPqIKQTcCbjmg0bSknAEzZVKVeVsOhj1iKTBuMuKnNbj56f2QgsXY4Uvy2i54p3PxVBwsSonhrbeHnxETbiOH/2yQMMAWNkae3FYgQ65RrI9zF2TAE2KM8popJZxWSetrWppPqY7Q4ATb+EaunlefUicHVlgjSTUb7PUc4Wm6PMlR9NxUiwvAc5/37LrTvTHZBGez4qc/cG/sEUosWK+w+BWJd77Hkvnyds9ygPeFPWXwoQP/IHiCrG66oG9X0/272Wl5sr5YmCuEzJRrSefn0tBYEM0QLSBnR+3XTaqsisbgQx6ltKLJlbkEr9ebayNNJp4hAHf/t3fgLMj9jG/joNpMo3VHPmuOwuATQLSpyaGgdMkBy+2bfavvUsHtLopVYFsPygmQJzLTcYENfJAfiwHqbAgFJp2uY3nVcvLGQHCJxQ+0dy5Hk/5toU4IEGhWXeYycrAPwF58qmB83ZXpF37XZ/XypPSqqczbIQlihjdeiDqfWhrN7LOX4J+24J4vy5lpDXQ+y2n972w872Y7k0kcaiIsA7dHoMh85DFAfRiA6I6o1ZQ8cSf3XkJe/8hhYGgPoJ/VSv2eaOFzr7fEOA31E1grtCN/ZgX40maZlsvYz+7OBCVMfl5YXr4CHv6HllzQpeOcUkh3UwFHWepJGsEjApREPGJ0cQdCTour+mXhggy4X/W7Ve3JztasmAk9jVxuBiS4YLvNlR2yEqY1iNR94nVKEiGBM7rZbAvtBHJnRFUyyLp8DKjaqgvTJgN8EB183P1Pohf+gTND6H75uNeJUqEEqbs2I++sAyxtuunWMxuq0fFh/jzU4PSGF6Ogrz3CTAljmXvuUNQwa+r07fYtGbDBjg0gpNUcV3Pt87PL0nCX7OJkIPP K88ja7sB RwDUUNdV395GRFrRNq/QorwzfNoLMuT2ChPvaw9wyfEkjdOsqRLrapaRKDY3DI0Pqoh9jz7lLny1gJsa94ZwL2zmM+ggKgOEDp1GzvYB1l+QvcHiF8Wf3FNafa/PLsqpoBrzSBobSQrEFmxW9RwUy3Kk8mJArIcoUbgzdyMj7U9k8gKY/jstRUWuFU4ltSFt5vcGAZ8iGlh2C3z6zh0+m2P+8lQBSBKZYR094h7t8MT3etxk= X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: Hi, Kefeng, Kefeng Wang writes: > +CC Huang Ying, > > On 2024/10/23 6:56, Barry Song wrote: >> On Wed, Oct 23, 2024 at 4:10=E2=80=AFAM Kefeng Wang wrote: >>> >>> > ... >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> On 2024/10/17 23:09, Matthew Wilcox wrote: >>>>>>>>>>>>>>>>>> On Thu, Oct 17, 2024 at 10:25:04PM +0800, Kefeng Wang wr= ote: >>>>>>>>>>>>>>>>>>> Directly use folio_zero_range() to cleanup code. >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> Are you sure there's no performance regression introduce= d by this? >>>>>>>>>>>>>>>>>> clear_highpage() is often optimised in ways that we can'= t optimise for >>>>>>>>>>>>>>>>>> a plain memset(). =C2=A0On the other hand, if the folio = is large, maybe a >>>>>>>>>>>>>>>>>> modern CPU will be able to do better than clear-one-page= -at-a-time. >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> Right, I missing this, clear_page might be better than me= mset, I change >>>>>>>>>>>>>>>>> this one when look at the shmem_writepage(), which alread= y convert to >>>>>>>>>>>>>>>>> use folio_zero_range() from clear_highpage(), also I grep >>>>>>>>>>>>>>>>> folio_zero_range(), there are some other to use folio_zer= o_range(). >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> fs/bcachefs/fs-io-buffered.c: =C2=A0 =C2=A0 =C2=A0 =C2=A0= =C2=A0 folio_zero_range(folio, 0, >>>>>>>>>>>>>>>>> folio_size(folio)); >>>>>>>>>>>>>>>>> fs/bcachefs/fs-io-buffered.c: =C2=A0 =C2=A0 =C2=A0 =C2=A0= =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 folio_zero_range(f, >>>>>>>>>>>>>>>>> 0, folio_size(f)); >>>>>>>>>>>>>>>>> fs/bcachefs/fs-io-buffered.c: =C2=A0 =C2=A0 =C2=A0 =C2=A0= =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 folio_zero_range(f, >>>>>>>>>>>>>>>>> 0, folio_size(f)); >>>>>>>>>>>>>>>>> fs/libfs.c: =C2=A0 =C2=A0 folio_zero_range(folio, 0, foli= o_size(folio)); >>>>>>>>>>>>>>>>> fs/ntfs3/frecord.c: =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 = =C2=A0 folio_zero_range(folio, 0, >>>>>>>>>>>>>>>>> folio_size(folio)); >>>>>>>>>>>>>>>>> mm/page_io.c: =C2=A0 folio_zero_range(folio, 0, folio_siz= e(folio)); >>>>>>>>>>>>>>>>> mm/shmem.c: =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 fol= io_zero_range(folio, 0, folio_size(folio)); >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> IOW, what performance testing have you done with this pa= tch? >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> No performance test before, but I write a testcase, >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> 1) allocate N large folios (folio_alloc(PMD_ORDER)) >>>>>>>>>>>>>>>>> 2) then calculate the diff(us) when clear all N folios >>>>>>>>>>>>>>>>> =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 clear_highpage/folio_= zero_range/folio_zero_user >>>>>>>>>>>>>>>>> 3) release N folios >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> the result(run 5 times) shown below on my machine, >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> N=3D1, >>>>>>>>>>>>>>>>> =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 clear_h= ighpage =C2=A0folio_zero_range =C2=A0 =C2=A0folio_zero_user >>>>>>>>>>>>>>>>> =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A01 =C2=A0 =C2=A0 =C2=A0= 69 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 74 =C2=A0= =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 177 >>>>>>>>>>>>>>>>> =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A02 =C2=A0 =C2=A0 =C2=A0= 57 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 62 =C2=A0= =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 168 >>>>>>>>>>>>>>>>> =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A03 =C2=A0 =C2=A0 =C2=A0= 54 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 58 =C2=A0= =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 234 >>>>>>>>>>>>>>>>> =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A04 =C2=A0 =C2=A0 =C2=A0= 54 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 58 =C2=A0= =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 157 >>>>>>>>>>>>>>>>> =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A05 =C2=A0 =C2=A0 =C2=A0= 56 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 62 =C2=A0= =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 148 >>>>>>>>>>>>>>>>> avg =C2=A0 =C2=A0 =C2=A0 58 =C2=A0 =C2=A0 =C2=A0 =C2=A0 = =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 62.8 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 = =C2=A0 =C2=A0 176.8 >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> N=3D100 >>>>>>>>>>>>>>>>> =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 clear_h= ighpage =C2=A0folio_zero_range =C2=A0 =C2=A0folio_zero_user >>>>>>>>>>>>>>>>> =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A01 =C2=A0 =C2=A011015 = =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 11309 =C2=A0 =C2=A0= =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 32833 >>>>>>>>>>>>>>>>> =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A02 =C2=A0 =C2=A010385 = =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 11110 =C2=A0 =C2=A0= =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 49751 >>>>>>>>>>>>>>>>> =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A03 =C2=A0 =C2=A010369 = =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 11056 =C2=A0 =C2=A0= =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 33095 >>>>>>>>>>>>>>>>> =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A04 =C2=A0 =C2=A010332 = =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 11017 =C2=A0 =C2=A0= =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 33106 >>>>>>>>>>>>>>>>> =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A05 =C2=A0 =C2=A010483 = =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 11000 =C2=A0 =C2=A0= =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 49032 >>>>>>>>>>>>>>>>> avg =C2=A0 =C2=A0 10516.8 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2= =A0 =C2=A0 =C2=A0 11098.4 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 39563.4 >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> N=3D512 >>>>>>>>>>>>>>>>> =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 clear_h= ighpage =C2=A0folio_zero_range =C2=A0 folio_zero_user >>>>>>>>>>>>>>>>> =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A01 =C2=A0 =C2=A055560 = =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 60055 =C2=A0 =C2=A0= =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0156876 >>>>>>>>>>>>>>>>> =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A02 =C2=A0 =C2=A055485 = =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 60024 =C2=A0 =C2=A0= =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0157132 >>>>>>>>>>>>>>>>> =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A03 =C2=A0 =C2=A055474 = =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 60129 =C2=A0 =C2=A0= =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0156658 >>>>>>>>>>>>>>>>> =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A04 =C2=A0 =C2=A055555 = =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 59867 =C2=A0 =C2=A0= =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0157259 >>>>>>>>>>>>>>>>> =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A05 =C2=A0 =C2=A055528 = =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 59932 =C2=A0 =C2=A0= =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0157108 >>>>>>>>>>>>>>>>> avg =C2=A0 =C2=A0 55520.4 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2= =A0 =C2=A0 =C2=A0 60001.4 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0157006.6 >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> folio_zero_user with many cond_resched(), so time fluctua= tes a lot, >>>>>>>>>>>>>>>>> clear_highpage is better folio_zero_range as you said. >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> Maybe add a new helper to convert all folio_zero_range(fo= lio, 0, >>>>>>>>>>>>>>>>> folio_size(folio)) >>>>>>>>>>>>>>>>> to use clear_highpage + flush_dcache_folio? >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> If this also improves performance for other existing calle= rs of >>>>>>>>>>>>>>>> folio_zero_range(), then that's a positive outcome. >>>>>>>>>>>>>>> >>>>>>>>>> ... >>>>>>>>>> >>>>>>>>>>>>> hi Kefeng, >>>>>>>>>>>>> what's your point? providing a helper like clear_highfolio() = or similar? >>>>>>>>>>>> >>>>>>>>>>>> Yes, from above test, using clear_highpage/flush_dcache_folio = is better >>>>>>>>>>>> than using folio_zero_range() for folio zero(especially for la= rge >>>>>>>>>>>> folio), so I'd like to add a new helper, maybe name it folio_z= ero() >>>>>>>>>>>> since it zero the whole folio. >>>>>>>>>>> >>>>>>>>>>> we already have a helper like folio_zero_user()? >>>>>>>>>>> it is not good enough? >>>>>>>>>> >>>>>>>>>> Since it is with many cond_resched(), the performance is worst... >>>>>>>>> >>>>>>>>> Not exactly? It should have zero cost for a preemptible kernel. >>>>>>>>> For a non-preemptible kernel, it helps avoid clearing the folio >>>>>>>>> from occupying the CPU and starving other processes, right? >>>>>>>> >>>>>>>> --- a/mm/shmem.c >>>>>>>> +++ b/mm/shmem.c >>>>>>>> >>>>>>>> @@ -2393,10 +2393,7 @@ static int shmem_get_folio_gfp(struct inode >>>>>>>> *inode, pgoff_t index, >>>>>>>> =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 * it now, lest undo on = failure cancel our earlier guarantee. >>>>>>>> =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 */ >>>>>>>> >>>>>>>> =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0if (sgp !=3D SGP_WRITE &= & !folio_test_uptodate(folio)) { >>>>>>>> - =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 long i, n =3D f= olio_nr_pages(folio); >>>>>>>> - >>>>>>>> - =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 for (i =3D 0; i= < n; i++) >>>>>>>> - =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 = =C2=A0 =C2=A0 clear_highpage(folio_page(folio, i)); >>>>>>>> + =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 folio_zero_user= (folio, vmf->address); >>>>>>>> =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 = =C2=A0flush_dcache_folio(folio); >>>>>>>> =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 = =C2=A0folio_mark_uptodate(folio); >>>>>>>> =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0} >>>>>>>> >>>>>>>> Do we perform better or worse with the following? >>>>>>> >>>>>>> Here is for SGP_FALLOC, vmf =3D NULL, we could use folio_zero_user(= folio, >>>>>>> 0), I think the performance is worse, will retest once I can access >>>>>>> hardware. >>>>>> >>>>>> Perhaps, since the current code uses clear_hugepage(). Does using >>>>>> index << PAGE_SHIFT as the addr_hint offer any benefit? >>>>>> >>>>> >>>>> when use folio_zero_user(), the performance is vary bad with above >>>>> fallocate test(mount huge=3Dalways), >>>>> >>>>> =C2=A0 =C2=A0 =C2=A0 =C2=A0 folio_zero_range =C2=A0 clear_highpage = =C2=A0 =C2=A0 =C2=A0 =C2=A0 folio_zero_user >>>>> real =C2=A0 =C2=A00m1.214s =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 = 0m1.111s =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A00m3.159s >>>>> user =C2=A0 =C2=A00m0.000s =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 = 0m0.000s =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A00m0.000s >>>>> sys =C2=A0 =C2=A0 0m1.210s =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 = 0m1.109s =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A00m3.152s >>>>> >>>>> I tried with addr_hint =3D 0/index << PAGE_SHIFT, no obvious differen= t. >>>> >>>> Interesting. Does your kernel have preemption disabled or >>>> preemption_debug enabled? >>> >>> ARM64 server, CONFIG_PREEMPT_NONE=3Dy >> this explains why the performance is much worse. >>=20 >>> >>>> >>>> If not, it makes me wonder whether folio_zero_user() in >>>> alloc_anon_folio() is actually improving performance as expected, >>>> compared to the simpler folio_zero() you plan to implement. :-) >>> >>> Yes, maybe, the folio_zero_user(was clear_huge_page) is from >>> 47ad8475c000 ("thp: clear_copy_huge_page"), so original clear_huge_page >>> is used in HugeTLB, clear PUD size maybe spend many time, but for PMD or >>> other size of large folio, cond_resched is not necessary since we >>> already have some folio_zero_range() to clear large folio, and no issue >>> was reported. >> probably worth an optimization. calling cond_resched() for each page >> seems too aggressive and useless. > > After some test, I think the cond_resched() is not the root cause, > no performance gained with batched cond_resched(), even I kill > cond_resched() from process_huge_page, no improvement. > > But when I unconditionally use clear_gigantic_page() in > folio_zero_user(patched), there is big improvement with above > fallocate on tmpfs(mount huge=3Dalways), also I test some other testcase, > > > 1) case-anon-w-seq-mt: (2M PMD THP) > > base: > real 0m2.490s 0m2.254s 0m2.272s > user 1m59.980s 2m23.431s 2m18.739s > sys 1m3.675s 1m15.462s 1m15.030s=09 > > patched: > real 0m2.234s 0m2.225s 0m2.159s > user 2m56.105s 2m57.117s 3m0.489s > sys 0m17.064s 0m17.564s 0m16.150s > > Patched kernel win on sys and bad in user, but real is almost same, > maybe a little better than base. We can find user time difference. That means the original cache hot behavior still applies on your system. However, it appears that the performance to clear page from end to begin is really bad on your system. So, I suggest to revise the current implementation to use sequential clearing as much as possible. > 2) case-anon-w-seq-hugetlb:(2M PMD HugeTLB) > > base: > real 0m5.175s 0m5.117s 0m4.856s > user 5m15.943s 5m7.567s 4m29.273s > sys 2m38.503s 2m21.949s 2m21.252s > > patched: > real 0m4.966s 0m4.841s 0m4.561s > user 6m30.123s 6m9.516s 5m49.733s > sys 0m58.503s 0m47.847s 0m46.785s > > > This case is similar to the case1. > > 3) fallocate hugetlb 20G (2M PMD HugeTLB) > > base: > real 0m3.016s 0m3.019s 0m3.018s > user 0m0.000s 0m0.000s 0m0.000s > sys 0m3.009s 0m3.012s 0m3.010s > > patched: > > real 0m1.136s 0m1.136s 0m1.136s > user 0m0.000s 0m0.000s 0m0.004s > sys 0m1.133s 0m1.133s 0m1.129s > > > There is big win on patched kernel, and it is similar to above tmpfs > test, so maybe we could revert the commit c79b57e462b5 ("mm: hugetlb: > clear target sub-page last when clearing huge page"). -- Best Regards, Huang, Ying