From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 73137E674B5 for ; Fri, 1 Nov 2024 08:20:36 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id B8F686B0082; Fri, 1 Nov 2024 04:20:35 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id B3F166B0083; Fri, 1 Nov 2024 04:20:35 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 9DF506B0085; Fri, 1 Nov 2024 04:20:35 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0011.hostedemail.com [216.40.44.11]) by kanga.kvack.org (Postfix) with ESMTP id 801B36B0082 for ; Fri, 1 Nov 2024 04:20:35 -0400 (EDT) Received: from smtpin16.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay02.hostedemail.com (Postfix) with ESMTP id E235D121BC9 for ; Fri, 1 Nov 2024 08:20:34 +0000 (UTC) X-FDA: 82736829042.16.1D1067A Received: from mgamail.intel.com (mgamail.intel.com [192.198.163.13]) by imf14.hostedemail.com (Postfix) with ESMTP id 208C7100016 for ; Fri, 1 Nov 2024 08:20:00 +0000 (UTC) Authentication-Results: imf14.hostedemail.com; dkim=pass header.d=intel.com header.s=Intel header.b=BGbYf2Cd; spf=pass (imf14.hostedemail.com: domain of ying.huang@intel.com designates 192.198.163.13 as permitted sender) smtp.mailfrom=ying.huang@intel.com; dmarc=pass (policy=none) header.from=intel.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1730449017; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=6Yf+SN3WyahLxTn0/xv188CkyImKtIV5hy6Xm2KuPuw=; b=xIJdRM28qNoYd7D4dcVDGQeA5VcIa2Zr54Y30lI/fGEcvI2HJWCVKbDwTz7WeH/dqRe97D uPEizCuJmpKLBQ5Jc5BEGUTP4wLqROobgW5YkrYkXbSeECSR40I+ZBEEekHT3jEu7I1C36 9uws4YKmN3nZgkWC0mBRqujma4Kd3KY= ARC-Authentication-Results: i=1; imf14.hostedemail.com; dkim=pass header.d=intel.com header.s=Intel header.b=BGbYf2Cd; spf=pass (imf14.hostedemail.com: domain of ying.huang@intel.com designates 192.198.163.13 as permitted sender) smtp.mailfrom=ying.huang@intel.com; dmarc=pass (policy=none) header.from=intel.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1730449017; a=rsa-sha256; cv=none; b=Mx8+AbQfZIeAX3sjuk2rgJvbO1BAFOt1MFFEjN0Kion9oizdB3nupUWluR5JCEYqnb4N/Z Hj+cT50LF+x9OE7gTe/mnnZIUV3Y4Yr99+L7lgWFGrJXxU2pKtSINt4im/viQ359Qy97nD 84kjXym+9y3+W+OykWSsQRf7COxEcWU= DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1730449232; x=1761985232; h=from:to:cc:subject:in-reply-to:references:date: message-id:mime-version:content-transfer-encoding; bh=aJgzqBdZAoUVqsFjlKUq7nywjKHl6uI4Q551/0Mr9/Y=; b=BGbYf2Cd1vZ0XUk7eMmjbrDkStpMVB+X4l5Iw9x7H+f4hCVO7TqepGqH 0T6DXtoOWwkzzJrawICRfJjypIlX4lSOtyzIlPSfJXuFf3fdM8zTxwuq7 qiwpHChVQpiStJJ19s1xBzkg+058KfhMemdaGVWO8T0mUnIEKCYMSTK12 3+AmmzxhwBxOdN1RMqv6Lshbdnc/mWEVgCWy68TKK7rxswLu7vQ5A7Tgy 8DTegHe1VIxQpNiE8bStZLwzqmrSfzVBHqdk98EqeLVtfwutE0ok36kQK oAAySMW1Xv2RHSMB6uMwaaITi9Xq37fV91nIjE2B1vmKZ0x2ytW3Hwd5c w==; X-CSE-ConnectionGUID: 8ztAyjefTUCG07eFiqKihw== X-CSE-MsgGUID: /0pB4mbIT5iD5x28gqI3CA== X-IronPort-AV: E=McAfee;i="6700,10204,11242"; a="33050517" X-IronPort-AV: E=Sophos;i="6.11,249,1725346800"; d="scan'208";a="33050517" Received: from orviesa001.jf.intel.com ([10.64.159.141]) by fmvoesa107.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 01 Nov 2024 01:20:30 -0700 X-CSE-ConnectionGUID: yKv+EKCjTm28Z7YX+w5Z7Q== X-CSE-MsgGUID: /GRa7T5BRyOuWi9+upmaPw== X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="6.11,249,1725346800"; d="scan'208";a="120357092" Received: from yhuang6-desk2.sh.intel.com (HELO yhuang6-desk2.ccr.corp.intel.com) ([10.238.208.55]) by smtpauth.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 01 Nov 2024 01:20:29 -0700 From: "Huang, Ying" To: Kefeng Wang Cc: David Hildenbrand , Andrew Morton , Matthew Wilcox , "Muchun Song" , , Zi Yan Subject: Re: [PATCH v2 1/2] mm: use aligned address in clear_gigantic_page() In-Reply-To: <848e4b40-f734-475f-9b1e-2f543e622a6c@huawei.com> (Kefeng Wang's message of "Fri, 1 Nov 2024 15:43:55 +0800") References: <20241026054307.3896926-1-wangkefeng.wang@huawei.com> <54f5f3ee-8442-4c49-ab4e-c46e8db73576@huawei.com> <4219a788-52ad-4d80-82e6-35a64c980d50@redhat.com> <127d4a00-29cc-4b45-aa96-eea4e0adaed2@huawei.com> <9b06805b-4f4f-4b37-861f-681e3ab9d470@huawei.com> <113d3cb9-0391-48ab-9389-f2fd1773ab73@redhat.com> <878qu6wgcm.fsf@yhuang6-desk2.ccr.corp.intel.com> <87sese9sy9.fsf@yhuang6-desk2.ccr.corp.intel.com> <64f1c69d-3706-41c5-a29f-929413e3dfa2@huawei.com> <87v7x88y3q.fsf@yhuang6-desk2.ccr.corp.intel.com> <848e4b40-f734-475f-9b1e-2f543e622a6c@huawei.com> Date: Fri, 01 Nov 2024 16:16:56 +0800 Message-ID: <87msij8j2f.fsf@yhuang6-desk2.ccr.corp.intel.com> User-Agent: Gnus/5.13 (Gnus v5.13) MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable X-Rspamd-Server: rspam06 X-Rspamd-Queue-Id: 208C7100016 X-Stat-Signature: 96zd7d3hm7zinfx6wu8kdpzuzh3fgc4k X-Rspam-User: X-HE-Tag: 1730449200-574432 X-HE-Meta: U2FsdGVkX18OJYM6YSNIwRqP9Vl/T82ZpRpOj+8zSGpsqg8kCpEEV0geIv+8YcirEtS8PY40g5Guk9FAX4D9Wr+qgJ+F8hjUdnI3WgkZqQ0PsyOFRr74yRsD+i3G67hCL0hwWiqZdyBuIcEynTG4G2ZbZi6FYw1zAIFmefgnGhdUnMG2N7WEgf7Z4SRHjH79zrXXdsRGsWPf3/Y31rOCaNku3qmxwJWFhuGWszIT+sgtdAn0ur8mhlKrYfpU4jrE8oT2IR0IJAPhXmnFI3zpXYf/RCF0ujvugIJvquMQYoenSZ9+ojoud/ijHpbAAHsSTJSr3Gb1pJRrBgIv74UuawwkLYMLdK1COmsEhLtl2WoQE/VPNl/9dnaB+bE5aD6I/xMADbLLQivl6pwDYZjETS/k0Uhno9DFPsMquohWODY2su4x9K90Ic5hOpW8yNPiIe7c3W6h+j8V6rh6x80zQq1C2aNYrZSor6lyFp63FZS2l5//2M5JY339lZz4y0au7KOlA4ravLhvjhQ/oEBwPu4MEm9ywiQL9DenETl2W/EYRsYOfMMpAFi8sZgzreqMm5G1bNPMZAzdjifriVVRjAyzMNF5W5bN5/EyLNF01vmmfoxKj4ULIEjZnfrcq+M6YYOAs81ojzqFEkMHtWSmum2gkTjEn37CdXXyKekFV0U10Nb95yh42FGnJQTwWcNRkpWXGBD2x29MX+ygUZtqk4Nw5GWgsmdqkKnddYyF9wkra7NQ7Ho0hADe7U3MSlXuvX6S6dINYTuRX9lqb9iRlNqN2RMf+5LB8adOXRuviWBua4+3Mddr/z+5IDQeaUzwFwnXUK1x2dR+dRY3omjmONbaKGxL4VPSkaGB+0skYxXclJFb7YwnBGQFxQSrrKVp/HEF9Yo8LUzf/h8I4Gppi7n2LxTI948ymQMfJC6ge6dX9sUM2+4oj30YG2mr2P1hO66XTdRNJzgWKVAiwYv So8oxDVi IRmCDm2D+Pgc341mlYT3ihTQBqrRTGwbdHyxNIJI4i1twFCeivKvoSb03OmyBN334+15z+pGKzENcnGVttEOIj+qXpZxqMUi7TQ2qNR/fWHL7zrdEkd5r9F9Pf0Fb/ZMhFY6ocddVXlVFZDftpb4GIIKE5Ydi3q8NrrkS2LChmGK3VAuKBS4sGQJUNVIYgYDI9obIY/GgGKvz+rg3GNRjCfpvT0oxj67p+KqIX6zRPfQtETVAQp3JPXSZoccTjfWe1QP+yept1H9E8GI= X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: Kefeng Wang writes: > On 2024/10/31 16:39, Huang, Ying wrote: >> Kefeng Wang writes: >> [snip] >>> >>>>> 1) Will test some rand test to check the different of performance as >>>>> David suggested.>>>> >>>>> 2) Hope the LKP to run more tests since it is very useful(more test >>>>> set and different machines) >>>> I'm starting to use LKP to test. >>> >>> Greet. > > > Sorry for the late, > >> I have run some tests with LKP to test. >> Firstly, there's almost no measurable difference between clearing >> pages >> from start to end or from end to start on Intel server CPU. I guess >> that there's some similar optimization for both direction. >> For multiple processes (same as logical CPU number) >> vm-scalability/anon-w-seq test case, the benchmark score increases >> about 22.4%. > > So process_huge_page is better than clear_gigantic_page() on Intel? For vm-scalability/anon-w-seq test case, it is. Because the performance of forward and backward clearing is almost same, and the user space accessing has cache-hot benefit. > Could you test the following case on x86? > echo 10240 > > /sys/devices/system/node/node0/hugepages/hugepages-2048kB/nr_hugepages > mkdir -p /hugetlbfs/ > mount none /hugetlbfs/ -t hugetlbfs > rm -f /hugetlbfs/test && fallocate -l 20G /hugetlbfs/test && fallocate > -d -l 20G /hugetlbfs/test && time taskset -c 10 fallocate -l 20G > /hugetlbfs/test It's not trivial for me to do this test. Because 0day wraps test cases. Do you know which existing test cases provide this? For example, in vm-scalability? >> For multiple processes vm-scalability/anon-w-rand test case, no >> measurable difference for benchmark score. >> So, the optimization helps sequential workload mainly. >> In summary, on x86, process_huge_page() will not introduce any >> regression. And it helps some workload. >> However, on ARM64, it does introduce some regression for clearing >> pages >> from end to start. That needs to be addressed. I guess that the >> regression can be resolved via using more clearing from start to end >> (but not all). For example, can you take a look at the patch below? >> Which uses the similar framework as before, but clear each small trunk >> (mpage) from start to end. You can adjust MPAGE_NRPAGES to check when >> the regression can be restored. >> WARNING: the patch is only build tested. > > > Base: baseline > Change1: using clear_gigantic_page() for 2M PMD > Change2: your patch with MPAGE_NRPAGES=3D16 > Change3: Case3 + fix[1] What is case3? > Change4: your patch with MPAGE_NRPAGES=3D64 + fix[1] > > 1. For rand write, > case-anon-w-rand/case-anon-w-rand-hugetlb no measurable difference > > 2. For seq write, > > 1) case-anon-w-seq-mt: Can you try case-anon-w-seq? That may be more stable. > base: > real 0m2.490s 0m2.254s 0m2.272s > user 1m59.980s 2m23.431s 2m18.739s > sys 1m3.675s 1m15.462s 1m15.030s > > Change1: > real 0m2.234s 0m2.225s 0m2.159s > user 2m56.105s 2m57.117s 3m0.489s > sys 0m17.064s 0m17.564s 0m16.150s > > Change2=EF=BC=9A > real 0m2.244s 0m2.384s 0m2.370s > user 2m39.413s 2m41.990s 2m42.229s > sys 0m19.826s 0m18.491s 0m18.053s It appears strange. There's no much cache hot benefit even if we clear pages from end to begin (with larger chunk). However, sys time improves a lot. This shows clearing page with large chunk helps on ARM64. > Change3=EF=BC=9A // best performance > real 0m2.155s 0m2.204s 0m2.194s > user 3m2.640s 2m55.837s 3m0.902s > sys 0m17.346s 0m17.630s 0m18.197s > > Change4=EF=BC=9A > real 0m2.287s 0m2.377s 0m2.284s=09 > user 2m37.030s 2m52.868s 3m17.593s > sys 0m15.445s 0m34.430s 0m45.224s Change4 is essentially same as Change1. I don't know why they are different. Is there some large variation among run to run? Can you further optimize the prototype patch below? I think that it has potential to fix your issue. > 2) case-anon-w-seq-hugetlb > very similar 1), Change4 slightly better than Change3, but not big > different. > > 3) hugetlbfs fallocate 20G > Change1(0m1.136s) =3D Change3(0m1.136s) =3D Change4(0m1.135s) < > Change2(0m1.275s) < base(0m3.016s) > > In summary, the Change3 is best and Change1 is good on my arm64 machine. > >> Best Regards, >> Huang, Ying >> -----------------------------------8<-----------------------------------= ----- >> From 406bcd1603987fdd7130d2df6f7d4aee4cc6b978 Mon Sep 17 00:00:00 2001 >> From: Huang Ying >> Date: Thu, 31 Oct 2024 11:13:57 +0800 >> Subject: [PATCH] mpage clear >> --- >> mm/memory.c | 70 ++++++++++++++++++++++++++++++++++++++++++++++++++--- >> 1 file changed, 67 insertions(+), 3 deletions(-) >> diff --git a/mm/memory.c b/mm/memory.c >> index 3ccee51adfbb..1fdc548c4275 100644 >> --- a/mm/memory.c >> +++ b/mm/memory.c >> @@ -6769,6 +6769,68 @@ static inline int process_huge_page( >> return 0; >> } >> +#define MPAGE_NRPAGES (1<<4) >> +#define MPAGE_SIZE (PAGE_SIZE * MPAGE_NRPAGES) >> +static inline int clear_huge_page( >> + unsigned long addr_hint, unsigned int nr_pages, >> + int (*process_subpage)(unsigned long addr, int idx, void *arg), >> + void *arg) >> +{ >> + int i, n, base, l, ret; >> + unsigned long addr =3D addr_hint & >> + ~(((unsigned long)nr_pages << PAGE_SHIFT) - 1); >> + unsigned long nr_mpages =3D ((unsigned long)nr_pages << PAGE_SHIFT) / = MPAGE_SIZE; >> + >> + /* Process target subpage last to keep its cache lines hot */ >> + might_sleep(); >> + n =3D (addr_hint - addr) / MPAGE_SIZE; >> + if (2 * n <=3D nr_mpages) { >> + /* If target subpage in first half of huge page */ >> + base =3D 0; >> + l =3D n; >> + /* Process subpages at the end of huge page */ >> + for (i =3D nr_mpages - 1; i >=3D 2 * n; i--) { >> + cond_resched(); >> + ret =3D process_subpage(addr + i * MPAGE_SIZE, >> + i * MPAGE_NRPAGES, arg); >> + if (ret) >> + return ret; >> + } >> + } else { >> + /* If target subpage in second half of huge page */ >> + base =3D nr_mpages - 2 * (nr_mpages - n); >> + l =3D nr_mpages - n; >> + /* Process subpages at the begin of huge page */ >> + for (i =3D 0; i < base; i++) { >> + cond_resched(); >> + ret =3D process_subpage(addr + i * MPAGE_SIZE, >> + i * MPAGE_NRPAGES, arg); >> + if (ret) >> + return ret; >> + } >> + } >> + /* >> + * Process remaining subpages in left-right-left-right pattern >> + * towards the target subpage >> + */ >> + for (i =3D 0; i < l; i++) { >> + int left_idx =3D base + i; >> + int right_idx =3D base + 2 * l - 1 - i; >> + >> + cond_resched(); >> + ret =3D process_subpage(addr + left_idx * MPAGE_SIZE, >> + left_idx * MPAGE_NRPAGES, arg); >> + if (ret) >> + return ret; >> + cond_resched(); >> + ret =3D process_subpage(addr + right_idx * MPAGE_SIZE, >> + right_idx * MPAGE_NRPAGES, arg); >> + if (ret) >> + return ret; >> + } >> + return 0; >> +} >> + >> static void clear_gigantic_page(struct folio *folio, unsigned long add= r, >> unsigned int nr_pages) >> { >> @@ -6784,8 +6846,10 @@ static void clear_gigantic_page(struct folio *fol= io, unsigned long addr, >> static int clear_subpage(unsigned long addr, int idx, void *arg) >> { >> struct folio *folio =3D arg; >> + int i; >> - clear_user_highpage(folio_page(folio, idx), addr); >> + for (i =3D 0; i < MPAGE_NRPAGES; i++) >> + clear_user_highpage(folio_page(folio, idx + i), addr + i * PAGE_SIZE); >> return 0; >> } >> @@ -6798,10 +6862,10 @@ void folio_zero_user(struct folio *folio, >> unsigned long addr_hint) >> { >> unsigned int nr_pages =3D folio_nr_pages(folio); >> - if (unlikely(nr_pages > MAX_ORDER_NR_PAGES)) >> + if (unlikely(nr_pages !=3D HPAGE_PMD_NR)) >> clear_gigantic_page(folio, addr_hint, nr_pages); >> else >> - process_huge_page(addr_hint, nr_pages, clear_subpage, folio); >> + clear_huge_page(addr_hint, nr_pages, clear_subpage, folio); >> } >> static int copy_user_gigantic_page(struct folio *dst, struct >> folio *src, > > > [1] fix patch > > diff --git a/mm/memory.c b/mm/memory.c > index b22d4b83295b..aee99ede0c4f 100644 > --- a/mm/memory.c > +++ b/mm/memory.c > @@ -6816,7 +6816,7 @@ static inline int clear_huge_page( > base =3D 0; > l =3D n; > /* Process subpages at the end of huge page */ > - for (i =3D nr_mpages - 1; i >=3D 2 * n; i--) { > + for (i =3D 2 * n; i < nr_mpages; i++) { > cond_resched(); > ret =3D process_subpage(addr + i * MPAGE_SIZE, > i * MPAGE_NRPAGES, arg);