From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id BFAA9D1119B for ; Mon, 4 Nov 2024 02:39:24 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 405EA6B007B; Sun, 3 Nov 2024 21:39:24 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 3B47F6B0082; Sun, 3 Nov 2024 21:39:24 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 27BF56B0083; Sun, 3 Nov 2024 21:39:24 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0016.hostedemail.com [216.40.44.16]) by kanga.kvack.org (Postfix) with ESMTP id 05DF26B007B for ; Sun, 3 Nov 2024 21:39:23 -0500 (EST) Received: from smtpin10.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay03.hostedemail.com (Postfix) with ESMTP id 8C546A1509 for ; Mon, 4 Nov 2024 02:39:23 +0000 (UTC) X-FDA: 82746855240.10.E243F5F Received: from mgamail.intel.com (mgamail.intel.com [192.198.163.9]) by imf20.hostedemail.com (Postfix) with ESMTP id AE9A61C0011 for ; Mon, 4 Nov 2024 02:38:41 +0000 (UTC) Authentication-Results: imf20.hostedemail.com; dkim=pass header.d=intel.com header.s=Intel header.b=Dtc6RC8w; spf=pass (imf20.hostedemail.com: domain of ying.huang@intel.com designates 192.198.163.9 as permitted sender) smtp.mailfrom=ying.huang@intel.com; dmarc=pass (policy=none) header.from=intel.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1730687797; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=nWhPNJDLCbHdQ0yQiXlOQ0Q9vFOHshWjt1xBtmxVPu4=; b=MGrqUPnms3twPExvBXa9YrIfsK03HJy2y4xJ6yI2u9HrQe3X/LWI5oZoYiLw6xGVAixRRW 9uqKs5FjpQAoYE1CdJZSVWvhYAXM5xN2Qxqy3pj6ktVQtf9T68JnLlaBzti9xdQbtxFcB5 bpVZbSr/JjTjSLWJro4XfYHaqPCWduE= ARC-Authentication-Results: i=1; imf20.hostedemail.com; dkim=pass header.d=intel.com header.s=Intel header.b=Dtc6RC8w; spf=pass (imf20.hostedemail.com: domain of ying.huang@intel.com designates 192.198.163.9 as permitted sender) smtp.mailfrom=ying.huang@intel.com; dmarc=pass (policy=none) header.from=intel.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1730687797; a=rsa-sha256; cv=none; b=shOT6rjM7L9Avl0dIhVXooSQJezDexvHEQM6nZggJ2rRF6vrVzzKm0Rt4dYlgdkKyKAIjR QmaROaEqtvwJgPId0T32UNIa83NfIlp7WGdFJ6rj27Q1Y0SsaSDgHUdotLifPnWReK/y+S 8g5LVVNosh6GpJPY+gnS5QWqWfV0TAI= DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1730687961; x=1762223961; h=from:to:cc:subject:in-reply-to:references:date: message-id:mime-version:content-transfer-encoding; bh=r2pqC9wEGq6sUosTqA0h1TqTfEBnds97B57A8lI4N+g=; b=Dtc6RC8wEm5knoXGdmPYcg9Al5qoHALelZtFwjw6RKvojK+Rcc7PXK02 pBJV53XAsyYd1qBIVNMK/7E9rCCIQbnZ2VlFecWItt1EV8/rGehgTbRqT aRCSTvmcTg5v5rj8mzwjr+gzTccdSfxMzHtD9UqIqEfuB+seMLj3o6jpG esP7uk3X+eD4Ruhudjd0QV3tf4j7bOWYSqvUpWtNnN4tTbQ3Ki99rhXZl /XKLPgD5o5Da6nrCsIzqXwSI1n+AwMP3csm0o84lxMSagduhpPfr7kXc+ jti7W0r2WS958A1sazacE/Bkx8WRA8ZgMsz/3SuN5rEMK7Mm0g7y/NRIN Q==; X-CSE-ConnectionGUID: xroZOOIYTTO4BWRkze2iGg== X-CSE-MsgGUID: Zoj8wJGJQtqVSHYC74RZLA== X-IronPort-AV: E=McAfee;i="6700,10204,11245"; a="41004998" X-IronPort-AV: E=Sophos;i="6.11,256,1725346800"; d="scan'208";a="41004998" Received: from orviesa010.jf.intel.com ([10.64.159.150]) by fmvoesa103.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 03 Nov 2024 18:39:19 -0800 X-CSE-ConnectionGUID: vLwDgW5dTLeAI0JpxSzWtw== X-CSE-MsgGUID: WGxw0c5IQNOYbvIGfzewAA== X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="6.11,256,1725346800"; d="scan'208";a="83445174" Received: from yhuang6-desk2.sh.intel.com (HELO yhuang6-desk2.ccr.corp.intel.com) ([10.238.208.55]) by orviesa010-auth.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 03 Nov 2024 18:39:17 -0800 From: "Huang, Ying" To: Kefeng Wang Cc: David Hildenbrand , Andrew Morton , Matthew Wilcox , "Muchun Song" , , Zi Yan Subject: Re: [PATCH v2 1/2] mm: use aligned address in clear_gigantic_page() In-Reply-To: <5f752263-ba3c-4d85-939e-6da7648a9011@huawei.com> (Kefeng Wang's message of "Fri, 1 Nov 2024 17:45:42 +0800") References: <20241026054307.3896926-1-wangkefeng.wang@huawei.com> <54f5f3ee-8442-4c49-ab4e-c46e8db73576@huawei.com> <4219a788-52ad-4d80-82e6-35a64c980d50@redhat.com> <127d4a00-29cc-4b45-aa96-eea4e0adaed2@huawei.com> <9b06805b-4f4f-4b37-861f-681e3ab9d470@huawei.com> <113d3cb9-0391-48ab-9389-f2fd1773ab73@redhat.com> <878qu6wgcm.fsf@yhuang6-desk2.ccr.corp.intel.com> <87sese9sy9.fsf@yhuang6-desk2.ccr.corp.intel.com> <64f1c69d-3706-41c5-a29f-929413e3dfa2@huawei.com> <87v7x88y3q.fsf@yhuang6-desk2.ccr.corp.intel.com> <848e4b40-f734-475f-9b1e-2f543e622a6c@huawei.com> <87msij8j2f.fsf@yhuang6-desk2.ccr.corp.intel.com> <5f752263-ba3c-4d85-939e-6da7648a9011@huawei.com> Date: Mon, 04 Nov 2024 10:35:44 +0800 Message-ID: <87ikt3914v.fsf@yhuang6-desk2.ccr.corp.intel.com> User-Agent: Gnus/5.13 (Gnus v5.13) MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable X-Rspamd-Server: rspam09 X-Rspamd-Queue-Id: AE9A61C0011 X-Stat-Signature: zhb5ehbyuhqxqb71oz98jsbpyhti17h5 X-Rspam-User: X-HE-Tag: 1730687921-895193 X-HE-Meta: U2FsdGVkX19tEGoBgyuH+2ugMIpahqV+otaoG8RAXPJ063dWF8w2/hN/1SKZiDcvsErF2gnL/3DhKjtvbLUEXRIGqb68/U8a/KT0fv6bdjk7O9BCrSxNFqOyDb2izVEoLdcpWe8ZMqt6xUkeYU2HVhv5/NajKbu+E73j7tjoa6FnDqHFjsb6nWjDPEUGAGD/9DJsOD2RAkAb5i9IEDm9RWfkCKFBHz+a0urKL4hAYroMCI/hiUUQh0uSVe6LJrHcr8oA7rGoaZtKuGbm9vUkAB9hWPLujLPXPyhkMUE9hor6Qk0/upuuin5/Rj4GyQZhIH7H0DcdwBIAl9TgOG/0DYfo9ptKvvZjDg7c5AAxyBzn3t65boXuegVG4+gmBh+LTWwAr0hS0dYsXge7UXw3KaeRxUrel0UNVf96Zzo1SVzw7OUdphkgzu0eY5NVR2xjgi9W+nQ0OYTVtivEWSbn0p6xK6VSI8r2udL5VdIgypJ9dOp4qYW6vMPFMC0kq4fFt3Royx6szNKlAfYhS9Tt62tdwFbTqTEYPyXch3IATEYF7qvAGcoDC7Rk0OGCwIaHwjCRRLGhoODHDBVF0WljirVDvYkHyv7T3Y8rk9ufdd1OMtAg5RMFrvA6iUuq4i2ceqJXFOmTe9pczqTRb3Ov7xw3HuzoGE6zVBpWlfBWHQoqhh0kWJESScOOSL6Z+SDZKEVKmAsngxYpMqLmdkCBhsIyx1J2AJmBC1a/CNxO602f5CU1AvIh9w++9rMm9lsqsuhadgSij/QvNlwjCVIgofbtoz3AfE41WHSSpbqs7eZyd6W4xMfZr9Ab88r6ho9djxeWnszoYBmgSlFP0Kpzg9I55tmkX0rPX22rQ9kG0rk8J/szt9FmQHSVvVpUdlbyOatBi/Eh+SSLnNyjhG7lbQoLtNzbflbAcYY866zdENFbvJSADsj5v3mg8X3XwsTZkByrvk0Yg331wgtjGzc TIFUuh2C 7r2LS54GCF0Q2usSm6Y93XBAUs5biVL8ZnMvOFa3uf2NF4NPAKY9E/02oEG2KFPsnk6o5E0AESTKo9VS825BHPCQKzNe3r0GkmBx0T55CVQbKuLQBcK45Mqsc46z0HG7cXbIhUvn9Va1ku+ghOmSyW2PbWiFqU1pLj6E5UJTPX/IUZ13IT1knTMIzXTQWz5VSggK81iUJBskbWcjczJiuf0a4pCr3UqFKKzTm63enY1Mp5AX8N7RGHXupv8/poQuYxf6ZL/gZjRmVJro= X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: Kefeng Wang writes: > On 2024/11/1 16:16, Huang, Ying wrote: >> Kefeng Wang writes: >>=20 >>> On 2024/10/31 16:39, Huang, Ying wrote: >>>> Kefeng Wang writes: >>>> [snip] >>>>> >>>>>>> 1) Will test some rand test to check the different of performance as >>>>>>> David suggested.>>>> >>>>>>> 2) Hope the LKP to run more tests since it is very useful(more test >>>>>>> set and different machines) >>>>>> I'm starting to use LKP to test. >>>>> >>>>> Greet. >>> >>> >>> Sorry for the late, >>> >>>> I have run some tests with LKP to test. >>>> Firstly, there's almost no measurable difference between clearing >>>> pages >>>> from start to end or from end to start on Intel server CPU. I guess >>>> that there's some similar optimization for both direction. >>>> For multiple processes (same as logical CPU number) >>>> vm-scalability/anon-w-seq test case, the benchmark score increases >>>> about 22.4%. >>> >>> So process_huge_page is better than clear_gigantic_page() on Intel? >> For vm-scalability/anon-w-seq test case, it is. Because the >> performance >> of forward and backward clearing is almost same, and the user space >> accessing has cache-hot benefit. >>=20 >>> Could you test the following case on x86? >>> echo 10240 > >>> /sys/devices/system/node/node0/hugepages/hugepages-2048kB/nr_hugepages >>> mkdir -p /hugetlbfs/ >>> mount none /hugetlbfs/ -t hugetlbfs >>> rm -f /hugetlbfs/test && fallocate -l 20G /hugetlbfs/test && fallocate >>> -d -l 20G /hugetlbfs/test && time taskset -c 10 fallocate -l 20G >>> /hugetlbfs/test >> It's not trivial for me to do this test. Because 0day wraps test >> cases. >> Do you know which existing test cases provide this? For example, in >> vm-scalability? > > I don't know the public fallocate test, I will try to find a intel > machine to test this case. I don't expect it to change much, because we have observed that the performance of forward and backward clearing is similar on Intel. >>=20 >>>> For multiple processes vm-scalability/anon-w-rand test case, no >>>> measurable difference for benchmark score. >>>> So, the optimization helps sequential workload mainly. >>>> In summary, on x86, process_huge_page() will not introduce any >>>> regression. And it helps some workload. >>>> However, on ARM64, it does introduce some regression for clearing >>>> pages >>>> from end to start. That needs to be addressed. I guess that the >>>> regression can be resolved via using more clearing from start to end >>>> (but not all). For example, can you take a look at the patch below? >>>> Which uses the similar framework as before, but clear each small trunk >>>> (mpage) from start to end. You can adjust MPAGE_NRPAGES to check when >>>> the regression can be restored. >>>> WARNING: the patch is only build tested. >>> >>> >>> Base: baseline >>> Change1: using clear_gigantic_page() for 2M PMD >>> Change2: your patch with MPAGE_NRPAGES=3D16 >>> Change3: Case3 + fix[1] >> What is case3? > > Oh, it is Change2. Got it. >>=20 >>> Change4: your patch with MPAGE_NRPAGES=3D64 + fix[1] >>> >>> 1. For rand write, >>> case-anon-w-rand/case-anon-w-rand-hugetlb no measurable difference >>> >>> 2. For seq write, >>> >>> 1) case-anon-w-seq-mt: >> Can you try case-anon-w-seq? That may be more stable. >>=20 >>> base: >>> real 0m2.490s 0m2.254s 0m2.272s >>> user 1m59.980s 2m23.431s 2m18.739s >>> sys 1m3.675s 1m15.462s 1m15.030s >>> >>> Change1: >>> real 0m2.234s 0m2.225s 0m2.159s >>> user 2m56.105s 2m57.117s 3m0.489s >>> sys 0m17.064s 0m17.564s 0m16.150s >>> >>> Change2=EF=BC=9A >>> real 0m2.244s 0m2.384s 0m2.370s >>> user 2m39.413s 2m41.990s 2m42.229s >>> sys 0m19.826s 0m18.491s 0m18.053s >> It appears strange. There's no much cache hot benefit even if we >> clear >> pages from end to begin (with larger chunk). >> However, sys time improves a lot. This shows clearing page with >> large >> chunk helps on ARM64. >>=20 >>> Change3=EF=BC=9A // best performance >>> real 0m2.155s 0m2.204s 0m2.194s >>> user 3m2.640s 2m55.837s 3m0.902s >>> sys 0m17.346s 0m17.630s 0m18.197s >>> >>> Change4=EF=BC=9A >>> real 0m2.287s 0m2.377s 0m2.284s=09 >>> user 2m37.030s 2m52.868s 3m17.593s >>> sys 0m15.445s 0m34.430s 0m45.224s >> Change4 is essentially same as Change1. I don't know why they are >> different. Is there some large variation among run to run? > > As above shown, I test three times, the test results are relatively > stable, at least for real, I will try case-anon-w-seq. Can you also show the score of vm-scalability? TBH, I cannot understand your results. For example, why there are measurable difference between Change3 and Change4? In both cases, the kernel clears pages from start to end. >> Can you further optimize the prototype patch below? I think that it >> has >> potential to fix your issue. > > Yes, thanks for you helper, but this will make process_huge_page() a > little more complicated :) IMHO, we should try to root cause it, then try to find the proper solution and optimize (simplifies) it. -- Best Regards, Huang, Ying >>=20 >>> 2) case-anon-w-seq-hugetlb >>> very similar 1), Change4 slightly better than Change3, but not big >>> different. >>> >>> 3) hugetlbfs fallocate 20G >>> Change1(0m1.136s) =3D Change3(0m1.136s) =3D Change4(0m1.135s) < >>> Change2(0m1.275s) < base(0m3.016s) >>> >>> In summary, the Change3 is best and Change1 is good on my arm64 machine. >>> >>>> Best Regards, >>>> Huang, Ying >>>> -----------------------------------8<---------------------------------= ------- >>>> From 406bcd1603987fdd7130d2df6f7d4aee4cc6b978 Mon Sep 17 00:00:00 20= 01 >>>> From: Huang Ying >>>> Date: Thu, 31 Oct 2024 11:13:57 +0800 >>>> Subject: [PATCH] mpage clear >>>> --- >>>> mm/memory.c | 70 ++++++++++++++++++++++++++++++++++++++++++++++++++= --- >>>> 1 file changed, 67 insertions(+), 3 deletions(-) >>>> diff --git a/mm/memory.c b/mm/memory.c >>>> index 3ccee51adfbb..1fdc548c4275 100644 >>>> --- a/mm/memory.c >>>> +++ b/mm/memory.c >>>> @@ -6769,6 +6769,68 @@ static inline int process_huge_page( >>>> return 0; >>>> } >>>> +#define MPAGE_NRPAGES (1<<4) >>>> +#define MPAGE_SIZE (PAGE_SIZE * MPAGE_NRPAGES) >>>> +static inline int clear_huge_page( >>>> + unsigned long addr_hint, unsigned int nr_pages, >>>> + int (*process_subpage)(unsigned long addr, int idx, void *arg), >>>> + void *arg) >>>> +{ >>>> + int i, n, base, l, ret; >>>> + unsigned long addr =3D addr_hint & >>>> + ~(((unsigned long)nr_pages << PAGE_SHIFT) - 1); >>>> + unsigned long nr_mpages =3D ((unsigned long)nr_pages << PAGE_SHIFT) = / MPAGE_SIZE; >>>> + >>>> + /* Process target subpage last to keep its cache lines hot */ >>>> + might_sleep(); >>>> + n =3D (addr_hint - addr) / MPAGE_SIZE; >>>> + if (2 * n <=3D nr_mpages) { >>>> + /* If target subpage in first half of huge page */ >>>> + base =3D 0; >>>> + l =3D n; >>>> + /* Process subpages at the end of huge page */ >>>> + for (i =3D nr_mpages - 1; i >=3D 2 * n; i--) { >>>> + cond_resched(); >>>> + ret =3D process_subpage(addr + i * MPAGE_SIZE, >>>> + i * MPAGE_NRPAGES, arg); >>>> + if (ret) >>>> + return ret; >>>> + } >>>> + } else { >>>> + /* If target subpage in second half of huge page */ >>>> + base =3D nr_mpages - 2 * (nr_mpages - n); >>>> + l =3D nr_mpages - n; >>>> + /* Process subpages at the begin of huge page */ >>>> + for (i =3D 0; i < base; i++) { >>>> + cond_resched(); >>>> + ret =3D process_subpage(addr + i * MPAGE_SIZE, >>>> + i * MPAGE_NRPAGES, arg); >>>> + if (ret) >>>> + return ret; >>>> + } >>>> + } >>>> + /* >>>> + * Process remaining subpages in left-right-left-right pattern >>>> + * towards the target subpage >>>> + */ >>>> + for (i =3D 0; i < l; i++) { >>>> + int left_idx =3D base + i; >>>> + int right_idx =3D base + 2 * l - 1 - i; >>>> + >>>> + cond_resched(); >>>> + ret =3D process_subpage(addr + left_idx * MPAGE_SIZE, >>>> + left_idx * MPAGE_NRPAGES, arg); >>>> + if (ret) >>>> + return ret; >>>> + cond_resched(); >>>> + ret =3D process_subpage(addr + right_idx * MPAGE_SIZE, >>>> + right_idx * MPAGE_NRPAGES, arg); >>>> + if (ret) >>>> + return ret; >>>> + } >>>> + return 0; >>>> +} >>>> + >>>> static void clear_gigantic_page(struct folio *folio, unsigned long = addr, >>>> unsigned int nr_pages) >>>> { >>>> @@ -6784,8 +6846,10 @@ static void clear_gigantic_page(struct folio *f= olio, unsigned long addr, >>>> static int clear_subpage(unsigned long addr, int idx, void *arg) >>>> { >>>> struct folio *folio =3D arg; >>>> + int i; >>>> - clear_user_highpage(folio_page(folio, idx), addr); >>>> + for (i =3D 0; i < MPAGE_NRPAGES; i++) >>>> + clear_user_highpage(folio_page(folio, idx + i), addr + i * PAGE_SIZ= E); >>>> return 0; >>>> } >>>> @@ -6798,10 +6862,10 @@ void folio_zero_user(struct folio *folio, >>>> unsigned long addr_hint) >>>> { >>>> unsigned int nr_pages =3D folio_nr_pages(folio); >>>> - if (unlikely(nr_pages > MAX_ORDER_NR_PAGES)) >>>> + if (unlikely(nr_pages !=3D HPAGE_PMD_NR)) >>>> clear_gigantic_page(folio, addr_hint, nr_pages); >>>> else >>>> - process_huge_page(addr_hint, nr_pages, clear_subpage, folio); >>>> + clear_huge_page(addr_hint, nr_pages, clear_subpage, folio); >>>> } >>>> static int copy_user_gigantic_page(struct folio *dst, struct >>>> folio *src, >>> >>> >>> [1] fix patch >>> >>> diff --git a/mm/memory.c b/mm/memory.c >>> index b22d4b83295b..aee99ede0c4f 100644 >>> --- a/mm/memory.c >>> +++ b/mm/memory.c >>> @@ -6816,7 +6816,7 @@ static inline int clear_huge_page( >>> base =3D 0; >>> l =3D n; >>> /* Process subpages at the end of huge page */ >>> - for (i =3D nr_mpages - 1; i >=3D 2 * n; i--) { >>> + for (i =3D 2 * n; i < nr_mpages; i++) { >>> cond_resched(); >>> ret =3D process_subpage(addr + i * MPAGE_SIZE, >>> i * MPAGE_NRPAGES, arg); >>=20