From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 26002D5CC99 for ; Fri, 1 Nov 2024 09:45:52 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 7F6676B007B; Fri, 1 Nov 2024 05:45:51 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 77F586B0082; Fri, 1 Nov 2024 05:45:51 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 5F92D6B0083; Fri, 1 Nov 2024 05:45:51 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0015.hostedemail.com [216.40.44.15]) by kanga.kvack.org (Postfix) with ESMTP id 3E8086B007B for ; Fri, 1 Nov 2024 05:45:51 -0400 (EDT) Received: from smtpin23.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay03.hostedemail.com (Postfix) with ESMTP id D6D23A1C14 for ; Fri, 1 Nov 2024 09:45:50 +0000 (UTC) X-FDA: 82737043620.23.75282E8 Received: from szxga01-in.huawei.com (szxga01-in.huawei.com [45.249.212.187]) by imf19.hostedemail.com (Postfix) with ESMTP id 843691A0005 for ; Fri, 1 Nov 2024 09:45:12 +0000 (UTC) Authentication-Results: imf19.hostedemail.com; dkim=none; spf=pass (imf19.hostedemail.com: domain of wangkefeng.wang@huawei.com designates 45.249.212.187 as permitted sender) smtp.mailfrom=wangkefeng.wang@huawei.com; dmarc=pass (policy=quarantine) header.from=huawei.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1730454303; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=bPyWeW+zMZ5QrgjcV2FPklSEqtUJbx48MMVkd21ar/Q=; b=OiyZNPX2xTGHHu+ejMhGsE7ZAjEGXkFM20YTkEcINqyIJbboIa1MjkJmAGlzDphyHjgLrL HbrhxQiBzJKI/FDEBSmR/3Hz5frbGSabOuTltO8pwmpiaKnXX9uivnWWWXyp/c9qarcuwV c7AK70Tha3+bqIESfgquCFWhE5qKZCM= ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1730454303; a=rsa-sha256; cv=none; b=I0XuIrpJy5zQ3qUmzdd+M5N2r/CC1vy/kVZaMXJrMzu8+gl0m4xqb56SpceJFHsgTJJNy7 zSWATMnrRGsE03dSJ4ykxJpKDw2+qriggfqvDt5+Yq9L3CnEVKCeFQpmOq1A+vb1SSOXzr IHOAl7Yt2vkoMRV7Zh7i/yQZWpQFIWE= ARC-Authentication-Results: i=1; imf19.hostedemail.com; dkim=none; spf=pass (imf19.hostedemail.com: domain of wangkefeng.wang@huawei.com designates 45.249.212.187 as permitted sender) smtp.mailfrom=wangkefeng.wang@huawei.com; dmarc=pass (policy=quarantine) header.from=huawei.com Received: from mail.maildlp.com (unknown [172.19.163.252]) by szxga01-in.huawei.com (SkyGuard) with ESMTP id 4XfwsY2pmfz10PDk; Fri, 1 Nov 2024 17:43:29 +0800 (CST) Received: from dggpemf100008.china.huawei.com (unknown [7.185.36.138]) by mail.maildlp.com (Postfix) with ESMTPS id 160071800A5; Fri, 1 Nov 2024 17:45:43 +0800 (CST) Received: from [10.174.177.243] (10.174.177.243) by dggpemf100008.china.huawei.com (7.185.36.138) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.2.1544.11; Fri, 1 Nov 2024 17:45:42 +0800 Message-ID: <5f752263-ba3c-4d85-939e-6da7648a9011@huawei.com> Date: Fri, 1 Nov 2024 17:45:42 +0800 MIME-Version: 1.0 User-Agent: Mozilla Thunderbird Subject: Re: [PATCH v2 1/2] mm: use aligned address in clear_gigantic_page() To: "Huang, Ying" CC: David Hildenbrand , Andrew Morton , Matthew Wilcox , Muchun Song , , Zi Yan References: <20241026054307.3896926-1-wangkefeng.wang@huawei.com> <54f5f3ee-8442-4c49-ab4e-c46e8db73576@huawei.com> <4219a788-52ad-4d80-82e6-35a64c980d50@redhat.com> <127d4a00-29cc-4b45-aa96-eea4e0adaed2@huawei.com> <9b06805b-4f4f-4b37-861f-681e3ab9d470@huawei.com> <113d3cb9-0391-48ab-9389-f2fd1773ab73@redhat.com> <878qu6wgcm.fsf@yhuang6-desk2.ccr.corp.intel.com> <87sese9sy9.fsf@yhuang6-desk2.ccr.corp.intel.com> <64f1c69d-3706-41c5-a29f-929413e3dfa2@huawei.com> <87v7x88y3q.fsf@yhuang6-desk2.ccr.corp.intel.com> <848e4b40-f734-475f-9b1e-2f543e622a6c@huawei.com> <87msij8j2f.fsf@yhuang6-desk2.ccr.corp.intel.com> Content-Language: en-US From: Kefeng Wang In-Reply-To: <87msij8j2f.fsf@yhuang6-desk2.ccr.corp.intel.com> Content-Type: text/plain; charset="UTF-8"; format=flowed Content-Transfer-Encoding: 8bit X-Originating-IP: [10.174.177.243] X-ClientProxiedBy: dggems702-chm.china.huawei.com (10.3.19.179) To dggpemf100008.china.huawei.com (7.185.36.138) X-Stat-Signature: x8ou4kg9bquwyfw6pps35oufrnpyfo4u X-Rspam-User: X-Rspamd-Queue-Id: 843691A0005 X-Rspamd-Server: rspam02 X-HE-Tag: 1730454312-372265 X-HE-Meta: U2FsdGVkX1/Arwyq+eVV9J4kAmmDj3cBp974TH8fVTobYfVh3h/ndyzbHVMXeykPW0SM00PrE83zyhtiZRqFd+mlIAKM+TNjoSJFxhvEvIzII1lhmgznxsgwZ9YU1akxyfPCcVS/XynfadYZeh2kKGonufhfm5J/AjZ/Hee87fMQjtkSxQBZ+oNpDtFf9KESfHv/8AG/FNxniA8LB5QBXhURGcdO1eY3aNAaPtcAlPm3/MrFEcFNsDWQ/sPofC3k+NIYQ5zQcKI4NlCVP6JThXbXN3oJA7lxAIW6r1IoI7iTKgSkDSwJrp9no1Skw4N3P9xvbKlLZ1fIBbnpYyolrstWImpGQv1EJC0hGP1mwt4fVzEfXT5U23giAZB+i1UbKK1wGHd1NclvyFBoUkRrOPgd5GZAcX5MwiDToWAs/mdufzi5aMM7UHerHFib6oa+HjZj6AmIRlmHWYvy19zpeFQ4gzbexf53MP2ccdM+boZZQ2YsYh7cVZ3D4r2Rty+2JIdTNplW8lDNZ0vCpy98r46p6Az8HrSWwfbuRha3HFnN5xZWMFLhERYIgVoi456w4CUkW9kXL8Yn0MMKZkZigQ1y4wfUa+qbpF9GjphEnkrlTtxzzEVhkTECskHYG4euw5vN/5sm26r/XiwV6dYBk/YOeVB1zRn8CBLZ+hzUDGS4YXpTYEQPz3L6IX2XIrVDrA8pgAci0Fjy3jqOn9BZLM8RU8xLdkMOzbd4rQhGR7nn1NGYGwSpQTAj0eQ5gu7+KRiEjMzD/Rs216CGOmH3u8TI86ZGUu3EmCmKMhRW/0sHeVJrxvOcJopppSSHzKocapyZvGw5CQ/bezD7UjLeLQ02kK3Jg9J3/RcQEi335KGFgSq1njSS1Oxf9zuT+VF2fW1o/ieti9sBgHxgQcYGQgh/YHC5wGJieoNOYJLW2OGoTf26b8E7XnICLoOTuW7O63MC4OqzihOXtugxDeI 3eMuT2IP ux3Mm+ps3OeKQGjLxplZQwyue64LO2YSGZwrq3BnYLON5rO6b/wRTCTUIqgzBgRIyM+Lmhkd0dg7i36GU+8N7RXsKjkDHEFr/7eQ/iPk33GMRTIHyUNOo503kxRrBSPX+pjoiWSLPuyRDSCMcSYbHZSjqPSlCgN308keRrXlEqZdvpXxazY9t3E30VSj52ewSytVpWAewO4jV8sKAGB/Mrip2E3Am/hTeJd3jTr1v5D7uw39qEnaOc/cCzZzX3bAkW+40EhLsK5sp+06iHcpdCfEKrxoQPMkFL71e X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On 2024/11/1 16:16, Huang, Ying wrote: > Kefeng Wang writes: > >> On 2024/10/31 16:39, Huang, Ying wrote: >>> Kefeng Wang writes: >>> [snip] >>>> >>>>>> 1) Will test some rand test to check the different of performance as >>>>>> David suggested.>>>> >>>>>> 2) Hope the LKP to run more tests since it is very useful(more test >>>>>> set and different machines) >>>>> I'm starting to use LKP to test. >>>> >>>> Greet. >> >> >> Sorry for the late, >> >>> I have run some tests with LKP to test. >>> Firstly, there's almost no measurable difference between clearing >>> pages >>> from start to end or from end to start on Intel server CPU. I guess >>> that there's some similar optimization for both direction. >>> For multiple processes (same as logical CPU number) >>> vm-scalability/anon-w-seq test case, the benchmark score increases >>> about 22.4%. >> >> So process_huge_page is better than clear_gigantic_page() on Intel? > > For vm-scalability/anon-w-seq test case, it is. Because the performance > of forward and backward clearing is almost same, and the user space > accessing has cache-hot benefit. > >> Could you test the following case on x86? >> echo 10240 > >> /sys/devices/system/node/node0/hugepages/hugepages-2048kB/nr_hugepages >> mkdir -p /hugetlbfs/ >> mount none /hugetlbfs/ -t hugetlbfs >> rm -f /hugetlbfs/test && fallocate -l 20G /hugetlbfs/test && fallocate >> -d -l 20G /hugetlbfs/test && time taskset -c 10 fallocate -l 20G >> /hugetlbfs/test > > It's not trivial for me to do this test. Because 0day wraps test cases. > Do you know which existing test cases provide this? For example, in > vm-scalability? I don't know the public fallocate test, I will try to find a intel machine to test this case. > >>> For multiple processes vm-scalability/anon-w-rand test case, no >>> measurable difference for benchmark score. >>> So, the optimization helps sequential workload mainly. >>> In summary, on x86, process_huge_page() will not introduce any >>> regression. And it helps some workload. >>> However, on ARM64, it does introduce some regression for clearing >>> pages >>> from end to start. That needs to be addressed. I guess that the >>> regression can be resolved via using more clearing from start to end >>> (but not all). For example, can you take a look at the patch below? >>> Which uses the similar framework as before, but clear each small trunk >>> (mpage) from start to end. You can adjust MPAGE_NRPAGES to check when >>> the regression can be restored. >>> WARNING: the patch is only build tested. >> >> >> Base: baseline >> Change1: using clear_gigantic_page() for 2M PMD >> Change2: your patch with MPAGE_NRPAGES=16 >> Change3: Case3 + fix[1] > > What is case3? Oh, it is Change2. > >> Change4: your patch with MPAGE_NRPAGES=64 + fix[1] >> >> 1. For rand write, >> case-anon-w-rand/case-anon-w-rand-hugetlb no measurable difference >> >> 2. For seq write, >> >> 1) case-anon-w-seq-mt: > > Can you try case-anon-w-seq? That may be more stable. > >> base: >> real 0m2.490s 0m2.254s 0m2.272s >> user 1m59.980s 2m23.431s 2m18.739s >> sys 1m3.675s 1m15.462s 1m15.030s >> >> Change1: >> real 0m2.234s 0m2.225s 0m2.159s >> user 2m56.105s 2m57.117s 3m0.489s >> sys 0m17.064s 0m17.564s 0m16.150s >> >> Change2: >> real 0m2.244s 0m2.384s 0m2.370s >> user 2m39.413s 2m41.990s 2m42.229s >> sys 0m19.826s 0m18.491s 0m18.053s > > It appears strange. There's no much cache hot benefit even if we clear > pages from end to begin (with larger chunk). > > However, sys time improves a lot. This shows clearing page with large > chunk helps on ARM64. > >> Change3: // best performance >> real 0m2.155s 0m2.204s 0m2.194s >> user 3m2.640s 2m55.837s 3m0.902s >> sys 0m17.346s 0m17.630s 0m18.197s >> >> Change4: >> real 0m2.287s 0m2.377s 0m2.284s >> user 2m37.030s 2m52.868s 3m17.593s >> sys 0m15.445s 0m34.430s 0m45.224s > > Change4 is essentially same as Change1. I don't know why they are > different. Is there some large variation among run to run? As above shown, I test three times, the test results are relatively stable, at least for real, I will try case-anon-w-seq. > > Can you further optimize the prototype patch below? I think that it has > potential to fix your issue. Yes, thanks for you helper, but this will make process_huge_page() a little more complicated :) > >> 2) case-anon-w-seq-hugetlb >> very similar 1), Change4 slightly better than Change3, but not big >> different. >> >> 3) hugetlbfs fallocate 20G >> Change1(0m1.136s) = Change3(0m1.136s) = Change4(0m1.135s) < >> Change2(0m1.275s) < base(0m3.016s) >> >> In summary, the Change3 is best and Change1 is good on my arm64 machine. >> >>> Best Regards, >>> Huang, Ying >>> -----------------------------------8<---------------------------------------- >>> From 406bcd1603987fdd7130d2df6f7d4aee4cc6b978 Mon Sep 17 00:00:00 2001 >>> From: Huang Ying >>> Date: Thu, 31 Oct 2024 11:13:57 +0800 >>> Subject: [PATCH] mpage clear >>> --- >>> mm/memory.c | 70 ++++++++++++++++++++++++++++++++++++++++++++++++++--- >>> 1 file changed, 67 insertions(+), 3 deletions(-) >>> diff --git a/mm/memory.c b/mm/memory.c >>> index 3ccee51adfbb..1fdc548c4275 100644 >>> --- a/mm/memory.c >>> +++ b/mm/memory.c >>> @@ -6769,6 +6769,68 @@ static inline int process_huge_page( >>> return 0; >>> } >>> +#define MPAGE_NRPAGES (1<<4) >>> +#define MPAGE_SIZE (PAGE_SIZE * MPAGE_NRPAGES) >>> +static inline int clear_huge_page( >>> + unsigned long addr_hint, unsigned int nr_pages, >>> + int (*process_subpage)(unsigned long addr, int idx, void *arg), >>> + void *arg) >>> +{ >>> + int i, n, base, l, ret; >>> + unsigned long addr = addr_hint & >>> + ~(((unsigned long)nr_pages << PAGE_SHIFT) - 1); >>> + unsigned long nr_mpages = ((unsigned long)nr_pages << PAGE_SHIFT) / MPAGE_SIZE; >>> + >>> + /* Process target subpage last to keep its cache lines hot */ >>> + might_sleep(); >>> + n = (addr_hint - addr) / MPAGE_SIZE; >>> + if (2 * n <= nr_mpages) { >>> + /* If target subpage in first half of huge page */ >>> + base = 0; >>> + l = n; >>> + /* Process subpages at the end of huge page */ >>> + for (i = nr_mpages - 1; i >= 2 * n; i--) { >>> + cond_resched(); >>> + ret = process_subpage(addr + i * MPAGE_SIZE, >>> + i * MPAGE_NRPAGES, arg); >>> + if (ret) >>> + return ret; >>> + } >>> + } else { >>> + /* If target subpage in second half of huge page */ >>> + base = nr_mpages - 2 * (nr_mpages - n); >>> + l = nr_mpages - n; >>> + /* Process subpages at the begin of huge page */ >>> + for (i = 0; i < base; i++) { >>> + cond_resched(); >>> + ret = process_subpage(addr + i * MPAGE_SIZE, >>> + i * MPAGE_NRPAGES, arg); >>> + if (ret) >>> + return ret; >>> + } >>> + } >>> + /* >>> + * Process remaining subpages in left-right-left-right pattern >>> + * towards the target subpage >>> + */ >>> + for (i = 0; i < l; i++) { >>> + int left_idx = base + i; >>> + int right_idx = base + 2 * l - 1 - i; >>> + >>> + cond_resched(); >>> + ret = process_subpage(addr + left_idx * MPAGE_SIZE, >>> + left_idx * MPAGE_NRPAGES, arg); >>> + if (ret) >>> + return ret; >>> + cond_resched(); >>> + ret = process_subpage(addr + right_idx * MPAGE_SIZE, >>> + right_idx * MPAGE_NRPAGES, arg); >>> + if (ret) >>> + return ret; >>> + } >>> + return 0; >>> +} >>> + >>> static void clear_gigantic_page(struct folio *folio, unsigned long addr, >>> unsigned int nr_pages) >>> { >>> @@ -6784,8 +6846,10 @@ static void clear_gigantic_page(struct folio *folio, unsigned long addr, >>> static int clear_subpage(unsigned long addr, int idx, void *arg) >>> { >>> struct folio *folio = arg; >>> + int i; >>> - clear_user_highpage(folio_page(folio, idx), addr); >>> + for (i = 0; i < MPAGE_NRPAGES; i++) >>> + clear_user_highpage(folio_page(folio, idx + i), addr + i * PAGE_SIZE); >>> return 0; >>> } >>> @@ -6798,10 +6862,10 @@ void folio_zero_user(struct folio *folio, >>> unsigned long addr_hint) >>> { >>> unsigned int nr_pages = folio_nr_pages(folio); >>> - if (unlikely(nr_pages > MAX_ORDER_NR_PAGES)) >>> + if (unlikely(nr_pages != HPAGE_PMD_NR)) >>> clear_gigantic_page(folio, addr_hint, nr_pages); >>> else >>> - process_huge_page(addr_hint, nr_pages, clear_subpage, folio); >>> + clear_huge_page(addr_hint, nr_pages, clear_subpage, folio); >>> } >>> static int copy_user_gigantic_page(struct folio *dst, struct >>> folio *src, >> >> >> [1] fix patch >> >> diff --git a/mm/memory.c b/mm/memory.c >> index b22d4b83295b..aee99ede0c4f 100644 >> --- a/mm/memory.c >> +++ b/mm/memory.c >> @@ -6816,7 +6816,7 @@ static inline int clear_huge_page( >> base = 0; >> l = n; >> /* Process subpages at the end of huge page */ >> - for (i = nr_mpages - 1; i >= 2 * n; i--) { >> + for (i = 2 * n; i < nr_mpages; i++) { >> cond_resched(); >> ret = process_subpage(addr + i * MPAGE_SIZE, >> i * MPAGE_NRPAGES, arg); >