From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 10608D13588 for ; Mon, 28 Oct 2024 06:37:30 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 7B1736B0083; Mon, 28 Oct 2024 02:37:30 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 760956B0089; Mon, 28 Oct 2024 02:37:30 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 629316B008C; Mon, 28 Oct 2024 02:37:30 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0013.hostedemail.com [216.40.44.13]) by kanga.kvack.org (Postfix) with ESMTP id 400D46B0083 for ; Mon, 28 Oct 2024 02:37:30 -0400 (EDT) Received: from smtpin13.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay02.hostedemail.com (Postfix) with ESMTP id A9118121CA4 for ; Mon, 28 Oct 2024 06:37:09 +0000 (UTC) X-FDA: 82722053484.13.102953F Received: from szxga05-in.huawei.com (szxga05-in.huawei.com [45.249.212.191]) by imf05.hostedemail.com (Postfix) with ESMTP id B1B93100006 for ; Mon, 28 Oct 2024 06:36:43 +0000 (UTC) Authentication-Results: imf05.hostedemail.com; dkim=none; dmarc=pass (policy=quarantine) header.from=huawei.com; spf=pass (imf05.hostedemail.com: domain of wangkefeng.wang@huawei.com designates 45.249.212.191 as permitted sender) smtp.mailfrom=wangkefeng.wang@huawei.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1730097319; a=rsa-sha256; cv=none; b=HwL19QMX/d2KkK3HyM9+4CyUiV9wTgPVg2GYi2DD2K0ikiFCL/C9grrwsf3AUeMOdKm3xb dEIm/0OCbpkn6g8jz28hmrDq3FQE+gA7CdgGy0G6dzbiJyJ+52zIgXnQCh2nXeFfbTE7fr DjwHJ8Aj9FzFvWXjAPVlMdarNhO/DFk= ARC-Authentication-Results: i=1; imf05.hostedemail.com; dkim=none; dmarc=pass (policy=quarantine) header.from=huawei.com; spf=pass (imf05.hostedemail.com: domain of wangkefeng.wang@huawei.com designates 45.249.212.191 as permitted sender) smtp.mailfrom=wangkefeng.wang@huawei.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1730097319; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=QIR4Ehe3hznoFa+o2GHd5LZDBOtZGimTJISym/Baap0=; b=AWauFbGALuuvrTKO1g9XmHm0L4N346k8MQsSveIUDuoKuk+IxtB8JGnxCIVOvdIAYfsvcD YvmihoxrNe7BRIfbU9j/Yva+n0bWetOOCRNi8EwkrzvD+X9iYOnHuEPqHbnI0zgwH+rIWZ 4m/zc1g5opOQbqtTq5z47484yNyWFr8= Received: from mail.maildlp.com (unknown [172.19.163.17]) by szxga05-in.huawei.com (SkyGuard) with ESMTP id 4XcNtx4vm0z1jvvy; Mon, 28 Oct 2024 14:35:53 +0800 (CST) Received: from dggpemf100008.china.huawei.com (unknown [7.185.36.138]) by mail.maildlp.com (Postfix) with ESMTPS id D37531A0188; Mon, 28 Oct 2024 14:37:22 +0800 (CST) Received: from [10.174.177.243] (10.174.177.243) by dggpemf100008.china.huawei.com (7.185.36.138) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.2.1544.11; Mon, 28 Oct 2024 14:37:22 +0800 Message-ID: <9318769a-0d51-4c03-a808-fc3a3f09d492@huawei.com> Date: Mon, 28 Oct 2024 14:37:21 +0800 MIME-Version: 1.0 User-Agent: Mozilla Thunderbird Subject: Re: [PATCH] mm: shmem: convert to use folio_zero_range() To: "Huang, Ying" CC: Barry Song <21cnbao@gmail.com>, , , , , , References: <06d99b89-17ad-447e-a8f1-8e220b5688ac@huawei.com> <20241022225603.10491-1-21cnbao@gmail.com> <31afe958-91eb-484a-90b9-91114991a9a2@huawei.com> <87iktg3n2i.fsf@yhuang6-desk2.ccr.corp.intel.com> <5ccb295c-6ec9-4d00-8236-e3ba19221f40@huawei.com> <875xpg39q5.fsf@yhuang6-desk2.ccr.corp.intel.com> <1a37d4a9-eef8-4fe0-aeb0-fa95c33b305a@huawei.com> <871q042x1h.fsf@yhuang6-desk2.ccr.corp.intel.com> <86f9f4e8-9c09-4333-ae4f-f51a71c3aca7@huawei.com> <87ttcx0x4j.fsf@yhuang6-desk2.ccr.corp.intel.com> Content-Language: en-US From: Kefeng Wang In-Reply-To: <87ttcx0x4j.fsf@yhuang6-desk2.ccr.corp.intel.com> Content-Type: text/plain; charset="UTF-8"; format=flowed Content-Transfer-Encoding: 8bit X-Originating-IP: [10.174.177.243] X-ClientProxiedBy: dggems701-chm.china.huawei.com (10.3.19.178) To dggpemf100008.china.huawei.com (7.185.36.138) X-Rspamd-Server: rspam12 X-Rspamd-Queue-Id: B1B93100006 X-Stat-Signature: 11h4i3yu1tf3ygxx7dt6g3grjsubmduf X-Rspam-User: X-HE-Tag: 1730097403-375150 X-HE-Meta: U2FsdGVkX19eGBYZSGV9l2d1mQhqRkhKoESeWHjoXLsJ/cVlRgGTHhRSHY/MuJSMRdkD/6Z6/O0EleXBwGbLoQUTYUOjmDgYhIidPJRLvK1nzorJsIWF0uefRsIbQA/Lhpq/AGkfTXmf0WjLPrdLW9ZgPyLAbMnc4iwAdf1qw8yUY4V5R09e5RXK2ebdt0Fa3tvnbzHdVYLrI+S84Cafg4XRsPOm1jieVoULU4l7eaOfxNWGTX6N1/atTGbn4s4labvEB1CAijJmL/ncw5WYe+ih4n33v7z4KOVQ6hVOKLRz49SAWU2hJ5mjsoW9TIxPlD9bRsmSbcA6kOU/gYeSxBapp6JKuGjPi+LLMqW6whZzptDWeBVvqYT5YdF4SYLqjpa9M6YqnoOSBQ1Vjb3aM9wUCUTEZTWaqL5//IqzuusXCgcygwISUuwgepsM54NxkZ53T2ByAIDH5DXj84aJujqNfq6okMZTcmAk63lhO53Jjw9waAXYNSRhcqk7ddxkXRSx1mJI320gu6Fuh6vzNGexsFhFiYVGNu5f8xM4vUwuF9G90EC1FuVR3k6dbjvnQI+3DvKjvMBV60siYFpXmgVDuCvU/8Goox0S2Xz6ttTNrimPvgPQ1F4HMUOZDPYE3iaDsS8v5qRWF73biX0OZ/mj8E/MBq/W5p17Z+xzsidcgGEy1YimkHrKaFzJpO8HMnmP7Jigc+6Qxk1IPq0rF2mguCX4fagfDjCtuGCDVWcslgc19vbJNZ3Sn/RyxYxB5r5drtvLbNe6P0dRycrUgJthDnDmROmRczKTqw4jP3zSyKw74gC7upCykQd/Vhp7oYORru/zGlLJhf8A5PSkDm/m3TiT5bmfDJbqaouY2zhkl7Er/UOGmgprzmaUflkWa/z0QirKfEUelygbKH5ZfbctNBQyNVO8F1B4VLrtCwF4jv3AcwrjbHrZ2HAz1wA+mePVhWC5odiw9pmN8xg DAtAW32M 03gHZx2FYiTnknefs9TqzXrkDoGgZtalFFtINHbEdw3kNvVUrCTYz4Rg6r8D/XS7RnD5FObdmuePTkoD5x+fHFWkgu5R97GJP7Mul/NVPDMqeXN30iacjWgv61sXghFCP+4OlI7iZDU+SrjOJZGodeeV07srK1+krW6ZSQIbGCh+vKDpv62ecZGnALrNV6k4xNistbvL65CdPljYjTQFqY7NBn3eSuUL5JgTD0TrDEg3kL2/ADv4ASuJt925AXQZBlAE5zVbxXubywck= X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On 2024/10/28 10:39, Huang, Ying wrote: > Kefeng Wang writes: > >> On 2024/10/25 20:21, Huang, Ying wrote: >>> Kefeng Wang writes: >>> >>>> On 2024/10/25 15:47, Huang, Ying wrote: >>>>> Kefeng Wang writes: >>>>> >>>>>> On 2024/10/25 10:59, Huang, Ying wrote: >>>>>>> Hi, Kefeng, >>>>>>> Kefeng Wang writes: >>>>>>> >>>>>>>> +CC Huang Ying, >>>>>>>> >>>>>>>> On 2024/10/23 6:56, Barry Song wrote: >>>>>>>>> On Wed, Oct 23, 2024 at 4:10 AM Kefeng Wang wrote: >>>>>>>>>> >>>>>>>>>> >>>>>>>> ... >>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>> On 2024/10/17 23:09, Matthew Wilcox wrote: >>>>>>>>>>>>>>>>>>>>>>>>> On Thu, Oct 17, 2024 at 10:25:04PM +0800, Kefeng Wang wrote: >>>>>>>>>>>>>>>>>>>>>>>>>> Directly use folio_zero_range() to cleanup code. >>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>> Are you sure there's no performance regression introduced by this? >>>>>>>>>>>>>>>>>>>>>>>>> clear_highpage() is often optimised in ways that we can't optimise for >>>>>>>>>>>>>>>>>>>>>>>>> a plain memset().  On the other hand, if the folio is large, maybe a >>>>>>>>>>>>>>>>>>>>>>>>> modern CPU will be able to do better than clear-one-page-at-a-time. >>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>> Right, I missing this, clear_page might be better than memset, I change >>>>>>>>>>>>>>>>>>>>>>>> this one when look at the shmem_writepage(), which already convert to >>>>>>>>>>>>>>>>>>>>>>>> use folio_zero_range() from clear_highpage(), also I grep >>>>>>>>>>>>>>>>>>>>>>>> folio_zero_range(), there are some other to use folio_zero_range(). >>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>> fs/bcachefs/fs-io-buffered.c:           folio_zero_range(folio, 0, >>>>>>>>>>>>>>>>>>>>>>>> folio_size(folio)); >>>>>>>>>>>>>>>>>>>>>>>> fs/bcachefs/fs-io-buffered.c:                   folio_zero_range(f, >>>>>>>>>>>>>>>>>>>>>>>> 0, folio_size(f)); >>>>>>>>>>>>>>>>>>>>>>>> fs/bcachefs/fs-io-buffered.c:                   folio_zero_range(f, >>>>>>>>>>>>>>>>>>>>>>>> 0, folio_size(f)); >>>>>>>>>>>>>>>>>>>>>>>> fs/libfs.c:     folio_zero_range(folio, 0, folio_size(folio)); >>>>>>>>>>>>>>>>>>>>>>>> fs/ntfs3/frecord.c:             folio_zero_range(folio, 0, >>>>>>>>>>>>>>>>>>>>>>>> folio_size(folio)); >>>>>>>>>>>>>>>>>>>>>>>> mm/page_io.c:   folio_zero_range(folio, 0, folio_size(folio)); >>>>>>>>>>>>>>>>>>>>>>>> mm/shmem.c:             folio_zero_range(folio, 0, folio_size(folio)); >>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>> IOW, what performance testing have you done with this patch? >>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>> No performance test before, but I write a testcase, >>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>> 1) allocate N large folios (folio_alloc(PMD_ORDER)) >>>>>>>>>>>>>>>>>>>>>>>> 2) then calculate the diff(us) when clear all N folios >>>>>>>>>>>>>>>>>>>>>>>>           clear_highpage/folio_zero_range/folio_zero_user >>>>>>>>>>>>>>>>>>>>>>>> 3) release N folios >>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>> the result(run 5 times) shown below on my machine, >>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>> N=1, >>>>>>>>>>>>>>>>>>>>>>>>               clear_highpage  folio_zero_range    folio_zero_user >>>>>>>>>>>>>>>>>>>>>>>>          1      69                   74                 177 >>>>>>>>>>>>>>>>>>>>>>>>          2      57                   62                 168 >>>>>>>>>>>>>>>>>>>>>>>>          3      54                   58                 234 >>>>>>>>>>>>>>>>>>>>>>>>          4      54                   58                 157 >>>>>>>>>>>>>>>>>>>>>>>>          5      56                   62                 148 >>>>>>>>>>>>>>>>>>>>>>>> avg       58                   62.8               176.8 >>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>> N=100 >>>>>>>>>>>>>>>>>>>>>>>>               clear_highpage  folio_zero_range    folio_zero_user >>>>>>>>>>>>>>>>>>>>>>>>          1    11015                 11309               32833 >>>>>>>>>>>>>>>>>>>>>>>>          2    10385                 11110               49751 >>>>>>>>>>>>>>>>>>>>>>>>          3    10369                 11056               33095 >>>>>>>>>>>>>>>>>>>>>>>>          4    10332                 11017               33106 >>>>>>>>>>>>>>>>>>>>>>>>          5    10483                 11000               49032 >>>>>>>>>>>>>>>>>>>>>>>> avg     10516.8               11098.4             39563.4 >>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>> N=512 >>>>>>>>>>>>>>>>>>>>>>>>               clear_highpage  folio_zero_range   folio_zero_user >>>>>>>>>>>>>>>>>>>>>>>>          1    55560                 60055              156876 >>>>>>>>>>>>>>>>>>>>>>>>          2    55485                 60024              157132 >>>>>>>>>>>>>>>>>>>>>>>>          3    55474                 60129              156658 >>>>>>>>>>>>>>>>>>>>>>>>          4    55555                 59867              157259 >>>>>>>>>>>>>>>>>>>>>>>>          5    55528                 59932              157108 >>>>>>>>>>>>>>>>>>>>>>>> avg     55520.4               60001.4            157006.6 >>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>> folio_zero_user with many cond_resched(), so time fluctuates a lot, >>>>>>>>>>>>>>>>>>>>>>>> clear_highpage is better folio_zero_range as you said. >>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>> Maybe add a new helper to convert all folio_zero_range(folio, 0, >>>>>>>>>>>>>>>>>>>>>>>> folio_size(folio)) >>>>>>>>>>>>>>>>>>>>>>>> to use clear_highpage + flush_dcache_folio? >>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>> If this also improves performance for other existing callers of >>>>>>>>>>>>>>>>>>>>>>> folio_zero_range(), then that's a positive outcome. >>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> ... >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> hi Kefeng, >>>>>>>>>>>>>>>>>>>> what's your point? providing a helper like clear_highfolio() or similar? >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> Yes, from above test, using clear_highpage/flush_dcache_folio is better >>>>>>>>>>>>>>>>>>> than using folio_zero_range() for folio zero(especially for large >>>>>>>>>>>>>>>>>>> folio), so I'd like to add a new helper, maybe name it folio_zero() >>>>>>>>>>>>>>>>>>> since it zero the whole folio. >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> we already have a helper like folio_zero_user()? >>>>>>>>>>>>>>>>>> it is not good enough? >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> Since it is with many cond_resched(), the performance is worst... >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Not exactly? It should have zero cost for a preemptible kernel. >>>>>>>>>>>>>>>> For a non-preemptible kernel, it helps avoid clearing the folio >>>>>>>>>>>>>>>> from occupying the CPU and starving other processes, right? >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> --- a/mm/shmem.c >>>>>>>>>>>>>>> +++ b/mm/shmem.c >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> @@ -2393,10 +2393,7 @@ static int shmem_get_folio_gfp(struct inode >>>>>>>>>>>>>>> *inode, pgoff_t index, >>>>>>>>>>>>>>>             * it now, lest undo on failure cancel our earlier guarantee. >>>>>>>>>>>>>>>             */ >>>>>>>>>>>>>>> >>>>>>>>>>>>>>>            if (sgp != SGP_WRITE && !folio_test_uptodate(folio)) { >>>>>>>>>>>>>>> -               long i, n = folio_nr_pages(folio); >>>>>>>>>>>>>>> - >>>>>>>>>>>>>>> -               for (i = 0; i < n; i++) >>>>>>>>>>>>>>> -                       clear_highpage(folio_page(folio, i)); >>>>>>>>>>>>>>> +               folio_zero_user(folio, vmf->address); >>>>>>>>>>>>>>>                    flush_dcache_folio(folio); >>>>>>>>>>>>>>>                    folio_mark_uptodate(folio); >>>>>>>>>>>>>>>            } >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Do we perform better or worse with the following? >>>>>>>>>>>>>> >>>>>>>>>>>>>> Here is for SGP_FALLOC, vmf = NULL, we could use folio_zero_user(folio, >>>>>>>>>>>>>> 0), I think the performance is worse, will retest once I can access >>>>>>>>>>>>>> hardware. >>>>>>>>>>>>> >>>>>>>>>>>>> Perhaps, since the current code uses clear_hugepage(). Does using >>>>>>>>>>>>> index << PAGE_SHIFT as the addr_hint offer any benefit? >>>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> when use folio_zero_user(), the performance is vary bad with above >>>>>>>>>>>> fallocate test(mount huge=always), >>>>>>>>>>>> >>>>>>>>>>>>         folio_zero_range   clear_highpage         folio_zero_user >>>>>>>>>>>> real    0m1.214s             0m1.111s              0m3.159s >>>>>>>>>>>> user    0m0.000s             0m0.000s              0m0.000s >>>>>>>>>>>> sys     0m1.210s             0m1.109s              0m3.152s >>>>>>>>>>>> >>>>>>>>>>>> I tried with addr_hint = 0/index << PAGE_SHIFT, no obvious different. >>>>>>>>>>> >>>>>>>>>>> Interesting. Does your kernel have preemption disabled or >>>>>>>>>>> preemption_debug enabled? >>>>>>>>>> >>>>>>>>>> ARM64 server, CONFIG_PREEMPT_NONE=y >>>>>>>>> this explains why the performance is much worse. >>>>>>>>> >>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> If not, it makes me wonder whether folio_zero_user() in >>>>>>>>>>> alloc_anon_folio() is actually improving performance as expected, >>>>>>>>>>> compared to the simpler folio_zero() you plan to implement. :-) >>>>>>>>>> >>>>>>>>>> Yes, maybe, the folio_zero_user(was clear_huge_page) is from >>>>>>>>>> 47ad8475c000 ("thp: clear_copy_huge_page"), so original clear_huge_page >>>>>>>>>> is used in HugeTLB, clear PUD size maybe spend many time, but for PMD or >>>>>>>>>> other size of large folio, cond_resched is not necessary since we >>>>>>>>>> already have some folio_zero_range() to clear large folio, and no issue >>>>>>>>>> was reported. >>>>>>>>> probably worth an optimization. calling cond_resched() for each page >>>>>>>>> seems too aggressive and useless. >>>>>>>> >>>>>>>> After some test, I think the cond_resched() is not the root cause, >>>>>>>> no performance gained with batched cond_resched(), even I kill >>>>>>>> cond_resched() from process_huge_page, no improvement. >>>>>>>> >>>>>>>> But when I unconditionally use clear_gigantic_page() in >>>>>>>> folio_zero_user(patched), there is big improvement with above >>>>>>>> fallocate on tmpfs(mount huge=always), also I test some other testcase, >>>>>>>> >>>>>>>> >>>>>>>> 1) case-anon-w-seq-mt: (2M PMD THP) >>>>>>>> >>>>>>>> base: >>>>>>>> real 0m2.490s 0m2.254s 0m2.272s >>>>>>>> user 1m59.980s 2m23.431s 2m18.739s >>>>>>>> sys 1m3.675s 1m15.462s 1m15.030s >>>>>>>> >>>>>>>> patched: >>>>>>>> real 0m2.234s 0m2.225s 0m2.159s >>>>>>>> user 2m56.105s 2m57.117s 3m0.489s >>>>>>>> sys 0m17.064s 0m17.564s 0m16.150s >>>>>>>> >>>>>>>> Patched kernel win on sys and bad in user, but real is almost same, >>>>>>>> maybe a little better than base. >>>>>>> We can find user time difference. That means the original cache hot >>>>>>> behavior still applies on your system. >>>>>>> However, it appears that the performance to clear page from end to >>>>>>> begin >>>>>>> is really bad on your system. >>>>>>> So, I suggest to revise the current implementation to use sequential >>>>>>> clearing as much as possible. >>>>>>> >>>>>> >>>>>> I test case-anon-cow-seq-hugetlb for copy_user_large_folio() >>>>>> >>>>>> base: >>>>>> real 0m6.259s 0m6.197s 0m6.316s >>>>>> user 1m31.176s 1m27.195s 1m29.594s >>>>>> sys 7m44.199s 7m51.490s 8m21.149s >>>>>> >>>>>> patched(use copy_user_gigantic_page for 2M hugetlb too) >>>>>> real 0m3.182s 0m3.002s 0m2.963s >>>>>> user 1m19.456s 1m3.107s 1m6.447s >>>>>> sys 2m59.222s 3m10.899s 3m1.027s >>>>>> >>>>>> and sequential copy is better than the current implementation, >>>>>> so I will use sequential clear and copy. >>>>> Sorry, it appears that you misunderstanding my suggestion. I >>>>> suggest to >>>>> revise process_huge_page() to use more sequential memory clearing and >>>>> copying to improve its performance on your platform. >>>>> -- >>>>> Best Regards, >>>>> Huang, Ying >>>>> >>>>>>>> 2) case-anon-w-seq-hugetlb:(2M PMD HugeTLB) >>>>>>>> >>>>>>>> base: >>>>>>>> real 0m5.175s 0m5.117s 0m4.856s >>>>>>>> user 5m15.943s 5m7.567s 4m29.273s >>>>>>>> sys 2m38.503s 2m21.949s 2m21.252s >>>>>>>> >>>>>>>> patched: >>>>>>>> real 0m4.966s 0m4.841s 0m4.561s >>>>>>>> user 6m30.123s 6m9.516s 5m49.733s >>>>>>>> sys 0m58.503s 0m47.847s 0m46.785s >>>>>>>> >>>>>>>> >>>>>>>> This case is similar to the case1. >>>>>>>> >>>>>>>> 3) fallocate hugetlb 20G (2M PMD HugeTLB) >>>>>>>> >>>>>>>> base: >>>>>>>> real 0m3.016s 0m3.019s 0m3.018s >>>>>>>> user 0m0.000s 0m0.000s 0m0.000s >>>>>>>> sys 0m3.009s 0m3.012s 0m3.010s >>>>>>>> >>>>>>>> patched: >>>>>>>> >>>>>>>> real 0m1.136s 0m1.136s 0m1.136s >>>>>>>> user 0m0.000s 0m0.000s 0m0.004s >>>>>>>> sys 0m1.133s 0m1.133s 0m1.129s >>>>>>>> >>>>>>>> >>>>>>>> There is big win on patched kernel, and it is similar to above tmpfs >>>>>>>> test, so maybe we could revert the commit c79b57e462b5 ("mm: hugetlb: >>>>>>>> clear target sub-page last when clearing huge page"). >>>> >>>> I tried the following changes, >>>> diff --git a/mm/memory.c b/mm/memory.c >>>> index 66cf855dee3f..e5cc75adfa10 100644 >>>> --- a/mm/memory.c >>>> +++ b/mm/memory.c >>>> @@ -6777,7 +6777,7 @@ static inline int process_huge_page( >>>> base = 0; >>>> l = n; >>>> /* Process subpages at the end of huge page */ >>>> - for (i = nr_pages - 1; i >= 2 * n; i--) { >>>> + for (i = 2 * n; i < nr_pages; i++) { >>>> cond_resched(); >>>> ret = process_subpage(addr + i * PAGE_SIZE, i, >>>> arg); >>>> if (ret) >>>> >>>> Since n = 0, so the copying is from start to end now, but not >>>> improvement for case-anon-cow-seq-hugetlb, >>>> >>>> and if use copy_user_gigantic_pager, the time reduced from 6s to 3s >>>> >>>> diff --git a/mm/memory.c b/mm/memory.c >>>> index fe21bd3beff5..2c6532d21d84 100644 >>>> --- a/mm/memory.c >>>> +++ b/mm/memory.c >>>> @@ -6876,10 +6876,7 @@ int copy_user_large_folio(struct folio *dst, >>>> struct folio *src, >>>> .vma = vma, >>>> }; >>>> >>>> - if (unlikely(nr_pages > MAX_ORDER_NR_PAGES)) >>>> - return copy_user_gigantic_page(dst, src, addr_hint, >>>> vma, nr_pages); >>>> - >>>> - return process_huge_page(addr_hint, nr_pages, copy_subpage, &arg); >>>> + return copy_user_gigantic_page(dst, src, addr_hint, vma, nr_pages); >>>> } >>> It appears that we have code generation issue here. Can you check >>> it? >>> Whether code is inlined in the same way? >>> >> >> No different, and I checked the asm, both process_huge_page and >> copy_user_gigantic_page are inlined, it is strange... > > It's not inlined in my configuration. And __always_inline below changes > it for me. > > If it's already inlined and the code is actually almost same, why > there's difference? Is it possible for you to do some profile or > further analysis? Yes, will continue to debug this. > >>> Maybe we can start with >>> modified mm/memory.c >>> @@ -6714,7 +6714,7 @@ EXPORT_SYMBOL(__might_fault); >>> * operation. The target subpage will be processed last to keep its >>> * cache lines hot. >>> */ >>> -static inline int process_huge_page( >>> +static __always_inline int process_huge_page( >>> unsigned long addr_hint, unsigned int nr_pages, >>> int (*process_subpage)(unsigned long addr, int idx, void *arg), >>> void *arg) > > -- > Best Regards, > Huang, Ying >