From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <owner-linux-mm@kvack.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by smtp.lore.kernel.org (Postfix) with ESMTP id A9934CFA451
	for <linux-mm@archiver.kernel.org>; Thu, 24 Oct 2024 10:10:31 +0000 (UTC)
Received: by kanga.kvack.org (Postfix)
	id 210736B0082; Thu, 24 Oct 2024 06:10:31 -0400 (EDT)
Received: by kanga.kvack.org (Postfix, from userid 40)
	id 1C0D66B0083; Thu, 24 Oct 2024 06:10:31 -0400 (EDT)
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id 088BB6B0085; Thu, 24 Oct 2024 06:10:31 -0400 (EDT)
X-Delivered-To: linux-mm@kvack.org
Received: from relay.hostedemail.com (smtprelay0012.hostedemail.com [216.40.44.12])
	by kanga.kvack.org (Postfix) with ESMTP id DB6256B0082
	for <linux-mm@kvack.org>; Thu, 24 Oct 2024 06:10:30 -0400 (EDT)
Received: from smtpin01.hostedemail.com (a10.router.float.18 [10.200.18.1])
	by unirelay05.hostedemail.com (Postfix) with ESMTP id 57E43411F6
	for <linux-mm@kvack.org>; Thu, 24 Oct 2024 10:10:20 +0000 (UTC)
X-FDA: 82708075338.01.0E64ADA
Received: from szxga07-in.huawei.com (szxga07-in.huawei.com [45.249.212.35])
	by imf30.hostedemail.com (Postfix) with ESMTP id D4A0B80010
	for <linux-mm@kvack.org>; Thu, 24 Oct 2024 10:09:51 +0000 (UTC)
Authentication-Results: imf30.hostedemail.com;
	dkim=none;
	dmarc=pass (policy=quarantine) header.from=huawei.com;
	spf=pass (imf30.hostedemail.com: domain of wangkefeng.wang@huawei.com designates 45.249.212.35 as permitted sender) smtp.mailfrom=wangkefeng.wang@huawei.com
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com;
	s=arc-20220608; t=1729764476;
	h=from:from:sender:reply-to:subject:subject:date:date:
	 message-id:message-id:to:to:cc:cc:mime-version:mime-version:
	 content-type:content-type:
	 content-transfer-encoding:content-transfer-encoding:
	 in-reply-to:in-reply-to:references:references;
	bh=RVNZXJs+VmPyXXRTgLmdOf+exnWRkV1uZHQvf2GVVIg=;
	b=6YC7zSxCdSNwKA7WRGcjCZ3WWuMP5gaVJISdX3Eg7H0nHAZU6/E1iHcQsKOJnVgUH1SGc6
	kCHzOJrkrJBZMYPESjtbYie4GBppVAXEjoG11/j6K71Pr4W3RpvJf118QDMR8fzPx311qy
	Vw4SVOldXz71buMN+schdVnbEbtbzfA=
ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1729764476; a=rsa-sha256;
	cv=none;
	b=3nMstjcvUQgyASE6eCO/dUK8MT5jJxfyFW/uR+LcOkdskxmRgF1t5o/T2gq9Q7DlRK3GKG
	TbwqaLpDapT7WRbQKJxihQjHaBYMLzUKe7/ziz8UXQKB/mMp3QFkXeCbpVG/aXkzS6h6Ph
	u+llCU5XNXxoMjCDpirz7YMgjsbZxIU=
ARC-Authentication-Results: i=1;
	imf30.hostedemail.com;
	dkim=none;
	dmarc=pass (policy=quarantine) header.from=huawei.com;
	spf=pass (imf30.hostedemail.com: domain of wangkefeng.wang@huawei.com designates 45.249.212.35 as permitted sender) smtp.mailfrom=wangkefeng.wang@huawei.com
Received: from mail.maildlp.com (unknown [172.19.88.163])
	by szxga07-in.huawei.com (SkyGuard) with ESMTP id 4XZ1pc32wnz1SCtR;
	Thu, 24 Oct 2024 18:08:56 +0800 (CST)
Received: from dggpemf100008.china.huawei.com (unknown [7.185.36.138])
	by mail.maildlp.com (Postfix) with ESMTPS id 6E2FD180043;
	Thu, 24 Oct 2024 18:10:21 +0800 (CST)
Received: from [10.174.177.243] (10.174.177.243) by
 dggpemf100008.china.huawei.com (7.185.36.138) with Microsoft SMTP Server
 (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id
 15.2.1544.11; Thu, 24 Oct 2024 18:10:20 +0800
Message-ID: <31afe958-91eb-484a-90b9-91114991a9a2@huawei.com>
Date: Thu, 24 Oct 2024 18:10:20 +0800
MIME-Version: 1.0
User-Agent: Mozilla Thunderbird
Subject: Re: [PATCH] mm: shmem: convert to use folio_zero_range()
To: Barry Song <21cnbao@gmail.com>, Huang Ying <ying.huang@intel.com>
CC: <akpm@linux-foundation.org>, <baolin.wang@linux.alibaba.com>,
	<david@redhat.com>, <hughd@google.com>, <linux-mm@kvack.org>,
	<willy@infradead.org>
References: <06d99b89-17ad-447e-a8f1-8e220b5688ac@huawei.com>
 <20241022225603.10491-1-21cnbao@gmail.com>
Content-Language: en-US
From: Kefeng Wang <wangkefeng.wang@huawei.com>
In-Reply-To: <20241022225603.10491-1-21cnbao@gmail.com>
Content-Type: text/plain; charset="UTF-8"; format=flowed
Content-Transfer-Encoding: 8bit
X-Originating-IP: [10.174.177.243]
X-ClientProxiedBy: dggems701-chm.china.huawei.com (10.3.19.178) To
 dggpemf100008.china.huawei.com (7.185.36.138)
X-Rspamd-Queue-Id: D4A0B80010
X-Stat-Signature: cjkp87b537oqittcc4szij6xibh9fbcw
X-Rspamd-Server: rspam09
X-Rspam-User: 
X-HE-Tag: 1729764591-487283
X-HE-Meta: U2FsdGVkX1+zHXxkFYgguqf6oJ/EV1L0m4s4nahsZQ/IVCp90omzppKUxWQ/u6EnluWh+LMfDxBJ0QZ7cq6rXsOJ23fP/DlNYmqmSk8VuH4ImjJufJhknkZ32vk1Uh3s4xPxVPeCyoQWH+DtmypvcK0zXQEuXhIs7yl2ov6e9H5KsTUB6ptkIV4vzzRNiEAVfqSjfdbRw0RMktQtN04hrHYjS9XxuRcDtJE0BLtNnoWwXQ3mh30XRDqhZXxRYTc3tzgolkrk7KNQy4xkIqVDKg2H/0104XKbonf18ms22+j1xcQtyfbtqJ9sH1nc0et6mMBZsJGvxqsJ0yHBDIuk9x1cqvnpRLCosmgVt/wVFyg7iJp90m9h5YrVmLwi/pBs6IRRtllpGMX0UmbEX8yYo9rFobKtq0SjBmG1h7dkuaW1obgSQ7zCH8U2mwTsUd41YJ7QRNbM/+64edRfX2zWwBJKFltdkBB36CjwXqEK9fbPb5vlVDigSJvGOVZLyFi51wceWIRbnC03JUjDC75OrjhCYXeHZGXWeJzQASbwk/sieENgMAIsbjZHLvagZKOoooaEgQzprsFI0Tf3d1RUWt238aimEe8t1ibypVvcyf6J3Hoz3MXkJbCzDZOhMRQ4ag1rqX758nuhvU7RGkbzxak3dY3XnUzpkM8+SPvoSf6HoEbWZRyy61TWexit1VZgKpvFtBtdUk/a2fTJDh1CL7WS8ODBcn0oeBwLQcqh3cdaSjfXEsVwFBW01LrOzdLGFBoOJKuYWJkKBfvCvTCYvEKpDhIJdUIEfQ6ceHmPxca99XSXPxhjHcEpIdUTwJze6jzP4YDp1GC9JjXf4SvDGF01DGnVcJtAppF0u+kdmoA5VSmtUWAW0pphjhzvELLaQiH2CtUo6QfoDM+XnX8ZwRS6ROCBSR5Ku2nwDmrLNvRja56a3aOIFE9Uq0I+hWZgv24hEeTCL3Pejsqdlyw
 uCi2s1Gy
 JO4kwXfl7vCHhUWhG0H1OiXirOSY5BkuxYrz76/jidAATfulQVdoQENXf9slXbSJ6AK7mfKVHGaJfL355awgMKcYTqViXg0T9fsnmwhd6iPBg66HoXmYV7n31bAmU/1oeqT8Ye0C7KLkZ+3fTltAzDd5z8qhJecrb0xy+RD9QsMUB31WIKjtHE9rq7a6EaaG1nEt+h/+2ICSIIkMZObAYvhFLocs5FgzFCj7cgE8Yo9sQasD8fIiqEWNblxDfDc7efRrN
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>
List-Subscribe: <mailto:majordomo@kvack.org>
List-Unsubscribe: <mailto:majordomo@kvack.org>


+CC Huang Ying,

On 2024/10/23 6:56, Barry Song wrote:
> On Wed, Oct 23, 2024 at 4:10 AM Kefeng Wang <wangkefeng.wang@huawei.com> wrote:
>>
>>
...
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> On 2024/10/17 23:09, Matthew Wilcox wrote:
>>>>>>>>>>>>>>>>> On Thu, Oct 17, 2024 at 10:25:04PM +0800, Kefeng Wang wrote:
>>>>>>>>>>>>>>>>>> Directly use folio_zero_range() to cleanup code.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Are you sure there's no performance regression introduced by this?
>>>>>>>>>>>>>>>>> clear_highpage() is often optimised in ways that we can't optimise for
>>>>>>>>>>>>>>>>> a plain memset().  On the other hand, if the folio is large, maybe a
>>>>>>>>>>>>>>>>> modern CPU will be able to do better than clear-one-page-at-a-time.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Right, I missing this, clear_page might be better than memset, I change
>>>>>>>>>>>>>>>> this one when look at the shmem_writepage(), which already convert to
>>>>>>>>>>>>>>>> use folio_zero_range() from clear_highpage(), also I grep
>>>>>>>>>>>>>>>> folio_zero_range(), there are some other to use folio_zero_range().
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> fs/bcachefs/fs-io-buffered.c:           folio_zero_range(folio, 0,
>>>>>>>>>>>>>>>> folio_size(folio));
>>>>>>>>>>>>>>>> fs/bcachefs/fs-io-buffered.c:                   folio_zero_range(f,
>>>>>>>>>>>>>>>> 0, folio_size(f));
>>>>>>>>>>>>>>>> fs/bcachefs/fs-io-buffered.c:                   folio_zero_range(f,
>>>>>>>>>>>>>>>> 0, folio_size(f));
>>>>>>>>>>>>>>>> fs/libfs.c:     folio_zero_range(folio, 0, folio_size(folio));
>>>>>>>>>>>>>>>> fs/ntfs3/frecord.c:             folio_zero_range(folio, 0,
>>>>>>>>>>>>>>>> folio_size(folio));
>>>>>>>>>>>>>>>> mm/page_io.c:   folio_zero_range(folio, 0, folio_size(folio));
>>>>>>>>>>>>>>>> mm/shmem.c:             folio_zero_range(folio, 0, folio_size(folio));
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> IOW, what performance testing have you done with this patch?
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> No performance test before, but I write a testcase,
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> 1) allocate N large folios (folio_alloc(PMD_ORDER))
>>>>>>>>>>>>>>>> 2) then calculate the diff(us) when clear all N folios
>>>>>>>>>>>>>>>>            clear_highpage/folio_zero_range/folio_zero_user
>>>>>>>>>>>>>>>> 3) release N folios
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> the result(run 5 times) shown below on my machine,
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> N=1,
>>>>>>>>>>>>>>>>                clear_highpage  folio_zero_range    folio_zero_user
>>>>>>>>>>>>>>>>           1      69                   74                 177
>>>>>>>>>>>>>>>>           2      57                   62                 168
>>>>>>>>>>>>>>>>           3      54                   58                 234
>>>>>>>>>>>>>>>>           4      54                   58                 157
>>>>>>>>>>>>>>>>           5      56                   62                 148
>>>>>>>>>>>>>>>> avg       58                   62.8               176.8
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> N=100
>>>>>>>>>>>>>>>>                clear_highpage  folio_zero_range    folio_zero_user
>>>>>>>>>>>>>>>>           1    11015                 11309               32833
>>>>>>>>>>>>>>>>           2    10385                 11110               49751
>>>>>>>>>>>>>>>>           3    10369                 11056               33095
>>>>>>>>>>>>>>>>           4    10332                 11017               33106
>>>>>>>>>>>>>>>>           5    10483                 11000               49032
>>>>>>>>>>>>>>>> avg     10516.8               11098.4             39563.4
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> N=512
>>>>>>>>>>>>>>>>                clear_highpage  folio_zero_range   folio_zero_user
>>>>>>>>>>>>>>>>           1    55560                 60055              156876
>>>>>>>>>>>>>>>>           2    55485                 60024              157132
>>>>>>>>>>>>>>>>           3    55474                 60129              156658
>>>>>>>>>>>>>>>>           4    55555                 59867              157259
>>>>>>>>>>>>>>>>           5    55528                 59932              157108
>>>>>>>>>>>>>>>> avg     55520.4               60001.4            157006.6
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> folio_zero_user with many cond_resched(), so time fluctuates a lot,
>>>>>>>>>>>>>>>> clear_highpage is better folio_zero_range as you said.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Maybe add a new helper to convert all folio_zero_range(folio, 0,
>>>>>>>>>>>>>>>> folio_size(folio))
>>>>>>>>>>>>>>>> to use clear_highpage + flush_dcache_folio?
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> If this also improves performance for other existing callers of
>>>>>>>>>>>>>>> folio_zero_range(), then that's a positive outcome.
>>>>>>>>>>>>>>
>>>>>>>>> ...
>>>>>>>>>
>>>>>>>>>>>> hi Kefeng,
>>>>>>>>>>>> what's your point? providing a helper like clear_highfolio() or similar?
>>>>>>>>>>>
>>>>>>>>>>> Yes, from above test, using clear_highpage/flush_dcache_folio is better
>>>>>>>>>>> than using folio_zero_range() for folio zero(especially for large
>>>>>>>>>>> folio), so I'd like to add a new helper, maybe name it folio_zero()
>>>>>>>>>>> since it zero the whole folio.
>>>>>>>>>>
>>>>>>>>>> we already have a helper like folio_zero_user()?
>>>>>>>>>> it is not good enough?
>>>>>>>>>
>>>>>>>>> Since it is with many cond_resched(), the performance is worst...
>>>>>>>>
>>>>>>>> Not exactly? It should have zero cost for a preemptible kernel.
>>>>>>>> For a non-preemptible kernel, it helps avoid clearing the folio
>>>>>>>> from occupying the CPU and starving other processes, right?
>>>>>>>
>>>>>>> --- a/mm/shmem.c
>>>>>>> +++ b/mm/shmem.c
>>>>>>>
>>>>>>> @@ -2393,10 +2393,7 @@ static int shmem_get_folio_gfp(struct inode
>>>>>>> *inode, pgoff_t index,
>>>>>>>              * it now, lest undo on failure cancel our earlier guarantee.
>>>>>>>              */
>>>>>>>
>>>>>>>             if (sgp != SGP_WRITE && !folio_test_uptodate(folio)) {
>>>>>>> -               long i, n = folio_nr_pages(folio);
>>>>>>> -
>>>>>>> -               for (i = 0; i < n; i++)
>>>>>>> -                       clear_highpage(folio_page(folio, i));
>>>>>>> +               folio_zero_user(folio, vmf->address);
>>>>>>>                     flush_dcache_folio(folio);
>>>>>>>                     folio_mark_uptodate(folio);
>>>>>>>             }
>>>>>>>
>>>>>>> Do we perform better or worse with the following?
>>>>>>
>>>>>> Here is for SGP_FALLOC, vmf = NULL, we could use folio_zero_user(folio,
>>>>>> 0), I think the performance is worse, will retest once I can access
>>>>>> hardware.
>>>>>
>>>>> Perhaps, since the current code uses clear_hugepage(). Does using
>>>>> index << PAGE_SHIFT as the addr_hint offer any benefit?
>>>>>
>>>>
>>>> when use folio_zero_user(), the performance is vary bad with above
>>>> fallocate test(mount huge=always),
>>>>
>>>>          folio_zero_range   clear_highpage         folio_zero_user
>>>> real    0m1.214s             0m1.111s              0m3.159s
>>>> user    0m0.000s             0m0.000s              0m0.000s
>>>> sys     0m1.210s             0m1.109s              0m3.152s
>>>>
>>>> I tried with addr_hint = 0/index << PAGE_SHIFT, no obvious different.
>>>
>>> Interesting. Does your kernel have preemption disabled or
>>> preemption_debug enabled?
>>
>> ARM64 server, CONFIG_PREEMPT_NONE=y
> 
> this explains why the performance is much worse.
> 
>>
>>>
>>> If not, it makes me wonder whether folio_zero_user() in
>>> alloc_anon_folio() is actually improving performance as expected,
>>> compared to the simpler folio_zero() you plan to implement. :-)
>>
>> Yes, maybe, the folio_zero_user(was clear_huge_page) is from
>> 47ad8475c000 ("thp: clear_copy_huge_page"), so original clear_huge_page
>> is used in HugeTLB, clear PUD size maybe spend many time, but for PMD or
>> other size of large folio, cond_resched is not necessary since we
>> already have some folio_zero_range() to clear large folio, and no issue
>> was reported.
> 
> probably worth an optimization. calling cond_resched() for each page
> seems too aggressive and useless.

After some test, I think the cond_resched() is not the root cause,
no performance gained with batched cond_resched(), even I kill
cond_resched() from process_huge_page, no improvement.

But when I unconditionally use clear_gigantic_page() in
folio_zero_user(patched), there is big improvement with above
fallocate on tmpfs(mount huge=always), also I test some other testcase,


1) case-anon-w-seq-mt: (2M PMD THP)

base:
real    0m2.490s    0m2.254s    0m2.272s
user    1m59.980s   2m23.431s   2m18.739s
sys     1m3.675s    1m15.462s   1m15.030s	

patched:
real    0m2.234s    0m2.225s    0m2.159s
user    2m56.105s   2m57.117s   3m0.489s
sys     0m17.064s   0m17.564s   0m16.150s

Patched kernel win on sys and bad in user, but real is almost same,
maybe a little better than base.

2) case-anon-w-seq-hugetlb:(2M PMD HugeTLB)

base:
real    0m5.175s    0m5.117s    0m4.856s
user    5m15.943s   5m7.567s    4m29.273s
sys     2m38.503s   2m21.949s   2m21.252s

patched:
real    0m4.966s    0m4.841s    0m4.561s
user    6m30.123s   6m9.516s    5m49.733s
sys     0m58.503s   0m47.847s   0m46.785s


This case is similar to the case1.

3) fallocate hugetlb 20G (2M PMD HugeTLB)

base:
real    0m3.016s    0m3.019s    0m3.018s
user    0m0.000s    0m0.000s    0m0.000s
sys     0m3.009s    0m3.012s    0m3.010s

patched:

real    0m1.136s    0m1.136s    0m1.136s
user    0m0.000s    0m0.000s    0m0.004s
sys     0m1.133s    0m1.133s    0m1.129s


There is big win on patched kernel, and it is similar to above tmpfs
test, so maybe we could revert the commit c79b57e462b5 ("mm: hugetlb:
clear target sub-page last when clearing huge page").