From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <owner-linux-mm@kvack.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by smtp.lore.kernel.org (Postfix) with ESMTP id 68B5DD15D84
	for <linux-mm@archiver.kernel.org>; Mon, 21 Oct 2024 12:15:30 +0000 (UTC)
Received: by kanga.kvack.org (Postfix)
	id 8BF1D6B007B; Mon, 21 Oct 2024 08:15:29 -0400 (EDT)
Received: by kanga.kvack.org (Postfix, from userid 40)
	id 86FDE6B0082; Mon, 21 Oct 2024 08:15:29 -0400 (EDT)
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id 75E506B0083; Mon, 21 Oct 2024 08:15:29 -0400 (EDT)
X-Delivered-To: linux-mm@kvack.org
Received: from relay.hostedemail.com (smtprelay0015.hostedemail.com [216.40.44.15])
	by kanga.kvack.org (Postfix) with ESMTP id 5AEEA6B007B
	for <linux-mm@kvack.org>; Mon, 21 Oct 2024 08:15:29 -0400 (EDT)
Received: from smtpin11.hostedemail.com (a10.router.float.18 [10.200.18.1])
	by unirelay01.hostedemail.com (Postfix) with ESMTP id 8CE7F1C73A4
	for <linux-mm@kvack.org>; Mon, 21 Oct 2024 12:15:11 +0000 (UTC)
X-FDA: 82697504484.11.D393717
Received: from szxga05-in.huawei.com (szxga05-in.huawei.com [45.249.212.191])
	by imf21.hostedemail.com (Postfix) with ESMTP id 39D3A1C0013
	for <linux-mm@kvack.org>; Mon, 21 Oct 2024 12:14:56 +0000 (UTC)
Authentication-Results: imf21.hostedemail.com;
	dkim=none;
	dmarc=pass (policy=quarantine) header.from=huawei.com;
	spf=pass (imf21.hostedemail.com: domain of chenridong@huawei.com designates 45.249.212.191 as permitted sender) smtp.mailfrom=chenridong@huawei.com
ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1729512816; a=rsa-sha256;
	cv=none;
	b=IF4IxDMLdB0k5j946Li214JR3w6vE/UTlePAKXEvoz+ko5ZKLKA5urnBDelZB0EEJr2uSa
	W4ywTR/03MGFUdGZ528VHInV+tjS0XFQEGnCpNZnNVqb15mDaIMM0lvbLf3Tz2RpUB2vEE
	3NfF5KJaEoNQAg2KNUHlk2VPz1Iet08=
ARC-Authentication-Results: i=1;
	imf21.hostedemail.com;
	dkim=none;
	dmarc=pass (policy=quarantine) header.from=huawei.com;
	spf=pass (imf21.hostedemail.com: domain of chenridong@huawei.com designates 45.249.212.191 as permitted sender) smtp.mailfrom=chenridong@huawei.com
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com;
	s=arc-20220608; t=1729512816;
	h=from:from:sender:reply-to:subject:subject:date:date:
	 message-id:message-id:to:to:cc:cc:mime-version:mime-version:
	 content-type:content-type:
	 content-transfer-encoding:content-transfer-encoding:
	 in-reply-to:in-reply-to:references:references;
	bh=CojjOOs0piKBzoz6ZouPmoO+6SUVLvXothgh+2SOpgw=;
	b=d4ZlT30T7YHm/luPE3CeBKvgFMcdmCiKtcuPMIhFcZ8usrzVO7vew8geDozaShSiTOKDhn
	BsPxtcEdRSURZQSkEfqnUT6PyG1fXJ5eW0BPW6mODnwHvFscWbqDvCWWP3y3EAYxJ6CJ2b
	AzMtML0RJVw1+yQS2UkCGg9Qxm4bc2U=
Received: from mail.maildlp.com (unknown [172.19.88.214])
	by szxga05-in.huawei.com (SkyGuard) with ESMTP id 4XXDkH2j77z1jBDs;
	Mon, 21 Oct 2024 20:13:59 +0800 (CST)
Received: from kwepemd100013.china.huawei.com (unknown [7.221.188.163])
	by mail.maildlp.com (Postfix) with ESMTPS id D7CD91A016C;
	Mon, 21 Oct 2024 20:15:19 +0800 (CST)
Received: from [10.67.109.79] (10.67.109.79) by kwepemd100013.china.huawei.com
 (7.221.188.163) with Microsoft SMTP Server (version=TLS1_2,
 cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.2.1258.34; Mon, 21 Oct
 2024 20:15:19 +0800
Message-ID: <ef119720-ba51-499c-9ca6-e068876f9c70@huawei.com>
Date: Mon, 21 Oct 2024 20:15:17 +0800
MIME-Version: 1.0
User-Agent: Mozilla Thunderbird
Subject: Re: [PATCH v3] mm/vmscan: stop the loop if enough pages have been
 page_out
To: Barry Song <21cnbao@gmail.com>
CC: Chen Ridong <chenridong@huaweicloud.com>, Kefeng Wang
	<wangkefeng.wang@huawei.com>, <akpm@linux-foundation.org>,
	<linux-mm@kvack.org>, <linux-kernel@vger.kernel.org>,
	<wangweiyang2@huawei.com>, Michal Hocko <mhocko@suse.com>, Johannes Weiner
	<hannes@cmpxchg.org>, Yosry Ahmed <yosryahmed@google.com>, Yu Zhao
	<yuzhao@google.com>, David Hildenbrand <david@redhat.com>, Matthew Wilcox
	<willy@infradead.org>, Ryan Roberts <ryan.roberts@arm.com>
References: <20241010081802.290893-1-chenridong@huaweicloud.com>
 <c3f2c5e2-4804-46e8-86ff-1f6a79ea9a7c@huawei.com>
 <CAGsJ_4zR+=80S_Fz1ZV3iZxjVKUE3-f32w7W1smuAgZM=HrPRQ@mail.gmail.com>
 <62bd2564-76fa-4cb0-9c08-9eb2f96771b6@huawei.com>
 <CAGsJ_4x=nqKFMqDmfmvXVAhQNTo1Fx-aQ2MoSKSGQrSCccqr3Q@mail.gmail.com>
 <28b3eae5-92e7-471f-9883-d03684e06d1b@huaweicloud.com>
 <CAGsJ_4yx8Z2w=FbBCUHtDa-=jDVDcdsBAHu26-LNeuOuquoOmg@mail.gmail.com>
 <c85a035b-70d2-47df-b2c3-db255c98f6ff@huawei.com>
 <CAGsJ_4yfpPoEnR2DOGcytEe0Xzuhq05W8Ncrb=z7OZ_icdWAww@mail.gmail.com>
Content-Language: en-US
From: chenridong <chenridong@huawei.com>
In-Reply-To: <CAGsJ_4yfpPoEnR2DOGcytEe0Xzuhq05W8Ncrb=z7OZ_icdWAww@mail.gmail.com>
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: 8bit
X-Originating-IP: [10.67.109.79]
X-ClientProxiedBy: dggems706-chm.china.huawei.com (10.3.19.183) To
 kwepemd100013.china.huawei.com (7.221.188.163)
X-Rspamd-Queue-Id: 39D3A1C0013
X-Rspam-User: 
X-Rspamd-Server: rspam05
X-Stat-Signature: uz65w7x9nj8gtitzdxttptu9wdqgbxfz
X-HE-Tag: 1729512896-924324
X-HE-Meta: U2FsdGVkX19mH3fj2pseOsi5f9Hbp05XavzjstxyYTf3DOCAjmfLV3Ij8cNcAQj8iAJ8zfnfbzlaRQY0yLL5cWmzv0zr9Ck6YRnFQ86mslPW/JHj8ivd1Uf1z0FCMipQWT9T/TLlSPRQ0RziS4tqlIudsMMFFag29tnV3I2RGcDl4jnJ+IavMbkx6BhFcdLSuy9Svt82i6OjO3sIIksv9bRnD1p7w8qRSUnS30SnHNf8QBGv4o6jVa1VUkuM7VuSI5+FGMetcB32trSTrbiKThJACtnGE3eIgRUEFjv4qVMCqHpUEEM6xtBvX5Po8uCpggunRjPXRTmJWWamSbbl+FzvZCD/tXWvPuU33rxPFXqG8sH26wdFblA3fj0NnEbB1zmhKje6n5ISA25uAL2CL4yfGu7dEagK8bOwKpFZkqC7asIxapscolo+Y0KshczPCd5uaM+N8cR6+ELtDjFTvj9NKUNrakpai1wECWENC1kdmaH/17qvhR8pXw0hV7ALEjbQdsY8Mi6kB1Ntnnp7P/4HSbI/HneoXWwruVBwC6CXgDCFwYxw+NMx+Ym2+WatFimjf0cYzCoa4ZAM1GFVDkpMH834dqZ/VS8pZhQydfHaJ6klo/Tul2JN/AjQScN2obw3R8DZdzyDShKJYCHxWgum3/RnbEzKSYthcN0R7sFG9oOxPmR3Xesjp2fTdA6r2uKi6EtILP6Y7y+BvZ1hPqWddJa7rUhF4HdR4ep1Dbxa81OLW5ruUo+3nRUpp/hwq5CgvmiwDIDnu3hCsBVGLulh0IE0y2zR5rDKbzAIVNDEC18NHrNARQN70MvWwRigA5jPpFTYjvXAQtTG27UTm8C282hmmWG20DBVNFhvsMU1uQX7o9ybagrAZj+xKrZ6nm1i2ntmciL6xaUY/Ql8082EF//g1qaXKAp4hR/IEZBsG4QP7UYWs1mBIqxSk1x8VLzSOjaAlXipeUrOeur
 LKOZlnUv
 INCqarruCOItzlzKsVc1jRWL11WVp8hD+EQUnKKuO1a7LMrIEvsFzUQYewVhXUMZqvJR15GL8kcaleFTmi2vuicAnqgNDMzYnwSYsJ2Ra4J8o5Q0Wot7JSCx6lNipKJ1ss1J14ceIDbpXP6t3l43d4bNfrT9uQ8RCkRRSQTMm2o3aCi0BITDP26s5nVABF9G26nyV5J//OsabbupEkOxWgUvJw5rsnmloYJag3+ye2FjHygoA8lzzCwirb0sfK3F4on3B5w00fUGCP3EoS/CwgkL2u95eLxcImIlK
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>
List-Subscribe: <mailto:majordomo@kvack.org>
List-Unsubscribe: <mailto:majordomo@kvack.org>


On 2024/10/21 18:09, Barry Song wrote:
> On Mon, Oct 21, 2024 at 10:56 PM chenridong <chenridong@huawei.com> wrote:
>>
>>
>>
>> On 2024/10/21 17:42, Barry Song wrote:
>>> On Mon, Oct 21, 2024 at 9:14 PM Chen Ridong <chenridong@huaweicloud.com> wrote:
>>>>
>>>>
>>>>
>>>> On 2024/10/21 12:44, Barry Song wrote:
>>>>> On Fri, Oct 11, 2024 at 7:49 PM chenridong <chenridong@huawei.com> wrote:
>>>>>>
>>>>>>
>>>>>>
>>>>>> On 2024/10/11 0:17, Barry Song wrote:
>>>>>>> On Thu, Oct 10, 2024 at 4:59 PM Kefeng Wang <wangkefeng.wang@huawei.com> wrote:
>>>>>>>>
>>>>>>>> Hi Ridong,
>>>>>>>>
>>>>>>>> This should be the first version for upstream, and the issue only
>>>>>>>> occurred when large folio is spited.
>>>>>>>>
>>>>>>>> Adding more CCs to see if there's more feedback.
>>>>>>>>
>>>>>>>>
>>>>>>>> On 2024/10/10 16:18, Chen Ridong wrote:
>>>>>>>>> From: Chen Ridong <chenridong@huawei.com>
>>>>>>>>>
>>>>>>>>> An issue was found with the following testing step:
>>>>>>>>> 1. Compile with CONFIG_TRANSPARENT_HUGEPAGE=y
>>>>>>>>> 2. Mount memcg v1, and create memcg named test_memcg and set
>>>>>>>>>      usage_in_bytes=2.1G, memsw.usage_in_bytes=3G.
>>>>>>>>> 3. Create a 1G swap file, and allocate 2.2G anon memory in test_memcg.
>>>>>>>>>
>>>>>>>>> It was found that:
>>>>>>>>>
>>>>>>>>> cat memory.usage_in_bytes
>>>>>>>>> 2144940032
>>>>>>>>> cat memory.memsw.usage_in_bytes
>>>>>>>>> 2255056896
>>>>>>>>>
>>>>>>>>> free -h
>>>>>>>>>                 total        used        free
>>>>>>>>> Mem:           31Gi       2.1Gi        27Gi
>>>>>>>>> Swap:         1.0Gi       618Mi       405Mi
>>>>>>>>>
>>>>>>>>> As shown above, the test_memcg used about 100M swap, but 600M+ swap memory
>>>>>>>>> was used, which means that 500M may be wasted because other memcgs can not
>>>>>>>>> use these swap memory.
>>>>>>>>>
>>>>>>>>> It can be explained as follows:
>>>>>>>>> 1. When entering shrink_inactive_list, it isolates folios from lru from
>>>>>>>>>      tail to head. If it just takes folioN from lru(make it simple).
>>>>>>>>>
>>>>>>>>>      inactive lru: folio1<->folio2<->folio3...<->folioN-1
>>>>>>>>>      isolated list: folioN
>>>>>>>>>
>>>>>>>>> 2. In shrink_page_list function, if folioN is THP, it may be splited and
>>>>>>>>>      added to swap cache folio by folio. After adding to swap cache, it will
>>>>>>>>>      submit io to writeback folio to swap, which is asynchronous.
>>>>>>>>>      When shrink_page_list is finished, the isolated folios list will be
>>>>>>>>>      moved back to the head of inactive lru. The inactive lru may just look
>>>>>>>>>      like this, with 512 filioes have been move to the head of inactive lru.
>>>>>>>>>
>>>>>>>>>      folioN512<->folioN511<->...filioN1<->folio1<->folio2...<->folioN-1
>>>>>>>>>
>>>>>>>>> 3. When folio writeback io is completed, the folio may be rotated to tail
>>>>>>>>>      of lru. The following lru list is expected, with those filioes that have
>>>>>>>>>      been added to swap cache are rotated to tail of lru. So those folios
>>>>>>>>>      can be reclaimed as soon as possible.
>>>>>>>>>
>>>>>>>>>      folio1<->folio2<->...<->folioN-1<->filioN1<->...folioN511<->folioN512
>>>>>>>>>
>>>>>>>>> 4. However, shrink_page_list and folio writeback are asynchronous. If THP
>>>>>>>>>      is splited, shrink_page_list loops at least 512 times, which means that
>>>>>>>>>      shrink_page_list is not completed but some folios writeback have been
>>>>>>>>>      completed, and this may lead to failure to rotate these folios to the
>>>>>>>>>      tail of lru. The lru may look likes as below:
>>>>>>>
>>>>>>> I assume you’re referring to PMD-mapped THP, but your code also modifies
>>>>>>> mTHP, which might not be that large. For instance, it could be a 16KB mTHP.
>>>>>>>
>>>>>>>>>
>>>>>>>>>      folioN50<->folioN49<->...filioN1<->folio1<->folio2...<->folioN-1<->
>>>>>>>>>      folioN51<->folioN52<->...folioN511<->folioN512
>>>>>>>>>
>>>>>>>>>      Although those folios (N1-N50) have been finished writing back, they
>>>>>>>>>      are still at the head of lru. When isolating folios from lru, it scans
>>>>>>>>>      from tail to head, so it is difficult to scan those folios again.
>>>>>>>>>
>>>>>>>>> What mentioned above may lead to a large number of folios have been added
>>>>>>>>> to swap cache but can not be reclaimed in time, which may reduce reclaim
>>>>>>>>> efficiency and prevent other memcgs from using this swap memory even if
>>>>>>>>> they trigger OOM.
>>>>>>>>>
>>>>>>>>> To fix this issue, it's better to stop looping if THP has been splited and
>>>>>>>>> nr_pageout is greater than nr_to_reclaim.
>>>>>>>>>
>>>>>>>>> Signed-off-by: Chen Ridong <chenridong@huawei.com>
>>>>>>>>> ---
>>>>>>>>>    mm/vmscan.c | 16 +++++++++++++++-
>>>>>>>>>    1 file changed, 15 insertions(+), 1 deletion(-)
>>>>>>>>>
>>>>>>>>> diff --git a/mm/vmscan.c b/mm/vmscan.c
>>>>>>>>> index 749cdc110c74..fd8ad251eda2 100644
>>>>>>>>> --- a/mm/vmscan.c
>>>>>>>>> +++ b/mm/vmscan.c
>>>>>>>>> @@ -1047,7 +1047,7 @@ static unsigned int shrink_folio_list(struct list_head *folio_list,
>>>>>>>>>        LIST_HEAD(demote_folios);
>>>>>>>>>        unsigned int nr_reclaimed = 0;
>>>>>>>>>        unsigned int pgactivate = 0;
>>>>>>>>> -     bool do_demote_pass;
>>>>>>>>> +     bool do_demote_pass, splited = false;
>>>>>>>>>        struct swap_iocb *plug = NULL;
>>>>>>>>>
>>>>>>>>>        folio_batch_init(&free_folios);
>>>>>>>>> @@ -1065,6 +1065,16 @@ static unsigned int shrink_folio_list(struct list_head *folio_list,
>>>>>>>>>
>>>>>>>>>                cond_resched();
>>>>>>>>>
>>>>>>>>> +             /*
>>>>>>>>> +              * If a large folio has been split, many folios are added
>>>>>>>>> +              * to folio_list. Looping through the entire list takes
>>>>>>>>> +              * too much time, which may prevent folios that have completed
>>>>>>>>> +              * writeback from rotateing to the tail of the lru. Just
>>>>>>>>> +              * stop looping if nr_pageout is greater than nr_to_reclaim.
>>>>>>>>> +              */
>>>>>>>>> +             if (unlikely(splited && stat->nr_pageout > sc->nr_to_reclaim))
>>>>>>>>> +                     break;
>>>>>>>
>>>>>>> I’m not entirely sure about the theory behind comparing stat->nr_pageout
>>>>>>> with sc->nr_to_reclaim. However, the condition might still hold true even
>>>>>>> if you’ve split a relatively small “large folio,” such as 16kB?
>>>>>>>
>>>>>>
>>>>>> Why compare stat->nr_pageout with sc->nr_to_reclaim? It's because if all
>>>>>> pages that have been pageout can be reclaimed, then enough pages can be
>>>>>> reclaimed when all pages have finished writeback. Thus, it may not have
>>>>>> to pageout more.
>>>>>>
>>>>>> If a small large folio(16 kB) has been split, it may return early
>>>>>> without the entire pages in the folio_list being pageout, but I think
>>>>>> that is fine. It can pageout more pages the next time it enters
>>>>>> shrink_folio_list if there are not enough pages to reclaimed.
>>>>>>
>>>>>> However, if pages that have been pageout are still at the head of the
>>>>>> LRU, it is difficult to scan these pages again. In this case, not only
>>>>>> might it "waste" some swap memory but it also has to pageout more pages.
>>>>>>
>>>>>> Considering the above, I sent this patch. It may not be a perfect
>>>>>> solution, but i think it's a good option to consider. And I am wondering
>>>>>> if anyone has a better solution.
>>>>>
>>>>> Hi Ridong,
>>>>> My overall understanding is that you have failed to describe your problem
>>>>> particularly I don't understand what your 3 and 4 mean:
>>>>>
>>>>>> 3. When folio writeback io is completed, the folio may be rotated to tail
>>>>>>    of lru. The following lru list is expected, with those filioes that have
>>>>>>    been added to swap cache are rotated to tail of lru. So those folios
>>>>>>  can be reclaimed as soon as possible.
>>>>>>
>>>>>>  folio1<->folio2<->...<->folioN-1<->filioN1<->...folioN511<->folioN512
>>>>>
>>>>>  > 4. However, shrink_page_list and folio writeback are asynchronous. If THP
>>>>>  >    is splited, shrink_page_list loops at least 512 times, which means that
>>>>>  >    shrink_page_list is not completed but some folios writeback have been
>>>>>  >    completed, and this may lead to failure to rotate these folios to the
>>>>>   >  tail of lru. The lru may look likes as below:
>>>>>
>>>>> can you please describe it in a readable approach?
>>>>>
>>>>> i feel your below diagram is somehow wrong:
>>>>> folio1<->folio2<->...<->folioN-1<->filioN1<->...folioN511<->folioN512
>>>>>
>>>>> You mentioned "rotate', how could "rotate" makes:
>>>>> folioN512<->folioN511<->...filioN1 in (2)
>>>>> become
>>>>> filioN1<->...folioN511<->folioN512 in (3).
>>>>>
>>>>
>>>> I am sorry for any confusion.
>>>>
>>>> If THP is split, filioN1, filioN2, filioN3, ...filioN512 are committed
>>>> to writeback one by one. it assumed that filioN1,
>>>> filioN2,filioN3,...filioN512 are completed in order.
>>>>
>>>> Orignal:
>>>> folioN512<->folioN511<->...filioN1<->folio1<->folio2...<->folioN-1
>>>>
>>>> filioN1 is finished, filioN1 is rotated to the tail of LRU:
>>>> folioN512<->folioN511<->...filioN2<->folio1<->folio2...<->folioN-1<->folioN1
>>>>
>>>> filioN2 is finished:
>>>> folioN512<->folioN511<->...filioN3<->folio1<->folio2...<->folioN-1<->folioN1<->folioN2
>>>>
>>>> filioN3 is finished:
>>>> folioN512<->folioN511<->...filioN4<->folio1<->folio2...<->folioN-1<->folioN1<->folioN2<->filioN3
>>>>
>>>> ...
>>>>
>>>> filioN512 is finished:
>>>> folio1<->folio2<->...<->folioN-1<->filioN1<->...folioN511<->folioN512
>>>>
>>>> When the filios are finished, the LRU might just like this:
>>>> folio1<->folio2<->...<->folioN-1<->filioN1<->...folioN511<->folioN512
>>>
>>> understood, thanks!
>>>
>>> Let me try to understand the following part:
>>>
>>>> 4:
>>>>   folioN50<->folioN49<->...filioN1<->folio1<->folio2...<->folioN-1<->
>>>>   folioN51<->folioN52<->...folioN511<->folioN512
>>>
>>>  >  Although those folios (N1-N50) have been finished writing back, they
>>>  >  are still at the head of lru. When isolating folios from lru, it scans
>>>  >  from tail to head, so it is difficult to scan those folios again.
>>>
>>> What is the reason that "those folios (N1-N50) have finished writing back,
>>> yet they remain at the head of the LRU"? Is it because their writeback_end
>>> occurred while we were still looping in shrink_folio_list(), causing
>>> folio_end_writeback()'s folio_rotate_reclaimable() to fail in moving
>>> these folios, which are still in the "folio_list", to the tail of the LRU?
>>>
>>
>> Yes, you are right.
>>
>>>>
>>>>> btw, writeback isn't always async. it could be sync for zram and sync_io
>>>>> swap. in that case, your patch might change the order of LRU. i mean,
>>>>> for example, while a mTHP becomes cold, we always reclaim all of them,
>>>>> but not part of them and put back part of small folios to the head of lru.
>>>>>
>>>>
>>>> Yes, This can be changed.
>>>> Although it may put back part of small folios to the head of lru, it can
>>>> return in time from shrink_folio_list without causing much additional I/O.
>>>>
>>>> If you have understood this issue, do you have any suggestions to fix
>>>> it? My patch may not be a perfect way to fix this issue.
>>>>
>>>
>>> My point is that synchronous I/O, like zRAM, doesn't have this issue and
>>> doesn't require this fix, as writeback is always completed without
>>> asynchronous latency.
>>>
>>
>> I have tested zRAM and found no issues.
>> This is version 1, and I don't know whether this fix will be accepted.
>> If it is accepted, perhaps this patch could be modified to apply only to
>> asynchronous io.
> 
> Consider a 2MB THP: when it becomes cold, we detect that it is cold and
> decide to page it out. Even if we split it into 512 * 4KiB folios, the entire
> 2MB is still cold, so we want pageout() to be called for the entire 2MB.
> With your current approach, some parts of the 2MB are moved to the
> LRU head while we're still paging out other parts, which seems
> problematic.
> 
> Could we address this in move_folios_to_lru()? Perhaps we could find
> a way to detect folios whose writeback has done and move them to the
> tail instead of always placing them at the head.
> 

I'll try to address this issue with this idea.
Thanks you.

Best regards,
Ridong

>>
>> Best regards,
>> Ridong
>>
>>>
>>>> Best regards,
>>>> Ridong
>>>>
> 
> Thanks
> Barry