From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <owner-linux-mm@kvack.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by smtp.lore.kernel.org (Postfix) with ESMTP id 9603BD2447F
	for <linux-mm@archiver.kernel.org>; Fri, 11 Oct 2024 06:49:58 +0000 (UTC)
Received: by kanga.kvack.org (Postfix)
	id F09A46B0093; Fri, 11 Oct 2024 02:49:57 -0400 (EDT)
Received: by kanga.kvack.org (Postfix, from userid 40)
	id EB8366B0095; Fri, 11 Oct 2024 02:49:57 -0400 (EDT)
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id D7F466B0096; Fri, 11 Oct 2024 02:49:57 -0400 (EDT)
X-Delivered-To: linux-mm@kvack.org
Received: from relay.hostedemail.com (smtprelay0016.hostedemail.com [216.40.44.16])
	by kanga.kvack.org (Postfix) with ESMTP id B85FF6B0093
	for <linux-mm@kvack.org>; Fri, 11 Oct 2024 02:49:57 -0400 (EDT)
Received: from smtpin03.hostedemail.com (a10.router.float.18 [10.200.18.1])
	by unirelay09.hostedemail.com (Postfix) with ESMTP id C181880CB9
	for <linux-mm@kvack.org>; Fri, 11 Oct 2024 06:49:53 +0000 (UTC)
X-FDA: 82660396392.03.56AE48F
Received: from szxga07-in.huawei.com (szxga07-in.huawei.com [45.249.212.35])
	by imf22.hostedemail.com (Postfix) with ESMTP id 3300AC000B
	for <linux-mm@kvack.org>; Fri, 11 Oct 2024 06:49:51 +0000 (UTC)
Authentication-Results: imf22.hostedemail.com;
	dkim=none;
	dmarc=pass (policy=quarantine) header.from=huawei.com;
	spf=pass (imf22.hostedemail.com: domain of chenridong@huawei.com designates 45.249.212.35 as permitted sender) smtp.mailfrom=chenridong@huawei.com
ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1728629351; a=rsa-sha256;
	cv=none;
	b=RCaqGcRn5gPySPQWt/qdnd2E+oLR/erlGVXUE0wu7+H+Sk1ULSpg/9o6vFSsa94Om/UYcV
	adXb9cHwj+s2BIzHZDoF9byJ9Fbo+NK5wyPmBiFOcOpK6ngGM1vhpgPD9HoOqFeWtec/4Q
	958RKceCJ6RYv5me8lIBBjcwbdiIvyA=
ARC-Authentication-Results: i=1;
	imf22.hostedemail.com;
	dkim=none;
	dmarc=pass (policy=quarantine) header.from=huawei.com;
	spf=pass (imf22.hostedemail.com: domain of chenridong@huawei.com designates 45.249.212.35 as permitted sender) smtp.mailfrom=chenridong@huawei.com
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com;
	s=arc-20220608; t=1728629351;
	h=from:from:sender:reply-to:subject:subject:date:date:
	 message-id:message-id:to:to:cc:cc:mime-version:mime-version:
	 content-type:content-type:
	 content-transfer-encoding:content-transfer-encoding:
	 in-reply-to:in-reply-to:references:references;
	bh=pawpBJXV09SjPyU2Ok019eQLXsgg/ZE9uvNG2Wck/V0=;
	b=OZaGcKGkcN0oniGfQXU1mnJeCgJMzxsWgEmYc3+4dS8vBxzs9SN0ywzeRdW6hhiBH10QIZ
	lYnDIIK7bzFoAhvFrPCW1ou5jYzCK1npAG1A3JKL+oDZIoPEZ37ex1u3mlBJKsUsnpL31Y
	+Fe6fNmCfqpgwBt+CE8sacmCXJHdE3A=
Received: from mail.maildlp.com (unknown [172.19.163.44])
	by szxga07-in.huawei.com (SkyGuard) with ESMTP id 4XPxzY6hF5z1SCS8;
	Fri, 11 Oct 2024 14:48:41 +0800 (CST)
Received: from kwepemd100013.china.huawei.com (unknown [7.221.188.163])
	by mail.maildlp.com (Postfix) with ESMTPS id 1FAB9140361;
	Fri, 11 Oct 2024 14:49:50 +0800 (CST)
Received: from [10.67.109.79] (10.67.109.79) by kwepemd100013.china.huawei.com
 (7.221.188.163) with Microsoft SMTP Server (version=TLS1_2,
 cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.2.1258.34; Fri, 11 Oct
 2024 14:49:49 +0800
Message-ID: <62bd2564-76fa-4cb0-9c08-9eb2f96771b6@huawei.com>
Date: Fri, 11 Oct 2024 14:49:48 +0800
MIME-Version: 1.0
User-Agent: Mozilla Thunderbird
Subject: Re: [PATCH v3] mm/vmscan: stop the loop if enough pages have been
 page_out
To: Barry Song <baohua@kernel.org>, Kefeng Wang <wangkefeng.wang@huawei.com>
CC: Chen Ridong <chenridong@huaweicloud.com>, <akpm@linux-foundation.org>,
	<linux-mm@kvack.org>, <linux-kernel@vger.kernel.org>,
	<wangweiyang2@huawei.com>, Michal Hocko <mhocko@suse.com>, Johannes Weiner
	<hannes@cmpxchg.org>, Yosry Ahmed <yosryahmed@google.com>, Yu Zhao
	<yuzhao@google.com>, David Hildenbrand <david@redhat.com>, Matthew Wilcox
	<willy@infradead.org>, Ryan Roberts <ryan.roberts@arm.com>
References: <20241010081802.290893-1-chenridong@huaweicloud.com>
 <c3f2c5e2-4804-46e8-86ff-1f6a79ea9a7c@huawei.com>
 <CAGsJ_4zR+=80S_Fz1ZV3iZxjVKUE3-f32w7W1smuAgZM=HrPRQ@mail.gmail.com>
Content-Language: en-US
From: chenridong <chenridong@huawei.com>
In-Reply-To: <CAGsJ_4zR+=80S_Fz1ZV3iZxjVKUE3-f32w7W1smuAgZM=HrPRQ@mail.gmail.com>
Content-Type: text/plain; charset="UTF-8"; format=flowed
Content-Transfer-Encoding: 8bit
X-Originating-IP: [10.67.109.79]
X-ClientProxiedBy: dggems703-chm.china.huawei.com (10.3.19.180) To
 kwepemd100013.china.huawei.com (7.221.188.163)
X-Rspam-User: 
X-Rspamd-Queue-Id: 3300AC000B
X-Rspamd-Server: rspam01
X-Stat-Signature: fcj9awu8htsz1g46qf6n4whiocdwmcy5
X-HE-Tag: 1728629391-365676
X-HE-Meta: U2FsdGVkX1+HBVfM3hDoL1rI8q+sQKF60KAFB8/U0Mz80J57G5zdvhO9JA4hrd2h6ocTThAJNZ2pUvlpOcF4qo3AKa0pwwOuE06bFMWIq9FvQkiSncheri1q450WUypVkihNz8N3Y0XX1RWnLuyTPGUHJ5oZ5qXay2jOLlUWsbGWru2fxObCqpK65DUwCv0rHd5VBOVfQwt2GKASEvx9L0ChN/cVgnB3Q6SUoBDQudStyEw8COlzkSqKvzgt0V//CGALmwPbXF25F/GYRtlh+WElqDO7tUODCC62/mMWMVZZ3MlgFh2EqLdB+5Wrv1mrtg2HmhhAOvPtgFCinBHGVD/iI0hYm3bk5+OFUWC7Vvie8JxKYjClQuIV7XNnF6jJ/WfqZhK27S3i3+uAfQtaRul2LdU6r32DiQhTxuokLv5xlUUk/bSTofcJdP4z3uCbz47nJJt4eJVn9D0/Q8vfcykTlqsKUiYrgel/g3tJ7jrOiAWZRQR0D9ENXxXCppVNDZPp2QqyApSjxOcw62Wc/WxmjVJv47AaL/tMWBsIgvPnUAdk1z7Trm8FzbINvqD49vheROA/p/nEVTftZroRDwliNaVfjT3P2JInUHbhWOoyma/y72EhrtpS9vhY+l8TaiSOfJersp2ILTZvaRa5gPZs2OL5RID0mw/7k1K2wjt8Babai00Bkt76wh4fH2Vpg2kFSRcKHTMD109pNg1eqfZCShwuVbfdxSbRjKOAX5XjiKa0x+ecWtGnV8npCjg6iULRjAELe7vYLBoFPE3sW4Rhtyhdchkm5Dl9M112aHHJmU9r8wgk0X2JSKQVEaLlizl9kOMJub255ZKqzUpC3Ne2pIQivzuXFQA2Ahp1r+Hj2T66SLU1QEqGVMFl5+PdYTJi/Dqi60Sc17E6Irv6/oh92ZJJs9AmAyi7wSqF35aXh/J1i8n8e8Zj7zAHU1KxN5tLwLbqktPX+dBwFxo
 jvkwZzA9
 5F2Fz3TvaJy4Wv58miSktg6jrBuLZu025DeE25Ncz2dZuqhSXLTZdwuyV95EXZqeOHoouQNRbs1HGTZW7RGeEUGblisDBT+hjR9CnWvtXdfvInB9JPEfVQAkkNVUU7hHh1AM+mECXH2z1F/DzjcluPvcIPYlJE7lIUWKUcdD0Ww+P6kfEoodd2JfX5gA1G4lxoBZKe7I5fnqw8Uc7GGoia9MAVrFD+B3qUj1ivomsMB6yqu193CMZuICraGivtTgnZNTF+6yyquq2OX0=
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>
List-Subscribe: <mailto:majordomo@kvack.org>
List-Unsubscribe: <mailto:majordomo@kvack.org>


On 2024/10/11 0:17, Barry Song wrote:
> On Thu, Oct 10, 2024 at 4:59 PM Kefeng Wang <wangkefeng.wang@huawei.com> wrote:
>>
>> Hi Ridong,
>>
>> This should be the first version for upstream, and the issue only
>> occurred when large folio is spited.
>>
>> Adding more CCs to see if there's more feedback.
>>
>>
>> On 2024/10/10 16:18, Chen Ridong wrote:
>>> From: Chen Ridong <chenridong@huawei.com>
>>>
>>> An issue was found with the following testing step:
>>> 1. Compile with CONFIG_TRANSPARENT_HUGEPAGE=y
>>> 2. Mount memcg v1, and create memcg named test_memcg and set
>>>      usage_in_bytes=2.1G, memsw.usage_in_bytes=3G.
>>> 3. Create a 1G swap file, and allocate 2.2G anon memory in test_memcg.
>>>
>>> It was found that:
>>>
>>> cat memory.usage_in_bytes
>>> 2144940032
>>> cat memory.memsw.usage_in_bytes
>>> 2255056896
>>>
>>> free -h
>>>                 total        used        free
>>> Mem:           31Gi       2.1Gi        27Gi
>>> Swap:         1.0Gi       618Mi       405Mi
>>>
>>> As shown above, the test_memcg used about 100M swap, but 600M+ swap memory
>>> was used, which means that 500M may be wasted because other memcgs can not
>>> use these swap memory.
>>>
>>> It can be explained as follows:
>>> 1. When entering shrink_inactive_list, it isolates folios from lru from
>>>      tail to head. If it just takes folioN from lru(make it simple).
>>>
>>>      inactive lru: folio1<->folio2<->folio3...<->folioN-1
>>>      isolated list: folioN
>>>
>>> 2. In shrink_page_list function, if folioN is THP, it may be splited and
>>>      added to swap cache folio by folio. After adding to swap cache, it will
>>>      submit io to writeback folio to swap, which is asynchronous.
>>>      When shrink_page_list is finished, the isolated folios list will be
>>>      moved back to the head of inactive lru. The inactive lru may just look
>>>      like this, with 512 filioes have been move to the head of inactive lru.
>>>
>>>      folioN512<->folioN511<->...filioN1<->folio1<->folio2...<->folioN-1
>>>
>>> 3. When folio writeback io is completed, the folio may be rotated to tail
>>>      of lru. The following lru list is expected, with those filioes that have
>>>      been added to swap cache are rotated to tail of lru. So those folios
>>>      can be reclaimed as soon as possible.
>>>
>>>      folio1<->folio2<->...<->folioN-1<->filioN1<->...folioN511<->folioN512
>>>
>>> 4. However, shrink_page_list and folio writeback are asynchronous. If THP
>>>      is splited, shrink_page_list loops at least 512 times, which means that
>>>      shrink_page_list is not completed but some folios writeback have been
>>>      completed, and this may lead to failure to rotate these folios to the
>>>      tail of lru. The lru may look likes as below:
> 
> I assume you’re referring to PMD-mapped THP, but your code also modifies
> mTHP, which might not be that large. For instance, it could be a 16KB mTHP.
> 
>>>
>>>      folioN50<->folioN49<->...filioN1<->folio1<->folio2...<->folioN-1<->
>>>      folioN51<->folioN52<->...folioN511<->folioN512
>>>
>>>      Although those folios (N1-N50) have been finished writing back, they
>>>      are still at the head of lru. When isolating folios from lru, it scans
>>>      from tail to head, so it is difficult to scan those folios again.
>>>
>>> What mentioned above may lead to a large number of folios have been added
>>> to swap cache but can not be reclaimed in time, which may reduce reclaim
>>> efficiency and prevent other memcgs from using this swap memory even if
>>> they trigger OOM.
>>>
>>> To fix this issue, it's better to stop looping if THP has been splited and
>>> nr_pageout is greater than nr_to_reclaim.
>>>
>>> Signed-off-by: Chen Ridong <chenridong@huawei.com>
>>> ---
>>>    mm/vmscan.c | 16 +++++++++++++++-
>>>    1 file changed, 15 insertions(+), 1 deletion(-)
>>>
>>> diff --git a/mm/vmscan.c b/mm/vmscan.c
>>> index 749cdc110c74..fd8ad251eda2 100644
>>> --- a/mm/vmscan.c
>>> +++ b/mm/vmscan.c
>>> @@ -1047,7 +1047,7 @@ static unsigned int shrink_folio_list(struct list_head *folio_list,
>>>        LIST_HEAD(demote_folios);
>>>        unsigned int nr_reclaimed = 0;
>>>        unsigned int pgactivate = 0;
>>> -     bool do_demote_pass;
>>> +     bool do_demote_pass, splited = false;
>>>        struct swap_iocb *plug = NULL;
>>>
>>>        folio_batch_init(&free_folios);
>>> @@ -1065,6 +1065,16 @@ static unsigned int shrink_folio_list(struct list_head *folio_list,
>>>
>>>                cond_resched();
>>>
>>> +             /*
>>> +              * If a large folio has been split, many folios are added
>>> +              * to folio_list. Looping through the entire list takes
>>> +              * too much time, which may prevent folios that have completed
>>> +              * writeback from rotateing to the tail of the lru. Just
>>> +              * stop looping if nr_pageout is greater than nr_to_reclaim.
>>> +              */
>>> +             if (unlikely(splited && stat->nr_pageout > sc->nr_to_reclaim))
>>> +                     break;
> 
> I’m not entirely sure about the theory behind comparing stat->nr_pageout
> with sc->nr_to_reclaim. However, the condition might still hold true even
> if you’ve split a relatively small “large folio,” such as 16kB?
> 

Why compare stat->nr_pageout with sc->nr_to_reclaim? It's because if all 
pages that have been pageout can be reclaimed, then enough pages can be 
reclaimed when all pages have finished writeback. Thus, it may not have 
to pageout more.

If a small large folio(16 kB) has been split, it may return early 
without the entire pages in the folio_list being pageout, but I think 
that is fine. It can pageout more pages the next time it enters 
shrink_folio_list if there are not enough pages to reclaimed.

However, if pages that have been pageout are still at the head of the 
LRU, it is difficult to scan these pages again. In this case, not only 
might it "waste" some swap memory but it also has to pageout more pages.

Considering the above, I sent this patch. It may not be a perfect 
solution, but i think it's a good option to consider. And I am wondering 
if anyone has a better solution.

Best regards,
Ridong

>>> +
>>>                folio = lru_to_folio(folio_list);
>>>                list_del(&folio->lru);
>>>
>>> @@ -1273,6 +1283,7 @@ static unsigned int shrink_folio_list(struct list_head *folio_list,
>>>                if ((nr_pages > 1) && !folio_test_large(folio)) {
>>>                        sc->nr_scanned -= (nr_pages - 1);
>>>                        nr_pages = 1;
>>> +                     splited = true;
>>>                }
>>>
>>>                /*
>>> @@ -1375,12 +1386,14 @@ static unsigned int shrink_folio_list(struct list_head *folio_list,
>>>                                if (nr_pages > 1 && !folio_test_large(folio)) {
>>>                                        sc->nr_scanned -= (nr_pages - 1);
>>>                                        nr_pages = 1;
>>> +                                     splited = true;
>>>                                }
>>>                                goto activate_locked;
>>>                        case PAGE_SUCCESS:
>>>                                if (nr_pages > 1 && !folio_test_large(folio)) {
>>>                                        sc->nr_scanned -= (nr_pages - 1);
>>>                                        nr_pages = 1;
>>> +                                     splited = true;
>>>                                }
>>>                                stat->nr_pageout += nr_pages;
>>>
>>> @@ -1491,6 +1504,7 @@ static unsigned int shrink_folio_list(struct list_head *folio_list,
>>>                if (nr_pages > 1) {
>>>                        sc->nr_scanned -= (nr_pages - 1);
>>>                        nr_pages = 1;
>>> +                     splited = true;
>>>                }
>>>    activate_locked:
>>>                /* Not a candidate for swapping, so reclaim swap space. */
>>