From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 68B5DD15D84 for ; Mon, 21 Oct 2024 12:15:30 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 8BF1D6B007B; Mon, 21 Oct 2024 08:15:29 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 86FDE6B0082; Mon, 21 Oct 2024 08:15:29 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 75E506B0083; Mon, 21 Oct 2024 08:15:29 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0015.hostedemail.com [216.40.44.15]) by kanga.kvack.org (Postfix) with ESMTP id 5AEEA6B007B for ; Mon, 21 Oct 2024 08:15:29 -0400 (EDT) Received: from smtpin11.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay01.hostedemail.com (Postfix) with ESMTP id 8CE7F1C73A4 for ; Mon, 21 Oct 2024 12:15:11 +0000 (UTC) X-FDA: 82697504484.11.D393717 Received: from szxga05-in.huawei.com (szxga05-in.huawei.com [45.249.212.191]) by imf21.hostedemail.com (Postfix) with ESMTP id 39D3A1C0013 for ; Mon, 21 Oct 2024 12:14:56 +0000 (UTC) Authentication-Results: imf21.hostedemail.com; dkim=none; dmarc=pass (policy=quarantine) header.from=huawei.com; spf=pass (imf21.hostedemail.com: domain of chenridong@huawei.com designates 45.249.212.191 as permitted sender) smtp.mailfrom=chenridong@huawei.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1729512816; a=rsa-sha256; cv=none; b=IF4IxDMLdB0k5j946Li214JR3w6vE/UTlePAKXEvoz+ko5ZKLKA5urnBDelZB0EEJr2uSa W4ywTR/03MGFUdGZ528VHInV+tjS0XFQEGnCpNZnNVqb15mDaIMM0lvbLf3Tz2RpUB2vEE 3NfF5KJaEoNQAg2KNUHlk2VPz1Iet08= ARC-Authentication-Results: i=1; imf21.hostedemail.com; dkim=none; dmarc=pass (policy=quarantine) header.from=huawei.com; spf=pass (imf21.hostedemail.com: domain of chenridong@huawei.com designates 45.249.212.191 as permitted sender) smtp.mailfrom=chenridong@huawei.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1729512816; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=CojjOOs0piKBzoz6ZouPmoO+6SUVLvXothgh+2SOpgw=; b=d4ZlT30T7YHm/luPE3CeBKvgFMcdmCiKtcuPMIhFcZ8usrzVO7vew8geDozaShSiTOKDhn BsPxtcEdRSURZQSkEfqnUT6PyG1fXJ5eW0BPW6mODnwHvFscWbqDvCWWP3y3EAYxJ6CJ2b AzMtML0RJVw1+yQS2UkCGg9Qxm4bc2U= Received: from mail.maildlp.com (unknown [172.19.88.214]) by szxga05-in.huawei.com (SkyGuard) with ESMTP id 4XXDkH2j77z1jBDs; Mon, 21 Oct 2024 20:13:59 +0800 (CST) Received: from kwepemd100013.china.huawei.com (unknown [7.221.188.163]) by mail.maildlp.com (Postfix) with ESMTPS id D7CD91A016C; Mon, 21 Oct 2024 20:15:19 +0800 (CST) Received: from [10.67.109.79] (10.67.109.79) by kwepemd100013.china.huawei.com (7.221.188.163) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.2.1258.34; Mon, 21 Oct 2024 20:15:19 +0800 Message-ID: Date: Mon, 21 Oct 2024 20:15:17 +0800 MIME-Version: 1.0 User-Agent: Mozilla Thunderbird Subject: Re: [PATCH v3] mm/vmscan: stop the loop if enough pages have been page_out To: Barry Song <21cnbao@gmail.com> CC: Chen Ridong , Kefeng Wang , , , , , Michal Hocko , Johannes Weiner , Yosry Ahmed , Yu Zhao , David Hildenbrand , Matthew Wilcox , Ryan Roberts References: <20241010081802.290893-1-chenridong@huaweicloud.com> <62bd2564-76fa-4cb0-9c08-9eb2f96771b6@huawei.com> <28b3eae5-92e7-471f-9883-d03684e06d1b@huaweicloud.com> Content-Language: en-US From: chenridong In-Reply-To: Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: 8bit X-Originating-IP: [10.67.109.79] X-ClientProxiedBy: dggems706-chm.china.huawei.com (10.3.19.183) To kwepemd100013.china.huawei.com (7.221.188.163) X-Rspamd-Queue-Id: 39D3A1C0013 X-Rspam-User: X-Rspamd-Server: rspam05 X-Stat-Signature: uz65w7x9nj8gtitzdxttptu9wdqgbxfz X-HE-Tag: 1729512896-924324 X-HE-Meta: U2FsdGVkX19mH3fj2pseOsi5f9Hbp05XavzjstxyYTf3DOCAjmfLV3Ij8cNcAQj8iAJ8zfnfbzlaRQY0yLL5cWmzv0zr9Ck6YRnFQ86mslPW/JHj8ivd1Uf1z0FCMipQWT9T/TLlSPRQ0RziS4tqlIudsMMFFag29tnV3I2RGcDl4jnJ+IavMbkx6BhFcdLSuy9Svt82i6OjO3sIIksv9bRnD1p7w8qRSUnS30SnHNf8QBGv4o6jVa1VUkuM7VuSI5+FGMetcB32trSTrbiKThJACtnGE3eIgRUEFjv4qVMCqHpUEEM6xtBvX5Po8uCpggunRjPXRTmJWWamSbbl+FzvZCD/tXWvPuU33rxPFXqG8sH26wdFblA3fj0NnEbB1zmhKje6n5ISA25uAL2CL4yfGu7dEagK8bOwKpFZkqC7asIxapscolo+Y0KshczPCd5uaM+N8cR6+ELtDjFTvj9NKUNrakpai1wECWENC1kdmaH/17qvhR8pXw0hV7ALEjbQdsY8Mi6kB1Ntnnp7P/4HSbI/HneoXWwruVBwC6CXgDCFwYxw+NMx+Ym2+WatFimjf0cYzCoa4ZAM1GFVDkpMH834dqZ/VS8pZhQydfHaJ6klo/Tul2JN/AjQScN2obw3R8DZdzyDShKJYCHxWgum3/RnbEzKSYthcN0R7sFG9oOxPmR3Xesjp2fTdA6r2uKi6EtILP6Y7y+BvZ1hPqWddJa7rUhF4HdR4ep1Dbxa81OLW5ruUo+3nRUpp/hwq5CgvmiwDIDnu3hCsBVGLulh0IE0y2zR5rDKbzAIVNDEC18NHrNARQN70MvWwRigA5jPpFTYjvXAQtTG27UTm8C282hmmWG20DBVNFhvsMU1uQX7o9ybagrAZj+xKrZ6nm1i2ntmciL6xaUY/Ql8082EF//g1qaXKAp4hR/IEZBsG4QP7UYWs1mBIqxSk1x8VLzSOjaAlXipeUrOeur LKOZlnUv INCqarruCOItzlzKsVc1jRWL11WVp8hD+EQUnKKuO1a7LMrIEvsFzUQYewVhXUMZqvJR15GL8kcaleFTmi2vuicAnqgNDMzYnwSYsJ2Ra4J8o5Q0Wot7JSCx6lNipKJ1ss1J14ceIDbpXP6t3l43d4bNfrT9uQ8RCkRRSQTMm2o3aCi0BITDP26s5nVABF9G26nyV5J//OsabbupEkOxWgUvJw5rsnmloYJag3+ye2FjHygoA8lzzCwirb0sfK3F4on3B5w00fUGCP3EoS/CwgkL2u95eLxcImIlK X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On 2024/10/21 18:09, Barry Song wrote: > On Mon, Oct 21, 2024 at 10:56 PM chenridong wrote: >> >> >> >> On 2024/10/21 17:42, Barry Song wrote: >>> On Mon, Oct 21, 2024 at 9:14 PM Chen Ridong wrote: >>>> >>>> >>>> >>>> On 2024/10/21 12:44, Barry Song wrote: >>>>> On Fri, Oct 11, 2024 at 7:49 PM chenridong wrote: >>>>>> >>>>>> >>>>>> >>>>>> On 2024/10/11 0:17, Barry Song wrote: >>>>>>> On Thu, Oct 10, 2024 at 4:59 PM Kefeng Wang wrote: >>>>>>>> >>>>>>>> Hi Ridong, >>>>>>>> >>>>>>>> This should be the first version for upstream, and the issue only >>>>>>>> occurred when large folio is spited. >>>>>>>> >>>>>>>> Adding more CCs to see if there's more feedback. >>>>>>>> >>>>>>>> >>>>>>>> On 2024/10/10 16:18, Chen Ridong wrote: >>>>>>>>> From: Chen Ridong >>>>>>>>> >>>>>>>>> An issue was found with the following testing step: >>>>>>>>> 1. Compile with CONFIG_TRANSPARENT_HUGEPAGE=y >>>>>>>>> 2. Mount memcg v1, and create memcg named test_memcg and set >>>>>>>>> usage_in_bytes=2.1G, memsw.usage_in_bytes=3G. >>>>>>>>> 3. Create a 1G swap file, and allocate 2.2G anon memory in test_memcg. >>>>>>>>> >>>>>>>>> It was found that: >>>>>>>>> >>>>>>>>> cat memory.usage_in_bytes >>>>>>>>> 2144940032 >>>>>>>>> cat memory.memsw.usage_in_bytes >>>>>>>>> 2255056896 >>>>>>>>> >>>>>>>>> free -h >>>>>>>>> total used free >>>>>>>>> Mem: 31Gi 2.1Gi 27Gi >>>>>>>>> Swap: 1.0Gi 618Mi 405Mi >>>>>>>>> >>>>>>>>> As shown above, the test_memcg used about 100M swap, but 600M+ swap memory >>>>>>>>> was used, which means that 500M may be wasted because other memcgs can not >>>>>>>>> use these swap memory. >>>>>>>>> >>>>>>>>> It can be explained as follows: >>>>>>>>> 1. When entering shrink_inactive_list, it isolates folios from lru from >>>>>>>>> tail to head. If it just takes folioN from lru(make it simple). >>>>>>>>> >>>>>>>>> inactive lru: folio1<->folio2<->folio3...<->folioN-1 >>>>>>>>> isolated list: folioN >>>>>>>>> >>>>>>>>> 2. In shrink_page_list function, if folioN is THP, it may be splited and >>>>>>>>> added to swap cache folio by folio. After adding to swap cache, it will >>>>>>>>> submit io to writeback folio to swap, which is asynchronous. >>>>>>>>> When shrink_page_list is finished, the isolated folios list will be >>>>>>>>> moved back to the head of inactive lru. The inactive lru may just look >>>>>>>>> like this, with 512 filioes have been move to the head of inactive lru. >>>>>>>>> >>>>>>>>> folioN512<->folioN511<->...filioN1<->folio1<->folio2...<->folioN-1 >>>>>>>>> >>>>>>>>> 3. When folio writeback io is completed, the folio may be rotated to tail >>>>>>>>> of lru. The following lru list is expected, with those filioes that have >>>>>>>>> been added to swap cache are rotated to tail of lru. So those folios >>>>>>>>> can be reclaimed as soon as possible. >>>>>>>>> >>>>>>>>> folio1<->folio2<->...<->folioN-1<->filioN1<->...folioN511<->folioN512 >>>>>>>>> >>>>>>>>> 4. However, shrink_page_list and folio writeback are asynchronous. If THP >>>>>>>>> is splited, shrink_page_list loops at least 512 times, which means that >>>>>>>>> shrink_page_list is not completed but some folios writeback have been >>>>>>>>> completed, and this may lead to failure to rotate these folios to the >>>>>>>>> tail of lru. The lru may look likes as below: >>>>>>> >>>>>>> I assume you’re referring to PMD-mapped THP, but your code also modifies >>>>>>> mTHP, which might not be that large. For instance, it could be a 16KB mTHP. >>>>>>> >>>>>>>>> >>>>>>>>> folioN50<->folioN49<->...filioN1<->folio1<->folio2...<->folioN-1<-> >>>>>>>>> folioN51<->folioN52<->...folioN511<->folioN512 >>>>>>>>> >>>>>>>>> Although those folios (N1-N50) have been finished writing back, they >>>>>>>>> are still at the head of lru. When isolating folios from lru, it scans >>>>>>>>> from tail to head, so it is difficult to scan those folios again. >>>>>>>>> >>>>>>>>> What mentioned above may lead to a large number of folios have been added >>>>>>>>> to swap cache but can not be reclaimed in time, which may reduce reclaim >>>>>>>>> efficiency and prevent other memcgs from using this swap memory even if >>>>>>>>> they trigger OOM. >>>>>>>>> >>>>>>>>> To fix this issue, it's better to stop looping if THP has been splited and >>>>>>>>> nr_pageout is greater than nr_to_reclaim. >>>>>>>>> >>>>>>>>> Signed-off-by: Chen Ridong >>>>>>>>> --- >>>>>>>>> mm/vmscan.c | 16 +++++++++++++++- >>>>>>>>> 1 file changed, 15 insertions(+), 1 deletion(-) >>>>>>>>> >>>>>>>>> diff --git a/mm/vmscan.c b/mm/vmscan.c >>>>>>>>> index 749cdc110c74..fd8ad251eda2 100644 >>>>>>>>> --- a/mm/vmscan.c >>>>>>>>> +++ b/mm/vmscan.c >>>>>>>>> @@ -1047,7 +1047,7 @@ static unsigned int shrink_folio_list(struct list_head *folio_list, >>>>>>>>> LIST_HEAD(demote_folios); >>>>>>>>> unsigned int nr_reclaimed = 0; >>>>>>>>> unsigned int pgactivate = 0; >>>>>>>>> - bool do_demote_pass; >>>>>>>>> + bool do_demote_pass, splited = false; >>>>>>>>> struct swap_iocb *plug = NULL; >>>>>>>>> >>>>>>>>> folio_batch_init(&free_folios); >>>>>>>>> @@ -1065,6 +1065,16 @@ static unsigned int shrink_folio_list(struct list_head *folio_list, >>>>>>>>> >>>>>>>>> cond_resched(); >>>>>>>>> >>>>>>>>> + /* >>>>>>>>> + * If a large folio has been split, many folios are added >>>>>>>>> + * to folio_list. Looping through the entire list takes >>>>>>>>> + * too much time, which may prevent folios that have completed >>>>>>>>> + * writeback from rotateing to the tail of the lru. Just >>>>>>>>> + * stop looping if nr_pageout is greater than nr_to_reclaim. >>>>>>>>> + */ >>>>>>>>> + if (unlikely(splited && stat->nr_pageout > sc->nr_to_reclaim)) >>>>>>>>> + break; >>>>>>> >>>>>>> I’m not entirely sure about the theory behind comparing stat->nr_pageout >>>>>>> with sc->nr_to_reclaim. However, the condition might still hold true even >>>>>>> if you’ve split a relatively small “large folio,” such as 16kB? >>>>>>> >>>>>> >>>>>> Why compare stat->nr_pageout with sc->nr_to_reclaim? It's because if all >>>>>> pages that have been pageout can be reclaimed, then enough pages can be >>>>>> reclaimed when all pages have finished writeback. Thus, it may not have >>>>>> to pageout more. >>>>>> >>>>>> If a small large folio(16 kB) has been split, it may return early >>>>>> without the entire pages in the folio_list being pageout, but I think >>>>>> that is fine. It can pageout more pages the next time it enters >>>>>> shrink_folio_list if there are not enough pages to reclaimed. >>>>>> >>>>>> However, if pages that have been pageout are still at the head of the >>>>>> LRU, it is difficult to scan these pages again. In this case, not only >>>>>> might it "waste" some swap memory but it also has to pageout more pages. >>>>>> >>>>>> Considering the above, I sent this patch. It may not be a perfect >>>>>> solution, but i think it's a good option to consider. And I am wondering >>>>>> if anyone has a better solution. >>>>> >>>>> Hi Ridong, >>>>> My overall understanding is that you have failed to describe your problem >>>>> particularly I don't understand what your 3 and 4 mean: >>>>> >>>>>> 3. When folio writeback io is completed, the folio may be rotated to tail >>>>>> of lru. The following lru list is expected, with those filioes that have >>>>>> been added to swap cache are rotated to tail of lru. So those folios >>>>>> can be reclaimed as soon as possible. >>>>>> >>>>>> folio1<->folio2<->...<->folioN-1<->filioN1<->...folioN511<->folioN512 >>>>> >>>>> > 4. However, shrink_page_list and folio writeback are asynchronous. If THP >>>>> > is splited, shrink_page_list loops at least 512 times, which means that >>>>> > shrink_page_list is not completed but some folios writeback have been >>>>> > completed, and this may lead to failure to rotate these folios to the >>>>> > tail of lru. The lru may look likes as below: >>>>> >>>>> can you please describe it in a readable approach? >>>>> >>>>> i feel your below diagram is somehow wrong: >>>>> folio1<->folio2<->...<->folioN-1<->filioN1<->...folioN511<->folioN512 >>>>> >>>>> You mentioned "rotate', how could "rotate" makes: >>>>> folioN512<->folioN511<->...filioN1 in (2) >>>>> become >>>>> filioN1<->...folioN511<->folioN512 in (3). >>>>> >>>> >>>> I am sorry for any confusion. >>>> >>>> If THP is split, filioN1, filioN2, filioN3, ...filioN512 are committed >>>> to writeback one by one. it assumed that filioN1, >>>> filioN2,filioN3,...filioN512 are completed in order. >>>> >>>> Orignal: >>>> folioN512<->folioN511<->...filioN1<->folio1<->folio2...<->folioN-1 >>>> >>>> filioN1 is finished, filioN1 is rotated to the tail of LRU: >>>> folioN512<->folioN511<->...filioN2<->folio1<->folio2...<->folioN-1<->folioN1 >>>> >>>> filioN2 is finished: >>>> folioN512<->folioN511<->...filioN3<->folio1<->folio2...<->folioN-1<->folioN1<->folioN2 >>>> >>>> filioN3 is finished: >>>> folioN512<->folioN511<->...filioN4<->folio1<->folio2...<->folioN-1<->folioN1<->folioN2<->filioN3 >>>> >>>> ... >>>> >>>> filioN512 is finished: >>>> folio1<->folio2<->...<->folioN-1<->filioN1<->...folioN511<->folioN512 >>>> >>>> When the filios are finished, the LRU might just like this: >>>> folio1<->folio2<->...<->folioN-1<->filioN1<->...folioN511<->folioN512 >>> >>> understood, thanks! >>> >>> Let me try to understand the following part: >>> >>>> 4: >>>> folioN50<->folioN49<->...filioN1<->folio1<->folio2...<->folioN-1<-> >>>> folioN51<->folioN52<->...folioN511<->folioN512 >>> >>> > Although those folios (N1-N50) have been finished writing back, they >>> > are still at the head of lru. When isolating folios from lru, it scans >>> > from tail to head, so it is difficult to scan those folios again. >>> >>> What is the reason that "those folios (N1-N50) have finished writing back, >>> yet they remain at the head of the LRU"? Is it because their writeback_end >>> occurred while we were still looping in shrink_folio_list(), causing >>> folio_end_writeback()'s folio_rotate_reclaimable() to fail in moving >>> these folios, which are still in the "folio_list", to the tail of the LRU? >>> >> >> Yes, you are right. >> >>>> >>>>> btw, writeback isn't always async. it could be sync for zram and sync_io >>>>> swap. in that case, your patch might change the order of LRU. i mean, >>>>> for example, while a mTHP becomes cold, we always reclaim all of them, >>>>> but not part of them and put back part of small folios to the head of lru. >>>>> >>>> >>>> Yes, This can be changed. >>>> Although it may put back part of small folios to the head of lru, it can >>>> return in time from shrink_folio_list without causing much additional I/O. >>>> >>>> If you have understood this issue, do you have any suggestions to fix >>>> it? My patch may not be a perfect way to fix this issue. >>>> >>> >>> My point is that synchronous I/O, like zRAM, doesn't have this issue and >>> doesn't require this fix, as writeback is always completed without >>> asynchronous latency. >>> >> >> I have tested zRAM and found no issues. >> This is version 1, and I don't know whether this fix will be accepted. >> If it is accepted, perhaps this patch could be modified to apply only to >> asynchronous io. > > Consider a 2MB THP: when it becomes cold, we detect that it is cold and > decide to page it out. Even if we split it into 512 * 4KiB folios, the entire > 2MB is still cold, so we want pageout() to be called for the entire 2MB. > With your current approach, some parts of the 2MB are moved to the > LRU head while we're still paging out other parts, which seems > problematic. > > Could we address this in move_folios_to_lru()? Perhaps we could find > a way to detect folios whose writeback has done and move them to the > tail instead of always placing them at the head. > I'll try to address this issue with this idea. Thanks you. Best regards, Ridong >> >> Best regards, >> Ridong >> >>> >>>> Best regards, >>>> Ridong >>>> > > Thanks > Barry