From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id B2028CF07D5 for ; Thu, 10 Oct 2024 08:59:23 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 2CCA86B0082; Thu, 10 Oct 2024 04:59:23 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 27CCA6B0083; Thu, 10 Oct 2024 04:59:23 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 144336B0085; Thu, 10 Oct 2024 04:59:23 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0014.hostedemail.com [216.40.44.14]) by kanga.kvack.org (Postfix) with ESMTP id E9ABC6B0082 for ; Thu, 10 Oct 2024 04:59:22 -0400 (EDT) Received: from smtpin08.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay02.hostedemail.com (Postfix) with ESMTP id 8ECD7120683 for ; Thu, 10 Oct 2024 08:59:19 +0000 (UTC) X-FDA: 82657093764.08.4F410DE Received: from szxga02-in.huawei.com (szxga02-in.huawei.com [45.249.212.188]) by imf24.hostedemail.com (Postfix) with ESMTP id 445E018000D for ; Thu, 10 Oct 2024 08:59:14 +0000 (UTC) Authentication-Results: imf24.hostedemail.com; dkim=none; dmarc=pass (policy=quarantine) header.from=huawei.com; spf=pass (imf24.hostedemail.com: domain of wangkefeng.wang@huawei.com designates 45.249.212.188 as permitted sender) smtp.mailfrom=wangkefeng.wang@huawei.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1728550733; a=rsa-sha256; cv=none; b=M55I6N7krJiXtUrTSgqFUCLZVRR2fj/nwrHMNDGyHnwyZLGAB7eYlCimw9Mt9MKGyO0kqM KatVGuMzIClLX5Q0QZpbn7dVfV7MjWtN4acf/ur7WDSRTawmhkn/F7I+LRtcG/NpeNbv87 2+diwpa0FC01m2IoxRgan3NsaRa3w7M= ARC-Authentication-Results: i=1; imf24.hostedemail.com; dkim=none; dmarc=pass (policy=quarantine) header.from=huawei.com; spf=pass (imf24.hostedemail.com: domain of wangkefeng.wang@huawei.com designates 45.249.212.188 as permitted sender) smtp.mailfrom=wangkefeng.wang@huawei.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1728550733; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=oYA1VslXxHZLGbHrCGvZm0/J1PrtZKmN2La2ppj1Doo=; b=DyYFtocXbprRSCskeY/L5TE850muZupoQJS7laqOxJ4UZjTbFJ/Nc1pLmr0Bu3m1TywjnO qnzuSaE1pPdIArXyFShiEZ0K7gYxW10MuaUk9POnT2itxJ+fgB5qI8p2+CWZWkAz47Nz8F yhuj6u4LivdP9e0BUE7Rje/lCr2fPoY= Received: from mail.maildlp.com (unknown [172.19.163.174]) by szxga02-in.huawei.com (SkyGuard) with ESMTP id 4XPNtJ3FmCzpWhL; Thu, 10 Oct 2024 16:57:12 +0800 (CST) Received: from dggpemf100008.china.huawei.com (unknown [7.185.36.138]) by mail.maildlp.com (Postfix) with ESMTPS id 13F011400D8; Thu, 10 Oct 2024 16:59:12 +0800 (CST) Received: from [10.174.177.243] (10.174.177.243) by dggpemf100008.china.huawei.com (7.185.36.138) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.2.1544.11; Thu, 10 Oct 2024 16:59:11 +0800 Message-ID: Date: Thu, 10 Oct 2024 16:59:10 +0800 MIME-Version: 1.0 User-Agent: Mozilla Thunderbird Subject: Re: [PATCH v3] mm/vmscan: stop the loop if enough pages have been page_out To: Chen Ridong , CC: , , , , Michal Hocko , Johannes Weiner , Yosry Ahmed , Yu Zhao , David Hildenbrand , Matthew Wilcox , Ryan Roberts , Barry Song References: <20241010081802.290893-1-chenridong@huaweicloud.com> Content-Language: en-US From: Kefeng Wang In-Reply-To: <20241010081802.290893-1-chenridong@huaweicloud.com> Content-Type: text/plain; charset="UTF-8"; format=flowed Content-Transfer-Encoding: 7bit X-Originating-IP: [10.174.177.243] X-ClientProxiedBy: dggems702-chm.china.huawei.com (10.3.19.179) To dggpemf100008.china.huawei.com (7.185.36.138) X-Rspam-User: X-Stat-Signature: 4yp4ktmaftbzhkqhjzax5nkf15hi7n81 X-Rspamd-Queue-Id: 445E018000D X-Rspamd-Server: rspam02 X-HE-Tag: 1728550754-166386 X-HE-Meta: U2FsdGVkX18DddKoZIBMVHOmHXWTCXmSqOuF+3N57CkI3W0azuN7af+ZdFgAPEtW6HQ1IytOrHLfDbR/V6hYDF8V7srXNceslJY/qPyiCb6kcETaUF+BWUfDA2Z0MW0AebovEvdH87JZgm2JWBXpBfHKHOlMnntE65SmFY+BlzawI8Gv2EEA0oPyaZsFVnvIRoHF0576K+iBFeT8b1PaD1xRqMLXIus1YNZDSU0iqjpKqmYQdyXrgPlkQFLakeiwsdXwrejlUlXwTKPtK1hWI69nN4qgE1DYHp3gnnZ6MHOhxiGeWNpxRYzrJU9ASg5wIPfyWNRHno05JpxnB31CPgzz0JQSDWmHuc2QFRPkKdCVdadl5svMy03/7VVUH3mE7qgB+xK7NjShMOfMdvaTHNDm4P7o9dEuGkv7WpxOicIEhmT4bD7wjy9l2X8gcoLE4P2CN2J1tTyzGQ7tljxNdO/GcGnaeFhx5NQKCnZ49pXLnJTDWnSO01kT7LZogNTs/jSmSC84Xm93uAlh0svz5OF07Sx0cl92bqcXGnI6mvpqZYNEjIqJNPtTML+lKRtVRPSncqUoQlwe5aiJUil4T7yPQUIC831JGzsXx70POW2J5iE2z7TLiuYiTJsDElCR9rvWP66apdl/nHPoJl66x1YhTbzFdM8YkJ8O7jLgJ60yKw1S33R7lU4teQWuvZiJ9AH7x0MBvA5LrB7932ycinF92aHiLOI171Jn6NTq0IEGzBdXT09YlDn1XSdnGHn8zlVJl7GIAUzoAXfHtj5YFn6UgDSR8o5ryyA34tU8zqtQgU6G8Ag7mWqbk5+cGLaCVPrZZQV3KAxnLTPcH3kalHKJAbG3d9j2CTxwNkPEsWXdpZkFUoOl30CeQ/H0FAsbn5eRC7hYYR2i3aU/lGvPlIkuFwdMu8Uekduxv327SJUdcaRyZpUKnkAcrDGqJKnDgGKlXMnjTIeH4zZ6BdO S5MYmqHS YD79b27y1U54uIjTO4z1rBZkKJhcpP0nvVhHysHUUaxlCu2Tqh/btL0uWNV829yMMHiMsFM9GS/maTHH2+ocEqPB2YE56VcfW8+AmQqlb0COipyZUHwSP6SyX+lfNqtDPH7Xx78z2KRqTFQEnThDU/iSIEHFPCB/Iew/ZAFAeIM71tBoD67UqsHmqfC1ag+aTgvEE8b3z/rc5ISj5ESS8S7SR3T+GvRl9bHU3Rb4rCAQ6JB8YEXsNQsneOuwgtFN3tl0m3SvOB/UhHrFbRTUTgAw+Dw== X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: Hi Ridong, This should be the first version for upstream, and the issue only occurred when large folio is spited. Adding more CCs to see if there's more feedback. On 2024/10/10 16:18, Chen Ridong wrote: > From: Chen Ridong > > An issue was found with the following testing step: > 1. Compile with CONFIG_TRANSPARENT_HUGEPAGE=y > 2. Mount memcg v1, and create memcg named test_memcg and set > usage_in_bytes=2.1G, memsw.usage_in_bytes=3G. > 3. Create a 1G swap file, and allocate 2.2G anon memory in test_memcg. > > It was found that: > > cat memory.usage_in_bytes > 2144940032 > cat memory.memsw.usage_in_bytes > 2255056896 > > free -h > total used free > Mem: 31Gi 2.1Gi 27Gi > Swap: 1.0Gi 618Mi 405Mi > > As shown above, the test_memcg used about 100M swap, but 600M+ swap memory > was used, which means that 500M may be wasted because other memcgs can not > use these swap memory. > > It can be explained as follows: > 1. When entering shrink_inactive_list, it isolates folios from lru from > tail to head. If it just takes folioN from lru(make it simple). > > inactive lru: folio1<->folio2<->folio3...<->folioN-1 > isolated list: folioN > > 2. In shrink_page_list function, if folioN is THP, it may be splited and > added to swap cache folio by folio. After adding to swap cache, it will > submit io to writeback folio to swap, which is asynchronous. > When shrink_page_list is finished, the isolated folios list will be > moved back to the head of inactive lru. The inactive lru may just look > like this, with 512 filioes have been move to the head of inactive lru. > > folioN512<->folioN511<->...filioN1<->folio1<->folio2...<->folioN-1 > > 3. When folio writeback io is completed, the folio may be rotated to tail > of lru. The following lru list is expected, with those filioes that have > been added to swap cache are rotated to tail of lru. So those folios > can be reclaimed as soon as possible. > > folio1<->folio2<->...<->folioN-1<->filioN1<->...folioN511<->folioN512 > > 4. However, shrink_page_list and folio writeback are asynchronous. If THP > is splited, shrink_page_list loops at least 512 times, which means that > shrink_page_list is not completed but some folios writeback have been > completed, and this may lead to failure to rotate these folios to the > tail of lru. The lru may look likes as below: > > folioN50<->folioN49<->...filioN1<->folio1<->folio2...<->folioN-1<-> > folioN51<->folioN52<->...folioN511<->folioN512 > > Although those folios (N1-N50) have been finished writing back, they > are still at the head of lru. When isolating folios from lru, it scans > from tail to head, so it is difficult to scan those folios again. > > What mentioned above may lead to a large number of folios have been added > to swap cache but can not be reclaimed in time, which may reduce reclaim > efficiency and prevent other memcgs from using this swap memory even if > they trigger OOM. > > To fix this issue, it's better to stop looping if THP has been splited and > nr_pageout is greater than nr_to_reclaim. > > Signed-off-by: Chen Ridong > --- > mm/vmscan.c | 16 +++++++++++++++- > 1 file changed, 15 insertions(+), 1 deletion(-) > > diff --git a/mm/vmscan.c b/mm/vmscan.c > index 749cdc110c74..fd8ad251eda2 100644 > --- a/mm/vmscan.c > +++ b/mm/vmscan.c > @@ -1047,7 +1047,7 @@ static unsigned int shrink_folio_list(struct list_head *folio_list, > LIST_HEAD(demote_folios); > unsigned int nr_reclaimed = 0; > unsigned int pgactivate = 0; > - bool do_demote_pass; > + bool do_demote_pass, splited = false; > struct swap_iocb *plug = NULL; > > folio_batch_init(&free_folios); > @@ -1065,6 +1065,16 @@ static unsigned int shrink_folio_list(struct list_head *folio_list, > > cond_resched(); > > + /* > + * If a large folio has been split, many folios are added > + * to folio_list. Looping through the entire list takes > + * too much time, which may prevent folios that have completed > + * writeback from rotateing to the tail of the lru. Just > + * stop looping if nr_pageout is greater than nr_to_reclaim. > + */ > + if (unlikely(splited && stat->nr_pageout > sc->nr_to_reclaim)) > + break; > + > folio = lru_to_folio(folio_list); > list_del(&folio->lru); > > @@ -1273,6 +1283,7 @@ static unsigned int shrink_folio_list(struct list_head *folio_list, > if ((nr_pages > 1) && !folio_test_large(folio)) { > sc->nr_scanned -= (nr_pages - 1); > nr_pages = 1; > + splited = true; > } > > /* > @@ -1375,12 +1386,14 @@ static unsigned int shrink_folio_list(struct list_head *folio_list, > if (nr_pages > 1 && !folio_test_large(folio)) { > sc->nr_scanned -= (nr_pages - 1); > nr_pages = 1; > + splited = true; > } > goto activate_locked; > case PAGE_SUCCESS: > if (nr_pages > 1 && !folio_test_large(folio)) { > sc->nr_scanned -= (nr_pages - 1); > nr_pages = 1; > + splited = true; > } > stat->nr_pageout += nr_pages; > > @@ -1491,6 +1504,7 @@ static unsigned int shrink_folio_list(struct list_head *folio_list, > if (nr_pages > 1) { > sc->nr_scanned -= (nr_pages - 1); > nr_pages = 1; > + splited = true; > } > activate_locked: > /* Not a candidate for swapping, so reclaim swap space. */