From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 5A5D6E7716F for ; Thu, 5 Dec 2024 02:06:35 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id E25EC6B008C; Wed, 4 Dec 2024 21:06:34 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id DD63C6B0092; Wed, 4 Dec 2024 21:06:34 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id CC48D6B0093; Wed, 4 Dec 2024 21:06:34 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0012.hostedemail.com [216.40.44.12]) by kanga.kvack.org (Postfix) with ESMTP id A6A686B008C for ; Wed, 4 Dec 2024 21:06:34 -0500 (EST) Received: from smtpin21.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay06.hostedemail.com (Postfix) with ESMTP id 5C4E1AF1A7 for ; Thu, 5 Dec 2024 02:06:34 +0000 (UTC) X-FDA: 82859265804.21.7E5E472 Received: from szxga06-in.huawei.com (szxga06-in.huawei.com [45.249.212.32]) by imf07.hostedemail.com (Postfix) with ESMTP id 393C740006 for ; Thu, 5 Dec 2024 02:06:15 +0000 (UTC) Authentication-Results: imf07.hostedemail.com; dkim=none; dmarc=pass (policy=quarantine) header.from=huawei.com; spf=pass (imf07.hostedemail.com: domain of chenridong@huawei.com designates 45.249.212.32 as permitted sender) smtp.mailfrom=chenridong@huawei.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1733364378; a=rsa-sha256; cv=none; b=Yjp7rO9GhRH6tZKhTOSDfLlPSuB3R4aFWnUXTLNAP19FZOgcGpd/NMXNaTxyTYn7nUj71n 9LioS+BMCtA+QTwIyf6jlgLB92aJ70qWQ5u13og1j/Pp6QEtdougNbs2zDLiqurulQjPL1 NQ5MCw55gf9T3+SVwNt6GK1YhS0tkQw= ARC-Authentication-Results: i=1; imf07.hostedemail.com; dkim=none; dmarc=pass (policy=quarantine) header.from=huawei.com; spf=pass (imf07.hostedemail.com: domain of chenridong@huawei.com designates 45.249.212.32 as permitted sender) smtp.mailfrom=chenridong@huawei.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1733364378; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=30ltuql36hYhcL6gzgUMaXmEBYupmvjCFQOxgUpNaMs=; b=ksLRdAOfZIfQh9ImGXSU1YQE4BXefSYYAgSSMqOh22t1TwFS7CBB+sMkksnNU3F0mS1HI1 zvYNEbWyuqC15FXKKT0W4rcEsHmiK7w4qRttdxPN7nwO/emTyMgNHszpZAmTSiKmMkvO7/ sqBksfWRO+Y5YFhH3c4uVB6lrsaFWeY= Received: from mail.maildlp.com (unknown [172.19.88.234]) by szxga06-in.huawei.com (SkyGuard) with ESMTP id 4Y3d6p2hXMz1yrS9; Thu, 5 Dec 2024 10:06:42 +0800 (CST) Received: from kwepemd100013.china.huawei.com (unknown [7.221.188.163]) by mail.maildlp.com (Postfix) with ESMTPS id 947081402E1; Thu, 5 Dec 2024 10:06:26 +0800 (CST) Received: from [10.67.109.79] (10.67.109.79) by kwepemd100013.china.huawei.com (7.221.188.163) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.2.1258.34; Thu, 5 Dec 2024 10:06:25 +0800 Message-ID: <953e88cb-4c80-43d2-9afb-26150fea2da9@huawei.com> Date: Thu, 5 Dec 2024 10:06:24 +0800 MIME-Version: 1.0 User-Agent: Mozilla Thunderbird Subject: Re: [RFC PATCH v3 2/2] mm: vmscan: retry folios written back while isolated To: Barry Song <21cnbao@gmail.com>, Chen Ridong CC: , , , , , , , , , , , , References: <20241204040158.2768519-1-chenridong@huaweicloud.com> <20241204040158.2768519-3-chenridong@huaweicloud.com> Content-Language: en-US From: chenridong In-Reply-To: Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: 8bit X-Originating-IP: [10.67.109.79] X-ClientProxiedBy: dggems704-chm.china.huawei.com (10.3.19.181) To kwepemd100013.china.huawei.com (7.221.188.163) X-Stat-Signature: gfb4j9g9gqj1qefbybjek37f5hqjhmc5 X-Rspam-User: X-Rspamd-Queue-Id: 393C740006 X-Rspamd-Server: rspam08 X-HE-Tag: 1733364375-67934 X-HE-Meta: U2FsdGVkX1/sdbQVlZMQJhWD5/BZd8tqlElxuOYFOPajy5nIb706B65eYG2BMvmh+INZnQsjmbj2Ac4XJih6L6W0DNQ0GSPcfryf4qFXZFFQe0/WDbhmk/64SYaGcdFzeXJZtSwD+xS3HXYwVd41c6iY9iJqpZbuULUnXAyM3Afdvd0L2+qb3b40XpLDzb2DGZaJe/x570ZvkeX752ePhayMldh9hBrT+6ziwvi53Mui3J5EKDE6JGXryAN7jWjOskpR6K9bcgi4mYLntrMFZAC4Uw7+rIlpumqt6z9f17xPRny7OgovtvYl4Ue0DULRw8Jha2ZlaNTj0JB8AOhghseaObPgOygiI2LKBO1xoVVguSATvqtcfyMmNiwMVhG9pwQH8JLIwceVJYh4QAEMSG6QVVZRfbA6PpCNBMtE3V+ETkcaE+RFSbvDbDBzFaqmyl2YxuThV2DnPTN5Jgd3F7806bNK/TdX5uH5H3vJGlUJ1BhUNtiyBSFuXaNHu5ijuMiIJvz0Yl77pNZVueb+i4A8hzbgEwu2iJd/S1G8Yxt2mWNVexUGqvD+XOgzX4odKAwiq/dUDrSltlbqCVc61+jmSiPDV+1JpiIng2VXT4QlEo6PiUVTKDdL60B2RnrZAazDd3ub/WYtoOKyp9lNY2qH0p339DX6FpDSLbyQtKMP/d0y5nnMYHAj1xokXSE8+VKnAy0lMj8GcgYhmYSV+1LNKiXvfOKCyNrSAMWLatNn2zbXaYz25EWYEKlPBBJ85r72Yi517lu0SCCflf5e1qIkc9rLXz5PVJg4NwKlCDfuEWuJ0ziv3wb5MGBgpgKUMHRvE4lNN1CYo0dNUgNAyHLtNE4WsBp4M5cTWcY7t5GODTJyBmgUQpUBJFVFvR1Kqz+mOh4d+N/57EZqo5NUdxSEW63AdYg42SH1F2viOhUOChd/ODpB0qEA9VlQghGNTY9D4ghe1Pu2QrrjNfJ 0wCYmnJa DeRmwjKzTELdzFPwV9FJ7YULEfNosiDjwFYAFCLKLz96R1N7Oq2teutaLfdxscoxNfo4ZehEdGHQpYiFgzb0aTOMcd+yc7PFYQoYxbGD+CFHXBpnZI5IOKkcc3p3IDTOm5XUdmaI9IH5drTnsbq+NXCQZPCCCfckK2nqIGPG25DiP5NQ+TzxBxab2zyoLRyjcOivYYV6bSPlj2M3zw4M0RzNL9f3GVuTTNfBrYW2RNR2ooAQqxfY5cj+jQw== X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On 2024/12/4 18:45, Barry Song wrote: > On Wed, Dec 4, 2024 at 5:11 PM Chen Ridong wrote: >> >> From: Chen Ridong >> >> An issue was found with the following testing step: >> 1. Compile with CONFIG_TRANSPARENT_HUGEPAGE=y, CONFIG_LRU_GEN_ENABLED=n. >> 2. Mount memcg v1, and create memcg named test_memcg and set >> usage_in_bytes=2.1G, memsw.usage_in_bytes=3G. >> 3. Use file as swap, and create a 1G swap. >> 4. Allocate 2.2G anon memory in test_memcg. >> >> It was found that: >> >> cat memory.usage_in_bytes >> 2144940032 >> cat memory.memsw.usage_in_bytes >> 2255056896 >> >> free -h >> total used free >> Mem: 31Gi 2.1Gi 27Gi >> Swap: 1.0Gi 618Mi 405Mi >> >> As shown above, the test_memcg used about 100M swap, but 600M+ swap memory >> was used, which means that 500M may be wasted because other memcgs can not >> use these swap memory. >> >> It can be explained as follows: >> 1. When entering shrink_inactive_list, it isolates folios from lru from >> tail to head. If it just takes folioN from lru(make it simple). >> >> inactive lru: folio1<->folio2<->folio3...<->folioN-1 >> isolated list: folioN >> >> 2. In shrink_page_list function, if folioN is THP(2M), it may be splited >> and added to swap cache folio by folio. After adding to swap cache, >> it will submit io to writeback folio to swap, which is asynchronous. >> When shrink_page_list is finished, the isolated folios list will be >> moved back to the head of inactive lru. The inactive lru may just look >> like this, with 512 filioes have been move to the head of inactive lru. >> >> folioN512<->folioN511<->...filioN1<->folio1<->folio2...<->folioN-1 >> >> It committed io from folioN1 to folioN512, the later folios committed >> was added to head of the 'ret_folios' in the shrink_page_list function. >> As a result, the order was shown as folioN512->folioN511->...->folioN1. >> >> 3. When folio writeback io is completed, the folio may be rotated to tail >> of the lru one by one. It's assumed that filioN1,filioN2, ...,filioN512 >> are completed in order(commit io in this order), and they are rotated to >> the tail of the LRU in order (filioN1<->...folioN511<->folioN512). >> Therefore, those folios that are tail of the lru will be reclaimed as >> soon as possible. >> >> folio1<->folio2<->...<->folioN-1<->filioN1<->...folioN511<->folioN512 >> >> 4. However, shrink_page_list and folio writeback are asynchronous. If THP >> is splited, shrink_page_list loops at least 512 times, which means that >> shrink_page_list is not completed but some folios writeback have been >> completed, and this may lead to failure to rotate these folios to the >> tail of lru. The lru may look likes as below: >> >> folioN50<->folioN49<->...filioN1<->folio1<->folio2...<->folioN-1<-> >> folioN51<->folioN52<->...folioN511<->folioN512 >> >> Although those folios (N1-N50) have been finished writing back, they >> are still at the head of the lru. This is because their writeback_end >> occurred while it were still looping in shrink_folio_list(), causing >> folio_end_writeback()'s folio_rotate_reclaimable() to fail in moving >> these folios, which are not in the LRU but still in the 'folio_list', >> to the tail of the LRU. >> When isolating folios from lru, it scans from tail to head, so it is >> difficult to scan those folios again. > > I don’t think it’s necessary to focus so much on large folios. This > issue affects both small and large folios alike. Splitting large > folios simply lengthens the list, which increases the chances of > missing rotation. It’s enough to note that commit 359a5e1416ca > fixed this issue in mglru, but the same problem exists in the > active/inactive LRU. As a result, we’re extracting the function in > patch 1 to make it usable for both LRUs and applying the same fix > to the active/inactive LRU. Mentioning that THP splitting can > worsen the issue (since it makes the list longer) is sufficient; > it’s not the main point. > > It’s better to have a single patch and refine the changelog to focus on > the core and essential problem, avoiding too many unrelated details. > Thank you, will update. Best regards, Ridong >> >> This issue is fixed when CONFIG_LRU_GEN_ENABLED is enabled with the >> commit 359a5e1416ca ("mm: multi-gen LRU: retry folios written back while >> isolated"). This issue should be fixed for active/inactive lru in the >> same way. >> >> Signed-off-by: Chen Ridong >> --- >> mm/vmscan.c | 35 +++++++++++++++++++++++++++++++++-- >> 1 file changed, 33 insertions(+), 2 deletions(-) >> >> diff --git a/mm/vmscan.c b/mm/vmscan.c >> index af1ff76f83e7..1f0d194f8b2f 100644 >> --- a/mm/vmscan.c >> +++ b/mm/vmscan.c >> @@ -1949,6 +1949,25 @@ static int current_may_throttle(void) >> return !(current->flags & PF_LOCAL_THROTTLE); >> } >> >> +static inline void acc_reclaimed_stat(struct reclaim_stat *stat, >> + struct reclaim_stat *curr) >> +{ >> + int i; >> + >> + stat->nr_dirty += curr->nr_dirty; >> + stat->nr_unqueued_dirty += curr->nr_unqueued_dirty; >> + stat->nr_congested += curr->nr_congested; >> + stat->nr_writeback += curr->nr_writeback; >> + stat->nr_immediate += curr->nr_immediate; >> + stat->nr_pageout += curr->nr_pageout; >> + stat->nr_ref_keep += curr->nr_ref_keep; >> + stat->nr_unmap_fail += curr->nr_unmap_fail; >> + stat->nr_lazyfree_fail += curr->nr_lazyfree_fail; >> + stat->nr_demoted += curr->nr_demoted; >> + for (i = 0; i < ANON_AND_FILE; i++) >> + stat->nr_activate[i] = curr->nr_activate[i]; >> +} >> + >> /* >> * shrink_inactive_list() is a helper for shrink_node(). It returns the number >> * of reclaimed pages >> @@ -1958,14 +1977,16 @@ static unsigned long shrink_inactive_list(unsigned long nr_to_scan, >> enum lru_list lru) >> { >> LIST_HEAD(folio_list); >> + LIST_HEAD(clean_list); >> unsigned long nr_scanned; >> unsigned int nr_reclaimed = 0; >> unsigned long nr_taken; >> - struct reclaim_stat stat; >> + struct reclaim_stat stat, curr; >> bool file = is_file_lru(lru); >> enum vm_event_item item; >> struct pglist_data *pgdat = lruvec_pgdat(lruvec); >> bool stalled = false; >> + bool skip_retry = false; >> >> while (unlikely(too_many_isolated(pgdat, file, sc))) { >> if (stalled) >> @@ -1999,10 +2020,20 @@ static unsigned long shrink_inactive_list(unsigned long nr_to_scan, >> if (nr_taken == 0) >> return 0; >> >> - nr_reclaimed = shrink_folio_list(&folio_list, pgdat, sc, &stat, false); >> + memset(&stat, 0, sizeof(stat)); >> +retry: >> + nr_reclaimed += shrink_folio_list(&folio_list, pgdat, sc, &curr, false); >> + find_folios_written_back(&folio_list, &clean_list, skip_retry); >> + acc_reclaimed_stat(&stat, &curr); >> >> spin_lock_irq(&lruvec->lru_lock); >> move_folios_to_lru(lruvec, &folio_list); >> + if (!list_empty(&clean_list)) { >> + list_splice_init(&clean_list, &folio_list); >> + skip_retry = true; >> + spin_unlock_irq(&lruvec->lru_lock); >> + goto retry; >> + } >> >> __mod_lruvec_state(lruvec, PGDEMOTE_KSWAPD + reclaimer_offset(), >> stat.nr_demoted); >> -- >> 2.34.1 >> > > Thanks > Barry