From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 280E5E7719F for ; Mon, 13 Jan 2025 08:35:37 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id AC9F56B0093; Mon, 13 Jan 2025 03:35:36 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id A79CE6B0095; Mon, 13 Jan 2025 03:35:36 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 942126B0096; Mon, 13 Jan 2025 03:35:36 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0015.hostedemail.com [216.40.44.15]) by kanga.kvack.org (Postfix) with ESMTP id 778D76B0093 for ; Mon, 13 Jan 2025 03:35:36 -0500 (EST) Received: from smtpin12.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay01.hostedemail.com (Postfix) with ESMTP id 21C811C7C68 for ; Mon, 13 Jan 2025 08:35:36 +0000 (UTC) X-FDA: 83001769872.12.3E52410 Received: from szxga02-in.huawei.com (szxga02-in.huawei.com [45.249.212.188]) by imf23.hostedemail.com (Postfix) with ESMTP id 7DC06140013 for ; Mon, 13 Jan 2025 08:35:33 +0000 (UTC) Authentication-Results: imf23.hostedemail.com; dkim=none; spf=pass (imf23.hostedemail.com: domain of chenridong@huawei.com designates 45.249.212.188 as permitted sender) smtp.mailfrom=chenridong@huawei.com; dmarc=pass (policy=quarantine) header.from=huawei.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1736757334; a=rsa-sha256; cv=none; b=y8FKIm66lQIsaE5XrwaDs50ziYC64uhq1hRtZJCwjux3E4BNWMJsKEiziKsPaPFak6ay28 ldinruCshpHcDFrePeLYIZkdd0VKBMTO+/VsjDH29z86V11yD9ZIlLYIcwD4FUTTXP43RG RtGBkTFO8SgTlNzDecx/CeFcb3qEi+U= ARC-Authentication-Results: i=1; imf23.hostedemail.com; dkim=none; spf=pass (imf23.hostedemail.com: domain of chenridong@huawei.com designates 45.249.212.188 as permitted sender) smtp.mailfrom=chenridong@huawei.com; dmarc=pass (policy=quarantine) header.from=huawei.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1736757334; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=SjVcIgtNPIvYEKV0qZLJcK4qPFG5xB+GhEsAqJK9nqw=; b=q7fHNeqzE74xz921hRymmTtapfz56g5zhY0UfM5xl1lRW7UqqG1PhUCHPo/O3STBZ4lVF6 eIxgoGwrvbtSkYRUmiA2JK82TotsCHKJTcSFMS4OqFBBdJo++Il7bSr+4WsHxkih8pLI7h 2k/ff2F8FnuYgtAGwLf9i76w7QtwBHM= Received: from mail.maildlp.com (unknown [172.19.163.252]) by szxga02-in.huawei.com (SkyGuard) with ESMTP id 4YWlqp71Lgzbk1r; Mon, 13 Jan 2025 16:32:22 +0800 (CST) Received: from kwepemd100013.china.huawei.com (unknown [7.221.188.163]) by mail.maildlp.com (Postfix) with ESMTPS id 5C8261800D9; Mon, 13 Jan 2025 16:35:30 +0800 (CST) Received: from [10.67.109.79] (10.67.109.79) by kwepemd100013.china.huawei.com (7.221.188.163) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.2.1258.34; Mon, 13 Jan 2025 16:35:29 +0800 Message-ID: <72867779-5994-4b9d-b1d3-761ce303fc02@huawei.com> Date: Mon, 13 Jan 2025 16:35:28 +0800 MIME-Version: 1.0 User-Agent: Mozilla Thunderbird Subject: Re: [PATCH v7 mm-unstable] mm: vmscan: retry folios written back while isolated for traditional LRU To: Yu Zhao , Chen Ridong , Wei Xu CC: , , , , , , , , <21cnbao@gmail.com>, , , , , References: <20250111091504.1363075-1-chenridong@huaweicloud.com> Content-Language: en-US From: chenridong In-Reply-To: Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: 8bit X-Originating-IP: [10.67.109.79] X-ClientProxiedBy: dggems704-chm.china.huawei.com (10.3.19.181) To kwepemd100013.china.huawei.com (7.221.188.163) X-Rspamd-Queue-Id: 7DC06140013 X-Stat-Signature: 5xgdet3ehgiyxbd77q75863rc8hnnrce X-Rspam-User: X-Rspamd-Server: rspam09 X-HE-Tag: 1736757333-698254 X-HE-Meta: U2FsdGVkX1+46iJ57Jew6qRAtCphHkrTkIv47epSMTZdwKMzX1pq6ZvBuKQCZNdwdcqBNPHCFhxEUQlFP4m+ErODZ9IH84GVeqrgNuipdZ9LuQ5Q4hnNJAX1QF+pWQBmVLoMg20IoSvHNG5G/vGMttmgQGkd7/ILVdFpVo/OCVBkcUQv9ZZjFhd8hTKuLxMicMTzpYR1kLBfdt4oUeua0CDB1Wjnc3HDsW9PR6sHJyqM1TvmpyR/4C6K9e1IUNUZ+p64Fos5CK9li/HEv+Li5bjc8/CWjRsFda+kRg2Z0F9z8ormybit92QwqprUIEnv0E/THR9aPaiim9mAKs8tesr5rswbNf1XP9TLgbponOuHtoy5KzWkPMggZSKgJfO9GxtyQgsjsgkR91+IQfCF21wRYm9IRpzAsHutBX5EvtvBGkCH/Qm5ITB8ZrWXgbow9QbdDCofnqTvkKenzOytO3BKCSacx3G/j0M2i4we6O+I96T8RDM5DrmjL4bLgilp8Xe5JY3dJ6K73ZKALCDQtoIjruA4MKJp4NgmtXYrBoo7EMIXVubX2dglJqKZEBlMZ1qYXp9YZ5N+jp8bhYk8s6BaGYxOLnBGDJHjH02ZCV5IQvqUszM/p8OdcUt2RR8QiGiwO5oM3u4lz5f5McSvp+afUjffCKoWnjkSHhGzknRb3yaDcpa3dlOdqVHl1zgTfKzV08zBtQhX1mK9G+nwyPXDyiSOcEGtvkLSL89xkPnFH8Jyr9w4Ucc4kwJXfFhMYlTfTdfv1zRg5cWP+pI8DOJnZNjS4MtJW6yapDGAVEr//V7L/Vuh737kDgRTeK6AA1Bdpq/Oy/A7+ypYX2RfnfpOLlztPnc/nl6iEZlx8/ZHhyqJyWuKBMDR/8J1E6QzZ2zOtDd5CEHxKeOqLmkYkrQKRTHDjcnDhc3oWUepspunGVNXmLtyezjiZxXr0A0Kj/m0NJylYo9DbUdhK2h Ou2rF4bS zovD1WyfrqHTKU4z7M6qxm6LytyoezFeRccLJV9NKMWT8gOaBEO8FDBzzNlMdWeicjJjWdJzvJwOSHB5y0OOJfh9v5otkYOlc/sY//YtwebN8qrPV8m3ggvIlbahio6ja7rYpzIoDTNhTPSgcXM7g8Pzew5puXCtZJRkDuhIEyX8+GmsQHuYEh1TKUCBfBCuX4HCqeKA/EzyCz/vBtdPHR2JqrZd8IbVbcv64XLU2do146Pv+8Rv5cALdhZYtPNw/mc4kHAM3C7w/v/fH8iiS9o4z77y0kO1P4VHQ X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On 2025/1/12 6:12, Yu Zhao wrote: > On Sat, Jan 11, 2025 at 2:25 AM Chen Ridong wrote: >> >> From: Chen Ridong >> >> As commit 359a5e1416ca ("mm: multi-gen LRU: retry folios written back >> while isolated") mentioned: >> >> The page reclaim isolates a batch of folios from the tail of one of the >> LRU lists and works on those folios one by one. For a suitable >> swap-backed folio, if the swap device is async, it queues that folio for >> writeback. After the page reclaim finishes an entire batch, it puts back >> the folios it queued for writeback to the head of the original LRU list. >> >> In the meantime, the page writeback flushes the queued folios also by >> batches. Its batching logic is independent from that of the page >> reclaim. For each of the folios it writes back, the page writeback calls >> folio_rotate_reclaimable() which tries to rotate a folio to the tail. >> >> folio_rotate_reclaimable() only works for a folio after the page reclaim >> has put it back. If an async swap device is fast enough, the page >> writeback can finish with that folio while the page reclaim is still >> working on the rest of the batch containing it. In this case, that folio >> will remain at the head and the page reclaim will not retry it before >> reaching there". >> >> The commit 359a5e1416ca ("mm: multi-gen LRU: retry folios written back >> while isolated") only fixed the issue for mglru. However, this issue >> also exists in the traditional active/inactive LRU and was found at [1]. > > The active/inactive LRU needs more careful thoughts due to its > complexity. Details below. > >> It can be reproduced with below steps: >> >> 1. Compile with CONFIG_TRANSPARENT_HUGEPAGE=y >> 2. Mount memcg v1, and create memcg named test_memcg and set >> limit_in_bytes=1G, memsw.limit_in_bytes=2G. >> 3. Create a 1G swap file, and allocate 1.05G anon memory in test_memcg. >> >> It was found that: >> >> cat memory.limit_in_bytes >> 1073741824 >> cat memory.memsw.limit_in_bytes >> 2147483648 >> cat memory.usage_in_bytes >> 1073664000 >> cat memory.memsw.usage_in_bytes >> 1129840640 >> >> free -h >> total used free >> Mem: 31Gi 1.2Gi 28Gi >> Swap: 1.0Gi 1.0Gi 2.0Mi >> >> As shown above, the test_memcg used about 50M swap, but almost 1G swap >> memory was used, which means that 900M+ may be wasted because other memcgs >> can not use these swap memory. >> >> This issue should be fixed in the same way as mglru. Therefore, the common >> logic was extracted to the 'find_folios_written_back' function firstly, >> which is then reused in the 'shrink_inactive_list' function. Finally, >> retry reclaiming those folios that may have missed the rotation for >> traditional LRU. >> >> After change, the same test case. only 54M swap was used. >> >> cat memory.usage_in_bytes >> 1073463296 >> cat memory.memsw.usage_in_bytes >> 1129828352 >> >> free -h >> total used free >> Mem: 31Gi 1.2Gi 28Gi >> Swap: 1.0Gi 54Mi 969Mi >> >> [1] https://lore.kernel.org/linux-kernel/20241010081802.290893-1-chenridong@huaweicloud.com/ >> [2] https://lore.kernel.org/linux-kernel/CAGsJ_4zqL8ZHNRZ44o_CC69kE7DBVXvbZfvmQxMGiFqRxqHQdA@mail.gmail.com/ >> Signed-off-by: Chen Ridong >> --- >> >> v6->v7: >> - fix conflict based on mm-unstable. >> - update the commit message(quote from YU's commit message, and add >> improvements after change.) >> - restore 'is_retrying' to 'skip_retry' to keep original semantics. >> >> v6: https://lore.kernel.org/linux-kernel/20241223082004.3759152-1-chenridong@huaweicloud.com/ >> >> mm/vmscan.c | 114 ++++++++++++++++++++++++++++++++++------------------ >> 1 file changed, 76 insertions(+), 38 deletions(-) >> >> diff --git a/mm/vmscan.c b/mm/vmscan.c >> index 01dce6f26..6861b6937 100644 >> --- a/mm/vmscan.c >> +++ b/mm/vmscan.c >> @@ -183,6 +183,9 @@ struct scan_control { >> struct reclaim_state reclaim_state; >> }; >> >> +static inline void find_folios_written_back(struct list_head *list, >> + struct list_head *clean, struct lruvec *lruvec, int type, bool is_retrying); >> + >> #ifdef ARCH_HAS_PREFETCHW >> #define prefetchw_prev_lru_folio(_folio, _base, _field) \ >> do { \ >> @@ -1960,14 +1963,18 @@ static unsigned long shrink_inactive_list(unsigned long nr_to_scan, >> enum lru_list lru) >> { >> LIST_HEAD(folio_list); >> + LIST_HEAD(clean_list); >> unsigned long nr_scanned; >> - unsigned int nr_reclaimed = 0; >> + unsigned int nr_reclaimed, total_reclaimed = 0; >> + unsigned int nr_pageout = 0; >> + unsigned int nr_unqueued_dirty = 0; >> unsigned long nr_taken; >> struct reclaim_stat stat; >> bool file = is_file_lru(lru); >> enum vm_event_item item; >> struct pglist_data *pgdat = lruvec_pgdat(lruvec); >> bool stalled = false; >> + bool skip_retry = false; >> >> while (unlikely(too_many_isolated(pgdat, file, sc))) { >> if (stalled) >> @@ -2001,22 +2008,47 @@ static unsigned long shrink_inactive_list(unsigned long nr_to_scan, >> if (nr_taken == 0) >> return 0; >> >> +retry: >> nr_reclaimed = shrink_folio_list(&folio_list, pgdat, sc, &stat, false); >> >> + sc->nr.dirty += stat.nr_dirty; >> + sc->nr.congested += stat.nr_congested; >> + sc->nr.unqueued_dirty += stat.nr_unqueued_dirty; >> + sc->nr.writeback += stat.nr_writeback; > > I think this change breaks the tests on the stats above, e.g., > wakeup_flusher_threads(), because the same dirty/writeback folio can > be counted twice. The reason for that is that > folio_test_dirty/writeback() can't account for dirty/writeback buffer > heads, which can only be done by folio_check_dirty_writeback(). > > For MGLRU, it has been broken since day 1 and commit 1bc542c6a0d1 > ("mm/vmscan: wake up flushers conditionally to avoid cgroup OOM") > doesn't account for this either. I'll get around to that. Hi, Yu, thank you for your review. Maybe nr_reclaimed is the only value we need to accumulate? We only want to retry folios that may have missed folio_rotate_reclaimable(), and these folios should be reclaimed and freed. Therefore, we need to accumulate nr_reclaimed. For the other fields in the stat, we should just keep the values that were obtained the first time they were shrunk. But I'm not sure if I'm missing something. > >> + sc->nr.immediate += stat.nr_immediate; >> + total_reclaimed += nr_reclaimed; >> + nr_pageout += stat.nr_pageout; >> + nr_unqueued_dirty += stat.nr_unqueued_dirty; >> + >> + trace_mm_vmscan_lru_shrink_inactive(pgdat->node_id, >> + nr_scanned, nr_reclaimed, &stat, sc->priority, file); >> + >> + find_folios_written_back(&folio_list, &clean_list, lruvec, 0, skip_retry); >> + >> spin_lock_irq(&lruvec->lru_lock); >> move_folios_to_lru(lruvec, &folio_list); >> >> __mod_lruvec_state(lruvec, PGDEMOTE_KSWAPD + reclaimer_offset(), >> stat.nr_demoted); >> - __mod_node_page_state(pgdat, NR_ISOLATED_ANON + file, -nr_taken); >> item = PGSTEAL_KSWAPD + reclaimer_offset(); >> if (!cgroup_reclaim(sc)) >> __count_vm_events(item, nr_reclaimed); >> __count_memcg_events(lruvec_memcg(lruvec), item, nr_reclaimed); >> __count_vm_events(PGSTEAL_ANON + file, nr_reclaimed); >> + >> + if (!list_empty(&clean_list)) { >> + list_splice_init(&clean_list, &folio_list); >> + skip_retry = true; >> + spin_unlock_irq(&lruvec->lru_lock); >> + goto retry; >> + } >> + __mod_node_page_state(pgdat, NR_ISOLATED_ANON + file, -nr_taken); >> spin_unlock_irq(&lruvec->lru_lock); >> + sc->nr.taken += nr_taken; >> + if (file) >> + sc->nr.file_taken += nr_taken; >> >> - lru_note_cost(lruvec, file, stat.nr_pageout, nr_scanned - nr_reclaimed); >> + lru_note_cost(lruvec, file, nr_pageout, nr_scanned - total_reclaimed); >> >> /* >> * If dirty folios are scanned that are not queued for IO, it >> @@ -2029,7 +2061,7 @@ static unsigned long shrink_inactive_list(unsigned long nr_to_scan, >> * the flushers simply cannot keep up with the allocation >> * rate. Nudge the flusher threads in case they are asleep. >> */ >> - if (stat.nr_unqueued_dirty == nr_taken) { >> + if (nr_unqueued_dirty == nr_taken) { >> wakeup_flusher_threads(WB_REASON_VMSCAN); >> /* >> * For cgroupv1 dirty throttling is achieved by waking up >> @@ -2044,18 +2076,7 @@ static unsigned long shrink_inactive_list(unsigned long nr_to_scan, >> reclaim_throttle(pgdat, VMSCAN_THROTTLE_WRITEBACK); >> } >> >> - sc->nr.dirty += stat.nr_dirty; >> - sc->nr.congested += stat.nr_congested; >> - sc->nr.unqueued_dirty += stat.nr_unqueued_dirty; >> - sc->nr.writeback += stat.nr_writeback; >> - sc->nr.immediate += stat.nr_immediate; >> - sc->nr.taken += nr_taken; >> - if (file) >> - sc->nr.file_taken += nr_taken; >> - >> - trace_mm_vmscan_lru_shrink_inactive(pgdat->node_id, >> - nr_scanned, nr_reclaimed, &stat, sc->priority, file); >> - return nr_reclaimed; >> + return total_reclaimed; >> } >> >> /* >> @@ -4637,8 +4658,6 @@ static int evict_folios(struct lruvec *lruvec, struct scan_control *sc, int swap >> int reclaimed; >> LIST_HEAD(list); >> LIST_HEAD(clean); >> - struct folio *folio; >> - struct folio *next; >> enum vm_event_item item; >> struct reclaim_stat stat; >> struct lru_gen_mm_walk *walk; >> @@ -4668,26 +4687,7 @@ static int evict_folios(struct lruvec *lruvec, struct scan_control *sc, int swap >> scanned, reclaimed, &stat, sc->priority, >> type ? LRU_INACTIVE_FILE : LRU_INACTIVE_ANON); >> >> - list_for_each_entry_safe_reverse(folio, next, &list, lru) { >> - DEFINE_MIN_SEQ(lruvec); >> - >> - if (!folio_evictable(folio)) { >> - list_del(&folio->lru); >> - folio_putback_lru(folio); >> - continue; >> - } >> - >> - /* retry folios that may have missed folio_rotate_reclaimable() */ >> - if (!skip_retry && !folio_test_active(folio) && !folio_mapped(folio) && >> - !folio_test_dirty(folio) && !folio_test_writeback(folio)) { >> - list_move(&folio->lru, &clean); >> - continue; >> - } >> - >> - /* don't add rejected folios to the oldest generation */ >> - if (lru_gen_folio_seq(lruvec, folio, false) == min_seq[type]) >> - set_mask_bits(&folio->flags, LRU_REFS_FLAGS, BIT(PG_active)); >> - } >> + find_folios_written_back(&list, &clean, lruvec, type, skip_retry); >> >> spin_lock_irq(&lruvec->lru_lock); >> >> @@ -5706,6 +5706,44 @@ static void lru_gen_shrink_node(struct pglist_data *pgdat, struct scan_control * >> >> #endif /* CONFIG_LRU_GEN */ >> >> +/** >> + * find_folios_written_back - Find and move the written back folios to a new list. >> + * @list: filios list >> + * @clean: the written back folios list >> + * @lruvec: the lruvec >> + * @type: LRU_GEN_ANON/LRU_GEN_FILE, only for multi-gen LRU >> + * @skip_retry: whether skip retry. >> + */ >> +static inline void find_folios_written_back(struct list_head *list, >> + struct list_head *clean, struct lruvec *lruvec, int type, bool skip_retry) >> +{ >> + struct folio *folio; >> + struct folio *next; >> + >> + list_for_each_entry_safe_reverse(folio, next, list, lru) { >> +#ifdef CONFIG_LRU_GEN >> + DEFINE_MIN_SEQ(lruvec); >> +#endif >> + if (!folio_evictable(folio)) { >> + list_del(&folio->lru); >> + folio_putback_lru(folio); >> + continue; >> + } >> + >> + /* retry folios that may have missed folio_rotate_reclaimable() */ >> + if (!skip_retry && !folio_test_active(folio) && !folio_mapped(folio) && >> + !folio_test_dirty(folio) && !folio_test_writeback(folio)) { > > Have you verified that this condition also holds for the > active/inactive LRU or did you just assume it? IOW, how do we know the > active/inactive LRU doesn't think this folio should be kept (and put > back to the head of the inactive LRU list). > As the message shows, I tested my case and it worked for my case. I added logs, and they could identify the folios that have missed folio_rotate_reclaimable(). I think it's the same for both MGLRU and active/inactive LRU to identify the folios that may have missed folio_rotate_reclaimable(). Or did I miss something again? Thank you again. Best regards, Ridong