linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
From: Qi Zheng <qi.zheng@linux.dev>
To: hannes@cmpxchg.org, hughd@google.com, mhocko@suse.com,
	roman.gushchin@linux.dev, shakeel.butt@linux.dev,
	muchun.song@linux.dev, david@kernel.org,
	lorenzo.stoakes@oracle.com, ziy@nvidia.com, harry.yoo@oracle.com,
	imran.f.khan@oracle.com, kamalesh.babulal@oracle.com,
	axelrasmussen@google.com, yuanchu@google.com, weixugc@google.com,
	chenridong@huaweicloud.com, mkoutny@suse.com,
	akpm@linux-foundation.org, hamzamahfooz@linux.microsoft.com,
	apais@linux.microsoft.com, lance.yang@linux.dev
Cc: linux-mm@kvack.org, linux-kernel@vger.kernel.org,
	cgroups@vger.kernel.org, Qi Zheng <zhengqi.arch@bytedance.com>
Subject: [PATCH v2 25/28] mm: vmscan: prepare for reparenting MGLRU folios
Date: Wed, 17 Dec 2025 15:27:49 +0800	[thread overview]
Message-ID: <93cf8a847992563a096fdf9b24b18529606c29ee.1765956026.git.zhengqi.arch@bytedance.com> (raw)
In-Reply-To: <cover.1765956025.git.zhengqi.arch@bytedance.com>

From: Qi Zheng <zhengqi.arch@bytedance.com>

Similar to traditional LRU folios, in order to solve the dying memcg
problem, we also need to reparenting MGLRU folios to the parent memcg when
memcg offline.

However, there are the following challenges:

1. Each lruvec has between MIN_NR_GENS and MAX_NR_GENS generations, the
   number of generations of the parent and child memcg may be different,
   so we cannot simply transfer MGLRU folios in the child memcg to the
   parent memcg as we did for traditional LRU folios.
2. The generation information is stored in folio->flags, but we cannot
   traverse these folios while holding the lru lock, otherwise it may
   cause softlockup.
3. In walk_update_folio(), the gen of folio and corresponding lru size
   may be updated, but the folio is not immediately moved to the
   corresponding lru list. Therefore, there may be folios of different
   generations on an LRU list.
4. In lru_gen_del_folio(), the generation to which the folio belongs is
   found based on the generation information in folio->flags, and the
   corresponding LRU size will be updated. Therefore, we need to update
   the lru size correctly during reparenting, otherwise the lru size may
   be updated incorrectly in lru_gen_del_folio().

Finally, this patch chose a compromise method, which is to splice the lru
list in the child memcg to the lru list of the same generation in the
parent memcg during reparenting. And in order to ensure that the parent
memcg has the same generation, we need to increase the generations in the
parent memcg to the MAX_NR_GENS before reparenting.

Of course, the same generation has different meanings in the parent and
child memcg, this will cause confusion in the hot and cold information of
folios. But other than that, this method is simple enough, the lru size
is correct, and there is no need to consider some concurrency issues (such
as lru_gen_del_folio()).

To prepare for the above work, this commit implements the specific
functions, which will be used during reparenting.

Suggested-by: Harry Yoo <harry.yoo@oracle.com>
Suggested-by: Imran Khan <imran.f.khan@oracle.com>
Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
---
 include/linux/mmzone.h |  16 +++++
 mm/vmscan.c            | 141 +++++++++++++++++++++++++++++++++++++++++
 2 files changed, 157 insertions(+)

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 08132012aa8b8..67c0e55da1220 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -628,6 +628,9 @@ void lru_gen_online_memcg(struct mem_cgroup *memcg);
 void lru_gen_offline_memcg(struct mem_cgroup *memcg);
 void lru_gen_release_memcg(struct mem_cgroup *memcg);
 void lru_gen_soft_reclaim(struct mem_cgroup *memcg, int nid);
+void max_lru_gen_memcg(struct mem_cgroup *memcg);
+bool recheck_lru_gen_max_memcg(struct mem_cgroup *memcg);
+void lru_gen_reparent_memcg(struct mem_cgroup *src, struct mem_cgroup *dst);
 
 #else /* !CONFIG_LRU_GEN */
 
@@ -668,6 +671,19 @@ static inline void lru_gen_soft_reclaim(struct mem_cgroup *memcg, int nid)
 {
 }
 
+static inline void max_lru_gen_memcg(struct mem_cgroup *memcg)
+{
+}
+
+static inline bool recheck_lru_gen_max_memcg(struct mem_cgroup *memcg)
+{
+	return true;
+}
+
+static inline void lru_gen_reparent_memcg(struct mem_cgroup *src, struct mem_cgroup *dst)
+{
+}
+
 #endif /* CONFIG_LRU_GEN */
 
 struct lruvec {
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 5fd0f97c3719c..64a85eea26dc6 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -4466,6 +4466,147 @@ void lru_gen_soft_reclaim(struct mem_cgroup *memcg, int nid)
 		lru_gen_rotate_memcg(lruvec, MEMCG_LRU_HEAD);
 }
 
+bool recheck_lru_gen_max_memcg(struct mem_cgroup *memcg)
+{
+	int nid;
+
+	for_each_node(nid) {
+		struct lruvec *lruvec = get_lruvec(memcg, nid);
+		int type;
+
+		for (type = 0; type < ANON_AND_FILE; type++) {
+			if (get_nr_gens(lruvec, type) != MAX_NR_GENS)
+				return false;
+		}
+	}
+
+	return true;
+}
+
+static void try_to_inc_max_seq_nowalk(struct mem_cgroup *memcg,
+				      struct lruvec *lruvec)
+{
+	struct lru_gen_mm_list *mm_list = get_mm_list(memcg);
+	struct lru_gen_mm_state *mm_state = get_mm_state(lruvec);
+	int swappiness = mem_cgroup_swappiness(memcg);
+	DEFINE_MAX_SEQ(lruvec);
+	bool success = false;
+
+	/*
+	 * We are not iterating the mm_list here, updating mm_state->seq is just
+	 * to make mm walkers work properly.
+	 */
+	if (mm_state) {
+		spin_lock(&mm_list->lock);
+		VM_WARN_ON_ONCE(mm_state->seq + 1 < max_seq);
+		if (max_seq > mm_state->seq) {
+			WRITE_ONCE(mm_state->seq, mm_state->seq + 1);
+			success = true;
+		}
+		spin_unlock(&mm_list->lock);
+	} else {
+		success = true;
+	}
+
+	if (success)
+		inc_max_seq(lruvec, max_seq, swappiness);
+}
+
+/*
+ * We need to ensure that the folios of child memcg can be reparented to the
+ * same gen of the parent memcg, so the gens of the parent memcg needed be
+ * incremented to the MAX_NR_GENS before reparenting.
+ */
+void max_lru_gen_memcg(struct mem_cgroup *memcg)
+{
+	int nid;
+
+	for_each_node(nid) {
+		struct lruvec *lruvec = get_lruvec(memcg, nid);
+		int type;
+
+		for (type = 0; type < ANON_AND_FILE; type++) {
+			while (get_nr_gens(lruvec, type) < MAX_NR_GENS) {
+				try_to_inc_max_seq_nowalk(memcg, lruvec);
+				cond_resched();
+			}
+		}
+	}
+}
+
+/*
+ * Compared to traditional LRU, MGLRU faces the following challenges:
+ *
+ * 1. Each lruvec has between MIN_NR_GENS and MAX_NR_GENS generations, the
+ *    number of generations of the parent and child memcg may be different,
+ *    so we cannot simply transfer MGLRU folios in the child memcg to the
+ *    parent memcg as we did for traditional LRU folios.
+ * 2. The generation information is stored in folio->flags, but we cannot
+ *    traverse these folios while holding the lru lock, otherwise it may
+ *    cause softlockup.
+ * 3. In walk_update_folio(), the gen of folio and corresponding lru size
+ *    may be updated, but the folio is not immediately moved to the
+ *    corresponding lru list. Therefore, there may be folios of different
+ *    generations on an LRU list.
+ * 4. In lru_gen_del_folio(), the generation to which the folio belongs is
+ *    found based on the generation information in folio->flags, and the
+ *    corresponding LRU size will be updated. Therefore, we need to update
+ *    the lru size correctly during reparenting, otherwise the lru size may
+ *    be updated incorrectly in lru_gen_del_folio().
+ *
+ * Finally, we choose a compromise method, which is to splice the lru list in
+ * the child memcg to the lru list of the same generation in the parent memcg
+ * during reparenting.
+ *
+ * The same generation has different meanings in the parent and child memcg,
+ * so this compromise method will cause the LRU inversion problem. But as the
+ * system runs, this problem will be fixed automatically.
+ */
+static void __lru_gen_reparent_memcg(struct lruvec *src_lruvec, struct lruvec *dst_lruvec,
+				     int zone, int type)
+{
+	struct lru_gen_folio *src_lrugen, *dst_lrugen;
+	enum lru_list lru = type * LRU_INACTIVE_FILE;
+	int i;
+
+	src_lrugen = &src_lruvec->lrugen;
+	dst_lrugen = &dst_lruvec->lrugen;
+
+	for (i = 0; i < get_nr_gens(src_lruvec, type); i++) {
+		int gen = lru_gen_from_seq(src_lrugen->max_seq - i);
+		long nr_pages = src_lrugen->nr_pages[gen][type][zone];
+		int src_lru_active = lru_gen_is_active(src_lruvec, gen) ? LRU_ACTIVE : 0;
+		int dst_lru_active = lru_gen_is_active(dst_lruvec, gen) ? LRU_ACTIVE : 0;
+
+		list_splice_tail_init(&src_lrugen->folios[gen][type][zone],
+				      &dst_lrugen->folios[gen][type][zone]);
+
+		WRITE_ONCE(src_lrugen->nr_pages[gen][type][zone], 0);
+		WRITE_ONCE(dst_lrugen->nr_pages[gen][type][zone],
+			   dst_lrugen->nr_pages[gen][type][zone] + nr_pages);
+
+		__update_lru_size(src_lruvec, lru + src_lru_active, zone, -nr_pages);
+		__update_lru_size(dst_lruvec, lru + dst_lru_active, zone, nr_pages);
+	}
+}
+
+void lru_gen_reparent_memcg(struct mem_cgroup *src, struct mem_cgroup *dst)
+{
+	int nid;
+
+	for_each_node(nid) {
+		struct lruvec *src_lruvec, *dst_lruvec;
+		int type, zone;
+
+		src_lruvec = get_lruvec(src, nid);
+		dst_lruvec = get_lruvec(dst, nid);
+
+		for (zone = 0; zone < MAX_NR_ZONES; zone++)
+			for (type = 0; type < ANON_AND_FILE; type++)
+				__lru_gen_reparent_memcg(src_lruvec, dst_lruvec, zone, type);
+	}
+}
+
 #endif /* CONFIG_MEMCG */
 
 /******************************************************************************
-- 
2.20.1



  parent reply	other threads:[~2025-12-17  7:34 UTC|newest]

Thread overview: 149+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-12-17  7:27 [PATCH v2 00/28] Eliminate Dying Memory Cgroup Qi Zheng
2025-12-17  7:27 ` [PATCH v2 01/28] mm: memcontrol: remove dead code of checking parent memory cgroup Qi Zheng
2025-12-18 23:31   ` Shakeel Butt
2025-12-17  7:27 ` [PATCH v2 02/28] mm: workingset: use folio_lruvec() in workingset_refault() Qi Zheng
2025-12-18 23:32   ` Shakeel Butt
2025-12-17  7:27 ` [PATCH v2 03/28] mm: rename unlock_page_lruvec_irq and its variants Qi Zheng
2025-12-18  9:00   ` David Hildenbrand (Red Hat)
2025-12-18 23:34   ` Shakeel Butt
2025-12-17  7:27 ` [PATCH v2 04/28] mm: vmscan: prepare for the refactoring the move_folios_to_lru() Qi Zheng
2025-12-17 21:13   ` Johannes Weiner
2025-12-18  9:04   ` David Hildenbrand (Red Hat)
2025-12-18  9:31     ` Qi Zheng
2025-12-18 23:39   ` Shakeel Butt
2025-12-25  3:45   ` Chen Ridong
2025-12-17  7:27 ` [PATCH v2 05/28] mm: vmscan: refactor move_folios_to_lru() Qi Zheng
2025-12-19  0:04   ` Shakeel Butt
2025-12-17  7:27 ` [PATCH v2 06/28] mm: memcontrol: allocate object cgroup for non-kmem case Qi Zheng
2025-12-17 21:22   ` Johannes Weiner
2025-12-18  6:25     ` Qi Zheng
2025-12-19  0:23   ` Shakeel Butt
2025-12-25  6:23   ` Chen Ridong
2025-12-17  7:27 ` [PATCH v2 07/28] mm: memcontrol: return root object cgroup for root memory cgroup Qi Zheng
2025-12-17 21:28   ` Johannes Weiner
2025-12-19  0:39   ` Shakeel Butt
2025-12-26  1:03   ` Chen Ridong
2025-12-26  3:10     ` Muchun Song
2025-12-26  3:50       ` Chen Ridong
2025-12-26  3:58         ` Chen Ridong
2025-12-17  7:27 ` [PATCH v2 08/28] mm: memcontrol: prevent memory cgroup release in get_mem_cgroup_from_folio() Qi Zheng
2025-12-17 21:45   ` Johannes Weiner
2025-12-18  6:31     ` Qi Zheng
2025-12-19  2:09     ` Shakeel Butt
2025-12-19  3:53       ` Johannes Weiner
2025-12-19  3:56         ` Johannes Weiner
2025-12-17  7:27 ` [PATCH v2 09/28] buffer: prevent memory cgroup release in folio_alloc_buffers() Qi Zheng
2025-12-17 21:45   ` Johannes Weiner
2025-12-19  2:14   ` Shakeel Butt
2025-12-26  2:01     ` Chen Ridong
2025-12-17  7:27 ` [PATCH v2 10/28] writeback: prevent memory cgroup release in writeback module Qi Zheng
2025-12-17 22:08   ` Johannes Weiner
2025-12-19  2:30   ` Shakeel Butt
2025-12-17  7:27 ` [PATCH v2 11/28] mm: memcontrol: prevent memory cgroup release in count_memcg_folio_events() Qi Zheng
2025-12-17 22:11   ` Johannes Weiner
2025-12-19 23:31   ` Shakeel Butt
2025-12-26  2:12   ` Chen Ridong
2025-12-17  7:27 ` [PATCH v2 12/28] mm: page_io: prevent memory cgroup release in page_io module Qi Zheng
2025-12-17 22:12   ` Johannes Weiner
2025-12-19 23:44   ` Shakeel Butt
2025-12-17  7:27 ` [PATCH v2 13/28] mm: migrate: prevent memory cgroup release in folio_migrate_mapping() Qi Zheng
2025-12-17 22:14   ` Johannes Weiner
2025-12-18  9:09   ` David Hildenbrand (Red Hat)
2025-12-18  9:36     ` Qi Zheng
2025-12-18  9:43       ` David Hildenbrand (Red Hat)
2025-12-18 11:40         ` Qi Zheng
2025-12-18 11:56           ` David Hildenbrand (Red Hat)
2025-12-18 13:00             ` Qi Zheng
2025-12-18 13:04               ` David Hildenbrand (Red Hat)
2025-12-18 13:16                 ` Qi Zheng
2025-12-19  4:12                   ` Harry Yoo
2025-12-19  6:18                     ` David Hildenbrand (Red Hat)
2025-12-18 14:26     ` Johannes Weiner
2025-12-22  3:42       ` Qi Zheng
2025-12-30 20:07       ` David Hildenbrand (Red Hat)
2025-12-19 23:51   ` Shakeel Butt
2025-12-17  7:27 ` [PATCH v2 14/28] mm: mglru: prevent memory cgroup release in mglru Qi Zheng
2025-12-17 22:18   ` Johannes Weiner
2025-12-18  6:50     ` Qi Zheng
2025-12-20  0:58     ` Shakeel Butt
2025-12-17  7:27 ` [PATCH v2 15/28] mm: memcontrol: prevent memory cgroup release in mem_cgroup_swap_full() Qi Zheng
2025-12-17 22:21   ` Johannes Weiner
2025-12-20  1:05   ` Shakeel Butt
2025-12-22  4:02     ` Qi Zheng
2025-12-26  2:29     ` Chen Ridong
2025-12-17  7:27 ` [PATCH v2 16/28] mm: workingset: prevent memory cgroup release in lru_gen_eviction() Qi Zheng
2025-12-17 22:23   ` Johannes Weiner
2025-12-20  1:06   ` Shakeel Butt
2025-12-17  7:27 ` [PATCH v2 17/28] mm: thp: prevent memory cgroup release in folio_split_queue_lock{_irqsave}() Qi Zheng
2025-12-17 22:27   ` Johannes Weiner
2025-12-20  1:11     ` Shakeel Butt
2025-12-22  3:33       ` Qi Zheng
2025-12-18  9:10   ` David Hildenbrand (Red Hat)
2025-12-17  7:27 ` [PATCH v2 18/28] mm: zswap: prevent memory cgroup release in zswap_compress() Qi Zheng
2025-12-17 22:27   ` Johannes Weiner
2025-12-20  1:14   ` Shakeel Butt
2025-12-17  7:27 ` [PATCH v2 19/28] mm: workingset: prevent lruvec release in workingset_refault() Qi Zheng
2025-12-17 22:30   ` Johannes Weiner
2025-12-18  6:57     ` Qi Zheng
2025-12-17  7:27 ` [PATCH v2 20/28] mm: zswap: prevent lruvec release in zswap_folio_swapin() Qi Zheng
2025-12-17 22:33   ` Johannes Weiner
2025-12-18  7:09     ` Qi Zheng
2025-12-18 13:02       ` Johannes Weiner
2025-12-20  1:23   ` Shakeel Butt
2025-12-17  7:27 ` [PATCH v2 21/28] mm: swap: prevent lruvec release in lru_gen_clear_refs() Qi Zheng
2025-12-17 22:34   ` Johannes Weiner
2025-12-20  1:24   ` Shakeel Butt
2025-12-17  7:27 ` [PATCH v2 22/28] mm: workingset: prevent lruvec release in workingset_activation() Qi Zheng
2025-12-17 22:36   ` Johannes Weiner
2025-12-20  1:25   ` Shakeel Butt
2025-12-17  7:27 ` [PATCH v2 23/28] mm: memcontrol: prepare for reparenting LRU pages for lruvec lock Qi Zheng
2025-12-18 13:00   ` Johannes Weiner
2025-12-18 13:17     ` Qi Zheng
2025-12-20  2:03   ` Shakeel Butt
2025-12-23  6:14     ` Qi Zheng
2025-12-17  7:27 ` [PATCH v2 24/28] mm: vmscan: prepare for reparenting traditional LRU folios Qi Zheng
2025-12-18 13:32   ` Johannes Weiner
2025-12-22  3:55     ` Qi Zheng
2025-12-17  7:27 ` Qi Zheng [this message]
2025-12-17  7:27 ` [PATCH v2 26/28] mm: memcontrol: refactor memcg_reparent_objcgs() Qi Zheng
2025-12-18 13:45   ` Johannes Weiner
2025-12-22  3:56     ` Qi Zheng
2025-12-17  7:27 ` [PATCH v2 27/28] mm: memcontrol: eliminate the problem of dying memory cgroup for LRU folios Qi Zheng
2025-12-18 14:06   ` Johannes Weiner
2025-12-22  3:59     ` Qi Zheng
2025-12-17  7:27 ` [PATCH v2 28/28] mm: lru: add VM_WARN_ON_ONCE_FOLIO to lru maintenance helpers Qi Zheng
2025-12-18 14:07   ` Johannes Weiner
2025-12-23 20:04 ` [PATCH v2 00/28] Eliminate Dying Memory Cgroup Yosry Ahmed
2025-12-23 23:20   ` Shakeel Butt
2025-12-24  0:07     ` Yosry Ahmed
2025-12-24  0:36       ` Shakeel Butt
2025-12-24  0:43         ` Yosry Ahmed
2025-12-24  0:58           ` Shakeel Butt
2025-12-29  9:42             ` Qi Zheng
2025-12-29 10:52               ` Michal Koutný
2025-12-29  7:48     ` Qi Zheng
2025-12-29  9:35       ` Harry Yoo
2025-12-29  9:46         ` Qi Zheng
2025-12-29 10:53         ` Michal Koutný
2025-12-24  8:43   ` Harry Yoo
2025-12-24 14:51     ` Yosry Ahmed
2025-12-26 11:24       ` Harry Yoo
2025-12-30  1:36 ` Roman Gushchin
2025-12-30  2:44   ` Qi Zheng
2025-12-30  4:20     ` Roman Gushchin
2025-12-30  4:25       ` Qi Zheng
2025-12-30  4:48         ` Shakeel Butt
2025-12-30 16:46           ` Zi Yan
2025-12-30 18:13             ` Shakeel Butt
2025-12-30 19:18               ` Chris Mason
2025-12-30 20:51                 ` Matthew Wilcox
2025-12-30 21:10                   ` Chris Mason
2025-12-30 22:30                     ` Roman Gushchin
2025-12-30 22:03                   ` Roman Gushchin
2025-12-30 21:07                 ` Zi Yan
2025-12-30 19:34             ` Roman Gushchin
2025-12-30 21:13               ` Zi Yan
2025-12-30  4:01   ` Shakeel Butt
2025-12-30  4:11     ` Roman Gushchin
2025-12-30 18:36       ` Shakeel Butt
2025-12-30 20:47         ` Roman Gushchin

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=93cf8a847992563a096fdf9b24b18529606c29ee.1765956026.git.zhengqi.arch@bytedance.com \
    --to=qi.zheng@linux.dev \
    --cc=akpm@linux-foundation.org \
    --cc=apais@linux.microsoft.com \
    --cc=axelrasmussen@google.com \
    --cc=cgroups@vger.kernel.org \
    --cc=chenridong@huaweicloud.com \
    --cc=david@kernel.org \
    --cc=hamzamahfooz@linux.microsoft.com \
    --cc=hannes@cmpxchg.org \
    --cc=harry.yoo@oracle.com \
    --cc=hughd@google.com \
    --cc=imran.f.khan@oracle.com \
    --cc=kamalesh.babulal@oracle.com \
    --cc=lance.yang@linux.dev \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=lorenzo.stoakes@oracle.com \
    --cc=mhocko@suse.com \
    --cc=mkoutny@suse.com \
    --cc=muchun.song@linux.dev \
    --cc=roman.gushchin@linux.dev \
    --cc=shakeel.butt@linux.dev \
    --cc=weixugc@google.com \
    --cc=yuanchu@google.com \
    --cc=zhengqi.arch@bytedance.com \
    --cc=ziy@nvidia.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox