[PATCH v2 00/28] Eliminate Dying Memory Cgroup

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [PATCH v2 00/28] Eliminate Dying Memory Cgroup
@ 2025-12-17  7:27 Qi Zheng
  2025-12-17  7:27 ` [PATCH v2 01/28] mm: memcontrol: remove dead code of checking parent memory cgroup Qi Zheng
                   ` (29 more replies)
  0 siblings, 30 replies; 149+ messages in thread
From: Qi Zheng @ 2025-12-17  7:27 UTC (permalink / raw)
  To: hannes, hughd, mhocko, roman.gushchin, shakeel.butt, muchun.song,
	david, lorenzo.stoakes, ziy, harry.yoo, imran.f.khan,
	kamalesh.babulal, axelrasmussen, yuanchu, weixugc, chenridong,
	mkoutny, akpm, hamzamahfooz, apais, lance.yang
  Cc: linux-mm, linux-kernel, cgroups, Qi Zheng

From: Qi Zheng <zhengqi.arch@bytedance.com>

Changes in v2:
 - add [PATCH v2 04/28] and remove local_irq_disable() in evict_folios()
   (pointed by Harry Yoo)
 - recheck objcg in [PATCH v2 07/28] (pointed by Harry Yoo)
 - modify the commit message in [PATCH v2 12/28] and [PATCH v2 21/28]
   (pointed by Harry Yoo)
 - use rcu lock to protect mm_state in [PATCH v2 14/28] (pointed by Harry Yoo)
 - fix bad unlock balance warning in [PATCH v2 23/28]
 - change nr_pages type to long in [PATCH v2 25/28] (pointed by Harry Yoo)
 - incease mm_state->seq during reparenting to make mm walker work properly in
   [PATCH v2 25/28] (pointed by Harry Yoo)
 - add [PATCH v2 18/28] to fix WARNING in folio_memcg() (pointed by Harry Yoo)
 - collect Reviewed-bys
 - rebase onto the next-20251216

Changes in v1:
 - drop [PATCH RFC 02/28]
 - drop THP split queue related part, which has been merged as a separate
   patchset[2]
 - prevent memory cgroup release in folio_split_queue_lock{_irqsave}() in
   [PATCH v1 16/26]
 - Separate the reparenting function of traditional LRU folios to [PATCH v1 22/26]
 - adapted to the MGLRU scenarios in [PATCH v1 23/26]
 - refactor memcg_reparent_objcgs() in [PATCH v1 24/26]
 - collect Acked-bys and Reviewed-bys
 - rebase onto the next-20251028

Hi all,

Introduction
============

This patchset is intended to transfer the LRU pages to the object cgroup
without holding a reference to the original memory cgroup in order to
address the issue of the dying memory cgroup. A consensus has already been
reached regarding this approach recently [1].

Background
==========

The issue of a dying memory cgroup refers to a situation where a memory
cgroup is no longer being used by users, but memory (the metadata
associated with memory cgroups) remains allocated to it. This situation
may potentially result in memory leaks or inefficiencies in memory
reclamation and has persisted as an issue for several years. Any memory
allocation that endures longer than the lifespan (from the users'
perspective) of a memory cgroup can lead to the issue of dying memory
cgroup. We have exerted greater efforts to tackle this problem by
introducing the infrastructure of object cgroup [2].

Presently, numerous types of objects (slab objects, non-slab kernel
allocations, per-CPU objects) are charged to the object cgroup without
holding a reference to the original memory cgroup. The final allocations
for LRU pages (anonymous pages and file pages) are charged at allocation
time and continues to hold a reference to the original memory cgroup
until reclaimed.

File pages are more complex than anonymous pages as they can be shared
among different memory cgroups and may persist beyond the lifespan of
the memory cgroup. The long-term pinning of file pages to memory cgroups
is a widespread issue that causes recurring problems in practical
scenarios [3]. File pages remain unreclaimed for extended periods.
Additionally, they are accessed by successive instances (second, third,
fourth, etc.) of the same job, which is restarted into a new cgroup each
time. As a result, unreclaimable dying memory cgroups accumulate,
leading to memory wastage and significantly reducing the efficiency
of page reclamation.

Fundamentals
============

A folio will no longer pin its corresponding memory cgroup. It is necessary
to ensure that the memory cgroup or the lruvec associated with the memory
cgroup is not released when a user obtains a pointer to the memory cgroup
or lruvec returned by folio_memcg() or folio_lruvec(). Users are required
to hold the RCU read lock or acquire a reference to the memory cgroup
associated with the folio to prevent its release if they are not concerned
about the binding stability between the folio and its corresponding memory
cgroup. However, some users of folio_lruvec() (i.e., the lruvec lock)
desire a stable binding between the folio and its corresponding memory
cgroup. An approach is needed to ensure the stability of the binding while
the lruvec lock is held, and to detect the situation of holding the
incorrect lruvec lock when there is a race condition during memory cgroup
reparenting. The following four steps are taken to achieve these goals.

1. The first step  to be taken is to identify all users of both functions
   (folio_memcg() and folio_lruvec()) who are not concerned about binding
   stability and implement appropriate measures (such as holding a RCU read
   lock or temporarily obtaining a reference to the memory cgroup for a
   brief period) to prevent the release of the memory cgroup.

2. Secondly, the following refactoring of folio_lruvec_lock() demonstrates
   how to ensure the binding stability from the user's perspective of
   folio_lruvec().

   struct lruvec *folio_lruvec_lock(struct folio *folio)
   {
           struct lruvec *lruvec;

           rcu_read_lock();
   retry:
           lruvec = folio_lruvec(folio);
           spin_lock(&lruvec->lru_lock);
           if (unlikely(lruvec_memcg(lruvec) != folio_memcg(folio))) {
                   spin_unlock(&lruvec->lru_lock);
                   goto retry;
           }

           return lruvec;
   }

   From the perspective of memory cgroup removal, the entire reparenting
   process (altering the binding relationship between folio and its memory
   cgroup and moving the LRU lists to its parental memory cgroup) should be
   carried out under both the lruvec lock of the memory cgroup being removed
   and the lruvec lock of its parent.

3. Finally, transfer the LRU pages to the object cgroup without holding a
   reference to the original memory cgroup.

Effect
======

Finally, it can be observed that the quantity of dying memory cgroups will
not experience a significant increase if the following test script is
executed to reproduce the issue.

```bash
#!/bin/bash

# Create a temporary file 'temp' filled with zero bytes
dd if=/dev/zero of=temp bs=4096 count=1

# Display memory-cgroup info from /proc/cgroups
cat /proc/cgroups | grep memory

for i in {0..2000}
do
    mkdir /sys/fs/cgroup/memory/test$i
    echo $$ > /sys/fs/cgroup/memory/test$i/cgroup.procs

    # Append 'temp' file content to 'log'
    cat temp >> log

    echo $$ > /sys/fs/cgroup/memory/cgroup.procs

    # Potentially create a dying memory cgroup
    rmdir /sys/fs/cgroup/memory/test$i
done

# Display memory-cgroup info after test
cat /proc/cgroups | grep memory

rm -f temp log
```

Comments and suggestions are welcome!

Thanks,
Qi

[1].https://lore.kernel.org/linux-mm/Z6OkXXYDorPrBvEQ@hm-sls2/
[2].https://lwn.net/Articles/895431/
[3].https://github.com/systemd/systemd/pull/36827

Muchun Song (22):
  mm: memcontrol: remove dead code of checking parent memory cgroup
  mm: workingset: use folio_lruvec() in workingset_refault()
  mm: rename unlock_page_lruvec_irq and its variants
  mm: vmscan: refactor move_folios_to_lru()
  mm: memcontrol: allocate object cgroup for non-kmem case
  mm: memcontrol: return root object cgroup for root memory cgroup
  mm: memcontrol: prevent memory cgroup release in
    get_mem_cgroup_from_folio()
  buffer: prevent memory cgroup release in folio_alloc_buffers()
  writeback: prevent memory cgroup release in writeback module
  mm: memcontrol: prevent memory cgroup release in
    count_memcg_folio_events()
  mm: page_io: prevent memory cgroup release in page_io module
  mm: migrate: prevent memory cgroup release in folio_migrate_mapping()
  mm: mglru: prevent memory cgroup release in mglru
  mm: memcontrol: prevent memory cgroup release in
    mem_cgroup_swap_full()
  mm: workingset: prevent memory cgroup release in lru_gen_eviction()
  mm: workingset: prevent lruvec release in workingset_refault()
  mm: zswap: prevent lruvec release in zswap_folio_swapin()
  mm: swap: prevent lruvec release in lru_gen_clear_refs()
  mm: workingset: prevent lruvec release in workingset_activation()
  mm: memcontrol: prepare for reparenting LRU pages for lruvec lock
  mm: memcontrol: eliminate the problem of dying memory cgroup for LRU
    folios
  mm: lru: add VM_WARN_ON_ONCE_FOLIO to lru maintenance helpers

Qi Zheng (6):
  mm: vmscan: prepare for the refactoring the move_folios_to_lru()
  mm: thp: prevent memory cgroup release in
    folio_split_queue_lock{_irqsave}()
  mm: zswap: prevent memory cgroup release in zswap_compress()
  mm: vmscan: prepare for reparenting traditional LRU folios
  mm: vmscan: prepare for reparenting MGLRU folios
  mm: memcontrol: refactor memcg_reparent_objcgs()

 fs/buffer.c                      |   4 +-
 fs/fs-writeback.c                |  22 +-
 include/linux/memcontrol.h       | 159 ++++++------
 include/linux/mm_inline.h        |   6 +
 include/linux/mmzone.h           |  20 ++
 include/trace/events/writeback.h |   3 +
 mm/compaction.c                  |  43 +++-
 mm/huge_memory.c                 |  18 +-
 mm/memcontrol-v1.c               |  15 +-
 mm/memcontrol.c                  | 405 ++++++++++++++++++-------------
 mm/migrate.c                     |   2 +
 mm/mlock.c                       |   2 +-
 mm/page_io.c                     |   8 +-
 mm/percpu.c                      |   2 +-
 mm/shrinker.c                    |   6 +-
 mm/swap.c                        |  20 +-
 mm/vmscan.c                      | 267 ++++++++++++++++----
 mm/workingset.c                  |  26 +-
 mm/zswap.c                       |   5 +
 19 files changed, 677 insertions(+), 356 deletions(-)

-- 
2.20.1

^ permalink raw reply	[flat|nested] 149+ messages in thread

* [PATCH v2 01/28] mm: memcontrol: remove dead code of checking parent memory cgroup
  2025-12-17  7:27 [PATCH v2 00/28] Eliminate Dying Memory Cgroup Qi Zheng
@ 2025-12-17  7:27 ` Qi Zheng
  2025-12-18 23:31   ` Shakeel Butt
  2025-12-17  7:27 ` [PATCH v2 02/28] mm: workingset: use folio_lruvec() in workingset_refault() Qi Zheng
                   ` (28 subsequent siblings)
  29 siblings, 1 reply; 149+ messages in thread
From: Qi Zheng @ 2025-12-17  7:27 UTC (permalink / raw)
  To: hannes, hughd, mhocko, roman.gushchin, shakeel.butt, muchun.song,
	david, lorenzo.stoakes, ziy, harry.yoo, imran.f.khan,
	kamalesh.babulal, axelrasmussen, yuanchu, weixugc, chenridong,
	mkoutny, akpm, hamzamahfooz, apais, lance.yang
  Cc: linux-mm, linux-kernel, cgroups, Muchun Song, Qi Zheng, Chen Ridong

From: Muchun Song <songmuchun@bytedance.com>

Since the no-hierarchy mode has been deprecated after the commit:

  commit bef8620cd8e0 ("mm: memcg: deprecate the non-hierarchical mode").

As a result, parent_mem_cgroup() will not return NULL except when passing
the root memcg, and the root memcg cannot be offline. Hence, it's safe to
remove the check on the returned value of parent_mem_cgroup(). Remove the
corresponding dead code.

Signed-off-by: Muchun Song <songmuchun@bytedance.com>
Acked-by: Roman Gushchin <roman.gushchin@linux.dev>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
Reviewed-by: Harry Yoo <harry.yoo@oracle.com>
Reviewed-by: Chen Ridong <chenridong@huawei.com>
---
 mm/memcontrol.c | 5 -----
 mm/shrinker.c   | 6 +-----
 2 files changed, 1 insertion(+), 10 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index e2e49f4ec9e0e..ae234518d023c 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -3331,9 +3331,6 @@ static void memcg_offline_kmem(struct mem_cgroup *memcg)
 		return;
 
 	parent = parent_mem_cgroup(memcg);
-	if (!parent)
-		parent = root_mem_cgroup;
-
 	memcg_reparent_list_lrus(memcg, parent);
 
 	/*
@@ -3624,8 +3621,6 @@ struct mem_cgroup *mem_cgroup_id_get_online(struct mem_cgroup *memcg)
 			break;
 		}
 		memcg = parent_mem_cgroup(memcg);
-		if (!memcg)
-			memcg = root_mem_cgroup;
 	}
 	return memcg;
 }
diff --git a/mm/shrinker.c b/mm/shrinker.c
index 4a93fd433689a..e8e092a2f7f41 100644
--- a/mm/shrinker.c
+++ b/mm/shrinker.c
@@ -286,14 +286,10 @@ void reparent_shrinker_deferred(struct mem_cgroup *memcg)
 {
 	int nid, index, offset;
 	long nr;
-	struct mem_cgroup *parent;
+	struct mem_cgroup *parent = parent_mem_cgroup(memcg);
 	struct shrinker_info *child_info, *parent_info;
 	struct shrinker_info_unit *child_unit, *parent_unit;
 
-	parent = parent_mem_cgroup(memcg);
-	if (!parent)
-		parent = root_mem_cgroup;
-
 	/* Prevent from concurrent shrinker_info expand */
 	mutex_lock(&shrinker_mutex);
 	for_each_node(nid) {
-- 
2.20.1



^ permalink raw reply	[flat|nested] 149+ messages in thread

* Re: [PATCH v2 01/28] mm: memcontrol: remove dead code of checking parent memory cgroup
  2025-12-17  7:27 ` [PATCH v2 01/28] mm: memcontrol: remove dead code of checking parent memory cgroup Qi Zheng
@ 2025-12-18 23:31   ` Shakeel Butt
  0 siblings, 0 replies; 149+ messages in thread
From: Shakeel Butt @ 2025-12-18 23:31 UTC (permalink / raw)
  To: Qi Zheng
  Cc: hannes, hughd, mhocko, roman.gushchin, muchun.song, david,
	lorenzo.stoakes, ziy, harry.yoo, imran.f.khan, kamalesh.babulal,
	axelrasmussen, yuanchu, weixugc, chenridong, mkoutny, akpm,
	hamzamahfooz, apais, lance.yang, linux-mm, linux-kernel, cgroups,
	Muchun Song, Qi Zheng, Chen Ridong

On Wed, Dec 17, 2025 at 03:27:25PM +0800, Qi Zheng wrote:
> From: Muchun Song <songmuchun@bytedance.com>
> 
> Since the no-hierarchy mode has been deprecated after the commit:
> 
>   commit bef8620cd8e0 ("mm: memcg: deprecate the non-hierarchical mode").
> 
> As a result, parent_mem_cgroup() will not return NULL except when passing
> the root memcg, and the root memcg cannot be offline. Hence, it's safe to
> remove the check on the returned value of parent_mem_cgroup(). Remove the
> corresponding dead code.
> 
> Signed-off-by: Muchun Song <songmuchun@bytedance.com>
> Acked-by: Roman Gushchin <roman.gushchin@linux.dev>
> Acked-by: Johannes Weiner <hannes@cmpxchg.org>
> Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
> Reviewed-by: Harry Yoo <harry.yoo@oracle.com>
> Reviewed-by: Chen Ridong <chenridong@huawei.com>

Acked-by: Shakeel Butt <shakeel.butt@linux.dev>


^ permalink raw reply	[flat|nested] 149+ messages in thread

* [PATCH v2 02/28] mm: workingset: use folio_lruvec() in workingset_refault()
  2025-12-17  7:27 [PATCH v2 00/28] Eliminate Dying Memory Cgroup Qi Zheng
  2025-12-17  7:27 ` [PATCH v2 01/28] mm: memcontrol: remove dead code of checking parent memory cgroup Qi Zheng
@ 2025-12-17  7:27 ` Qi Zheng
  2025-12-18 23:32   ` Shakeel Butt
  2025-12-17  7:27 ` [PATCH v2 03/28] mm: rename unlock_page_lruvec_irq and its variants Qi Zheng
                   ` (27 subsequent siblings)
  29 siblings, 1 reply; 149+ messages in thread
From: Qi Zheng @ 2025-12-17  7:27 UTC (permalink / raw)
  To: hannes, hughd, mhocko, roman.gushchin, shakeel.butt, muchun.song,
	david, lorenzo.stoakes, ziy, harry.yoo, imran.f.khan,
	kamalesh.babulal, axelrasmussen, yuanchu, weixugc, chenridong,
	mkoutny, akpm, hamzamahfooz, apais, lance.yang
  Cc: linux-mm, linux-kernel, cgroups, Muchun Song, Qi Zheng

From: Muchun Song <songmuchun@bytedance.com>

Use folio_lruvec() to simplify the code.

Signed-off-by: Muchun Song <songmuchun@bytedance.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
Reviewed-by: Harry Yoo <harry.yoo@oracle.com>
---
 mm/workingset.c | 7 +------
 1 file changed, 1 insertion(+), 6 deletions(-)

diff --git a/mm/workingset.c b/mm/workingset.c
index e9f05634747a7..e41b44e29944b 100644
--- a/mm/workingset.c
+++ b/mm/workingset.c
@@ -534,8 +534,6 @@ bool workingset_test_recent(void *shadow, bool file, bool *workingset,
 void workingset_refault(struct folio *folio, void *shadow)
 {
 	bool file = folio_is_file_lru(folio);
-	struct pglist_data *pgdat;
-	struct mem_cgroup *memcg;
 	struct lruvec *lruvec;
 	bool workingset;
 	long nr;
@@ -557,10 +555,7 @@ void workingset_refault(struct folio *folio, void *shadow)
 	 * locked to guarantee folio_memcg() stability throughout.
 	 */
 	nr = folio_nr_pages(folio);
-	memcg = folio_memcg(folio);
-	pgdat = folio_pgdat(folio);
-	lruvec = mem_cgroup_lruvec(memcg, pgdat);
-
+	lruvec = folio_lruvec(folio);
 	mod_lruvec_state(lruvec, WORKINGSET_REFAULT_BASE + file, nr);
 
 	if (!workingset_test_recent(shadow, file, &workingset, true))
-- 
2.20.1



^ permalink raw reply	[flat|nested] 149+ messages in thread

* Re: [PATCH v2 02/28] mm: workingset: use folio_lruvec() in workingset_refault()
  2025-12-17  7:27 ` [PATCH v2 02/28] mm: workingset: use folio_lruvec() in workingset_refault() Qi Zheng
@ 2025-12-18 23:32   ` Shakeel Butt
  0 siblings, 0 replies; 149+ messages in thread
From: Shakeel Butt @ 2025-12-18 23:32 UTC (permalink / raw)
  To: Qi Zheng
  Cc: hannes, hughd, mhocko, roman.gushchin, muchun.song, david,
	lorenzo.stoakes, ziy, harry.yoo, imran.f.khan, kamalesh.babulal,
	axelrasmussen, yuanchu, weixugc, chenridong, mkoutny, akpm,
	hamzamahfooz, apais, lance.yang, linux-mm, linux-kernel, cgroups,
	Muchun Song, Qi Zheng

On Wed, Dec 17, 2025 at 03:27:26PM +0800, Qi Zheng wrote:
> From: Muchun Song <songmuchun@bytedance.com>
> 
> Use folio_lruvec() to simplify the code.
> 
> Signed-off-by: Muchun Song <songmuchun@bytedance.com>
> Acked-by: Johannes Weiner <hannes@cmpxchg.org>
> Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
> Reviewed-by: Harry Yoo <harry.yoo@oracle.com>

Acked-by: Shakeel Butt <shakeel.butt@linux.dev>


^ permalink raw reply	[flat|nested] 149+ messages in thread

* [PATCH v2 03/28] mm: rename unlock_page_lruvec_irq and its variants
  2025-12-17  7:27 [PATCH v2 00/28] Eliminate Dying Memory Cgroup Qi Zheng
  2025-12-17  7:27 ` [PATCH v2 01/28] mm: memcontrol: remove dead code of checking parent memory cgroup Qi Zheng
  2025-12-17  7:27 ` [PATCH v2 02/28] mm: workingset: use folio_lruvec() in workingset_refault() Qi Zheng
@ 2025-12-17  7:27 ` Qi Zheng
  2025-12-18  9:00   ` David Hildenbrand (Red Hat)
  2025-12-18 23:34   ` Shakeel Butt
  2025-12-17  7:27 ` [PATCH v2 04/28] mm: vmscan: prepare for the refactoring the move_folios_to_lru() Qi Zheng
                   ` (26 subsequent siblings)
  29 siblings, 2 replies; 149+ messages in thread
From: Qi Zheng @ 2025-12-17  7:27 UTC (permalink / raw)
  To: hannes, hughd, mhocko, roman.gushchin, shakeel.butt, muchun.song,
	david, lorenzo.stoakes, ziy, harry.yoo, imran.f.khan,
	kamalesh.babulal, axelrasmussen, yuanchu, weixugc, chenridong,
	mkoutny, akpm, hamzamahfooz, apais, lance.yang
  Cc: linux-mm, linux-kernel, cgroups, Muchun Song, Qi Zheng, Chen Ridong

From: Muchun Song <songmuchun@bytedance.com>

It is inappropriate to use folio_lruvec_lock() variants in conjunction
with unlock_page_lruvec() variants, as this involves the inconsistent
operation of locking a folio while unlocking a page. To rectify this, the
functions unlock_page_lruvec{_irq, _irqrestore} are renamed to
lruvec_unlock{_irq,_irqrestore}.

Signed-off-by: Muchun Song <songmuchun@bytedance.com>
Acked-by: Roman Gushchin <roman.gushchin@linux.dev>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
Reviewed-by: Harry Yoo <harry.yoo@oracle.com>
Reviewed-by: Chen Ridong <chenridong@huawei.com>
---
 include/linux/memcontrol.h | 10 +++++-----
 mm/compaction.c            | 14 +++++++-------
 mm/huge_memory.c           |  2 +-
 mm/mlock.c                 |  2 +-
 mm/swap.c                  | 12 ++++++------
 mm/vmscan.c                |  4 ++--
 6 files changed, 22 insertions(+), 22 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 6a48398a1f4e7..288dd6337f80f 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -1465,17 +1465,17 @@ static inline struct lruvec *parent_lruvec(struct lruvec *lruvec)
 	return mem_cgroup_lruvec(memcg, lruvec_pgdat(lruvec));
 }
 
-static inline void unlock_page_lruvec(struct lruvec *lruvec)
+static inline void lruvec_unlock(struct lruvec *lruvec)
 {
 	spin_unlock(&lruvec->lru_lock);
 }
 
-static inline void unlock_page_lruvec_irq(struct lruvec *lruvec)
+static inline void lruvec_unlock_irq(struct lruvec *lruvec)
 {
 	spin_unlock_irq(&lruvec->lru_lock);
 }
 
-static inline void unlock_page_lruvec_irqrestore(struct lruvec *lruvec,
+static inline void lruvec_unlock_irqrestore(struct lruvec *lruvec,
 		unsigned long flags)
 {
 	spin_unlock_irqrestore(&lruvec->lru_lock, flags);
@@ -1497,7 +1497,7 @@ static inline struct lruvec *folio_lruvec_relock_irq(struct folio *folio,
 		if (folio_matches_lruvec(folio, locked_lruvec))
 			return locked_lruvec;
 
-		unlock_page_lruvec_irq(locked_lruvec);
+		lruvec_unlock_irq(locked_lruvec);
 	}
 
 	return folio_lruvec_lock_irq(folio);
@@ -1511,7 +1511,7 @@ static inline void folio_lruvec_relock_irqsave(struct folio *folio,
 		if (folio_matches_lruvec(folio, *lruvecp))
 			return;
 
-		unlock_page_lruvec_irqrestore(*lruvecp, *flags);
+		lruvec_unlock_irqrestore(*lruvecp, *flags);
 	}
 
 	*lruvecp = folio_lruvec_lock_irqsave(folio, flags);
diff --git a/mm/compaction.c b/mm/compaction.c
index 1e8f8eca318c6..c3e338aaa0ffb 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -913,7 +913,7 @@ isolate_migratepages_block(struct compact_control *cc, unsigned long low_pfn,
 		 */
 		if (!(low_pfn % COMPACT_CLUSTER_MAX)) {
 			if (locked) {
-				unlock_page_lruvec_irqrestore(locked, flags);
+				lruvec_unlock_irqrestore(locked, flags);
 				locked = NULL;
 			}
 
@@ -964,7 +964,7 @@ isolate_migratepages_block(struct compact_control *cc, unsigned long low_pfn,
 			}
 			/* for alloc_contig case */
 			if (locked) {
-				unlock_page_lruvec_irqrestore(locked, flags);
+				lruvec_unlock_irqrestore(locked, flags);
 				locked = NULL;
 			}
 
@@ -1053,7 +1053,7 @@ isolate_migratepages_block(struct compact_control *cc, unsigned long low_pfn,
 			if (unlikely(page_has_movable_ops(page)) &&
 			    !PageMovableOpsIsolated(page)) {
 				if (locked) {
-					unlock_page_lruvec_irqrestore(locked, flags);
+					lruvec_unlock_irqrestore(locked, flags);
 					locked = NULL;
 				}
 
@@ -1158,7 +1158,7 @@ isolate_migratepages_block(struct compact_control *cc, unsigned long low_pfn,
 		/* If we already hold the lock, we can skip some rechecking */
 		if (lruvec != locked) {
 			if (locked)
-				unlock_page_lruvec_irqrestore(locked, flags);
+				lruvec_unlock_irqrestore(locked, flags);
 
 			compact_lock_irqsave(&lruvec->lru_lock, &flags, cc);
 			locked = lruvec;
@@ -1226,7 +1226,7 @@ isolate_migratepages_block(struct compact_control *cc, unsigned long low_pfn,
 isolate_fail_put:
 		/* Avoid potential deadlock in freeing page under lru_lock */
 		if (locked) {
-			unlock_page_lruvec_irqrestore(locked, flags);
+			lruvec_unlock_irqrestore(locked, flags);
 			locked = NULL;
 		}
 		folio_put(folio);
@@ -1242,7 +1242,7 @@ isolate_migratepages_block(struct compact_control *cc, unsigned long low_pfn,
 		 */
 		if (nr_isolated) {
 			if (locked) {
-				unlock_page_lruvec_irqrestore(locked, flags);
+				lruvec_unlock_irqrestore(locked, flags);
 				locked = NULL;
 			}
 			putback_movable_pages(&cc->migratepages);
@@ -1274,7 +1274,7 @@ isolate_migratepages_block(struct compact_control *cc, unsigned long low_pfn,
 
 isolate_abort:
 	if (locked)
-		unlock_page_lruvec_irqrestore(locked, flags);
+		lruvec_unlock_irqrestore(locked, flags);
 	if (folio) {
 		folio_set_lru(folio);
 		folio_put(folio);
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 40cf59301c21a..12b46215b30c1 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -3899,7 +3899,7 @@ static int __folio_freeze_and_split_unmapped(struct folio *folio, unsigned int n
 		folio_ref_unfreeze(folio, folio_cache_ref_count(folio) + 1);
 
 		if (do_lru)
-			unlock_page_lruvec(lruvec);
+			lruvec_unlock(lruvec);
 
 		if (ci)
 			swap_cluster_unlock(ci);
diff --git a/mm/mlock.c b/mm/mlock.c
index 2f699c3497a57..66740e16679c3 100644
--- a/mm/mlock.c
+++ b/mm/mlock.c
@@ -205,7 +205,7 @@ static void mlock_folio_batch(struct folio_batch *fbatch)
 	}
 
 	if (lruvec)
-		unlock_page_lruvec_irq(lruvec);
+		lruvec_unlock_irq(lruvec);
 	folios_put(fbatch);
 }
 
diff --git a/mm/swap.c b/mm/swap.c
index 2260dcd2775e7..ec0c654e128dc 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -91,7 +91,7 @@ static void page_cache_release(struct folio *folio)
 
 	__page_cache_release(folio, &lruvec, &flags);
 	if (lruvec)
-		unlock_page_lruvec_irqrestore(lruvec, flags);
+		lruvec_unlock_irqrestore(lruvec, flags);
 }
 
 void __folio_put(struct folio *folio)
@@ -175,7 +175,7 @@ static void folio_batch_move_lru(struct folio_batch *fbatch, move_fn_t move_fn)
 	}
 
 	if (lruvec)
-		unlock_page_lruvec_irqrestore(lruvec, flags);
+		lruvec_unlock_irqrestore(lruvec, flags);
 	folios_put(fbatch);
 }
 
@@ -349,7 +349,7 @@ void folio_activate(struct folio *folio)
 
 	lruvec = folio_lruvec_lock_irq(folio);
 	lru_activate(lruvec, folio);
-	unlock_page_lruvec_irq(lruvec);
+	lruvec_unlock_irq(lruvec);
 	folio_set_lru(folio);
 }
 #endif
@@ -963,7 +963,7 @@ void folios_put_refs(struct folio_batch *folios, unsigned int *refs)
 
 		if (folio_is_zone_device(folio)) {
 			if (lruvec) {
-				unlock_page_lruvec_irqrestore(lruvec, flags);
+				lruvec_unlock_irqrestore(lruvec, flags);
 				lruvec = NULL;
 			}
 			if (folio_ref_sub_and_test(folio, nr_refs))
@@ -977,7 +977,7 @@ void folios_put_refs(struct folio_batch *folios, unsigned int *refs)
 		/* hugetlb has its own memcg */
 		if (folio_test_hugetlb(folio)) {
 			if (lruvec) {
-				unlock_page_lruvec_irqrestore(lruvec, flags);
+				lruvec_unlock_irqrestore(lruvec, flags);
 				lruvec = NULL;
 			}
 			free_huge_folio(folio);
@@ -991,7 +991,7 @@ void folios_put_refs(struct folio_batch *folios, unsigned int *refs)
 		j++;
 	}
 	if (lruvec)
-		unlock_page_lruvec_irqrestore(lruvec, flags);
+		lruvec_unlock_irqrestore(lruvec, flags);
 	if (!j) {
 		folio_batch_reinit(folios);
 		return;
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 670fe9fae5baa..28d9b3af47130 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1829,7 +1829,7 @@ bool folio_isolate_lru(struct folio *folio)
 		folio_get(folio);
 		lruvec = folio_lruvec_lock_irq(folio);
 		lruvec_del_folio(lruvec, folio);
-		unlock_page_lruvec_irq(lruvec);
+		lruvec_unlock_irq(lruvec);
 		ret = true;
 	}
 
@@ -7855,7 +7855,7 @@ void check_move_unevictable_folios(struct folio_batch *fbatch)
 	if (lruvec) {
 		__count_vm_events(UNEVICTABLE_PGRESCUED, pgrescued);
 		__count_vm_events(UNEVICTABLE_PGSCANNED, pgscanned);
-		unlock_page_lruvec_irq(lruvec);
+		lruvec_unlock_irq(lruvec);
 	} else if (pgscanned) {
 		count_vm_events(UNEVICTABLE_PGSCANNED, pgscanned);
 	}
-- 
2.20.1



^ permalink raw reply	[flat|nested] 149+ messages in thread

* Re: [PATCH v2 03/28] mm: rename unlock_page_lruvec_irq and its variants
  2025-12-17  7:27 ` [PATCH v2 03/28] mm: rename unlock_page_lruvec_irq and its variants Qi Zheng
@ 2025-12-18  9:00   ` David Hildenbrand (Red Hat)
  2025-12-18 23:34   ` Shakeel Butt
  1 sibling, 0 replies; 149+ messages in thread
From: David Hildenbrand (Red Hat) @ 2025-12-18  9:00 UTC (permalink / raw)
  To: Qi Zheng, hannes, hughd, mhocko, roman.gushchin, shakeel.butt,
	muchun.song, lorenzo.stoakes, ziy, harry.yoo, imran.f.khan,
	kamalesh.babulal, axelrasmussen, yuanchu, weixugc, chenridong,
	mkoutny, akpm, hamzamahfooz, apais, lance.yang
  Cc: linux-mm, linux-kernel, cgroups, Muchun Song, Qi Zheng, Chen Ridong

On 12/17/25 08:27, Qi Zheng wrote:
> From: Muchun Song <songmuchun@bytedance.com>
> 
> It is inappropriate to use folio_lruvec_lock() variants in conjunction
> with unlock_page_lruvec() variants, as this involves the inconsistent
> operation of locking a folio while unlocking a page. To rectify this, the
> functions unlock_page_lruvec{_irq, _irqrestore} are renamed to
> lruvec_unlock{_irq,_irqrestore}.
> 
> Signed-off-by: Muchun Song <songmuchun@bytedance.com>
> Acked-by: Roman Gushchin <roman.gushchin@linux.dev>
> Acked-by: Johannes Weiner <hannes@cmpxchg.org>
> Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
> Reviewed-by: Harry Yoo <harry.yoo@oracle.com>
> Reviewed-by: Chen Ridong <chenridong@huawei.com>
> ---

Nice cleanup

Acked-by: David Hildenbrand (Red Hat) <david@kernel.org>

-- 
Cheers

David


^ permalink raw reply	[flat|nested] 149+ messages in thread

* Re: [PATCH v2 03/28] mm: rename unlock_page_lruvec_irq and its variants
  2025-12-17  7:27 ` [PATCH v2 03/28] mm: rename unlock_page_lruvec_irq and its variants Qi Zheng
  2025-12-18  9:00   ` David Hildenbrand (Red Hat)
@ 2025-12-18 23:34   ` Shakeel Butt
  1 sibling, 0 replies; 149+ messages in thread
From: Shakeel Butt @ 2025-12-18 23:34 UTC (permalink / raw)
  To: Qi Zheng
  Cc: hannes, hughd, mhocko, roman.gushchin, muchun.song, david,
	lorenzo.stoakes, ziy, harry.yoo, imran.f.khan, kamalesh.babulal,
	axelrasmussen, yuanchu, weixugc, chenridong, mkoutny, akpm,
	hamzamahfooz, apais, lance.yang, linux-mm, linux-kernel, cgroups,
	Muchun Song, Qi Zheng, Chen Ridong

On Wed, Dec 17, 2025 at 03:27:27PM +0800, Qi Zheng wrote:
> From: Muchun Song <songmuchun@bytedance.com>
> 
> It is inappropriate to use folio_lruvec_lock() variants in conjunction
> with unlock_page_lruvec() variants, as this involves the inconsistent
> operation of locking a folio while unlocking a page. To rectify this, the
> functions unlock_page_lruvec{_irq, _irqrestore} are renamed to
> lruvec_unlock{_irq,_irqrestore}.
> 
> Signed-off-by: Muchun Song <songmuchun@bytedance.com>
> Acked-by: Roman Gushchin <roman.gushchin@linux.dev>
> Acked-by: Johannes Weiner <hannes@cmpxchg.org>
> Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
> Reviewed-by: Harry Yoo <harry.yoo@oracle.com>
> Reviewed-by: Chen Ridong <chenridong@huawei.com>

Acked-by: Shakeel Butt <shakeel.butt@linux.dev>


^ permalink raw reply	[flat|nested] 149+ messages in thread

* [PATCH v2 04/28] mm: vmscan: prepare for the refactoring the move_folios_to_lru()
  2025-12-17  7:27 [PATCH v2 00/28] Eliminate Dying Memory Cgroup Qi Zheng
                   ` (2 preceding siblings ...)
  2025-12-17  7:27 ` [PATCH v2 03/28] mm: rename unlock_page_lruvec_irq and its variants Qi Zheng
@ 2025-12-17  7:27 ` Qi Zheng
  2025-12-17 21:13   ` Johannes Weiner
                     ` (3 more replies)
  2025-12-17  7:27 ` [PATCH v2 05/28] mm: vmscan: refactor move_folios_to_lru() Qi Zheng
                   ` (25 subsequent siblings)
  29 siblings, 4 replies; 149+ messages in thread
From: Qi Zheng @ 2025-12-17  7:27 UTC (permalink / raw)
  To: hannes, hughd, mhocko, roman.gushchin, shakeel.butt, muchun.song,
	david, lorenzo.stoakes, ziy, harry.yoo, imran.f.khan,
	kamalesh.babulal, axelrasmussen, yuanchu, weixugc, chenridong,
	mkoutny, akpm, hamzamahfooz, apais, lance.yang
  Cc: linux-mm, linux-kernel, cgroups, Qi Zheng

From: Qi Zheng <zhengqi.arch@bytedance.com>

After refactoring the move_folios_to_lru(), its caller no longer needs to
hold the lruvec lock, the disabling IRQ is only for __count_vm_events()
and __mod_node_page_state().

On the PREEMPT_RT kernel, the local_irq_disable() cannot be used. To
avoid using local_irq_disable() and reduce the critical section of
disabling IRQ, make all callers of move_folios_to_lru() use IRQ-safed
count_vm_events() and mod_node_page_state().

Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
---
 mm/vmscan.c | 14 +++++++-------
 1 file changed, 7 insertions(+), 7 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 28d9b3af47130..49e5661746213 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2021,12 +2021,12 @@ static unsigned long shrink_inactive_list(unsigned long nr_to_scan,
 
 	mod_lruvec_state(lruvec, PGDEMOTE_KSWAPD + reclaimer_offset(sc),
 					stat.nr_demoted);
-	__mod_node_page_state(pgdat, NR_ISOLATED_ANON + file, -nr_taken);
+	mod_node_page_state(pgdat, NR_ISOLATED_ANON + file, -nr_taken);
 	item = PGSTEAL_KSWAPD + reclaimer_offset(sc);
 	if (!cgroup_reclaim(sc))
-		__count_vm_events(item, nr_reclaimed);
+		count_vm_events(item, nr_reclaimed);
 	count_memcg_events(lruvec_memcg(lruvec), item, nr_reclaimed);
-	__count_vm_events(PGSTEAL_ANON + file, nr_reclaimed);
+	count_vm_events(PGSTEAL_ANON + file, nr_reclaimed);
 
 	lru_note_cost_unlock_irq(lruvec, file, stat.nr_pageout,
 					nr_scanned - nr_reclaimed);
@@ -2171,10 +2171,10 @@ static void shrink_active_list(unsigned long nr_to_scan,
 	nr_activate = move_folios_to_lru(lruvec, &l_active);
 	nr_deactivate = move_folios_to_lru(lruvec, &l_inactive);
 
-	__count_vm_events(PGDEACTIVATE, nr_deactivate);
+	count_vm_events(PGDEACTIVATE, nr_deactivate);
 	count_memcg_events(lruvec_memcg(lruvec), PGDEACTIVATE, nr_deactivate);
 
-	__mod_node_page_state(pgdat, NR_ISOLATED_ANON + file, -nr_taken);
+	mod_node_page_state(pgdat, NR_ISOLATED_ANON + file, -nr_taken);
 
 	lru_note_cost_unlock_irq(lruvec, file, 0, nr_rotated);
 	trace_mm_vmscan_lru_shrink_active(pgdat->node_id, nr_taken, nr_activate,
@@ -4751,9 +4751,9 @@ static int evict_folios(unsigned long nr_to_scan, struct lruvec *lruvec,
 
 	item = PGSTEAL_KSWAPD + reclaimer_offset(sc);
 	if (!cgroup_reclaim(sc))
-		__count_vm_events(item, reclaimed);
+		count_vm_events(item, reclaimed);
 	count_memcg_events(memcg, item, reclaimed);
-	__count_vm_events(PGSTEAL_ANON + type, reclaimed);
+	count_vm_events(PGSTEAL_ANON + type, reclaimed);
 
 	spin_unlock_irq(&lruvec->lru_lock);
 
-- 
2.20.1



^ permalink raw reply	[flat|nested] 149+ messages in thread

* Re: [PATCH v2 04/28] mm: vmscan: prepare for the refactoring the move_folios_to_lru()
  2025-12-17  7:27 ` [PATCH v2 04/28] mm: vmscan: prepare for the refactoring the move_folios_to_lru() Qi Zheng
@ 2025-12-17 21:13   ` Johannes Weiner
  2025-12-18  9:04   ` David Hildenbrand (Red Hat)
                     ` (2 subsequent siblings)
  3 siblings, 0 replies; 149+ messages in thread
From: Johannes Weiner @ 2025-12-17 21:13 UTC (permalink / raw)
  To: Qi Zheng
  Cc: hughd, mhocko, roman.gushchin, shakeel.butt, muchun.song, david,
	lorenzo.stoakes, ziy, harry.yoo, imran.f.khan, kamalesh.babulal,
	axelrasmussen, yuanchu, weixugc, chenridong, mkoutny, akpm,
	hamzamahfooz, apais, lance.yang, linux-mm, linux-kernel, cgroups,
	Qi Zheng

On Wed, Dec 17, 2025 at 03:27:28PM +0800, Qi Zheng wrote:
> From: Qi Zheng <zhengqi.arch@bytedance.com>
> 
> After refactoring the move_folios_to_lru(), its caller no longer needs to
> hold the lruvec lock, the disabling IRQ is only for __count_vm_events()
> and __mod_node_page_state().
> 
> On the PREEMPT_RT kernel, the local_irq_disable() cannot be used. To
> avoid using local_irq_disable() and reduce the critical section of
> disabling IRQ, make all callers of move_folios_to_lru() use IRQ-safed
> count_vm_events() and mod_node_page_state().
> 
> Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>

Acked-by: Johannes Weiner <hannes@cmpxchg.org>


^ permalink raw reply	[flat|nested] 149+ messages in thread

* Re: [PATCH v2 04/28] mm: vmscan: prepare for the refactoring the move_folios_to_lru()
  2025-12-17  7:27 ` [PATCH v2 04/28] mm: vmscan: prepare for the refactoring the move_folios_to_lru() Qi Zheng
  2025-12-17 21:13   ` Johannes Weiner
@ 2025-12-18  9:04   ` David Hildenbrand (Red Hat)
  2025-12-18  9:31     ` Qi Zheng
  2025-12-18 23:39   ` Shakeel Butt
  2025-12-25  3:45   ` Chen Ridong
  3 siblings, 1 reply; 149+ messages in thread
From: David Hildenbrand (Red Hat) @ 2025-12-18  9:04 UTC (permalink / raw)
  To: Qi Zheng, hannes, hughd, mhocko, roman.gushchin, shakeel.butt,
	muchun.song, lorenzo.stoakes, ziy, harry.yoo, imran.f.khan,
	kamalesh.babulal, axelrasmussen, yuanchu, weixugc, chenridong,
	mkoutny, akpm, hamzamahfooz, apais, lance.yang
  Cc: linux-mm, linux-kernel, cgroups, Qi Zheng

On 12/17/25 08:27, Qi Zheng wrote:
> From: Qi Zheng <zhengqi.arch@bytedance.com>
> 
> After refactoring the move_folios_to_lru(), its caller no longer needs to
> hold the lruvec lock, the disabling IRQ is only for __count_vm_events()
> and __mod_node_page_state().
> 
> On the PREEMPT_RT kernel, the local_irq_disable() cannot be used. To
> avoid using local_irq_disable() and reduce the critical section of
> disabling IRQ, make all callers of move_folios_to_lru() use IRQ-safed
> count_vm_events() and mod_node_page_state().

The patch description is a bit confusing for me.

I assume you mean something like

"Once we refactor move_folios_to_lru(), its callers will no longer have 
to hold the lruvec lock; disabling IRQs is then only required for 
__count_vm_events() and __mod_node_page_state().

To prepare for that, let's $YOURDETAILSHERE"

-- 
Cheers

David


^ permalink raw reply	[flat|nested] 149+ messages in thread

* Re: [PATCH v2 04/28] mm: vmscan: prepare for the refactoring the move_folios_to_lru()
  2025-12-18  9:04   ` David Hildenbrand (Red Hat)
@ 2025-12-18  9:31     ` Qi Zheng
  0 siblings, 0 replies; 149+ messages in thread
From: Qi Zheng @ 2025-12-18  9:31 UTC (permalink / raw)
  To: David Hildenbrand (Red Hat),
	hannes, hughd, mhocko, roman.gushchin, shakeel.butt, muchun.song,
	lorenzo.stoakes, ziy, harry.yoo, imran.f.khan, kamalesh.babulal,
	axelrasmussen, yuanchu, weixugc, chenridong, mkoutny, akpm,
	hamzamahfooz, apais, lance.yang
  Cc: linux-mm, linux-kernel, cgroups, Qi Zheng



On 12/18/25 5:04 PM, David Hildenbrand (Red Hat) wrote:
> On 12/17/25 08:27, Qi Zheng wrote:
>> From: Qi Zheng <zhengqi.arch@bytedance.com>
>>
>> After refactoring the move_folios_to_lru(), its caller no longer needs to
>> hold the lruvec lock, the disabling IRQ is only for __count_vm_events()
>> and __mod_node_page_state().
>>
>> On the PREEMPT_RT kernel, the local_irq_disable() cannot be used. To
>> avoid using local_irq_disable() and reduce the critical section of
>> disabling IRQ, make all callers of move_folios_to_lru() use IRQ-safed
>> count_vm_events() and mod_node_page_state().
> 
> The patch description is a bit confusing for me.
> 
> I assume you mean something like
> 
> "Once we refactor move_folios_to_lru(), its callers will no longer have 
> to hold the lruvec lock; disabling IRQs is then only required for 
> __count_vm_events() and __mod_node_page_state().
> 
> To prepare for that, let's $YOURDETAILSHERE"

It is indeed clearer, will do in the next version.

> 



^ permalink raw reply	[flat|nested] 149+ messages in thread

* Re: [PATCH v2 04/28] mm: vmscan: prepare for the refactoring the move_folios_to_lru()
  2025-12-17  7:27 ` [PATCH v2 04/28] mm: vmscan: prepare for the refactoring the move_folios_to_lru() Qi Zheng
  2025-12-17 21:13   ` Johannes Weiner
  2025-12-18  9:04   ` David Hildenbrand (Red Hat)
@ 2025-12-18 23:39   ` Shakeel Butt
  2025-12-25  3:45   ` Chen Ridong
  3 siblings, 0 replies; 149+ messages in thread
From: Shakeel Butt @ 2025-12-18 23:39 UTC (permalink / raw)
  To: Qi Zheng
  Cc: hannes, hughd, mhocko, roman.gushchin, muchun.song, david,
	lorenzo.stoakes, ziy, harry.yoo, imran.f.khan, kamalesh.babulal,
	axelrasmussen, yuanchu, weixugc, chenridong, mkoutny, akpm,
	hamzamahfooz, apais, lance.yang, linux-mm, linux-kernel, cgroups,
	Qi Zheng

On Wed, Dec 17, 2025 at 03:27:28PM +0800, Qi Zheng wrote:
> From: Qi Zheng <zhengqi.arch@bytedance.com>
> 
> After refactoring the move_folios_to_lru(), its caller no longer needs to
> hold the lruvec lock, the disabling IRQ is only for __count_vm_events()
> and __mod_node_page_state().
> 
> On the PREEMPT_RT kernel, the local_irq_disable() cannot be used. To
> avoid using local_irq_disable() and reduce the critical section of
> disabling IRQ, make all callers of move_folios_to_lru() use IRQ-safed
> count_vm_events() and mod_node_page_state().
> 
> Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>

Acked-by: Shakeel Butt <shakeel.butt@linux.dev>


^ permalink raw reply	[flat|nested] 149+ messages in thread

* Re: [PATCH v2 04/28] mm: vmscan: prepare for the refactoring the move_folios_to_lru()
  2025-12-17  7:27 ` [PATCH v2 04/28] mm: vmscan: prepare for the refactoring the move_folios_to_lru() Qi Zheng
                     ` (2 preceding siblings ...)
  2025-12-18 23:39   ` Shakeel Butt
@ 2025-12-25  3:45   ` Chen Ridong
  3 siblings, 0 replies; 149+ messages in thread
From: Chen Ridong @ 2025-12-25  3:45 UTC (permalink / raw)
  To: Qi Zheng, hannes, hughd, mhocko, roman.gushchin, shakeel.butt,
	muchun.song, david, lorenzo.stoakes, ziy, harry.yoo,
	imran.f.khan, kamalesh.babulal, axelrasmussen, yuanchu, weixugc,
	mkoutny, akpm, hamzamahfooz, apais, lance.yang
  Cc: linux-mm, linux-kernel, cgroups, Qi Zheng



On 2025/12/17 15:27, Qi Zheng wrote:
> From: Qi Zheng <zhengqi.arch@bytedance.com>
> 
> After refactoring the move_folios_to_lru(), its caller no longer needs to
> hold the lruvec lock, the disabling IRQ is only for __count_vm_events()
> and __mod_node_page_state().
> 

nit:
For shrink_inactive_list(), shrink_active_list() and evict_folios(), IRQ disabling is only needed
for __count_vm_events() and __mod_node_page_state().

I think it can be clearer.

> On the PREEMPT_RT kernel, the local_irq_disable() cannot be used. To
> avoid using local_irq_disable() and reduce the critical section of
> disabling IRQ, make all callers of move_folios_to_lru() use IRQ-safed
> count_vm_events() and mod_node_page_state().
> 
> Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
> ---
>  mm/vmscan.c | 14 +++++++-------
>  1 file changed, 7 insertions(+), 7 deletions(-)
> 
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 28d9b3af47130..49e5661746213 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -2021,12 +2021,12 @@ static unsigned long shrink_inactive_list(unsigned long nr_to_scan,
>  
>  	mod_lruvec_state(lruvec, PGDEMOTE_KSWAPD + reclaimer_offset(sc),
>  					stat.nr_demoted);
> -	__mod_node_page_state(pgdat, NR_ISOLATED_ANON + file, -nr_taken);
> +	mod_node_page_state(pgdat, NR_ISOLATED_ANON + file, -nr_taken);
>  	item = PGSTEAL_KSWAPD + reclaimer_offset(sc);
>  	if (!cgroup_reclaim(sc))
> -		__count_vm_events(item, nr_reclaimed);
> +		count_vm_events(item, nr_reclaimed);
>  	count_memcg_events(lruvec_memcg(lruvec), item, nr_reclaimed);
> -	__count_vm_events(PGSTEAL_ANON + file, nr_reclaimed);
> +	count_vm_events(PGSTEAL_ANON + file, nr_reclaimed);
>  
>  	lru_note_cost_unlock_irq(lruvec, file, stat.nr_pageout,
>  					nr_scanned - nr_reclaimed);
> @@ -2171,10 +2171,10 @@ static void shrink_active_list(unsigned long nr_to_scan,
>  	nr_activate = move_folios_to_lru(lruvec, &l_active);
>  	nr_deactivate = move_folios_to_lru(lruvec, &l_inactive);
>  
> -	__count_vm_events(PGDEACTIVATE, nr_deactivate);
> +	count_vm_events(PGDEACTIVATE, nr_deactivate);
>  	count_memcg_events(lruvec_memcg(lruvec), PGDEACTIVATE, nr_deactivate);
>  
> -	__mod_node_page_state(pgdat, NR_ISOLATED_ANON + file, -nr_taken);
> +	mod_node_page_state(pgdat, NR_ISOLATED_ANON + file, -nr_taken);
>  
>  	lru_note_cost_unlock_irq(lruvec, file, 0, nr_rotated);
>  	trace_mm_vmscan_lru_shrink_active(pgdat->node_id, nr_taken, nr_activate,
> @@ -4751,9 +4751,9 @@ static int evict_folios(unsigned long nr_to_scan, struct lruvec *lruvec,
>  
>  	item = PGSTEAL_KSWAPD + reclaimer_offset(sc);
>  	if (!cgroup_reclaim(sc))
> -		__count_vm_events(item, reclaimed);
> +		count_vm_events(item, reclaimed);
>  	count_memcg_events(memcg, item, reclaimed);
> -	__count_vm_events(PGSTEAL_ANON + type, reclaimed);
> +	count_vm_events(PGSTEAL_ANON + type, reclaimed);
>  
>  	spin_unlock_irq(&lruvec->lru_lock);
>  

Reviewed-by: Chen Ridong <chenridong@huawei.com>

-- 
Best regards,
Ridong



^ permalink raw reply	[flat|nested] 149+ messages in thread

* [PATCH v2 05/28] mm: vmscan: refactor move_folios_to_lru()
  2025-12-17  7:27 [PATCH v2 00/28] Eliminate Dying Memory Cgroup Qi Zheng
                   ` (3 preceding siblings ...)
  2025-12-17  7:27 ` [PATCH v2 04/28] mm: vmscan: prepare for the refactoring the move_folios_to_lru() Qi Zheng
@ 2025-12-17  7:27 ` Qi Zheng
  2025-12-19  0:04   ` Shakeel Butt
  2025-12-17  7:27 ` [PATCH v2 06/28] mm: memcontrol: allocate object cgroup for non-kmem case Qi Zheng
                   ` (24 subsequent siblings)
  29 siblings, 1 reply; 149+ messages in thread
From: Qi Zheng @ 2025-12-17  7:27 UTC (permalink / raw)
  To: hannes, hughd, mhocko, roman.gushchin, shakeel.butt, muchun.song,
	david, lorenzo.stoakes, ziy, harry.yoo, imran.f.khan,
	kamalesh.babulal, axelrasmussen, yuanchu, weixugc, chenridong,
	mkoutny, akpm, hamzamahfooz, apais, lance.yang
  Cc: linux-mm, linux-kernel, cgroups, Muchun Song, Qi Zheng

From: Muchun Song <songmuchun@bytedance.com>

In a subsequent patch, we'll reparent the LRU folios. The folios that are
moved to the appropriate LRU list can undergo reparenting during the
move_folios_to_lru() process. Hence, it's incorrect for the caller to hold
a lruvec lock. Instead, we should utilize the more general interface of
folio_lruvec_relock_irq() to obtain the correct lruvec lock.

This patch involves only code refactoring and doesn't introduce any
functional changes.

Signed-off-by: Muchun Song <songmuchun@bytedance.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
---
 mm/vmscan.c | 46 +++++++++++++++++++++-------------------------
 1 file changed, 21 insertions(+), 25 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 49e5661746213..354b19f7365d4 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1883,24 +1883,27 @@ static bool too_many_isolated(struct pglist_data *pgdat, int file,
 /*
  * move_folios_to_lru() moves folios from private @list to appropriate LRU list.
  *
- * Returns the number of pages moved to the given lruvec.
+ * Returns the number of pages moved to the appropriate lruvec.
+ *
+ * Note: The caller must not hold any lruvec lock.
  */
-static unsigned int move_folios_to_lru(struct lruvec *lruvec,
-		struct list_head *list)
+static unsigned int move_folios_to_lru(struct list_head *list)
 {
 	int nr_pages, nr_moved = 0;
+	struct lruvec *lruvec = NULL;
 	struct folio_batch free_folios;
 
 	folio_batch_init(&free_folios);
 	while (!list_empty(list)) {
 		struct folio *folio = lru_to_folio(list);
 
+		lruvec = folio_lruvec_relock_irq(folio, lruvec);
 		VM_BUG_ON_FOLIO(folio_test_lru(folio), folio);
 		list_del(&folio->lru);
 		if (unlikely(!folio_evictable(folio))) {
-			spin_unlock_irq(&lruvec->lru_lock);
+			lruvec_unlock_irq(lruvec);
 			folio_putback_lru(folio);
-			spin_lock_irq(&lruvec->lru_lock);
+			lruvec = NULL;
 			continue;
 		}
 
@@ -1922,19 +1925,15 @@ static unsigned int move_folios_to_lru(struct lruvec *lruvec,
 
 			folio_unqueue_deferred_split(folio);
 			if (folio_batch_add(&free_folios, folio) == 0) {
-				spin_unlock_irq(&lruvec->lru_lock);
+				lruvec_unlock_irq(lruvec);
 				mem_cgroup_uncharge_folios(&free_folios);
 				free_unref_folios(&free_folios);
-				spin_lock_irq(&lruvec->lru_lock);
+				lruvec = NULL;
 			}
 
 			continue;
 		}
 
-		/*
-		 * All pages were isolated from the same lruvec (and isolation
-		 * inhibits memcg migration).
-		 */
 		VM_BUG_ON_FOLIO(!folio_matches_lruvec(folio, lruvec), folio);
 		lruvec_add_folio(lruvec, folio);
 		nr_pages = folio_nr_pages(folio);
@@ -1943,11 +1942,12 @@ static unsigned int move_folios_to_lru(struct lruvec *lruvec,
 			workingset_age_nonresident(lruvec, nr_pages);
 	}
 
+	if (lruvec)
+		lruvec_unlock_irq(lruvec);
+
 	if (free_folios.nr) {
-		spin_unlock_irq(&lruvec->lru_lock);
 		mem_cgroup_uncharge_folios(&free_folios);
 		free_unref_folios(&free_folios);
-		spin_lock_irq(&lruvec->lru_lock);
 	}
 
 	return nr_moved;
@@ -2016,8 +2016,7 @@ static unsigned long shrink_inactive_list(unsigned long nr_to_scan,
 	nr_reclaimed = shrink_folio_list(&folio_list, pgdat, sc, &stat, false,
 					 lruvec_memcg(lruvec));
 
-	spin_lock_irq(&lruvec->lru_lock);
-	move_folios_to_lru(lruvec, &folio_list);
+	move_folios_to_lru(&folio_list);
 
 	mod_lruvec_state(lruvec, PGDEMOTE_KSWAPD + reclaimer_offset(sc),
 					stat.nr_demoted);
@@ -2028,6 +2027,7 @@ static unsigned long shrink_inactive_list(unsigned long nr_to_scan,
 	count_memcg_events(lruvec_memcg(lruvec), item, nr_reclaimed);
 	count_vm_events(PGSTEAL_ANON + file, nr_reclaimed);
 
+	spin_lock_irq(&lruvec->lru_lock);
 	lru_note_cost_unlock_irq(lruvec, file, stat.nr_pageout,
 					nr_scanned - nr_reclaimed);
 
@@ -2166,16 +2166,14 @@ static void shrink_active_list(unsigned long nr_to_scan,
 	/*
 	 * Move folios back to the lru list.
 	 */
-	spin_lock_irq(&lruvec->lru_lock);
-
-	nr_activate = move_folios_to_lru(lruvec, &l_active);
-	nr_deactivate = move_folios_to_lru(lruvec, &l_inactive);
+	nr_activate = move_folios_to_lru(&l_active);
+	nr_deactivate = move_folios_to_lru(&l_inactive);
 
 	count_vm_events(PGDEACTIVATE, nr_deactivate);
 	count_memcg_events(lruvec_memcg(lruvec), PGDEACTIVATE, nr_deactivate);
-
 	mod_node_page_state(pgdat, NR_ISOLATED_ANON + file, -nr_taken);
 
+	spin_lock_irq(&lruvec->lru_lock);
 	lru_note_cost_unlock_irq(lruvec, file, 0, nr_rotated);
 	trace_mm_vmscan_lru_shrink_active(pgdat->node_id, nr_taken, nr_activate,
 			nr_deactivate, nr_rotated, sc->priority, file);
@@ -4736,14 +4734,14 @@ static int evict_folios(unsigned long nr_to_scan, struct lruvec *lruvec,
 			set_mask_bits(&folio->flags.f, LRU_REFS_FLAGS, BIT(PG_active));
 	}
 
-	spin_lock_irq(&lruvec->lru_lock);
-
-	move_folios_to_lru(lruvec, &list);
+	move_folios_to_lru(&list);
 
 	walk = current->reclaim_state->mm_walk;
 	if (walk && walk->batched) {
 		walk->lruvec = lruvec;
+		spin_lock(&lruvec->lru_lock);
 		reset_batch_size(walk);
+		spin_unlock(&lruvec->lru_lock);
 	}
 
 	mod_lruvec_state(lruvec, PGDEMOTE_KSWAPD + reclaimer_offset(sc),
@@ -4755,8 +4753,6 @@ static int evict_folios(unsigned long nr_to_scan, struct lruvec *lruvec,
 	count_memcg_events(memcg, item, reclaimed);
 	count_vm_events(PGSTEAL_ANON + type, reclaimed);
 
-	spin_unlock_irq(&lruvec->lru_lock);
-
 	list_splice_init(&clean, &list);
 
 	if (!list_empty(&list)) {
-- 
2.20.1



^ permalink raw reply	[flat|nested] 149+ messages in thread

* Re: [PATCH v2 05/28] mm: vmscan: refactor move_folios_to_lru()
  2025-12-17  7:27 ` [PATCH v2 05/28] mm: vmscan: refactor move_folios_to_lru() Qi Zheng
@ 2025-12-19  0:04   ` Shakeel Butt
  0 siblings, 0 replies; 149+ messages in thread
From: Shakeel Butt @ 2025-12-19  0:04 UTC (permalink / raw)
  To: Qi Zheng
  Cc: hannes, hughd, mhocko, roman.gushchin, muchun.song, david,
	lorenzo.stoakes, ziy, harry.yoo, imran.f.khan, kamalesh.babulal,
	axelrasmussen, yuanchu, weixugc, chenridong, mkoutny, akpm,
	hamzamahfooz, apais, lance.yang, linux-mm, linux-kernel, cgroups,
	Muchun Song, Qi Zheng

On Wed, Dec 17, 2025 at 03:27:29PM +0800, Qi Zheng wrote:
> From: Muchun Song <songmuchun@bytedance.com>
> 
> In a subsequent patch, we'll reparent the LRU folios. The folios that are
> moved to the appropriate LRU list can undergo reparenting during the
> move_folios_to_lru() process. Hence, it's incorrect for the caller to hold
> a lruvec lock. Instead, we should utilize the more general interface of
> folio_lruvec_relock_irq() to obtain the correct lruvec lock.
> 
> This patch involves only code refactoring and doesn't introduce any
> functional changes.
> 
> Signed-off-by: Muchun Song <songmuchun@bytedance.com>
> Acked-by: Johannes Weiner <hannes@cmpxchg.org>
> Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
> ---

[...]
> +	spin_lock_irq(&lruvec->lru_lock);
>  	lru_note_cost_unlock_irq(lruvec, file, stat.nr_pageout,
>  					nr_scanned - nr_reclaimed);

I know that this patch is not changing any functionality but it is
undoing the optimization done by the commit 3865301dc58ae ("mm: optimize
lru_note_cost() by adding lru_note_cost_unlock_irq()"). I think it is
fine as a transient state and I haven't checked the final state of the
code after this series but I think we should do something to restore the
optimization after the series.

Anyways, it's a nit and if no one comes to it, I will take a stab at
restoring the optimization.

For now, LGTM.

Acked-by: Shakeel Butt <shakeel.butt@linux.dev>



^ permalink raw reply	[flat|nested] 149+ messages in thread

* [PATCH v2 06/28] mm: memcontrol: allocate object cgroup for non-kmem case
  2025-12-17  7:27 [PATCH v2 00/28] Eliminate Dying Memory Cgroup Qi Zheng
                   ` (4 preceding siblings ...)
  2025-12-17  7:27 ` [PATCH v2 05/28] mm: vmscan: refactor move_folios_to_lru() Qi Zheng
@ 2025-12-17  7:27 ` Qi Zheng
  2025-12-17 21:22   ` Johannes Weiner
                     ` (2 more replies)
  2025-12-17  7:27 ` [PATCH v2 07/28] mm: memcontrol: return root object cgroup for root memory cgroup Qi Zheng
                   ` (23 subsequent siblings)
  29 siblings, 3 replies; 149+ messages in thread
From: Qi Zheng @ 2025-12-17  7:27 UTC (permalink / raw)
  To: hannes, hughd, mhocko, roman.gushchin, shakeel.butt, muchun.song,
	david, lorenzo.stoakes, ziy, harry.yoo, imran.f.khan,
	kamalesh.babulal, axelrasmussen, yuanchu, weixugc, chenridong,
	mkoutny, akpm, hamzamahfooz, apais, lance.yang
  Cc: linux-mm, linux-kernel, cgroups, Muchun Song, Qi Zheng

From: Muchun Song <songmuchun@bytedance.com>

Pagecache pages are charged at allocation time and hold a reference
to the original memory cgroup until reclaimed. Depending on memory
pressure, page sharing patterns between different cgroups and cgroup
creation/destruction rates, many dying memory cgroups can be pinned
by pagecache pages, reducing page reclaim efficiency and wasting
memory. Converting LRU folios and most other raw memory cgroup pins
to the object cgroup direction can fix this long-living problem.

As a result, the objcg infrastructure is no longer solely applicable
to the kmem case. In this patch, we extend the scope of the objcg
infrastructure beyond the kmem case, enabling LRU folios to reuse
it for folio charging purposes.

It should be noted that LRU folios are not accounted for at the root
level, yet the folio->memcg_data points to the root_mem_cgroup. Hence,
the folio->memcg_data of LRU folios always points to a valid pointer.
However, the root_mem_cgroup does not possess an object cgroup.
Therefore, we also allocate an object cgroup for the root_mem_cgroup.

Signed-off-by: Muchun Song <songmuchun@bytedance.com>
Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
Reviewed-by: Harry Yoo <harry.yoo@oracle.com>
---
 mm/memcontrol.c | 51 +++++++++++++++++++++++--------------------------
 1 file changed, 24 insertions(+), 27 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index ae234518d023c..544b3200db12d 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -204,10 +204,10 @@ static struct obj_cgroup *obj_cgroup_alloc(void)
 	return objcg;
 }
 
-static void memcg_reparent_objcgs(struct mem_cgroup *memcg,
-				  struct mem_cgroup *parent)
+static void memcg_reparent_objcgs(struct mem_cgroup *memcg)
 {
 	struct obj_cgroup *objcg, *iter;
+	struct mem_cgroup *parent = parent_mem_cgroup(memcg);
 
 	objcg = rcu_replace_pointer(memcg->objcg, NULL, true);
 
@@ -3294,30 +3294,17 @@ unsigned long mem_cgroup_usage(struct mem_cgroup *memcg, bool swap)
 	return val;
 }
 
-static int memcg_online_kmem(struct mem_cgroup *memcg)
+static void memcg_online_kmem(struct mem_cgroup *memcg)
 {
-	struct obj_cgroup *objcg;
-
 	if (mem_cgroup_kmem_disabled())
-		return 0;
+		return;
 
 	if (unlikely(mem_cgroup_is_root(memcg)))
-		return 0;
-
-	objcg = obj_cgroup_alloc();
-	if (!objcg)
-		return -ENOMEM;
-
-	objcg->memcg = memcg;
-	rcu_assign_pointer(memcg->objcg, objcg);
-	obj_cgroup_get(objcg);
-	memcg->orig_objcg = objcg;
+		return;
 
 	static_branch_enable(&memcg_kmem_online_key);
 
 	memcg->kmemcg_id = memcg->id.id;
-
-	return 0;
 }
 
 static void memcg_offline_kmem(struct mem_cgroup *memcg)
@@ -3332,12 +3319,6 @@ static void memcg_offline_kmem(struct mem_cgroup *memcg)
 
 	parent = parent_mem_cgroup(memcg);
 	memcg_reparent_list_lrus(memcg, parent);
-
-	/*
-	 * Objcg's reparenting must be after list_lru's, make sure list_lru
-	 * helpers won't use parent's list_lru until child is drained.
-	 */
-	memcg_reparent_objcgs(memcg, parent);
 }
 
 #ifdef CONFIG_CGROUP_WRITEBACK
@@ -3854,9 +3835,9 @@ mem_cgroup_css_alloc(struct cgroup_subsys_state *parent_css)
 static int mem_cgroup_css_online(struct cgroup_subsys_state *css)
 {
 	struct mem_cgroup *memcg = mem_cgroup_from_css(css);
+	struct obj_cgroup *objcg;
 
-	if (memcg_online_kmem(memcg))
-		goto remove_id;
+	memcg_online_kmem(memcg);
 
 	/*
 	 * A memcg must be visible for expand_shrinker_info()
@@ -3866,6 +3847,15 @@ static int mem_cgroup_css_online(struct cgroup_subsys_state *css)
 	if (alloc_shrinker_info(memcg))
 		goto offline_kmem;
 
+	objcg = obj_cgroup_alloc();
+	if (!objcg)
+		goto free_shrinker;
+
+	objcg->memcg = memcg;
+	rcu_assign_pointer(memcg->objcg, objcg);
+	obj_cgroup_get(objcg);
+	memcg->orig_objcg = objcg;
+
 	if (unlikely(mem_cgroup_is_root(memcg)) && !mem_cgroup_disabled())
 		queue_delayed_work(system_unbound_wq, &stats_flush_dwork,
 				   FLUSH_TIME);
@@ -3888,9 +3878,10 @@ static int mem_cgroup_css_online(struct cgroup_subsys_state *css)
 	xa_store(&mem_cgroup_ids, memcg->id.id, memcg, GFP_KERNEL);
 
 	return 0;
+free_shrinker:
+	free_shrinker_info(memcg);
 offline_kmem:
 	memcg_offline_kmem(memcg);
-remove_id:
 	mem_cgroup_id_remove(memcg);
 	return -ENOMEM;
 }
@@ -3908,6 +3899,12 @@ static void mem_cgroup_css_offline(struct cgroup_subsys_state *css)
 
 	memcg_offline_kmem(memcg);
 	reparent_deferred_split_queue(memcg);
+	/*
+	 * The reparenting of objcg must be after the reparenting of the
+	 * list_lru and deferred_split_queue above, which ensures that they will
+	 * not mistakenly get the parent list_lru and deferred_split_queue.
+	 */
+	memcg_reparent_objcgs(memcg);
 	reparent_shrinker_deferred(memcg);
 	wb_memcg_offline(memcg);
 	lru_gen_offline_memcg(memcg);
-- 
2.20.1



^ permalink raw reply	[flat|nested] 149+ messages in thread

* Re: [PATCH v2 06/28] mm: memcontrol: allocate object cgroup for non-kmem case
  2025-12-17  7:27 ` [PATCH v2 06/28] mm: memcontrol: allocate object cgroup for non-kmem case Qi Zheng
@ 2025-12-17 21:22   ` Johannes Weiner
  2025-12-18  6:25     ` Qi Zheng
  2025-12-19  0:23   ` Shakeel Butt
  2025-12-25  6:23   ` Chen Ridong
  2 siblings, 1 reply; 149+ messages in thread
From: Johannes Weiner @ 2025-12-17 21:22 UTC (permalink / raw)
  To: Qi Zheng
  Cc: hughd, mhocko, roman.gushchin, shakeel.butt, muchun.song, david,
	lorenzo.stoakes, ziy, harry.yoo, imran.f.khan, kamalesh.babulal,
	axelrasmussen, yuanchu, weixugc, chenridong, mkoutny, akpm,
	hamzamahfooz, apais, lance.yang, linux-mm, linux-kernel, cgroups,
	Muchun Song, Qi Zheng

On Wed, Dec 17, 2025 at 03:27:30PM +0800, Qi Zheng wrote:
> From: Muchun Song <songmuchun@bytedance.com>
> 
> Pagecache pages are charged at allocation time and hold a reference
> to the original memory cgroup until reclaimed. Depending on memory
> pressure, page sharing patterns between different cgroups and cgroup
> creation/destruction rates, many dying memory cgroups can be pinned
> by pagecache pages, reducing page reclaim efficiency and wasting
> memory. Converting LRU folios and most other raw memory cgroup pins
> to the object cgroup direction can fix this long-living problem.

Not a big deal, but since the coverletter will be preserved in git, I
don't think you need to repeat the full thesis.

> As a result, the objcg infrastructure is no longer solely applicable
> to the kmem case. In this patch, we extend the scope of the objcg
> infrastructure beyond the kmem case, enabling LRU folios to reuse
> it for folio charging purposes.

"To allow LRU page reparenting, the objcg infrastructure [...]"

> It should be noted that LRU folios are not accounted for at the root
> level, yet the folio->memcg_data points to the root_mem_cgroup. Hence,
> the folio->memcg_data of LRU folios always points to a valid pointer.
> However, the root_mem_cgroup does not possess an object cgroup.
> Therefore, we also allocate an object cgroup for the root_mem_cgroup.
> 
> Signed-off-by: Muchun Song <songmuchun@bytedance.com>
> Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
> Reviewed-by: Harry Yoo <harry.yoo@oracle.com>

Looks good to me.

Acked-by: Johannes Weiner <hannes@cmpxchg.org>


^ permalink raw reply	[flat|nested] 149+ messages in thread

* Re: [PATCH v2 06/28] mm: memcontrol: allocate object cgroup for non-kmem case
  2025-12-17 21:22   ` Johannes Weiner
@ 2025-12-18  6:25     ` Qi Zheng
  0 siblings, 0 replies; 149+ messages in thread
From: Qi Zheng @ 2025-12-18  6:25 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: hughd, mhocko, roman.gushchin, shakeel.butt, muchun.song, david,
	lorenzo.stoakes, ziy, harry.yoo, imran.f.khan, kamalesh.babulal,
	axelrasmussen, yuanchu, weixugc, chenridong, mkoutny, akpm,
	hamzamahfooz, apais, lance.yang, linux-mm, linux-kernel, cgroups,
	Muchun Song, Qi Zheng



On 12/18/25 5:22 AM, Johannes Weiner wrote:
> On Wed, Dec 17, 2025 at 03:27:30PM +0800, Qi Zheng wrote:
>> From: Muchun Song <songmuchun@bytedance.com>
>>
>> Pagecache pages are charged at allocation time and hold a reference
>> to the original memory cgroup until reclaimed. Depending on memory
>> pressure, page sharing patterns between different cgroups and cgroup
>> creation/destruction rates, many dying memory cgroups can be pinned
>> by pagecache pages, reducing page reclaim efficiency and wasting
>> memory. Converting LRU folios and most other raw memory cgroup pins
>> to the object cgroup direction can fix this long-living problem.
> 
> Not a big deal, but since the coverletter will be preserved in git, I
> don't think you need to repeat the full thesis.

Got it.

> 
>> As a result, the objcg infrastructure is no longer solely applicable
>> to the kmem case. In this patch, we extend the scope of the objcg
>> infrastructure beyond the kmem case, enabling LRU folios to reuse
>> it for folio charging purposes.
> 
> "To allow LRU page reparenting, the objcg infrastructure [...]"

OK, will do.

> 
>> It should be noted that LRU folios are not accounted for at the root
>> level, yet the folio->memcg_data points to the root_mem_cgroup. Hence,
>> the folio->memcg_data of LRU folios always points to a valid pointer.
>> However, the root_mem_cgroup does not possess an object cgroup.
>> Therefore, we also allocate an object cgroup for the root_mem_cgroup.
>>
>> Signed-off-by: Muchun Song <songmuchun@bytedance.com>
>> Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
>> Reviewed-by: Harry Yoo <harry.yoo@oracle.com>
> 
> Looks good to me.
> 
> Acked-by: Johannes Weiner <hannes@cmpxchg.org>

Thanks!




^ permalink raw reply	[flat|nested] 149+ messages in thread

* Re: [PATCH v2 06/28] mm: memcontrol: allocate object cgroup for non-kmem case
  2025-12-17  7:27 ` [PATCH v2 06/28] mm: memcontrol: allocate object cgroup for non-kmem case Qi Zheng
  2025-12-17 21:22   ` Johannes Weiner
@ 2025-12-19  0:23   ` Shakeel Butt
  2025-12-25  6:23   ` Chen Ridong
  2 siblings, 0 replies; 149+ messages in thread
From: Shakeel Butt @ 2025-12-19  0:23 UTC (permalink / raw)
  To: Qi Zheng
  Cc: hannes, hughd, mhocko, roman.gushchin, muchun.song, david,
	lorenzo.stoakes, ziy, harry.yoo, imran.f.khan, kamalesh.babulal,
	axelrasmussen, yuanchu, weixugc, chenridong, mkoutny, akpm,
	hamzamahfooz, apais, lance.yang, linux-mm, linux-kernel, cgroups,
	Muchun Song, Qi Zheng

On Wed, Dec 17, 2025 at 03:27:30PM +0800, Qi Zheng wrote:
> From: Muchun Song <songmuchun@bytedance.com>
> 
> Pagecache pages are charged at allocation time and hold a reference
> to the original memory cgroup until reclaimed. Depending on memory
> pressure, page sharing patterns between different cgroups and cgroup
> creation/destruction rates, many dying memory cgroups can be pinned
> by pagecache pages, reducing page reclaim efficiency and wasting
> memory. Converting LRU folios and most other raw memory cgroup pins
> to the object cgroup direction can fix this long-living problem.
> 
> As a result, the objcg infrastructure is no longer solely applicable
> to the kmem case. In this patch, we extend the scope of the objcg
> infrastructure beyond the kmem case, enabling LRU folios to reuse
> it for folio charging purposes.
> 
> It should be noted that LRU folios are not accounted for at the root
> level, yet the folio->memcg_data points to the root_mem_cgroup. Hence,
> the folio->memcg_data of LRU folios always points to a valid pointer.
> However, the root_mem_cgroup does not possess an object cgroup.
> Therefore, we also allocate an object cgroup for the root_mem_cgroup.
> 
> Signed-off-by: Muchun Song <songmuchun@bytedance.com>
> Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
> Reviewed-by: Harry Yoo <harry.yoo@oracle.com>

Acked-by: Shakeel Butt <shakeel.butt@linux.dev>


^ permalink raw reply	[flat|nested] 149+ messages in thread

* Re: [PATCH v2 06/28] mm: memcontrol: allocate object cgroup for non-kmem case
  2025-12-17  7:27 ` [PATCH v2 06/28] mm: memcontrol: allocate object cgroup for non-kmem case Qi Zheng
  2025-12-17 21:22   ` Johannes Weiner
  2025-12-19  0:23   ` Shakeel Butt
@ 2025-12-25  6:23   ` Chen Ridong
  2 siblings, 0 replies; 149+ messages in thread
From: Chen Ridong @ 2025-12-25  6:23 UTC (permalink / raw)
  To: Qi Zheng, hannes, hughd, mhocko, roman.gushchin, shakeel.butt,
	muchun.song, david, lorenzo.stoakes, ziy, harry.yoo,
	imran.f.khan, kamalesh.babulal, axelrasmussen, yuanchu, weixugc,
	mkoutny, akpm, hamzamahfooz, apais, lance.yang
  Cc: linux-mm, linux-kernel, cgroups, Muchun Song, Qi Zheng



On 2025/12/17 15:27, Qi Zheng wrote:
> From: Muchun Song <songmuchun@bytedance.com>
> 
> Pagecache pages are charged at allocation time and hold a reference
> to the original memory cgroup until reclaimed. Depending on memory
> pressure, page sharing patterns between different cgroups and cgroup
> creation/destruction rates, many dying memory cgroups can be pinned
> by pagecache pages, reducing page reclaim efficiency and wasting
> memory. Converting LRU folios and most other raw memory cgroup pins
> to the object cgroup direction can fix this long-living problem.
> 
> As a result, the objcg infrastructure is no longer solely applicable
> to the kmem case. In this patch, we extend the scope of the objcg
> infrastructure beyond the kmem case, enabling LRU folios to reuse
> it for folio charging purposes.
> 
> It should be noted that LRU folios are not accounted for at the root
> level, yet the folio->memcg_data points to the root_mem_cgroup. Hence,
> the folio->memcg_data of LRU folios always points to a valid pointer.
> However, the root_mem_cgroup does not possess an object cgroup.
> Therefore, we also allocate an object cgroup for the root_mem_cgroup.
> 
> Signed-off-by: Muchun Song <songmuchun@bytedance.com>
> Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
> Reviewed-by: Harry Yoo <harry.yoo@oracle.com>
> ---
>  mm/memcontrol.c | 51 +++++++++++++++++++++++--------------------------
>  1 file changed, 24 insertions(+), 27 deletions(-)
> 
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index ae234518d023c..544b3200db12d 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -204,10 +204,10 @@ static struct obj_cgroup *obj_cgroup_alloc(void)
>  	return objcg;
>  }
>  
> -static void memcg_reparent_objcgs(struct mem_cgroup *memcg,
> -				  struct mem_cgroup *parent)
> +static void memcg_reparent_objcgs(struct mem_cgroup *memcg)
>  {
>  	struct obj_cgroup *objcg, *iter;
> +	struct mem_cgroup *parent = parent_mem_cgroup(memcg);
>  
>  	objcg = rcu_replace_pointer(memcg->objcg, NULL, true);
>  
> @@ -3294,30 +3294,17 @@ unsigned long mem_cgroup_usage(struct mem_cgroup *memcg, bool swap)
>  	return val;
>  }
>  
> -static int memcg_online_kmem(struct mem_cgroup *memcg)
> +static void memcg_online_kmem(struct mem_cgroup *memcg)
>  {
> -	struct obj_cgroup *objcg;
> -
>  	if (mem_cgroup_kmem_disabled())
> -		return 0;
> +		return;
>  
>  	if (unlikely(mem_cgroup_is_root(memcg)))
> -		return 0;
> -
> -	objcg = obj_cgroup_alloc();
> -	if (!objcg)
> -		return -ENOMEM;
> -
> -	objcg->memcg = memcg;
> -	rcu_assign_pointer(memcg->objcg, objcg);
> -	obj_cgroup_get(objcg);
> -	memcg->orig_objcg = objcg;
> +		return;
>  
>  	static_branch_enable(&memcg_kmem_online_key);
>  

Not about this patch.

Should we add?
	if (!memcg_kmem_online())
		static_branch_enable(&memcg_kmem_online_key);

>  	memcg->kmemcg_id = memcg->id.id;
> -
> -	return 0;
>  }
>  
>  static void memcg_offline_kmem(struct mem_cgroup *memcg)
> @@ -3332,12 +3319,6 @@ static void memcg_offline_kmem(struct mem_cgroup *memcg)
>  
>  	parent = parent_mem_cgroup(memcg);
>  	memcg_reparent_list_lrus(memcg, parent);
> -
> -	/*
> -	 * Objcg's reparenting must be after list_lru's, make sure list_lru
> -	 * helpers won't use parent's list_lru until child is drained.
> -	 */
> -	memcg_reparent_objcgs(memcg, parent);
>  }
>  
>  #ifdef CONFIG_CGROUP_WRITEBACK
> @@ -3854,9 +3835,9 @@ mem_cgroup_css_alloc(struct cgroup_subsys_state *parent_css)
>  static int mem_cgroup_css_online(struct cgroup_subsys_state *css)
>  {
>  	struct mem_cgroup *memcg = mem_cgroup_from_css(css);
> +	struct obj_cgroup *objcg;
>  
> -	if (memcg_online_kmem(memcg))
> -		goto remove_id;
> +	memcg_online_kmem(memcg);
>  
>  	/*
>  	 * A memcg must be visible for expand_shrinker_info()
> @@ -3866,6 +3847,15 @@ static int mem_cgroup_css_online(struct cgroup_subsys_state *css)
>  	if (alloc_shrinker_info(memcg))
>  		goto offline_kmem;
>  
> +	objcg = obj_cgroup_alloc();
> +	if (!objcg)
> +		goto free_shrinker;
> +
> +	objcg->memcg = memcg;
> +	rcu_assign_pointer(memcg->objcg, objcg);
> +	obj_cgroup_get(objcg);
> +	memcg->orig_objcg = objcg;
> +
>  	if (unlikely(mem_cgroup_is_root(memcg)) && !mem_cgroup_disabled())
>  		queue_delayed_work(system_unbound_wq, &stats_flush_dwork,
>  				   FLUSH_TIME);
> @@ -3888,9 +3878,10 @@ static int mem_cgroup_css_online(struct cgroup_subsys_state *css)
>  	xa_store(&mem_cgroup_ids, memcg->id.id, memcg, GFP_KERNEL);
>  
>  	return 0;
> +free_shrinker:
> +	free_shrinker_info(memcg);
>  offline_kmem:
>  	memcg_offline_kmem(memcg);
> -remove_id:
>  	mem_cgroup_id_remove(memcg);
>  	return -ENOMEM;
>  }
> @@ -3908,6 +3899,12 @@ static void mem_cgroup_css_offline(struct cgroup_subsys_state *css)
>  
>  	memcg_offline_kmem(memcg);
>  	reparent_deferred_split_queue(memcg);
> +	/*
> +	 * The reparenting of objcg must be after the reparenting of the
> +	 * list_lru and deferred_split_queue above, which ensures that they will
> +	 * not mistakenly get the parent list_lru and deferred_split_queue.
> +	 */
> +	memcg_reparent_objcgs(memcg);
>  	reparent_shrinker_deferred(memcg);
>  	wb_memcg_offline(memcg);
>  	lru_gen_offline_memcg(memcg);


Reviewed-by: Chen Ridong <chenridong@huawei.com>

-- 
Best regards,
Ridong



^ permalink raw reply	[flat|nested] 149+ messages in thread

* [PATCH v2 07/28] mm: memcontrol: return root object cgroup for root memory cgroup
  2025-12-17  7:27 [PATCH v2 00/28] Eliminate Dying Memory Cgroup Qi Zheng
                   ` (5 preceding siblings ...)
  2025-12-17  7:27 ` [PATCH v2 06/28] mm: memcontrol: allocate object cgroup for non-kmem case Qi Zheng
@ 2025-12-17  7:27 ` Qi Zheng
  2025-12-17 21:28   ` Johannes Weiner
                     ` (2 more replies)
  2025-12-17  7:27 ` [PATCH v2 08/28] mm: memcontrol: prevent memory cgroup release in get_mem_cgroup_from_folio() Qi Zheng
                   ` (22 subsequent siblings)
  29 siblings, 3 replies; 149+ messages in thread
From: Qi Zheng @ 2025-12-17  7:27 UTC (permalink / raw)
  To: hannes, hughd, mhocko, roman.gushchin, shakeel.butt, muchun.song,
	david, lorenzo.stoakes, ziy, harry.yoo, imran.f.khan,
	kamalesh.babulal, axelrasmussen, yuanchu, weixugc, chenridong,
	mkoutny, akpm, hamzamahfooz, apais, lance.yang
  Cc: linux-mm, linux-kernel, cgroups, Muchun Song, Qi Zheng

From: Muchun Song <songmuchun@bytedance.com>

Memory cgroup functions such as get_mem_cgroup_from_folio() and
get_mem_cgroup_from_mm() return a valid memory cgroup pointer,
even for the root memory cgroup. In contrast, the situation for
object cgroups has been different.

Previously, the root object cgroup couldn't be returned because
it didn't exist. Now that a valid root object cgroup exists, for
the sake of consistency, it's necessary to align the behavior of
object-cgroup-related operations with that of memory cgroup APIs.

Signed-off-by: Muchun Song <songmuchun@bytedance.com>
Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
---
 include/linux/memcontrol.h | 26 +++++++++++++++++-----
 mm/memcontrol.c            | 45 ++++++++++++++++++++------------------
 mm/percpu.c                |  2 +-
 3 files changed, 45 insertions(+), 28 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 288dd6337f80f..776d9be1f446a 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -332,6 +332,7 @@ struct mem_cgroup {
 #define MEMCG_CHARGE_BATCH 64U
 
 extern struct mem_cgroup *root_mem_cgroup;
+extern struct obj_cgroup *root_obj_cgroup;
 
 enum page_memcg_data_flags {
 	/* page->memcg_data is a pointer to an slabobj_ext vector */
@@ -549,6 +550,11 @@ static inline bool mem_cgroup_is_root(struct mem_cgroup *memcg)
 	return (memcg == root_mem_cgroup);
 }
 
+static inline bool obj_cgroup_is_root(const struct obj_cgroup *objcg)
+{
+	return objcg == root_obj_cgroup;
+}
+
 static inline bool mem_cgroup_disabled(void)
 {
 	return !cgroup_subsys_enabled(memory_cgrp_subsys);
@@ -773,23 +779,26 @@ struct mem_cgroup *mem_cgroup_from_css(struct cgroup_subsys_state *css){
 
 static inline bool obj_cgroup_tryget(struct obj_cgroup *objcg)
 {
+	if (obj_cgroup_is_root(objcg))
+		return true;
 	return percpu_ref_tryget(&objcg->refcnt);
 }
 
-static inline void obj_cgroup_get(struct obj_cgroup *objcg)
+static inline void obj_cgroup_get_many(struct obj_cgroup *objcg,
+				       unsigned long nr)
 {
-	percpu_ref_get(&objcg->refcnt);
+	if (!obj_cgroup_is_root(objcg))
+		percpu_ref_get_many(&objcg->refcnt, nr);
 }
 
-static inline void obj_cgroup_get_many(struct obj_cgroup *objcg,
-				       unsigned long nr)
+static inline void obj_cgroup_get(struct obj_cgroup *objcg)
 {
-	percpu_ref_get_many(&objcg->refcnt, nr);
+	obj_cgroup_get_many(objcg, 1);
 }
 
 static inline void obj_cgroup_put(struct obj_cgroup *objcg)
 {
-	if (objcg)
+	if (objcg && !obj_cgroup_is_root(objcg))
 		percpu_ref_put(&objcg->refcnt);
 }
 
@@ -1084,6 +1093,11 @@ static inline bool mem_cgroup_is_root(struct mem_cgroup *memcg)
 	return true;
 }
 
+static inline bool obj_cgroup_is_root(const struct obj_cgroup *objcg)
+{
+	return true;
+}
+
 static inline bool mem_cgroup_disabled(void)
 {
 	return true;
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 544b3200db12d..21b5aad34cae7 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -83,6 +83,8 @@ EXPORT_SYMBOL(memory_cgrp_subsys);
 struct mem_cgroup *root_mem_cgroup __read_mostly;
 EXPORT_SYMBOL(root_mem_cgroup);
 
+struct obj_cgroup *root_obj_cgroup __read_mostly;
+
 /* Active memory cgroup to use from an interrupt context */
 DEFINE_PER_CPU(struct mem_cgroup *, int_active_memcg);
 EXPORT_PER_CPU_SYMBOL_GPL(int_active_memcg);
@@ -2634,15 +2636,14 @@ struct mem_cgroup *mem_cgroup_from_slab_obj(void *p)
 
 static struct obj_cgroup *__get_obj_cgroup_from_memcg(struct mem_cgroup *memcg)
 {
-	struct obj_cgroup *objcg = NULL;
+	for (; memcg; memcg = parent_mem_cgroup(memcg)) {
+		struct obj_cgroup *objcg = rcu_dereference(memcg->objcg);
 
-	for (; !mem_cgroup_is_root(memcg); memcg = parent_mem_cgroup(memcg)) {
-		objcg = rcu_dereference(memcg->objcg);
 		if (likely(objcg && obj_cgroup_tryget(objcg)))
-			break;
-		objcg = NULL;
+			return objcg;
 	}
-	return objcg;
+
+	return NULL;
 }
 
 static struct obj_cgroup *current_objcg_update(void)
@@ -2716,18 +2717,17 @@ __always_inline struct obj_cgroup *current_obj_cgroup(void)
 		 * Objcg reference is kept by the task, so it's safe
 		 * to use the objcg by the current task.
 		 */
-		return objcg;
+		return objcg ? : root_obj_cgroup;
 	}
 
 	memcg = this_cpu_read(int_active_memcg);
 	if (unlikely(memcg))
 		goto from_memcg;
 
-	return NULL;
+	return root_obj_cgroup;
 
 from_memcg:
-	objcg = NULL;
-	for (; !mem_cgroup_is_root(memcg); memcg = parent_mem_cgroup(memcg)) {
+	for (; memcg; memcg = parent_mem_cgroup(memcg)) {
 		/*
 		 * Memcg pointer is protected by scope (see set_active_memcg())
 		 * and is pinning the corresponding objcg, so objcg can't go
@@ -2736,10 +2736,10 @@ __always_inline struct obj_cgroup *current_obj_cgroup(void)
 		 */
 		objcg = rcu_dereference_check(memcg->objcg, 1);
 		if (likely(objcg))
-			break;
+			return objcg;
 	}
 
-	return objcg;
+	return root_obj_cgroup;
 }
 
 struct obj_cgroup *get_obj_cgroup_from_folio(struct folio *folio)
@@ -2753,14 +2753,8 @@ struct obj_cgroup *get_obj_cgroup_from_folio(struct folio *folio)
 		objcg = __folio_objcg(folio);
 		obj_cgroup_get(objcg);
 	} else {
-		struct mem_cgroup *memcg;
-
 		rcu_read_lock();
-		memcg = __folio_memcg(folio);
-		if (memcg)
-			objcg = __get_obj_cgroup_from_memcg(memcg);
-		else
-			objcg = NULL;
+		objcg = __get_obj_cgroup_from_memcg(__folio_memcg(folio));
 		rcu_read_unlock();
 	}
 	return objcg;
@@ -2863,7 +2857,7 @@ int __memcg_kmem_charge_page(struct page *page, gfp_t gfp, int order)
 	int ret = 0;
 
 	objcg = current_obj_cgroup();
-	if (objcg) {
+	if (objcg && !obj_cgroup_is_root(objcg)) {
 		ret = obj_cgroup_charge_pages(objcg, gfp, 1 << order);
 		if (!ret) {
 			obj_cgroup_get(objcg);
@@ -3164,7 +3158,7 @@ bool __memcg_slab_post_alloc_hook(struct kmem_cache *s, struct list_lru *lru,
 	 * obj_cgroup_get() is used to get a permanent reference.
 	 */
 	objcg = current_obj_cgroup();
-	if (!objcg)
+	if (!objcg || obj_cgroup_is_root(objcg))
 		return true;
 
 	/*
@@ -3851,6 +3845,9 @@ static int mem_cgroup_css_online(struct cgroup_subsys_state *css)
 	if (!objcg)
 		goto free_shrinker;
 
+	if (unlikely(mem_cgroup_is_root(memcg)))
+		root_obj_cgroup = objcg;
+
 	objcg->memcg = memcg;
 	rcu_assign_pointer(memcg->objcg, objcg);
 	obj_cgroup_get(objcg);
@@ -5471,6 +5468,9 @@ void obj_cgroup_charge_zswap(struct obj_cgroup *objcg, size_t size)
 	if (!cgroup_subsys_on_dfl(memory_cgrp_subsys))
 		return;
 
+	if (obj_cgroup_is_root(objcg))
+		return;
+
 	VM_WARN_ON_ONCE(!(current->flags & PF_MEMALLOC));
 
 	/* PF_MEMALLOC context, charging must succeed */
@@ -5498,6 +5498,9 @@ void obj_cgroup_uncharge_zswap(struct obj_cgroup *objcg, size_t size)
 	if (!cgroup_subsys_on_dfl(memory_cgrp_subsys))
 		return;
 
+	if (obj_cgroup_is_root(objcg))
+		return;
+
 	obj_cgroup_uncharge(objcg, size);
 
 	rcu_read_lock();
diff --git a/mm/percpu.c b/mm/percpu.c
index 81462ce5866e1..5c1a9b77d6b93 100644
--- a/mm/percpu.c
+++ b/mm/percpu.c
@@ -1616,7 +1616,7 @@ static bool pcpu_memcg_pre_alloc_hook(size_t size, gfp_t gfp,
 		return true;
 
 	objcg = current_obj_cgroup();
-	if (!objcg)
+	if (!objcg || obj_cgroup_is_root(objcg))
 		return true;
 
 	if (obj_cgroup_charge(objcg, gfp, pcpu_obj_full_size(size)))
-- 
2.20.1



^ permalink raw reply	[flat|nested] 149+ messages in thread

* Re: [PATCH v2 07/28] mm: memcontrol: return root object cgroup for root memory cgroup
  2025-12-17  7:27 ` [PATCH v2 07/28] mm: memcontrol: return root object cgroup for root memory cgroup Qi Zheng
@ 2025-12-17 21:28   ` Johannes Weiner
  2025-12-19  0:39   ` Shakeel Butt
  2025-12-26  1:03   ` Chen Ridong
  2 siblings, 0 replies; 149+ messages in thread
From: Johannes Weiner @ 2025-12-17 21:28 UTC (permalink / raw)
  To: Qi Zheng
  Cc: hughd, mhocko, roman.gushchin, shakeel.butt, muchun.song, david,
	lorenzo.stoakes, ziy, harry.yoo, imran.f.khan, kamalesh.babulal,
	axelrasmussen, yuanchu, weixugc, chenridong, mkoutny, akpm,
	hamzamahfooz, apais, lance.yang, linux-mm, linux-kernel, cgroups,
	Muchun Song, Qi Zheng

On Wed, Dec 17, 2025 at 03:27:31PM +0800, Qi Zheng wrote:
> From: Muchun Song <songmuchun@bytedance.com>
> 
> Memory cgroup functions such as get_mem_cgroup_from_folio() and
> get_mem_cgroup_from_mm() return a valid memory cgroup pointer,
> even for the root memory cgroup. In contrast, the situation for
> object cgroups has been different.
> 
> Previously, the root object cgroup couldn't be returned because
> it didn't exist. Now that a valid root object cgroup exists, for
> the sake of consistency, it's necessary to align the behavior of
> object-cgroup-related operations with that of memory cgroup APIs.
> 
> Signed-off-by: Muchun Song <songmuchun@bytedance.com>
> Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>

Acked-by: Johannes Weiner <hannes@cmpxchg.org>


^ permalink raw reply	[flat|nested] 149+ messages in thread

* Re: [PATCH v2 07/28] mm: memcontrol: return root object cgroup for root memory cgroup
  2025-12-17  7:27 ` [PATCH v2 07/28] mm: memcontrol: return root object cgroup for root memory cgroup Qi Zheng
  2025-12-17 21:28   ` Johannes Weiner
@ 2025-12-19  0:39   ` Shakeel Butt
  2025-12-26  1:03   ` Chen Ridong
  2 siblings, 0 replies; 149+ messages in thread
From: Shakeel Butt @ 2025-12-19  0:39 UTC (permalink / raw)
  To: Qi Zheng
  Cc: hannes, hughd, mhocko, roman.gushchin, muchun.song, david,
	lorenzo.stoakes, ziy, harry.yoo, imran.f.khan, kamalesh.babulal,
	axelrasmussen, yuanchu, weixugc, chenridong, mkoutny, akpm,
	hamzamahfooz, apais, lance.yang, linux-mm, linux-kernel, cgroups,
	Muchun Song, Qi Zheng

On Wed, Dec 17, 2025 at 03:27:31PM +0800, Qi Zheng wrote:
> From: Muchun Song <songmuchun@bytedance.com>
> 
> Memory cgroup functions such as get_mem_cgroup_from_folio() and
> get_mem_cgroup_from_mm() return a valid memory cgroup pointer,
> even for the root memory cgroup. In contrast, the situation for
> object cgroups has been different.
> 
> Previously, the root object cgroup couldn't be returned because
> it didn't exist. Now that a valid root object cgroup exists, for
> the sake of consistency, it's necessary to align the behavior of
> object-cgroup-related operations with that of memory cgroup APIs.
> 
> Signed-off-by: Muchun Song <songmuchun@bytedance.com>
> Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>

Acked-by: Shakeel Butt <shakeel.butt@linux.dev>


^ permalink raw reply	[flat|nested] 149+ messages in thread

* Re: [PATCH v2 07/28] mm: memcontrol: return root object cgroup for root memory cgroup
  2025-12-17  7:27 ` [PATCH v2 07/28] mm: memcontrol: return root object cgroup for root memory cgroup Qi Zheng
  2025-12-17 21:28   ` Johannes Weiner
  2025-12-19  0:39   ` Shakeel Butt
@ 2025-12-26  1:03   ` Chen Ridong
  2025-12-26  3:10     ` Muchun Song
  2 siblings, 1 reply; 149+ messages in thread
From: Chen Ridong @ 2025-12-26  1:03 UTC (permalink / raw)
  To: Qi Zheng, hannes, hughd, mhocko, roman.gushchin, shakeel.butt,
	muchun.song, david, lorenzo.stoakes, ziy, harry.yoo,
	imran.f.khan, kamalesh.babulal, axelrasmussen, yuanchu, weixugc,
	mkoutny, akpm, hamzamahfooz, apais, lance.yang
  Cc: linux-mm, linux-kernel, cgroups, Muchun Song, Qi Zheng



On 2025/12/17 15:27, Qi Zheng wrote:
> From: Muchun Song <songmuchun@bytedance.com>
> 
> Memory cgroup functions such as get_mem_cgroup_from_folio() and
> get_mem_cgroup_from_mm() return a valid memory cgroup pointer,
> even for the root memory cgroup. In contrast, the situation for
> object cgroups has been different.
> 
> Previously, the root object cgroup couldn't be returned because
> it didn't exist. Now that a valid root object cgroup exists, for
> the sake of consistency, it's necessary to align the behavior of
> object-cgroup-related operations with that of memory cgroup APIs.
> 
> Signed-off-by: Muchun Song <songmuchun@bytedance.com>
> Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
> ---
>  include/linux/memcontrol.h | 26 +++++++++++++++++-----
>  mm/memcontrol.c            | 45 ++++++++++++++++++++------------------
>  mm/percpu.c                |  2 +-
>  3 files changed, 45 insertions(+), 28 deletions(-)
> 
> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> index 288dd6337f80f..776d9be1f446a 100644
> --- a/include/linux/memcontrol.h
> +++ b/include/linux/memcontrol.h
> @@ -332,6 +332,7 @@ struct mem_cgroup {
>  #define MEMCG_CHARGE_BATCH 64U
>  
>  extern struct mem_cgroup *root_mem_cgroup;
> +extern struct obj_cgroup *root_obj_cgroup;
>  
>  enum page_memcg_data_flags {
>  	/* page->memcg_data is a pointer to an slabobj_ext vector */
> @@ -549,6 +550,11 @@ static inline bool mem_cgroup_is_root(struct mem_cgroup *memcg)
>  	return (memcg == root_mem_cgroup);
>  }
>  
> +static inline bool obj_cgroup_is_root(const struct obj_cgroup *objcg)
> +{
> +	return objcg == root_obj_cgroup;
> +}
> +
>  static inline bool mem_cgroup_disabled(void)
>  {
>  	return !cgroup_subsys_enabled(memory_cgrp_subsys);
> @@ -773,23 +779,26 @@ struct mem_cgroup *mem_cgroup_from_css(struct cgroup_subsys_state *css){
>  
>  static inline bool obj_cgroup_tryget(struct obj_cgroup *objcg)
>  {
> +	if (obj_cgroup_is_root(objcg))
> +		return true;
>  	return percpu_ref_tryget(&objcg->refcnt);
>  }
>  
> -static inline void obj_cgroup_get(struct obj_cgroup *objcg)
> +static inline void obj_cgroup_get_many(struct obj_cgroup *objcg,
> +				       unsigned long nr)
>  {
> -	percpu_ref_get(&objcg->refcnt);
> +	if (!obj_cgroup_is_root(objcg))
> +		percpu_ref_get_many(&objcg->refcnt, nr);
>  }
>  
> -static inline void obj_cgroup_get_many(struct obj_cgroup *objcg,
> -				       unsigned long nr)
> +static inline void obj_cgroup_get(struct obj_cgroup *objcg)
>  {
> -	percpu_ref_get_many(&objcg->refcnt, nr);
> +	obj_cgroup_get_many(objcg, 1);
>  }
>  
>  static inline void obj_cgroup_put(struct obj_cgroup *objcg)
>  {
> -	if (objcg)
> +	if (objcg && !obj_cgroup_is_root(objcg))
>  		percpu_ref_put(&objcg->refcnt);
>  }
>  
> @@ -1084,6 +1093,11 @@ static inline bool mem_cgroup_is_root(struct mem_cgroup *memcg)
>  	return true;
>  }
>  
> +static inline bool obj_cgroup_is_root(const struct obj_cgroup *objcg)
> +{
> +	return true;
> +}
> +
>  static inline bool mem_cgroup_disabled(void)
>  {
>  	return true;
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 544b3200db12d..21b5aad34cae7 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -83,6 +83,8 @@ EXPORT_SYMBOL(memory_cgrp_subsys);
>  struct mem_cgroup *root_mem_cgroup __read_mostly;
>  EXPORT_SYMBOL(root_mem_cgroup);
>  
> +struct obj_cgroup *root_obj_cgroup __read_mostly;
> +
>  /* Active memory cgroup to use from an interrupt context */
>  DEFINE_PER_CPU(struct mem_cgroup *, int_active_memcg);
>  EXPORT_PER_CPU_SYMBOL_GPL(int_active_memcg);
> @@ -2634,15 +2636,14 @@ struct mem_cgroup *mem_cgroup_from_slab_obj(void *p)
>  
>  static struct obj_cgroup *__get_obj_cgroup_from_memcg(struct mem_cgroup *memcg)
>  {
> -	struct obj_cgroup *objcg = NULL;
> +	for (; memcg; memcg = parent_mem_cgroup(memcg)) {
> +		struct obj_cgroup *objcg = rcu_dereference(memcg->objcg);
>  
> -	for (; !mem_cgroup_is_root(memcg); memcg = parent_mem_cgroup(memcg)) {
> -		objcg = rcu_dereference(memcg->objcg);
>  		if (likely(objcg && obj_cgroup_tryget(objcg)))
> -			break;
> -		objcg = NULL;
> +			return objcg;
>  	}
> -	return objcg;
> +
> +	return NULL;
>  }
>  
>  static struct obj_cgroup *current_objcg_update(void)
> @@ -2716,18 +2717,17 @@ __always_inline struct obj_cgroup *current_obj_cgroup(void)
>  		 * Objcg reference is kept by the task, so it's safe
>  		 * to use the objcg by the current task.
>  		 */
> -		return objcg;
> +		return objcg ? : root_obj_cgroup;
>  	}
>  
>  	memcg = this_cpu_read(int_active_memcg);
>  	if (unlikely(memcg))
>  		goto from_memcg;
>  
> -	return NULL;
> +	return root_obj_cgroup;
>  
>  from_memcg:
> -	objcg = NULL;
> -	for (; !mem_cgroup_is_root(memcg); memcg = parent_mem_cgroup(memcg)) {
> +	for (; memcg; memcg = parent_mem_cgroup(memcg)) {
>  		/*
>  		 * Memcg pointer is protected by scope (see set_active_memcg())
>  		 * and is pinning the corresponding objcg, so objcg can't go
> @@ -2736,10 +2736,10 @@ __always_inline struct obj_cgroup *current_obj_cgroup(void)
>  		 */
>  		objcg = rcu_dereference_check(memcg->objcg, 1);
>  		if (likely(objcg))
> -			break;
> +			return objcg;
>  	}
>  
> -	return objcg;
> +	return root_obj_cgroup;
>  }
>  
>  struct obj_cgroup *get_obj_cgroup_from_folio(struct folio *folio)
> @@ -2753,14 +2753,8 @@ struct obj_cgroup *get_obj_cgroup_from_folio(struct folio *folio)
>  		objcg = __folio_objcg(folio);
>  		obj_cgroup_get(objcg);
>  	} else {
> -		struct mem_cgroup *memcg;
> -
>  		rcu_read_lock();
> -		memcg = __folio_memcg(folio);
> -		if (memcg)
> -			objcg = __get_obj_cgroup_from_memcg(memcg);
> -		else
> -			objcg = NULL;
> +		objcg = __get_obj_cgroup_from_memcg(__folio_memcg(folio));
>  		rcu_read_unlock();
>  	}
>  	return objcg;
> @@ -2863,7 +2857,7 @@ int __memcg_kmem_charge_page(struct page *page, gfp_t gfp, int order)
>  	int ret = 0;
>  
>  	objcg = current_obj_cgroup();
> -	if (objcg) {
> +	if (objcg && !obj_cgroup_is_root(objcg)) {
>  		ret = obj_cgroup_charge_pages(objcg, gfp, 1 << order);
>  		if (!ret) {
>  			obj_cgroup_get(objcg);
> @@ -3164,7 +3158,7 @@ bool __memcg_slab_post_alloc_hook(struct kmem_cache *s, struct list_lru *lru,
>  	 * obj_cgroup_get() is used to get a permanent reference.
>  	 */
>  	objcg = current_obj_cgroup();
> -	if (!objcg)
> +	if (!objcg || obj_cgroup_is_root(objcg))
>  		return true;
>  
>  	/*
> @@ -3851,6 +3845,9 @@ static int mem_cgroup_css_online(struct cgroup_subsys_state *css)
>  	if (!objcg)
>  		goto free_shrinker;
>  
> +	if (unlikely(mem_cgroup_is_root(memcg)))
> +		root_obj_cgroup = objcg;
> +
>  	objcg->memcg = memcg;
>  	rcu_assign_pointer(memcg->objcg, objcg);
>  	obj_cgroup_get(objcg);
> @@ -5471,6 +5468,9 @@ void obj_cgroup_charge_zswap(struct obj_cgroup *objcg, size_t size)
>  	if (!cgroup_subsys_on_dfl(memory_cgrp_subsys))
>  		return;
>  
> +	if (obj_cgroup_is_root(objcg))
> +		return;
> +
>  	VM_WARN_ON_ONCE(!(current->flags & PF_MEMALLOC));
>  
>  	/* PF_MEMALLOC context, charging must succeed */
> @@ -5498,6 +5498,9 @@ void obj_cgroup_uncharge_zswap(struct obj_cgroup *objcg, size_t size)
>  	if (!cgroup_subsys_on_dfl(memory_cgrp_subsys))
>  		return;
>  
> +	if (obj_cgroup_is_root(objcg))
> +		return;
> +
>  	obj_cgroup_uncharge(objcg, size);
>  

If we modify zswap by adding MEMCG_ZSWAP_B and MEMCG_ZSWAPPED with obj_cgroup_charge_zswap , then
remove a control group (via rmdir) and reparent its objects to the root cgroup, then for the root
cgroup, obj_cgroup_uncharge_zswap will return directly due to the obj_cgroup_is_root check. Would
this cause us to miss decrementing MEMCG_ZSWAP_B and MEMCG_ZSWAPPED?

>  	rcu_read_lock();
> diff --git a/mm/percpu.c b/mm/percpu.c
> index 81462ce5866e1..5c1a9b77d6b93 100644
> --- a/mm/percpu.c
> +++ b/mm/percpu.c
> @@ -1616,7 +1616,7 @@ static bool pcpu_memcg_pre_alloc_hook(size_t size, gfp_t gfp,
>  		return true;
>  
>  	objcg = current_obj_cgroup();
> -	if (!objcg)
> +	if (!objcg || obj_cgroup_is_root(objcg))
>  		return true;
>  
>  	if (obj_cgroup_charge(objcg, gfp, pcpu_obj_full_size(size)))

-- 
Best regards,
Ridong



^ permalink raw reply	[flat|nested] 149+ messages in thread

* Re: [PATCH v2 07/28] mm: memcontrol: return root object cgroup for root memory cgroup
  2025-12-26  1:03   ` Chen Ridong
@ 2025-12-26  3:10     ` Muchun Song
  2025-12-26  3:50       ` Chen Ridong
  0 siblings, 1 reply; 149+ messages in thread
From: Muchun Song @ 2025-12-26  3:10 UTC (permalink / raw)
  To: Chen Ridong
  Cc: linux-mm, linux-kernel, cgroups, Muchun Song, Qi Zheng, Qi Zheng,
	hannes, hughd, mhocko, roman.gushchin, shakeel.butt, david,
	lorenzo.stoakes, ziy, harry.yoo, imran.f.khan, kamalesh.babulal,
	axelrasmussen, yuanchu, weixugc, mkoutny, akpm, hamzamahfooz,
	apais, lance.yang



On 2025/12/26 09:03, Chen Ridong wrote:
>
> On 2025/12/17 15:27, Qi Zheng wrote:
>> From: Muchun Song <songmuchun@bytedance.com>
>>
>> Memory cgroup functions such as get_mem_cgroup_from_folio() and
>> get_mem_cgroup_from_mm() return a valid memory cgroup pointer,
>> even for the root memory cgroup. In contrast, the situation for
>> object cgroups has been different.
>>
>> Previously, the root object cgroup couldn't be returned because
>> it didn't exist. Now that a valid root object cgroup exists, for
>> the sake of consistency, it's necessary to align the behavior of
>> object-cgroup-related operations with that of memory cgroup APIs.
>>
>> Signed-off-by: Muchun Song <songmuchun@bytedance.com>
>> Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
>> ---
>>   include/linux/memcontrol.h | 26 +++++++++++++++++-----
>>   mm/memcontrol.c            | 45 ++++++++++++++++++++------------------
>>   mm/percpu.c                |  2 +-
>>   3 files changed, 45 insertions(+), 28 deletions(-)
>>
>> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
>> index 288dd6337f80f..776d9be1f446a 100644
>> --- a/include/linux/memcontrol.h
>> +++ b/include/linux/memcontrol.h
>> @@ -332,6 +332,7 @@ struct mem_cgroup {
>>   #define MEMCG_CHARGE_BATCH 64U
>>   
>>   extern struct mem_cgroup *root_mem_cgroup;
>> +extern struct obj_cgroup *root_obj_cgroup;
>>   
>>   enum page_memcg_data_flags {
>>   	/* page->memcg_data is a pointer to an slabobj_ext vector */
>> @@ -549,6 +550,11 @@ static inline bool mem_cgroup_is_root(struct mem_cgroup *memcg)
>>   	return (memcg == root_mem_cgroup);
>>   }
>>   
>> +static inline bool obj_cgroup_is_root(const struct obj_cgroup *objcg)
>> +{
>> +	return objcg == root_obj_cgroup;
>> +}
>> +
>>   static inline bool mem_cgroup_disabled(void)
>>   {
>>   	return !cgroup_subsys_enabled(memory_cgrp_subsys);
>> @@ -773,23 +779,26 @@ struct mem_cgroup *mem_cgroup_from_css(struct cgroup_subsys_state *css){
>>   
>>   static inline bool obj_cgroup_tryget(struct obj_cgroup *objcg)
>>   {
>> +	if (obj_cgroup_is_root(objcg))
>> +		return true;
>>   	return percpu_ref_tryget(&objcg->refcnt);
>>   }
>>   
>> -static inline void obj_cgroup_get(struct obj_cgroup *objcg)
>> +static inline void obj_cgroup_get_many(struct obj_cgroup *objcg,
>> +				       unsigned long nr)
>>   {
>> -	percpu_ref_get(&objcg->refcnt);
>> +	if (!obj_cgroup_is_root(objcg))
>> +		percpu_ref_get_many(&objcg->refcnt, nr);
>>   }
>>   
>> -static inline void obj_cgroup_get_many(struct obj_cgroup *objcg,
>> -				       unsigned long nr)
>> +static inline void obj_cgroup_get(struct obj_cgroup *objcg)
>>   {
>> -	percpu_ref_get_many(&objcg->refcnt, nr);
>> +	obj_cgroup_get_many(objcg, 1);
>>   }
>>   
>>   static inline void obj_cgroup_put(struct obj_cgroup *objcg)
>>   {
>> -	if (objcg)
>> +	if (objcg && !obj_cgroup_is_root(objcg))
>>   		percpu_ref_put(&objcg->refcnt);
>>   }
>>   
>> @@ -1084,6 +1093,11 @@ static inline bool mem_cgroup_is_root(struct mem_cgroup *memcg)
>>   	return true;
>>   }
>>   
>> +static inline bool obj_cgroup_is_root(const struct obj_cgroup *objcg)
>> +{
>> +	return true;
>> +}
>> +
>>   static inline bool mem_cgroup_disabled(void)
>>   {
>>   	return true;
>> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
>> index 544b3200db12d..21b5aad34cae7 100644
>> --- a/mm/memcontrol.c
>> +++ b/mm/memcontrol.c
>> @@ -83,6 +83,8 @@ EXPORT_SYMBOL(memory_cgrp_subsys);
>>   struct mem_cgroup *root_mem_cgroup __read_mostly;
>>   EXPORT_SYMBOL(root_mem_cgroup);
>>   
>> +struct obj_cgroup *root_obj_cgroup __read_mostly;
>> +
>>   /* Active memory cgroup to use from an interrupt context */
>>   DEFINE_PER_CPU(struct mem_cgroup *, int_active_memcg);
>>   EXPORT_PER_CPU_SYMBOL_GPL(int_active_memcg);
>> @@ -2634,15 +2636,14 @@ struct mem_cgroup *mem_cgroup_from_slab_obj(void *p)
>>   
>>   static struct obj_cgroup *__get_obj_cgroup_from_memcg(struct mem_cgroup *memcg)
>>   {
>> -	struct obj_cgroup *objcg = NULL;
>> +	for (; memcg; memcg = parent_mem_cgroup(memcg)) {
>> +		struct obj_cgroup *objcg = rcu_dereference(memcg->objcg);
>>   
>> -	for (; !mem_cgroup_is_root(memcg); memcg = parent_mem_cgroup(memcg)) {
>> -		objcg = rcu_dereference(memcg->objcg);
>>   		if (likely(objcg && obj_cgroup_tryget(objcg)))
>> -			break;
>> -		objcg = NULL;
>> +			return objcg;
>>   	}
>> -	return objcg;
>> +
>> +	return NULL;
>>   }
>>   
>>   static struct obj_cgroup *current_objcg_update(void)
>> @@ -2716,18 +2717,17 @@ __always_inline struct obj_cgroup *current_obj_cgroup(void)
>>   		 * Objcg reference is kept by the task, so it's safe
>>   		 * to use the objcg by the current task.
>>   		 */
>> -		return objcg;
>> +		return objcg ? : root_obj_cgroup;
>>   	}
>>   
>>   	memcg = this_cpu_read(int_active_memcg);
>>   	if (unlikely(memcg))
>>   		goto from_memcg;
>>   
>> -	return NULL;
>> +	return root_obj_cgroup;
>>   
>>   from_memcg:
>> -	objcg = NULL;
>> -	for (; !mem_cgroup_is_root(memcg); memcg = parent_mem_cgroup(memcg)) {
>> +	for (; memcg; memcg = parent_mem_cgroup(memcg)) {
>>   		/*
>>   		 * Memcg pointer is protected by scope (see set_active_memcg())
>>   		 * and is pinning the corresponding objcg, so objcg can't go
>> @@ -2736,10 +2736,10 @@ __always_inline struct obj_cgroup *current_obj_cgroup(void)
>>   		 */
>>   		objcg = rcu_dereference_check(memcg->objcg, 1);
>>   		if (likely(objcg))
>> -			break;
>> +			return objcg;
>>   	}
>>   
>> -	return objcg;
>> +	return root_obj_cgroup;
>>   }
>>   
>>   struct obj_cgroup *get_obj_cgroup_from_folio(struct folio *folio)
>> @@ -2753,14 +2753,8 @@ struct obj_cgroup *get_obj_cgroup_from_folio(struct folio *folio)
>>   		objcg = __folio_objcg(folio);
>>   		obj_cgroup_get(objcg);
>>   	} else {
>> -		struct mem_cgroup *memcg;
>> -
>>   		rcu_read_lock();
>> -		memcg = __folio_memcg(folio);
>> -		if (memcg)
>> -			objcg = __get_obj_cgroup_from_memcg(memcg);
>> -		else
>> -			objcg = NULL;
>> +		objcg = __get_obj_cgroup_from_memcg(__folio_memcg(folio));
>>   		rcu_read_unlock();
>>   	}
>>   	return objcg;
>> @@ -2863,7 +2857,7 @@ int __memcg_kmem_charge_page(struct page *page, gfp_t gfp, int order)
>>   	int ret = 0;
>>   
>>   	objcg = current_obj_cgroup();
>> -	if (objcg) {
>> +	if (objcg && !obj_cgroup_is_root(objcg)) {
>>   		ret = obj_cgroup_charge_pages(objcg, gfp, 1 << order);
>>   		if (!ret) {
>>   			obj_cgroup_get(objcg);
>> @@ -3164,7 +3158,7 @@ bool __memcg_slab_post_alloc_hook(struct kmem_cache *s, struct list_lru *lru,
>>   	 * obj_cgroup_get() is used to get a permanent reference.
>>   	 */
>>   	objcg = current_obj_cgroup();
>> -	if (!objcg)
>> +	if (!objcg || obj_cgroup_is_root(objcg))
>>   		return true;
>>   
>>   	/*
>> @@ -3851,6 +3845,9 @@ static int mem_cgroup_css_online(struct cgroup_subsys_state *css)
>>   	if (!objcg)
>>   		goto free_shrinker;
>>   
>> +	if (unlikely(mem_cgroup_is_root(memcg)))
>> +		root_obj_cgroup = objcg;
>> +
>>   	objcg->memcg = memcg;
>>   	rcu_assign_pointer(memcg->objcg, objcg);
>>   	obj_cgroup_get(objcg);
>> @@ -5471,6 +5468,9 @@ void obj_cgroup_charge_zswap(struct obj_cgroup *objcg, size_t size)
>>   	if (!cgroup_subsys_on_dfl(memory_cgrp_subsys))
>>   		return;
>>   
>> +	if (obj_cgroup_is_root(objcg))
>> +		return;
>> +
>>   	VM_WARN_ON_ONCE(!(current->flags & PF_MEMALLOC));
>>   
>>   	/* PF_MEMALLOC context, charging must succeed */
>> @@ -5498,6 +5498,9 @@ void obj_cgroup_uncharge_zswap(struct obj_cgroup *objcg, size_t size)
>>   	if (!cgroup_subsys_on_dfl(memory_cgrp_subsys))
>>   		return;
>>   
>> +	if (obj_cgroup_is_root(objcg))
>> +		return;
>> +
>>   	obj_cgroup_uncharge(objcg, size);
>>   
> If we modify zswap by adding MEMCG_ZSWAP_B and MEMCG_ZSWAPPED with obj_cgroup_charge_zswap , then
> remove a control group (via rmdir) and reparent its objects to the root cgroup, then for the root
> cgroup, obj_cgroup_uncharge_zswap will return directly due to the obj_cgroup_is_root check. Would
> this cause us to miss decrementing MEMCG_ZSWAP_B and MEMCG_ZSWAPPED?

I'm not sure I fully understand the problem—how could this happen, given 
that
obj_cgroup_charge_zswap also checks for the root objcg?

Muchun,
Thanks.
>
>>   	rcu_read_lock();
>> diff --git a/mm/percpu.c b/mm/percpu.c
>> index 81462ce5866e1..5c1a9b77d6b93 100644
>> --- a/mm/percpu.c
>> +++ b/mm/percpu.c
>> @@ -1616,7 +1616,7 @@ static bool pcpu_memcg_pre_alloc_hook(size_t size, gfp_t gfp,
>>   		return true;
>>   
>>   	objcg = current_obj_cgroup();
>> -	if (!objcg)
>> +	if (!objcg || obj_cgroup_is_root(objcg))
>>   		return true;
>>   
>>   	if (obj_cgroup_charge(objcg, gfp, pcpu_obj_full_size(size)))



^ permalink raw reply	[flat|nested] 149+ messages in thread

* Re: [PATCH v2 07/28] mm: memcontrol: return root object cgroup for root memory cgroup
  2025-12-26  3:10     ` Muchun Song
@ 2025-12-26  3:50       ` Chen Ridong
  2025-12-26  3:58         ` Chen Ridong
  0 siblings, 1 reply; 149+ messages in thread
From: Chen Ridong @ 2025-12-26  3:50 UTC (permalink / raw)
  To: Muchun Song
  Cc: linux-mm, linux-kernel, cgroups, Muchun Song, Qi Zheng, Qi Zheng,
	hannes, hughd, mhocko, roman.gushchin, shakeel.butt, david,
	lorenzo.stoakes, ziy, harry.yoo, imran.f.khan, kamalesh.babulal,
	axelrasmussen, yuanchu, weixugc, mkoutny, akpm, hamzamahfooz,
	apais, lance.yang



On 2025/12/26 11:10, Muchun Song wrote:
> 
> 
> On 2025/12/26 09:03, Chen Ridong wrote:
>>
>> On 2025/12/17 15:27, Qi Zheng wrote:
>>> From: Muchun Song <songmuchun@bytedance.com>
>>>
>>> Memory cgroup functions such as get_mem_cgroup_from_folio() and
>>> get_mem_cgroup_from_mm() return a valid memory cgroup pointer,
>>> even for the root memory cgroup. In contrast, the situation for
>>> object cgroups has been different.
>>>
>>> Previously, the root object cgroup couldn't be returned because
>>> it didn't exist. Now that a valid root object cgroup exists, for
>>> the sake of consistency, it's necessary to align the behavior of
>>> object-cgroup-related operations with that of memory cgroup APIs.
>>>
>>> Signed-off-by: Muchun Song <songmuchun@bytedance.com>
>>> Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
>>> ---
>>>   include/linux/memcontrol.h | 26 +++++++++++++++++-----
>>>   mm/memcontrol.c            | 45 ++++++++++++++++++++------------------
>>>   mm/percpu.c                |  2 +-
>>>   3 files changed, 45 insertions(+), 28 deletions(-)
>>>
>>> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
>>> index 288dd6337f80f..776d9be1f446a 100644
>>> --- a/include/linux/memcontrol.h
>>> +++ b/include/linux/memcontrol.h
>>> @@ -332,6 +332,7 @@ struct mem_cgroup {
>>>   #define MEMCG_CHARGE_BATCH 64U
>>>     extern struct mem_cgroup *root_mem_cgroup;
>>> +extern struct obj_cgroup *root_obj_cgroup;
>>>     enum page_memcg_data_flags {
>>>       /* page->memcg_data is a pointer to an slabobj_ext vector */
>>> @@ -549,6 +550,11 @@ static inline bool mem_cgroup_is_root(struct mem_cgroup *memcg)
>>>       return (memcg == root_mem_cgroup);
>>>   }
>>>   +static inline bool obj_cgroup_is_root(const struct obj_cgroup *objcg)
>>> +{
>>> +    return objcg == root_obj_cgroup;
>>> +}
>>> +
>>>   static inline bool mem_cgroup_disabled(void)
>>>   {
>>>       return !cgroup_subsys_enabled(memory_cgrp_subsys);
>>> @@ -773,23 +779,26 @@ struct mem_cgroup *mem_cgroup_from_css(struct cgroup_subsys_state *css){
>>>     static inline bool obj_cgroup_tryget(struct obj_cgroup *objcg)
>>>   {
>>> +    if (obj_cgroup_is_root(objcg))
>>> +        return true;
>>>       return percpu_ref_tryget(&objcg->refcnt);
>>>   }
>>>   -static inline void obj_cgroup_get(struct obj_cgroup *objcg)
>>> +static inline void obj_cgroup_get_many(struct obj_cgroup *objcg,
>>> +                       unsigned long nr)
>>>   {
>>> -    percpu_ref_get(&objcg->refcnt);
>>> +    if (!obj_cgroup_is_root(objcg))
>>> +        percpu_ref_get_many(&objcg->refcnt, nr);
>>>   }
>>>   -static inline void obj_cgroup_get_many(struct obj_cgroup *objcg,
>>> -                       unsigned long nr)
>>> +static inline void obj_cgroup_get(struct obj_cgroup *objcg)
>>>   {
>>> -    percpu_ref_get_many(&objcg->refcnt, nr);
>>> +    obj_cgroup_get_many(objcg, 1);
>>>   }
>>>     static inline void obj_cgroup_put(struct obj_cgroup *objcg)
>>>   {
>>> -    if (objcg)
>>> +    if (objcg && !obj_cgroup_is_root(objcg))
>>>           percpu_ref_put(&objcg->refcnt);
>>>   }
>>>   @@ -1084,6 +1093,11 @@ static inline bool mem_cgroup_is_root(struct mem_cgroup *memcg)
>>>       return true;
>>>   }
>>>   +static inline bool obj_cgroup_is_root(const struct obj_cgroup *objcg)
>>> +{
>>> +    return true;
>>> +}
>>> +
>>>   static inline bool mem_cgroup_disabled(void)
>>>   {
>>>       return true;
>>> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
>>> index 544b3200db12d..21b5aad34cae7 100644
>>> --- a/mm/memcontrol.c
>>> +++ b/mm/memcontrol.c
>>> @@ -83,6 +83,8 @@ EXPORT_SYMBOL(memory_cgrp_subsys);
>>>   struct mem_cgroup *root_mem_cgroup __read_mostly;
>>>   EXPORT_SYMBOL(root_mem_cgroup);
>>>   +struct obj_cgroup *root_obj_cgroup __read_mostly;
>>> +
>>>   /* Active memory cgroup to use from an interrupt context */
>>>   DEFINE_PER_CPU(struct mem_cgroup *, int_active_memcg);
>>>   EXPORT_PER_CPU_SYMBOL_GPL(int_active_memcg);
>>> @@ -2634,15 +2636,14 @@ struct mem_cgroup *mem_cgroup_from_slab_obj(void *p)
>>>     static struct obj_cgroup *__get_obj_cgroup_from_memcg(struct mem_cgroup *memcg)
>>>   {
>>> -    struct obj_cgroup *objcg = NULL;
>>> +    for (; memcg; memcg = parent_mem_cgroup(memcg)) {
>>> +        struct obj_cgroup *objcg = rcu_dereference(memcg->objcg);
>>>   -    for (; !mem_cgroup_is_root(memcg); memcg = parent_mem_cgroup(memcg)) {
>>> -        objcg = rcu_dereference(memcg->objcg);
>>>           if (likely(objcg && obj_cgroup_tryget(objcg)))
>>> -            break;
>>> -        objcg = NULL;
>>> +            return objcg;
>>>       }
>>> -    return objcg;
>>> +
>>> +    return NULL;
>>>   }
>>>     static struct obj_cgroup *current_objcg_update(void)
>>> @@ -2716,18 +2717,17 @@ __always_inline struct obj_cgroup *current_obj_cgroup(void)
>>>            * Objcg reference is kept by the task, so it's safe
>>>            * to use the objcg by the current task.
>>>            */
>>> -        return objcg;
>>> +        return objcg ? : root_obj_cgroup;
>>>       }
>>>         memcg = this_cpu_read(int_active_memcg);
>>>       if (unlikely(memcg))
>>>           goto from_memcg;
>>>   -    return NULL;
>>> +    return root_obj_cgroup;
>>>     from_memcg:
>>> -    objcg = NULL;
>>> -    for (; !mem_cgroup_is_root(memcg); memcg = parent_mem_cgroup(memcg)) {
>>> +    for (; memcg; memcg = parent_mem_cgroup(memcg)) {
>>>           /*
>>>            * Memcg pointer is protected by scope (see set_active_memcg())
>>>            * and is pinning the corresponding objcg, so objcg can't go
>>> @@ -2736,10 +2736,10 @@ __always_inline struct obj_cgroup *current_obj_cgroup(void)
>>>            */
>>>           objcg = rcu_dereference_check(memcg->objcg, 1);
>>>           if (likely(objcg))
>>> -            break;
>>> +            return objcg;
>>>       }
>>>   -    return objcg;
>>> +    return root_obj_cgroup;
>>>   }
>>>     struct obj_cgroup *get_obj_cgroup_from_folio(struct folio *folio)
>>> @@ -2753,14 +2753,8 @@ struct obj_cgroup *get_obj_cgroup_from_folio(struct folio *folio)
>>>           objcg = __folio_objcg(folio);
>>>           obj_cgroup_get(objcg);
>>>       } else {
>>> -        struct mem_cgroup *memcg;
>>> -
>>>           rcu_read_lock();
>>> -        memcg = __folio_memcg(folio);
>>> -        if (memcg)
>>> -            objcg = __get_obj_cgroup_from_memcg(memcg);
>>> -        else
>>> -            objcg = NULL;
>>> +        objcg = __get_obj_cgroup_from_memcg(__folio_memcg(folio));
>>>           rcu_read_unlock();
>>>       }
>>>       return objcg;
>>> @@ -2863,7 +2857,7 @@ int __memcg_kmem_charge_page(struct page *page, gfp_t gfp, int order)
>>>       int ret = 0;
>>>         objcg = current_obj_cgroup();
>>> -    if (objcg) {
>>> +    if (objcg && !obj_cgroup_is_root(objcg)) {
>>>           ret = obj_cgroup_charge_pages(objcg, gfp, 1 << order);
>>>           if (!ret) {
>>>               obj_cgroup_get(objcg);
>>> @@ -3164,7 +3158,7 @@ bool __memcg_slab_post_alloc_hook(struct kmem_cache *s, struct list_lru *lru,
>>>        * obj_cgroup_get() is used to get a permanent reference.
>>>        */
>>>       objcg = current_obj_cgroup();
>>> -    if (!objcg)
>>> +    if (!objcg || obj_cgroup_is_root(objcg))
>>>           return true;
>>>         /*
>>> @@ -3851,6 +3845,9 @@ static int mem_cgroup_css_online(struct cgroup_subsys_state *css)
>>>       if (!objcg)
>>>           goto free_shrinker;
>>>   +    if (unlikely(mem_cgroup_is_root(memcg)))
>>> +        root_obj_cgroup = objcg;
>>> +
>>>       objcg->memcg = memcg;
>>>       rcu_assign_pointer(memcg->objcg, objcg);
>>>       obj_cgroup_get(objcg);
>>> @@ -5471,6 +5468,9 @@ void obj_cgroup_charge_zswap(struct obj_cgroup *objcg, size_t size)
>>>       if (!cgroup_subsys_on_dfl(memory_cgrp_subsys))
>>>           return;
>>>   +    if (obj_cgroup_is_root(objcg))
>>> +        return;
>>> +
>>>       VM_WARN_ON_ONCE(!(current->flags & PF_MEMALLOC));
>>>         /* PF_MEMALLOC context, charging must succeed */
>>> @@ -5498,6 +5498,9 @@ void obj_cgroup_uncharge_zswap(struct obj_cgroup *objcg, size_t size)
>>>       if (!cgroup_subsys_on_dfl(memory_cgrp_subsys))
>>>           return;
>>>   +    if (obj_cgroup_is_root(objcg))
>>> +        return;
>>> +
>>>       obj_cgroup_uncharge(objcg, size);
>>>   
>> If we modify zswap by adding MEMCG_ZSWAP_B and MEMCG_ZSWAPPED with obj_cgroup_charge_zswap , then
>> remove a control group (via rmdir) and reparent its objects to the root cgroup, then for the root
>> cgroup, obj_cgroup_uncharge_zswap will return directly due to the obj_cgroup_is_root check. Would
>> this cause us to miss decrementing MEMCG_ZSWAP_B and MEMCG_ZSWAPPED?
> 
> I'm not sure I fully understand the problem—how could this happen, given that
> obj_cgroup_charge_zswap also checks for the root objcg?
> 
> Muchun,
> Thanks.

That is:

1. memcg A is under the root memcg.
2. obj_cgroup_charge_zswap charges memcg A.
3. After rmdir A, the obj of A is reparented to the root memcg.
4. obj_cgroup_uncharge_zswap does nothing and returns, since the object is now associated with the root.

Thus, the root memcg will miss decrementing MEMCG_ZSWAP_B and MEMCG_ZSWAPPED, correct? Or am I
missing something?

>>
>>>       rcu_read_lock();
>>> diff --git a/mm/percpu.c b/mm/percpu.c
>>> index 81462ce5866e1..5c1a9b77d6b93 100644
>>> --- a/mm/percpu.c
>>> +++ b/mm/percpu.c
>>> @@ -1616,7 +1616,7 @@ static bool pcpu_memcg_pre_alloc_hook(size_t size, gfp_t gfp,
>>>           return true;
>>>         objcg = current_obj_cgroup();
>>> -    if (!objcg)
>>> +    if (!objcg || obj_cgroup_is_root(objcg))
>>>           return true;
>>>         if (obj_cgroup_charge(objcg, gfp, pcpu_obj_full_size(size)))
> 

-- 
Best regards,
Ridong



^ permalink raw reply	[flat|nested] 149+ messages in thread

* Re: [PATCH v2 07/28] mm: memcontrol: return root object cgroup for root memory cgroup
  2025-12-26  3:50       ` Chen Ridong
@ 2025-12-26  3:58         ` Chen Ridong
  0 siblings, 0 replies; 149+ messages in thread
From: Chen Ridong @ 2025-12-26  3:58 UTC (permalink / raw)
  To: Muchun Song
  Cc: linux-mm, linux-kernel, cgroups, Muchun Song, Qi Zheng, Qi Zheng,
	hannes, hughd, mhocko, roman.gushchin, shakeel.butt, david,
	lorenzo.stoakes, ziy, harry.yoo, imran.f.khan, kamalesh.babulal,
	axelrasmussen, yuanchu, weixugc, mkoutny, akpm, hamzamahfooz,
	apais, lance.yang



On 2025/12/26 11:50, Chen Ridong wrote:
> 
> 
> On 2025/12/26 11:10, Muchun Song wrote:
>>
>>
>> On 2025/12/26 09:03, Chen Ridong wrote:
>>>
>>> On 2025/12/17 15:27, Qi Zheng wrote:
>>>> From: Muchun Song <songmuchun@bytedance.com>
>>>>
>>>> Memory cgroup functions such as get_mem_cgroup_from_folio() and
>>>> get_mem_cgroup_from_mm() return a valid memory cgroup pointer,
>>>> even for the root memory cgroup. In contrast, the situation for
>>>> object cgroups has been different.
>>>>
>>>> Previously, the root object cgroup couldn't be returned because
>>>> it didn't exist. Now that a valid root object cgroup exists, for
>>>> the sake of consistency, it's necessary to align the behavior of
>>>> object-cgroup-related operations with that of memory cgroup APIs.
>>>>
>>>> Signed-off-by: Muchun Song <songmuchun@bytedance.com>
>>>> Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
>>>> ---
>>>>   include/linux/memcontrol.h | 26 +++++++++++++++++-----
>>>>   mm/memcontrol.c            | 45 ++++++++++++++++++++------------------
>>>>   mm/percpu.c                |  2 +-
>>>>   3 files changed, 45 insertions(+), 28 deletions(-)
>>>>
>>>> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
>>>> index 288dd6337f80f..776d9be1f446a 100644
>>>> --- a/include/linux/memcontrol.h
>>>> +++ b/include/linux/memcontrol.h
>>>> @@ -332,6 +332,7 @@ struct mem_cgroup {
>>>>   #define MEMCG_CHARGE_BATCH 64U
>>>>     extern struct mem_cgroup *root_mem_cgroup;
>>>> +extern struct obj_cgroup *root_obj_cgroup;
>>>>     enum page_memcg_data_flags {
>>>>       /* page->memcg_data is a pointer to an slabobj_ext vector */
>>>> @@ -549,6 +550,11 @@ static inline bool mem_cgroup_is_root(struct mem_cgroup *memcg)
>>>>       return (memcg == root_mem_cgroup);
>>>>   }
>>>>   +static inline bool obj_cgroup_is_root(const struct obj_cgroup *objcg)
>>>> +{
>>>> +    return objcg == root_obj_cgroup;
>>>> +}
>>>> +
>>>>   static inline bool mem_cgroup_disabled(void)
>>>>   {
>>>>       return !cgroup_subsys_enabled(memory_cgrp_subsys);
>>>> @@ -773,23 +779,26 @@ struct mem_cgroup *mem_cgroup_from_css(struct cgroup_subsys_state *css){
>>>>     static inline bool obj_cgroup_tryget(struct obj_cgroup *objcg)
>>>>   {
>>>> +    if (obj_cgroup_is_root(objcg))
>>>> +        return true;
>>>>       return percpu_ref_tryget(&objcg->refcnt);
>>>>   }
>>>>   -static inline void obj_cgroup_get(struct obj_cgroup *objcg)
>>>> +static inline void obj_cgroup_get_many(struct obj_cgroup *objcg,
>>>> +                       unsigned long nr)
>>>>   {
>>>> -    percpu_ref_get(&objcg->refcnt);
>>>> +    if (!obj_cgroup_is_root(objcg))
>>>> +        percpu_ref_get_many(&objcg->refcnt, nr);
>>>>   }
>>>>   -static inline void obj_cgroup_get_many(struct obj_cgroup *objcg,
>>>> -                       unsigned long nr)
>>>> +static inline void obj_cgroup_get(struct obj_cgroup *objcg)
>>>>   {
>>>> -    percpu_ref_get_many(&objcg->refcnt, nr);
>>>> +    obj_cgroup_get_many(objcg, 1);
>>>>   }
>>>>     static inline void obj_cgroup_put(struct obj_cgroup *objcg)
>>>>   {
>>>> -    if (objcg)
>>>> +    if (objcg && !obj_cgroup_is_root(objcg))
>>>>           percpu_ref_put(&objcg->refcnt);
>>>>   }
>>>>   @@ -1084,6 +1093,11 @@ static inline bool mem_cgroup_is_root(struct mem_cgroup *memcg)
>>>>       return true;
>>>>   }
>>>>   +static inline bool obj_cgroup_is_root(const struct obj_cgroup *objcg)
>>>> +{
>>>> +    return true;
>>>> +}
>>>> +
>>>>   static inline bool mem_cgroup_disabled(void)
>>>>   {
>>>>       return true;
>>>> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
>>>> index 544b3200db12d..21b5aad34cae7 100644
>>>> --- a/mm/memcontrol.c
>>>> +++ b/mm/memcontrol.c
>>>> @@ -83,6 +83,8 @@ EXPORT_SYMBOL(memory_cgrp_subsys);
>>>>   struct mem_cgroup *root_mem_cgroup __read_mostly;
>>>>   EXPORT_SYMBOL(root_mem_cgroup);
>>>>   +struct obj_cgroup *root_obj_cgroup __read_mostly;
>>>> +
>>>>   /* Active memory cgroup to use from an interrupt context */
>>>>   DEFINE_PER_CPU(struct mem_cgroup *, int_active_memcg);
>>>>   EXPORT_PER_CPU_SYMBOL_GPL(int_active_memcg);
>>>> @@ -2634,15 +2636,14 @@ struct mem_cgroup *mem_cgroup_from_slab_obj(void *p)
>>>>     static struct obj_cgroup *__get_obj_cgroup_from_memcg(struct mem_cgroup *memcg)
>>>>   {
>>>> -    struct obj_cgroup *objcg = NULL;
>>>> +    for (; memcg; memcg = parent_mem_cgroup(memcg)) {
>>>> +        struct obj_cgroup *objcg = rcu_dereference(memcg->objcg);
>>>>   -    for (; !mem_cgroup_is_root(memcg); memcg = parent_mem_cgroup(memcg)) {
>>>> -        objcg = rcu_dereference(memcg->objcg);
>>>>           if (likely(objcg && obj_cgroup_tryget(objcg)))
>>>> -            break;
>>>> -        objcg = NULL;
>>>> +            return objcg;
>>>>       }
>>>> -    return objcg;
>>>> +
>>>> +    return NULL;
>>>>   }
>>>>     static struct obj_cgroup *current_objcg_update(void)
>>>> @@ -2716,18 +2717,17 @@ __always_inline struct obj_cgroup *current_obj_cgroup(void)
>>>>            * Objcg reference is kept by the task, so it's safe
>>>>            * to use the objcg by the current task.
>>>>            */
>>>> -        return objcg;
>>>> +        return objcg ? : root_obj_cgroup;
>>>>       }
>>>>         memcg = this_cpu_read(int_active_memcg);
>>>>       if (unlikely(memcg))
>>>>           goto from_memcg;
>>>>   -    return NULL;
>>>> +    return root_obj_cgroup;
>>>>     from_memcg:
>>>> -    objcg = NULL;
>>>> -    for (; !mem_cgroup_is_root(memcg); memcg = parent_mem_cgroup(memcg)) {
>>>> +    for (; memcg; memcg = parent_mem_cgroup(memcg)) {
>>>>           /*
>>>>            * Memcg pointer is protected by scope (see set_active_memcg())
>>>>            * and is pinning the corresponding objcg, so objcg can't go
>>>> @@ -2736,10 +2736,10 @@ __always_inline struct obj_cgroup *current_obj_cgroup(void)
>>>>            */
>>>>           objcg = rcu_dereference_check(memcg->objcg, 1);
>>>>           if (likely(objcg))
>>>> -            break;
>>>> +            return objcg;
>>>>       }
>>>>   -    return objcg;
>>>> +    return root_obj_cgroup;
>>>>   }
>>>>     struct obj_cgroup *get_obj_cgroup_from_folio(struct folio *folio)
>>>> @@ -2753,14 +2753,8 @@ struct obj_cgroup *get_obj_cgroup_from_folio(struct folio *folio)
>>>>           objcg = __folio_objcg(folio);
>>>>           obj_cgroup_get(objcg);
>>>>       } else {
>>>> -        struct mem_cgroup *memcg;
>>>> -
>>>>           rcu_read_lock();
>>>> -        memcg = __folio_memcg(folio);
>>>> -        if (memcg)
>>>> -            objcg = __get_obj_cgroup_from_memcg(memcg);
>>>> -        else
>>>> -            objcg = NULL;
>>>> +        objcg = __get_obj_cgroup_from_memcg(__folio_memcg(folio));
>>>>           rcu_read_unlock();
>>>>       }
>>>>       return objcg;
>>>> @@ -2863,7 +2857,7 @@ int __memcg_kmem_charge_page(struct page *page, gfp_t gfp, int order)
>>>>       int ret = 0;
>>>>         objcg = current_obj_cgroup();
>>>> -    if (objcg) {
>>>> +    if (objcg && !obj_cgroup_is_root(objcg)) {
>>>>           ret = obj_cgroup_charge_pages(objcg, gfp, 1 << order);
>>>>           if (!ret) {
>>>>               obj_cgroup_get(objcg);
>>>> @@ -3164,7 +3158,7 @@ bool __memcg_slab_post_alloc_hook(struct kmem_cache *s, struct list_lru *lru,
>>>>        * obj_cgroup_get() is used to get a permanent reference.
>>>>        */
>>>>       objcg = current_obj_cgroup();
>>>> -    if (!objcg)
>>>> +    if (!objcg || obj_cgroup_is_root(objcg))
>>>>           return true;
>>>>         /*
>>>> @@ -3851,6 +3845,9 @@ static int mem_cgroup_css_online(struct cgroup_subsys_state *css)
>>>>       if (!objcg)
>>>>           goto free_shrinker;
>>>>   +    if (unlikely(mem_cgroup_is_root(memcg)))
>>>> +        root_obj_cgroup = objcg;
>>>> +
>>>>       objcg->memcg = memcg;
>>>>       rcu_assign_pointer(memcg->objcg, objcg);
>>>>       obj_cgroup_get(objcg);
>>>> @@ -5471,6 +5468,9 @@ void obj_cgroup_charge_zswap(struct obj_cgroup *objcg, size_t size)
>>>>       if (!cgroup_subsys_on_dfl(memory_cgrp_subsys))
>>>>           return;
>>>>   +    if (obj_cgroup_is_root(objcg))
>>>> +        return;
>>>> +
>>>>       VM_WARN_ON_ONCE(!(current->flags & PF_MEMALLOC));
>>>>         /* PF_MEMALLOC context, charging must succeed */
>>>> @@ -5498,6 +5498,9 @@ void obj_cgroup_uncharge_zswap(struct obj_cgroup *objcg, size_t size)
>>>>       if (!cgroup_subsys_on_dfl(memory_cgrp_subsys))
>>>>           return;
>>>>   +    if (obj_cgroup_is_root(objcg))
>>>> +        return;
>>>> +
>>>>       obj_cgroup_uncharge(objcg, size);
>>>>   
>>> If we modify zswap by adding MEMCG_ZSWAP_B and MEMCG_ZSWAPPED with obj_cgroup_charge_zswap , then
>>> remove a control group (via rmdir) and reparent its objects to the root cgroup, then for the root
>>> cgroup, obj_cgroup_uncharge_zswap will return directly due to the obj_cgroup_is_root check. Would
>>> this cause us to miss decrementing MEMCG_ZSWAP_B and MEMCG_ZSWAPPED?
>>
>> I'm not sure I fully understand the problem—how could this happen, given that
>> obj_cgroup_charge_zswap also checks for the root objcg?
>>
>> Muchun,
>> Thanks.
> 
> That is:
> 
> 1. memcg A is under the root memcg.
> 2. obj_cgroup_charge_zswap charges memcg A.
> 3. After rmdir A, the obj of A is reparented to the root memcg.
> 4. obj_cgroup_uncharge_zswap does nothing and returns, since the object is now associated with the root.
> 
> Thus, the root memcg will miss decrementing MEMCG_ZSWAP_B and MEMCG_ZSWAPPED, correct? Or am I
> missing something?
> 

Oh, I understand now — please ignore my previous question. Apologies for the noise.

Even after reparenting, the object from A should not be treated as a root object.

>>>
>>>>       rcu_read_lock();
>>>> diff --git a/mm/percpu.c b/mm/percpu.c
>>>> index 81462ce5866e1..5c1a9b77d6b93 100644
>>>> --- a/mm/percpu.c
>>>> +++ b/mm/percpu.c
>>>> @@ -1616,7 +1616,7 @@ static bool pcpu_memcg_pre_alloc_hook(size_t size, gfp_t gfp,
>>>>           return true;
>>>>         objcg = current_obj_cgroup();
>>>> -    if (!objcg)
>>>> +    if (!objcg || obj_cgroup_is_root(objcg))
>>>>           return true;
>>>>         if (obj_cgroup_charge(objcg, gfp, pcpu_obj_full_size(size)))
>>
> 

-- 
Best regards,
Ridong



^ permalink raw reply	[flat|nested] 149+ messages in thread

* [PATCH v2 08/28] mm: memcontrol: prevent memory cgroup release in get_mem_cgroup_from_folio()
  2025-12-17  7:27 [PATCH v2 00/28] Eliminate Dying Memory Cgroup Qi Zheng
                   ` (6 preceding siblings ...)
  2025-12-17  7:27 ` [PATCH v2 07/28] mm: memcontrol: return root object cgroup for root memory cgroup Qi Zheng
@ 2025-12-17  7:27 ` Qi Zheng
  2025-12-17 21:45   ` Johannes Weiner
  2025-12-17  7:27 ` [PATCH v2 09/28] buffer: prevent memory cgroup release in folio_alloc_buffers() Qi Zheng
                   ` (21 subsequent siblings)
  29 siblings, 1 reply; 149+ messages in thread
From: Qi Zheng @ 2025-12-17  7:27 UTC (permalink / raw)
  To: hannes, hughd, mhocko, roman.gushchin, shakeel.butt, muchun.song,
	david, lorenzo.stoakes, ziy, harry.yoo, imran.f.khan,
	kamalesh.babulal, axelrasmussen, yuanchu, weixugc, chenridong,
	mkoutny, akpm, hamzamahfooz, apais, lance.yang
  Cc: linux-mm, linux-kernel, cgroups, Muchun Song, Qi Zheng

From: Muchun Song <songmuchun@bytedance.com>

In the near future, a folio will no longer pin its corresponding
memory cgroup. To ensure safety, it will only be appropriate to
hold the rcu read lock or acquire a reference to the memory cgroup
returned by folio_memcg(), thereby preventing it from being released.

In the current patch, the rcu read lock is employed to safeguard
against the release of the memory cgroup in get_mem_cgroup_from_folio().

This serves as a preparatory measure for the reparenting of the
LRU pages.

Signed-off-by: Muchun Song <songmuchun@bytedance.com>
Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
Reviewed-by: Harry Yoo <harry.yoo@oracle.com>
---
 mm/memcontrol.c | 11 ++++++++---
 1 file changed, 8 insertions(+), 3 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 21b5aad34cae7..431b3154c70c5 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -973,14 +973,19 @@ struct mem_cgroup *get_mem_cgroup_from_current(void)
  */
 struct mem_cgroup *get_mem_cgroup_from_folio(struct folio *folio)
 {
-	struct mem_cgroup *memcg = folio_memcg(folio);
+	struct mem_cgroup *memcg;
 
 	if (mem_cgroup_disabled())
 		return NULL;
 
+	if (!folio_memcg_charged(folio))
+		return root_mem_cgroup;
+
 	rcu_read_lock();
-	if (!memcg || WARN_ON_ONCE(!css_tryget(&memcg->css)))
-		memcg = root_mem_cgroup;
+retry:
+	memcg = folio_memcg(folio);
+	if (unlikely(!css_tryget(&memcg->css)))
+		goto retry;
 	rcu_read_unlock();
 	return memcg;
 }
-- 
2.20.1



^ permalink raw reply	[flat|nested] 149+ messages in thread

* Re: [PATCH v2 08/28] mm: memcontrol: prevent memory cgroup release in get_mem_cgroup_from_folio()
  2025-12-17  7:27 ` [PATCH v2 08/28] mm: memcontrol: prevent memory cgroup release in get_mem_cgroup_from_folio() Qi Zheng
@ 2025-12-17 21:45   ` Johannes Weiner
  2025-12-18  6:31     ` Qi Zheng
  2025-12-19  2:09     ` Shakeel Butt
  0 siblings, 2 replies; 149+ messages in thread
From: Johannes Weiner @ 2025-12-17 21:45 UTC (permalink / raw)
  To: Qi Zheng
  Cc: hughd, mhocko, roman.gushchin, shakeel.butt, muchun.song, david,
	lorenzo.stoakes, ziy, harry.yoo, imran.f.khan, kamalesh.babulal,
	axelrasmussen, yuanchu, weixugc, chenridong, mkoutny, akpm,
	hamzamahfooz, apais, lance.yang, linux-mm, linux-kernel, cgroups,
	Muchun Song, Qi Zheng

On Wed, Dec 17, 2025 at 03:27:32PM +0800, Qi Zheng wrote:
> From: Muchun Song <songmuchun@bytedance.com>
> 
> In the near future, a folio will no longer pin its corresponding
> memory cgroup. To ensure safety, it will only be appropriate to
> hold the rcu read lock or acquire a reference to the memory cgroup
> returned by folio_memcg(), thereby preventing it from being released.
> 
> In the current patch, the rcu read lock is employed to safeguard
> against the release of the memory cgroup in get_mem_cgroup_from_folio().
> 
> This serves as a preparatory measure for the reparenting of the
> LRU pages.
> 
> Signed-off-by: Muchun Song <songmuchun@bytedance.com>
> Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
> Reviewed-by: Harry Yoo <harry.yoo@oracle.com>
> ---
>  mm/memcontrol.c | 11 ++++++++---
>  1 file changed, 8 insertions(+), 3 deletions(-)
> 
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 21b5aad34cae7..431b3154c70c5 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -973,14 +973,19 @@ struct mem_cgroup *get_mem_cgroup_from_current(void)
>   */
>  struct mem_cgroup *get_mem_cgroup_from_folio(struct folio *folio)
>  {
> -	struct mem_cgroup *memcg = folio_memcg(folio);
> +	struct mem_cgroup *memcg;
>  
>  	if (mem_cgroup_disabled())
>  		return NULL;
>  
> +	if (!folio_memcg_charged(folio))
> +		return root_mem_cgroup;
> +
>  	rcu_read_lock();
> -	if (!memcg || WARN_ON_ONCE(!css_tryget(&memcg->css)))
> -		memcg = root_mem_cgroup;
> +retry:
> +	memcg = folio_memcg(folio);
> +	if (unlikely(!css_tryget(&memcg->css)))
> +		goto retry;

So starting in patch 27, the tryget can fail if the memcg is offlined,
and the folio's objcg is reparented concurrently. We'll retry until we
find a memcg that isn't dead yet. There's always root_mem_cgroup.

It makes sense, but a loop like this begs the question of how it is
bounded. I pieced it together looking ahead. Since this is a small
diff, it would be nicer to fold it into 27. I didn't see anything in
between depending on it, but correct me if I'm wrong.

Minor style preference:

	/* Comment explaining the above */
	do {
		memcg = folio_memcg(folio);
	} while (!css_tryget(&memcg->css));


^ permalink raw reply	[flat|nested] 149+ messages in thread

* Re: [PATCH v2 08/28] mm: memcontrol: prevent memory cgroup release in get_mem_cgroup_from_folio()
  2025-12-17 21:45   ` Johannes Weiner
@ 2025-12-18  6:31     ` Qi Zheng
  2025-12-19  2:09     ` Shakeel Butt
  1 sibling, 0 replies; 149+ messages in thread
From: Qi Zheng @ 2025-12-18  6:31 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: hughd, mhocko, roman.gushchin, shakeel.butt, muchun.song, david,
	lorenzo.stoakes, ziy, harry.yoo, imran.f.khan, kamalesh.babulal,
	axelrasmussen, yuanchu, weixugc, chenridong, mkoutny, akpm,
	hamzamahfooz, apais, lance.yang, linux-mm, linux-kernel, cgroups,
	Muchun Song, Qi Zheng



On 12/18/25 5:45 AM, Johannes Weiner wrote:
> On Wed, Dec 17, 2025 at 03:27:32PM +0800, Qi Zheng wrote:
>> From: Muchun Song <songmuchun@bytedance.com>
>>
>> In the near future, a folio will no longer pin its corresponding
>> memory cgroup. To ensure safety, it will only be appropriate to
>> hold the rcu read lock or acquire a reference to the memory cgroup
>> returned by folio_memcg(), thereby preventing it from being released.
>>
>> In the current patch, the rcu read lock is employed to safeguard
>> against the release of the memory cgroup in get_mem_cgroup_from_folio().
>>
>> This serves as a preparatory measure for the reparenting of the
>> LRU pages.
>>
>> Signed-off-by: Muchun Song <songmuchun@bytedance.com>
>> Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
>> Reviewed-by: Harry Yoo <harry.yoo@oracle.com>
>> ---
>>   mm/memcontrol.c | 11 ++++++++---
>>   1 file changed, 8 insertions(+), 3 deletions(-)
>>
>> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
>> index 21b5aad34cae7..431b3154c70c5 100644
>> --- a/mm/memcontrol.c
>> +++ b/mm/memcontrol.c
>> @@ -973,14 +973,19 @@ struct mem_cgroup *get_mem_cgroup_from_current(void)
>>    */
>>   struct mem_cgroup *get_mem_cgroup_from_folio(struct folio *folio)
>>   {
>> -	struct mem_cgroup *memcg = folio_memcg(folio);
>> +	struct mem_cgroup *memcg;
>>   
>>   	if (mem_cgroup_disabled())
>>   		return NULL;
>>   
>> +	if (!folio_memcg_charged(folio))
>> +		return root_mem_cgroup;
>> +
>>   	rcu_read_lock();
>> -	if (!memcg || WARN_ON_ONCE(!css_tryget(&memcg->css)))
>> -		memcg = root_mem_cgroup;
>> +retry:
>> +	memcg = folio_memcg(folio);
>> +	if (unlikely(!css_tryget(&memcg->css)))
>> +		goto retry;
> 
> So starting in patch 27, the tryget can fail if the memcg is offlined,
> and the folio's objcg is reparented concurrently. We'll retry until we
> find a memcg that isn't dead yet. There's always root_mem_cgroup.
> 
> It makes sense, but a loop like this begs the question of how it is
> bounded. I pieced it together looking ahead. Since this is a small
> diff, it would be nicer to fold it into 27. I didn't see anything in
> between depending on it, but correct me if I'm wrong.

Right, will fold it into #27 in the next version.

> 
> Minor style preference:
> 
> 	/* Comment explaining the above */
> 	do {
> 		memcg = folio_memcg(folio);
> 	} while (!css_tryget(&memcg->css));

OK, will do.

Thanks,
Qi




^ permalink raw reply	[flat|nested] 149+ messages in thread

* Re: [PATCH v2 08/28] mm: memcontrol: prevent memory cgroup release in get_mem_cgroup_from_folio()
  2025-12-17 21:45   ` Johannes Weiner
  2025-12-18  6:31     ` Qi Zheng
@ 2025-12-19  2:09     ` Shakeel Butt
  2025-12-19  3:53       ` Johannes Weiner
  1 sibling, 1 reply; 149+ messages in thread
From: Shakeel Butt @ 2025-12-19  2:09 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Qi Zheng, hughd, mhocko, roman.gushchin, muchun.song, david,
	lorenzo.stoakes, ziy, harry.yoo, imran.f.khan, kamalesh.babulal,
	axelrasmussen, yuanchu, weixugc, chenridong, mkoutny, akpm,
	hamzamahfooz, apais, lance.yang, linux-mm, linux-kernel, cgroups,
	Muchun Song, Qi Zheng

On Wed, Dec 17, 2025 at 04:45:06PM -0500, Johannes Weiner wrote:
> On Wed, Dec 17, 2025 at 03:27:32PM +0800, Qi Zheng wrote:
> > From: Muchun Song <songmuchun@bytedance.com>
> > 
> > In the near future, a folio will no longer pin its corresponding
> > memory cgroup. To ensure safety, it will only be appropriate to
> > hold the rcu read lock or acquire a reference to the memory cgroup
> > returned by folio_memcg(), thereby preventing it from being released.
> > 
> > In the current patch, the rcu read lock is employed to safeguard
> > against the release of the memory cgroup in get_mem_cgroup_from_folio().
> > 
> > This serves as a preparatory measure for the reparenting of the
> > LRU pages.
> > 
> > Signed-off-by: Muchun Song <songmuchun@bytedance.com>
> > Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
> > Reviewed-by: Harry Yoo <harry.yoo@oracle.com>
> > ---
> >  mm/memcontrol.c | 11 ++++++++---
> >  1 file changed, 8 insertions(+), 3 deletions(-)
> > 
> > diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> > index 21b5aad34cae7..431b3154c70c5 100644
> > --- a/mm/memcontrol.c
> > +++ b/mm/memcontrol.c
> > @@ -973,14 +973,19 @@ struct mem_cgroup *get_mem_cgroup_from_current(void)
> >   */
> >  struct mem_cgroup *get_mem_cgroup_from_folio(struct folio *folio)
> >  {
> > -	struct mem_cgroup *memcg = folio_memcg(folio);
> > +	struct mem_cgroup *memcg;
> >  
> >  	if (mem_cgroup_disabled())
> >  		return NULL;
> >  
> > +	if (!folio_memcg_charged(folio))
> > +		return root_mem_cgroup;
> > +
> >  	rcu_read_lock();
> > -	if (!memcg || WARN_ON_ONCE(!css_tryget(&memcg->css)))
> > -		memcg = root_mem_cgroup;
> > +retry:
> > +	memcg = folio_memcg(folio);
> > +	if (unlikely(!css_tryget(&memcg->css)))
> > +		goto retry;
> 
> So starting in patch 27, the tryget can fail if the memcg is offlined,

offlined or on its way to free? It is css_tryget() without online.

> and the folio's objcg is reparented concurrently. We'll retry until we
> find a memcg that isn't dead yet. There's always root_mem_cgroup.
> 
> It makes sense, but a loop like this begs the question of how it is
> bounded. I pieced it together looking ahead. Since this is a small
> diff, it would be nicer to fold it into 27. I didn't see anything in
> between depending on it, but correct me if I'm wrong.

I agree to fold it in the patch where it is needed. Currently at this
point in series I don't see how css_tryget() can fail here.

> 
> Minor style preference:
> 
> 	/* Comment explaining the above */
> 	do {
> 		memcg = folio_memcg(folio);
> 	} while (!css_tryget(&memcg->css));


^ permalink raw reply	[flat|nested] 149+ messages in thread

* Re: [PATCH v2 08/28] mm: memcontrol: prevent memory cgroup release in get_mem_cgroup_from_folio()
  2025-12-19  2:09     ` Shakeel Butt
@ 2025-12-19  3:53       ` Johannes Weiner
  2025-12-19  3:56         ` Johannes Weiner
  0 siblings, 1 reply; 149+ messages in thread
From: Johannes Weiner @ 2025-12-19  3:53 UTC (permalink / raw)
  To: Shakeel Butt
  Cc: Qi Zheng, hughd, mhocko, roman.gushchin, muchun.song, david,
	lorenzo.stoakes, ziy, harry.yoo, imran.f.khan, kamalesh.babulal,
	axelrasmussen, yuanchu, weixugc, chenridong, mkoutny, akpm,
	hamzamahfooz, apais, lance.yang, linux-mm, linux-kernel, cgroups,
	Muchun Song, Qi Zheng

On Thu, Dec 18, 2025 at 06:09:50PM -0800, Shakeel Butt wrote:
> On Wed, Dec 17, 2025 at 04:45:06PM -0500, Johannes Weiner wrote:
> > On Wed, Dec 17, 2025 at 03:27:32PM +0800, Qi Zheng wrote:
> > > From: Muchun Song <songmuchun@bytedance.com>
> > > 
> > > In the near future, a folio will no longer pin its corresponding
> > > memory cgroup. To ensure safety, it will only be appropriate to
> > > hold the rcu read lock or acquire a reference to the memory cgroup
> > > returned by folio_memcg(), thereby preventing it from being released.
> > > 
> > > In the current patch, the rcu read lock is employed to safeguard
> > > against the release of the memory cgroup in get_mem_cgroup_from_folio().
> > > 
> > > This serves as a preparatory measure for the reparenting of the
> > > LRU pages.
> > > 
> > > Signed-off-by: Muchun Song <songmuchun@bytedance.com>
> > > Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
> > > Reviewed-by: Harry Yoo <harry.yoo@oracle.com>
> > > ---
> > >  mm/memcontrol.c | 11 ++++++++---
> > >  1 file changed, 8 insertions(+), 3 deletions(-)
> > > 
> > > diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> > > index 21b5aad34cae7..431b3154c70c5 100644
> > > --- a/mm/memcontrol.c
> > > +++ b/mm/memcontrol.c
> > > @@ -973,14 +973,19 @@ struct mem_cgroup *get_mem_cgroup_from_current(void)
> > >   */
> > >  struct mem_cgroup *get_mem_cgroup_from_folio(struct folio *folio)
> > >  {
> > > -	struct mem_cgroup *memcg = folio_memcg(folio);
> > > +	struct mem_cgroup *memcg;
> > >  
> > >  	if (mem_cgroup_disabled())
> > >  		return NULL;
> > >  
> > > +	if (!folio_memcg_charged(folio))
> > > +		return root_mem_cgroup;
> > > +
> > >  	rcu_read_lock();
> > > -	if (!memcg || WARN_ON_ONCE(!css_tryget(&memcg->css)))
> > > -		memcg = root_mem_cgroup;
> > > +retry:
> > > +	memcg = folio_memcg(folio);
> > > +	if (unlikely(!css_tryget(&memcg->css)))
> > > +		goto retry;
> > 
> > So starting in patch 27, the tryget can fail if the memcg is offlined,
> 
> offlined or on its way to free? It is css_tryget() without online.

Sorry, I did mean freeing.

But in the new scheme, they will happen much closer together than
before, since charges don't hold a reference to the css anymore.

So when css_killed_work_fn() does

		offline_css(css);
		css_put(css);

on rmdir, that's now the css_put() we expect to drop the refcount to 0
even with folios in circulation.

The race is then:

	get_mem_cgroup_from_folio()	cgroup_rmdir()
	  memcg = folio_memcg(folio);
            folio->objcg->memcg
					  offline_css()
                                            reparent_objcgs()
					      objcg->memcg = objcg->memcg->parent
					  css_put() -> 0
	  !css_tryget(&memcg->css)

and the retry ensures we'll look up objcg->memcg again and find the
live parent and new owner of the folio.


^ permalink raw reply	[flat|nested] 149+ messages in thread

* Re: [PATCH v2 08/28] mm: memcontrol: prevent memory cgroup release in get_mem_cgroup_from_folio()
  2025-12-19  3:53       ` Johannes Weiner
@ 2025-12-19  3:56         ` Johannes Weiner
  0 siblings, 0 replies; 149+ messages in thread
From: Johannes Weiner @ 2025-12-19  3:56 UTC (permalink / raw)
  To: Shakeel Butt
  Cc: Qi Zheng, hughd, mhocko, roman.gushchin, muchun.song, david,
	lorenzo.stoakes, ziy, harry.yoo, imran.f.khan, kamalesh.babulal,
	axelrasmussen, yuanchu, weixugc, chenridong, mkoutny, akpm,
	hamzamahfooz, apais, lance.yang, linux-mm, linux-kernel, cgroups,
	Muchun Song, Qi Zheng

On Thu, Dec 18, 2025 at 10:53:57PM -0500, Johannes Weiner wrote:
> On Thu, Dec 18, 2025 at 06:09:50PM -0800, Shakeel Butt wrote:
> > On Wed, Dec 17, 2025 at 04:45:06PM -0500, Johannes Weiner wrote:
> > > So starting in patch 27, the tryget can fail if the memcg is offlined,
> > 
> > offlined or on its way to free? It is css_tryget() without online.
> 
> Sorry, I did mean freeing.
> 
> But in the new scheme, they will happen much closer together than
> before, since charges don't hold a reference to the css anymore.
> 
> So when css_killed_work_fn() does
> 
> 		offline_css(css);
> 		css_put(css);
> 
> on rmdir, that's now the css_put() we expect to drop the refcount to 0
> even with folios in circulation.
> 
> The race is then:
> 
> 	get_mem_cgroup_from_folio()	cgroup_rmdir()
> 	  memcg = folio_memcg(folio);
>             folio->objcg->memcg
> 					  offline_css()
>                                             reparent_objcgs()
> 					      objcg->memcg = objcg->memcg->parent
> 					  css_put() -> 0
> 	  !css_tryget(&memcg->css)
> 
> and the retry ensures we'll look up objcg->memcg again and find the
> live parent and new owner of the folio.

But yes, none of this happens until patch 27.


^ permalink raw reply	[flat|nested] 149+ messages in thread

* [PATCH v2 09/28] buffer: prevent memory cgroup release in folio_alloc_buffers()
  2025-12-17  7:27 [PATCH v2 00/28] Eliminate Dying Memory Cgroup Qi Zheng
                   ` (7 preceding siblings ...)
  2025-12-17  7:27 ` [PATCH v2 08/28] mm: memcontrol: prevent memory cgroup release in get_mem_cgroup_from_folio() Qi Zheng
@ 2025-12-17  7:27 ` Qi Zheng
  2025-12-17 21:45   ` Johannes Weiner
  2025-12-19  2:14   ` Shakeel Butt
  2025-12-17  7:27 ` [PATCH v2 10/28] writeback: prevent memory cgroup release in writeback module Qi Zheng
                   ` (20 subsequent siblings)
  29 siblings, 2 replies; 149+ messages in thread
From: Qi Zheng @ 2025-12-17  7:27 UTC (permalink / raw)
  To: hannes, hughd, mhocko, roman.gushchin, shakeel.butt, muchun.song,
	david, lorenzo.stoakes, ziy, harry.yoo, imran.f.khan,
	kamalesh.babulal, axelrasmussen, yuanchu, weixugc, chenridong,
	mkoutny, akpm, hamzamahfooz, apais, lance.yang
  Cc: linux-mm, linux-kernel, cgroups, Muchun Song, Qi Zheng

From: Muchun Song <songmuchun@bytedance.com>

In the near future, a folio will no longer pin its corresponding
memory cgroup. To ensure safety, it will only be appropriate to
hold the rcu read lock or acquire a reference to the memory cgroup
returned by folio_memcg(), thereby preventing it from being released.

In the current patch, the function get_mem_cgroup_from_folio() is
employed to safeguard against the release of the memory cgroup.
This serves as a preparatory measure for the reparenting of the
LRU pages.

Signed-off-by: Muchun Song <songmuchun@bytedance.com>
Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
Reviewed-by: Harry Yoo <harry.yoo@oracle.com>
---
 fs/buffer.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/fs/buffer.c b/fs/buffer.c
index fd53b806ab7eb..4552d9cab0dbd 100644
--- a/fs/buffer.c
+++ b/fs/buffer.c
@@ -925,8 +925,7 @@ struct buffer_head *folio_alloc_buffers(struct folio *folio, unsigned long size,
 	long offset;
 	struct mem_cgroup *memcg, *old_memcg;
 
-	/* The folio lock pins the memcg */
-	memcg = folio_memcg(folio);
+	memcg = get_mem_cgroup_from_folio(folio);
 	old_memcg = set_active_memcg(memcg);
 
 	head = NULL;
@@ -947,6 +946,7 @@ struct buffer_head *folio_alloc_buffers(struct folio *folio, unsigned long size,
 	}
 out:
 	set_active_memcg(old_memcg);
+	mem_cgroup_put(memcg);
 	return head;
 /*
  * In case anything failed, we just free everything we got.
-- 
2.20.1



^ permalink raw reply	[flat|nested] 149+ messages in thread

* Re: [PATCH v2 09/28] buffer: prevent memory cgroup release in folio_alloc_buffers()
  2025-12-17  7:27 ` [PATCH v2 09/28] buffer: prevent memory cgroup release in folio_alloc_buffers() Qi Zheng
@ 2025-12-17 21:45   ` Johannes Weiner
  2025-12-19  2:14   ` Shakeel Butt
  1 sibling, 0 replies; 149+ messages in thread
From: Johannes Weiner @ 2025-12-17 21:45 UTC (permalink / raw)
  To: Qi Zheng
  Cc: hughd, mhocko, roman.gushchin, shakeel.butt, muchun.song, david,
	lorenzo.stoakes, ziy, harry.yoo, imran.f.khan, kamalesh.babulal,
	axelrasmussen, yuanchu, weixugc, chenridong, mkoutny, akpm,
	hamzamahfooz, apais, lance.yang, linux-mm, linux-kernel, cgroups,
	Muchun Song, Qi Zheng

On Wed, Dec 17, 2025 at 03:27:33PM +0800, Qi Zheng wrote:
> From: Muchun Song <songmuchun@bytedance.com>
> 
> In the near future, a folio will no longer pin its corresponding
> memory cgroup. To ensure safety, it will only be appropriate to
> hold the rcu read lock or acquire a reference to the memory cgroup
> returned by folio_memcg(), thereby preventing it from being released.
> 
> In the current patch, the function get_mem_cgroup_from_folio() is
> employed to safeguard against the release of the memory cgroup.
> This serves as a preparatory measure for the reparenting of the
> LRU pages.
> 
> Signed-off-by: Muchun Song <songmuchun@bytedance.com>
> Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
> Reviewed-by: Harry Yoo <harry.yoo@oracle.com>

Acked-by: Johannes Weiner <hannes@cmpxchg.org>


^ permalink raw reply	[flat|nested] 149+ messages in thread

* Re: [PATCH v2 09/28] buffer: prevent memory cgroup release in folio_alloc_buffers()
  2025-12-17  7:27 ` [PATCH v2 09/28] buffer: prevent memory cgroup release in folio_alloc_buffers() Qi Zheng
  2025-12-17 21:45   ` Johannes Weiner
@ 2025-12-19  2:14   ` Shakeel Butt
  2025-12-26  2:01     ` Chen Ridong
  1 sibling, 1 reply; 149+ messages in thread
From: Shakeel Butt @ 2025-12-19  2:14 UTC (permalink / raw)
  To: Qi Zheng
  Cc: hannes, hughd, mhocko, roman.gushchin, muchun.song, david,
	lorenzo.stoakes, ziy, harry.yoo, imran.f.khan, kamalesh.babulal,
	axelrasmussen, yuanchu, weixugc, chenridong, mkoutny, akpm,
	hamzamahfooz, apais, lance.yang, linux-mm, linux-kernel, cgroups,
	Muchun Song, Qi Zheng

On Wed, Dec 17, 2025 at 03:27:33PM +0800, Qi Zheng wrote:
> From: Muchun Song <songmuchun@bytedance.com>
> 
> In the near future, a folio will no longer pin its corresponding
> memory cgroup. To ensure safety, it will only be appropriate to
> hold the rcu read lock or acquire a reference to the memory cgroup
> returned by folio_memcg(), thereby preventing it from being released.
> 
> In the current patch, the function get_mem_cgroup_from_folio() is
> employed to safeguard against the release of the memory cgroup.
> This serves as a preparatory measure for the reparenting of the
> LRU pages.
> 
> Signed-off-by: Muchun Song <songmuchun@bytedance.com>
> Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
> Reviewed-by: Harry Yoo <harry.yoo@oracle.com>

Acked-by: Shakeel Butt <shakeel.butt@linux.dev>


^ permalink raw reply	[flat|nested] 149+ messages in thread

* Re: [PATCH v2 09/28] buffer: prevent memory cgroup release in folio_alloc_buffers()
  2025-12-19  2:14   ` Shakeel Butt
@ 2025-12-26  2:01     ` Chen Ridong
  0 siblings, 0 replies; 149+ messages in thread
From: Chen Ridong @ 2025-12-26  2:01 UTC (permalink / raw)
  To: Shakeel Butt, Qi Zheng
  Cc: hannes, hughd, mhocko, roman.gushchin, muchun.song, david,
	lorenzo.stoakes, ziy, harry.yoo, imran.f.khan, kamalesh.babulal,
	axelrasmussen, yuanchu, weixugc, mkoutny, akpm, hamzamahfooz,
	apais, lance.yang, linux-mm, linux-kernel, cgroups, Muchun Song,
	Qi Zheng



On 2025/12/19 10:14, Shakeel Butt wrote:
> On Wed, Dec 17, 2025 at 03:27:33PM +0800, Qi Zheng wrote:
>> From: Muchun Song <songmuchun@bytedance.com>
>>
>> In the near future, a folio will no longer pin its corresponding
>> memory cgroup. To ensure safety, it will only be appropriate to
>> hold the rcu read lock or acquire a reference to the memory cgroup
>> returned by folio_memcg(), thereby preventing it from being released.
>>
>> In the current patch, the function get_mem_cgroup_from_folio() is
>> employed to safeguard against the release of the memory cgroup.
>> This serves as a preparatory measure for the reparenting of the
>> LRU pages.
>>
>> Signed-off-by: Muchun Song <songmuchun@bytedance.com>
>> Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
>> Reviewed-by: Harry Yoo <harry.yoo@oracle.com>
> 
> Acked-by: Shakeel Butt <shakeel.butt@linux.dev>

Reviewed-by: Chen Ridong <chenridong@huawei.com>

-- 
Best regards,
Ridong



^ permalink raw reply	[flat|nested] 149+ messages in thread

* [PATCH v2 10/28] writeback: prevent memory cgroup release in writeback module
  2025-12-17  7:27 [PATCH v2 00/28] Eliminate Dying Memory Cgroup Qi Zheng
                   ` (8 preceding siblings ...)
  2025-12-17  7:27 ` [PATCH v2 09/28] buffer: prevent memory cgroup release in folio_alloc_buffers() Qi Zheng
@ 2025-12-17  7:27 ` Qi Zheng
  2025-12-17 22:08   ` Johannes Weiner
  2025-12-19  2:30   ` Shakeel Butt
  2025-12-17  7:27 ` [PATCH v2 11/28] mm: memcontrol: prevent memory cgroup release in count_memcg_folio_events() Qi Zheng
                   ` (19 subsequent siblings)
  29 siblings, 2 replies; 149+ messages in thread
From: Qi Zheng @ 2025-12-17  7:27 UTC (permalink / raw)
  To: hannes, hughd, mhocko, roman.gushchin, shakeel.butt, muchun.song,
	david, lorenzo.stoakes, ziy, harry.yoo, imran.f.khan,
	kamalesh.babulal, axelrasmussen, yuanchu, weixugc, chenridong,
	mkoutny, akpm, hamzamahfooz, apais, lance.yang
  Cc: linux-mm, linux-kernel, cgroups, Muchun Song, Qi Zheng

From: Muchun Song <songmuchun@bytedance.com>

In the near future, a folio will no longer pin its corresponding
memory cgroup. To ensure safety, it will only be appropriate to
hold the rcu read lock or acquire a reference to the memory cgroup
returned by folio_memcg(), thereby preventing it from being released.

In the current patch, the function get_mem_cgroup_css_from_folio()
and the rcu read lock are employed to safeguard against the release
of the memory cgroup.

This serves as a preparatory measure for the reparenting of the
LRU pages.

Signed-off-by: Muchun Song <songmuchun@bytedance.com>
Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
Reviewed-by: Harry Yoo <harry.yoo@oracle.com>
---
 fs/fs-writeback.c                | 22 +++++++++++-----------
 include/linux/memcontrol.h       |  9 +++++++--
 include/trace/events/writeback.h |  3 +++
 mm/memcontrol.c                  | 14 ++++++++------
 4 files changed, 29 insertions(+), 19 deletions(-)

diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
index 5dd6e89a6d29e..2e57b7e2b4453 100644
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -279,15 +279,13 @@ void __inode_attach_wb(struct inode *inode, struct folio *folio)
 	if (inode_cgwb_enabled(inode)) {
 		struct cgroup_subsys_state *memcg_css;
 
-		if (folio) {
-			memcg_css = mem_cgroup_css_from_folio(folio);
-			wb = wb_get_create(bdi, memcg_css, GFP_ATOMIC);
-		} else {
-			/* must pin memcg_css, see wb_get_create() */
+		/* must pin memcg_css, see wb_get_create() */
+		if (folio)
+			memcg_css = get_mem_cgroup_css_from_folio(folio);
+		else
 			memcg_css = task_get_css(current, memory_cgrp_id);
-			wb = wb_get_create(bdi, memcg_css, GFP_ATOMIC);
-			css_put(memcg_css);
-		}
+		wb = wb_get_create(bdi, memcg_css, GFP_ATOMIC);
+		css_put(memcg_css);
 	}
 
 	if (!wb)
@@ -979,16 +977,16 @@ void wbc_account_cgroup_owner(struct writeback_control *wbc, struct folio *folio
 	if (!wbc->wb || wbc->no_cgroup_owner)
 		return;
 
-	css = mem_cgroup_css_from_folio(folio);
+	css = get_mem_cgroup_css_from_folio(folio);
 	/* dead cgroups shouldn't contribute to inode ownership arbitration */
 	if (!css_is_online(css))
-		return;
+		goto out;
 
 	id = css->id;
 
 	if (id == wbc->wb_id) {
 		wbc->wb_bytes += bytes;
-		return;
+		goto out;
 	}
 
 	if (id == wbc->wb_lcand_id)
@@ -1001,6 +999,8 @@ void wbc_account_cgroup_owner(struct writeback_control *wbc, struct folio *folio
 		wbc->wb_tcand_bytes += bytes;
 	else
 		wbc->wb_tcand_bytes -= min(bytes, wbc->wb_tcand_bytes);
+out:
+	css_put(css);
 }
 EXPORT_SYMBOL_GPL(wbc_account_cgroup_owner);
 
diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 776d9be1f446a..bc526e0d37e0b 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -895,7 +895,7 @@ static inline bool mm_match_cgroup(struct mm_struct *mm,
 	return match;
 }
 
-struct cgroup_subsys_state *mem_cgroup_css_from_folio(struct folio *folio);
+struct cgroup_subsys_state *get_mem_cgroup_css_from_folio(struct folio *folio);
 ino_t page_cgroup_ino(struct page *page);
 
 static inline bool mem_cgroup_online(struct mem_cgroup *memcg)
@@ -1549,9 +1549,14 @@ static inline void mem_cgroup_track_foreign_dirty(struct folio *folio,
 	if (mem_cgroup_disabled())
 		return;
 
+	if (!folio_memcg_charged(folio))
+		return;
+
+	rcu_read_lock();
 	memcg = folio_memcg(folio);
-	if (unlikely(memcg && &memcg->css != wb->memcg_css))
+	if (unlikely(&memcg->css != wb->memcg_css))
 		mem_cgroup_track_foreign_dirty_slowpath(folio, wb);
+	rcu_read_unlock();
 }
 
 void mem_cgroup_flush_foreign(struct bdi_writeback *wb);
diff --git a/include/trace/events/writeback.h b/include/trace/events/writeback.h
index 311a341e6fe42..f5bfe8c1a160a 100644
--- a/include/trace/events/writeback.h
+++ b/include/trace/events/writeback.h
@@ -295,7 +295,10 @@ TRACE_EVENT(track_foreign_dirty,
 		__entry->ino		= inode ? inode->i_ino : 0;
 		__entry->memcg_id	= wb->memcg_css->id;
 		__entry->cgroup_ino	= __trace_wb_assign_cgroup(wb);
+
+		rcu_read_lock();
 		__entry->page_cgroup_ino = cgroup_ino(folio_memcg(folio)->css.cgroup);
+		rcu_read_unlock();
 	),
 
 	TP_printk("bdi %s[%llu]: ino=%lu memcg_id=%u cgroup_ino=%lu page_cgroup_ino=%lu",
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 431b3154c70c5..131f940c03fa0 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -241,7 +241,7 @@ DEFINE_STATIC_KEY_FALSE(memcg_bpf_enabled_key);
 EXPORT_SYMBOL(memcg_bpf_enabled_key);
 
 /**
- * mem_cgroup_css_from_folio - css of the memcg associated with a folio
+ * get_mem_cgroup_css_from_folio - acquire a css of the memcg associated with a folio
  * @folio: folio of interest
  *
  * If memcg is bound to the default hierarchy, css of the memcg associated
@@ -251,14 +251,16 @@ EXPORT_SYMBOL(memcg_bpf_enabled_key);
  * If memcg is bound to a traditional hierarchy, the css of root_mem_cgroup
  * is returned.
  */
-struct cgroup_subsys_state *mem_cgroup_css_from_folio(struct folio *folio)
+struct cgroup_subsys_state *get_mem_cgroup_css_from_folio(struct folio *folio)
 {
-	struct mem_cgroup *memcg = folio_memcg(folio);
+	struct mem_cgroup *memcg;
 
-	if (!memcg || !cgroup_subsys_on_dfl(memory_cgrp_subsys))
-		memcg = root_mem_cgroup;
+	if (!cgroup_subsys_on_dfl(memory_cgrp_subsys))
+		return &root_mem_cgroup->css;
 
-	return &memcg->css;
+	memcg = get_mem_cgroup_from_folio(folio);
+
+	return memcg ? &memcg->css : &root_mem_cgroup->css;
 }
 
 /**
-- 
2.20.1



^ permalink raw reply	[flat|nested] 149+ messages in thread

* Re: [PATCH v2 10/28] writeback: prevent memory cgroup release in writeback module
  2025-12-17  7:27 ` [PATCH v2 10/28] writeback: prevent memory cgroup release in writeback module Qi Zheng
@ 2025-12-17 22:08   ` Johannes Weiner
  2025-12-19  2:30   ` Shakeel Butt
  1 sibling, 0 replies; 149+ messages in thread
From: Johannes Weiner @ 2025-12-17 22:08 UTC (permalink / raw)
  To: Qi Zheng
  Cc: hughd, mhocko, roman.gushchin, shakeel.butt, muchun.song, david,
	lorenzo.stoakes, ziy, harry.yoo, imran.f.khan, kamalesh.babulal,
	axelrasmussen, yuanchu, weixugc, chenridong, mkoutny, akpm,
	hamzamahfooz, apais, lance.yang, linux-mm, linux-kernel, cgroups,
	Muchun Song, Qi Zheng

On Wed, Dec 17, 2025 at 03:27:34PM +0800, Qi Zheng wrote:
> From: Muchun Song <songmuchun@bytedance.com>
> 
> In the near future, a folio will no longer pin its corresponding
> memory cgroup. To ensure safety, it will only be appropriate to
> hold the rcu read lock or acquire a reference to the memory cgroup
> returned by folio_memcg(), thereby preventing it from being released.
> 
> In the current patch, the function get_mem_cgroup_css_from_folio()
> and the rcu read lock are employed to safeguard against the release
> of the memory cgroup.
> 
> This serves as a preparatory measure for the reparenting of the
> LRU pages.
> 
> Signed-off-by: Muchun Song <songmuchun@bytedance.com>
> Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
> Reviewed-by: Harry Yoo <harry.yoo@oracle.com>

Looks sane to me.

The roo_mem_cgroup handling in get_mem_cgroup_css_from_folio() is
unusual - usually we do NULL for mem_cgroup_disabled(). But that's a
quirk you're inheriting from the existing writeback code.

Acked-by: Johannes Weiner <hannes@cmpxchg.org>


^ permalink raw reply	[flat|nested] 149+ messages in thread

* Re: [PATCH v2 10/28] writeback: prevent memory cgroup release in writeback module
  2025-12-17  7:27 ` [PATCH v2 10/28] writeback: prevent memory cgroup release in writeback module Qi Zheng
  2025-12-17 22:08   ` Johannes Weiner
@ 2025-12-19  2:30   ` Shakeel Butt
  1 sibling, 0 replies; 149+ messages in thread
From: Shakeel Butt @ 2025-12-19  2:30 UTC (permalink / raw)
  To: Qi Zheng
  Cc: hannes, hughd, mhocko, roman.gushchin, muchun.song, david,
	lorenzo.stoakes, ziy, harry.yoo, imran.f.khan, kamalesh.babulal,
	axelrasmussen, yuanchu, weixugc, chenridong, mkoutny, akpm,
	hamzamahfooz, apais, lance.yang, linux-mm, linux-kernel, cgroups,
	Muchun Song, Qi Zheng

On Wed, Dec 17, 2025 at 03:27:34PM +0800, Qi Zheng wrote:
> From: Muchun Song <songmuchun@bytedance.com>
> 
> In the near future, a folio will no longer pin its corresponding
> memory cgroup. To ensure safety, it will only be appropriate to
> hold the rcu read lock or acquire a reference to the memory cgroup
> returned by folio_memcg(), thereby preventing it from being released.
> 
> In the current patch, the function get_mem_cgroup_css_from_folio()
> and the rcu read lock are employed to safeguard against the release
> of the memory cgroup.
> 
> This serves as a preparatory measure for the reparenting of the
> LRU pages.
> 
> Signed-off-by: Muchun Song <songmuchun@bytedance.com>
> Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
> Reviewed-by: Harry Yoo <harry.yoo@oracle.com>
> ---

[...]

> @@ -1549,9 +1549,14 @@ static inline void mem_cgroup_track_foreign_dirty(struct folio *folio,
>  	if (mem_cgroup_disabled())
>  		return;
>  
> +	if (!folio_memcg_charged(folio))
> +		return;
> +
> +	rcu_read_lock();
>  	memcg = folio_memcg(folio);
> -	if (unlikely(memcg && &memcg->css != wb->memcg_css))
> +	if (unlikely(&memcg->css != wb->memcg_css))
>  		mem_cgroup_track_foreign_dirty_slowpath(folio, wb);

The slowpath in the name gave me a pause but it seems like it is safe to
be called within rcu lock.

Acked-by: Shakeel Butt <shakeel.butt@linux.dev>


^ permalink raw reply	[flat|nested] 149+ messages in thread

* [PATCH v2 11/28] mm: memcontrol: prevent memory cgroup release in count_memcg_folio_events()
  2025-12-17  7:27 [PATCH v2 00/28] Eliminate Dying Memory Cgroup Qi Zheng
                   ` (9 preceding siblings ...)
  2025-12-17  7:27 ` [PATCH v2 10/28] writeback: prevent memory cgroup release in writeback module Qi Zheng
@ 2025-12-17  7:27 ` Qi Zheng
  2025-12-17 22:11   ` Johannes Weiner
                     ` (2 more replies)
  2025-12-17  7:27 ` [PATCH v2 12/28] mm: page_io: prevent memory cgroup release in page_io module Qi Zheng
                   ` (18 subsequent siblings)
  29 siblings, 3 replies; 149+ messages in thread
From: Qi Zheng @ 2025-12-17  7:27 UTC (permalink / raw)
  To: hannes, hughd, mhocko, roman.gushchin, shakeel.butt, muchun.song,
	david, lorenzo.stoakes, ziy, harry.yoo, imran.f.khan,
	kamalesh.babulal, axelrasmussen, yuanchu, weixugc, chenridong,
	mkoutny, akpm, hamzamahfooz, apais, lance.yang
  Cc: linux-mm, linux-kernel, cgroups, Muchun Song, Qi Zheng

From: Muchun Song <songmuchun@bytedance.com>

In the near future, a folio will no longer pin its corresponding
memory cgroup. To ensure safety, it will only be appropriate to
hold the rcu read lock or acquire a reference to the memory cgroup
returned by folio_memcg(), thereby preventing it from being released.

In the current patch, the rcu read lock is employed to safeguard
against the release of the memory cgroup in count_memcg_folio_events().

This serves as a preparatory measure for the reparenting of the
LRU pages.

Signed-off-by: Muchun Song <songmuchun@bytedance.com>
Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
Reviewed-by: Harry Yoo <harry.yoo@oracle.com>
---
 include/linux/memcontrol.h | 11 ++++++++---
 1 file changed, 8 insertions(+), 3 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index bc526e0d37e0b..69c4bcfb3c3cd 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -974,10 +974,15 @@ void count_memcg_events(struct mem_cgroup *memcg, enum vm_event_item idx,
 static inline void count_memcg_folio_events(struct folio *folio,
 		enum vm_event_item idx, unsigned long nr)
 {
-	struct mem_cgroup *memcg = folio_memcg(folio);
+	struct mem_cgroup *memcg;
 
-	if (memcg)
-		count_memcg_events(memcg, idx, nr);
+	if (!folio_memcg_charged(folio))
+		return;
+
+	rcu_read_lock();
+	memcg = folio_memcg(folio);
+	count_memcg_events(memcg, idx, nr);
+	rcu_read_unlock();
 }
 
 static inline void count_memcg_events_mm(struct mm_struct *mm,
-- 
2.20.1



^ permalink raw reply	[flat|nested] 149+ messages in thread

* Re: [PATCH v2 11/28] mm: memcontrol: prevent memory cgroup release in count_memcg_folio_events()
  2025-12-17  7:27 ` [PATCH v2 11/28] mm: memcontrol: prevent memory cgroup release in count_memcg_folio_events() Qi Zheng
@ 2025-12-17 22:11   ` Johannes Weiner
  2025-12-19 23:31   ` Shakeel Butt
  2025-12-26  2:12   ` Chen Ridong
  2 siblings, 0 replies; 149+ messages in thread
From: Johannes Weiner @ 2025-12-17 22:11 UTC (permalink / raw)
  To: Qi Zheng
  Cc: hughd, mhocko, roman.gushchin, shakeel.butt, muchun.song, david,
	lorenzo.stoakes, ziy, harry.yoo, imran.f.khan, kamalesh.babulal,
	axelrasmussen, yuanchu, weixugc, chenridong, mkoutny, akpm,
	hamzamahfooz, apais, lance.yang, linux-mm, linux-kernel, cgroups,
	Muchun Song, Qi Zheng

On Wed, Dec 17, 2025 at 03:27:35PM +0800, Qi Zheng wrote:
> From: Muchun Song <songmuchun@bytedance.com>
> 
> In the near future, a folio will no longer pin its corresponding
> memory cgroup. To ensure safety, it will only be appropriate to
> hold the rcu read lock or acquire a reference to the memory cgroup
> returned by folio_memcg(), thereby preventing it from being released.
> 
> In the current patch, the rcu read lock is employed to safeguard
> against the release of the memory cgroup in count_memcg_folio_events().
> 
> This serves as a preparatory measure for the reparenting of the
> LRU pages.
> 
> Signed-off-by: Muchun Song <songmuchun@bytedance.com>
> Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
> Reviewed-by: Harry Yoo <harry.yoo@oracle.com>

Acked-by: Johannes Weiner <hannes@cmpxchg.org>


^ permalink raw reply	[flat|nested] 149+ messages in thread

* Re: [PATCH v2 11/28] mm: memcontrol: prevent memory cgroup release in count_memcg_folio_events()
  2025-12-17  7:27 ` [PATCH v2 11/28] mm: memcontrol: prevent memory cgroup release in count_memcg_folio_events() Qi Zheng
  2025-12-17 22:11   ` Johannes Weiner
@ 2025-12-19 23:31   ` Shakeel Butt
  2025-12-26  2:12   ` Chen Ridong
  2 siblings, 0 replies; 149+ messages in thread
From: Shakeel Butt @ 2025-12-19 23:31 UTC (permalink / raw)
  To: Qi Zheng
  Cc: hannes, hughd, mhocko, roman.gushchin, muchun.song, david,
	lorenzo.stoakes, ziy, harry.yoo, imran.f.khan, kamalesh.babulal,
	axelrasmussen, yuanchu, weixugc, chenridong, mkoutny, akpm,
	hamzamahfooz, apais, lance.yang, linux-mm, linux-kernel, cgroups,
	Muchun Song, Qi Zheng

On Wed, Dec 17, 2025 at 03:27:35PM +0800, Qi Zheng wrote:
> From: Muchun Song <songmuchun@bytedance.com>
> 
> In the near future, a folio will no longer pin its corresponding
> memory cgroup. To ensure safety, it will only be appropriate to
> hold the rcu read lock or acquire a reference to the memory cgroup
> returned by folio_memcg(), thereby preventing it from being released.
> 
> In the current patch, the rcu read lock is employed to safeguard
> against the release of the memory cgroup in count_memcg_folio_events().
> 
> This serves as a preparatory measure for the reparenting of the
> LRU pages.
> 
> Signed-off-by: Muchun Song <songmuchun@bytedance.com>
> Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
> Reviewed-by: Harry Yoo <harry.yoo@oracle.com>

Acked-by: Shakeel Butt <shakeel.butt@linux.dev>


^ permalink raw reply	[flat|nested] 149+ messages in thread

* Re: [PATCH v2 11/28] mm: memcontrol: prevent memory cgroup release in count_memcg_folio_events()
  2025-12-17  7:27 ` [PATCH v2 11/28] mm: memcontrol: prevent memory cgroup release in count_memcg_folio_events() Qi Zheng
  2025-12-17 22:11   ` Johannes Weiner
  2025-12-19 23:31   ` Shakeel Butt
@ 2025-12-26  2:12   ` Chen Ridong
  2 siblings, 0 replies; 149+ messages in thread
From: Chen Ridong @ 2025-12-26  2:12 UTC (permalink / raw)
  To: Qi Zheng, hannes, hughd, mhocko, roman.gushchin, shakeel.butt,
	muchun.song, david, lorenzo.stoakes, ziy, harry.yoo,
	imran.f.khan, kamalesh.babulal, axelrasmussen, yuanchu, weixugc,
	mkoutny, akpm, hamzamahfooz, apais, lance.yang
  Cc: linux-mm, linux-kernel, cgroups, Muchun Song, Qi Zheng



On 2025/12/17 15:27, Qi Zheng wrote:
> From: Muchun Song <songmuchun@bytedance.com>
> 
> In the near future, a folio will no longer pin its corresponding
> memory cgroup. To ensure safety, it will only be appropriate to
> hold the rcu read lock or acquire a reference to the memory cgroup
> returned by folio_memcg(), thereby preventing it from being released.
> 
> In the current patch, the rcu read lock is employed to safeguard
> against the release of the memory cgroup in count_memcg_folio_events().
> 
> This serves as a preparatory measure for the reparenting of the
> LRU pages.
> 
> Signed-off-by: Muchun Song <songmuchun@bytedance.com>
> Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
> Reviewed-by: Harry Yoo <harry.yoo@oracle.com>
> ---
>  include/linux/memcontrol.h | 11 ++++++++---
>  1 file changed, 8 insertions(+), 3 deletions(-)
> 
> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> index bc526e0d37e0b..69c4bcfb3c3cd 100644
> --- a/include/linux/memcontrol.h
> +++ b/include/linux/memcontrol.h
> @@ -974,10 +974,15 @@ void count_memcg_events(struct mem_cgroup *memcg, enum vm_event_item idx,
>  static inline void count_memcg_folio_events(struct folio *folio,
>  		enum vm_event_item idx, unsigned long nr)
>  {
> -	struct mem_cgroup *memcg = folio_memcg(folio);
> +	struct mem_cgroup *memcg;
>  
> -	if (memcg)
> -		count_memcg_events(memcg, idx, nr);
> +	if (!folio_memcg_charged(folio))
> +		return;
> +
> +	rcu_read_lock();
> +	memcg = folio_memcg(folio);
> +	count_memcg_events(memcg, idx, nr);
> +	rcu_read_unlock();
>  }
>  
>  static inline void count_memcg_events_mm(struct mm_struct *mm,

Reviewed-by: Chen Ridong <chenridong@huawei.com>

-- 
Best regards,
Ridong



^ permalink raw reply	[flat|nested] 149+ messages in thread

* [PATCH v2 12/28] mm: page_io: prevent memory cgroup release in page_io module
  2025-12-17  7:27 [PATCH v2 00/28] Eliminate Dying Memory Cgroup Qi Zheng
                   ` (10 preceding siblings ...)
  2025-12-17  7:27 ` [PATCH v2 11/28] mm: memcontrol: prevent memory cgroup release in count_memcg_folio_events() Qi Zheng
@ 2025-12-17  7:27 ` Qi Zheng
  2025-12-17 22:12   ` Johannes Weiner
  2025-12-19 23:44   ` Shakeel Butt
  2025-12-17  7:27 ` [PATCH v2 13/28] mm: migrate: prevent memory cgroup release in folio_migrate_mapping() Qi Zheng
                   ` (17 subsequent siblings)
  29 siblings, 2 replies; 149+ messages in thread
From: Qi Zheng @ 2025-12-17  7:27 UTC (permalink / raw)
  To: hannes, hughd, mhocko, roman.gushchin, shakeel.butt, muchun.song,
	david, lorenzo.stoakes, ziy, harry.yoo, imran.f.khan,
	kamalesh.babulal, axelrasmussen, yuanchu, weixugc, chenridong,
	mkoutny, akpm, hamzamahfooz, apais, lance.yang
  Cc: linux-mm, linux-kernel, cgroups, Muchun Song, Qi Zheng

From: Muchun Song <songmuchun@bytedance.com>

In the near future, a folio will no longer pin its corresponding
memory cgroup. To ensure safety, it will only be appropriate to
hold the rcu read lock or acquire a reference to the memory cgroup
returned by folio_memcg(), thereby preventing it from being released.

In the current patch, the rcu read lock is employed to safeguard
against the release of the memory cgroup in swap_writeout() and
bio_associate_blkg_from_page().

This serves as a preparatory measure for the reparenting of the
LRU pages.

Signed-off-by: Muchun Song <songmuchun@bytedance.com>
Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
Reviewed-by: Harry Yoo <harry.yoo@oracle.com>
---
 mm/page_io.c | 8 ++++++--
 1 file changed, 6 insertions(+), 2 deletions(-)

diff --git a/mm/page_io.c b/mm/page_io.c
index 3c342db77ce38..ec7720762042c 100644
--- a/mm/page_io.c
+++ b/mm/page_io.c
@@ -276,10 +276,14 @@ int swap_writeout(struct folio *folio, struct swap_iocb **swap_plug)
 		count_mthp_stat(folio_order(folio), MTHP_STAT_ZSWPOUT);
 		goto out_unlock;
 	}
+
+	rcu_read_lock();
 	if (!mem_cgroup_zswap_writeback_enabled(folio_memcg(folio))) {
+		rcu_read_unlock();
 		folio_mark_dirty(folio);
 		return AOP_WRITEPAGE_ACTIVATE;
 	}
+	rcu_read_unlock();
 
 	__swap_writepage(folio, swap_plug);
 	return 0;
@@ -307,11 +311,11 @@ static void bio_associate_blkg_from_page(struct bio *bio, struct folio *folio)
 	struct cgroup_subsys_state *css;
 	struct mem_cgroup *memcg;
 
-	memcg = folio_memcg(folio);
-	if (!memcg)
+	if (!folio_memcg_charged(folio))
 		return;
 
 	rcu_read_lock();
+	memcg = folio_memcg(folio);
 	css = cgroup_e_css(memcg->css.cgroup, &io_cgrp_subsys);
 	bio_associate_blkg_from_css(bio, css);
 	rcu_read_unlock();
-- 
2.20.1



^ permalink raw reply	[flat|nested] 149+ messages in thread

* Re: [PATCH v2 12/28] mm: page_io: prevent memory cgroup release in page_io module
  2025-12-17  7:27 ` [PATCH v2 12/28] mm: page_io: prevent memory cgroup release in page_io module Qi Zheng
@ 2025-12-17 22:12   ` Johannes Weiner
  2025-12-19 23:44   ` Shakeel Butt
  1 sibling, 0 replies; 149+ messages in thread
From: Johannes Weiner @ 2025-12-17 22:12 UTC (permalink / raw)
  To: Qi Zheng
  Cc: hughd, mhocko, roman.gushchin, shakeel.butt, muchun.song, david,
	lorenzo.stoakes, ziy, harry.yoo, imran.f.khan, kamalesh.babulal,
	axelrasmussen, yuanchu, weixugc, chenridong, mkoutny, akpm,
	hamzamahfooz, apais, lance.yang, linux-mm, linux-kernel, cgroups,
	Muchun Song, Qi Zheng

On Wed, Dec 17, 2025 at 03:27:36PM +0800, Qi Zheng wrote:
> From: Muchun Song <songmuchun@bytedance.com>
> 
> In the near future, a folio will no longer pin its corresponding
> memory cgroup. To ensure safety, it will only be appropriate to
> hold the rcu read lock or acquire a reference to the memory cgroup
> returned by folio_memcg(), thereby preventing it from being released.
> 
> In the current patch, the rcu read lock is employed to safeguard
> against the release of the memory cgroup in swap_writeout() and
> bio_associate_blkg_from_page().
> 
> This serves as a preparatory measure for the reparenting of the
> LRU pages.
> 
> Signed-off-by: Muchun Song <songmuchun@bytedance.com>
> Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
> Reviewed-by: Harry Yoo <harry.yoo@oracle.com>

Acked-by: Johannes Weiner <hannes@cmpxchg.org>


^ permalink raw reply	[flat|nested] 149+ messages in thread

* Re: [PATCH v2 12/28] mm: page_io: prevent memory cgroup release in page_io module
  2025-12-17  7:27 ` [PATCH v2 12/28] mm: page_io: prevent memory cgroup release in page_io module Qi Zheng
  2025-12-17 22:12   ` Johannes Weiner
@ 2025-12-19 23:44   ` Shakeel Butt
  1 sibling, 0 replies; 149+ messages in thread
From: Shakeel Butt @ 2025-12-19 23:44 UTC (permalink / raw)
  To: Qi Zheng
  Cc: hannes, hughd, mhocko, roman.gushchin, muchun.song, david,
	lorenzo.stoakes, ziy, harry.yoo, imran.f.khan, kamalesh.babulal,
	axelrasmussen, yuanchu, weixugc, chenridong, mkoutny, akpm,
	hamzamahfooz, apais, lance.yang, linux-mm, linux-kernel, cgroups,
	Muchun Song, Qi Zheng

On Wed, Dec 17, 2025 at 03:27:36PM +0800, Qi Zheng wrote:
> From: Muchun Song <songmuchun@bytedance.com>
> 
> In the near future, a folio will no longer pin its corresponding
> memory cgroup. To ensure safety, it will only be appropriate to
> hold the rcu read lock or acquire a reference to the memory cgroup
> returned by folio_memcg(), thereby preventing it from being released.
> 
> In the current patch, the rcu read lock is employed to safeguard
> against the release of the memory cgroup in swap_writeout() and
> bio_associate_blkg_from_page().
> 
> This serves as a preparatory measure for the reparenting of the
> LRU pages.
> 
> Signed-off-by: Muchun Song <songmuchun@bytedance.com>
> Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
> Reviewed-by: Harry Yoo <harry.yoo@oracle.com>

Acked-by: Shakeel Butt <shakeel.butt@linux.dev>


^ permalink raw reply	[flat|nested] 149+ messages in thread

* [PATCH v2 13/28] mm: migrate: prevent memory cgroup release in folio_migrate_mapping()
  2025-12-17  7:27 [PATCH v2 00/28] Eliminate Dying Memory Cgroup Qi Zheng
                   ` (11 preceding siblings ...)
  2025-12-17  7:27 ` [PATCH v2 12/28] mm: page_io: prevent memory cgroup release in page_io module Qi Zheng
@ 2025-12-17  7:27 ` Qi Zheng
  2025-12-17 22:14   ` Johannes Weiner
                     ` (2 more replies)
  2025-12-17  7:27 ` [PATCH v2 14/28] mm: mglru: prevent memory cgroup release in mglru Qi Zheng
                   ` (16 subsequent siblings)
  29 siblings, 3 replies; 149+ messages in thread
From: Qi Zheng @ 2025-12-17  7:27 UTC (permalink / raw)
  To: hannes, hughd, mhocko, roman.gushchin, shakeel.butt, muchun.song,
	david, lorenzo.stoakes, ziy, harry.yoo, imran.f.khan,
	kamalesh.babulal, axelrasmussen, yuanchu, weixugc, chenridong,
	mkoutny, akpm, hamzamahfooz, apais, lance.yang
  Cc: linux-mm, linux-kernel, cgroups, Muchun Song, Qi Zheng

From: Muchun Song <songmuchun@bytedance.com>

In the near future, a folio will no longer pin its corresponding
memory cgroup. To ensure safety, it will only be appropriate to
hold the rcu read lock or acquire a reference to the memory cgroup
returned by folio_memcg(), thereby preventing it from being released.

In the current patch, the rcu read lock is employed to safeguard
against the release of the memory cgroup in folio_migrate_mapping().

This serves as a preparatory measure for the reparenting of the
LRU pages.

Signed-off-by: Muchun Song <songmuchun@bytedance.com>
Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
Reviewed-by: Harry Yoo <harry.yoo@oracle.com>
---
 mm/migrate.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/mm/migrate.c b/mm/migrate.c
index 5169f9717f606..8bcd588c083ca 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -671,6 +671,7 @@ static int __folio_migrate_mapping(struct address_space *mapping,
 		struct lruvec *old_lruvec, *new_lruvec;
 		struct mem_cgroup *memcg;
 
+		rcu_read_lock();
 		memcg = folio_memcg(folio);
 		old_lruvec = mem_cgroup_lruvec(memcg, oldzone->zone_pgdat);
 		new_lruvec = mem_cgroup_lruvec(memcg, newzone->zone_pgdat);
@@ -698,6 +699,7 @@ static int __folio_migrate_mapping(struct address_space *mapping,
 			mod_lruvec_state(new_lruvec, NR_FILE_DIRTY, nr);
 			__mod_zone_page_state(newzone, NR_ZONE_WRITE_PENDING, nr);
 		}
+		rcu_read_unlock();
 	}
 	local_irq_enable();
 
-- 
2.20.1



^ permalink raw reply	[flat|nested] 149+ messages in thread

* Re: [PATCH v2 13/28] mm: migrate: prevent memory cgroup release in folio_migrate_mapping()
  2025-12-17  7:27 ` [PATCH v2 13/28] mm: migrate: prevent memory cgroup release in folio_migrate_mapping() Qi Zheng
@ 2025-12-17 22:14   ` Johannes Weiner
  2025-12-18  9:09   ` David Hildenbrand (Red Hat)
  2025-12-19 23:51   ` Shakeel Butt
  2 siblings, 0 replies; 149+ messages in thread
From: Johannes Weiner @ 2025-12-17 22:14 UTC (permalink / raw)
  To: Qi Zheng
  Cc: hughd, mhocko, roman.gushchin, shakeel.butt, muchun.song, david,
	lorenzo.stoakes, ziy, harry.yoo, imran.f.khan, kamalesh.babulal,
	axelrasmussen, yuanchu, weixugc, chenridong, mkoutny, akpm,
	hamzamahfooz, apais, lance.yang, linux-mm, linux-kernel, cgroups,
	Muchun Song, Qi Zheng

On Wed, Dec 17, 2025 at 03:27:37PM +0800, Qi Zheng wrote:
> From: Muchun Song <songmuchun@bytedance.com>
> 
> In the near future, a folio will no longer pin its corresponding
> memory cgroup. To ensure safety, it will only be appropriate to
> hold the rcu read lock or acquire a reference to the memory cgroup
> returned by folio_memcg(), thereby preventing it from being released.
> 
> In the current patch, the rcu read lock is employed to safeguard
> against the release of the memory cgroup in folio_migrate_mapping().
> 
> This serves as a preparatory measure for the reparenting of the
> LRU pages.
> 
> Signed-off-by: Muchun Song <songmuchun@bytedance.com>
> Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
> Reviewed-by: Harry Yoo <harry.yoo@oracle.com>

Acked-by: Johannes Weiner <hannes@cmpxchg.org>


^ permalink raw reply	[flat|nested] 149+ messages in thread

* Re: [PATCH v2 13/28] mm: migrate: prevent memory cgroup release in folio_migrate_mapping()
  2025-12-17  7:27 ` [PATCH v2 13/28] mm: migrate: prevent memory cgroup release in folio_migrate_mapping() Qi Zheng
  2025-12-17 22:14   ` Johannes Weiner
@ 2025-12-18  9:09   ` David Hildenbrand (Red Hat)
  2025-12-18  9:36     ` Qi Zheng
  2025-12-18 14:26     ` Johannes Weiner
  2025-12-19 23:51   ` Shakeel Butt
  2 siblings, 2 replies; 149+ messages in thread
From: David Hildenbrand (Red Hat) @ 2025-12-18  9:09 UTC (permalink / raw)
  To: Qi Zheng, hannes, hughd, mhocko, roman.gushchin, shakeel.butt,
	muchun.song, lorenzo.stoakes, ziy, harry.yoo, imran.f.khan,
	kamalesh.babulal, axelrasmussen, yuanchu, weixugc, chenridong,
	mkoutny, akpm, hamzamahfooz, apais, lance.yang
  Cc: linux-mm, linux-kernel, cgroups, Muchun Song, Qi Zheng

On 12/17/25 08:27, Qi Zheng wrote:
> From: Muchun Song <songmuchun@bytedance.com>
> 
> In the near future, a folio will no longer pin its corresponding
> memory cgroup. To ensure safety, it will only be appropriate to
> hold the rcu read lock or acquire a reference to the memory cgroup
> returned by folio_memcg(), thereby preventing it from being released.
> 
> In the current patch, the rcu read lock is employed to safeguard
> against the release of the memory cgroup in folio_migrate_mapping().

We usually avoid talking about "patches".

In __folio_migrate_mapping(), the rcu read lock ...

> 
> This serves as a preparatory measure for the reparenting of the
> LRU pages.
> 
> Signed-off-by: Muchun Song <songmuchun@bytedance.com>
> Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
> Reviewed-by: Harry Yoo <harry.yoo@oracle.com>
> ---
>   mm/migrate.c | 2 ++
>   1 file changed, 2 insertions(+)
> 
> diff --git a/mm/migrate.c b/mm/migrate.c
> index 5169f9717f606..8bcd588c083ca 100644
> --- a/mm/migrate.c
> +++ b/mm/migrate.c
> @@ -671,6 +671,7 @@ static int __folio_migrate_mapping(struct address_space *mapping,
>   		struct lruvec *old_lruvec, *new_lruvec;
>   		struct mem_cgroup *memcg;
>   
> +		rcu_read_lock();
>   		memcg = folio_memcg(folio);

In general, LGTM

I wonder, though, whether we should embed that in the ABI.

Like "lock RCU and get the memcg" in one operation, to the "return memcg 
and unock rcu" in another operation.

Something like "start / end" semantics.

--
Cheers

David


^ permalink raw reply	[flat|nested] 149+ messages in thread

* Re: [PATCH v2 13/28] mm: migrate: prevent memory cgroup release in folio_migrate_mapping()
  2025-12-18  9:09   ` David Hildenbrand (Red Hat)
@ 2025-12-18  9:36     ` Qi Zheng
  2025-12-18  9:43       ` David Hildenbrand (Red Hat)
  2025-12-18 14:26     ` Johannes Weiner
  1 sibling, 1 reply; 149+ messages in thread
From: Qi Zheng @ 2025-12-18  9:36 UTC (permalink / raw)
  To: David Hildenbrand (Red Hat),
	hannes, hughd, mhocko, roman.gushchin, shakeel.butt, muchun.song,
	lorenzo.stoakes, ziy, harry.yoo, imran.f.khan, kamalesh.babulal,
	axelrasmussen, yuanchu, weixugc, chenridong, mkoutny, akpm,
	hamzamahfooz, apais, lance.yang
  Cc: linux-mm, linux-kernel, cgroups, Muchun Song, Qi Zheng



On 12/18/25 5:09 PM, David Hildenbrand (Red Hat) wrote:
> On 12/17/25 08:27, Qi Zheng wrote:
>> From: Muchun Song <songmuchun@bytedance.com>
>>
>> In the near future, a folio will no longer pin its corresponding
>> memory cgroup. To ensure safety, it will only be appropriate to
>> hold the rcu read lock or acquire a reference to the memory cgroup
>> returned by folio_memcg(), thereby preventing it from being released.
>>
>> In the current patch, the rcu read lock is employed to safeguard
>> against the release of the memory cgroup in folio_migrate_mapping().
> 
> We usually avoid talking about "patches".

Got it.

> 
> In __folio_migrate_mapping(), the rcu read lock ...

Will do.

> 
>>
>> This serves as a preparatory measure for the reparenting of the
>> LRU pages.
>>
>> Signed-off-by: Muchun Song <songmuchun@bytedance.com>
>> Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
>> Reviewed-by: Harry Yoo <harry.yoo@oracle.com>
>> ---
>>   mm/migrate.c | 2 ++
>>   1 file changed, 2 insertions(+)
>>
>> diff --git a/mm/migrate.c b/mm/migrate.c
>> index 5169f9717f606..8bcd588c083ca 100644
>> --- a/mm/migrate.c
>> +++ b/mm/migrate.c
>> @@ -671,6 +671,7 @@ static int __folio_migrate_mapping(struct 
>> address_space *mapping,
>>           struct lruvec *old_lruvec, *new_lruvec;
>>           struct mem_cgroup *memcg;
>> +        rcu_read_lock();
>>           memcg = folio_memcg(folio);
> 
> In general, LGTM
> 
> I wonder, though, whether we should embed that in the ABI.
> 
> Like "lock RCU and get the memcg" in one operation, to the "return memcg 
> and unock rcu" in another operation.

Do you mean adding a helper function like get_mem_cgroup_from_folio()?

> 
> Something like "start / end" semantics.
> 
> -- 
> Cheers
> 
> David



^ permalink raw reply	[flat|nested] 149+ messages in thread

* Re: [PATCH v2 13/28] mm: migrate: prevent memory cgroup release in folio_migrate_mapping()
  2025-12-18  9:36     ` Qi Zheng
@ 2025-12-18  9:43       ` David Hildenbrand (Red Hat)
  2025-12-18 11:40         ` Qi Zheng
  0 siblings, 1 reply; 149+ messages in thread
From: David Hildenbrand (Red Hat) @ 2025-12-18  9:43 UTC (permalink / raw)
  To: Qi Zheng, hannes, hughd, mhocko, roman.gushchin, shakeel.butt,
	muchun.song, lorenzo.stoakes, ziy, harry.yoo, imran.f.khan,
	kamalesh.babulal, axelrasmussen, yuanchu, weixugc, chenridong,
	mkoutny, akpm, hamzamahfooz, apais, lance.yang
  Cc: linux-mm, linux-kernel, cgroups, Muchun Song, Qi Zheng

On 12/18/25 10:36, Qi Zheng wrote:
> 
> 
> On 12/18/25 5:09 PM, David Hildenbrand (Red Hat) wrote:
>> On 12/17/25 08:27, Qi Zheng wrote:
>>> From: Muchun Song <songmuchun@bytedance.com>
>>>
>>> In the near future, a folio will no longer pin its corresponding
>>> memory cgroup. To ensure safety, it will only be appropriate to
>>> hold the rcu read lock or acquire a reference to the memory cgroup
>>> returned by folio_memcg(), thereby preventing it from being released.
>>>
>>> In the current patch, the rcu read lock is employed to safeguard
>>> against the release of the memory cgroup in folio_migrate_mapping().
>>
>> We usually avoid talking about "patches".
> 
> Got it.
> 
>>
>> In __folio_migrate_mapping(), the rcu read lock ...
> 
> Will do.
> 
>>
>>>
>>> This serves as a preparatory measure for the reparenting of the
>>> LRU pages.
>>>
>>> Signed-off-by: Muchun Song <songmuchun@bytedance.com>
>>> Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
>>> Reviewed-by: Harry Yoo <harry.yoo@oracle.com>
>>> ---
>>>    mm/migrate.c | 2 ++
>>>    1 file changed, 2 insertions(+)
>>>
>>> diff --git a/mm/migrate.c b/mm/migrate.c
>>> index 5169f9717f606..8bcd588c083ca 100644
>>> --- a/mm/migrate.c
>>> +++ b/mm/migrate.c
>>> @@ -671,6 +671,7 @@ static int __folio_migrate_mapping(struct
>>> address_space *mapping,
>>>            struct lruvec *old_lruvec, *new_lruvec;
>>>            struct mem_cgroup *memcg;
>>> +        rcu_read_lock();
>>>            memcg = folio_memcg(folio);
>>
>> In general, LGTM
>>
>> I wonder, though, whether we should embed that in the ABI.
>>
>> Like "lock RCU and get the memcg" in one operation, to the "return memcg
>> and unock rcu" in another operation.
> 
> Do you mean adding a helper function like get_mem_cgroup_from_folio()?

Right, something like

memcg = folio_memcg_begin(folio);
folio_memcg_end(memcg);

Maybe someone reading along has a better idea. Then you can nicely 
document the requirements in the kerneldocs, and it is clear why the RCU 
lock is used (internally).

-- 
Cheers

David


^ permalink raw reply	[flat|nested] 149+ messages in thread

* Re: [PATCH v2 13/28] mm: migrate: prevent memory cgroup release in folio_migrate_mapping()
  2025-12-18  9:43       ` David Hildenbrand (Red Hat)
@ 2025-12-18 11:40         ` Qi Zheng
  2025-12-18 11:56           ` David Hildenbrand (Red Hat)
  0 siblings, 1 reply; 149+ messages in thread
From: Qi Zheng @ 2025-12-18 11:40 UTC (permalink / raw)
  To: David Hildenbrand (Red Hat),
	hannes, hughd, mhocko, roman.gushchin, shakeel.butt, muchun.song,
	lorenzo.stoakes, ziy, harry.yoo, imran.f.khan, kamalesh.babulal,
	axelrasmussen, yuanchu, weixugc, chenridong, mkoutny, akpm,
	hamzamahfooz, apais, lance.yang
  Cc: linux-mm, linux-kernel, cgroups, Muchun Song, Qi Zheng



On 12/18/25 5:43 PM, David Hildenbrand (Red Hat) wrote:
> On 12/18/25 10:36, Qi Zheng wrote:
>>
>>
>> On 12/18/25 5:09 PM, David Hildenbrand (Red Hat) wrote:
>>> On 12/17/25 08:27, Qi Zheng wrote:
>>>> From: Muchun Song <songmuchun@bytedance.com>
>>>>
>>>> In the near future, a folio will no longer pin its corresponding
>>>> memory cgroup. To ensure safety, it will only be appropriate to
>>>> hold the rcu read lock or acquire a reference to the memory cgroup
>>>> returned by folio_memcg(), thereby preventing it from being released.
>>>>
>>>> In the current patch, the rcu read lock is employed to safeguard
>>>> against the release of the memory cgroup in folio_migrate_mapping().
>>>
>>> We usually avoid talking about "patches".
>>
>> Got it.
>>
>>>
>>> In __folio_migrate_mapping(), the rcu read lock ...
>>
>> Will do.
>>
>>>
>>>>
>>>> This serves as a preparatory measure for the reparenting of the
>>>> LRU pages.
>>>>
>>>> Signed-off-by: Muchun Song <songmuchun@bytedance.com>
>>>> Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
>>>> Reviewed-by: Harry Yoo <harry.yoo@oracle.com>
>>>> ---
>>>>    mm/migrate.c | 2 ++
>>>>    1 file changed, 2 insertions(+)
>>>>
>>>> diff --git a/mm/migrate.c b/mm/migrate.c
>>>> index 5169f9717f606..8bcd588c083ca 100644
>>>> --- a/mm/migrate.c
>>>> +++ b/mm/migrate.c
>>>> @@ -671,6 +671,7 @@ static int __folio_migrate_mapping(struct
>>>> address_space *mapping,
>>>>            struct lruvec *old_lruvec, *new_lruvec;
>>>>            struct mem_cgroup *memcg;
>>>> +        rcu_read_lock();
>>>>            memcg = folio_memcg(folio);
>>>
>>> In general, LGTM
>>>
>>> I wonder, though, whether we should embed that in the ABI.
>>>
>>> Like "lock RCU and get the memcg" in one operation, to the "return memcg
>>> and unock rcu" in another operation.
>>
>> Do you mean adding a helper function like get_mem_cgroup_from_folio()?
> 
> Right, something like
> 
> memcg = folio_memcg_begin(folio);
> folio_memcg_end(memcg);

For some longer or might-sleep critical sections (such as those pointed
by Johannes), perhaps it can be defined like this:

struct mem_cgroup *folio_memcg_begin(struct folio *folio)
{
	return get_mem_cgroup_from_folio(folio);
}

void folio_memcg_end(struct mem_cgroup *memcg)
{
	mem_cgroup_put(memcg);
}

But for some short critical sections, using RCU lock directly might
be the most convention option?

> 
> Maybe someone reading along has a better idea. Then you can nicely 
> document the requirements in the kerneldocs, and it is clear why the RCU 
> lock is used (internally).
> 



^ permalink raw reply	[flat|nested] 149+ messages in thread

* Re: [PATCH v2 13/28] mm: migrate: prevent memory cgroup release in folio_migrate_mapping()
  2025-12-18 11:40         ` Qi Zheng
@ 2025-12-18 11:56           ` David Hildenbrand (Red Hat)
  2025-12-18 13:00             ` Qi Zheng
  0 siblings, 1 reply; 149+ messages in thread
From: David Hildenbrand (Red Hat) @ 2025-12-18 11:56 UTC (permalink / raw)
  To: Qi Zheng, hannes, hughd, mhocko, roman.gushchin, shakeel.butt,
	muchun.song, lorenzo.stoakes, ziy, harry.yoo, imran.f.khan,
	kamalesh.babulal, axelrasmussen, yuanchu, weixugc, chenridong,
	mkoutny, akpm, hamzamahfooz, apais, lance.yang
  Cc: linux-mm, linux-kernel, cgroups, Muchun Song, Qi Zheng

On 12/18/25 12:40, Qi Zheng wrote:
> 
> 
> On 12/18/25 5:43 PM, David Hildenbrand (Red Hat) wrote:
>> On 12/18/25 10:36, Qi Zheng wrote:
>>>
>>>
>>> On 12/18/25 5:09 PM, David Hildenbrand (Red Hat) wrote:
>>>> On 12/17/25 08:27, Qi Zheng wrote:
>>>>> From: Muchun Song <songmuchun@bytedance.com>
>>>>>
>>>>> In the near future, a folio will no longer pin its corresponding
>>>>> memory cgroup. To ensure safety, it will only be appropriate to
>>>>> hold the rcu read lock or acquire a reference to the memory cgroup
>>>>> returned by folio_memcg(), thereby preventing it from being released.
>>>>>
>>>>> In the current patch, the rcu read lock is employed to safeguard
>>>>> against the release of the memory cgroup in folio_migrate_mapping().
>>>>
>>>> We usually avoid talking about "patches".
>>>
>>> Got it.
>>>
>>>>
>>>> In __folio_migrate_mapping(), the rcu read lock ...
>>>
>>> Will do.
>>>
>>>>
>>>>>
>>>>> This serves as a preparatory measure for the reparenting of the
>>>>> LRU pages.
>>>>>
>>>>> Signed-off-by: Muchun Song <songmuchun@bytedance.com>
>>>>> Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
>>>>> Reviewed-by: Harry Yoo <harry.yoo@oracle.com>
>>>>> ---
>>>>>     mm/migrate.c | 2 ++
>>>>>     1 file changed, 2 insertions(+)
>>>>>
>>>>> diff --git a/mm/migrate.c b/mm/migrate.c
>>>>> index 5169f9717f606..8bcd588c083ca 100644
>>>>> --- a/mm/migrate.c
>>>>> +++ b/mm/migrate.c
>>>>> @@ -671,6 +671,7 @@ static int __folio_migrate_mapping(struct
>>>>> address_space *mapping,
>>>>>             struct lruvec *old_lruvec, *new_lruvec;
>>>>>             struct mem_cgroup *memcg;
>>>>> +        rcu_read_lock();
>>>>>             memcg = folio_memcg(folio);
>>>>
>>>> In general, LGTM
>>>>
>>>> I wonder, though, whether we should embed that in the ABI.
>>>>
>>>> Like "lock RCU and get the memcg" in one operation, to the "return memcg
>>>> and unock rcu" in another operation.
>>>
>>> Do you mean adding a helper function like get_mem_cgroup_from_folio()?
>>
>> Right, something like
>>
>> memcg = folio_memcg_begin(folio);
>> folio_memcg_end(memcg);
> 
> For some longer or might-sleep critical sections (such as those pointed
> by Johannes), perhaps it can be defined like this:
> 
> struct mem_cgroup *folio_memcg_begin(struct folio *folio)
> {
> 	return get_mem_cgroup_from_folio(folio);
> }
> 
> void folio_memcg_end(struct mem_cgroup *memcg)
> {
> 	mem_cgroup_put(memcg);
> }
> 
> But for some short critical sections, using RCU lock directly might
> be the most convention option?
> 

Then put the rcu read locking in there instead?

-- 
Cheers

David


^ permalink raw reply	[flat|nested] 149+ messages in thread

* Re: [PATCH v2 13/28] mm: migrate: prevent memory cgroup release in folio_migrate_mapping()
  2025-12-18 11:56           ` David Hildenbrand (Red Hat)
@ 2025-12-18 13:00             ` Qi Zheng
  2025-12-18 13:04               ` David Hildenbrand (Red Hat)
  0 siblings, 1 reply; 149+ messages in thread
From: Qi Zheng @ 2025-12-18 13:00 UTC (permalink / raw)
  To: David Hildenbrand (Red Hat),
	hannes, hughd, mhocko, roman.gushchin, shakeel.butt, muchun.song,
	lorenzo.stoakes, ziy, harry.yoo, imran.f.khan, kamalesh.babulal,
	axelrasmussen, yuanchu, weixugc, chenridong, mkoutny, akpm,
	hamzamahfooz, apais, lance.yang
  Cc: linux-mm, linux-kernel, cgroups, Muchun Song, Qi Zheng



On 12/18/25 7:56 PM, David Hildenbrand (Red Hat) wrote:
> On 12/18/25 12:40, Qi Zheng wrote:
>>
>>
>> On 12/18/25 5:43 PM, David Hildenbrand (Red Hat) wrote:
>>> On 12/18/25 10:36, Qi Zheng wrote:
>>>>
>>>>
>>>> On 12/18/25 5:09 PM, David Hildenbrand (Red Hat) wrote:
>>>>> On 12/17/25 08:27, Qi Zheng wrote:
>>>>>> From: Muchun Song <songmuchun@bytedance.com>
>>>>>>
>>>>>> In the near future, a folio will no longer pin its corresponding
>>>>>> memory cgroup. To ensure safety, it will only be appropriate to
>>>>>> hold the rcu read lock or acquire a reference to the memory cgroup
>>>>>> returned by folio_memcg(), thereby preventing it from being released.
>>>>>>
>>>>>> In the current patch, the rcu read lock is employed to safeguard
>>>>>> against the release of the memory cgroup in folio_migrate_mapping().
>>>>>
>>>>> We usually avoid talking about "patches".
>>>>
>>>> Got it.
>>>>
>>>>>
>>>>> In __folio_migrate_mapping(), the rcu read lock ...
>>>>
>>>> Will do.
>>>>
>>>>>
>>>>>>
>>>>>> This serves as a preparatory measure for the reparenting of the
>>>>>> LRU pages.
>>>>>>
>>>>>> Signed-off-by: Muchun Song <songmuchun@bytedance.com>
>>>>>> Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
>>>>>> Reviewed-by: Harry Yoo <harry.yoo@oracle.com>
>>>>>> ---
>>>>>>     mm/migrate.c | 2 ++
>>>>>>     1 file changed, 2 insertions(+)
>>>>>>
>>>>>> diff --git a/mm/migrate.c b/mm/migrate.c
>>>>>> index 5169f9717f606..8bcd588c083ca 100644
>>>>>> --- a/mm/migrate.c
>>>>>> +++ b/mm/migrate.c
>>>>>> @@ -671,6 +671,7 @@ static int __folio_migrate_mapping(struct
>>>>>> address_space *mapping,
>>>>>>             struct lruvec *old_lruvec, *new_lruvec;
>>>>>>             struct mem_cgroup *memcg;
>>>>>> +        rcu_read_lock();
>>>>>>             memcg = folio_memcg(folio);
>>>>>
>>>>> In general, LGTM
>>>>>
>>>>> I wonder, though, whether we should embed that in the ABI.
>>>>>
>>>>> Like "lock RCU and get the memcg" in one operation, to the "return 
>>>>> memcg
>>>>> and unock rcu" in another operation.
>>>>
>>>> Do you mean adding a helper function like get_mem_cgroup_from_folio()?
>>>
>>> Right, something like
>>>
>>> memcg = folio_memcg_begin(folio);
>>> folio_memcg_end(memcg);
>>
>> For some longer or might-sleep critical sections (such as those pointed
>> by Johannes), perhaps it can be defined like this:
>>
>> struct mem_cgroup *folio_memcg_begin(struct folio *folio)
>> {
>>     return get_mem_cgroup_from_folio(folio);
>> }
>>
>> void folio_memcg_end(struct mem_cgroup *memcg)
>> {
>>     mem_cgroup_put(memcg);
>> }
>>
>> But for some short critical sections, using RCU lock directly might
>> be the most convention option?
>>
> 
> Then put the rcu read locking in there instead?

So for some longer or might-sleep critical sections, using:

memcg = folio_memcg_begin(folio);
do_some_thing(memcg);
folio_memcg_end(folio);

for some short critical sections, using:

rcu_read_lock();
memcg = folio_memcg(folio);
do_some_thing(memcg);
rcu_read_unlock();

Right?


> 



^ permalink raw reply	[flat|nested] 149+ messages in thread

* Re: [PATCH v2 13/28] mm: migrate: prevent memory cgroup release in folio_migrate_mapping()
  2025-12-18 13:00             ` Qi Zheng
@ 2025-12-18 13:04               ` David Hildenbrand (Red Hat)
  2025-12-18 13:16                 ` Qi Zheng
  0 siblings, 1 reply; 149+ messages in thread
From: David Hildenbrand (Red Hat) @ 2025-12-18 13:04 UTC (permalink / raw)
  To: Qi Zheng, hannes, hughd, mhocko, roman.gushchin, shakeel.butt,
	muchun.song, lorenzo.stoakes, ziy, harry.yoo, imran.f.khan,
	kamalesh.babulal, axelrasmussen, yuanchu, weixugc, chenridong,
	mkoutny, akpm, hamzamahfooz, apais, lance.yang
  Cc: linux-mm, linux-kernel, cgroups, Muchun Song, Qi Zheng

On 12/18/25 14:00, Qi Zheng wrote:
> 
> 
> On 12/18/25 7:56 PM, David Hildenbrand (Red Hat) wrote:
>> On 12/18/25 12:40, Qi Zheng wrote:
>>>
>>>
>>> On 12/18/25 5:43 PM, David Hildenbrand (Red Hat) wrote:
>>>> On 12/18/25 10:36, Qi Zheng wrote:
>>>>>
>>>>>
>>>>> On 12/18/25 5:09 PM, David Hildenbrand (Red Hat) wrote:
>>>>>> On 12/17/25 08:27, Qi Zheng wrote:
>>>>>>> From: Muchun Song <songmuchun@bytedance.com>
>>>>>>>
>>>>>>> In the near future, a folio will no longer pin its corresponding
>>>>>>> memory cgroup. To ensure safety, it will only be appropriate to
>>>>>>> hold the rcu read lock or acquire a reference to the memory cgroup
>>>>>>> returned by folio_memcg(), thereby preventing it from being released.
>>>>>>>
>>>>>>> In the current patch, the rcu read lock is employed to safeguard
>>>>>>> against the release of the memory cgroup in folio_migrate_mapping().
>>>>>>
>>>>>> We usually avoid talking about "patches".
>>>>>
>>>>> Got it.
>>>>>
>>>>>>
>>>>>> In __folio_migrate_mapping(), the rcu read lock ...
>>>>>
>>>>> Will do.
>>>>>
>>>>>>
>>>>>>>
>>>>>>> This serves as a preparatory measure for the reparenting of the
>>>>>>> LRU pages.
>>>>>>>
>>>>>>> Signed-off-by: Muchun Song <songmuchun@bytedance.com>
>>>>>>> Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
>>>>>>> Reviewed-by: Harry Yoo <harry.yoo@oracle.com>
>>>>>>> ---
>>>>>>>      mm/migrate.c | 2 ++
>>>>>>>      1 file changed, 2 insertions(+)
>>>>>>>
>>>>>>> diff --git a/mm/migrate.c b/mm/migrate.c
>>>>>>> index 5169f9717f606..8bcd588c083ca 100644
>>>>>>> --- a/mm/migrate.c
>>>>>>> +++ b/mm/migrate.c
>>>>>>> @@ -671,6 +671,7 @@ static int __folio_migrate_mapping(struct
>>>>>>> address_space *mapping,
>>>>>>>              struct lruvec *old_lruvec, *new_lruvec;
>>>>>>>              struct mem_cgroup *memcg;
>>>>>>> +        rcu_read_lock();
>>>>>>>              memcg = folio_memcg(folio);
>>>>>>
>>>>>> In general, LGTM
>>>>>>
>>>>>> I wonder, though, whether we should embed that in the ABI.
>>>>>>
>>>>>> Like "lock RCU and get the memcg" in one operation, to the "return
>>>>>> memcg
>>>>>> and unock rcu" in another operation.
>>>>>
>>>>> Do you mean adding a helper function like get_mem_cgroup_from_folio()?
>>>>
>>>> Right, something like
>>>>
>>>> memcg = folio_memcg_begin(folio);
>>>> folio_memcg_end(memcg);
>>>
>>> For some longer or might-sleep critical sections (such as those pointed
>>> by Johannes), perhaps it can be defined like this:
>>>
>>> struct mem_cgroup *folio_memcg_begin(struct folio *folio)
>>> {
>>>      return get_mem_cgroup_from_folio(folio);
>>> }
>>>
>>> void folio_memcg_end(struct mem_cgroup *memcg)
>>> {
>>>      mem_cgroup_put(memcg);
>>> }
>>>
>>> But for some short critical sections, using RCU lock directly might
>>> be the most convention option?
>>>
>>
>> Then put the rcu read locking in there instead?
> 
> So for some longer or might-sleep critical sections, using:
> 
> memcg = folio_memcg_begin(folio);
> do_some_thing(memcg);
> folio_memcg_end(folio);
> 
> for some short critical sections, using:
> 
> rcu_read_lock();
> memcg = folio_memcg(folio);
> do_some_thing(memcg);
> rcu_read_unlock();
> 
> Right?

What I mean is:

memcg = folio_memcg_begin(folio);
do_some_thing(memcg);
folio_memcg_end(folio);

but do the rcu_read_lock() in folio_memcg_begin() and the 
rcu_read_unlock() in folio_memcg_end().

You could also have (expensive) variants, as you describe, that mess 
with getting/dopping the memcg.

But my points was about hiding the rcu details in a set of helpers.

Sorry if what I say is confusing.

-- 
Cheers

David


^ permalink raw reply	[flat|nested] 149+ messages in thread

* Re: [PATCH v2 13/28] mm: migrate: prevent memory cgroup release in folio_migrate_mapping()
  2025-12-18 13:04               ` David Hildenbrand (Red Hat)
@ 2025-12-18 13:16                 ` Qi Zheng
  2025-12-19  4:12                   ` Harry Yoo
  0 siblings, 1 reply; 149+ messages in thread
From: Qi Zheng @ 2025-12-18 13:16 UTC (permalink / raw)
  To: David Hildenbrand (Red Hat),
	hannes, hughd, mhocko, roman.gushchin, shakeel.butt, muchun.song,
	lorenzo.stoakes, ziy, harry.yoo, imran.f.khan, kamalesh.babulal,
	axelrasmussen, yuanchu, weixugc, chenridong, mkoutny, akpm,
	hamzamahfooz, apais, lance.yang
  Cc: linux-mm, linux-kernel, cgroups, Muchun Song, Qi Zheng



On 12/18/25 9:04 PM, David Hildenbrand (Red Hat) wrote:
> On 12/18/25 14:00, Qi Zheng wrote:
>>
>>
>> On 12/18/25 7:56 PM, David Hildenbrand (Red Hat) wrote:
>>> On 12/18/25 12:40, Qi Zheng wrote:
>>>>
>>>>
>>>> On 12/18/25 5:43 PM, David Hildenbrand (Red Hat) wrote:
>>>>> On 12/18/25 10:36, Qi Zheng wrote:
>>>>>>
>>>>>>
>>>>>> On 12/18/25 5:09 PM, David Hildenbrand (Red Hat) wrote:
>>>>>>> On 12/17/25 08:27, Qi Zheng wrote:
>>>>>>>> From: Muchun Song <songmuchun@bytedance.com>
>>>>>>>>
>>>>>>>> In the near future, a folio will no longer pin its corresponding
>>>>>>>> memory cgroup. To ensure safety, it will only be appropriate to
>>>>>>>> hold the rcu read lock or acquire a reference to the memory cgroup
>>>>>>>> returned by folio_memcg(), thereby preventing it from being 
>>>>>>>> released.
>>>>>>>>
>>>>>>>> In the current patch, the rcu read lock is employed to safeguard
>>>>>>>> against the release of the memory cgroup in 
>>>>>>>> folio_migrate_mapping().
>>>>>>>
>>>>>>> We usually avoid talking about "patches".
>>>>>>
>>>>>> Got it.
>>>>>>
>>>>>>>
>>>>>>> In __folio_migrate_mapping(), the rcu read lock ...
>>>>>>
>>>>>> Will do.
>>>>>>
>>>>>>>
>>>>>>>>
>>>>>>>> This serves as a preparatory measure for the reparenting of the
>>>>>>>> LRU pages.
>>>>>>>>
>>>>>>>> Signed-off-by: Muchun Song <songmuchun@bytedance.com>
>>>>>>>> Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
>>>>>>>> Reviewed-by: Harry Yoo <harry.yoo@oracle.com>
>>>>>>>> ---
>>>>>>>>      mm/migrate.c | 2 ++
>>>>>>>>      1 file changed, 2 insertions(+)
>>>>>>>>
>>>>>>>> diff --git a/mm/migrate.c b/mm/migrate.c
>>>>>>>> index 5169f9717f606..8bcd588c083ca 100644
>>>>>>>> --- a/mm/migrate.c
>>>>>>>> +++ b/mm/migrate.c
>>>>>>>> @@ -671,6 +671,7 @@ static int __folio_migrate_mapping(struct
>>>>>>>> address_space *mapping,
>>>>>>>>              struct lruvec *old_lruvec, *new_lruvec;
>>>>>>>>              struct mem_cgroup *memcg;
>>>>>>>> +        rcu_read_lock();
>>>>>>>>              memcg = folio_memcg(folio);
>>>>>>>
>>>>>>> In general, LGTM
>>>>>>>
>>>>>>> I wonder, though, whether we should embed that in the ABI.
>>>>>>>
>>>>>>> Like "lock RCU and get the memcg" in one operation, to the "return
>>>>>>> memcg
>>>>>>> and unock rcu" in another operation.
>>>>>>
>>>>>> Do you mean adding a helper function like 
>>>>>> get_mem_cgroup_from_folio()?
>>>>>
>>>>> Right, something like
>>>>>
>>>>> memcg = folio_memcg_begin(folio);
>>>>> folio_memcg_end(memcg);
>>>>
>>>> For some longer or might-sleep critical sections (such as those pointed
>>>> by Johannes), perhaps it can be defined like this:
>>>>
>>>> struct mem_cgroup *folio_memcg_begin(struct folio *folio)
>>>> {
>>>>      return get_mem_cgroup_from_folio(folio);
>>>> }
>>>>
>>>> void folio_memcg_end(struct mem_cgroup *memcg)
>>>> {
>>>>      mem_cgroup_put(memcg);
>>>> }
>>>>
>>>> But for some short critical sections, using RCU lock directly might
>>>> be the most convention option?
>>>>
>>>
>>> Then put the rcu read locking in there instead?
>>
>> So for some longer or might-sleep critical sections, using:
>>
>> memcg = folio_memcg_begin(folio);
>> do_some_thing(memcg);
>> folio_memcg_end(folio);
>>
>> for some short critical sections, using:
>>
>> rcu_read_lock();
>> memcg = folio_memcg(folio);
>> do_some_thing(memcg);
>> rcu_read_unlock();
>>
>> Right?
> 
> What I mean is:
> 
> memcg = folio_memcg_begin(folio);
> do_some_thing(memcg);
> folio_memcg_end(folio);
> 
> but do the rcu_read_lock() in folio_memcg_begin() and the 
> rcu_read_unlock() in folio_memcg_end().
> 
> You could also have (expensive) variants, as you describe, that mess 
> with getting/dopping the memcg.

Or simple use folio_memcg_begin(memcg)/folio_memcg_end(memcg) in all cases.

Or add a parameter to them:

struct mem_cgroup *folio_memcg_begin(struct folio *folio, bool get_refcnt)
{
	struct mem_cgroup *memcg;

	if (get_refcnt)
		memcg = get_mem_cgroup_from_folio(folio);
	else {
		rcu_read_lock();
		memcg = folio_memcg(folio);
	}

	return memcg;
}

void folio_memcg_end(struct mem_cgroup *memcg, bool get_refcnt)
{
	if (get_refcnt)
		mem_cgroup_put(memcg);
	else
		rcu_read_unlock();
}



> 
> But my points was about hiding the rcu details in a set of helpers.
> 
> Sorry if what I say is confusing.
> 



^ permalink raw reply	[flat|nested] 149+ messages in thread

* Re: [PATCH v2 13/28] mm: migrate: prevent memory cgroup release in folio_migrate_mapping()
  2025-12-18 13:16                 ` Qi Zheng
@ 2025-12-19  4:12                   ` Harry Yoo
  2025-12-19  6:18                     ` David Hildenbrand (Red Hat)
  0 siblings, 1 reply; 149+ messages in thread
From: Harry Yoo @ 2025-12-19  4:12 UTC (permalink / raw)
  To: Qi Zheng
  Cc: David Hildenbrand (Red Hat),
	hannes, hughd, mhocko, roman.gushchin, shakeel.butt, muchun.song,
	lorenzo.stoakes, ziy, imran.f.khan, kamalesh.babulal,
	axelrasmussen, yuanchu, weixugc, chenridong, mkoutny, akpm,
	hamzamahfooz, apais, lance.yang, linux-mm, linux-kernel, cgroups,
	Muchun Song, Qi Zheng

On Thu, Dec 18, 2025 at 09:16:11PM +0800, Qi Zheng wrote:
> 
> 
> On 12/18/25 9:04 PM, David Hildenbrand (Red Hat) wrote:
> > On 12/18/25 14:00, Qi Zheng wrote:
> > > 
> > > 
> > > On 12/18/25 7:56 PM, David Hildenbrand (Red Hat) wrote:
> > > > On 12/18/25 12:40, Qi Zheng wrote:
> > > > > 
> > > > > 
> > > > > On 12/18/25 5:43 PM, David Hildenbrand (Red Hat) wrote:
> > > > > > On 12/18/25 10:36, Qi Zheng wrote:
> > > > > > > 
> > > > > > > 
> > > > > > > On 12/18/25 5:09 PM, David Hildenbrand (Red Hat) wrote:
> > > > > > > > On 12/17/25 08:27, Qi Zheng wrote:
> > > > > > > > > From: Muchun Song <songmuchun@bytedance.com>
> > > > > > > > > 
> > > > > > > > > In the near future, a folio will no longer pin its corresponding
> > > > > > > > > memory cgroup. To ensure safety, it will only be appropriate to
> > > > > > > > > hold the rcu read lock or acquire a reference to the memory cgroup
> > > > > > > > > returned by folio_memcg(), thereby
> > > > > > > > > preventing it from being released.
> > > > > > > > > 
> > > > > > > > > In the current patch, the rcu read lock is employed to safeguard
> > > > > > > > > against the release of the memory cgroup in
> > > > > > > > > folio_migrate_mapping().
> > > > > > > > 
> > > > > > > > We usually avoid talking about "patches".
> > > > > > > 
> > > > > > > Got it.
> > > > > > > 
> > > > > > > > 
> > > > > > > > In __folio_migrate_mapping(), the rcu read lock ...
> > > > > > > 
> > > > > > > Will do.
> > > > > > > 
> > > > > > > > 
> > > > > > > > > 
> > > > > > > > > This serves as a preparatory measure for the reparenting of the
> > > > > > > > > LRU pages.
> > > > > > > > > 
> > > > > > > > > Signed-off-by: Muchun Song <songmuchun@bytedance.com>
> > > > > > > > > Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
> > > > > > > > > Reviewed-by: Harry Yoo <harry.yoo@oracle.com>
> > > > > > > > > ---
> > > > > > > > >      mm/migrate.c | 2 ++
> > > > > > > > >      1 file changed, 2 insertions(+)
> > > > > > > > > 
> > > > > > > > > diff --git a/mm/migrate.c b/mm/migrate.c
> > > > > > > > > index 5169f9717f606..8bcd588c083ca 100644
> > > > > > > > > --- a/mm/migrate.c
> > > > > > > > > +++ b/mm/migrate.c
> > > > > > > > > @@ -671,6 +671,7 @@ static int __folio_migrate_mapping(struct
> > > > > > > > > address_space *mapping,
> > > > > > > > >              struct lruvec *old_lruvec, *new_lruvec;
> > > > > > > > >              struct mem_cgroup *memcg;
> > > > > > > > > +        rcu_read_lock();
> > > > > > > > >              memcg = folio_memcg(folio);
> > > > > > > > 
> > > > > > > > In general, LGTM
> > > > > > > > 
> > > > > > > > I wonder, though, whether we should embed that in the ABI.
> > > > > > > > 
> > > > > > > > Like "lock RCU and get the memcg" in one operation, to the "return
> > > > > > > > memcg
> > > > > > > > and unock rcu" in another operation.
> > > > > > > 
> > > > > > > Do you mean adding a helper function like
> > > > > > > get_mem_cgroup_from_folio()?
> > > > > > 
> > > > > > Right, something like
> > > > > > 
> > > > > > memcg = folio_memcg_begin(folio);
> > > > > > folio_memcg_end(memcg);
> > > > > 
> > > > > For some longer or might-sleep critical sections (such as those pointed
> > > > > by Johannes), perhaps it can be defined like this:
> > > > > 
> > > > > struct mem_cgroup *folio_memcg_begin(struct folio *folio)
> > > > > {
> > > > >      return get_mem_cgroup_from_folio(folio);
> > > > > }
> > > > > 
> > > > > void folio_memcg_end(struct mem_cgroup *memcg)
> > > > > {
> > > > >      mem_cgroup_put(memcg);
> > > > > }
> > > > > 
> > > > > But for some short critical sections, using RCU lock directly might
> > > > > be the most convention option?
> > > > > 
> > > > 
> > > > Then put the rcu read locking in there instead?
> > > 
> > > So for some longer or might-sleep critical sections, using:
> > > 
> > > memcg = folio_memcg_begin(folio);
> > > do_some_thing(memcg);
> > > folio_memcg_end(folio);
> > > 
> > > for some short critical sections, using:
> > > 
> > > rcu_read_lock();
> > > memcg = folio_memcg(folio);
> > > do_some_thing(memcg);
> > > rcu_read_unlock();
> > > 
> > > Right?
> > 
> > What I mean is:
> > 
> > memcg = folio_memcg_begin(folio);
> > do_some_thing(memcg);
> > folio_memcg_end(folio);
> > 
> > but do the rcu_read_lock() in folio_memcg_begin() and the
> > rcu_read_unlock() in folio_memcg_end().
> > 
> > You could also have (expensive) variants, as you describe, that mess
> > with getting/dopping the memcg.
> 
> Or simple use folio_memcg_begin(memcg)/folio_memcg_end(memcg) in all cases.
> 
> Or add a parameter to them:
> 
> struct mem_cgroup *folio_memcg_begin(struct folio *folio, bool get_refcnt)
> {
> 	struct mem_cgroup *memcg;
> 
> 	if (get_refcnt)
> 		memcg = get_mem_cgroup_from_folio(folio);
> 	else {
> 		rcu_read_lock();
> 		memcg = folio_memcg(folio);
> 	}
> 
> 	return memcg;
> }
> 
> void folio_memcg_end(struct mem_cgroup *memcg, bool get_refcnt)
> {
> 	if (get_refcnt)
> 		mem_cgroup_put(memcg);
> 	else
> 		rcu_read_unlock();
> }

I would like to vote for open coding as we do now, because I think hiding
the RCU lock / refcount acquisition into a less obvious API doesn't make
it more readable.

No strong opinion on introducing new helpers, but at least it should be
obvious that each variant has different restrictions.
 
> > But my points was about hiding the rcu details in a set of helpers.
> > 
> > Sorry if what I say is confusing.

-- 
Cheers,
Harry / Hyeonggon


^ permalink raw reply	[flat|nested] 149+ messages in thread

* Re: [PATCH v2 13/28] mm: migrate: prevent memory cgroup release in folio_migrate_mapping()
  2025-12-19  4:12                   ` Harry Yoo
@ 2025-12-19  6:18                     ` David Hildenbrand (Red Hat)
  0 siblings, 0 replies; 149+ messages in thread
From: David Hildenbrand (Red Hat) @ 2025-12-19  6:18 UTC (permalink / raw)
  To: Harry Yoo, Qi Zheng
  Cc: hannes, hughd, mhocko, roman.gushchin, shakeel.butt, muchun.song,
	lorenzo.stoakes, ziy, imran.f.khan, kamalesh.babulal,
	axelrasmussen, yuanchu, weixugc, chenridong, mkoutny, akpm,
	hamzamahfooz, apais, lance.yang, linux-mm, linux-kernel, cgroups,
	Muchun Song, Qi Zheng

On 12/19/25 05:12, Harry Yoo wrote:
> On Thu, Dec 18, 2025 at 09:16:11PM +0800, Qi Zheng wrote:
>>
>>
>> On 12/18/25 9:04 PM, David Hildenbrand (Red Hat) wrote:
>>> On 12/18/25 14:00, Qi Zheng wrote:
>>>>
>>>>
>>>> On 12/18/25 7:56 PM, David Hildenbrand (Red Hat) wrote:
>>>>> On 12/18/25 12:40, Qi Zheng wrote:
>>>>>>
>>>>>>
>>>>>> On 12/18/25 5:43 PM, David Hildenbrand (Red Hat) wrote:
>>>>>>> On 12/18/25 10:36, Qi Zheng wrote:
>>>>>>>>
>>>>>>>>
>>>>>>>> On 12/18/25 5:09 PM, David Hildenbrand (Red Hat) wrote:
>>>>>>>>> On 12/17/25 08:27, Qi Zheng wrote:
>>>>>>>>>> From: Muchun Song <songmuchun@bytedance.com>
>>>>>>>>>>
>>>>>>>>>> In the near future, a folio will no longer pin its corresponding
>>>>>>>>>> memory cgroup. To ensure safety, it will only be appropriate to
>>>>>>>>>> hold the rcu read lock or acquire a reference to the memory cgroup
>>>>>>>>>> returned by folio_memcg(), thereby
>>>>>>>>>> preventing it from being released.
>>>>>>>>>>
>>>>>>>>>> In the current patch, the rcu read lock is employed to safeguard
>>>>>>>>>> against the release of the memory cgroup in
>>>>>>>>>> folio_migrate_mapping().
>>>>>>>>>
>>>>>>>>> We usually avoid talking about "patches".
>>>>>>>>
>>>>>>>> Got it.
>>>>>>>>
>>>>>>>>>
>>>>>>>>> In __folio_migrate_mapping(), the rcu read lock ...
>>>>>>>>
>>>>>>>> Will do.
>>>>>>>>
>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> This serves as a preparatory measure for the reparenting of the
>>>>>>>>>> LRU pages.
>>>>>>>>>>
>>>>>>>>>> Signed-off-by: Muchun Song <songmuchun@bytedance.com>
>>>>>>>>>> Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
>>>>>>>>>> Reviewed-by: Harry Yoo <harry.yoo@oracle.com>
>>>>>>>>>> ---
>>>>>>>>>>       mm/migrate.c | 2 ++
>>>>>>>>>>       1 file changed, 2 insertions(+)
>>>>>>>>>>
>>>>>>>>>> diff --git a/mm/migrate.c b/mm/migrate.c
>>>>>>>>>> index 5169f9717f606..8bcd588c083ca 100644
>>>>>>>>>> --- a/mm/migrate.c
>>>>>>>>>> +++ b/mm/migrate.c
>>>>>>>>>> @@ -671,6 +671,7 @@ static int __folio_migrate_mapping(struct
>>>>>>>>>> address_space *mapping,
>>>>>>>>>>               struct lruvec *old_lruvec, *new_lruvec;
>>>>>>>>>>               struct mem_cgroup *memcg;
>>>>>>>>>> +        rcu_read_lock();
>>>>>>>>>>               memcg = folio_memcg(folio);
>>>>>>>>>
>>>>>>>>> In general, LGTM
>>>>>>>>>
>>>>>>>>> I wonder, though, whether we should embed that in the ABI.
>>>>>>>>>
>>>>>>>>> Like "lock RCU and get the memcg" in one operation, to the "return
>>>>>>>>> memcg
>>>>>>>>> and unock rcu" in another operation.
>>>>>>>>
>>>>>>>> Do you mean adding a helper function like
>>>>>>>> get_mem_cgroup_from_folio()?
>>>>>>>
>>>>>>> Right, something like
>>>>>>>
>>>>>>> memcg = folio_memcg_begin(folio);
>>>>>>> folio_memcg_end(memcg);
>>>>>>
>>>>>> For some longer or might-sleep critical sections (such as those pointed
>>>>>> by Johannes), perhaps it can be defined like this:
>>>>>>
>>>>>> struct mem_cgroup *folio_memcg_begin(struct folio *folio)
>>>>>> {
>>>>>>       return get_mem_cgroup_from_folio(folio);
>>>>>> }
>>>>>>
>>>>>> void folio_memcg_end(struct mem_cgroup *memcg)
>>>>>> {
>>>>>>       mem_cgroup_put(memcg);
>>>>>> }
>>>>>>
>>>>>> But for some short critical sections, using RCU lock directly might
>>>>>> be the most convention option?
>>>>>>
>>>>>
>>>>> Then put the rcu read locking in there instead?
>>>>
>>>> So for some longer or might-sleep critical sections, using:
>>>>
>>>> memcg = folio_memcg_begin(folio);
>>>> do_some_thing(memcg);
>>>> folio_memcg_end(folio);
>>>>
>>>> for some short critical sections, using:
>>>>
>>>> rcu_read_lock();
>>>> memcg = folio_memcg(folio);
>>>> do_some_thing(memcg);
>>>> rcu_read_unlock();
>>>>
>>>> Right?
>>>
>>> What I mean is:
>>>
>>> memcg = folio_memcg_begin(folio);
>>> do_some_thing(memcg);
>>> folio_memcg_end(folio);
>>>
>>> but do the rcu_read_lock() in folio_memcg_begin() and the
>>> rcu_read_unlock() in folio_memcg_end().
>>>
>>> You could also have (expensive) variants, as you describe, that mess
>>> with getting/dopping the memcg.
>>
>> Or simple use folio_memcg_begin(memcg)/folio_memcg_end(memcg) in all cases.
>>
>> Or add a parameter to them:
>>
>> struct mem_cgroup *folio_memcg_begin(struct folio *folio, bool get_refcnt)
>> {
>> 	struct mem_cgroup *memcg;
>>
>> 	if (get_refcnt)
>> 		memcg = get_mem_cgroup_from_folio(folio);
>> 	else {
>> 		rcu_read_lock();
>> 		memcg = folio_memcg(folio);
>> 	}
>>
>> 	return memcg;
>> }
>>
>> void folio_memcg_end(struct mem_cgroup *memcg, bool get_refcnt)
>> {
>> 	if (get_refcnt)
>> 		mem_cgroup_put(memcg);
>> 	else
>> 		rcu_read_unlock();
>> }
> 
> I would like to vote for open coding as we do now, because I think hiding
> the RCU lock / refcount acquisition into a less obvious API doesn't make
> it more readable.

I wouldn't do it in an API as proposed above.

I prefer to not have magical RCU locking in every caller. Easy to get wrong.

See how we did something similar in the pte_*map*() vs. pte_unmap() API, 
without requiring all callers to open-code this.

-- 
Cheers

David


^ permalink raw reply	[flat|nested] 149+ messages in thread

* Re: [PATCH v2 13/28] mm: migrate: prevent memory cgroup release in folio_migrate_mapping()
  2025-12-18  9:09   ` David Hildenbrand (Red Hat)
  2025-12-18  9:36     ` Qi Zheng
@ 2025-12-18 14:26     ` Johannes Weiner
  2025-12-22  3:42       ` Qi Zheng
  2025-12-30 20:07       ` David Hildenbrand (Red Hat)
  1 sibling, 2 replies; 149+ messages in thread
From: Johannes Weiner @ 2025-12-18 14:26 UTC (permalink / raw)
  To: David Hildenbrand (Red Hat)
  Cc: Qi Zheng, hughd, mhocko, roman.gushchin, shakeel.butt,
	muchun.song, lorenzo.stoakes, ziy, harry.yoo, imran.f.khan,
	kamalesh.babulal, axelrasmussen, yuanchu, weixugc, chenridong,
	mkoutny, akpm, hamzamahfooz, apais, lance.yang, linux-mm,
	linux-kernel, cgroups, Muchun Song, Qi Zheng

On Thu, Dec 18, 2025 at 10:09:21AM +0100, David Hildenbrand (Red Hat) wrote:
> On 12/17/25 08:27, Qi Zheng wrote:
> > From: Muchun Song <songmuchun@bytedance.com>
> > 
> > In the near future, a folio will no longer pin its corresponding
> > memory cgroup. To ensure safety, it will only be appropriate to
> > hold the rcu read lock or acquire a reference to the memory cgroup
> > returned by folio_memcg(), thereby preventing it from being released.
> > 
> > In the current patch, the rcu read lock is employed to safeguard
> > against the release of the memory cgroup in folio_migrate_mapping().
> 
> We usually avoid talking about "patches".
> 
> In __folio_migrate_mapping(), the rcu read lock ...
> 
> > 
> > This serves as a preparatory measure for the reparenting of the
> > LRU pages.
> > 
> > Signed-off-by: Muchun Song <songmuchun@bytedance.com>
> > Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
> > Reviewed-by: Harry Yoo <harry.yoo@oracle.com>
> > ---
> >   mm/migrate.c | 2 ++
> >   1 file changed, 2 insertions(+)
> > 
> > diff --git a/mm/migrate.c b/mm/migrate.c
> > index 5169f9717f606..8bcd588c083ca 100644
> > --- a/mm/migrate.c
> > +++ b/mm/migrate.c
> > @@ -671,6 +671,7 @@ static int __folio_migrate_mapping(struct address_space *mapping,
> >   		struct lruvec *old_lruvec, *new_lruvec;
> >   		struct mem_cgroup *memcg;
> >   
> > +		rcu_read_lock();
> >   		memcg = folio_memcg(folio);
> 
> In general, LGTM
> 
> I wonder, though, whether we should embed that in the ABI.
> 
> Like "lock RCU and get the memcg" in one operation, to the "return memcg 
> and unock rcu" in another operation.
> 
> Something like "start / end" semantics.

The advantage of open-coding this particular one is that 1)
rcu_read_lock() is something the caller could already be
holding/using, implicitly or explicitly; and 2) it's immediately
obvious that this is an atomic section (which was already useful in
spotting a bug in the workingset patch of this series).

"start/end" terminology hides this. "lock" we can't use because it
would suggest binding stability. The only other idea I'd have would be
to spell it all out:

memcg = folio_memcg_rcu_read_lock(folio);
stuff(memcg);
otherstuff();
rcu_read_unlock();

But that might not be worth it. Maybe somebody can think of a better
name. But I'd be hesitant to trade off the obviousness of what's going
on given how simple the locking + access scheme is.


^ permalink raw reply	[flat|nested] 149+ messages in thread

* Re: [PATCH v2 13/28] mm: migrate: prevent memory cgroup release in folio_migrate_mapping()
  2025-12-18 14:26     ` Johannes Weiner
@ 2025-12-22  3:42       ` Qi Zheng
  2025-12-30 20:07       ` David Hildenbrand (Red Hat)
  1 sibling, 0 replies; 149+ messages in thread
From: Qi Zheng @ 2025-12-22  3:42 UTC (permalink / raw)
  To: Johannes Weiner, David Hildenbrand (Red Hat)
  Cc: hughd, mhocko, roman.gushchin, shakeel.butt, muchun.song,
	lorenzo.stoakes, ziy, harry.yoo, imran.f.khan, kamalesh.babulal,
	axelrasmussen, yuanchu, weixugc, chenridong, mkoutny, akpm,
	hamzamahfooz, apais, lance.yang, linux-mm, linux-kernel, cgroups,
	Muchun Song, Qi Zheng



On 12/18/25 10:26 PM, Johannes Weiner wrote:
> On Thu, Dec 18, 2025 at 10:09:21AM +0100, David Hildenbrand (Red Hat) wrote:
>> On 12/17/25 08:27, Qi Zheng wrote:
>>> From: Muchun Song <songmuchun@bytedance.com>
>>>
>>> In the near future, a folio will no longer pin its corresponding
>>> memory cgroup. To ensure safety, it will only be appropriate to
>>> hold the rcu read lock or acquire a reference to the memory cgroup
>>> returned by folio_memcg(), thereby preventing it from being released.
>>>
>>> In the current patch, the rcu read lock is employed to safeguard
>>> against the release of the memory cgroup in folio_migrate_mapping().
>>
>> We usually avoid talking about "patches".
>>
>> In __folio_migrate_mapping(), the rcu read lock ...
>>
>>>
>>> This serves as a preparatory measure for the reparenting of the
>>> LRU pages.
>>>
>>> Signed-off-by: Muchun Song <songmuchun@bytedance.com>
>>> Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
>>> Reviewed-by: Harry Yoo <harry.yoo@oracle.com>
>>> ---
>>>    mm/migrate.c | 2 ++
>>>    1 file changed, 2 insertions(+)
>>>
>>> diff --git a/mm/migrate.c b/mm/migrate.c
>>> index 5169f9717f606..8bcd588c083ca 100644
>>> --- a/mm/migrate.c
>>> +++ b/mm/migrate.c
>>> @@ -671,6 +671,7 @@ static int __folio_migrate_mapping(struct address_space *mapping,
>>>    		struct lruvec *old_lruvec, *new_lruvec;
>>>    		struct mem_cgroup *memcg;
>>>    
>>> +		rcu_read_lock();
>>>    		memcg = folio_memcg(folio);
>>
>> In general, LGTM
>>
>> I wonder, though, whether we should embed that in the ABI.
>>
>> Like "lock RCU and get the memcg" in one operation, to the "return memcg
>> and unock rcu" in another operation.
>>
>> Something like "start / end" semantics.
> 
> The advantage of open-coding this particular one is that 1)
> rcu_read_lock() is something the caller could already be
> holding/using, implicitly or explicitly; and 2) it's immediately
> obvious that this is an atomic section (which was already useful in
> spotting a bug in the workingset patch of this series).
> 
> "start/end" terminology hides this. "lock" we can't use because it
> would suggest binding stability. The only other idea I'd have would be
> to spell it all out:
> 
> memcg = folio_memcg_rcu_read_lock(folio);
> stuff(memcg);
> otherstuff();
> rcu_read_unlock();
> 
> But that might not be worth it. Maybe somebody can think of a better
> name. But I'd be hesitant to trade off the obviousness of what's going
> on given how simple the locking + access scheme is.

Agree. I also prefer to keep the open-coding method for now, and if a
better helper is available later, a cleanup patch can be added to
accomplish this.

Thanks,
Qi





^ permalink raw reply	[flat|nested] 149+ messages in thread

* Re: [PATCH v2 13/28] mm: migrate: prevent memory cgroup release in folio_migrate_mapping()
  2025-12-18 14:26     ` Johannes Weiner
  2025-12-22  3:42       ` Qi Zheng
@ 2025-12-30 20:07       ` David Hildenbrand (Red Hat)
  1 sibling, 0 replies; 149+ messages in thread
From: David Hildenbrand (Red Hat) @ 2025-12-30 20:07 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Qi Zheng, hughd, mhocko, roman.gushchin, shakeel.butt,
	muchun.song, lorenzo.stoakes, ziy, harry.yoo, imran.f.khan,
	kamalesh.babulal, axelrasmussen, yuanchu, weixugc, chenridong,
	mkoutny, akpm, hamzamahfooz, apais, lance.yang, linux-mm,
	linux-kernel, cgroups, Muchun Song, Qi Zheng

On 12/18/25 15:26, Johannes Weiner wrote:
> On Thu, Dec 18, 2025 at 10:09:21AM +0100, David Hildenbrand (Red Hat) wrote:
>> On 12/17/25 08:27, Qi Zheng wrote:
>>> From: Muchun Song <songmuchun@bytedance.com>
>>>
>>> In the near future, a folio will no longer pin its corresponding
>>> memory cgroup. To ensure safety, it will only be appropriate to
>>> hold the rcu read lock or acquire a reference to the memory cgroup
>>> returned by folio_memcg(), thereby preventing it from being released.
>>>
>>> In the current patch, the rcu read lock is employed to safeguard
>>> against the release of the memory cgroup in folio_migrate_mapping().
>>
>> We usually avoid talking about "patches".
>>
>> In __folio_migrate_mapping(), the rcu read lock ...
>>
>>>
>>> This serves as a preparatory measure for the reparenting of the
>>> LRU pages.
>>>
>>> Signed-off-by: Muchun Song <songmuchun@bytedance.com>
>>> Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
>>> Reviewed-by: Harry Yoo <harry.yoo@oracle.com>
>>> ---
>>>    mm/migrate.c | 2 ++
>>>    1 file changed, 2 insertions(+)
>>>
>>> diff --git a/mm/migrate.c b/mm/migrate.c
>>> index 5169f9717f606..8bcd588c083ca 100644
>>> --- a/mm/migrate.c
>>> +++ b/mm/migrate.c
>>> @@ -671,6 +671,7 @@ static int __folio_migrate_mapping(struct address_space *mapping,
>>>    		struct lruvec *old_lruvec, *new_lruvec;
>>>    		struct mem_cgroup *memcg;
>>>    
>>> +		rcu_read_lock();
>>>    		memcg = folio_memcg(folio);
>>
>> In general, LGTM
>>
>> I wonder, though, whether we should embed that in the ABI.
>>
>> Like "lock RCU and get the memcg" in one operation, to the "return memcg
>> and unock rcu" in another operation.
>>
>> Something like "start / end" semantics.
> 
> The advantage of open-coding this particular one is that 1)
> rcu_read_lock() is something the caller could already be
> holding/using, implicitly or explicitly; and 2) it's immediately
> obvious that this is an atomic section (which was already useful in
> spotting a bug in the workingset patch of this series).
> 
> "start/end" terminology hides this. "lock" we can't use because it
> would suggest binding stability. The only other idea I'd have would be
> to spell it all out:
> 
> memcg = folio_memcg_rcu_read_lock(folio);
> stuff(memcg);
> otherstuff();
> rcu_read_unlock();
> 
> But that might not be worth it. Maybe somebody can think of a better
> name. But I'd be hesitant to trade off the obviousness of what's going
> on given how simple the locking + access scheme is.

I rather disagree that open-coding it is the better approach here, in 
particular when it comes to new users or code changes in the future -- 
just way, way easier to mess up.

Well, unless we have some other way to add safety-checks that the right 
locks are held when the memcg is getting used (e.g., passed into other 
functions). Maybe that is done already to minimize the chance for UAF etc.

I agree that naming is tricky, and that it needs some more thought, so 
I'm fine with keeping it as is.

-- 
Cheers

David



^ permalink raw reply	[flat|nested] 149+ messages in thread

* Re: [PATCH v2 13/28] mm: migrate: prevent memory cgroup release in folio_migrate_mapping()
  2025-12-17  7:27 ` [PATCH v2 13/28] mm: migrate: prevent memory cgroup release in folio_migrate_mapping() Qi Zheng
  2025-12-17 22:14   ` Johannes Weiner
  2025-12-18  9:09   ` David Hildenbrand (Red Hat)
@ 2025-12-19 23:51   ` Shakeel Butt
  2 siblings, 0 replies; 149+ messages in thread
From: Shakeel Butt @ 2025-12-19 23:51 UTC (permalink / raw)
  To: Qi Zheng
  Cc: hannes, hughd, mhocko, roman.gushchin, muchun.song, david,
	lorenzo.stoakes, ziy, harry.yoo, imran.f.khan, kamalesh.babulal,
	axelrasmussen, yuanchu, weixugc, chenridong, mkoutny, akpm,
	hamzamahfooz, apais, lance.yang, linux-mm, linux-kernel, cgroups,
	Muchun Song, Qi Zheng

On Wed, Dec 17, 2025 at 03:27:37PM +0800, Qi Zheng wrote:
> From: Muchun Song <songmuchun@bytedance.com>
> 
> In the near future, a folio will no longer pin its corresponding
> memory cgroup. To ensure safety, it will only be appropriate to
> hold the rcu read lock or acquire a reference to the memory cgroup
> returned by folio_memcg(), thereby preventing it from being released.
> 
> In the current patch, the rcu read lock is employed to safeguard
> against the release of the memory cgroup in folio_migrate_mapping().
> 
> This serves as a preparatory measure for the reparenting of the
> LRU pages.
> 
> Signed-off-by: Muchun Song <songmuchun@bytedance.com>
> Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
> Reviewed-by: Harry Yoo <harry.yoo@oracle.com>

Acked-by: Shakeel Butt <shakeel.butt@linux.dev>


^ permalink raw reply	[flat|nested] 149+ messages in thread

* [PATCH v2 14/28] mm: mglru: prevent memory cgroup release in mglru
  2025-12-17  7:27 [PATCH v2 00/28] Eliminate Dying Memory Cgroup Qi Zheng
                   ` (12 preceding siblings ...)
  2025-12-17  7:27 ` [PATCH v2 13/28] mm: migrate: prevent memory cgroup release in folio_migrate_mapping() Qi Zheng
@ 2025-12-17  7:27 ` Qi Zheng
  2025-12-17 22:18   ` Johannes Weiner
  2025-12-17  7:27 ` [PATCH v2 15/28] mm: memcontrol: prevent memory cgroup release in mem_cgroup_swap_full() Qi Zheng
                   ` (15 subsequent siblings)
  29 siblings, 1 reply; 149+ messages in thread
From: Qi Zheng @ 2025-12-17  7:27 UTC (permalink / raw)
  To: hannes, hughd, mhocko, roman.gushchin, shakeel.butt, muchun.song,
	david, lorenzo.stoakes, ziy, harry.yoo, imran.f.khan,
	kamalesh.babulal, axelrasmussen, yuanchu, weixugc, chenridong,
	mkoutny, akpm, hamzamahfooz, apais, lance.yang
  Cc: linux-mm, linux-kernel, cgroups, Muchun Song, Qi Zheng

From: Muchun Song <songmuchun@bytedance.com>

In the near future, a folio will no longer pin its corresponding
memory cgroup. To ensure safety, it will only be appropriate to
hold the rcu read lock or acquire a reference to the memory cgroup
returned by folio_memcg(), thereby preventing it from being released.

In the current patch, the rcu read lock is employed to safeguard
against the release of the memory cgroup in mglru.

This serves as a preparatory measure for the reparenting of the
LRU pages.

Signed-off-by: Muchun Song <songmuchun@bytedance.com>
Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
---
 mm/vmscan.c | 23 +++++++++++++++++------
 1 file changed, 17 insertions(+), 6 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 354b19f7365d4..814498a2c1bd6 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -3444,8 +3444,10 @@ static struct folio *get_pfn_folio(unsigned long pfn, struct mem_cgroup *memcg,
 	if (folio_nid(folio) != pgdat->node_id)
 		return NULL;
 
+	rcu_read_lock();
 	if (folio_memcg(folio) != memcg)
-		return NULL;
+		folio = NULL;
+	rcu_read_unlock();
 
 	return folio;
 }
@@ -4202,12 +4204,12 @@ bool lru_gen_look_around(struct page_vma_mapped_walk *pvmw)
 	unsigned long addr = pvmw->address;
 	struct vm_area_struct *vma = pvmw->vma;
 	struct folio *folio = pfn_folio(pvmw->pfn);
-	struct mem_cgroup *memcg = folio_memcg(folio);
+	struct mem_cgroup *memcg;
 	struct pglist_data *pgdat = folio_pgdat(folio);
-	struct lruvec *lruvec = mem_cgroup_lruvec(memcg, pgdat);
-	struct lru_gen_mm_state *mm_state = get_mm_state(lruvec);
-	DEFINE_MAX_SEQ(lruvec);
-	int gen = lru_gen_from_seq(max_seq);
+	struct lruvec *lruvec;
+	struct lru_gen_mm_state *mm_state;
+	unsigned long max_seq;
+	int gen;
 
 	lockdep_assert_held(pvmw->ptl);
 	VM_WARN_ON_ONCE_FOLIO(folio_test_lru(folio), folio);
@@ -4242,6 +4244,13 @@ bool lru_gen_look_around(struct page_vma_mapped_walk *pvmw)
 		}
 	}
 
+	rcu_read_lock();
+	memcg = folio_memcg(folio);
+	lruvec = mem_cgroup_lruvec(memcg, pgdat);
+	max_seq = READ_ONCE((lruvec)->lrugen.max_seq);
+	gen = lru_gen_from_seq(max_seq);
+	mm_state = get_mm_state(lruvec);
+
 	arch_enter_lazy_mmu_mode();
 
 	pte -= (addr - start) / PAGE_SIZE;
@@ -4282,6 +4291,8 @@ bool lru_gen_look_around(struct page_vma_mapped_walk *pvmw)
 	if (mm_state && suitable_to_scan(i, young))
 		update_bloom_filter(mm_state, max_seq, pvmw->pmd);
 
+	rcu_read_unlock();
+
 	return true;
 }
 
-- 
2.20.1



^ permalink raw reply	[flat|nested] 149+ messages in thread

* Re: [PATCH v2 14/28] mm: mglru: prevent memory cgroup release in mglru
  2025-12-17  7:27 ` [PATCH v2 14/28] mm: mglru: prevent memory cgroup release in mglru Qi Zheng
@ 2025-12-17 22:18   ` Johannes Weiner
  2025-12-18  6:50     ` Qi Zheng
  2025-12-20  0:58     ` Shakeel Butt
  0 siblings, 2 replies; 149+ messages in thread
From: Johannes Weiner @ 2025-12-17 22:18 UTC (permalink / raw)
  To: Qi Zheng
  Cc: hughd, mhocko, roman.gushchin, shakeel.butt, muchun.song, david,
	lorenzo.stoakes, ziy, harry.yoo, imran.f.khan, kamalesh.babulal,
	axelrasmussen, yuanchu, weixugc, chenridong, mkoutny, akpm,
	hamzamahfooz, apais, lance.yang, linux-mm, linux-kernel, cgroups,
	Muchun Song, Qi Zheng

On Wed, Dec 17, 2025 at 03:27:38PM +0800, Qi Zheng wrote:
> @@ -4242,6 +4244,13 @@ bool lru_gen_look_around(struct page_vma_mapped_walk *pvmw)
>  		}
>  	}
>  
> +	rcu_read_lock();
> +	memcg = folio_memcg(folio);
> +	lruvec = mem_cgroup_lruvec(memcg, pgdat);
> +	max_seq = READ_ONCE((lruvec)->lrugen.max_seq);
> +	gen = lru_gen_from_seq(max_seq);
> +	mm_state = get_mm_state(lruvec);
> +
>  	arch_enter_lazy_mmu_mode();
>  
>  	pte -= (addr - start) / PAGE_SIZE;
> @@ -4282,6 +4291,8 @@ bool lru_gen_look_around(struct page_vma_mapped_walk *pvmw)
>  	if (mm_state && suitable_to_scan(i, young))
>  		update_bloom_filter(mm_state, max_seq, pvmw->pmd);
>  
> +	rcu_read_unlock();
> +
>  	return true;

This seems a bit long to be holding the rcu lock. Maybe do a get and a
put instead?


^ permalink raw reply	[flat|nested] 149+ messages in thread

* Re: [PATCH v2 14/28] mm: mglru: prevent memory cgroup release in mglru
  2025-12-17 22:18   ` Johannes Weiner
@ 2025-12-18  6:50     ` Qi Zheng
  2025-12-20  0:58     ` Shakeel Butt
  1 sibling, 0 replies; 149+ messages in thread
From: Qi Zheng @ 2025-12-18  6:50 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: hughd, mhocko, roman.gushchin, shakeel.butt, muchun.song, david,
	lorenzo.stoakes, ziy, harry.yoo, imran.f.khan, kamalesh.babulal,
	axelrasmussen, yuanchu, weixugc, chenridong, mkoutny, akpm,
	hamzamahfooz, apais, lance.yang, linux-mm, linux-kernel, cgroups,
	Muchun Song, Qi Zheng



On 12/18/25 6:18 AM, Johannes Weiner wrote:
> On Wed, Dec 17, 2025 at 03:27:38PM +0800, Qi Zheng wrote:
>> @@ -4242,6 +4244,13 @@ bool lru_gen_look_around(struct page_vma_mapped_walk *pvmw)
>>   		}
>>   	}
>>   
>> +	rcu_read_lock();
>> +	memcg = folio_memcg(folio);
>> +	lruvec = mem_cgroup_lruvec(memcg, pgdat);
>> +	max_seq = READ_ONCE((lruvec)->lrugen.max_seq);
>> +	gen = lru_gen_from_seq(max_seq);
>> +	mm_state = get_mm_state(lruvec);
>> +
>>   	arch_enter_lazy_mmu_mode();
>>   
>>   	pte -= (addr - start) / PAGE_SIZE;
>> @@ -4282,6 +4291,8 @@ bool lru_gen_look_around(struct page_vma_mapped_walk *pvmw)
>>   	if (mm_state && suitable_to_scan(i, young))
>>   		update_bloom_filter(mm_state, max_seq, pvmw->pmd);
>>   
>> +	rcu_read_unlock();
>> +
>>   	return true;
> 
> This seems a bit long to be holding the rcu lock. 

Indeed.

> Maybe do a get and a
> put instead?

OK, will use get_mem_cgroup_from_folio(folio) to do this.

This way, #8 doesn't need to be folded into #27.





^ permalink raw reply	[flat|nested] 149+ messages in thread

* Re: [PATCH v2 14/28] mm: mglru: prevent memory cgroup release in mglru
  2025-12-17 22:18   ` Johannes Weiner
  2025-12-18  6:50     ` Qi Zheng
@ 2025-12-20  0:58     ` Shakeel Butt
  1 sibling, 0 replies; 149+ messages in thread
From: Shakeel Butt @ 2025-12-20  0:58 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Qi Zheng, hughd, mhocko, roman.gushchin, muchun.song, david,
	lorenzo.stoakes, ziy, harry.yoo, imran.f.khan, kamalesh.babulal,
	axelrasmussen, yuanchu, weixugc, chenridong, mkoutny, akpm,
	hamzamahfooz, apais, lance.yang, linux-mm, linux-kernel, cgroups,
	Muchun Song, Qi Zheng

On Wed, Dec 17, 2025 at 05:18:40PM -0500, Johannes Weiner wrote:
> On Wed, Dec 17, 2025 at 03:27:38PM +0800, Qi Zheng wrote:
> > @@ -4242,6 +4244,13 @@ bool lru_gen_look_around(struct page_vma_mapped_walk *pvmw)
> >  		}
> >  	}
> >  
> > +	rcu_read_lock();
> > +	memcg = folio_memcg(folio);
> > +	lruvec = mem_cgroup_lruvec(memcg, pgdat);
> > +	max_seq = READ_ONCE((lruvec)->lrugen.max_seq);
> > +	gen = lru_gen_from_seq(max_seq);
> > +	mm_state = get_mm_state(lruvec);
> > +
> >  	arch_enter_lazy_mmu_mode();
> >  
> >  	pte -= (addr - start) / PAGE_SIZE;
> > @@ -4282,6 +4291,8 @@ bool lru_gen_look_around(struct page_vma_mapped_walk *pvmw)
> >  	if (mm_state && suitable_to_scan(i, young))
> >  		update_bloom_filter(mm_state, max_seq, pvmw->pmd);
> >  
> > +	rcu_read_unlock();
> > +
> >  	return true;
> 
> This seems a bit long to be holding the rcu lock. Maybe do a get and a
> put instead?

This function is called under ptl lock, so at least for non-RT kernels,
the preemption is disabled across the function. Anyways no strong
opinion either way.


^ permalink raw reply	[flat|nested] 149+ messages in thread

* [PATCH v2 15/28] mm: memcontrol: prevent memory cgroup release in mem_cgroup_swap_full()
  2025-12-17  7:27 [PATCH v2 00/28] Eliminate Dying Memory Cgroup Qi Zheng
                   ` (13 preceding siblings ...)
  2025-12-17  7:27 ` [PATCH v2 14/28] mm: mglru: prevent memory cgroup release in mglru Qi Zheng
@ 2025-12-17  7:27 ` Qi Zheng
  2025-12-17 22:21   ` Johannes Weiner
  2025-12-20  1:05   ` Shakeel Butt
  2025-12-17  7:27 ` [PATCH v2 16/28] mm: workingset: prevent memory cgroup release in lru_gen_eviction() Qi Zheng
                   ` (14 subsequent siblings)
  29 siblings, 2 replies; 149+ messages in thread
From: Qi Zheng @ 2025-12-17  7:27 UTC (permalink / raw)
  To: hannes, hughd, mhocko, roman.gushchin, shakeel.butt, muchun.song,
	david, lorenzo.stoakes, ziy, harry.yoo, imran.f.khan,
	kamalesh.babulal, axelrasmussen, yuanchu, weixugc, chenridong,
	mkoutny, akpm, hamzamahfooz, apais, lance.yang
  Cc: linux-mm, linux-kernel, cgroups, Muchun Song, Qi Zheng

From: Muchun Song <songmuchun@bytedance.com>

In the near future, a folio will no longer pin its corresponding
memory cgroup. To ensure safety, it will only be appropriate to
hold the rcu read lock or acquire a reference to the memory cgroup
returned by folio_memcg(), thereby preventing it from being released.

In the current patch, the rcu read lock is employed to safeguard
against the release of the memory cgroup in mem_cgroup_swap_full().

This serves as a preparatory measure for the reparenting of the
LRU pages.

Signed-off-by: Muchun Song <songmuchun@bytedance.com>
Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
Reviewed-by: Harry Yoo <harry.yoo@oracle.com>
---
 mm/memcontrol.c | 10 +++++++---
 1 file changed, 7 insertions(+), 3 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 131f940c03fa0..f2c891c1f49d5 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -5267,17 +5267,21 @@ bool mem_cgroup_swap_full(struct folio *folio)
 	if (do_memsw_account())
 		return false;
 
-	memcg = folio_memcg(folio);
-	if (!memcg)
+	if (!folio_memcg_charged(folio))
 		return false;
 
+	rcu_read_lock();
+	memcg = folio_memcg(folio);
 	for (; !mem_cgroup_is_root(memcg); memcg = parent_mem_cgroup(memcg)) {
 		unsigned long usage = page_counter_read(&memcg->swap);
 
 		if (usage * 2 >= READ_ONCE(memcg->swap.high) ||
-		    usage * 2 >= READ_ONCE(memcg->swap.max))
+		    usage * 2 >= READ_ONCE(memcg->swap.max)) {
+			rcu_read_unlock();
 			return true;
+		}
 	}
+	rcu_read_unlock();
 
 	return false;
 }
-- 
2.20.1



^ permalink raw reply	[flat|nested] 149+ messages in thread

* Re: [PATCH v2 15/28] mm: memcontrol: prevent memory cgroup release in mem_cgroup_swap_full()
  2025-12-17  7:27 ` [PATCH v2 15/28] mm: memcontrol: prevent memory cgroup release in mem_cgroup_swap_full() Qi Zheng
@ 2025-12-17 22:21   ` Johannes Weiner
  2025-12-20  1:05   ` Shakeel Butt
  1 sibling, 0 replies; 149+ messages in thread
From: Johannes Weiner @ 2025-12-17 22:21 UTC (permalink / raw)
  To: Qi Zheng
  Cc: hughd, mhocko, roman.gushchin, shakeel.butt, muchun.song, david,
	lorenzo.stoakes, ziy, harry.yoo, imran.f.khan, kamalesh.babulal,
	axelrasmussen, yuanchu, weixugc, chenridong, mkoutny, akpm,
	hamzamahfooz, apais, lance.yang, linux-mm, linux-kernel, cgroups,
	Muchun Song, Qi Zheng

On Wed, Dec 17, 2025 at 03:27:39PM +0800, Qi Zheng wrote:
> From: Muchun Song <songmuchun@bytedance.com>
> 
> In the near future, a folio will no longer pin its corresponding
> memory cgroup. To ensure safety, it will only be appropriate to
> hold the rcu read lock or acquire a reference to the memory cgroup
> returned by folio_memcg(), thereby preventing it from being released.
> 
> In the current patch, the rcu read lock is employed to safeguard
> against the release of the memory cgroup in mem_cgroup_swap_full().
> 
> This serves as a preparatory measure for the reparenting of the
> LRU pages.
> 
> Signed-off-by: Muchun Song <songmuchun@bytedance.com>
> Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
> Reviewed-by: Harry Yoo <harry.yoo@oracle.com>

Acked-by: Johannes Weiner <hannes@cmpxchg.org>


^ permalink raw reply	[flat|nested] 149+ messages in thread

* Re: [PATCH v2 15/28] mm: memcontrol: prevent memory cgroup release in mem_cgroup_swap_full()
  2025-12-17  7:27 ` [PATCH v2 15/28] mm: memcontrol: prevent memory cgroup release in mem_cgroup_swap_full() Qi Zheng
  2025-12-17 22:21   ` Johannes Weiner
@ 2025-12-20  1:05   ` Shakeel Butt
  2025-12-22  4:02     ` Qi Zheng
  2025-12-26  2:29     ` Chen Ridong
  1 sibling, 2 replies; 149+ messages in thread
From: Shakeel Butt @ 2025-12-20  1:05 UTC (permalink / raw)
  To: Qi Zheng
  Cc: hannes, hughd, mhocko, roman.gushchin, muchun.song, david,
	lorenzo.stoakes, ziy, harry.yoo, imran.f.khan, kamalesh.babulal,
	axelrasmussen, yuanchu, weixugc, chenridong, mkoutny, akpm,
	hamzamahfooz, apais, lance.yang, linux-mm, linux-kernel, cgroups,
	Muchun Song, Qi Zheng

On Wed, Dec 17, 2025 at 03:27:39PM +0800, Qi Zheng wrote:
> From: Muchun Song <songmuchun@bytedance.com>
> 
> In the near future, a folio will no longer pin its corresponding
> memory cgroup. To ensure safety, it will only be appropriate to
> hold the rcu read lock or acquire a reference to the memory cgroup
> returned by folio_memcg(), thereby preventing it from being released.
> 
> In the current patch, the rcu read lock is employed to safeguard
> against the release of the memory cgroup in mem_cgroup_swap_full().
> 
> This serves as a preparatory measure for the reparenting of the
> LRU pages.
> 
> Signed-off-by: Muchun Song <songmuchun@bytedance.com>
> Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
> Reviewed-by: Harry Yoo <harry.yoo@oracle.com>
> ---
>  mm/memcontrol.c | 10 +++++++---
>  1 file changed, 7 insertions(+), 3 deletions(-)
> 
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 131f940c03fa0..f2c891c1f49d5 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -5267,17 +5267,21 @@ bool mem_cgroup_swap_full(struct folio *folio)
>  	if (do_memsw_account())
>  		return false;
>  
> -	memcg = folio_memcg(folio);
> -	if (!memcg)
> +	if (!folio_memcg_charged(folio))
>  		return false;
>  
> +	rcu_read_lock();
> +	memcg = folio_memcg(folio);
>  	for (; !mem_cgroup_is_root(memcg); memcg = parent_mem_cgroup(memcg)) {
>  		unsigned long usage = page_counter_read(&memcg->swap);
>  
>  		if (usage * 2 >= READ_ONCE(memcg->swap.high) ||
> -		    usage * 2 >= READ_ONCE(memcg->swap.max))
> +		    usage * 2 >= READ_ONCE(memcg->swap.max)) {
> +			rcu_read_unlock();
>  			return true;
> +		}
>  	}
> +	rcu_read_unlock();
>  
>  	return false;
>  }

How about the following?


 bool mem_cgroup_swap_full(struct folio *folio)
 {
 	struct mem_cgroup *memcg;
+	bool ret = false;
 
 	VM_BUG_ON_FOLIO(!folio_test_locked(folio), folio);
 
 	if (vm_swap_full())
 		return true;
-	if (do_memsw_account())
-		return false;
 
-	if (!folio_memcg_charged(folio))
-		return false;
+	if (do_memsw_account() || !folio_memcg_charged(folio))
+		return ret;
 
 	rcu_read_lock();
 	memcg = folio_memcg(folio);
@@ -5277,13 +5276,13 @@ bool mem_cgroup_swap_full(struct folio *folio)
 
 		if (usage * 2 >= READ_ONCE(memcg->swap.high) ||
 		    usage * 2 >= READ_ONCE(memcg->swap.max)) {
-			rcu_read_unlock();
-			return true;
+			ret = true;
+			break;
 		}
 	}
 	rcu_read_unlock();
 
-	return false;
+	return ret;
 }
 

Anyways LGTM.

Acked-by: Shakeel Butt <shakeel.butt@linux.dev>


^ permalink raw reply	[flat|nested] 149+ messages in thread

* Re: [PATCH v2 15/28] mm: memcontrol: prevent memory cgroup release in mem_cgroup_swap_full()
  2025-12-20  1:05   ` Shakeel Butt
@ 2025-12-22  4:02     ` Qi Zheng
  2025-12-26  2:29     ` Chen Ridong
  1 sibling, 0 replies; 149+ messages in thread
From: Qi Zheng @ 2025-12-22  4:02 UTC (permalink / raw)
  To: Shakeel Butt
  Cc: hannes, hughd, mhocko, roman.gushchin, muchun.song, david,
	lorenzo.stoakes, ziy, harry.yoo, imran.f.khan, kamalesh.babulal,
	axelrasmussen, yuanchu, weixugc, chenridong, mkoutny, akpm,
	hamzamahfooz, apais, lance.yang, linux-mm, linux-kernel, cgroups,
	Muchun Song, Qi Zheng



On 12/20/25 9:05 AM, Shakeel Butt wrote:
> On Wed, Dec 17, 2025 at 03:27:39PM +0800, Qi Zheng wrote:
>> From: Muchun Song <songmuchun@bytedance.com>
>>
>> In the near future, a folio will no longer pin its corresponding
>> memory cgroup. To ensure safety, it will only be appropriate to
>> hold the rcu read lock or acquire a reference to the memory cgroup
>> returned by folio_memcg(), thereby preventing it from being released.
>>
>> In the current patch, the rcu read lock is employed to safeguard
>> against the release of the memory cgroup in mem_cgroup_swap_full().
>>
>> This serves as a preparatory measure for the reparenting of the
>> LRU pages.
>>
>> Signed-off-by: Muchun Song <songmuchun@bytedance.com>
>> Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
>> Reviewed-by: Harry Yoo <harry.yoo@oracle.com>
>> ---
>>   mm/memcontrol.c | 10 +++++++---
>>   1 file changed, 7 insertions(+), 3 deletions(-)
>>
>> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
>> index 131f940c03fa0..f2c891c1f49d5 100644
>> --- a/mm/memcontrol.c
>> +++ b/mm/memcontrol.c
>> @@ -5267,17 +5267,21 @@ bool mem_cgroup_swap_full(struct folio *folio)
>>   	if (do_memsw_account())
>>   		return false;
>>   
>> -	memcg = folio_memcg(folio);
>> -	if (!memcg)
>> +	if (!folio_memcg_charged(folio))
>>   		return false;
>>   
>> +	rcu_read_lock();
>> +	memcg = folio_memcg(folio);
>>   	for (; !mem_cgroup_is_root(memcg); memcg = parent_mem_cgroup(memcg)) {
>>   		unsigned long usage = page_counter_read(&memcg->swap);
>>   
>>   		if (usage * 2 >= READ_ONCE(memcg->swap.high) ||
>> -		    usage * 2 >= READ_ONCE(memcg->swap.max))
>> +		    usage * 2 >= READ_ONCE(memcg->swap.max)) {
>> +			rcu_read_unlock();
>>   			return true;
>> +		}
>>   	}
>> +	rcu_read_unlock();
>>   
>>   	return false;
>>   }
> 
> How about the following?
> 
> 
>   bool mem_cgroup_swap_full(struct folio *folio)
>   {
>   	struct mem_cgroup *memcg;
> +	bool ret = false;
>   
>   	VM_BUG_ON_FOLIO(!folio_test_locked(folio), folio);
>   
>   	if (vm_swap_full())
>   		return true;
> -	if (do_memsw_account())
> -		return false;
>   
> -	if (!folio_memcg_charged(folio))
> -		return false;
> +	if (do_memsw_account() || !folio_memcg_charged(folio))
> +		return ret;
>   
>   	rcu_read_lock();
>   	memcg = folio_memcg(folio);
> @@ -5277,13 +5276,13 @@ bool mem_cgroup_swap_full(struct folio *folio)
>   
>   		if (usage * 2 >= READ_ONCE(memcg->swap.high) ||
>   		    usage * 2 >= READ_ONCE(memcg->swap.max)) {
> -			rcu_read_unlock();
> -			return true;
> +			ret = true;
> +			break;
>   		}
>   	}
>   	rcu_read_unlock();
>   
> -	return false;
> +	return ret;
>   }

LGTM, will do.

>   
> 
> Anyways LGTM.
> 
> Acked-by: Shakeel Butt <shakeel.butt@linux.dev>

Thanks!




^ permalink raw reply	[flat|nested] 149+ messages in thread

* Re: [PATCH v2 15/28] mm: memcontrol: prevent memory cgroup release in mem_cgroup_swap_full()
  2025-12-20  1:05   ` Shakeel Butt
  2025-12-22  4:02     ` Qi Zheng
@ 2025-12-26  2:29     ` Chen Ridong
  1 sibling, 0 replies; 149+ messages in thread
From: Chen Ridong @ 2025-12-26  2:29 UTC (permalink / raw)
  To: Shakeel Butt, Qi Zheng
  Cc: hannes, hughd, mhocko, roman.gushchin, muchun.song, david,
	lorenzo.stoakes, ziy, harry.yoo, imran.f.khan, kamalesh.babulal,
	axelrasmussen, yuanchu, weixugc, mkoutny, akpm, hamzamahfooz,
	apais, lance.yang, linux-mm, linux-kernel, cgroups, Muchun Song,
	Qi Zheng



On 2025/12/20 9:05, Shakeel Butt wrote:
> On Wed, Dec 17, 2025 at 03:27:39PM +0800, Qi Zheng wrote:
>> From: Muchun Song <songmuchun@bytedance.com>
>>
>> In the near future, a folio will no longer pin its corresponding
>> memory cgroup. To ensure safety, it will only be appropriate to
>> hold the rcu read lock or acquire a reference to the memory cgroup
>> returned by folio_memcg(), thereby preventing it from being released.
>>
>> In the current patch, the rcu read lock is employed to safeguard
>> against the release of the memory cgroup in mem_cgroup_swap_full().
>>
>> This serves as a preparatory measure for the reparenting of the
>> LRU pages.
>>
>> Signed-off-by: Muchun Song <songmuchun@bytedance.com>
>> Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
>> Reviewed-by: Harry Yoo <harry.yoo@oracle.com>
>> ---
>>  mm/memcontrol.c | 10 +++++++---
>>  1 file changed, 7 insertions(+), 3 deletions(-)
>>
>> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
>> index 131f940c03fa0..f2c891c1f49d5 100644
>> --- a/mm/memcontrol.c
>> +++ b/mm/memcontrol.c
>> @@ -5267,17 +5267,21 @@ bool mem_cgroup_swap_full(struct folio *folio)
>>  	if (do_memsw_account())
>>  		return false;
>>  
>> -	memcg = folio_memcg(folio);
>> -	if (!memcg)
>> +	if (!folio_memcg_charged(folio))
>>  		return false;
>>  
>> +	rcu_read_lock();
>> +	memcg = folio_memcg(folio);
>>  	for (; !mem_cgroup_is_root(memcg); memcg = parent_mem_cgroup(memcg)) {
>>  		unsigned long usage = page_counter_read(&memcg->swap);
>>  
>>  		if (usage * 2 >= READ_ONCE(memcg->swap.high) ||
>> -		    usage * 2 >= READ_ONCE(memcg->swap.max))
>> +		    usage * 2 >= READ_ONCE(memcg->swap.max)) {
>> +			rcu_read_unlock();
>>  			return true;
>> +		}
>>  	}
>> +	rcu_read_unlock();
>>  
>>  	return false;
>>  }
> 
> How about the following?
> 
> 
>  bool mem_cgroup_swap_full(struct folio *folio)
>  {
>  	struct mem_cgroup *memcg;
> +	bool ret = false;
>  
>  	VM_BUG_ON_FOLIO(!folio_test_locked(folio), folio);
>  
>  	if (vm_swap_full())
>  		return true;
> -	if (do_memsw_account())
> -		return false;
>  
> -	if (!folio_memcg_charged(folio))
> -		return false;
> +	if (do_memsw_account() || !folio_memcg_charged(folio))
> +		return ret;
>  
>  	rcu_read_lock();
>  	memcg = folio_memcg(folio);
> @@ -5277,13 +5276,13 @@ bool mem_cgroup_swap_full(struct folio *folio)
>  
>  		if (usage * 2 >= READ_ONCE(memcg->swap.high) ||
>  		    usage * 2 >= READ_ONCE(memcg->swap.max)) {
> -			rcu_read_unlock();
> -			return true;
> +			ret = true;
> +			break;
>  		}
>  	}
>  	rcu_read_unlock();
>  
> -	return false;
> +	return ret;
>  }
>  
> 
> Anyways LGTM.
> 
> Acked-by: Shakeel Butt <shakeel.butt@linux.dev>

More compact.

LGTM.

-- 
Best regards,
Ridong



^ permalink raw reply	[flat|nested] 149+ messages in thread

* [PATCH v2 16/28] mm: workingset: prevent memory cgroup release in lru_gen_eviction()
  2025-12-17  7:27 [PATCH v2 00/28] Eliminate Dying Memory Cgroup Qi Zheng
                   ` (14 preceding siblings ...)
  2025-12-17  7:27 ` [PATCH v2 15/28] mm: memcontrol: prevent memory cgroup release in mem_cgroup_swap_full() Qi Zheng
@ 2025-12-17  7:27 ` Qi Zheng
  2025-12-17 22:23   ` Johannes Weiner
  2025-12-20  1:06   ` Shakeel Butt
  2025-12-17  7:27 ` [PATCH v2 17/28] mm: thp: prevent memory cgroup release in folio_split_queue_lock{_irqsave}() Qi Zheng
                   ` (13 subsequent siblings)
  29 siblings, 2 replies; 149+ messages in thread
From: Qi Zheng @ 2025-12-17  7:27 UTC (permalink / raw)
  To: hannes, hughd, mhocko, roman.gushchin, shakeel.butt, muchun.song,
	david, lorenzo.stoakes, ziy, harry.yoo, imran.f.khan,
	kamalesh.babulal, axelrasmussen, yuanchu, weixugc, chenridong,
	mkoutny, akpm, hamzamahfooz, apais, lance.yang
  Cc: linux-mm, linux-kernel, cgroups, Muchun Song, Qi Zheng

From: Muchun Song <songmuchun@bytedance.com>

In the near future, a folio will no longer pin its corresponding
memory cgroup. To ensure safety, it will only be appropriate to
hold the rcu read lock or acquire a reference to the memory cgroup
returned by folio_memcg(), thereby preventing it from being released.

In the current patch, the rcu read lock is employed to safeguard
against the release of the memory cgroup in lru_gen_eviction().

This serves as a preparatory measure for the reparenting of the
LRU pages.

Signed-off-by: Muchun Song <songmuchun@bytedance.com>
Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
Reviewed-by: Harry Yoo <harry.yoo@oracle.com>
---
 mm/workingset.c | 9 +++++++--
 1 file changed, 7 insertions(+), 2 deletions(-)

diff --git a/mm/workingset.c b/mm/workingset.c
index e41b44e29944b..445fc634196d8 100644
--- a/mm/workingset.c
+++ b/mm/workingset.c
@@ -241,11 +241,14 @@ static void *lru_gen_eviction(struct folio *folio)
 	int refs = folio_lru_refs(folio);
 	bool workingset = folio_test_workingset(folio);
 	int tier = lru_tier_from_refs(refs, workingset);
-	struct mem_cgroup *memcg = folio_memcg(folio);
+	struct mem_cgroup *memcg;
 	struct pglist_data *pgdat = folio_pgdat(folio);
+	unsigned short memcg_id;
 
 	BUILD_BUG_ON(LRU_GEN_WIDTH + LRU_REFS_WIDTH > BITS_PER_LONG - EVICTION_SHIFT);
 
+	rcu_read_lock();
+	memcg = folio_memcg(folio);
 	lruvec = mem_cgroup_lruvec(memcg, pgdat);
 	lrugen = &lruvec->lrugen;
 	min_seq = READ_ONCE(lrugen->min_seq[type]);
@@ -253,8 +256,10 @@ static void *lru_gen_eviction(struct folio *folio)
 
 	hist = lru_hist_from_seq(min_seq);
 	atomic_long_add(delta, &lrugen->evicted[hist][type][tier]);
+	memcg_id = mem_cgroup_id(memcg);
+	rcu_read_unlock();
 
-	return pack_shadow(mem_cgroup_id(memcg), pgdat, token, workingset);
+	return pack_shadow(memcg_id, pgdat, token, workingset);
 }
 
 /*
-- 
2.20.1



^ permalink raw reply	[flat|nested] 149+ messages in thread

* Re: [PATCH v2 16/28] mm: workingset: prevent memory cgroup release in lru_gen_eviction()
  2025-12-17  7:27 ` [PATCH v2 16/28] mm: workingset: prevent memory cgroup release in lru_gen_eviction() Qi Zheng
@ 2025-12-17 22:23   ` Johannes Weiner
  2025-12-20  1:06   ` Shakeel Butt
  1 sibling, 0 replies; 149+ messages in thread
From: Johannes Weiner @ 2025-12-17 22:23 UTC (permalink / raw)
  To: Qi Zheng
  Cc: hughd, mhocko, roman.gushchin, shakeel.butt, muchun.song, david,
	lorenzo.stoakes, ziy, harry.yoo, imran.f.khan, kamalesh.babulal,
	axelrasmussen, yuanchu, weixugc, chenridong, mkoutny, akpm,
	hamzamahfooz, apais, lance.yang, linux-mm, linux-kernel, cgroups,
	Muchun Song, Qi Zheng

On Wed, Dec 17, 2025 at 03:27:40PM +0800, Qi Zheng wrote:
> From: Muchun Song <songmuchun@bytedance.com>
> 
> In the near future, a folio will no longer pin its corresponding
> memory cgroup. To ensure safety, it will only be appropriate to
> hold the rcu read lock or acquire a reference to the memory cgroup
> returned by folio_memcg(), thereby preventing it from being released.
> 
> In the current patch, the rcu read lock is employed to safeguard
> against the release of the memory cgroup in lru_gen_eviction().
> 
> This serves as a preparatory measure for the reparenting of the
> LRU pages.
> 
> Signed-off-by: Muchun Song <songmuchun@bytedance.com>
> Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
> Reviewed-by: Harry Yoo <harry.yoo@oracle.com>

Acked-by: Johannes Weiner <hannes@cmpxchg.org>


^ permalink raw reply	[flat|nested] 149+ messages in thread

* Re: [PATCH v2 16/28] mm: workingset: prevent memory cgroup release in lru_gen_eviction()
  2025-12-17  7:27 ` [PATCH v2 16/28] mm: workingset: prevent memory cgroup release in lru_gen_eviction() Qi Zheng
  2025-12-17 22:23   ` Johannes Weiner
@ 2025-12-20  1:06   ` Shakeel Butt
  1 sibling, 0 replies; 149+ messages in thread
From: Shakeel Butt @ 2025-12-20  1:06 UTC (permalink / raw)
  To: Qi Zheng
  Cc: hannes, hughd, mhocko, roman.gushchin, muchun.song, david,
	lorenzo.stoakes, ziy, harry.yoo, imran.f.khan, kamalesh.babulal,
	axelrasmussen, yuanchu, weixugc, chenridong, mkoutny, akpm,
	hamzamahfooz, apais, lance.yang, linux-mm, linux-kernel, cgroups,
	Muchun Song, Qi Zheng

On Wed, Dec 17, 2025 at 03:27:40PM +0800, Qi Zheng wrote:
> From: Muchun Song <songmuchun@bytedance.com>
> 
> In the near future, a folio will no longer pin its corresponding
> memory cgroup. To ensure safety, it will only be appropriate to
> hold the rcu read lock or acquire a reference to the memory cgroup
> returned by folio_memcg(), thereby preventing it from being released.
> 
> In the current patch, the rcu read lock is employed to safeguard
> against the release of the memory cgroup in lru_gen_eviction().
> 
> This serves as a preparatory measure for the reparenting of the
> LRU pages.
> 
> Signed-off-by: Muchun Song <songmuchun@bytedance.com>
> Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
> Reviewed-by: Harry Yoo <harry.yoo@oracle.com>

Acked-by: Shakeel Butt <shakeel.butt@linux.dev>


^ permalink raw reply	[flat|nested] 149+ messages in thread

* [PATCH v2 17/28] mm: thp: prevent memory cgroup release in folio_split_queue_lock{_irqsave}()
  2025-12-17  7:27 [PATCH v2 00/28] Eliminate Dying Memory Cgroup Qi Zheng
                   ` (15 preceding siblings ...)
  2025-12-17  7:27 ` [PATCH v2 16/28] mm: workingset: prevent memory cgroup release in lru_gen_eviction() Qi Zheng
@ 2025-12-17  7:27 ` Qi Zheng
  2025-12-17 22:27   ` Johannes Weiner
  2025-12-18  9:10   ` David Hildenbrand (Red Hat)
  2025-12-17  7:27 ` [PATCH v2 18/28] mm: zswap: prevent memory cgroup release in zswap_compress() Qi Zheng
                   ` (12 subsequent siblings)
  29 siblings, 2 replies; 149+ messages in thread
From: Qi Zheng @ 2025-12-17  7:27 UTC (permalink / raw)
  To: hannes, hughd, mhocko, roman.gushchin, shakeel.butt, muchun.song,
	david, lorenzo.stoakes, ziy, harry.yoo, imran.f.khan,
	kamalesh.babulal, axelrasmussen, yuanchu, weixugc, chenridong,
	mkoutny, akpm, hamzamahfooz, apais, lance.yang
  Cc: linux-mm, linux-kernel, cgroups, Qi Zheng

From: Qi Zheng <zhengqi.arch@bytedance.com>

In the near future, a folio will no longer pin its corresponding memory
cgroup. To ensure safety, it will only be appropriate to hold the rcu read
lock or acquire a reference to the memory cgroup returned by
folio_memcg(), thereby preventing it from being released.

In the current patch, the rcu read lock is employed to safeguard against
the release of the memory cgroup in folio_split_queue_lock{_irqsave}().

Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
Reviewed-by: Harry Yoo <harry.yoo@oracle.com>
---
 mm/huge_memory.c | 16 ++++++++++++++--
 1 file changed, 14 insertions(+), 2 deletions(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 12b46215b30c1..b9e6855ec0b6a 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1154,13 +1154,25 @@ split_queue_lock_irqsave(int nid, struct mem_cgroup *memcg, unsigned long *flags
 
 static struct deferred_split *folio_split_queue_lock(struct folio *folio)
 {
-	return split_queue_lock(folio_nid(folio), folio_memcg(folio));
+	struct deferred_split *queue;
+
+	rcu_read_lock();
+	queue = split_queue_lock(folio_nid(folio), folio_memcg(folio));
+	rcu_read_unlock();
+
+	return queue;
 }
 
 static struct deferred_split *
 folio_split_queue_lock_irqsave(struct folio *folio, unsigned long *flags)
 {
-	return split_queue_lock_irqsave(folio_nid(folio), folio_memcg(folio), flags);
+	struct deferred_split *queue;
+
+	rcu_read_lock();
+	queue = split_queue_lock_irqsave(folio_nid(folio), folio_memcg(folio), flags);
+	rcu_read_unlock();
+
+	return queue;
 }
 
 static inline void split_queue_unlock(struct deferred_split *queue)
-- 
2.20.1



^ permalink raw reply	[flat|nested] 149+ messages in thread

* Re: [PATCH v2 17/28] mm: thp: prevent memory cgroup release in folio_split_queue_lock{_irqsave}()
  2025-12-17  7:27 ` [PATCH v2 17/28] mm: thp: prevent memory cgroup release in folio_split_queue_lock{_irqsave}() Qi Zheng
@ 2025-12-17 22:27   ` Johannes Weiner
  2025-12-20  1:11     ` Shakeel Butt
  2025-12-18  9:10   ` David Hildenbrand (Red Hat)
  1 sibling, 1 reply; 149+ messages in thread
From: Johannes Weiner @ 2025-12-17 22:27 UTC (permalink / raw)
  To: Qi Zheng
  Cc: hughd, mhocko, roman.gushchin, shakeel.butt, muchun.song, david,
	lorenzo.stoakes, ziy, harry.yoo, imran.f.khan, kamalesh.babulal,
	axelrasmussen, yuanchu, weixugc, chenridong, mkoutny, akpm,
	hamzamahfooz, apais, lance.yang, linux-mm, linux-kernel, cgroups,
	Qi Zheng

On Wed, Dec 17, 2025 at 03:27:41PM +0800, Qi Zheng wrote:
> From: Qi Zheng <zhengqi.arch@bytedance.com>
> 
> In the near future, a folio will no longer pin its corresponding memory
> cgroup. To ensure safety, it will only be appropriate to hold the rcu read
> lock or acquire a reference to the memory cgroup returned by
> folio_memcg(), thereby preventing it from being released.
> 
> In the current patch, the rcu read lock is employed to safeguard against
> the release of the memory cgroup in folio_split_queue_lock{_irqsave}().
> 
> Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
> Reviewed-by: Harry Yoo <harry.yoo@oracle.com>
> ---
>  mm/huge_memory.c | 16 ++++++++++++++--
>  1 file changed, 14 insertions(+), 2 deletions(-)
> 
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index 12b46215b30c1..b9e6855ec0b6a 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -1154,13 +1154,25 @@ split_queue_lock_irqsave(int nid, struct mem_cgroup *memcg, unsigned long *flags
>  
>  static struct deferred_split *folio_split_queue_lock(struct folio *folio)
>  {
> -	return split_queue_lock(folio_nid(folio), folio_memcg(folio));
> +	struct deferred_split *queue;
> +
> +	rcu_read_lock();
> +	queue = split_queue_lock(folio_nid(folio), folio_memcg(folio));
> +	rcu_read_unlock();

Ah, the memcg destruction path is acquiring the split queue lock for
reparenting. Once you have it locked, it's safe to drop the rcu lock.

Acked-by: Johannes Weiner <hannes@cmpxchg.org>


^ permalink raw reply	[flat|nested] 149+ messages in thread

* Re: [PATCH v2 17/28] mm: thp: prevent memory cgroup release in folio_split_queue_lock{_irqsave}()
  2025-12-17 22:27   ` Johannes Weiner
@ 2025-12-20  1:11     ` Shakeel Butt
  2025-12-22  3:33       ` Qi Zheng
  0 siblings, 1 reply; 149+ messages in thread
From: Shakeel Butt @ 2025-12-20  1:11 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Qi Zheng, hughd, mhocko, roman.gushchin, muchun.song, david,
	lorenzo.stoakes, ziy, harry.yoo, imran.f.khan, kamalesh.babulal,
	axelrasmussen, yuanchu, weixugc, chenridong, mkoutny, akpm,
	hamzamahfooz, apais, lance.yang, linux-mm, linux-kernel, cgroups,
	Qi Zheng

On Wed, Dec 17, 2025 at 05:27:17PM -0500, Johannes Weiner wrote:
> On Wed, Dec 17, 2025 at 03:27:41PM +0800, Qi Zheng wrote:
> > From: Qi Zheng <zhengqi.arch@bytedance.com>
> > 
> > In the near future, a folio will no longer pin its corresponding memory
> > cgroup. To ensure safety, it will only be appropriate to hold the rcu read
> > lock or acquire a reference to the memory cgroup returned by
> > folio_memcg(), thereby preventing it from being released.
> > 
> > In the current patch, the rcu read lock is employed to safeguard against
> > the release of the memory cgroup in folio_split_queue_lock{_irqsave}().
> > 
> > Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
> > Reviewed-by: Harry Yoo <harry.yoo@oracle.com>
> > ---
> >  mm/huge_memory.c | 16 ++++++++++++++--
> >  1 file changed, 14 insertions(+), 2 deletions(-)
> > 
> > diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> > index 12b46215b30c1..b9e6855ec0b6a 100644
> > --- a/mm/huge_memory.c
> > +++ b/mm/huge_memory.c
> > @@ -1154,13 +1154,25 @@ split_queue_lock_irqsave(int nid, struct mem_cgroup *memcg, unsigned long *flags
> >  
> >  static struct deferred_split *folio_split_queue_lock(struct folio *folio)
> >  {
> > -	return split_queue_lock(folio_nid(folio), folio_memcg(folio));
> > +	struct deferred_split *queue;
> > +
> > +	rcu_read_lock();
> > +	queue = split_queue_lock(folio_nid(folio), folio_memcg(folio));
> > +	rcu_read_unlock();
> 
> Ah, the memcg destruction path is acquiring the split queue lock for
> reparenting. Once you have it locked, it's safe to drop the rcu lock.

Qi, please add the above explanation in a comment and with that:

Acked-by: Shakeel Butt <shakeel.butt@linux.dev>


^ permalink raw reply	[flat|nested] 149+ messages in thread

* Re: [PATCH v2 17/28] mm: thp: prevent memory cgroup release in folio_split_queue_lock{_irqsave}()
  2025-12-20  1:11     ` Shakeel Butt
@ 2025-12-22  3:33       ` Qi Zheng
  0 siblings, 0 replies; 149+ messages in thread
From: Qi Zheng @ 2025-12-22  3:33 UTC (permalink / raw)
  To: Shakeel Butt, Johannes Weiner
  Cc: hughd, mhocko, roman.gushchin, muchun.song, david,
	lorenzo.stoakes, ziy, harry.yoo, imran.f.khan, kamalesh.babulal,
	axelrasmussen, yuanchu, weixugc, chenridong, mkoutny, akpm,
	hamzamahfooz, apais, lance.yang, linux-mm, linux-kernel, cgroups,
	Qi Zheng



On 12/20/25 9:11 AM, Shakeel Butt wrote:
> On Wed, Dec 17, 2025 at 05:27:17PM -0500, Johannes Weiner wrote:
>> On Wed, Dec 17, 2025 at 03:27:41PM +0800, Qi Zheng wrote:
>>> From: Qi Zheng <zhengqi.arch@bytedance.com>
>>>
>>> In the near future, a folio will no longer pin its corresponding memory
>>> cgroup. To ensure safety, it will only be appropriate to hold the rcu read
>>> lock or acquire a reference to the memory cgroup returned by
>>> folio_memcg(), thereby preventing it from being released.
>>>
>>> In the current patch, the rcu read lock is employed to safeguard against
>>> the release of the memory cgroup in folio_split_queue_lock{_irqsave}().
>>>
>>> Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
>>> Reviewed-by: Harry Yoo <harry.yoo@oracle.com>
>>> ---
>>>   mm/huge_memory.c | 16 ++++++++++++++--
>>>   1 file changed, 14 insertions(+), 2 deletions(-)
>>>
>>> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
>>> index 12b46215b30c1..b9e6855ec0b6a 100644
>>> --- a/mm/huge_memory.c
>>> +++ b/mm/huge_memory.c
>>> @@ -1154,13 +1154,25 @@ split_queue_lock_irqsave(int nid, struct mem_cgroup *memcg, unsigned long *flags
>>>   
>>>   static struct deferred_split *folio_split_queue_lock(struct folio *folio)
>>>   {
>>> -	return split_queue_lock(folio_nid(folio), folio_memcg(folio));
>>> +	struct deferred_split *queue;
>>> +
>>> +	rcu_read_lock();
>>> +	queue = split_queue_lock(folio_nid(folio), folio_memcg(folio));
>>> +	rcu_read_unlock();
>>
>> Ah, the memcg destruction path is acquiring the split queue lock for
>> reparenting. Once you have it locked, it's safe to drop the rcu lock.
> 
> Qi, please add the above explanation in a comment and with that:

OK, will do.

> 
> Acked-by: Shakeel Butt <shakeel.butt@linux.dev>

Thanks!




^ permalink raw reply	[flat|nested] 149+ messages in thread

* Re: [PATCH v2 17/28] mm: thp: prevent memory cgroup release in folio_split_queue_lock{_irqsave}()
  2025-12-17  7:27 ` [PATCH v2 17/28] mm: thp: prevent memory cgroup release in folio_split_queue_lock{_irqsave}() Qi Zheng
  2025-12-17 22:27   ` Johannes Weiner
@ 2025-12-18  9:10   ` David Hildenbrand (Red Hat)
  1 sibling, 0 replies; 149+ messages in thread
From: David Hildenbrand (Red Hat) @ 2025-12-18  9:10 UTC (permalink / raw)
  To: Qi Zheng, hannes, hughd, mhocko, roman.gushchin, shakeel.butt,
	muchun.song, lorenzo.stoakes, ziy, harry.yoo, imran.f.khan,
	kamalesh.babulal, axelrasmussen, yuanchu, weixugc, chenridong,
	mkoutny, akpm, hamzamahfooz, apais, lance.yang
  Cc: linux-mm, linux-kernel, cgroups, Qi Zheng

On 12/17/25 08:27, Qi Zheng wrote:
> From: Qi Zheng <zhengqi.arch@bytedance.com>
> 
> In the near future, a folio will no longer pin its corresponding memory
> cgroup. To ensure safety, it will only be appropriate to hold the rcu read
> lock or acquire a reference to the memory cgroup returned by
> folio_memcg(), thereby preventing it from being released.
> 
> In the current patch, the rcu read lock is employed to safeguard against
> the release of the memory cgroup in folio_split_queue_lock{_irqsave}().
> 
> Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
> Reviewed-by: Harry Yoo <harry.yoo@oracle.com>
> ---

Acked-by: David Hildenbrand (Red Hat) <david@kernel.org>

-- 
Cheers

David


^ permalink raw reply	[flat|nested] 149+ messages in thread

* [PATCH v2 18/28] mm: zswap: prevent memory cgroup release in zswap_compress()
  2025-12-17  7:27 [PATCH v2 00/28] Eliminate Dying Memory Cgroup Qi Zheng
                   ` (16 preceding siblings ...)
  2025-12-17  7:27 ` [PATCH v2 17/28] mm: thp: prevent memory cgroup release in folio_split_queue_lock{_irqsave}() Qi Zheng
@ 2025-12-17  7:27 ` Qi Zheng
  2025-12-17 22:27   ` Johannes Weiner
  2025-12-20  1:14   ` Shakeel Butt
  2025-12-17  7:27 ` [PATCH v2 19/28] mm: workingset: prevent lruvec release in workingset_refault() Qi Zheng
                   ` (11 subsequent siblings)
  29 siblings, 2 replies; 149+ messages in thread
From: Qi Zheng @ 2025-12-17  7:27 UTC (permalink / raw)
  To: hannes, hughd, mhocko, roman.gushchin, shakeel.butt, muchun.song,
	david, lorenzo.stoakes, ziy, harry.yoo, imran.f.khan,
	kamalesh.babulal, axelrasmussen, yuanchu, weixugc, chenridong,
	mkoutny, akpm, hamzamahfooz, apais, lance.yang
  Cc: linux-mm, linux-kernel, cgroups, Qi Zheng

From: Qi Zheng <zhengqi.arch@bytedance.com>

In the near future, a folio will no longer pin its corresponding memory
cgroup. To ensure safety, it will only be appropriate to hold the rcu read
lock or acquire a reference to the memory cgroup returned by
folio_memcg(), thereby preventing it from being released.

In the current patch, the rcu read lock is employed to safeguard against
the release of the memory cgroup in zswap_compress().

Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
---
 mm/zswap.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/mm/zswap.c b/mm/zswap.c
index 5d0f8b13a958d..b468046a90754 100644
--- a/mm/zswap.c
+++ b/mm/zswap.c
@@ -894,11 +894,14 @@ static bool zswap_compress(struct page *page, struct zswap_entry *entry,
 	 * to the active LRU list in the case.
 	 */
 	if (comp_ret || !dlen || dlen >= PAGE_SIZE) {
+		rcu_read_lock();
 		if (!mem_cgroup_zswap_writeback_enabled(
 					folio_memcg(page_folio(page)))) {
+			rcu_read_unlock();
 			comp_ret = comp_ret ? comp_ret : -EINVAL;
 			goto unlock;
 		}
+		rcu_read_unlock();
 		comp_ret = 0;
 		dlen = PAGE_SIZE;
 		dst = kmap_local_page(page);
-- 
2.20.1



^ permalink raw reply	[flat|nested] 149+ messages in thread

* Re: [PATCH v2 18/28] mm: zswap: prevent memory cgroup release in zswap_compress()
  2025-12-17  7:27 ` [PATCH v2 18/28] mm: zswap: prevent memory cgroup release in zswap_compress() Qi Zheng
@ 2025-12-17 22:27   ` Johannes Weiner
  2025-12-20  1:14   ` Shakeel Butt
  1 sibling, 0 replies; 149+ messages in thread
From: Johannes Weiner @ 2025-12-17 22:27 UTC (permalink / raw)
  To: Qi Zheng
  Cc: hughd, mhocko, roman.gushchin, shakeel.butt, muchun.song, david,
	lorenzo.stoakes, ziy, harry.yoo, imran.f.khan, kamalesh.babulal,
	axelrasmussen, yuanchu, weixugc, chenridong, mkoutny, akpm,
	hamzamahfooz, apais, lance.yang, linux-mm, linux-kernel, cgroups,
	Qi Zheng

On Wed, Dec 17, 2025 at 03:27:42PM +0800, Qi Zheng wrote:
> From: Qi Zheng <zhengqi.arch@bytedance.com>
> 
> In the near future, a folio will no longer pin its corresponding memory
> cgroup. To ensure safety, it will only be appropriate to hold the rcu read
> lock or acquire a reference to the memory cgroup returned by
> folio_memcg(), thereby preventing it from being released.
> 
> In the current patch, the rcu read lock is employed to safeguard against
> the release of the memory cgroup in zswap_compress().
> 
> Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>

Acked-by: Johannes Weiner <hannes@cmpxchg.org>


^ permalink raw reply	[flat|nested] 149+ messages in thread

* Re: [PATCH v2 18/28] mm: zswap: prevent memory cgroup release in zswap_compress()
  2025-12-17  7:27 ` [PATCH v2 18/28] mm: zswap: prevent memory cgroup release in zswap_compress() Qi Zheng
  2025-12-17 22:27   ` Johannes Weiner
@ 2025-12-20  1:14   ` Shakeel Butt
  1 sibling, 0 replies; 149+ messages in thread
From: Shakeel Butt @ 2025-12-20  1:14 UTC (permalink / raw)
  To: Qi Zheng
  Cc: hannes, hughd, mhocko, roman.gushchin, muchun.song, david,
	lorenzo.stoakes, ziy, harry.yoo, imran.f.khan, kamalesh.babulal,
	axelrasmussen, yuanchu, weixugc, chenridong, mkoutny, akpm,
	hamzamahfooz, apais, lance.yang, linux-mm, linux-kernel, cgroups,
	Qi Zheng

On Wed, Dec 17, 2025 at 03:27:42PM +0800, Qi Zheng wrote:
> From: Qi Zheng <zhengqi.arch@bytedance.com>
> 
> In the near future, a folio will no longer pin its corresponding memory
> cgroup. To ensure safety, it will only be appropriate to hold the rcu read
> lock or acquire a reference to the memory cgroup returned by
> folio_memcg(), thereby preventing it from being released.
> 
> In the current patch, the rcu read lock is employed to safeguard against
> the release of the memory cgroup in zswap_compress().
> 
> Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>

Acked-by: Shakeel Butt <shakeel.butt@linux.dev>


^ permalink raw reply	[flat|nested] 149+ messages in thread

* [PATCH v2 19/28] mm: workingset: prevent lruvec release in workingset_refault()
  2025-12-17  7:27 [PATCH v2 00/28] Eliminate Dying Memory Cgroup Qi Zheng
                   ` (17 preceding siblings ...)
  2025-12-17  7:27 ` [PATCH v2 18/28] mm: zswap: prevent memory cgroup release in zswap_compress() Qi Zheng
@ 2025-12-17  7:27 ` Qi Zheng
  2025-12-17 22:30   ` Johannes Weiner
  2025-12-17  7:27 ` [PATCH v2 20/28] mm: zswap: prevent lruvec release in zswap_folio_swapin() Qi Zheng
                   ` (10 subsequent siblings)
  29 siblings, 1 reply; 149+ messages in thread
From: Qi Zheng @ 2025-12-17  7:27 UTC (permalink / raw)
  To: hannes, hughd, mhocko, roman.gushchin, shakeel.butt, muchun.song,
	david, lorenzo.stoakes, ziy, harry.yoo, imran.f.khan,
	kamalesh.babulal, axelrasmussen, yuanchu, weixugc, chenridong,
	mkoutny, akpm, hamzamahfooz, apais, lance.yang
  Cc: linux-mm, linux-kernel, cgroups, Muchun Song, Qi Zheng

From: Muchun Song <songmuchun@bytedance.com>

In the near future, a folio will no longer pin its corresponding
memory cgroup. So an lruvec returned by folio_lruvec() could be
released without the rcu read lock or a reference to its memory
cgroup.

In the current patch, the rcu read lock is employed to safeguard
against the release of the lruvec in workingset_refault().

This serves as a preparatory measure for the reparenting of the
LRU pages.

Signed-off-by: Muchun Song <songmuchun@bytedance.com>
Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
Reviewed-by: Harry Yoo <harry.yoo@oracle.com>
---
 mm/workingset.c | 5 ++++-
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/mm/workingset.c b/mm/workingset.c
index 445fc634196d8..427ca1a5625e8 100644
--- a/mm/workingset.c
+++ b/mm/workingset.c
@@ -560,11 +560,12 @@ void workingset_refault(struct folio *folio, void *shadow)
 	 * locked to guarantee folio_memcg() stability throughout.
 	 */
 	nr = folio_nr_pages(folio);
+	rcu_read_lock();
 	lruvec = folio_lruvec(folio);
 	mod_lruvec_state(lruvec, WORKINGSET_REFAULT_BASE + file, nr);
 
 	if (!workingset_test_recent(shadow, file, &workingset, true))
-		return;
+		goto out;
 
 	folio_set_active(folio);
 	workingset_age_nonresident(lruvec, nr);
@@ -580,6 +581,8 @@ void workingset_refault(struct folio *folio, void *shadow)
 		lru_note_cost_refault(folio);
 		mod_lruvec_state(lruvec, WORKINGSET_RESTORE_BASE + file, nr);
 	}
+out:
+	rcu_read_unlock();
 }
 
 /**
-- 
2.20.1



^ permalink raw reply	[flat|nested] 149+ messages in thread

* Re: [PATCH v2 19/28] mm: workingset: prevent lruvec release in workingset_refault()
  2025-12-17  7:27 ` [PATCH v2 19/28] mm: workingset: prevent lruvec release in workingset_refault() Qi Zheng
@ 2025-12-17 22:30   ` Johannes Weiner
  2025-12-18  6:57     ` Qi Zheng
  0 siblings, 1 reply; 149+ messages in thread
From: Johannes Weiner @ 2025-12-17 22:30 UTC (permalink / raw)
  To: Qi Zheng
  Cc: hughd, mhocko, roman.gushchin, shakeel.butt, muchun.song, david,
	lorenzo.stoakes, ziy, harry.yoo, imran.f.khan, kamalesh.babulal,
	axelrasmussen, yuanchu, weixugc, chenridong, mkoutny, akpm,
	hamzamahfooz, apais, lance.yang, linux-mm, linux-kernel, cgroups,
	Muchun Song, Qi Zheng

On Wed, Dec 17, 2025 at 03:27:43PM +0800, Qi Zheng wrote:
> From: Muchun Song <songmuchun@bytedance.com>
> 
> In the near future, a folio will no longer pin its corresponding
> memory cgroup. So an lruvec returned by folio_lruvec() could be
> released without the rcu read lock or a reference to its memory
> cgroup.
> 
> In the current patch, the rcu read lock is employed to safeguard
> against the release of the lruvec in workingset_refault().
> 
> This serves as a preparatory measure for the reparenting of the
> LRU pages.
> 
> Signed-off-by: Muchun Song <songmuchun@bytedance.com>
> Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
> Reviewed-by: Harry Yoo <harry.yoo@oracle.com>
> ---
>  mm/workingset.c | 5 ++++-
>  1 file changed, 4 insertions(+), 1 deletion(-)
> 
> diff --git a/mm/workingset.c b/mm/workingset.c
> index 445fc634196d8..427ca1a5625e8 100644
> --- a/mm/workingset.c
> +++ b/mm/workingset.c
> @@ -560,11 +560,12 @@ void workingset_refault(struct folio *folio, void *shadow)
>  	 * locked to guarantee folio_memcg() stability throughout.
>  	 */
>  	nr = folio_nr_pages(folio);
> +	rcu_read_lock();
>  	lruvec = folio_lruvec(folio);
>  	mod_lruvec_state(lruvec, WORKINGSET_REFAULT_BASE + file, nr);
>  
>  	if (!workingset_test_recent(shadow, file, &workingset, true))

This might sleep. You have to get a reference here and unlock RCU.


^ permalink raw reply	[flat|nested] 149+ messages in thread

* Re: [PATCH v2 19/28] mm: workingset: prevent lruvec release in workingset_refault()
  2025-12-17 22:30   ` Johannes Weiner
@ 2025-12-18  6:57     ` Qi Zheng
  0 siblings, 0 replies; 149+ messages in thread
From: Qi Zheng @ 2025-12-18  6:57 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: hughd, mhocko, roman.gushchin, shakeel.butt, muchun.song, david,
	lorenzo.stoakes, ziy, harry.yoo, imran.f.khan, kamalesh.babulal,
	axelrasmussen, yuanchu, weixugc, chenridong, mkoutny, akpm,
	hamzamahfooz, apais, lance.yang, linux-mm, linux-kernel, cgroups,
	Muchun Song, Qi Zheng



On 12/18/25 6:30 AM, Johannes Weiner wrote:
> On Wed, Dec 17, 2025 at 03:27:43PM +0800, Qi Zheng wrote:
>> From: Muchun Song <songmuchun@bytedance.com>
>>
>> In the near future, a folio will no longer pin its corresponding
>> memory cgroup. So an lruvec returned by folio_lruvec() could be
>> released without the rcu read lock or a reference to its memory
>> cgroup.
>>
>> In the current patch, the rcu read lock is employed to safeguard
>> against the release of the lruvec in workingset_refault().
>>
>> This serves as a preparatory measure for the reparenting of the
>> LRU pages.
>>
>> Signed-off-by: Muchun Song <songmuchun@bytedance.com>
>> Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
>> Reviewed-by: Harry Yoo <harry.yoo@oracle.com>
>> ---
>>   mm/workingset.c | 5 ++++-
>>   1 file changed, 4 insertions(+), 1 deletion(-)
>>
>> diff --git a/mm/workingset.c b/mm/workingset.c
>> index 445fc634196d8..427ca1a5625e8 100644
>> --- a/mm/workingset.c
>> +++ b/mm/workingset.c
>> @@ -560,11 +560,12 @@ void workingset_refault(struct folio *folio, void *shadow)
>>   	 * locked to guarantee folio_memcg() stability throughout.
>>   	 */
>>   	nr = folio_nr_pages(folio);
>> +	rcu_read_lock();
>>   	lruvec = folio_lruvec(folio);
>>   	mod_lruvec_state(lruvec, WORKINGSET_REFAULT_BASE + file, nr);
>>   
>>   	if (!workingset_test_recent(shadow, file, &workingset, true))
> 
> This might sleep. You have to get a reference here and unlock RCU.

Indeed, this may sleep in css_rstat_flush().

Will do the following:

get_mem_cgroup_from_folio(folio);
lruvec = mem_cgroup_lruvec(memcg, folio_pgdat(folio));

/* xxx */

mem_cgroup_put(memcg);





^ permalink raw reply	[flat|nested] 149+ messages in thread

* [PATCH v2 20/28] mm: zswap: prevent lruvec release in zswap_folio_swapin()
  2025-12-17  7:27 [PATCH v2 00/28] Eliminate Dying Memory Cgroup Qi Zheng
                   ` (18 preceding siblings ...)
  2025-12-17  7:27 ` [PATCH v2 19/28] mm: workingset: prevent lruvec release in workingset_refault() Qi Zheng
@ 2025-12-17  7:27 ` Qi Zheng
  2025-12-17 22:33   ` Johannes Weiner
  2025-12-20  1:23   ` Shakeel Butt
  2025-12-17  7:27 ` [PATCH v2 21/28] mm: swap: prevent lruvec release in lru_gen_clear_refs() Qi Zheng
                   ` (9 subsequent siblings)
  29 siblings, 2 replies; 149+ messages in thread
From: Qi Zheng @ 2025-12-17  7:27 UTC (permalink / raw)
  To: hannes, hughd, mhocko, roman.gushchin, shakeel.butt, muchun.song,
	david, lorenzo.stoakes, ziy, harry.yoo, imran.f.khan,
	kamalesh.babulal, axelrasmussen, yuanchu, weixugc, chenridong,
	mkoutny, akpm, hamzamahfooz, apais, lance.yang
  Cc: linux-mm, linux-kernel, cgroups, Muchun Song, Nhat Pham,
	Chengming Zhou, Qi Zheng

From: Muchun Song <songmuchun@bytedance.com>

In the near future, a folio will no longer pin its corresponding
memory cgroup. So an lruvec returned by folio_lruvec() could be
released without the rcu read lock or a reference to its memory
cgroup.

In the current patch, the rcu read lock is employed to safeguard
against the release of the lruvec in zswap_folio_swapin().

This serves as a preparatory measure for the reparenting of the
LRU pages.

Signed-off-by: Muchun Song <songmuchun@bytedance.com>
Acked-by: Nhat Pham <nphamcs@gmail.com>
Reviewed-by: Chengming Zhou <chengming.zhou@linux.dev>
Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
Reviewed-by: Harry Yoo <harry.yoo@oracle.com>
---
 mm/zswap.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/mm/zswap.c b/mm/zswap.c
index b468046a90754..738d914e53549 100644
--- a/mm/zswap.c
+++ b/mm/zswap.c
@@ -664,8 +664,10 @@ void zswap_folio_swapin(struct folio *folio)
 	struct lruvec *lruvec;
 
 	if (folio) {
+		rcu_read_lock();
 		lruvec = folio_lruvec(folio);
 		atomic_long_inc(&lruvec->zswap_lruvec_state.nr_disk_swapins);
+		rcu_read_unlock();
 	}
 }
 
-- 
2.20.1



^ permalink raw reply	[flat|nested] 149+ messages in thread

* Re: [PATCH v2 20/28] mm: zswap: prevent lruvec release in zswap_folio_swapin()
  2025-12-17  7:27 ` [PATCH v2 20/28] mm: zswap: prevent lruvec release in zswap_folio_swapin() Qi Zheng
@ 2025-12-17 22:33   ` Johannes Weiner
  2025-12-18  7:09     ` Qi Zheng
  2025-12-20  1:23   ` Shakeel Butt
  1 sibling, 1 reply; 149+ messages in thread
From: Johannes Weiner @ 2025-12-17 22:33 UTC (permalink / raw)
  To: Qi Zheng
  Cc: hughd, mhocko, roman.gushchin, shakeel.butt, muchun.song, david,
	lorenzo.stoakes, ziy, harry.yoo, imran.f.khan, kamalesh.babulal,
	axelrasmussen, yuanchu, weixugc, chenridong, mkoutny, akpm,
	hamzamahfooz, apais, lance.yang, linux-mm, linux-kernel, cgroups,
	Muchun Song, Nhat Pham, Chengming Zhou, Qi Zheng

On Wed, Dec 17, 2025 at 03:27:44PM +0800, Qi Zheng wrote:
> From: Muchun Song <songmuchun@bytedance.com>
> 
> In the near future, a folio will no longer pin its corresponding
> memory cgroup. So an lruvec returned by folio_lruvec() could be
> released without the rcu read lock or a reference to its memory
> cgroup.
> 
> In the current patch, the rcu read lock is employed to safeguard
> against the release of the lruvec in zswap_folio_swapin().
> 
> This serves as a preparatory measure for the reparenting of the
> LRU pages.
> 
> Signed-off-by: Muchun Song <songmuchun@bytedance.com>
> Acked-by: Nhat Pham <nphamcs@gmail.com>
> Reviewed-by: Chengming Zhou <chengming.zhou@linux.dev>
> Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
> Reviewed-by: Harry Yoo <harry.yoo@oracle.com>

Acked-by: Johannes Weiner <hannes@cmpxchg.org>

Btw, it would make the series shorter if you combined the changes to
workingset.c, zswap.c etc. It should still be easy to review as long
as you just stick to making folio_memcg(), folio_lruvec() calls safe.


^ permalink raw reply	[flat|nested] 149+ messages in thread

* Re: [PATCH v2 20/28] mm: zswap: prevent lruvec release in zswap_folio_swapin()
  2025-12-17 22:33   ` Johannes Weiner
@ 2025-12-18  7:09     ` Qi Zheng
  2025-12-18 13:02       ` Johannes Weiner
  0 siblings, 1 reply; 149+ messages in thread
From: Qi Zheng @ 2025-12-18  7:09 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: hughd, mhocko, roman.gushchin, shakeel.butt, muchun.song, david,
	lorenzo.stoakes, ziy, harry.yoo, imran.f.khan, kamalesh.babulal,
	axelrasmussen, yuanchu, weixugc, chenridong, mkoutny, akpm,
	hamzamahfooz, apais, lance.yang, linux-mm, linux-kernel, cgroups,
	Muchun Song, Nhat Pham, Chengming Zhou, Qi Zheng



On 12/18/25 6:33 AM, Johannes Weiner wrote:
> On Wed, Dec 17, 2025 at 03:27:44PM +0800, Qi Zheng wrote:
>> From: Muchun Song <songmuchun@bytedance.com>
>>
>> In the near future, a folio will no longer pin its corresponding
>> memory cgroup. So an lruvec returned by folio_lruvec() could be
>> released without the rcu read lock or a reference to its memory
>> cgroup.
>>
>> In the current patch, the rcu read lock is employed to safeguard
>> against the release of the lruvec in zswap_folio_swapin().
>>
>> This serves as a preparatory measure for the reparenting of the
>> LRU pages.
>>
>> Signed-off-by: Muchun Song <songmuchun@bytedance.com>
>> Acked-by: Nhat Pham <nphamcs@gmail.com>
>> Reviewed-by: Chengming Zhou <chengming.zhou@linux.dev>
>> Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
>> Reviewed-by: Harry Yoo <harry.yoo@oracle.com>
> 
> Acked-by: Johannes Weiner <hannes@cmpxchg.org>

Thanks!

> 
> Btw, it would make the series shorter if you combined the changes to
> workingset.c, zswap.c etc. It should still be easy to review as long
> as you just stick to making folio_memcg(), folio_lruvec() calls safe.

I prefer to separate them. For example, as you pointed out, in some
places, it would be more appropriate to switch to use
get_mem_cgroup_from_folio() to handle it. Separating them also makes
subsequent updates and iterations easier.




^ permalink raw reply	[flat|nested] 149+ messages in thread

* Re: [PATCH v2 20/28] mm: zswap: prevent lruvec release in zswap_folio_swapin()
  2025-12-18  7:09     ` Qi Zheng
@ 2025-12-18 13:02       ` Johannes Weiner
  0 siblings, 0 replies; 149+ messages in thread
From: Johannes Weiner @ 2025-12-18 13:02 UTC (permalink / raw)
  To: Qi Zheng
  Cc: hughd, mhocko, roman.gushchin, shakeel.butt, muchun.song, david,
	lorenzo.stoakes, ziy, harry.yoo, imran.f.khan, kamalesh.babulal,
	axelrasmussen, yuanchu, weixugc, chenridong, mkoutny, akpm,
	hamzamahfooz, apais, lance.yang, linux-mm, linux-kernel, cgroups,
	Muchun Song, Nhat Pham, Chengming Zhou, Qi Zheng

On Thu, Dec 18, 2025 at 03:09:04PM +0800, Qi Zheng wrote:
> 
> 
> On 12/18/25 6:33 AM, Johannes Weiner wrote:
> > On Wed, Dec 17, 2025 at 03:27:44PM +0800, Qi Zheng wrote:
> >> From: Muchun Song <songmuchun@bytedance.com>
> >>
> >> In the near future, a folio will no longer pin its corresponding
> >> memory cgroup. So an lruvec returned by folio_lruvec() could be
> >> released without the rcu read lock or a reference to its memory
> >> cgroup.
> >>
> >> In the current patch, the rcu read lock is employed to safeguard
> >> against the release of the lruvec in zswap_folio_swapin().
> >>
> >> This serves as a preparatory measure for the reparenting of the
> >> LRU pages.
> >>
> >> Signed-off-by: Muchun Song <songmuchun@bytedance.com>
> >> Acked-by: Nhat Pham <nphamcs@gmail.com>
> >> Reviewed-by: Chengming Zhou <chengming.zhou@linux.dev>
> >> Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
> >> Reviewed-by: Harry Yoo <harry.yoo@oracle.com>
> > 
> > Acked-by: Johannes Weiner <hannes@cmpxchg.org>
> 
> Thanks!
> 
> > 
> > Btw, it would make the series shorter if you combined the changes to
> > workingset.c, zswap.c etc. It should still be easy to review as long
> > as you just stick to making folio_memcg(), folio_lruvec() calls safe.
> 
> I prefer to separate them. For example, as you pointed out, in some
> places, it would be more appropriate to switch to use
> get_mem_cgroup_from_folio() to handle it. Separating them also makes
> subsequent updates and iterations easier.

Ok, that works for me!


^ permalink raw reply	[flat|nested] 149+ messages in thread

* Re: [PATCH v2 20/28] mm: zswap: prevent lruvec release in zswap_folio_swapin()
  2025-12-17  7:27 ` [PATCH v2 20/28] mm: zswap: prevent lruvec release in zswap_folio_swapin() Qi Zheng
  2025-12-17 22:33   ` Johannes Weiner
@ 2025-12-20  1:23   ` Shakeel Butt
  1 sibling, 0 replies; 149+ messages in thread
From: Shakeel Butt @ 2025-12-20  1:23 UTC (permalink / raw)
  To: Qi Zheng
  Cc: hannes, hughd, mhocko, roman.gushchin, muchun.song, david,
	lorenzo.stoakes, ziy, harry.yoo, imran.f.khan, kamalesh.babulal,
	axelrasmussen, yuanchu, weixugc, chenridong, mkoutny, akpm,
	hamzamahfooz, apais, lance.yang, linux-mm, linux-kernel, cgroups,
	Muchun Song, Nhat Pham, Chengming Zhou, Qi Zheng

On Wed, Dec 17, 2025 at 03:27:44PM +0800, Qi Zheng wrote:
> From: Muchun Song <songmuchun@bytedance.com>
> 
> In the near future, a folio will no longer pin its corresponding
> memory cgroup. So an lruvec returned by folio_lruvec() could be
> released without the rcu read lock or a reference to its memory
> cgroup.
> 
> In the current patch, the rcu read lock is employed to safeguard
> against the release of the lruvec in zswap_folio_swapin().
> 
> This serves as a preparatory measure for the reparenting of the
> LRU pages.
> 
> Signed-off-by: Muchun Song <songmuchun@bytedance.com>
> Acked-by: Nhat Pham <nphamcs@gmail.com>
> Reviewed-by: Chengming Zhou <chengming.zhou@linux.dev>
> Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
> Reviewed-by: Harry Yoo <harry.yoo@oracle.com>

Acked-by: Shakeel Butt <shakeel.butt@linux.dev>


^ permalink raw reply	[flat|nested] 149+ messages in thread

* [PATCH v2 21/28] mm: swap: prevent lruvec release in lru_gen_clear_refs()
  2025-12-17  7:27 [PATCH v2 00/28] Eliminate Dying Memory Cgroup Qi Zheng
                   ` (19 preceding siblings ...)
  2025-12-17  7:27 ` [PATCH v2 20/28] mm: zswap: prevent lruvec release in zswap_folio_swapin() Qi Zheng
@ 2025-12-17  7:27 ` Qi Zheng
  2025-12-17 22:34   ` Johannes Weiner
  2025-12-20  1:24   ` Shakeel Butt
  2025-12-17  7:27 ` [PATCH v2 22/28] mm: workingset: prevent lruvec release in workingset_activation() Qi Zheng
                   ` (8 subsequent siblings)
  29 siblings, 2 replies; 149+ messages in thread
From: Qi Zheng @ 2025-12-17  7:27 UTC (permalink / raw)
  To: hannes, hughd, mhocko, roman.gushchin, shakeel.butt, muchun.song,
	david, lorenzo.stoakes, ziy, harry.yoo, imran.f.khan,
	kamalesh.babulal, axelrasmussen, yuanchu, weixugc, chenridong,
	mkoutny, akpm, hamzamahfooz, apais, lance.yang
  Cc: linux-mm, linux-kernel, cgroups, Muchun Song, Qi Zheng

From: Muchun Song <songmuchun@bytedance.com>

In the near future, a folio will no longer pin its corresponding
memory cgroup. So an lruvec returned by folio_lruvec() could be
released without the rcu read lock or a reference to its memory
cgroup.

In the current patch, the rcu read lock is employed to safeguard
against the release of the lruvec in lru_gen_clear_refs().

This serves as a preparatory measure for the reparenting of the
LRU pages.

Signed-off-by: Muchun Song <songmuchun@bytedance.com>
Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
Reviewed-by: Harry Yoo <harry.yoo@oracle.com>
---
 mm/swap.c | 8 +++++---
 1 file changed, 5 insertions(+), 3 deletions(-)

diff --git a/mm/swap.c b/mm/swap.c
index ec0c654e128dc..0606795f3ccf3 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -412,18 +412,20 @@ static void lru_gen_inc_refs(struct folio *folio)
 
 static bool lru_gen_clear_refs(struct folio *folio)
 {
-	struct lru_gen_folio *lrugen;
 	int gen = folio_lru_gen(folio);
 	int type = folio_is_file_lru(folio);
+	unsigned long seq;
 
 	if (gen < 0)
 		return true;
 
 	set_mask_bits(&folio->flags.f, LRU_REFS_FLAGS | BIT(PG_workingset), 0);
 
-	lrugen = &folio_lruvec(folio)->lrugen;
+	rcu_read_lock();
+	seq = READ_ONCE(folio_lruvec(folio)->lrugen.min_seq[type]);
+	rcu_read_unlock();
 	/* whether can do without shuffling under the LRU lock */
-	return gen == lru_gen_from_seq(READ_ONCE(lrugen->min_seq[type]));
+	return gen == lru_gen_from_seq(seq);
 }
 
 #else /* !CONFIG_LRU_GEN */
-- 
2.20.1



^ permalink raw reply	[flat|nested] 149+ messages in thread

* Re: [PATCH v2 21/28] mm: swap: prevent lruvec release in lru_gen_clear_refs()
  2025-12-17  7:27 ` [PATCH v2 21/28] mm: swap: prevent lruvec release in lru_gen_clear_refs() Qi Zheng
@ 2025-12-17 22:34   ` Johannes Weiner
  2025-12-20  1:24   ` Shakeel Butt
  1 sibling, 0 replies; 149+ messages in thread
From: Johannes Weiner @ 2025-12-17 22:34 UTC (permalink / raw)
  To: Qi Zheng
  Cc: hughd, mhocko, roman.gushchin, shakeel.butt, muchun.song, david,
	lorenzo.stoakes, ziy, harry.yoo, imran.f.khan, kamalesh.babulal,
	axelrasmussen, yuanchu, weixugc, chenridong, mkoutny, akpm,
	hamzamahfooz, apais, lance.yang, linux-mm, linux-kernel, cgroups,
	Muchun Song, Qi Zheng

On Wed, Dec 17, 2025 at 03:27:45PM +0800, Qi Zheng wrote:
> From: Muchun Song <songmuchun@bytedance.com>
> 
> In the near future, a folio will no longer pin its corresponding
> memory cgroup. So an lruvec returned by folio_lruvec() could be
> released without the rcu read lock or a reference to its memory
> cgroup.
> 
> In the current patch, the rcu read lock is employed to safeguard
> against the release of the lruvec in lru_gen_clear_refs().
> 
> This serves as a preparatory measure for the reparenting of the
> LRU pages.
> 
> Signed-off-by: Muchun Song <songmuchun@bytedance.com>
> Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
> Reviewed-by: Harry Yoo <harry.yoo@oracle.com>

Acked-by: Johannes Weiner <hannes@cmpxchg.org>


^ permalink raw reply	[flat|nested] 149+ messages in thread

* Re: [PATCH v2 21/28] mm: swap: prevent lruvec release in lru_gen_clear_refs()
  2025-12-17  7:27 ` [PATCH v2 21/28] mm: swap: prevent lruvec release in lru_gen_clear_refs() Qi Zheng
  2025-12-17 22:34   ` Johannes Weiner
@ 2025-12-20  1:24   ` Shakeel Butt
  1 sibling, 0 replies; 149+ messages in thread
From: Shakeel Butt @ 2025-12-20  1:24 UTC (permalink / raw)
  To: Qi Zheng
  Cc: hannes, hughd, mhocko, roman.gushchin, muchun.song, david,
	lorenzo.stoakes, ziy, harry.yoo, imran.f.khan, kamalesh.babulal,
	axelrasmussen, yuanchu, weixugc, chenridong, mkoutny, akpm,
	hamzamahfooz, apais, lance.yang, linux-mm, linux-kernel, cgroups,
	Muchun Song, Qi Zheng

On Wed, Dec 17, 2025 at 03:27:45PM +0800, Qi Zheng wrote:
> From: Muchun Song <songmuchun@bytedance.com>
> 
> In the near future, a folio will no longer pin its corresponding
> memory cgroup. So an lruvec returned by folio_lruvec() could be
> released without the rcu read lock or a reference to its memory
> cgroup.
> 
> In the current patch, the rcu read lock is employed to safeguard
> against the release of the lruvec in lru_gen_clear_refs().
> 
> This serves as a preparatory measure for the reparenting of the
> LRU pages.
> 
> Signed-off-by: Muchun Song <songmuchun@bytedance.com>
> Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
> Reviewed-by: Harry Yoo <harry.yoo@oracle.com>

Acked-by: Shakeel Butt <shakeel.butt@linux.dev>


^ permalink raw reply	[flat|nested] 149+ messages in thread

* [PATCH v2 22/28] mm: workingset: prevent lruvec release in workingset_activation()
  2025-12-17  7:27 [PATCH v2 00/28] Eliminate Dying Memory Cgroup Qi Zheng
                   ` (20 preceding siblings ...)
  2025-12-17  7:27 ` [PATCH v2 21/28] mm: swap: prevent lruvec release in lru_gen_clear_refs() Qi Zheng
@ 2025-12-17  7:27 ` Qi Zheng
  2025-12-17 22:36   ` Johannes Weiner
  2025-12-20  1:25   ` Shakeel Butt
  2025-12-17  7:27 ` [PATCH v2 23/28] mm: memcontrol: prepare for reparenting LRU pages for lruvec lock Qi Zheng
                   ` (7 subsequent siblings)
  29 siblings, 2 replies; 149+ messages in thread
From: Qi Zheng @ 2025-12-17  7:27 UTC (permalink / raw)
  To: hannes, hughd, mhocko, roman.gushchin, shakeel.butt, muchun.song,
	david, lorenzo.stoakes, ziy, harry.yoo, imran.f.khan,
	kamalesh.babulal, axelrasmussen, yuanchu, weixugc, chenridong,
	mkoutny, akpm, hamzamahfooz, apais, lance.yang
  Cc: linux-mm, linux-kernel, cgroups, Muchun Song, Qi Zheng

From: Muchun Song <songmuchun@bytedance.com>

In the near future, a folio will no longer pin its corresponding
memory cgroup. So an lruvec returned by folio_lruvec() could be
released without the rcu read lock or a reference to its memory
cgroup.

In the current patch, the rcu read lock is employed to safeguard
against the release of the lruvec in workingset_activation().

This serves as a preparatory measure for the reparenting of the
LRU pages.

Signed-off-by: Muchun Song <songmuchun@bytedance.com>
Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
Reviewed-by: Harry Yoo <harry.yoo@oracle.com>
---
 mm/workingset.c | 5 ++++-
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/mm/workingset.c b/mm/workingset.c
index 427ca1a5625e8..d6484f7a3ad28 100644
--- a/mm/workingset.c
+++ b/mm/workingset.c
@@ -595,8 +595,11 @@ void workingset_activation(struct folio *folio)
 	 * Filter non-memcg pages here, e.g. unmap can call
 	 * mark_page_accessed() on VDSO pages.
 	 */
-	if (mem_cgroup_disabled() || folio_memcg_charged(folio))
+	if (mem_cgroup_disabled() || folio_memcg_charged(folio)) {
+		rcu_read_lock();
 		workingset_age_nonresident(folio_lruvec(folio), folio_nr_pages(folio));
+		rcu_read_unlock();
+	}
 }
 
 /*
-- 
2.20.1



^ permalink raw reply	[flat|nested] 149+ messages in thread

* Re: [PATCH v2 22/28] mm: workingset: prevent lruvec release in workingset_activation()
  2025-12-17  7:27 ` [PATCH v2 22/28] mm: workingset: prevent lruvec release in workingset_activation() Qi Zheng
@ 2025-12-17 22:36   ` Johannes Weiner
  2025-12-20  1:25   ` Shakeel Butt
  1 sibling, 0 replies; 149+ messages in thread
From: Johannes Weiner @ 2025-12-17 22:36 UTC (permalink / raw)
  To: Qi Zheng
  Cc: hughd, mhocko, roman.gushchin, shakeel.butt, muchun.song, david,
	lorenzo.stoakes, ziy, harry.yoo, imran.f.khan, kamalesh.babulal,
	axelrasmussen, yuanchu, weixugc, chenridong, mkoutny, akpm,
	hamzamahfooz, apais, lance.yang, linux-mm, linux-kernel, cgroups,
	Muchun Song, Qi Zheng

On Wed, Dec 17, 2025 at 03:27:46PM +0800, Qi Zheng wrote:
> From: Muchun Song <songmuchun@bytedance.com>
> 
> In the near future, a folio will no longer pin its corresponding
> memory cgroup. So an lruvec returned by folio_lruvec() could be
> released without the rcu read lock or a reference to its memory
> cgroup.
> 
> In the current patch, the rcu read lock is employed to safeguard
> against the release of the lruvec in workingset_activation().
> 
> This serves as a preparatory measure for the reparenting of the
> LRU pages.
> 
> Signed-off-by: Muchun Song <songmuchun@bytedance.com>
> Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
> Reviewed-by: Harry Yoo <harry.yoo@oracle.com>

Acked-by: Johannes Weiner <hannes@cmpxchg.org>


^ permalink raw reply	[flat|nested] 149+ messages in thread

* Re: [PATCH v2 22/28] mm: workingset: prevent lruvec release in workingset_activation()
  2025-12-17  7:27 ` [PATCH v2 22/28] mm: workingset: prevent lruvec release in workingset_activation() Qi Zheng
  2025-12-17 22:36   ` Johannes Weiner
@ 2025-12-20  1:25   ` Shakeel Butt
  1 sibling, 0 replies; 149+ messages in thread
From: Shakeel Butt @ 2025-12-20  1:25 UTC (permalink / raw)
  To: Qi Zheng
  Cc: hannes, hughd, mhocko, roman.gushchin, muchun.song, david,
	lorenzo.stoakes, ziy, harry.yoo, imran.f.khan, kamalesh.babulal,
	axelrasmussen, yuanchu, weixugc, chenridong, mkoutny, akpm,
	hamzamahfooz, apais, lance.yang, linux-mm, linux-kernel, cgroups,
	Muchun Song, Qi Zheng

On Wed, Dec 17, 2025 at 03:27:46PM +0800, Qi Zheng wrote:
> From: Muchun Song <songmuchun@bytedance.com>
> 
> In the near future, a folio will no longer pin its corresponding
> memory cgroup. So an lruvec returned by folio_lruvec() could be
> released without the rcu read lock or a reference to its memory
> cgroup.
> 
> In the current patch, the rcu read lock is employed to safeguard
> against the release of the lruvec in workingset_activation().
> 
> This serves as a preparatory measure for the reparenting of the
> LRU pages.
> 
> Signed-off-by: Muchun Song <songmuchun@bytedance.com>
> Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
> Reviewed-by: Harry Yoo <harry.yoo@oracle.com>

Acked-by: Shakeel Butt <shakeel.butt@linux.dev>


^ permalink raw reply	[flat|nested] 149+ messages in thread

* [PATCH v2 23/28] mm: memcontrol: prepare for reparenting LRU pages for lruvec lock
  2025-12-17  7:27 [PATCH v2 00/28] Eliminate Dying Memory Cgroup Qi Zheng
                   ` (21 preceding siblings ...)
  2025-12-17  7:27 ` [PATCH v2 22/28] mm: workingset: prevent lruvec release in workingset_activation() Qi Zheng
@ 2025-12-17  7:27 ` Qi Zheng
  2025-12-18 13:00   ` Johannes Weiner
  2025-12-20  2:03   ` Shakeel Butt
  2025-12-17  7:27 ` [PATCH v2 24/28] mm: vmscan: prepare for reparenting traditional LRU folios Qi Zheng
                   ` (6 subsequent siblings)
  29 siblings, 2 replies; 149+ messages in thread
From: Qi Zheng @ 2025-12-17  7:27 UTC (permalink / raw)
  To: hannes, hughd, mhocko, roman.gushchin, shakeel.butt, muchun.song,
	david, lorenzo.stoakes, ziy, harry.yoo, imran.f.khan,
	kamalesh.babulal, axelrasmussen, yuanchu, weixugc, chenridong,
	mkoutny, akpm, hamzamahfooz, apais, lance.yang
  Cc: linux-mm, linux-kernel, cgroups, Muchun Song, Qi Zheng

From: Muchun Song <songmuchun@bytedance.com>

The following diagram illustrates how to ensure the safety of the folio
lruvec lock when LRU folios undergo reparenting.

In the folio_lruvec_lock(folio) function:
```
    rcu_read_lock();
retry:
    lruvec = folio_lruvec(folio);
    /* There is a possibility of folio reparenting at this point. */
    spin_lock(&lruvec->lru_lock);
    if (unlikely(lruvec_memcg(lruvec) != folio_memcg(folio))) {
        /*
         * The wrong lruvec lock was acquired, and a retry is required.
         * This is because the folio resides on the parent memcg lruvec
         * list.
         */
        spin_unlock(&lruvec->lru_lock);
        goto retry;
    }

    /* Reaching here indicates that folio_memcg() is stable. */
```

In the memcg_reparent_objcgs(memcg) function:
```
    spin_lock(&lruvec->lru_lock);
    spin_lock(&lruvec_parent->lru_lock);
    /* Transfer folios from the lruvec list to the parent's. */
    spin_unlock(&lruvec_parent->lru_lock);
    spin_unlock(&lruvec->lru_lock);
```

After acquiring the lruvec lock, it is necessary to verify whether
the folio has been reparented. If reparenting has occurred, the new
lruvec lock must be reacquired. During the LRU folio reparenting
process, the lruvec lock will also be acquired (this will be
implemented in a subsequent patch). Therefore, folio_memcg() remains
unchanged while the lruvec lock is held.

Given that lruvec_memcg(lruvec) is always equal to folio_memcg(folio)
after the lruvec lock is acquired, the lruvec_memcg_debug() check is
redundant. Hence, it is removed.

This patch serves as a preparation for the reparenting of LRU folios.

Signed-off-by: Muchun Song <songmuchun@bytedance.com>
Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
---
 include/linux/memcontrol.h | 26 ++++++++-----------
 mm/compaction.c            | 29 ++++++++++++++++-----
 mm/memcontrol.c            | 53 +++++++++++++++++++-------------------
 3 files changed, 61 insertions(+), 47 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 69c4bcfb3c3cd..85265b28c5d18 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -740,7 +740,11 @@ static inline struct lruvec *mem_cgroup_lruvec(struct mem_cgroup *memcg,
  * folio_lruvec - return lruvec for isolating/putting an LRU folio
  * @folio: Pointer to the folio.
  *
- * This function relies on folio->mem_cgroup being stable.
+ * The user should hold an rcu read lock to protect lruvec associated with
+ * the folio from being released. But it does not prevent binding stability
+ * between the folio and the returned lruvec from being changed to its parent
+ * or ancestor (e.g. like folio_lruvec_lock() does that holds LRU lock to
+ * prevent the change).
  */
 static inline struct lruvec *folio_lruvec(struct folio *folio)
 {
@@ -763,15 +767,6 @@ struct lruvec *folio_lruvec_lock_irq(struct folio *folio);
 struct lruvec *folio_lruvec_lock_irqsave(struct folio *folio,
 						unsigned long *flags);
 
-#ifdef CONFIG_DEBUG_VM
-void lruvec_memcg_debug(struct lruvec *lruvec, struct folio *folio);
-#else
-static inline
-void lruvec_memcg_debug(struct lruvec *lruvec, struct folio *folio)
-{
-}
-#endif
-
 static inline
 struct mem_cgroup *mem_cgroup_from_css(struct cgroup_subsys_state *css){
 	return css ? container_of(css, struct mem_cgroup, css) : NULL;
@@ -1194,11 +1189,6 @@ static inline struct lruvec *folio_lruvec(struct folio *folio)
 	return &pgdat->__lruvec;
 }
 
-static inline
-void lruvec_memcg_debug(struct lruvec *lruvec, struct folio *folio)
-{
-}
-
 static inline struct mem_cgroup *parent_mem_cgroup(struct mem_cgroup *memcg)
 {
 	return NULL;
@@ -1257,6 +1247,7 @@ static inline struct lruvec *folio_lruvec_lock(struct folio *folio)
 {
 	struct pglist_data *pgdat = folio_pgdat(folio);
 
+	rcu_read_lock();
 	spin_lock(&pgdat->__lruvec.lru_lock);
 	return &pgdat->__lruvec;
 }
@@ -1265,6 +1256,7 @@ static inline struct lruvec *folio_lruvec_lock_irq(struct folio *folio)
 {
 	struct pglist_data *pgdat = folio_pgdat(folio);
 
+	rcu_read_lock();
 	spin_lock_irq(&pgdat->__lruvec.lru_lock);
 	return &pgdat->__lruvec;
 }
@@ -1274,6 +1266,7 @@ static inline struct lruvec *folio_lruvec_lock_irqsave(struct folio *folio,
 {
 	struct pglist_data *pgdat = folio_pgdat(folio);
 
+	rcu_read_lock();
 	spin_lock_irqsave(&pgdat->__lruvec.lru_lock, *flagsp);
 	return &pgdat->__lruvec;
 }
@@ -1487,17 +1480,20 @@ static inline struct lruvec *parent_lruvec(struct lruvec *lruvec)
 static inline void lruvec_unlock(struct lruvec *lruvec)
 {
 	spin_unlock(&lruvec->lru_lock);
+	rcu_read_unlock();
 }
 
 static inline void lruvec_unlock_irq(struct lruvec *lruvec)
 {
 	spin_unlock_irq(&lruvec->lru_lock);
+	rcu_read_unlock();
 }
 
 static inline void lruvec_unlock_irqrestore(struct lruvec *lruvec,
 		unsigned long flags)
 {
 	spin_unlock_irqrestore(&lruvec->lru_lock, flags);
+	rcu_read_unlock();
 }
 
 /* Test requires a stable folio->memcg binding, see folio_memcg() */
diff --git a/mm/compaction.c b/mm/compaction.c
index c3e338aaa0ffb..3648ce22c8072 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -518,6 +518,24 @@ static bool compact_lock_irqsave(spinlock_t *lock, unsigned long *flags,
 	return true;
 }
 
+static struct lruvec *
+compact_folio_lruvec_lock_irqsave(struct folio *folio, unsigned long *flags,
+				  struct compact_control *cc)
+{
+	struct lruvec *lruvec;
+
+	rcu_read_lock();
+retry:
+	lruvec = folio_lruvec(folio);
+	compact_lock_irqsave(&lruvec->lru_lock, flags, cc);
+	if (unlikely(lruvec_memcg(lruvec) != folio_memcg(folio))) {
+		spin_unlock_irqrestore(&lruvec->lru_lock, *flags);
+		goto retry;
+	}
+
+	return lruvec;
+}
+
 /*
  * Compaction requires the taking of some coarse locks that are potentially
  * very heavily contended. The lock should be periodically unlocked to avoid
@@ -839,7 +857,7 @@ isolate_migratepages_block(struct compact_control *cc, unsigned long low_pfn,
 {
 	pg_data_t *pgdat = cc->zone->zone_pgdat;
 	unsigned long nr_scanned = 0, nr_isolated = 0;
-	struct lruvec *lruvec;
+	struct lruvec *lruvec = NULL;
 	unsigned long flags = 0;
 	struct lruvec *locked = NULL;
 	struct folio *folio = NULL;
@@ -1153,18 +1171,17 @@ isolate_migratepages_block(struct compact_control *cc, unsigned long low_pfn,
 		if (!folio_test_clear_lru(folio))
 			goto isolate_fail_put;
 
-		lruvec = folio_lruvec(folio);
+		if (locked)
+			lruvec = folio_lruvec(folio);
 
 		/* If we already hold the lock, we can skip some rechecking */
-		if (lruvec != locked) {
+		if (lruvec != locked || !locked) {
 			if (locked)
 				lruvec_unlock_irqrestore(locked, flags);
 
-			compact_lock_irqsave(&lruvec->lru_lock, &flags, cc);
+			lruvec = compact_folio_lruvec_lock_irqsave(folio, &flags, cc);
 			locked = lruvec;
 
-			lruvec_memcg_debug(lruvec, folio);
-
 			/*
 			 * Try get exclusive access under lock. If marked for
 			 * skip, the scan is aborted unless the current context
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index f2c891c1f49d5..930dacd6ce31a 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -1184,23 +1184,6 @@ void mem_cgroup_scan_tasks(struct mem_cgroup *memcg,
 	}
 }
 
-#ifdef CONFIG_DEBUG_VM
-void lruvec_memcg_debug(struct lruvec *lruvec, struct folio *folio)
-{
-	struct mem_cgroup *memcg;
-
-	if (mem_cgroup_disabled())
-		return;
-
-	memcg = folio_memcg(folio);
-
-	if (!memcg)
-		VM_BUG_ON_FOLIO(!mem_cgroup_is_root(lruvec_memcg(lruvec)), folio);
-	else
-		VM_BUG_ON_FOLIO(lruvec_memcg(lruvec) != memcg, folio);
-}
-#endif
-
 /**
  * folio_lruvec_lock - Lock the lruvec for a folio.
  * @folio: Pointer to the folio.
@@ -1210,14 +1193,20 @@ void lruvec_memcg_debug(struct lruvec *lruvec, struct folio *folio)
  * - folio_test_lru false
  * - folio frozen (refcount of 0)
  *
- * Return: The lruvec this folio is on with its lock held.
+ * Return: The lruvec this folio is on with its lock held and rcu read lock held.
  */
 struct lruvec *folio_lruvec_lock(struct folio *folio)
 {
-	struct lruvec *lruvec = folio_lruvec(folio);
+	struct lruvec *lruvec;
 
+	rcu_read_lock();
+retry:
+	lruvec = folio_lruvec(folio);
 	spin_lock(&lruvec->lru_lock);
-	lruvec_memcg_debug(lruvec, folio);
+	if (unlikely(lruvec_memcg(lruvec) != folio_memcg(folio))) {
+		spin_unlock(&lruvec->lru_lock);
+		goto retry;
+	}
 
 	return lruvec;
 }
@@ -1232,14 +1221,20 @@ struct lruvec *folio_lruvec_lock(struct folio *folio)
  * - folio frozen (refcount of 0)
  *
  * Return: The lruvec this folio is on with its lock held and interrupts
- * disabled.
+ * disabled and rcu read lock held.
  */
 struct lruvec *folio_lruvec_lock_irq(struct folio *folio)
 {
-	struct lruvec *lruvec = folio_lruvec(folio);
+	struct lruvec *lruvec;
 
+	rcu_read_lock();
+retry:
+	lruvec = folio_lruvec(folio);
 	spin_lock_irq(&lruvec->lru_lock);
-	lruvec_memcg_debug(lruvec, folio);
+	if (unlikely(lruvec_memcg(lruvec) != folio_memcg(folio))) {
+		spin_unlock_irq(&lruvec->lru_lock);
+		goto retry;
+	}
 
 	return lruvec;
 }
@@ -1255,15 +1250,21 @@ struct lruvec *folio_lruvec_lock_irq(struct folio *folio)
  * - folio frozen (refcount of 0)
  *
  * Return: The lruvec this folio is on with its lock held and interrupts
- * disabled.
+ * disabled and rcu read lock held.
  */
 struct lruvec *folio_lruvec_lock_irqsave(struct folio *folio,
 		unsigned long *flags)
 {
-	struct lruvec *lruvec = folio_lruvec(folio);
+	struct lruvec *lruvec;
 
+	rcu_read_lock();
+retry:
+	lruvec = folio_lruvec(folio);
 	spin_lock_irqsave(&lruvec->lru_lock, *flags);
-	lruvec_memcg_debug(lruvec, folio);
+	if (unlikely(lruvec_memcg(lruvec) != folio_memcg(folio))) {
+		spin_unlock_irqrestore(&lruvec->lru_lock, *flags);
+		goto retry;
+	}
 
 	return lruvec;
 }
-- 
2.20.1



^ permalink raw reply	[flat|nested] 149+ messages in thread

* Re: [PATCH v2 23/28] mm: memcontrol: prepare for reparenting LRU pages for lruvec lock
  2025-12-17  7:27 ` [PATCH v2 23/28] mm: memcontrol: prepare for reparenting LRU pages for lruvec lock Qi Zheng
@ 2025-12-18 13:00   ` Johannes Weiner
  2025-12-18 13:17     ` Qi Zheng
  2025-12-20  2:03   ` Shakeel Butt
  1 sibling, 1 reply; 149+ messages in thread
From: Johannes Weiner @ 2025-12-18 13:00 UTC (permalink / raw)
  To: Qi Zheng
  Cc: hughd, mhocko, roman.gushchin, shakeel.butt, muchun.song, david,
	lorenzo.stoakes, ziy, harry.yoo, imran.f.khan, kamalesh.babulal,
	axelrasmussen, yuanchu, weixugc, chenridong, mkoutny, akpm,
	hamzamahfooz, apais, lance.yang, linux-mm, linux-kernel, cgroups,
	Muchun Song, Qi Zheng

On Wed, Dec 17, 2025 at 03:27:47PM +0800, Qi Zheng wrote:
> From: Muchun Song <songmuchun@bytedance.com>
> 
> The following diagram illustrates how to ensure the safety of the folio
> lruvec lock when LRU folios undergo reparenting.
> 
> In the folio_lruvec_lock(folio) function:
> ```
>     rcu_read_lock();
> retry:
>     lruvec = folio_lruvec(folio);
>     /* There is a possibility of folio reparenting at this point. */
>     spin_lock(&lruvec->lru_lock);
>     if (unlikely(lruvec_memcg(lruvec) != folio_memcg(folio))) {
>         /*
>          * The wrong lruvec lock was acquired, and a retry is required.
>          * This is because the folio resides on the parent memcg lruvec
>          * list.
>          */
>         spin_unlock(&lruvec->lru_lock);
>         goto retry;
>     }
> 
>     /* Reaching here indicates that folio_memcg() is stable. */
> ```
> 
> In the memcg_reparent_objcgs(memcg) function:
> ```
>     spin_lock(&lruvec->lru_lock);
>     spin_lock(&lruvec_parent->lru_lock);
>     /* Transfer folios from the lruvec list to the parent's. */
>     spin_unlock(&lruvec_parent->lru_lock);
>     spin_unlock(&lruvec->lru_lock);
> ```
> 
> After acquiring the lruvec lock, it is necessary to verify whether
> the folio has been reparented. If reparenting has occurred, the new
> lruvec lock must be reacquired. During the LRU folio reparenting
> process, the lruvec lock will also be acquired (this will be
> implemented in a subsequent patch). Therefore, folio_memcg() remains
> unchanged while the lruvec lock is held.
> 
> Given that lruvec_memcg(lruvec) is always equal to folio_memcg(folio)
> after the lruvec lock is acquired, the lruvec_memcg_debug() check is
> redundant. Hence, it is removed.
> 
> This patch serves as a preparation for the reparenting of LRU folios.
> 
> Signed-off-by: Muchun Song <songmuchun@bytedance.com>
> Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
> ---
>  include/linux/memcontrol.h | 26 ++++++++-----------
>  mm/compaction.c            | 29 ++++++++++++++++-----
>  mm/memcontrol.c            | 53 +++++++++++++++++++-------------------
>  3 files changed, 61 insertions(+), 47 deletions(-)
> 
> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> index 69c4bcfb3c3cd..85265b28c5d18 100644
> --- a/include/linux/memcontrol.h
> +++ b/include/linux/memcontrol.h
> @@ -740,7 +740,11 @@ static inline struct lruvec *mem_cgroup_lruvec(struct mem_cgroup *memcg,
>   * folio_lruvec - return lruvec for isolating/putting an LRU folio
>   * @folio: Pointer to the folio.
>   *
> - * This function relies on folio->mem_cgroup being stable.
> + * The user should hold an rcu read lock to protect lruvec associated with
> + * the folio from being released. But it does not prevent binding stability
> + * between the folio and the returned lruvec from being changed to its parent
> + * or ancestor (e.g. like folio_lruvec_lock() does that holds LRU lock to
> + * prevent the change).

Can you please make this separate paragraphs to highlight the two
distinct modes of access? Something like this:

Call with rcu_read_lock() held to ensure the lifetime of the returned
lruvec. Note that this alone will NOT guarantee the stability of the
folio->lruvec association; the folio can be reparented to an ancestor
if this races with cgroup deletion.

Use folio_lruvec_lock() to ensure both lifetime and stability of the
binding. Once a lruvec is locked, folio_lruvec() can be called on
other folios, and their binding is stable if the returned lruvec
matches the one the caller has locked. Useful for lock batching.

Everything else looks good to me.

Thanks for putting so much effort into making these patches clean,
well-documented, and the series so easy to review!

Acked-by: Johannes Weiner <hannes@cmpxchg.org>


^ permalink raw reply	[flat|nested] 149+ messages in thread

* Re: [PATCH v2 23/28] mm: memcontrol: prepare for reparenting LRU pages for lruvec lock
  2025-12-18 13:00   ` Johannes Weiner
@ 2025-12-18 13:17     ` Qi Zheng
  0 siblings, 0 replies; 149+ messages in thread
From: Qi Zheng @ 2025-12-18 13:17 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: hughd, mhocko, roman.gushchin, shakeel.butt, muchun.song, david,
	lorenzo.stoakes, ziy, harry.yoo, imran.f.khan, kamalesh.babulal,
	axelrasmussen, yuanchu, weixugc, chenridong, mkoutny, akpm,
	hamzamahfooz, apais, lance.yang, linux-mm, linux-kernel, cgroups,
	Muchun Song, Qi Zheng



On 12/18/25 9:00 PM, Johannes Weiner wrote:
> On Wed, Dec 17, 2025 at 03:27:47PM +0800, Qi Zheng wrote:
>> From: Muchun Song <songmuchun@bytedance.com>
>>
>> The following diagram illustrates how to ensure the safety of the folio
>> lruvec lock when LRU folios undergo reparenting.
>>
>> In the folio_lruvec_lock(folio) function:
>> ```
>>      rcu_read_lock();
>> retry:
>>      lruvec = folio_lruvec(folio);
>>      /* There is a possibility of folio reparenting at this point. */
>>      spin_lock(&lruvec->lru_lock);
>>      if (unlikely(lruvec_memcg(lruvec) != folio_memcg(folio))) {
>>          /*
>>           * The wrong lruvec lock was acquired, and a retry is required.
>>           * This is because the folio resides on the parent memcg lruvec
>>           * list.
>>           */
>>          spin_unlock(&lruvec->lru_lock);
>>          goto retry;
>>      }
>>
>>      /* Reaching here indicates that folio_memcg() is stable. */
>> ```
>>
>> In the memcg_reparent_objcgs(memcg) function:
>> ```
>>      spin_lock(&lruvec->lru_lock);
>>      spin_lock(&lruvec_parent->lru_lock);
>>      /* Transfer folios from the lruvec list to the parent's. */
>>      spin_unlock(&lruvec_parent->lru_lock);
>>      spin_unlock(&lruvec->lru_lock);
>> ```
>>
>> After acquiring the lruvec lock, it is necessary to verify whether
>> the folio has been reparented. If reparenting has occurred, the new
>> lruvec lock must be reacquired. During the LRU folio reparenting
>> process, the lruvec lock will also be acquired (this will be
>> implemented in a subsequent patch). Therefore, folio_memcg() remains
>> unchanged while the lruvec lock is held.
>>
>> Given that lruvec_memcg(lruvec) is always equal to folio_memcg(folio)
>> after the lruvec lock is acquired, the lruvec_memcg_debug() check is
>> redundant. Hence, it is removed.
>>
>> This patch serves as a preparation for the reparenting of LRU folios.
>>
>> Signed-off-by: Muchun Song <songmuchun@bytedance.com>
>> Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
>> ---
>>   include/linux/memcontrol.h | 26 ++++++++-----------
>>   mm/compaction.c            | 29 ++++++++++++++++-----
>>   mm/memcontrol.c            | 53 +++++++++++++++++++-------------------
>>   3 files changed, 61 insertions(+), 47 deletions(-)
>>
>> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
>> index 69c4bcfb3c3cd..85265b28c5d18 100644
>> --- a/include/linux/memcontrol.h
>> +++ b/include/linux/memcontrol.h
>> @@ -740,7 +740,11 @@ static inline struct lruvec *mem_cgroup_lruvec(struct mem_cgroup *memcg,
>>    * folio_lruvec - return lruvec for isolating/putting an LRU folio
>>    * @folio: Pointer to the folio.
>>    *
>> - * This function relies on folio->mem_cgroup being stable.
>> + * The user should hold an rcu read lock to protect lruvec associated with
>> + * the folio from being released. But it does not prevent binding stability
>> + * between the folio and the returned lruvec from being changed to its parent
>> + * or ancestor (e.g. like folio_lruvec_lock() does that holds LRU lock to
>> + * prevent the change).
> 
> Can you please make this separate paragraphs to highlight the two
> distinct modes of access? Something like this:
> 
> Call with rcu_read_lock() held to ensure the lifetime of the returned
> lruvec. Note that this alone will NOT guarantee the stability of the
> folio->lruvec association; the folio can be reparented to an ancestor
> if this races with cgroup deletion.
> 
> Use folio_lruvec_lock() to ensure both lifetime and stability of the
> binding. Once a lruvec is locked, folio_lruvec() can be called on
> other folios, and their binding is stable if the returned lruvec
> matches the one the caller has locked. Useful for lock batching.

OK, will do in the next version.

> 
> Everything else looks good to me.
> 
> Thanks for putting so much effort into making these patches clean,
> well-documented, and the series so easy to review!
> 
> Acked-by: Johannes Weiner <hannes@cmpxchg.org>

Thanks!




^ permalink raw reply	[flat|nested] 149+ messages in thread

* Re: [PATCH v2 23/28] mm: memcontrol: prepare for reparenting LRU pages for lruvec lock
  2025-12-17  7:27 ` [PATCH v2 23/28] mm: memcontrol: prepare for reparenting LRU pages for lruvec lock Qi Zheng
  2025-12-18 13:00   ` Johannes Weiner
@ 2025-12-20  2:03   ` Shakeel Butt
  2025-12-23  6:14     ` Qi Zheng
  1 sibling, 1 reply; 149+ messages in thread
From: Shakeel Butt @ 2025-12-20  2:03 UTC (permalink / raw)
  To: Qi Zheng
  Cc: hannes, hughd, mhocko, roman.gushchin, muchun.song, david,
	lorenzo.stoakes, ziy, harry.yoo, imran.f.khan, kamalesh.babulal,
	axelrasmussen, yuanchu, weixugc, chenridong, mkoutny, akpm,
	hamzamahfooz, apais, lance.yang, linux-mm, linux-kernel, cgroups,
	Muchun Song, Qi Zheng

On Wed, Dec 17, 2025 at 03:27:47PM +0800, Qi Zheng wrote:
> @@ -1232,14 +1221,20 @@ struct lruvec *folio_lruvec_lock(struct folio *folio)
>   * - folio frozen (refcount of 0)
>   *
>   * Return: The lruvec this folio is on with its lock held and interrupts
> - * disabled.
> + * disabled and rcu read lock held.
>   */
>  struct lruvec *folio_lruvec_lock_irq(struct folio *folio)
>  {
> -	struct lruvec *lruvec = folio_lruvec(folio);
> +	struct lruvec *lruvec;
>  
> +	rcu_read_lock();
> +retry:
> +	lruvec = folio_lruvec(folio);
>  	spin_lock_irq(&lruvec->lru_lock);
> -	lruvec_memcg_debug(lruvec, folio);
> +	if (unlikely(lruvec_memcg(lruvec) != folio_memcg(folio))) {
> +		spin_unlock_irq(&lruvec->lru_lock);
> +		goto retry;
> +	}

So after this patch, all folio_lruvec_lock_irq() calls should be paired
with lruvec_unlock_irq() but lru_note_cost_refault() is calling
folio_lruvec_lock_irq() without the paired lruvec_unlock_irq(). It is
using lru_note_cost_unlock_irq() for unlocking lru lock and thus rcu
read unlock is missed.

Beside fixing this, I would suggest to add __acquire()/__release() tags
for both lru lock and rcu for all these functions.



^ permalink raw reply	[flat|nested] 149+ messages in thread

* Re: [PATCH v2 23/28] mm: memcontrol: prepare for reparenting LRU pages for lruvec lock
  2025-12-20  2:03   ` Shakeel Butt
@ 2025-12-23  6:14     ` Qi Zheng
  0 siblings, 0 replies; 149+ messages in thread
From: Qi Zheng @ 2025-12-23  6:14 UTC (permalink / raw)
  To: Shakeel Butt
  Cc: hannes, hughd, mhocko, roman.gushchin, muchun.song, david,
	lorenzo.stoakes, ziy, harry.yoo, imran.f.khan, kamalesh.babulal,
	axelrasmussen, yuanchu, weixugc, chenridong, mkoutny, akpm,
	hamzamahfooz, apais, lance.yang, linux-mm, linux-kernel, cgroups,
	Muchun Song, Qi Zheng



On 12/20/25 10:03 AM, Shakeel Butt wrote:
> On Wed, Dec 17, 2025 at 03:27:47PM +0800, Qi Zheng wrote:
>> @@ -1232,14 +1221,20 @@ struct lruvec *folio_lruvec_lock(struct folio *folio)
>>    * - folio frozen (refcount of 0)
>>    *
>>    * Return: The lruvec this folio is on with its lock held and interrupts
>> - * disabled.
>> + * disabled and rcu read lock held.
>>    */
>>   struct lruvec *folio_lruvec_lock_irq(struct folio *folio)
>>   {
>> -	struct lruvec *lruvec = folio_lruvec(folio);
>> +	struct lruvec *lruvec;
>>   
>> +	rcu_read_lock();
>> +retry:
>> +	lruvec = folio_lruvec(folio);
>>   	spin_lock_irq(&lruvec->lru_lock);
>> -	lruvec_memcg_debug(lruvec, folio);
>> +	if (unlikely(lruvec_memcg(lruvec) != folio_memcg(folio))) {
>> +		spin_unlock_irq(&lruvec->lru_lock);
>> +		goto retry;
>> +	}
> 
> So after this patch, all folio_lruvec_lock_irq() calls should be paired
> with lruvec_unlock_irq() but lru_note_cost_refault() is calling
> folio_lruvec_lock_irq() without the paired lruvec_unlock_irq(). It is
> using lru_note_cost_unlock_irq() for unlocking lru lock and thus rcu
> read unlock is missed.

Indeed. Will fix in the next version.

> 
> Beside fixing this, I would suggest to add __acquire()/__release() tags
> for both lru lock and rcu for all these functions.

OK, will do.


Thanks,
Qi

> 



^ permalink raw reply	[flat|nested] 149+ messages in thread

* [PATCH v2 24/28] mm: vmscan: prepare for reparenting traditional LRU folios
  2025-12-17  7:27 [PATCH v2 00/28] Eliminate Dying Memory Cgroup Qi Zheng
                   ` (22 preceding siblings ...)
  2025-12-17  7:27 ` [PATCH v2 23/28] mm: memcontrol: prepare for reparenting LRU pages for lruvec lock Qi Zheng
@ 2025-12-17  7:27 ` Qi Zheng
  2025-12-18 13:32   ` Johannes Weiner
  2025-12-17  7:27 ` [PATCH v2 25/28] mm: vmscan: prepare for reparenting MGLRU folios Qi Zheng
                   ` (5 subsequent siblings)
  29 siblings, 1 reply; 149+ messages in thread
From: Qi Zheng @ 2025-12-17  7:27 UTC (permalink / raw)
  To: hannes, hughd, mhocko, roman.gushchin, shakeel.butt, muchun.song,
	david, lorenzo.stoakes, ziy, harry.yoo, imran.f.khan,
	kamalesh.babulal, axelrasmussen, yuanchu, weixugc, chenridong,
	mkoutny, akpm, hamzamahfooz, apais, lance.yang
  Cc: linux-mm, linux-kernel, cgroups, Qi Zheng

From: Qi Zheng <zhengqi.arch@bytedance.com>

To reslove the dying memcg issue, we need to reparent LRU folios of child
memcg to its parent memcg. For traditional LRU list, each lruvec of every
memcg comprises four LRU lists. Due to the symmetry of the LRU lists, it
is feasible to transfer the LRU lists from a memcg to its parent memcg
during the reparenting process.

This commit implements the specific function, which will be used during
the reparenting process.

Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
Reviewed-by: Harry Yoo <harry.yoo@oracle.com>
---
 include/linux/mmzone.h |  4 ++++
 mm/vmscan.c            | 38 ++++++++++++++++++++++++++++++++++++++
 2 files changed, 42 insertions(+)

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 75ef7c9f9307f..08132012aa8b8 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -366,6 +366,10 @@ enum lruvec_flags {
 	LRUVEC_NODE_CONGESTED,
 };
 
+#ifdef CONFIG_MEMCG
+void lru_reparent_memcg(struct mem_cgroup *src, struct mem_cgroup *dst);
+#endif /* CONFIG_MEMCG */
+
 #endif /* !__GENERATING_BOUNDS_H */
 
 /*
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 814498a2c1bd6..5fd0f97c3719c 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2648,6 +2648,44 @@ static bool can_age_anon_pages(struct lruvec *lruvec,
 			  lruvec_memcg(lruvec));
 }
 
+#ifdef CONFIG_MEMCG
+static void lruvec_reparent_lru(struct lruvec *src, struct lruvec *dst,
+				enum lru_list lru)
+{
+	int zid;
+	struct mem_cgroup_per_node *mz_src, *mz_dst;
+
+	mz_src = container_of(src, struct mem_cgroup_per_node, lruvec);
+	mz_dst = container_of(dst, struct mem_cgroup_per_node, lruvec);
+
+	if (lru != LRU_UNEVICTABLE)
+		list_splice_tail_init(&src->lists[lru], &dst->lists[lru]);
+
+	for (zid = 0; zid < MAX_NR_ZONES; zid++) {
+		mz_dst->lru_zone_size[zid][lru] += mz_src->lru_zone_size[zid][lru];
+		mz_src->lru_zone_size[zid][lru] = 0;
+	}
+}
+
+void lru_reparent_memcg(struct mem_cgroup *src, struct mem_cgroup *dst)
+{
+	int nid;
+
+	for_each_node(nid) {
+		enum lru_list lru;
+		struct lruvec *src_lruvec, *dst_lruvec;
+
+		src_lruvec = mem_cgroup_lruvec(src, NODE_DATA(nid));
+		dst_lruvec = mem_cgroup_lruvec(dst, NODE_DATA(nid));
+		dst_lruvec->anon_cost += src_lruvec->anon_cost;
+		dst_lruvec->file_cost += src_lruvec->file_cost;
+
+		for_each_lru(lru)
+			lruvec_reparent_lru(src_lruvec, dst_lruvec, lru);
+	}
+}
+#endif
+
 #ifdef CONFIG_LRU_GEN
 
 #ifdef CONFIG_LRU_GEN_ENABLED
-- 
2.20.1



^ permalink raw reply	[flat|nested] 149+ messages in thread

* Re: [PATCH v2 24/28] mm: vmscan: prepare for reparenting traditional LRU folios
  2025-12-17  7:27 ` [PATCH v2 24/28] mm: vmscan: prepare for reparenting traditional LRU folios Qi Zheng
@ 2025-12-18 13:32   ` Johannes Weiner
  2025-12-22  3:55     ` Qi Zheng
  0 siblings, 1 reply; 149+ messages in thread
From: Johannes Weiner @ 2025-12-18 13:32 UTC (permalink / raw)
  To: Qi Zheng
  Cc: hughd, mhocko, roman.gushchin, shakeel.butt, muchun.song, david,
	lorenzo.stoakes, ziy, harry.yoo, imran.f.khan, kamalesh.babulal,
	axelrasmussen, yuanchu, weixugc, chenridong, mkoutny, akpm,
	hamzamahfooz, apais, lance.yang, linux-mm, linux-kernel, cgroups,
	Qi Zheng

On Wed, Dec 17, 2025 at 03:27:48PM +0800, Qi Zheng wrote:
> From: Qi Zheng <zhengqi.arch@bytedance.com>
> 
> To reslove the dying memcg issue, we need to reparent LRU folios of child

     resolve

> memcg to its parent memcg. For traditional LRU list, each lruvec of every
> memcg comprises four LRU lists. Due to the symmetry of the LRU lists, it
> is feasible to transfer the LRU lists from a memcg to its parent memcg
> during the reparenting process.
> 
> This commit implements the specific function, which will be used during
> the reparenting process.
> 
> Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
> Reviewed-by: Harry Yoo <harry.yoo@oracle.com>

Overall looks sane to me. I have a few nits below, not nothing
major. With those resolved, please feel free to add

Acked-by: Johannes Weiner <hannes@cmpxchg.org>

> @@ -2648,6 +2648,44 @@ static bool can_age_anon_pages(struct lruvec *lruvec,
>  			  lruvec_memcg(lruvec));
>  }
>  
> +#ifdef CONFIG_MEMCG
> +static void lruvec_reparent_lru(struct lruvec *src, struct lruvec *dst,
> +				enum lru_list lru)
> +{
> +	int zid;
> +	struct mem_cgroup_per_node *mz_src, *mz_dst;
> +
> +	mz_src = container_of(src, struct mem_cgroup_per_node, lruvec);
> +	mz_dst = container_of(dst, struct mem_cgroup_per_node, lruvec);
> +
> +	if (lru != LRU_UNEVICTABLE)
> +		list_splice_tail_init(&src->lists[lru], &dst->lists[lru]);
> +
> +	for (zid = 0; zid < MAX_NR_ZONES; zid++) {
> +		mz_dst->lru_zone_size[zid][lru] += mz_src->lru_zone_size[zid][lru];
> +		mz_src->lru_zone_size[zid][lru] = 0;
> +	}
> +}
> +
> +void lru_reparent_memcg(struct mem_cgroup *src, struct mem_cgroup *dst)

I can see why you want to pass both src and dst for convenience, but
it makes the API look a lot more generic than it is. It can only
safely move LRUs from a cgroup to its parent.

As such, I'd slightly prefer only passing one pointer and doing the
parent lookup internally. It's dealer's choice.

However, if you'd like to keep both pointers for a centralized lookup,
can you please rename the parameters @child and @parent, and add

	VM_WARN_ON(parent != parent_mem_cgroup(child));

Also please add a comment explaining the expected caller locking.

Lastly, vmscan.c is the reclaim policy. Mechanical LRU shuffling like
this is better placed in mm/swap.c.


^ permalink raw reply	[flat|nested] 149+ messages in thread

* Re: [PATCH v2 24/28] mm: vmscan: prepare for reparenting traditional LRU folios
  2025-12-18 13:32   ` Johannes Weiner
@ 2025-12-22  3:55     ` Qi Zheng
  0 siblings, 0 replies; 149+ messages in thread
From: Qi Zheng @ 2025-12-22  3:55 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: hughd, mhocko, roman.gushchin, shakeel.butt, muchun.song, david,
	lorenzo.stoakes, ziy, harry.yoo, imran.f.khan, kamalesh.babulal,
	axelrasmussen, yuanchu, weixugc, chenridong, mkoutny, akpm,
	hamzamahfooz, apais, lance.yang, linux-mm, linux-kernel, cgroups,
	Qi Zheng



On 12/18/25 9:32 PM, Johannes Weiner wrote:
> On Wed, Dec 17, 2025 at 03:27:48PM +0800, Qi Zheng wrote:
>> From: Qi Zheng <zhengqi.arch@bytedance.com>
>>
>> To reslove the dying memcg issue, we need to reparent LRU folios of child
> 
>       resolve

Got it.

> 
>> memcg to its parent memcg. For traditional LRU list, each lruvec of every
>> memcg comprises four LRU lists. Due to the symmetry of the LRU lists, it
>> is feasible to transfer the LRU lists from a memcg to its parent memcg
>> during the reparenting process.
>>
>> This commit implements the specific function, which will be used during
>> the reparenting process.
>>
>> Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
>> Reviewed-by: Harry Yoo <harry.yoo@oracle.com>
> 
> Overall looks sane to me. I have a few nits below, not nothing
> major. With those resolved, please feel free to add
> 
> Acked-by: Johannes Weiner <hannes@cmpxchg.org>

Thanks!

> 
>> @@ -2648,6 +2648,44 @@ static bool can_age_anon_pages(struct lruvec *lruvec,
>>   			  lruvec_memcg(lruvec));
>>   }
>>   
>> +#ifdef CONFIG_MEMCG
>> +static void lruvec_reparent_lru(struct lruvec *src, struct lruvec *dst,
>> +				enum lru_list lru)
>> +{
>> +	int zid;
>> +	struct mem_cgroup_per_node *mz_src, *mz_dst;
>> +
>> +	mz_src = container_of(src, struct mem_cgroup_per_node, lruvec);
>> +	mz_dst = container_of(dst, struct mem_cgroup_per_node, lruvec);
>> +
>> +	if (lru != LRU_UNEVICTABLE)
>> +		list_splice_tail_init(&src->lists[lru], &dst->lists[lru]);
>> +
>> +	for (zid = 0; zid < MAX_NR_ZONES; zid++) {
>> +		mz_dst->lru_zone_size[zid][lru] += mz_src->lru_zone_size[zid][lru];
>> +		mz_src->lru_zone_size[zid][lru] = 0;
>> +	}
>> +}
>> +
>> +void lru_reparent_memcg(struct mem_cgroup *src, struct mem_cgroup *dst)
> 
> I can see why you want to pass both src and dst for convenience, but
> it makes the API look a lot more generic than it is. It can only
> safely move LRUs from a cgroup to its parent.
> 
> As such, I'd slightly prefer only passing one pointer and doing the
> parent lookup internally. It's dealer's choice.

Make sense, will do.

> 
> However, if you'd like to keep both pointers for a centralized lookup,
> can you please rename the parameters @child and @parent, and add
> 
> 	VM_WARN_ON(parent != parent_mem_cgroup(child));
> 
> Also please add a comment explaining the expected caller locking.

OK.

> 
> Lastly, vmscan.c is the reclaim policy. Mechanical LRU shuffling like
> this is better placed in mm/swap.c.

OK, will move it to mm/swap.c.




^ permalink raw reply	[flat|nested] 149+ messages in thread

* [PATCH v2 25/28] mm: vmscan: prepare for reparenting MGLRU folios
  2025-12-17  7:27 [PATCH v2 00/28] Eliminate Dying Memory Cgroup Qi Zheng
                   ` (23 preceding siblings ...)
  2025-12-17  7:27 ` [PATCH v2 24/28] mm: vmscan: prepare for reparenting traditional LRU folios Qi Zheng
@ 2025-12-17  7:27 ` Qi Zheng
  2025-12-17  7:27 ` [PATCH v2 26/28] mm: memcontrol: refactor memcg_reparent_objcgs() Qi Zheng
                   ` (4 subsequent siblings)
  29 siblings, 0 replies; 149+ messages in thread
From: Qi Zheng @ 2025-12-17  7:27 UTC (permalink / raw)
  To: hannes, hughd, mhocko, roman.gushchin, shakeel.butt, muchun.song,
	david, lorenzo.stoakes, ziy, harry.yoo, imran.f.khan,
	kamalesh.babulal, axelrasmussen, yuanchu, weixugc, chenridong,
	mkoutny, akpm, hamzamahfooz, apais, lance.yang
  Cc: linux-mm, linux-kernel, cgroups, Qi Zheng

From: Qi Zheng <zhengqi.arch@bytedance.com>

Similar to traditional LRU folios, in order to solve the dying memcg
problem, we also need to reparenting MGLRU folios to the parent memcg when
memcg offline.

However, there are the following challenges:

1. Each lruvec has between MIN_NR_GENS and MAX_NR_GENS generations, the
   number of generations of the parent and child memcg may be different,
   so we cannot simply transfer MGLRU folios in the child memcg to the
   parent memcg as we did for traditional LRU folios.
2. The generation information is stored in folio->flags, but we cannot
   traverse these folios while holding the lru lock, otherwise it may
   cause softlockup.
3. In walk_update_folio(), the gen of folio and corresponding lru size
   may be updated, but the folio is not immediately moved to the
   corresponding lru list. Therefore, there may be folios of different
   generations on an LRU list.
4. In lru_gen_del_folio(), the generation to which the folio belongs is
   found based on the generation information in folio->flags, and the
   corresponding LRU size will be updated. Therefore, we need to update
   the lru size correctly during reparenting, otherwise the lru size may
   be updated incorrectly in lru_gen_del_folio().

Finally, this patch chose a compromise method, which is to splice the lru
list in the child memcg to the lru list of the same generation in the
parent memcg during reparenting. And in order to ensure that the parent
memcg has the same generation, we need to increase the generations in the
parent memcg to the MAX_NR_GENS before reparenting.

Of course, the same generation has different meanings in the parent and
child memcg, this will cause confusion in the hot and cold information of
folios. But other than that, this method is simple enough, the lru size
is correct, and there is no need to consider some concurrency issues (such
as lru_gen_del_folio()).

To prepare for the above work, this commit implements the specific
functions, which will be used during reparenting.

Suggested-by: Harry Yoo <harry.yoo@oracle.com>
Suggested-by: Imran Khan <imran.f.khan@oracle.com>
Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
---
 include/linux/mmzone.h |  16 +++++
 mm/vmscan.c            | 141 +++++++++++++++++++++++++++++++++++++++++
 2 files changed, 157 insertions(+)

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 08132012aa8b8..67c0e55da1220 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -628,6 +628,9 @@ void lru_gen_online_memcg(struct mem_cgroup *memcg);
 void lru_gen_offline_memcg(struct mem_cgroup *memcg);
 void lru_gen_release_memcg(struct mem_cgroup *memcg);
 void lru_gen_soft_reclaim(struct mem_cgroup *memcg, int nid);
+void max_lru_gen_memcg(struct mem_cgroup *memcg);
+bool recheck_lru_gen_max_memcg(struct mem_cgroup *memcg);
+void lru_gen_reparent_memcg(struct mem_cgroup *src, struct mem_cgroup *dst);
 
 #else /* !CONFIG_LRU_GEN */
 
@@ -668,6 +671,19 @@ static inline void lru_gen_soft_reclaim(struct mem_cgroup *memcg, int nid)
 {
 }
 
+static inline void max_lru_gen_memcg(struct mem_cgroup *memcg)
+{
+}
+
+static inline bool recheck_lru_gen_max_memcg(struct mem_cgroup *memcg)
+{
+	return true;
+}
+
+static inline void lru_gen_reparent_memcg(struct mem_cgroup *src, struct mem_cgroup *dst)
+{
+}
+
 #endif /* CONFIG_LRU_GEN */
 
 struct lruvec {
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 5fd0f97c3719c..64a85eea26dc6 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -4466,6 +4466,147 @@ void lru_gen_soft_reclaim(struct mem_cgroup *memcg, int nid)
 		lru_gen_rotate_memcg(lruvec, MEMCG_LRU_HEAD);
 }
 
+bool recheck_lru_gen_max_memcg(struct mem_cgroup *memcg)
+{
+	int nid;
+
+	for_each_node(nid) {
+		struct lruvec *lruvec = get_lruvec(memcg, nid);
+		int type;
+
+		for (type = 0; type < ANON_AND_FILE; type++) {
+			if (get_nr_gens(lruvec, type) != MAX_NR_GENS)
+				return false;
+		}
+	}
+
+	return true;
+}
+
+static void try_to_inc_max_seq_nowalk(struct mem_cgroup *memcg,
+				      struct lruvec *lruvec)
+{
+	struct lru_gen_mm_list *mm_list = get_mm_list(memcg);
+	struct lru_gen_mm_state *mm_state = get_mm_state(lruvec);
+	int swappiness = mem_cgroup_swappiness(memcg);
+	DEFINE_MAX_SEQ(lruvec);
+	bool success = false;
+
+	/*
+	 * We are not iterating the mm_list here, updating mm_state->seq is just
+	 * to make mm walkers work properly.
+	 */
+	if (mm_state) {
+		spin_lock(&mm_list->lock);
+		VM_WARN_ON_ONCE(mm_state->seq + 1 < max_seq);
+		if (max_seq > mm_state->seq) {
+			WRITE_ONCE(mm_state->seq, mm_state->seq + 1);
+			success = true;
+		}
+		spin_unlock(&mm_list->lock);
+	} else {
+		success = true;
+	}
+
+	if (success)
+		inc_max_seq(lruvec, max_seq, swappiness);
+}
+
+/*
+ * We need to ensure that the folios of child memcg can be reparented to the
+ * same gen of the parent memcg, so the gens of the parent memcg needed be
+ * incremented to the MAX_NR_GENS before reparenting.
+ */
+void max_lru_gen_memcg(struct mem_cgroup *memcg)
+{
+	int nid;
+
+	for_each_node(nid) {
+		struct lruvec *lruvec = get_lruvec(memcg, nid);
+		int type;
+
+		for (type = 0; type < ANON_AND_FILE; type++) {
+			while (get_nr_gens(lruvec, type) < MAX_NR_GENS) {
+				try_to_inc_max_seq_nowalk(memcg, lruvec);
+				cond_resched();
+			}
+		}
+	}
+}
+
+/*
+ * Compared to traditional LRU, MGLRU faces the following challenges:
+ *
+ * 1. Each lruvec has between MIN_NR_GENS and MAX_NR_GENS generations, the
+ *    number of generations of the parent and child memcg may be different,
+ *    so we cannot simply transfer MGLRU folios in the child memcg to the
+ *    parent memcg as we did for traditional LRU folios.
+ * 2. The generation information is stored in folio->flags, but we cannot
+ *    traverse these folios while holding the lru lock, otherwise it may
+ *    cause softlockup.
+ * 3. In walk_update_folio(), the gen of folio and corresponding lru size
+ *    may be updated, but the folio is not immediately moved to the
+ *    corresponding lru list. Therefore, there may be folios of different
+ *    generations on an LRU list.
+ * 4. In lru_gen_del_folio(), the generation to which the folio belongs is
+ *    found based on the generation information in folio->flags, and the
+ *    corresponding LRU size will be updated. Therefore, we need to update
+ *    the lru size correctly during reparenting, otherwise the lru size may
+ *    be updated incorrectly in lru_gen_del_folio().
+ *
+ * Finally, we choose a compromise method, which is to splice the lru list in
+ * the child memcg to the lru list of the same generation in the parent memcg
+ * during reparenting.
+ *
+ * The same generation has different meanings in the parent and child memcg,
+ * so this compromise method will cause the LRU inversion problem. But as the
+ * system runs, this problem will be fixed automatically.
+ */
+static void __lru_gen_reparent_memcg(struct lruvec *src_lruvec, struct lruvec *dst_lruvec,
+				     int zone, int type)
+{
+	struct lru_gen_folio *src_lrugen, *dst_lrugen;
+	enum lru_list lru = type * LRU_INACTIVE_FILE;
+	int i;
+
+	src_lrugen = &src_lruvec->lrugen;
+	dst_lrugen = &dst_lruvec->lrugen;
+
+	for (i = 0; i < get_nr_gens(src_lruvec, type); i++) {
+		int gen = lru_gen_from_seq(src_lrugen->max_seq - i);
+		long nr_pages = src_lrugen->nr_pages[gen][type][zone];
+		int src_lru_active = lru_gen_is_active(src_lruvec, gen) ? LRU_ACTIVE : 0;
+		int dst_lru_active = lru_gen_is_active(dst_lruvec, gen) ? LRU_ACTIVE : 0;
+
+		list_splice_tail_init(&src_lrugen->folios[gen][type][zone],
+				      &dst_lrugen->folios[gen][type][zone]);
+
+		WRITE_ONCE(src_lrugen->nr_pages[gen][type][zone], 0);
+		WRITE_ONCE(dst_lrugen->nr_pages[gen][type][zone],
+			   dst_lrugen->nr_pages[gen][type][zone] + nr_pages);
+
+		__update_lru_size(src_lruvec, lru + src_lru_active, zone, -nr_pages);
+		__update_lru_size(dst_lruvec, lru + dst_lru_active, zone, nr_pages);
+	}
+}
+
+void lru_gen_reparent_memcg(struct mem_cgroup *src, struct mem_cgroup *dst)
+{
+	int nid;
+
+	for_each_node(nid) {
+		struct lruvec *src_lruvec, *dst_lruvec;
+		int type, zone;
+
+		src_lruvec = get_lruvec(src, nid);
+		dst_lruvec = get_lruvec(dst, nid);
+
+		for (zone = 0; zone < MAX_NR_ZONES; zone++)
+			for (type = 0; type < ANON_AND_FILE; type++)
+				__lru_gen_reparent_memcg(src_lruvec, dst_lruvec, zone, type);
+	}
+}
+
 #endif /* CONFIG_MEMCG */
 
 /******************************************************************************
-- 
2.20.1



^ permalink raw reply	[flat|nested] 149+ messages in thread

* [PATCH v2 26/28] mm: memcontrol: refactor memcg_reparent_objcgs()
  2025-12-17  7:27 [PATCH v2 00/28] Eliminate Dying Memory Cgroup Qi Zheng
                   ` (24 preceding siblings ...)
  2025-12-17  7:27 ` [PATCH v2 25/28] mm: vmscan: prepare for reparenting MGLRU folios Qi Zheng
@ 2025-12-17  7:27 ` Qi Zheng
  2025-12-18 13:45   ` Johannes Weiner
  2025-12-17  7:27 ` [PATCH v2 27/28] mm: memcontrol: eliminate the problem of dying memory cgroup for LRU folios Qi Zheng
                   ` (3 subsequent siblings)
  29 siblings, 1 reply; 149+ messages in thread
From: Qi Zheng @ 2025-12-17  7:27 UTC (permalink / raw)
  To: hannes, hughd, mhocko, roman.gushchin, shakeel.butt, muchun.song,
	david, lorenzo.stoakes, ziy, harry.yoo, imran.f.khan,
	kamalesh.babulal, axelrasmussen, yuanchu, weixugc, chenridong,
	mkoutny, akpm, hamzamahfooz, apais, lance.yang
  Cc: linux-mm, linux-kernel, cgroups, Qi Zheng

From: Qi Zheng <zhengqi.arch@bytedance.com>

Refactor the memcg_reparent_objcgs() to facilitate subsequent reparenting
LRU folios here.

Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
---
 mm/memcontrol.c | 37 +++++++++++++++++++++++++++----------
 1 file changed, 27 insertions(+), 10 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 930dacd6ce31a..3daa99a0c65fe 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -206,24 +206,41 @@ static struct obj_cgroup *obj_cgroup_alloc(void)
 	return objcg;
 }
 
-static void memcg_reparent_objcgs(struct mem_cgroup *memcg)
+static inline void __memcg_reparent_objcgs(struct mem_cgroup *src,
+				    struct mem_cgroup *dst)
 {
 	struct obj_cgroup *objcg, *iter;
-	struct mem_cgroup *parent = parent_mem_cgroup(memcg);
-
-	objcg = rcu_replace_pointer(memcg->objcg, NULL, true);
-
-	spin_lock_irq(&objcg_lock);
 
+	objcg = rcu_replace_pointer(src->objcg, NULL, true);
 	/* 1) Ready to reparent active objcg. */
-	list_add(&objcg->list, &memcg->objcg_list);
+	list_add(&objcg->list, &src->objcg_list);
 	/* 2) Reparent active objcg and already reparented objcgs to parent. */
-	list_for_each_entry(iter, &memcg->objcg_list, list)
-		WRITE_ONCE(iter->memcg, parent);
+	list_for_each_entry(iter, &src->objcg_list, list)
+		WRITE_ONCE(iter->memcg, dst);
 	/* 3) Move already reparented objcgs to the parent's list */
-	list_splice(&memcg->objcg_list, &parent->objcg_list);
+	list_splice(&src->objcg_list, &dst->objcg_list);
+}
+
+static inline void reparent_locks(struct mem_cgroup *src, struct mem_cgroup *dst)
+{
+	spin_lock_irq(&objcg_lock);
+}
 
+static inline void reparent_unlocks(struct mem_cgroup *src, struct mem_cgroup *dst)
+{
 	spin_unlock_irq(&objcg_lock);
+}
+
+static void memcg_reparent_objcgs(struct mem_cgroup *src)
+{
+	struct obj_cgroup *objcg = rcu_dereference_protected(src->objcg, true);
+	struct mem_cgroup *dst = parent_mem_cgroup(src);
+
+	reparent_locks(src, dst);
+
+	__memcg_reparent_objcgs(src, dst);
+
+	reparent_unlocks(src, dst);
 
 	percpu_ref_kill(&objcg->refcnt);
 }
-- 
2.20.1



^ permalink raw reply	[flat|nested] 149+ messages in thread

* Re: [PATCH v2 26/28] mm: memcontrol: refactor memcg_reparent_objcgs()
  2025-12-17  7:27 ` [PATCH v2 26/28] mm: memcontrol: refactor memcg_reparent_objcgs() Qi Zheng
@ 2025-12-18 13:45   ` Johannes Weiner
  2025-12-22  3:56     ` Qi Zheng
  0 siblings, 1 reply; 149+ messages in thread
From: Johannes Weiner @ 2025-12-18 13:45 UTC (permalink / raw)
  To: Qi Zheng
  Cc: hughd, mhocko, roman.gushchin, shakeel.butt, muchun.song, david,
	lorenzo.stoakes, ziy, harry.yoo, imran.f.khan, kamalesh.babulal,
	axelrasmussen, yuanchu, weixugc, chenridong, mkoutny, akpm,
	hamzamahfooz, apais, lance.yang, linux-mm, linux-kernel, cgroups,
	Qi Zheng

On Wed, Dec 17, 2025 at 03:27:50PM +0800, Qi Zheng wrote:
> +static void memcg_reparent_objcgs(struct mem_cgroup *src)
> +{
> +	struct obj_cgroup *objcg = rcu_dereference_protected(src->objcg, true);
> +	struct mem_cgroup *dst = parent_mem_cgroup(src);
> +
> +	reparent_locks(src, dst);
> +
> +	__memcg_reparent_objcgs(src, dst);

Please have __memcg_reparent_objcgs() return the dead objcg for the
percpu_ref_kill(), instead of doing the deref twice.

And please use @child, @parent (or @memcg, @parent) throughout instead
of @src and @dst.

> +
> +	reparent_unlocks(src, dst);
>  
>  	percpu_ref_kill(&objcg->refcnt);

With that,

Acked-by: Johannes Weiner <hannes@cmpxchg.org>


^ permalink raw reply	[flat|nested] 149+ messages in thread

* Re: [PATCH v2 26/28] mm: memcontrol: refactor memcg_reparent_objcgs()
  2025-12-18 13:45   ` Johannes Weiner
@ 2025-12-22  3:56     ` Qi Zheng
  0 siblings, 0 replies; 149+ messages in thread
From: Qi Zheng @ 2025-12-22  3:56 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: hughd, mhocko, roman.gushchin, shakeel.butt, muchun.song, david,
	lorenzo.stoakes, ziy, harry.yoo, imran.f.khan, kamalesh.babulal,
	axelrasmussen, yuanchu, weixugc, chenridong, mkoutny, akpm,
	hamzamahfooz, apais, lance.yang, linux-mm, linux-kernel, cgroups,
	Qi Zheng



On 12/18/25 9:45 PM, Johannes Weiner wrote:
> On Wed, Dec 17, 2025 at 03:27:50PM +0800, Qi Zheng wrote:
>> +static void memcg_reparent_objcgs(struct mem_cgroup *src)
>> +{
>> +	struct obj_cgroup *objcg = rcu_dereference_protected(src->objcg, true);
>> +	struct mem_cgroup *dst = parent_mem_cgroup(src);
>> +
>> +	reparent_locks(src, dst);
>> +
>> +	__memcg_reparent_objcgs(src, dst);
> 
> Please have __memcg_reparent_objcgs() return the dead objcg for the
> percpu_ref_kill(), instead of doing the deref twice.

OK, will do.

> 
> And please use @child, @parent (or @memcg, @parent) throughout instead
> of @src and @dst.

OK.

> 
>> +
>> +	reparent_unlocks(src, dst);
>>   
>>   	percpu_ref_kill(&objcg->refcnt);
> 
> With that,
> 
> Acked-by: Johannes Weiner <hannes@cmpxchg.org>

Thanks!




^ permalink raw reply	[flat|nested] 149+ messages in thread

* [PATCH v2 27/28] mm: memcontrol: eliminate the problem of dying memory cgroup for LRU folios
  2025-12-17  7:27 [PATCH v2 00/28] Eliminate Dying Memory Cgroup Qi Zheng
                   ` (25 preceding siblings ...)
  2025-12-17  7:27 ` [PATCH v2 26/28] mm: memcontrol: refactor memcg_reparent_objcgs() Qi Zheng
@ 2025-12-17  7:27 ` Qi Zheng
  2025-12-18 14:06   ` Johannes Weiner
  2025-12-17  7:27 ` [PATCH v2 28/28] mm: lru: add VM_WARN_ON_ONCE_FOLIO to lru maintenance helpers Qi Zheng
                   ` (2 subsequent siblings)
  29 siblings, 1 reply; 149+ messages in thread
From: Qi Zheng @ 2025-12-17  7:27 UTC (permalink / raw)
  To: hannes, hughd, mhocko, roman.gushchin, shakeel.butt, muchun.song,
	david, lorenzo.stoakes, ziy, harry.yoo, imran.f.khan,
	kamalesh.babulal, axelrasmussen, yuanchu, weixugc, chenridong,
	mkoutny, akpm, hamzamahfooz, apais, lance.yang
  Cc: linux-mm, linux-kernel, cgroups, Muchun Song, Qi Zheng

From: Muchun Song <songmuchun@bytedance.com>

Pagecache pages are charged at allocation time and hold a reference
to the original memory cgroup until reclaimed. Depending on memory
pressure, page sharing patterns between different cgroups and cgroup
creation/destruction rates, many dying memory cgroups can be pinned
by pagecache pages, reducing page reclaim efficiency and wasting
memory. Converting LRU folios and most other raw memory cgroup pins
to the object cgroup direction can fix this long-living problem.

Finally, folio->memcg_data of LRU folios and kmem folios will always
point to an object cgroup pointer. The folio->memcg_data of slab
folios will point to an vector of object cgroups.

Signed-off-by: Muchun Song <songmuchun@bytedance.com>
Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
---
 include/linux/memcontrol.h |  77 +++++----------
 mm/memcontrol-v1.c         |  15 +--
 mm/memcontrol.c            | 189 +++++++++++++++++++++++--------------
 3 files changed, 150 insertions(+), 131 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 85265b28c5d18..9be52ce72f2c5 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -369,9 +369,6 @@ enum objext_flags {
 #define OBJEXTS_FLAGS_MASK (__NR_OBJEXTS_FLAGS - 1)
 
 #ifdef CONFIG_MEMCG
-
-static inline bool folio_memcg_kmem(struct folio *folio);
-
 /*
  * After the initialization objcg->memcg is always pointing at
  * a valid memcg, but can be atomically swapped to the parent memcg.
@@ -385,43 +382,19 @@ static inline struct mem_cgroup *obj_cgroup_memcg(struct obj_cgroup *objcg)
 }
 
 /*
- * __folio_memcg - Get the memory cgroup associated with a non-kmem folio
- * @folio: Pointer to the folio.
- *
- * Returns a pointer to the memory cgroup associated with the folio,
- * or NULL. This function assumes that the folio is known to have a
- * proper memory cgroup pointer. It's not safe to call this function
- * against some type of folios, e.g. slab folios or ex-slab folios or
- * kmem folios.
- */
-static inline struct mem_cgroup *__folio_memcg(struct folio *folio)
-{
-	unsigned long memcg_data = folio->memcg_data;
-
-	VM_BUG_ON_FOLIO(folio_test_slab(folio), folio);
-	VM_BUG_ON_FOLIO(memcg_data & MEMCG_DATA_OBJEXTS, folio);
-	VM_BUG_ON_FOLIO(memcg_data & MEMCG_DATA_KMEM, folio);
-
-	return (struct mem_cgroup *)(memcg_data & ~OBJEXTS_FLAGS_MASK);
-}
-
-/*
- * __folio_objcg - get the object cgroup associated with a kmem folio.
+ * folio_objcg - get the object cgroup associated with a folio.
  * @folio: Pointer to the folio.
  *
  * Returns a pointer to the object cgroup associated with the folio,
  * or NULL. This function assumes that the folio is known to have a
- * proper object cgroup pointer. It's not safe to call this function
- * against some type of folios, e.g. slab folios or ex-slab folios or
- * LRU folios.
+ * proper object cgroup pointer.
  */
-static inline struct obj_cgroup *__folio_objcg(struct folio *folio)
+static inline struct obj_cgroup *folio_objcg(struct folio *folio)
 {
 	unsigned long memcg_data = folio->memcg_data;
 
 	VM_BUG_ON_FOLIO(folio_test_slab(folio), folio);
 	VM_BUG_ON_FOLIO(memcg_data & MEMCG_DATA_OBJEXTS, folio);
-	VM_BUG_ON_FOLIO(!(memcg_data & MEMCG_DATA_KMEM), folio);
 
 	return (struct obj_cgroup *)(memcg_data & ~OBJEXTS_FLAGS_MASK);
 }
@@ -435,21 +408,30 @@ static inline struct obj_cgroup *__folio_objcg(struct folio *folio)
  * proper memory cgroup pointer. It's not safe to call this function
  * against some type of folios, e.g. slab folios or ex-slab folios.
  *
- * For a non-kmem folio any of the following ensures folio and memcg binding
- * stability:
+ * For a folio any of the following ensures folio and objcg binding stability:
  *
  * - the folio lock
  * - LRU isolation
  * - exclusive reference
  *
- * For a kmem folio a caller should hold an rcu read lock to protect memcg
- * associated with a kmem folio from being released.
+ * Based on the stable binding of folio and objcg, for a folio any of the
+ * following ensures folio and memcg binding stability:
+ *
+ * - cgroup_mutex
+ * - the lruvec lock
+ *
+ * If the caller only want to ensure that the page counters of memcg are
+ * updated correctly, ensure that the binding stability of folio and objcg
+ * is sufficient.
+ *
+ * Note: The caller should hold an rcu read lock or cgroup_mutex to protect
+ * memcg associated with a folio from being released.
  */
 static inline struct mem_cgroup *folio_memcg(struct folio *folio)
 {
-	if (folio_memcg_kmem(folio))
-		return obj_cgroup_memcg(__folio_objcg(folio));
-	return __folio_memcg(folio);
+	struct obj_cgroup *objcg = folio_objcg(folio);
+
+	return objcg ? obj_cgroup_memcg(objcg) : NULL;
 }
 
 /*
@@ -473,15 +455,10 @@ static inline bool folio_memcg_charged(struct folio *folio)
  * has an associated memory cgroup pointer or an object cgroups vector or
  * an object cgroup.
  *
- * For a non-kmem folio any of the following ensures folio and memcg binding
- * stability:
+ * The page and objcg or memcg binding rules can refer to folio_memcg().
  *
- * - the folio lock
- * - LRU isolation
- * - exclusive reference
- *
- * For a kmem folio a caller should hold an rcu read lock to protect memcg
- * associated with a kmem folio from being released.
+ * A caller should hold an rcu read lock to protect memcg associated with a
+ * page from being released.
  */
 static inline struct mem_cgroup *folio_memcg_check(struct folio *folio)
 {
@@ -490,18 +467,14 @@ static inline struct mem_cgroup *folio_memcg_check(struct folio *folio)
 	 * for slabs, READ_ONCE() should be used here.
 	 */
 	unsigned long memcg_data = READ_ONCE(folio->memcg_data);
+	struct obj_cgroup *objcg;
 
 	if (memcg_data & MEMCG_DATA_OBJEXTS)
 		return NULL;
 
-	if (memcg_data & MEMCG_DATA_KMEM) {
-		struct obj_cgroup *objcg;
-
-		objcg = (void *)(memcg_data & ~OBJEXTS_FLAGS_MASK);
-		return obj_cgroup_memcg(objcg);
-	}
+	objcg = (void *)(memcg_data & ~OBJEXTS_FLAGS_MASK);
 
-	return (struct mem_cgroup *)(memcg_data & ~OBJEXTS_FLAGS_MASK);
+	return objcg ? obj_cgroup_memcg(objcg) : NULL;
 }
 
 static inline struct mem_cgroup *page_memcg_check(struct page *page)
diff --git a/mm/memcontrol-v1.c b/mm/memcontrol-v1.c
index 6eed14bff7426..23c07df2063c8 100644
--- a/mm/memcontrol-v1.c
+++ b/mm/memcontrol-v1.c
@@ -591,6 +591,7 @@ void memcg1_commit_charge(struct folio *folio, struct mem_cgroup *memcg)
 void memcg1_swapout(struct folio *folio, swp_entry_t entry)
 {
 	struct mem_cgroup *memcg, *swap_memcg;
+	struct obj_cgroup *objcg;
 	unsigned int nr_entries;
 
 	VM_BUG_ON_FOLIO(folio_test_lru(folio), folio);
@@ -602,12 +603,13 @@ void memcg1_swapout(struct folio *folio, swp_entry_t entry)
 	if (!do_memsw_account())
 		return;
 
-	memcg = folio_memcg(folio);
-
-	VM_WARN_ON_ONCE_FOLIO(!memcg, folio);
-	if (!memcg)
+	objcg = folio_objcg(folio);
+	VM_WARN_ON_ONCE_FOLIO(!objcg, folio);
+	if (!objcg)
 		return;
 
+	rcu_read_lock();
+	memcg = obj_cgroup_memcg(objcg);
 	/*
 	 * In case the memcg owning these pages has been offlined and doesn't
 	 * have an ID allocated to it anymore, charge the closest online
@@ -625,7 +627,7 @@ void memcg1_swapout(struct folio *folio, swp_entry_t entry)
 	folio_unqueue_deferred_split(folio);
 	folio->memcg_data = 0;
 
-	if (!mem_cgroup_is_root(memcg))
+	if (!obj_cgroup_is_root(objcg))
 		page_counter_uncharge(&memcg->memory, nr_entries);
 
 	if (memcg != swap_memcg) {
@@ -646,7 +648,8 @@ void memcg1_swapout(struct folio *folio, swp_entry_t entry)
 	preempt_enable_nested();
 	memcg1_check_events(memcg, folio_nid(folio));
 
-	css_put(&memcg->css);
+	rcu_read_unlock();
+	obj_cgroup_put(objcg);
 }
 
 /*
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 3daa99a0c65fe..cd2f0f0c0f5ce 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -223,22 +223,55 @@ static inline void __memcg_reparent_objcgs(struct mem_cgroup *src,
 
 static inline void reparent_locks(struct mem_cgroup *src, struct mem_cgroup *dst)
 {
+	int nid, nest = 0;
+
 	spin_lock_irq(&objcg_lock);
+	for_each_node(nid) {
+		spin_lock_nested(&mem_cgroup_lruvec(src,
+				 NODE_DATA(nid))->lru_lock, nest++);
+		spin_lock_nested(&mem_cgroup_lruvec(dst,
+				 NODE_DATA(nid))->lru_lock, nest++);
+	}
 }
 
 static inline void reparent_unlocks(struct mem_cgroup *src, struct mem_cgroup *dst)
 {
+	int nid;
+
+	for_each_node(nid) {
+		spin_unlock(&mem_cgroup_lruvec(dst, NODE_DATA(nid))->lru_lock);
+		spin_unlock(&mem_cgroup_lruvec(src, NODE_DATA(nid))->lru_lock);
+	}
 	spin_unlock_irq(&objcg_lock);
 }
 
+static void memcg_reparent_lru_folios(struct mem_cgroup *src,
+				      struct mem_cgroup *dst)
+{
+	if (lru_gen_enabled())
+		lru_gen_reparent_memcg(src, dst);
+	else
+		lru_reparent_memcg(src, dst);
+}
+
 static void memcg_reparent_objcgs(struct mem_cgroup *src)
 {
 	struct obj_cgroup *objcg = rcu_dereference_protected(src->objcg, true);
 	struct mem_cgroup *dst = parent_mem_cgroup(src);
 
+retry:
+	if (lru_gen_enabled())
+		max_lru_gen_memcg(dst);
+
 	reparent_locks(src, dst);
+	if (lru_gen_enabled() && !recheck_lru_gen_max_memcg(dst)) {
+		reparent_unlocks(src, dst);
+		cond_resched();
+		goto retry;
+	}
 
 	__memcg_reparent_objcgs(src, dst);
+	memcg_reparent_lru_folios(src, dst);
 
 	reparent_unlocks(src, dst);
 
@@ -989,6 +1022,8 @@ struct mem_cgroup *get_mem_cgroup_from_current(void)
 /**
  * get_mem_cgroup_from_folio - Obtain a reference on a given folio's memcg.
  * @folio: folio from which memcg should be extracted.
+ *
+ * The folio and objcg or memcg binding rules can refer to folio_memcg().
  */
 struct mem_cgroup *get_mem_cgroup_from_folio(struct folio *folio)
 {
@@ -2557,17 +2592,17 @@ static inline int try_charge(struct mem_cgroup *memcg, gfp_t gfp_mask,
 	return try_charge_memcg(memcg, gfp_mask, nr_pages);
 }
 
-static void commit_charge(struct folio *folio, struct mem_cgroup *memcg)
+static void commit_charge(struct folio *folio, struct obj_cgroup *objcg)
 {
 	VM_BUG_ON_FOLIO(folio_memcg_charged(folio), folio);
 	/*
-	 * Any of the following ensures page's memcg stability:
+	 * Any of the following ensures folio's objcg stability:
 	 *
 	 * - the page lock
 	 * - LRU isolation
 	 * - exclusive reference
 	 */
-	folio->memcg_data = (unsigned long)memcg;
+	folio->memcg_data = (unsigned long)objcg;
 }
 
 #ifdef CONFIG_MEMCG_NMI_SAFETY_REQUIRES_ATOMIC
@@ -2671,6 +2706,17 @@ static struct obj_cgroup *__get_obj_cgroup_from_memcg(struct mem_cgroup *memcg)
 	return NULL;
 }
 
+static inline struct obj_cgroup *get_obj_cgroup_from_memcg(struct mem_cgroup *memcg)
+{
+	struct obj_cgroup *objcg;
+
+	rcu_read_lock();
+	objcg = __get_obj_cgroup_from_memcg(memcg);
+	rcu_read_unlock();
+
+	return objcg;
+}
+
 static struct obj_cgroup *current_objcg_update(void)
 {
 	struct mem_cgroup *memcg;
@@ -2771,17 +2817,10 @@ struct obj_cgroup *get_obj_cgroup_from_folio(struct folio *folio)
 {
 	struct obj_cgroup *objcg;
 
-	if (!memcg_kmem_online())
-		return NULL;
-
-	if (folio_memcg_kmem(folio)) {
-		objcg = __folio_objcg(folio);
+	objcg = folio_objcg(folio);
+	if (objcg)
 		obj_cgroup_get(objcg);
-	} else {
-		rcu_read_lock();
-		objcg = __get_obj_cgroup_from_memcg(__folio_memcg(folio));
-		rcu_read_unlock();
-	}
+
 	return objcg;
 }
 
@@ -3288,7 +3327,7 @@ void folio_split_memcg_refs(struct folio *folio, unsigned old_order,
 		return;
 
 	new_refs = (1 << (old_order - new_order)) - 1;
-	css_get_many(&__folio_memcg(folio)->css, new_refs);
+	obj_cgroup_get_many(folio_objcg(folio), new_refs);
 }
 
 unsigned long mem_cgroup_usage(struct mem_cgroup *memcg, bool swap)
@@ -4737,16 +4776,20 @@ void mem_cgroup_calculate_protection(struct mem_cgroup *root,
 static int charge_memcg(struct folio *folio, struct mem_cgroup *memcg,
 			gfp_t gfp)
 {
-	int ret;
-
-	ret = try_charge(memcg, gfp, folio_nr_pages(folio));
-	if (ret)
-		goto out;
+	int ret = 0;
+	struct obj_cgroup *objcg;
 
-	css_get(&memcg->css);
-	commit_charge(folio, memcg);
+	objcg = get_obj_cgroup_from_memcg(memcg);
+	/* Do not account at the root objcg level. */
+	if (!obj_cgroup_is_root(objcg))
+		ret = try_charge(memcg, gfp, folio_nr_pages(folio));
+	if (ret) {
+		obj_cgroup_put(objcg);
+		return ret;
+	}
+	commit_charge(folio, objcg);
 	memcg1_commit_charge(folio, memcg);
-out:
+
 	return ret;
 }
 
@@ -4832,7 +4875,7 @@ int mem_cgroup_swapin_charge_folio(struct folio *folio, struct mm_struct *mm,
 }
 
 struct uncharge_gather {
-	struct mem_cgroup *memcg;
+	struct obj_cgroup *objcg;
 	unsigned long nr_memory;
 	unsigned long pgpgout;
 	unsigned long nr_kmem;
@@ -4846,58 +4889,52 @@ static inline void uncharge_gather_clear(struct uncharge_gather *ug)
 
 static void uncharge_batch(const struct uncharge_gather *ug)
 {
+	struct mem_cgroup *memcg;
+
+	rcu_read_lock();
+	memcg = obj_cgroup_memcg(ug->objcg);
 	if (ug->nr_memory) {
-		memcg_uncharge(ug->memcg, ug->nr_memory);
+		memcg_uncharge(memcg, ug->nr_memory);
 		if (ug->nr_kmem) {
-			mod_memcg_state(ug->memcg, MEMCG_KMEM, -ug->nr_kmem);
-			memcg1_account_kmem(ug->memcg, -ug->nr_kmem);
+			mod_memcg_state(memcg, MEMCG_KMEM, -ug->nr_kmem);
+			memcg1_account_kmem(memcg, -ug->nr_kmem);
 		}
-		memcg1_oom_recover(ug->memcg);
+		memcg1_oom_recover(memcg);
 	}
 
-	memcg1_uncharge_batch(ug->memcg, ug->pgpgout, ug->nr_memory, ug->nid);
+	memcg1_uncharge_batch(memcg, ug->pgpgout, ug->nr_memory, ug->nid);
+	rcu_read_unlock();
 
 	/* drop reference from uncharge_folio */
-	css_put(&ug->memcg->css);
+	obj_cgroup_put(ug->objcg);
 }
 
 static void uncharge_folio(struct folio *folio, struct uncharge_gather *ug)
 {
 	long nr_pages;
-	struct mem_cgroup *memcg;
 	struct obj_cgroup *objcg;
 
 	VM_BUG_ON_FOLIO(folio_test_lru(folio), folio);
 
 	/*
 	 * Nobody should be changing or seriously looking at
-	 * folio memcg or objcg at this point, we have fully
-	 * exclusive access to the folio.
+	 * folio objcg at this point, we have fully exclusive
+	 * access to the folio.
 	 */
-	if (folio_memcg_kmem(folio)) {
-		objcg = __folio_objcg(folio);
-		/*
-		 * This get matches the put at the end of the function and
-		 * kmem pages do not hold memcg references anymore.
-		 */
-		memcg = get_mem_cgroup_from_objcg(objcg);
-	} else {
-		memcg = __folio_memcg(folio);
-	}
-
-	if (!memcg)
+	objcg = folio_objcg(folio);
+	if (!objcg)
 		return;
 
-	if (ug->memcg != memcg) {
-		if (ug->memcg) {
+	if (ug->objcg != objcg) {
+		if (ug->objcg) {
 			uncharge_batch(ug);
 			uncharge_gather_clear(ug);
 		}
-		ug->memcg = memcg;
+		ug->objcg = objcg;
 		ug->nid = folio_nid(folio);
 
-		/* pairs with css_put in uncharge_batch */
-		css_get(&memcg->css);
+		/* pairs with obj_cgroup_put in uncharge_batch */
+		obj_cgroup_get(objcg);
 	}
 
 	nr_pages = folio_nr_pages(folio);
@@ -4905,20 +4942,17 @@ static void uncharge_folio(struct folio *folio, struct uncharge_gather *ug)
 	if (folio_memcg_kmem(folio)) {
 		ug->nr_memory += nr_pages;
 		ug->nr_kmem += nr_pages;
-
-		folio->memcg_data = 0;
-		obj_cgroup_put(objcg);
 	} else {
 		/* LRU pages aren't accounted at the root level */
-		if (!mem_cgroup_is_root(memcg))
+		if (!obj_cgroup_is_root(objcg))
 			ug->nr_memory += nr_pages;
 		ug->pgpgout++;
 
 		WARN_ON_ONCE(folio_unqueue_deferred_split(folio));
-		folio->memcg_data = 0;
 	}
 
-	css_put(&memcg->css);
+	folio->memcg_data = 0;
+	obj_cgroup_put(objcg);
 }
 
 void __mem_cgroup_uncharge(struct folio *folio)
@@ -4942,7 +4976,7 @@ void __mem_cgroup_uncharge_folios(struct folio_batch *folios)
 	uncharge_gather_clear(&ug);
 	for (i = 0; i < folios->nr; i++)
 		uncharge_folio(folios->folios[i], &ug);
-	if (ug.memcg)
+	if (ug.objcg)
 		uncharge_batch(&ug);
 }
 
@@ -4959,6 +4993,7 @@ void __mem_cgroup_uncharge_folios(struct folio_batch *folios)
 void mem_cgroup_replace_folio(struct folio *old, struct folio *new)
 {
 	struct mem_cgroup *memcg;
+	struct obj_cgroup *objcg;
 	long nr_pages = folio_nr_pages(new);
 
 	VM_BUG_ON_FOLIO(!folio_test_locked(old), old);
@@ -4973,21 +5008,24 @@ void mem_cgroup_replace_folio(struct folio *old, struct folio *new)
 	if (folio_memcg_charged(new))
 		return;
 
-	memcg = folio_memcg(old);
-	VM_WARN_ON_ONCE_FOLIO(!memcg, old);
-	if (!memcg)
+	objcg = folio_objcg(old);
+	VM_WARN_ON_ONCE_FOLIO(!objcg, old);
+	if (!objcg)
 		return;
 
+	rcu_read_lock();
+	memcg = obj_cgroup_memcg(objcg);
 	/* Force-charge the new page. The old one will be freed soon */
-	if (!mem_cgroup_is_root(memcg)) {
+	if (!obj_cgroup_is_root(objcg)) {
 		page_counter_charge(&memcg->memory, nr_pages);
 		if (do_memsw_account())
 			page_counter_charge(&memcg->memsw, nr_pages);
 	}
 
-	css_get(&memcg->css);
-	commit_charge(new, memcg);
+	obj_cgroup_get(objcg);
+	commit_charge(new, objcg);
 	memcg1_commit_charge(new, memcg);
+	rcu_read_unlock();
 }
 
 /**
@@ -5003,7 +5041,7 @@ void mem_cgroup_replace_folio(struct folio *old, struct folio *new)
  */
 void mem_cgroup_migrate(struct folio *old, struct folio *new)
 {
-	struct mem_cgroup *memcg;
+	struct obj_cgroup *objcg;
 
 	VM_BUG_ON_FOLIO(!folio_test_locked(old), old);
 	VM_BUG_ON_FOLIO(!folio_test_locked(new), new);
@@ -5014,18 +5052,18 @@ void mem_cgroup_migrate(struct folio *old, struct folio *new)
 	if (mem_cgroup_disabled())
 		return;
 
-	memcg = folio_memcg(old);
+	objcg = folio_objcg(old);
 	/*
-	 * Note that it is normal to see !memcg for a hugetlb folio.
+	 * Note that it is normal to see !objcg for a hugetlb folio.
 	 * For e.g, itt could have been allocated when memory_hugetlb_accounting
 	 * was not selected.
 	 */
-	VM_WARN_ON_ONCE_FOLIO(!folio_test_hugetlb(old) && !memcg, old);
-	if (!memcg)
+	VM_WARN_ON_ONCE_FOLIO(!folio_test_hugetlb(old) && !objcg, old);
+	if (!objcg)
 		return;
 
-	/* Transfer the charge and the css ref */
-	commit_charge(new, memcg);
+	/* Transfer the charge and the objcg ref */
+	commit_charge(new, objcg);
 
 	/* Warning should never happen, so don't worry about refcount non-0 */
 	WARN_ON_ONCE(folio_unqueue_deferred_split(old));
@@ -5200,22 +5238,27 @@ int __mem_cgroup_try_charge_swap(struct folio *folio, swp_entry_t entry)
 	unsigned int nr_pages = folio_nr_pages(folio);
 	struct page_counter *counter;
 	struct mem_cgroup *memcg;
+	struct obj_cgroup *objcg;
 
 	if (do_memsw_account())
 		return 0;
 
-	memcg = folio_memcg(folio);
-
-	VM_WARN_ON_ONCE_FOLIO(!memcg, folio);
-	if (!memcg)
+	objcg = folio_objcg(folio);
+	VM_WARN_ON_ONCE_FOLIO(!objcg, folio);
+	if (!objcg)
 		return 0;
 
+	rcu_read_lock();
+	memcg = obj_cgroup_memcg(objcg);
 	if (!entry.val) {
 		memcg_memory_event(memcg, MEMCG_SWAP_FAIL);
+		rcu_read_unlock();
 		return 0;
 	}
 
 	memcg = mem_cgroup_id_get_online(memcg);
+	/* memcg is pined by memcg ID. */
+	rcu_read_unlock();
 
 	if (!mem_cgroup_is_root(memcg) &&
 	    !page_counter_try_charge(&memcg->swap, nr_pages, &counter)) {
-- 
2.20.1



^ permalink raw reply	[flat|nested] 149+ messages in thread

* Re: [PATCH v2 27/28] mm: memcontrol: eliminate the problem of dying memory cgroup for LRU folios
  2025-12-17  7:27 ` [PATCH v2 27/28] mm: memcontrol: eliminate the problem of dying memory cgroup for LRU folios Qi Zheng
@ 2025-12-18 14:06   ` Johannes Weiner
  2025-12-22  3:59     ` Qi Zheng
  0 siblings, 1 reply; 149+ messages in thread
From: Johannes Weiner @ 2025-12-18 14:06 UTC (permalink / raw)
  To: Qi Zheng
  Cc: hughd, mhocko, roman.gushchin, shakeel.butt, muchun.song, david,
	lorenzo.stoakes, ziy, harry.yoo, imran.f.khan, kamalesh.babulal,
	axelrasmussen, yuanchu, weixugc, chenridong, mkoutny, akpm,
	hamzamahfooz, apais, lance.yang, linux-mm, linux-kernel, cgroups,
	Muchun Song, Qi Zheng

On Wed, Dec 17, 2025 at 03:27:51PM +0800, Qi Zheng wrote:
> From: Muchun Song <songmuchun@bytedance.com>
> 
> Pagecache pages are charged at allocation time and hold a reference
> to the original memory cgroup until reclaimed. Depending on memory
> pressure, page sharing patterns between different cgroups and cgroup
> creation/destruction rates, many dying memory cgroups can be pinned
> by pagecache pages, reducing page reclaim efficiency and wasting
> memory. Converting LRU folios and most other raw memory cgroup pins
> to the object cgroup direction can fix this long-living problem.

This is already in the coverletter. Please describe here what the
patch itself does. IOW, now that everything is set up, switch
folio->memcg_data pointers to objcgs, update the accessors, and
execute reparenting on cgroup death.

> Finally, folio->memcg_data of LRU folios and kmem folios will always
> point to an object cgroup pointer. The folio->memcg_data of slab
> folios will point to an vector of object cgroups.

> @@ -223,22 +223,55 @@ static inline void __memcg_reparent_objcgs(struct mem_cgroup *src,
>  
>  static inline void reparent_locks(struct mem_cgroup *src, struct mem_cgroup *dst)
>  {
> +	int nid, nest = 0;
> +
>  	spin_lock_irq(&objcg_lock);
> +	for_each_node(nid) {
> +		spin_lock_nested(&mem_cgroup_lruvec(src,
> +				 NODE_DATA(nid))->lru_lock, nest++);
> +		spin_lock_nested(&mem_cgroup_lruvec(dst,
> +				 NODE_DATA(nid))->lru_lock, nest++);
> +	}
>  }

Looks okay to me. If this should turn out to be a scalability problem
in practice, we can make objcgs per-node, and then reparent lru/objcg
pairs on a per-node basis without nesting locks.

>  static inline void reparent_unlocks(struct mem_cgroup *src, struct mem_cgroup *dst)
>  {
> +	int nid;
> +
> +	for_each_node(nid) {
> +		spin_unlock(&mem_cgroup_lruvec(dst, NODE_DATA(nid))->lru_lock);
> +		spin_unlock(&mem_cgroup_lruvec(src, NODE_DATA(nid))->lru_lock);
> +	}
>  	spin_unlock_irq(&objcg_lock);
>  }
>  
> +static void memcg_reparent_lru_folios(struct mem_cgroup *src,
> +				      struct mem_cgroup *dst)
> +{
> +	if (lru_gen_enabled())
> +		lru_gen_reparent_memcg(src, dst);
> +	else
> +		lru_reparent_memcg(src, dst);
> +}
> +
>  static void memcg_reparent_objcgs(struct mem_cgroup *src)
>  {
>  	struct obj_cgroup *objcg = rcu_dereference_protected(src->objcg, true);
>  	struct mem_cgroup *dst = parent_mem_cgroup(src);
>  
> +retry:
> +	if (lru_gen_enabled())
> +		max_lru_gen_memcg(dst);
> +
>  	reparent_locks(src, dst);
> +	if (lru_gen_enabled() && !recheck_lru_gen_max_memcg(dst)) {
> +		reparent_unlocks(src, dst);
> +		cond_resched();
> +		goto retry;
> +	}
>  
>  	__memcg_reparent_objcgs(src, dst);
> +	memcg_reparent_lru_folios(src, dst);

Please inline memcg_reparent_lru_folios() here, to keep the lru vs
lrugen switching as "flat" as possible:

	if (lru_gen_enabled()) {
		if (!recheck_lru_gen_max_memcgs(parent)) {
			reparent_unlocks(memcg, parent);
			cond_resched();
			goto retry;
		}
		lru_gen_reparent_memcg(memcg, parent);
	} else {
		lru_reparent_memcg(memcg, parent);
	}

> @@ -989,6 +1022,8 @@ struct mem_cgroup *get_mem_cgroup_from_current(void)
>  /**
>   * get_mem_cgroup_from_folio - Obtain a reference on a given folio's memcg.
>   * @folio: folio from which memcg should be extracted.
> + *
> + * The folio and objcg or memcg binding rules can refer to folio_memcg().

      See folio_memcg() for folio->objcg/memcg binding rules.


^ permalink raw reply	[flat|nested] 149+ messages in thread

* Re: [PATCH v2 27/28] mm: memcontrol: eliminate the problem of dying memory cgroup for LRU folios
  2025-12-18 14:06   ` Johannes Weiner
@ 2025-12-22  3:59     ` Qi Zheng
  0 siblings, 0 replies; 149+ messages in thread
From: Qi Zheng @ 2025-12-22  3:59 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: hughd, mhocko, roman.gushchin, shakeel.butt, muchun.song, david,
	lorenzo.stoakes, ziy, harry.yoo, imran.f.khan, kamalesh.babulal,
	axelrasmussen, yuanchu, weixugc, chenridong, mkoutny, akpm,
	hamzamahfooz, apais, lance.yang, linux-mm, linux-kernel, cgroups,
	Muchun Song, Qi Zheng



On 12/18/25 10:06 PM, Johannes Weiner wrote:
> On Wed, Dec 17, 2025 at 03:27:51PM +0800, Qi Zheng wrote:
>> From: Muchun Song <songmuchun@bytedance.com>
>>
>> Pagecache pages are charged at allocation time and hold a reference
>> to the original memory cgroup until reclaimed. Depending on memory
>> pressure, page sharing patterns between different cgroups and cgroup
>> creation/destruction rates, many dying memory cgroups can be pinned
>> by pagecache pages, reducing page reclaim efficiency and wasting
>> memory. Converting LRU folios and most other raw memory cgroup pins
>> to the object cgroup direction can fix this long-living problem.
> 
> This is already in the coverletter. Please describe here what the
> patch itself does. IOW, now that everything is set up, switch
> folio->memcg_data pointers to objcgs, update the accessors, and
> execute reparenting on cgroup death.

Got it, will do.

> 
>> Finally, folio->memcg_data of LRU folios and kmem folios will always
>> point to an object cgroup pointer. The folio->memcg_data of slab
>> folios will point to an vector of object cgroups.
> 
>> @@ -223,22 +223,55 @@ static inline void __memcg_reparent_objcgs(struct mem_cgroup *src,
>>   
>>   static inline void reparent_locks(struct mem_cgroup *src, struct mem_cgroup *dst)
>>   {
>> +	int nid, nest = 0;
>> +
>>   	spin_lock_irq(&objcg_lock);
>> +	for_each_node(nid) {
>> +		spin_lock_nested(&mem_cgroup_lruvec(src,
>> +				 NODE_DATA(nid))->lru_lock, nest++);
>> +		spin_lock_nested(&mem_cgroup_lruvec(dst,
>> +				 NODE_DATA(nid))->lru_lock, nest++);
>> +	}
>>   }
> 
> Looks okay to me. If this should turn out to be a scalability problem
> in practice, we can make objcgs per-node, and then reparent lru/objcg
> pairs on a per-node basis without nesting locks.
> 
>>   static inline void reparent_unlocks(struct mem_cgroup *src, struct mem_cgroup *dst)
>>   {
>> +	int nid;
>> +
>> +	for_each_node(nid) {
>> +		spin_unlock(&mem_cgroup_lruvec(dst, NODE_DATA(nid))->lru_lock);
>> +		spin_unlock(&mem_cgroup_lruvec(src, NODE_DATA(nid))->lru_lock);
>> +	}
>>   	spin_unlock_irq(&objcg_lock);
>>   }
>>   
>> +static void memcg_reparent_lru_folios(struct mem_cgroup *src,
>> +				      struct mem_cgroup *dst)
>> +{
>> +	if (lru_gen_enabled())
>> +		lru_gen_reparent_memcg(src, dst);
>> +	else
>> +		lru_reparent_memcg(src, dst);
>> +}
>> +
>>   static void memcg_reparent_objcgs(struct mem_cgroup *src)
>>   {
>>   	struct obj_cgroup *objcg = rcu_dereference_protected(src->objcg, true);
>>   	struct mem_cgroup *dst = parent_mem_cgroup(src);
>>   
>> +retry:
>> +	if (lru_gen_enabled())
>> +		max_lru_gen_memcg(dst);
>> +
>>   	reparent_locks(src, dst);
>> +	if (lru_gen_enabled() && !recheck_lru_gen_max_memcg(dst)) {
>> +		reparent_unlocks(src, dst);
>> +		cond_resched();
>> +		goto retry;
>> +	}
>>   
>>   	__memcg_reparent_objcgs(src, dst);
>> +	memcg_reparent_lru_folios(src, dst);
> 
> Please inline memcg_reparent_lru_folios() here, to keep the lru vs
> lrugen switching as "flat" as possible:
> 
> 	if (lru_gen_enabled()) {
> 		if (!recheck_lru_gen_max_memcgs(parent)) {
> 			reparent_unlocks(memcg, parent);
> 			cond_resched();
> 			goto retry;
> 		}
> 		lru_gen_reparent_memcg(memcg, parent);
> 	} else {
> 		lru_reparent_memcg(memcg, parent);
> 	}

Looks better, will change to this style.

> 
>> @@ -989,6 +1022,8 @@ struct mem_cgroup *get_mem_cgroup_from_current(void)
>>   /**
>>    * get_mem_cgroup_from_folio - Obtain a reference on a given folio's memcg.
>>    * @folio: folio from which memcg should be extracted.
>> + *
>> + * The folio and objcg or memcg binding rules can refer to folio_memcg().
> 
>        See folio_memcg() for folio->objcg/memcg binding rules.

OK, will do.




^ permalink raw reply	[flat|nested] 149+ messages in thread

* [PATCH v2 28/28] mm: lru: add VM_WARN_ON_ONCE_FOLIO to lru maintenance helpers
  2025-12-17  7:27 [PATCH v2 00/28] Eliminate Dying Memory Cgroup Qi Zheng
                   ` (26 preceding siblings ...)
  2025-12-17  7:27 ` [PATCH v2 27/28] mm: memcontrol: eliminate the problem of dying memory cgroup for LRU folios Qi Zheng
@ 2025-12-17  7:27 ` Qi Zheng
  2025-12-18 14:07   ` Johannes Weiner
  2025-12-23 20:04 ` [PATCH v2 00/28] Eliminate Dying Memory Cgroup Yosry Ahmed
  2025-12-30  1:36 ` Roman Gushchin
  29 siblings, 1 reply; 149+ messages in thread
From: Qi Zheng @ 2025-12-17  7:27 UTC (permalink / raw)
  To: hannes, hughd, mhocko, roman.gushchin, shakeel.butt, muchun.song,
	david, lorenzo.stoakes, ziy, harry.yoo, imran.f.khan,
	kamalesh.babulal, axelrasmussen, yuanchu, weixugc, chenridong,
	mkoutny, akpm, hamzamahfooz, apais, lance.yang
  Cc: linux-mm, linux-kernel, cgroups, Muchun Song, Qi Zheng

From: Muchun Song <songmuchun@bytedance.com>

We must ensure the folio is deleted from or added to the correct lruvec
list. So, add VM_WARN_ON_ONCE_FOLIO() to catch invalid users. The
VM_BUG_ON_PAGE() in move_pages_to_lru() can be removed as
add_page_to_lru_list() will perform the necessary check.

Signed-off-by: Muchun Song <songmuchun@bytedance.com>
Acked-by: Roman Gushchin <roman.gushchin@linux.dev>
Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
---
 include/linux/mm_inline.h | 6 ++++++
 mm/vmscan.c               | 1 -
 2 files changed, 6 insertions(+), 1 deletion(-)

diff --git a/include/linux/mm_inline.h b/include/linux/mm_inline.h
index fa2d6ba811b53..ad50688d89dba 100644
--- a/include/linux/mm_inline.h
+++ b/include/linux/mm_inline.h
@@ -342,6 +342,8 @@ void lruvec_add_folio(struct lruvec *lruvec, struct folio *folio)
 {
 	enum lru_list lru = folio_lru_list(folio);
 
+	VM_WARN_ON_ONCE_FOLIO(!folio_matches_lruvec(folio, lruvec), folio);
+
 	if (lru_gen_add_folio(lruvec, folio, false))
 		return;
 
@@ -356,6 +358,8 @@ void lruvec_add_folio_tail(struct lruvec *lruvec, struct folio *folio)
 {
 	enum lru_list lru = folio_lru_list(folio);
 
+	VM_WARN_ON_ONCE_FOLIO(!folio_matches_lruvec(folio, lruvec), folio);
+
 	if (lru_gen_add_folio(lruvec, folio, true))
 		return;
 
@@ -370,6 +374,8 @@ void lruvec_del_folio(struct lruvec *lruvec, struct folio *folio)
 {
 	enum lru_list lru = folio_lru_list(folio);
 
+	VM_WARN_ON_ONCE_FOLIO(!folio_matches_lruvec(folio, lruvec), folio);
+
 	if (lru_gen_del_folio(lruvec, folio, false))
 		return;
 
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 64a85eea26dc6..2dc3ae432a017 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1934,7 +1934,6 @@ static unsigned int move_folios_to_lru(struct list_head *list)
 			continue;
 		}
 
-		VM_BUG_ON_FOLIO(!folio_matches_lruvec(folio, lruvec), folio);
 		lruvec_add_folio(lruvec, folio);
 		nr_pages = folio_nr_pages(folio);
 		nr_moved += nr_pages;
-- 
2.20.1



^ permalink raw reply	[flat|nested] 149+ messages in thread

* Re: [PATCH v2 28/28] mm: lru: add VM_WARN_ON_ONCE_FOLIO to lru maintenance helpers
  2025-12-17  7:27 ` [PATCH v2 28/28] mm: lru: add VM_WARN_ON_ONCE_FOLIO to lru maintenance helpers Qi Zheng
@ 2025-12-18 14:07   ` Johannes Weiner
  0 siblings, 0 replies; 149+ messages in thread
From: Johannes Weiner @ 2025-12-18 14:07 UTC (permalink / raw)
  To: Qi Zheng
  Cc: hughd, mhocko, roman.gushchin, shakeel.butt, muchun.song, david,
	lorenzo.stoakes, ziy, harry.yoo, imran.f.khan, kamalesh.babulal,
	axelrasmussen, yuanchu, weixugc, chenridong, mkoutny, akpm,
	hamzamahfooz, apais, lance.yang, linux-mm, linux-kernel, cgroups,
	Muchun Song, Qi Zheng

On Wed, Dec 17, 2025 at 03:27:52PM +0800, Qi Zheng wrote:
> From: Muchun Song <songmuchun@bytedance.com>
> 
> We must ensure the folio is deleted from or added to the correct lruvec
> list. So, add VM_WARN_ON_ONCE_FOLIO() to catch invalid users. The
> VM_BUG_ON_PAGE() in move_pages_to_lru() can be removed as
> add_page_to_lru_list() will perform the necessary check.
> 
> Signed-off-by: Muchun Song <songmuchun@bytedance.com>
> Acked-by: Roman Gushchin <roman.gushchin@linux.dev>
> Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>

Acked-by: Johannes Weiner <hannes@cmpxchg.org>


^ permalink raw reply	[flat|nested] 149+ messages in thread

* Re: [PATCH v2 00/28] Eliminate Dying Memory Cgroup
  2025-12-17  7:27 [PATCH v2 00/28] Eliminate Dying Memory Cgroup Qi Zheng
                   ` (27 preceding siblings ...)
  2025-12-17  7:27 ` [PATCH v2 28/28] mm: lru: add VM_WARN_ON_ONCE_FOLIO to lru maintenance helpers Qi Zheng
@ 2025-12-23 20:04 ` Yosry Ahmed
  2025-12-23 23:20   ` Shakeel Butt
  2025-12-24  8:43   ` Harry Yoo
  2025-12-30  1:36 ` Roman Gushchin
  29 siblings, 2 replies; 149+ messages in thread
From: Yosry Ahmed @ 2025-12-23 20:04 UTC (permalink / raw)
  To: Qi Zheng
  Cc: hannes, hughd, mhocko, roman.gushchin, shakeel.butt, muchun.song,
	david, lorenzo.stoakes, ziy, harry.yoo, imran.f.khan,
	kamalesh.babulal, axelrasmussen, yuanchu, weixugc, chenridong,
	mkoutny, akpm, hamzamahfooz, apais, lance.yang, linux-mm,
	linux-kernel, cgroups, Qi Zheng

On Wed, Dec 17, 2025 at 03:27:24PM +0800, Qi Zheng wrote:
> From: Qi Zheng <zhengqi.arch@bytedance.com>
> 
> Changes in v2:
>  - add [PATCH v2 04/28] and remove local_irq_disable() in evict_folios()
>    (pointed by Harry Yoo)
>  - recheck objcg in [PATCH v2 07/28] (pointed by Harry Yoo)
>  - modify the commit message in [PATCH v2 12/28] and [PATCH v2 21/28]
>    (pointed by Harry Yoo)
>  - use rcu lock to protect mm_state in [PATCH v2 14/28] (pointed by Harry Yoo)
>  - fix bad unlock balance warning in [PATCH v2 23/28]
>  - change nr_pages type to long in [PATCH v2 25/28] (pointed by Harry Yoo)
>  - incease mm_state->seq during reparenting to make mm walker work properly in
>    [PATCH v2 25/28] (pointed by Harry Yoo)
>  - add [PATCH v2 18/28] to fix WARNING in folio_memcg() (pointed by Harry Yoo)
>  - collect Reviewed-bys
>  - rebase onto the next-20251216
> 
> Changes in v1:
>  - drop [PATCH RFC 02/28]
>  - drop THP split queue related part, which has been merged as a separate
>    patchset[2]
>  - prevent memory cgroup release in folio_split_queue_lock{_irqsave}() in
>    [PATCH v1 16/26]
>  - Separate the reparenting function of traditional LRU folios to [PATCH v1 22/26]
>  - adapted to the MGLRU scenarios in [PATCH v1 23/26]
>  - refactor memcg_reparent_objcgs() in [PATCH v1 24/26]
>  - collect Acked-bys and Reviewed-bys
>  - rebase onto the next-20251028
> 
> Hi all,
> 
> Introduction
> ============
> 
> This patchset is intended to transfer the LRU pages to the object cgroup
> without holding a reference to the original memory cgroup in order to
> address the issue of the dying memory cgroup. A consensus has already been
> reached regarding this approach recently [1].
> 
> Background
> ==========
> 
> The issue of a dying memory cgroup refers to a situation where a memory
> cgroup is no longer being used by users, but memory (the metadata
> associated with memory cgroups) remains allocated to it. This situation
> may potentially result in memory leaks or inefficiencies in memory
> reclamation and has persisted as an issue for several years. Any memory
> allocation that endures longer than the lifespan (from the users'
> perspective) of a memory cgroup can lead to the issue of dying memory
> cgroup. We have exerted greater efforts to tackle this problem by
> introducing the infrastructure of object cgroup [2].
> 
> Presently, numerous types of objects (slab objects, non-slab kernel
> allocations, per-CPU objects) are charged to the object cgroup without
> holding a reference to the original memory cgroup. The final allocations
> for LRU pages (anonymous pages and file pages) are charged at allocation
> time and continues to hold a reference to the original memory cgroup
> until reclaimed.
> 
> File pages are more complex than anonymous pages as they can be shared
> among different memory cgroups and may persist beyond the lifespan of
> the memory cgroup. The long-term pinning of file pages to memory cgroups
> is a widespread issue that causes recurring problems in practical
> scenarios [3]. File pages remain unreclaimed for extended periods.
> Additionally, they are accessed by successive instances (second, third,
> fourth, etc.) of the same job, which is restarted into a new cgroup each
> time. As a result, unreclaimable dying memory cgroups accumulate,
> leading to memory wastage and significantly reducing the efficiency
> of page reclamation.
> 
> Fundamentals
> ============
> 
> A folio will no longer pin its corresponding memory cgroup. It is necessary
> to ensure that the memory cgroup or the lruvec associated with the memory
> cgroup is not released when a user obtains a pointer to the memory cgroup
> or lruvec returned by folio_memcg() or folio_lruvec(). Users are required
> to hold the RCU read lock or acquire a reference to the memory cgroup
> associated with the folio to prevent its release if they are not concerned
> about the binding stability between the folio and its corresponding memory
> cgroup. However, some users of folio_lruvec() (i.e., the lruvec lock)
> desire a stable binding between the folio and its corresponding memory
> cgroup. An approach is needed to ensure the stability of the binding while
> the lruvec lock is held, and to detect the situation of holding the
> incorrect lruvec lock when there is a race condition during memory cgroup
> reparenting. The following four steps are taken to achieve these goals.
> 
> 1. The first step  to be taken is to identify all users of both functions
>    (folio_memcg() and folio_lruvec()) who are not concerned about binding
>    stability and implement appropriate measures (such as holding a RCU read
>    lock or temporarily obtaining a reference to the memory cgroup for a
>    brief period) to prevent the release of the memory cgroup.
> 
> 2. Secondly, the following refactoring of folio_lruvec_lock() demonstrates
>    how to ensure the binding stability from the user's perspective of
>    folio_lruvec().
> 
>    struct lruvec *folio_lruvec_lock(struct folio *folio)
>    {
>            struct lruvec *lruvec;
> 
>            rcu_read_lock();
>    retry:
>            lruvec = folio_lruvec(folio);
>            spin_lock(&lruvec->lru_lock);
>            if (unlikely(lruvec_memcg(lruvec) != folio_memcg(folio))) {
>                    spin_unlock(&lruvec->lru_lock);
>                    goto retry;
>            }
> 
>            return lruvec;
>    }
> 
>    From the perspective of memory cgroup removal, the entire reparenting
>    process (altering the binding relationship between folio and its memory
>    cgroup and moving the LRU lists to its parental memory cgroup) should be
>    carried out under both the lruvec lock of the memory cgroup being removed
>    and the lruvec lock of its parent.
> 
> 3. Finally, transfer the LRU pages to the object cgroup without holding a
>    reference to the original memory cgroup.

I think there might be a problem with non-hierarchical stats on cgroup
v1, I brought it up previously [*]. I am not sure if this was addressed
but I couldn't immediately find anything.

In short, if memory is charged to a dying cgroup at the time of
reparenting, when the memory gets uncharged the stats updates will occur
at the parent. This will update both hierarchical and non-hierarchical
stats of the parent, which would corrupt the parent's non-hierarchical
stats (because those counters were never incremented when the memory was
charged).

I didn't track down which stats are affected by this, but off the top of
my head I think all stats tracking anon, file, etc.

The obvious solution is to flush and reparent the stats of a dying memcg
during reparenting, but I don't think this entirely fixes the problem
because the dying memcg stats can still be updated after its reparenting
(e.g. if a ref to the memcg has been held since before reparenting).

AFAICT, the stats of the dying memcg are only stable at release time,
but reparenting the stats at that point means that we have a potentially
large window (between reparenting and release) where the parent
non-hierarchical stats will be wrong and could even underflow.

[*]https://lore.kernel.org/all/CAJD7tkazvC+kZgGaV3idapQp-zPFaWBxoHwnrqTFoodHZGQcPA@mail.gmail.com/

> 
> Effect
> ======
> 
> Finally, it can be observed that the quantity of dying memory cgroups will
> not experience a significant increase if the following test script is
> executed to reproduce the issue.
> 
> ```bash
> #!/bin/bash
> 
> # Create a temporary file 'temp' filled with zero bytes
> dd if=/dev/zero of=temp bs=4096 count=1
> 
> # Display memory-cgroup info from /proc/cgroups
> cat /proc/cgroups | grep memory
> 
> for i in {0..2000}
> do
>     mkdir /sys/fs/cgroup/memory/test$i
>     echo $$ > /sys/fs/cgroup/memory/test$i/cgroup.procs
> 
>     # Append 'temp' file content to 'log'
>     cat temp >> log
> 
>     echo $$ > /sys/fs/cgroup/memory/cgroup.procs
> 
>     # Potentially create a dying memory cgroup
>     rmdir /sys/fs/cgroup/memory/test$i
> done
> 
> # Display memory-cgroup info after test
> cat /proc/cgroups | grep memory
> 
> rm -f temp log
> ```
> 
> Comments and suggestions are welcome!
> 
> Thanks,
> Qi
> 
> [1].https://lore.kernel.org/linux-mm/Z6OkXXYDorPrBvEQ@hm-sls2/
> [2].https://lwn.net/Articles/895431/
> [3].https://github.com/systemd/systemd/pull/36827
> 
> Muchun Song (22):
>   mm: memcontrol: remove dead code of checking parent memory cgroup
>   mm: workingset: use folio_lruvec() in workingset_refault()
>   mm: rename unlock_page_lruvec_irq and its variants
>   mm: vmscan: refactor move_folios_to_lru()
>   mm: memcontrol: allocate object cgroup for non-kmem case
>   mm: memcontrol: return root object cgroup for root memory cgroup
>   mm: memcontrol: prevent memory cgroup release in
>     get_mem_cgroup_from_folio()
>   buffer: prevent memory cgroup release in folio_alloc_buffers()
>   writeback: prevent memory cgroup release in writeback module
>   mm: memcontrol: prevent memory cgroup release in
>     count_memcg_folio_events()
>   mm: page_io: prevent memory cgroup release in page_io module
>   mm: migrate: prevent memory cgroup release in folio_migrate_mapping()
>   mm: mglru: prevent memory cgroup release in mglru
>   mm: memcontrol: prevent memory cgroup release in
>     mem_cgroup_swap_full()
>   mm: workingset: prevent memory cgroup release in lru_gen_eviction()
>   mm: workingset: prevent lruvec release in workingset_refault()
>   mm: zswap: prevent lruvec release in zswap_folio_swapin()
>   mm: swap: prevent lruvec release in lru_gen_clear_refs()
>   mm: workingset: prevent lruvec release in workingset_activation()
>   mm: memcontrol: prepare for reparenting LRU pages for lruvec lock
>   mm: memcontrol: eliminate the problem of dying memory cgroup for LRU
>     folios
>   mm: lru: add VM_WARN_ON_ONCE_FOLIO to lru maintenance helpers
> 
> Qi Zheng (6):
>   mm: vmscan: prepare for the refactoring the move_folios_to_lru()
>   mm: thp: prevent memory cgroup release in
>     folio_split_queue_lock{_irqsave}()
>   mm: zswap: prevent memory cgroup release in zswap_compress()
>   mm: vmscan: prepare for reparenting traditional LRU folios
>   mm: vmscan: prepare for reparenting MGLRU folios
>   mm: memcontrol: refactor memcg_reparent_objcgs()
> 
>  fs/buffer.c                      |   4 +-
>  fs/fs-writeback.c                |  22 +-
>  include/linux/memcontrol.h       | 159 ++++++------
>  include/linux/mm_inline.h        |   6 +
>  include/linux/mmzone.h           |  20 ++
>  include/trace/events/writeback.h |   3 +
>  mm/compaction.c                  |  43 +++-
>  mm/huge_memory.c                 |  18 +-
>  mm/memcontrol-v1.c               |  15 +-
>  mm/memcontrol.c                  | 405 ++++++++++++++++++-------------
>  mm/migrate.c                     |   2 +
>  mm/mlock.c                       |   2 +-
>  mm/page_io.c                     |   8 +-
>  mm/percpu.c                      |   2 +-
>  mm/shrinker.c                    |   6 +-
>  mm/swap.c                        |  20 +-
>  mm/vmscan.c                      | 267 ++++++++++++++++----
>  mm/workingset.c                  |  26 +-
>  mm/zswap.c                       |   5 +
>  19 files changed, 677 insertions(+), 356 deletions(-)
> 
> -- 
> 2.20.1
> 
> 


^ permalink raw reply	[flat|nested] 149+ messages in thread

* Re: [PATCH v2 00/28] Eliminate Dying Memory Cgroup
  2025-12-23 20:04 ` [PATCH v2 00/28] Eliminate Dying Memory Cgroup Yosry Ahmed
@ 2025-12-23 23:20   ` Shakeel Butt
  2025-12-24  0:07     ` Yosry Ahmed
  2025-12-29  7:48     ` Qi Zheng
  2025-12-24  8:43   ` Harry Yoo
  1 sibling, 2 replies; 149+ messages in thread
From: Shakeel Butt @ 2025-12-23 23:20 UTC (permalink / raw)
  To: Yosry Ahmed
  Cc: Qi Zheng, hannes, hughd, mhocko, roman.gushchin, muchun.song,
	david, lorenzo.stoakes, ziy, harry.yoo, imran.f.khan,
	kamalesh.babulal, axelrasmussen, yuanchu, weixugc, chenridong,
	mkoutny, akpm, hamzamahfooz, apais, lance.yang, linux-mm,
	linux-kernel, cgroups, Qi Zheng

On Tue, Dec 23, 2025 at 08:04:50PM +0000, Yosry Ahmed wrote:
[...]
> 
> I think there might be a problem with non-hierarchical stats on cgroup
> v1, I brought it up previously [*]. I am not sure if this was addressed
> but I couldn't immediately find anything.

Sigh, the curse of memcg-v1. Let's see what we can do to not break v1.

> 
> In short, if memory is charged to a dying cgroup 

Not sure why stats updates for dying cgroup is related. Isn't it simply
stat increase at the child memcg and then stat decrease at the parent
memcg would possibly show negative stat_local of the parent.

> at the time of
> reparenting, when the memory gets uncharged the stats updates will occur
> at the parent. This will update both hierarchical and non-hierarchical
> stats of the parent, which would corrupt the parent's non-hierarchical
> stats (because those counters were never incremented when the memory was
> charged).
> 
> I didn't track down which stats are affected by this, but off the top of
> my head I think all stats tracking anon, file, etc.

Let's start with what specific stats might be effected. First the stats
which are monotonically increasing should be fine, like
WORKINGSET_REFAULT_[ANON|FILE], PGPG[IN|OUT], PG[MAJ]FAULT.

So, the following ones are the interesting ones:

NR_FILE_PAGES, NR_ANON_MAPPED, NR_ANON_THPS, NR_SHMEM, NR_FILE_MAPPED,
NR_FILE_DIRTY, NR_WRITEBACK, MEMCG_SWAP, NR_SWAPCACHE.

> 
> The obvious solution is to flush and reparent the stats of a dying memcg
> during reparenting,

Again not sure how flushing will help here and what do you mean by
'reparent the stats'? Do you mean something like:

parent->vmstats->state_local += child->vmstats->state_local;

Hmm this seems fine and I think it should work.

> but I don't think this entirely fixes the problem
> because the dying memcg stats can still be updated after its reparenting
> (e.g. if a ref to the memcg has been held since before reparenting).

How can dying memcg stats can still be updated after reparenting? The
stats which we care about are the anon & file memory and this series is
reparenting them, so dying memcg will not see stats updates unless there
is a concurrent update happening and I think it is very easy to avoid
such situation by putting a grace period between reparenting the
file/anon folios and reparenting dying chils'd stats_local. Am I missing
something?

^ permalink raw reply	[flat|nested] 149+ messages in thread

* Re: [PATCH v2 00/28] Eliminate Dying Memory Cgroup
  2025-12-23 23:20   ` Shakeel Butt
@ 2025-12-24  0:07     ` Yosry Ahmed
  2025-12-24  0:36       ` Shakeel Butt
  2025-12-29  7:48     ` Qi Zheng
  1 sibling, 1 reply; 149+ messages in thread
From: Yosry Ahmed @ 2025-12-24  0:07 UTC (permalink / raw)
  To: Shakeel Butt
  Cc: Qi Zheng, hannes, hughd, mhocko, roman.gushchin, muchun.song,
	david, lorenzo.stoakes, ziy, harry.yoo, imran.f.khan,
	kamalesh.babulal, axelrasmussen, yuanchu, weixugc, chenridong,
	mkoutny, akpm, hamzamahfooz, apais, lance.yang, linux-mm,
	linux-kernel, cgroups, Qi Zheng

On Tue, Dec 23, 2025 at 03:20:47PM -0800, Shakeel Butt wrote:
> On Tue, Dec 23, 2025 at 08:04:50PM +0000, Yosry Ahmed wrote:
> [...]
> > 
> > I think there might be a problem with non-hierarchical stats on cgroup
> > v1, I brought it up previously [*]. I am not sure if this was addressed
> > but I couldn't immediately find anything.
> 
> Sigh, the curse of memcg-v1. Let's see what we can do to not break v1.
> 
> > 
> > In short, if memory is charged to a dying cgroup 
> 
> Not sure why stats updates for dying cgroup is related. Isn't it simply
> stat increase at the child memcg and then stat decrease at the parent
> memcg would possibly show negative stat_local of the parent.

Hmm not sure I understand what you mean here. Normally an update to the
child memcg should not update state_local of the parent. So outside the
context of dying cgroups and reparenting I don't see how state_local of
the parent can become negative.

> 
> > at the time of
> > reparenting, when the memory gets uncharged the stats updates will occur
> > at the parent. This will update both hierarchical and non-hierarchical
> > stats of the parent, which would corrupt the parent's non-hierarchical
> > stats (because those counters were never incremented when the memory was
> > charged).
> > 
> > I didn't track down which stats are affected by this, but off the top of
> > my head I think all stats tracking anon, file, etc.
> 
> Let's start with what specific stats might be effected. First the stats
> which are monotonically increasing should be fine, like
> WORKINGSET_REFAULT_[ANON|FILE], PGPG[IN|OUT], PG[MAJ]FAULT.
> 
> So, the following ones are the interesting ones:
> 
> NR_FILE_PAGES, NR_ANON_MAPPED, NR_ANON_THPS, NR_SHMEM, NR_FILE_MAPPED,
> NR_FILE_DIRTY, NR_WRITEBACK, MEMCG_SWAP, NR_SWAPCACHE.
> 
> > 
> > The obvious solution is to flush and reparent the stats of a dying memcg
> > during reparenting,
> 
> Again not sure how flushing will help here and what do you mean by
> 'reparent the stats'? Do you mean something like:

Oh I meant we just need to do an rstat flush to aggregate per-CPU
counters before moving the stats from child to parent.

> 
> parent->vmstats->state_local += child->vmstats->state_local;
> 
> Hmm this seems fine and I think it should work.

Something like that, I didn't look too closely if there's anything else
that needs to be reparented.

> 
> > but I don't think this entirely fixes the problem
> > because the dying memcg stats can still be updated after its reparenting
> > (e.g. if a ref to the memcg has been held since before reparenting).
> 
> How can dying memcg stats can still be updated after reparenting? The
> stats which we care about are the anon & file memory and this series is
> reparenting them, so dying memcg will not see stats updates unless there
> is a concurrent update happening and I think it is very easy to avoid
> such situation by putting a grace period between reparenting the
> file/anon folios and reparenting dying chils'd stats_local. Am I missing
> something?

What prevents the code from obtaining a ref to a parent's memcg before
reparenting, and using it to update the stats after reparenting? A grace
period only works if the entire scope of using the memcg is within the
RCU critical section.

For example, __mem_cgroup_try_charge_swap() currently does this when
incrementing MEMCG_SWAP. While this specific example isn't problematic
because the reference won't be dropped until MEMCG_SWAP is decremented
again, the pattern of grabbing a ref to the memcg then updating a stat
could generally cause the problem.

Most stats are updated using lruvec_stat_mod_folio(), which updates the
stats in the same RCU critical section as obtaining the memcg pointer
from the folio, so it can be fixed with a grace period. However, I think
it can be easily missed in the future if other code paths update memcg
stats in a different way. We should try to enforce that stat updates
cannot only happen from the same RCU critical section where the memcg
pointer is acquired.

^ permalink raw reply	[flat|nested] 149+ messages in thread

* Re: [PATCH v2 00/28] Eliminate Dying Memory Cgroup
  2025-12-24  0:07     ` Yosry Ahmed
@ 2025-12-24  0:36       ` Shakeel Butt
  2025-12-24  0:43         ` Yosry Ahmed
  0 siblings, 1 reply; 149+ messages in thread
From: Shakeel Butt @ 2025-12-24  0:36 UTC (permalink / raw)
  To: Yosry Ahmed
  Cc: Qi Zheng, hannes, hughd, mhocko, roman.gushchin, muchun.song,
	david, lorenzo.stoakes, ziy, harry.yoo, imran.f.khan,
	kamalesh.babulal, axelrasmussen, yuanchu, weixugc, chenridong,
	mkoutny, akpm, hamzamahfooz, apais, lance.yang, linux-mm,
	linux-kernel, cgroups, Qi Zheng

On Wed, Dec 24, 2025 at 12:07:50AM +0000, Yosry Ahmed wrote:
> On Tue, Dec 23, 2025 at 03:20:47PM -0800, Shakeel Butt wrote:
> > On Tue, Dec 23, 2025 at 08:04:50PM +0000, Yosry Ahmed wrote:
> > [...]
> > > 
> > > I think there might be a problem with non-hierarchical stats on cgroup
> > > v1, I brought it up previously [*]. I am not sure if this was addressed
> > > but I couldn't immediately find anything.
> > 
> > Sigh, the curse of memcg-v1. Let's see what we can do to not break v1.
> > 
> > > 
> > > In short, if memory is charged to a dying cgroup 
> > 
> > Not sure why stats updates for dying cgroup is related. Isn't it simply
> > stat increase at the child memcg and then stat decrease at the parent
> > memcg would possibly show negative stat_local of the parent.
> 
> Hmm not sure I understand what you mean here. Normally an update to the
> child memcg should not update state_local of the parent. So outside the
> context of dying cgroups and reparenting I don't see how state_local of
> the parent can become negative.

We might be talking about same thing. When you said "memory is charged
to a dying cgroup", it does not have to be dying cgroup, it can be alive
child which died later. So to give an example, let's say a process in
the child allocates a file page, NR_FILE_PAGES is increased at the
child and next child has been rmdir'ed, so when that specific file page
is freed, the NR_FILE_PAGES will be decreased at the parent after this
series.

> 
> > 
> > > at the time of
> > > reparenting, when the memory gets uncharged the stats updates will occur
> > > at the parent. This will update both hierarchical and non-hierarchical
> > > stats of the parent, which would corrupt the parent's non-hierarchical
> > > stats (because those counters were never incremented when the memory was
> > > charged).
> > > 
> > > I didn't track down which stats are affected by this, but off the top of
> > > my head I think all stats tracking anon, file, etc.
> > 
> > Let's start with what specific stats might be effected. First the stats
> > which are monotonically increasing should be fine, like
> > WORKINGSET_REFAULT_[ANON|FILE], PGPG[IN|OUT], PG[MAJ]FAULT.
> > 
> > So, the following ones are the interesting ones:
> > 
> > NR_FILE_PAGES, NR_ANON_MAPPED, NR_ANON_THPS, NR_SHMEM, NR_FILE_MAPPED,
> > NR_FILE_DIRTY, NR_WRITEBACK, MEMCG_SWAP, NR_SWAPCACHE.
> > 
> > > 
> > > The obvious solution is to flush and reparent the stats of a dying memcg
> > > during reparenting,
> > 
> > Again not sure how flushing will help here and what do you mean by
> > 'reparent the stats'? Do you mean something like:
> 
> Oh I meant we just need to do an rstat flush to aggregate per-CPU
> counters before moving the stats from child to parent.
> 
> > 
> > parent->vmstats->state_local += child->vmstats->state_local;
> > 
> > Hmm this seems fine and I think it should work.
> 
> Something like that, I didn't look too closely if there's anything else
> that needs to be reparented.
> 
> > 
> > > but I don't think this entirely fixes the problem
> > > because the dying memcg stats can still be updated after its reparenting
> > > (e.g. if a ref to the memcg has been held since before reparenting).
> > 
> > How can dying memcg stats can still be updated after reparenting? The
> > stats which we care about are the anon & file memory and this series is
> > reparenting them, so dying memcg will not see stats updates unless there
> > is a concurrent update happening and I think it is very easy to avoid
> > such situation by putting a grace period between reparenting the
> > file/anon folios and reparenting dying chils'd stats_local. Am I missing
> > something?
> 
> What prevents the code from obtaining a ref to a parent's memcg

I think you meant child's memcg here.

> before
> reparenting, and using it to update the stats after reparenting? A grace
> period only works if the entire scope of using the memcg is within the
> RCU critical section.

Yeah this is an issue.

> 
> For example, __mem_cgroup_try_charge_swap() currently does this when
> incrementing MEMCG_SWAP. While this specific example isn't problematic
> because the reference won't be dropped until MEMCG_SWAP is decremented
> again, the pattern of grabbing a ref to the memcg then updating a stat
> could generally cause the problem.
> 
> Most stats are updated using lruvec_stat_mod_folio(), which updates the
> stats in the same RCU critical section as obtaining the memcg pointer
> from the folio, so it can be fixed with a grace period. However, I think
> it can be easily missed in the future if other code paths update memcg
> stats in a different way. We should try to enforce that stat updates
> cannot only happen from the same RCU critical section where the memcg
> pointer is acquired.

The core stats update functions are mod_memcg_state() and
mod_memcg_lruvec_state(). If for v1 only, we add additional check for
CSS_DYING and go to parent if CSS_DYING is set then shouldn't we avoid
this issue?





^ permalink raw reply	[flat|nested] 149+ messages in thread

* Re: [PATCH v2 00/28] Eliminate Dying Memory Cgroup
  2025-12-24  0:36       ` Shakeel Butt
@ 2025-12-24  0:43         ` Yosry Ahmed
  2025-12-24  0:58           ` Shakeel Butt
  0 siblings, 1 reply; 149+ messages in thread
From: Yosry Ahmed @ 2025-12-24  0:43 UTC (permalink / raw)
  To: Shakeel Butt
  Cc: Qi Zheng, hannes, hughd, mhocko, roman.gushchin, muchun.song,
	david, lorenzo.stoakes, ziy, harry.yoo, imran.f.khan,
	kamalesh.babulal, axelrasmussen, yuanchu, weixugc, chenridong,
	mkoutny, akpm, hamzamahfooz, apais, lance.yang, linux-mm,
	linux-kernel, cgroups, Qi Zheng

On Tue, Dec 23, 2025 at 04:36:18PM -0800, Shakeel Butt wrote:
> On Wed, Dec 24, 2025 at 12:07:50AM +0000, Yosry Ahmed wrote:
> > On Tue, Dec 23, 2025 at 03:20:47PM -0800, Shakeel Butt wrote:
> > > On Tue, Dec 23, 2025 at 08:04:50PM +0000, Yosry Ahmed wrote:
> > > [...]
> > > > 
> > > > I think there might be a problem with non-hierarchical stats on cgroup
> > > > v1, I brought it up previously [*]. I am not sure if this was addressed
> > > > but I couldn't immediately find anything.
> > > 
> > > Sigh, the curse of memcg-v1. Let's see what we can do to not break v1.
> > > 
> > > > 
> > > > In short, if memory is charged to a dying cgroup 
> > > 
> > > Not sure why stats updates for dying cgroup is related. Isn't it simply
> > > stat increase at the child memcg and then stat decrease at the parent
> > > memcg would possibly show negative stat_local of the parent.
> > 
> > Hmm not sure I understand what you mean here. Normally an update to the
> > child memcg should not update state_local of the parent. So outside the
> > context of dying cgroups and reparenting I don't see how state_local of
> > the parent can become negative.
> 
> We might be talking about same thing. When you said "memory is charged
> to a dying cgroup", it does not have to be dying cgroup, it can be alive
> child which died later. So to give an example, let's say a process in
> the child allocates a file page, NR_FILE_PAGES is increased at the
> child and next child has been rmdir'ed, so when that specific file page
> is freed, the NR_FILE_PAGES will be decreased at the parent after this
> series.

Yes this is exactly what I mean. Specifically, an update happens after
the cgroup becomes "dying".

> 
> > 
> > > 
> > > > at the time of
> > > > reparenting, when the memory gets uncharged the stats updates will occur
> > > > at the parent. This will update both hierarchical and non-hierarchical
> > > > stats of the parent, which would corrupt the parent's non-hierarchical
> > > > stats (because those counters were never incremented when the memory was
> > > > charged).
> > > > 
> > > > I didn't track down which stats are affected by this, but off the top of
> > > > my head I think all stats tracking anon, file, etc.
> > > 
> > > Let's start with what specific stats might be effected. First the stats
> > > which are monotonically increasing should be fine, like
> > > WORKINGSET_REFAULT_[ANON|FILE], PGPG[IN|OUT], PG[MAJ]FAULT.
> > > 
> > > So, the following ones are the interesting ones:
> > > 
> > > NR_FILE_PAGES, NR_ANON_MAPPED, NR_ANON_THPS, NR_SHMEM, NR_FILE_MAPPED,
> > > NR_FILE_DIRTY, NR_WRITEBACK, MEMCG_SWAP, NR_SWAPCACHE.
> > > 
> > > > 
> > > > The obvious solution is to flush and reparent the stats of a dying memcg
> > > > during reparenting,
> > > 
> > > Again not sure how flushing will help here and what do you mean by
> > > 'reparent the stats'? Do you mean something like:
> > 
> > Oh I meant we just need to do an rstat flush to aggregate per-CPU
> > counters before moving the stats from child to parent.
> > 
> > > 
> > > parent->vmstats->state_local += child->vmstats->state_local;
> > > 
> > > Hmm this seems fine and I think it should work.
> > 
> > Something like that, I didn't look too closely if there's anything else
> > that needs to be reparented.
> > 
> > > 
> > > > but I don't think this entirely fixes the problem
> > > > because the dying memcg stats can still be updated after its reparenting
> > > > (e.g. if a ref to the memcg has been held since before reparenting).
> > > 
> > > How can dying memcg stats can still be updated after reparenting? The
> > > stats which we care about are the anon & file memory and this series is
> > > reparenting them, so dying memcg will not see stats updates unless there
> > > is a concurrent update happening and I think it is very easy to avoid
> > > such situation by putting a grace period between reparenting the
> > > file/anon folios and reparenting dying chils'd stats_local. Am I missing
> > > something?
> > 
> > What prevents the code from obtaining a ref to a parent's memcg
> 
> I think you meant child's memcg here.

Yes, sorry.

> 
> > before
> > reparenting, and using it to update the stats after reparenting? A grace
> > period only works if the entire scope of using the memcg is within the
> > RCU critical section.
> 
> Yeah this is an issue.
> 
> > 
> > For example, __mem_cgroup_try_charge_swap() currently does this when
> > incrementing MEMCG_SWAP. While this specific example isn't problematic
> > because the reference won't be dropped until MEMCG_SWAP is decremented
> > again, the pattern of grabbing a ref to the memcg then updating a stat
> > could generally cause the problem.
> > 
> > Most stats are updated using lruvec_stat_mod_folio(), which updates the
> > stats in the same RCU critical section as obtaining the memcg pointer
> > from the folio, so it can be fixed with a grace period. However, I think
> > it can be easily missed in the future if other code paths update memcg
> > stats in a different way. We should try to enforce that stat updates
> > cannot only happen from the same RCU critical section where the memcg
> > pointer is acquired.
> 
> The core stats update functions are mod_memcg_state() and
> mod_memcg_lruvec_state(). If for v1 only, we add additional check for
> CSS_DYING and go to parent if CSS_DYING is set then shouldn't we avoid
> this issue?

But this is still racy, right? The cgroup could become dying right after
we check CSS_DYING, no?


^ permalink raw reply	[flat|nested] 149+ messages in thread

* Re: [PATCH v2 00/28] Eliminate Dying Memory Cgroup
  2025-12-24  0:43         ` Yosry Ahmed
@ 2025-12-24  0:58           ` Shakeel Butt
  2025-12-29  9:42             ` Qi Zheng
  0 siblings, 1 reply; 149+ messages in thread
From: Shakeel Butt @ 2025-12-24  0:58 UTC (permalink / raw)
  To: Yosry Ahmed
  Cc: Qi Zheng, hannes, hughd, mhocko, roman.gushchin, muchun.song,
	david, lorenzo.stoakes, ziy, harry.yoo, imran.f.khan,
	kamalesh.babulal, axelrasmussen, yuanchu, weixugc, chenridong,
	mkoutny, akpm, hamzamahfooz, apais, lance.yang, linux-mm,
	linux-kernel, cgroups, Qi Zheng

On Wed, Dec 24, 2025 at 12:43:00AM +0000, Yosry Ahmed wrote:
[...]
> > 
> > I think you meant child's memcg here.
> 
> Yes, sorry.
> 
> > 
> > > before
> > > reparenting, and using it to update the stats after reparenting? A grace
> > > period only works if the entire scope of using the memcg is within the
> > > RCU critical section.
> > 
> > Yeah this is an issue.
> > 
> > > 
> > > For example, __mem_cgroup_try_charge_swap() currently does this when
> > > incrementing MEMCG_SWAP. While this specific example isn't problematic
> > > because the reference won't be dropped until MEMCG_SWAP is decremented
> > > again, the pattern of grabbing a ref to the memcg then updating a stat
> > > could generally cause the problem.
> > > 
> > > Most stats are updated using lruvec_stat_mod_folio(), which updates the
> > > stats in the same RCU critical section as obtaining the memcg pointer
> > > from the folio, so it can be fixed with a grace period. However, I think
> > > it can be easily missed in the future if other code paths update memcg
> > > stats in a different way. We should try to enforce that stat updates
> > > cannot only happen from the same RCU critical section where the memcg
> > > pointer is acquired.
> > 
> > The core stats update functions are mod_memcg_state() and
> > mod_memcg_lruvec_state(). If for v1 only, we add additional check for
> > CSS_DYING and go to parent if CSS_DYING is set then shouldn't we avoid
> > this issue?
> 
> But this is still racy, right? The cgroup could become dying right after
> we check CSS_DYING, no?

We do reparenting in css_offline() callback and cgroup offlining
happen somewhat like this:

1. Set CSS_DYING
2. Trigger percpu ref kill
3. Kernel makes sure css ref killed is seen by all CPUs and then trigger
   css_offline callback.

So, if in the stats update function we check CSS_DYING flag and the
actual stats update within rcu, I think we are good.


^ permalink raw reply	[flat|nested] 149+ messages in thread

* Re: [PATCH v2 00/28] Eliminate Dying Memory Cgroup
  2025-12-24  0:58           ` Shakeel Butt
@ 2025-12-29  9:42             ` Qi Zheng
  2025-12-29 10:52               ` Michal Koutný
  0 siblings, 1 reply; 149+ messages in thread
From: Qi Zheng @ 2025-12-29  9:42 UTC (permalink / raw)
  To: Shakeel Butt, Yosry Ahmed
  Cc: hannes, hughd, mhocko, roman.gushchin, muchun.song, david,
	lorenzo.stoakes, ziy, harry.yoo, imran.f.khan, kamalesh.babulal,
	axelrasmussen, yuanchu, weixugc, chenridong, mkoutny, akpm,
	hamzamahfooz, apais, lance.yang, linux-mm, linux-kernel, cgroups,
	Qi Zheng



On 12/24/25 8:58 AM, Shakeel Butt wrote:
> On Wed, Dec 24, 2025 at 12:43:00AM +0000, Yosry Ahmed wrote:
> [...]
>>>
>>> I think you meant child's memcg here.
>>
>> Yes, sorry.
>>
>>>
>>>> before
>>>> reparenting, and using it to update the stats after reparenting? A grace
>>>> period only works if the entire scope of using the memcg is within the
>>>> RCU critical section.
>>>
>>> Yeah this is an issue.
>>>
>>>>
>>>> For example, __mem_cgroup_try_charge_swap() currently does this when
>>>> incrementing MEMCG_SWAP. While this specific example isn't problematic
>>>> because the reference won't be dropped until MEMCG_SWAP is decremented
>>>> again, the pattern of grabbing a ref to the memcg then updating a stat
>>>> could generally cause the problem.
>>>>
>>>> Most stats are updated using lruvec_stat_mod_folio(), which updates the
>>>> stats in the same RCU critical section as obtaining the memcg pointer
>>>> from the folio, so it can be fixed with a grace period. However, I think
>>>> it can be easily missed in the future if other code paths update memcg
>>>> stats in a different way. We should try to enforce that stat updates
>>>> cannot only happen from the same RCU critical section where the memcg
>>>> pointer is acquired.
>>>
>>> The core stats update functions are mod_memcg_state() and
>>> mod_memcg_lruvec_state(). If for v1 only, we add additional check for
>>> CSS_DYING and go to parent if CSS_DYING is set then shouldn't we avoid
>>> this issue?
>>
>> But this is still racy, right? The cgroup could become dying right after
>> we check CSS_DYING, no?
> 
> We do reparenting in css_offline() callback and cgroup offlining
> happen somewhat like this:
> 
> 1. Set CSS_DYING
> 2. Trigger percpu ref kill
> 3. Kernel makes sure css ref killed is seen by all CPUs and then trigger
>     css_offline callback.

it seems that we can add the following to
mem_cgroup_css_free():

parent->vmstats->state_local += child->vmstats->state_local;

Right? I will continue to take a closer look.

Thanks,
Qi

> 
> So, if in the stats update function we check CSS_DYING flag and the
> actual stats update within rcu, I think we are good.



^ permalink raw reply	[flat|nested] 149+ messages in thread

* Re: [PATCH v2 00/28] Eliminate Dying Memory Cgroup
  2025-12-29  9:42             ` Qi Zheng
@ 2025-12-29 10:52               ` Michal Koutný
  0 siblings, 0 replies; 149+ messages in thread
From: Michal Koutný @ 2025-12-29 10:52 UTC (permalink / raw)
  To: Qi Zheng
  Cc: Shakeel Butt, Yosry Ahmed, hannes, hughd, mhocko, roman.gushchin,
	muchun.song, david, lorenzo.stoakes, ziy, harry.yoo,
	imran.f.khan, kamalesh.babulal, axelrasmussen, yuanchu, weixugc,
	chenridong, akpm, hamzamahfooz, apais, lance.yang, linux-mm,
	linux-kernel, cgroups, Qi Zheng

[-- Attachment #1: Type: text/plain, Size: 1432 bytes --]

On Tue, Dec 23, 2025 at 04:36:18PM -0800, Shakeel Butt <shakeel.butt@linux.dev> wrote:
...
> The core stats update functions are mod_memcg_state() and
> mod_memcg_lruvec_state(). If for v1 only, we add additional check for
> CSS_DYING and go to parent if CSS_DYING is set then shouldn't we avoid
> this issue?

...and go to first !CSS_DYING ancestor :-/ (as the whole chain of memcgs
can be offlined)

IIUC thanks to the reparenting charging (modifying state) to an offlined
memcg should be an exception...


On Mon, Dec 29, 2025 at 05:42:43PM +0800, Qi Zheng <qi.zheng@linux.dev> wrote:

> > We do reparenting in css_offline() callback and cgroup offlining
> > happen somewhat like this:
> > 
> > 1. Set CSS_DYING
> > 2. Trigger percpu ref kill
> > 3. Kernel makes sure css ref killed is seen by all CPUs and then trigger
> >     css_offline callback.
> 
> it seems that we can add the following to
> mem_cgroup_css_free():
> 
> parent->vmstats->state_local += child->vmstats->state_local;
> 
> Right? I will continue to take a closer look.

...and the time between offlining and free'ing a memcg should not be
arbitrarily long anymore (right?, the crux of the series).
So only transferring local stats in mem_cgroup_css_free should yield a
correct result after limited time range (with possible underflows
between) with no special precaution for CSS_DYING on charging side.

0.02€,
Michal

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 265 bytes --]

^ permalink raw reply	[flat|nested] 149+ messages in thread

* Re: [PATCH v2 00/28] Eliminate Dying Memory Cgroup
  2025-12-23 23:20   ` Shakeel Butt
  2025-12-24  0:07     ` Yosry Ahmed
@ 2025-12-29  7:48     ` Qi Zheng
  2025-12-29  9:35       ` Harry Yoo
  1 sibling, 1 reply; 149+ messages in thread
From: Qi Zheng @ 2025-12-29  7:48 UTC (permalink / raw)
  To: Shakeel Butt, Yosry Ahmed
  Cc: hannes, hughd, mhocko, roman.gushchin, muchun.song, david,
	lorenzo.stoakes, ziy, harry.yoo, imran.f.khan, kamalesh.babulal,
	axelrasmussen, yuanchu, weixugc, chenridong, mkoutny, akpm,
	hamzamahfooz, apais, lance.yang, linux-mm, linux-kernel, cgroups,
	Qi Zheng



On 12/24/25 7:20 AM, Shakeel Butt wrote:
> On Tue, Dec 23, 2025 at 08:04:50PM +0000, Yosry Ahmed wrote:
> [...]
>>
>> I think there might be a problem with non-hierarchical stats on cgroup
>> v1, I brought it up previously [*]. I am not sure if this was addressed
>> but I couldn't immediately find anything.
> 
> Sigh, the curse of memcg-v1. Let's see what we can do to not break v1.

The memcg-v1 was originally planned to be removed, could we skip
supporting v1?

> 




^ permalink raw reply	[flat|nested] 149+ messages in thread

* Re: [PATCH v2 00/28] Eliminate Dying Memory Cgroup
  2025-12-29  7:48     ` Qi Zheng
@ 2025-12-29  9:35       ` Harry Yoo
  2025-12-29  9:46         ` Qi Zheng
  2025-12-29 10:53         ` Michal Koutný
  0 siblings, 2 replies; 149+ messages in thread
From: Harry Yoo @ 2025-12-29  9:35 UTC (permalink / raw)
  To: Qi Zheng
  Cc: Shakeel Butt, Yosry Ahmed, hannes, hughd, mhocko, roman.gushchin,
	muchun.song, david, lorenzo.stoakes, ziy, imran.f.khan,
	kamalesh.babulal, axelrasmussen, yuanchu, weixugc, chenridong,
	mkoutny, akpm, hamzamahfooz, apais, lance.yang, linux-mm,
	linux-kernel, cgroups, Qi Zheng

On Mon, Dec 29, 2025 at 03:48:26PM +0800, Qi Zheng wrote:
> 
> 
> On 12/24/25 7:20 AM, Shakeel Butt wrote:
> > On Tue, Dec 23, 2025 at 08:04:50PM +0000, Yosry Ahmed wrote:
> > [...]
> > > 
> > > I think there might be a problem with non-hierarchical stats on cgroup
> > > v1, I brought it up previously [*]. I am not sure if this was addressed
> > > but I couldn't immediately find anything.
> > 
> > Sigh, the curse of memcg-v1. Let's see what we can do to not break v1.
> 
> The memcg-v1 was originally planned to be removed, could we skip
> supporting v1?

You mean not reparenting LRU pages if CONFIG_MEMCG_V1 is set?

That may work, but IMHO given that there is no clear timeline for removal
yet (some v1-specific features have been officially deprecated,
but memcg v1 as a whole hasn't), implementing Shakeel's suggestion [1]
may be a good option (it doesn't seem to add much complexity)

But it can be argued that it's not good enough reason to
delay the series; adding support for V1 later sounds sensible to me.

Anyway I'm fine either way and I'm not a memcg maintainer myself,
so this is just my two cents.

[1] https://lore.kernel.org/linux-mm/wvj4w7ifmrifnh5bvftdziudsj52fdnwlhbt2oifwmxmi4eore@ob3mrfahhnm5/

-- 
Cheers,
Harry / Hyeonggon


^ permalink raw reply	[flat|nested] 149+ messages in thread

* Re: [PATCH v2 00/28] Eliminate Dying Memory Cgroup
  2025-12-29  9:35       ` Harry Yoo
@ 2025-12-29  9:46         ` Qi Zheng
  2025-12-29 10:53         ` Michal Koutný
  1 sibling, 0 replies; 149+ messages in thread
From: Qi Zheng @ 2025-12-29  9:46 UTC (permalink / raw)
  To: Harry Yoo
  Cc: Shakeel Butt, Yosry Ahmed, hannes, hughd, mhocko, roman.gushchin,
	muchun.song, david, lorenzo.stoakes, ziy, imran.f.khan,
	kamalesh.babulal, axelrasmussen, yuanchu, weixugc, chenridong,
	mkoutny, akpm, hamzamahfooz, apais, lance.yang, linux-mm,
	linux-kernel, cgroups



On 12/29/25 5:35 PM, Harry Yoo wrote:
> On Mon, Dec 29, 2025 at 03:48:26PM +0800, Qi Zheng wrote:
>>
>>
>> On 12/24/25 7:20 AM, Shakeel Butt wrote:
>>> On Tue, Dec 23, 2025 at 08:04:50PM +0000, Yosry Ahmed wrote:
>>> [...]
>>>>
>>>> I think there might be a problem with non-hierarchical stats on cgroup
>>>> v1, I brought it up previously [*]. I am not sure if this was addressed
>>>> but I couldn't immediately find anything.
>>>
>>> Sigh, the curse of memcg-v1. Let's see what we can do to not break v1.
>>
>> The memcg-v1 was originally planned to be removed, could we skip
>> supporting v1?
> 
> You mean not reparenting LRU pages if CONFIG_MEMCG_V1 is set?
> 
> That may work, but IMHO given that there is no clear timeline for removal
> yet (some v1-specific features have been officially deprecated,
> but memcg v1 as a whole hasn't), implementing Shakeel's suggestion [1]
> may be a good option (it doesn't seem to add much complexity)
> 
> But it can be argued that it's not good enough reason to
> delay the series; adding support for V1 later sounds sensible to me.

Yeah, I will continue to take a closer look at how to support memcg-v1.

> 
> Anyway I'm fine either way and I'm not a memcg maintainer myself,
> so this is just my two cents.
> 
> [1] https://lore.kernel.org/linux-mm/wvj4w7ifmrifnh5bvftdziudsj52fdnwlhbt2oifwmxmi4eore@ob3mrfahhnm5/
> 



^ permalink raw reply	[flat|nested] 149+ messages in thread

* Re: [PATCH v2 00/28] Eliminate Dying Memory Cgroup
  2025-12-29  9:35       ` Harry Yoo
  2025-12-29  9:46         ` Qi Zheng
@ 2025-12-29 10:53         ` Michal Koutný
  1 sibling, 0 replies; 149+ messages in thread
From: Michal Koutný @ 2025-12-29 10:53 UTC (permalink / raw)
  To: Harry Yoo
  Cc: Qi Zheng, Shakeel Butt, Yosry Ahmed, hannes, hughd, mhocko,
	roman.gushchin, muchun.song, david, lorenzo.stoakes, ziy,
	imran.f.khan, kamalesh.babulal, axelrasmussen, yuanchu, weixugc,
	chenridong, akpm, hamzamahfooz, apais, lance.yang, linux-mm,
	linux-kernel, cgroups, Qi Zheng

[-- Attachment #1: Type: text/plain, Size: 504 bytes --]

On Mon, Dec 29, 2025 at 06:35:22PM +0900, Harry Yoo <harry.yoo@oracle.com> wrote:
> > The memcg-v1 was originally planned to be removed, could we skip
> > supporting v1?
> 
> You mean not reparenting LRU pages if CONFIG_MEMCG_V1 is set?

More precisely it should be dynamic
	!cgroup_subsys_on_dfl(memory_cgrp_subsys)

> That may work,

But would it make the code actually simpler? (Keeping two versions at
multiple places vs transferring local stats at the right moment.)

Thanks,
Michal

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 265 bytes --]

^ permalink raw reply	[flat|nested] 149+ messages in thread

* Re: [PATCH v2 00/28] Eliminate Dying Memory Cgroup
  2025-12-23 20:04 ` [PATCH v2 00/28] Eliminate Dying Memory Cgroup Yosry Ahmed
  2025-12-23 23:20   ` Shakeel Butt
@ 2025-12-24  8:43   ` Harry Yoo
  2025-12-24 14:51     ` Yosry Ahmed
  1 sibling, 1 reply; 149+ messages in thread
From: Harry Yoo @ 2025-12-24  8:43 UTC (permalink / raw)
  To: Yosry Ahmed
  Cc: Qi Zheng, hannes, hughd, mhocko, roman.gushchin, shakeel.butt,
	muchun.song, david, lorenzo.stoakes, ziy, imran.f.khan,
	kamalesh.babulal, axelrasmussen, yuanchu, weixugc, chenridong,
	mkoutny, akpm, hamzamahfooz, apais, lance.yang, linux-mm,
	linux-kernel, cgroups, Qi Zheng

On Tue, Dec 23, 2025 at 08:04:50PM +0000, Yosry Ahmed wrote:
> I think there might be a problem with non-hierarchical stats on cgroup
> v1, I brought it up previously [*]. I am not sure if this was addressed
> but I couldn't immediately find anything.

Hi, Yosry. Thanks for bringing this up!

> In short, if memory is charged to a dying cgroup at the time of
> reparenting, when the memory gets uncharged the stats updates will occur
> at the parent. This will update both hierarchical and non-hierarchical
> stats of the parent, which would corrupt the parent's non-hierarchical
> stats (because those counters were never incremented when the memory was
> charged).

Hmm, I wonder if this only applies to LRU pages.

In theory we should have this problem for NR_SLAB{UN,}RECLAIMABLE_B
because we already reparent objcgs, or am I missing something?

> [*]https://lore.kernel.org/all/CAJD7tkazvC*kZgGaV3idapQp-zPFaWBxoHwnrqTFoodHZGQcPA@mail.gmail.com/


^ permalink raw reply	[flat|nested] 149+ messages in thread

* Re: [PATCH v2 00/28] Eliminate Dying Memory Cgroup
  2025-12-24  8:43   ` Harry Yoo
@ 2025-12-24 14:51     ` Yosry Ahmed
  2025-12-26 11:24       ` Harry Yoo
  0 siblings, 1 reply; 149+ messages in thread
From: Yosry Ahmed @ 2025-12-24 14:51 UTC (permalink / raw)
  To: Harry Yoo
  Cc: Qi Zheng, hannes, hughd, mhocko, roman.gushchin, shakeel.butt,
	muchun.song, david, lorenzo.stoakes, ziy, imran.f.khan,
	kamalesh.babulal, axelrasmussen, yuanchu, weixugc, chenridong,
	mkoutny, akpm, hamzamahfooz, apais, lance.yang, linux-mm,
	linux-kernel, cgroups, Qi Zheng

December 24, 2025 at 12:43 AM, "Harry Yoo" <harry.yoo@oracle.com> wrote:


> 
> On Tue, Dec 23, 2025 at 08:04:50PM +0000, Yosry Ahmed wrote:
> 
> > 
> > I think there might be a problem with non-hierarchical stats on cgroup
> >  v1, I brought it up previously [*]. I am not sure if this was addressed
> >  but I couldn't immediately find anything.
> > 
> Hi, Yosry. Thanks for bringing this up!
> 
> > 
> > In short, if memory is charged to a dying cgroup at the time of
> >  reparenting, when the memory gets uncharged the stats updates will occur
> >  at the parent. This will update both hierarchical and non-hierarchical
> >  stats of the parent, which would corrupt the parent's non-hierarchical
> >  stats (because those counters were never incremented when the memory was
> >  charged).
> > 
> Hmm, I wonder if this only applies to LRU pages.
> 
> In theory we should have this problem for NR_SLAB{UN,}RECLAIMABLE_B
> because we already reparent objcgs, or am I missing something?

We do, but we don't expose these stats in cgroup v1, and we don't expose non-hierarchical stats in cgroup v2.


^ permalink raw reply	[flat|nested] 149+ messages in thread

* Re: [PATCH v2 00/28] Eliminate Dying Memory Cgroup
  2025-12-24 14:51     ` Yosry Ahmed
@ 2025-12-26 11:24       ` Harry Yoo
  0 siblings, 0 replies; 149+ messages in thread
From: Harry Yoo @ 2025-12-26 11:24 UTC (permalink / raw)
  To: Yosry Ahmed
  Cc: Qi Zheng, hannes, hughd, mhocko, roman.gushchin, shakeel.butt,
	muchun.song, david, lorenzo.stoakes, ziy, imran.f.khan,
	kamalesh.babulal, axelrasmussen, yuanchu, weixugc, chenridong,
	mkoutny, akpm, hamzamahfooz, apais, lance.yang, linux-mm,
	linux-kernel, cgroups, Qi Zheng

On Wed, Dec 24, 2025 at 02:51:12PM +0000, Yosry Ahmed wrote:
> December 24, 2025 at 12:43 AM, "Harry Yoo" <harry.yoo@oracle.com> wrote:
> 
> 
> > 
> > On Tue, Dec 23, 2025 at 08:04:50PM +0000, Yosry Ahmed wrote:
> > 
> > > 
> > > I think there might be a problem with non-hierarchical stats on cgroup
> > >  v1, I brought it up previously [*]. I am not sure if this was addressed
> > >  but I couldn't immediately find anything.
> > > 
> > Hi, Yosry. Thanks for bringing this up!
> > 
> > > 
> > > In short, if memory is charged to a dying cgroup at the time of
> > >  reparenting, when the memory gets uncharged the stats updates will occur
> > >  at the parent. This will update both hierarchical and non-hierarchical
> > >  stats of the parent, which would corrupt the parent's non-hierarchical
> > >  stats (because those counters were never incremented when the memory was
> > >  charged).
> > > 
> > Hmm, I wonder if this only applies to LRU pages.
> > 
> > In theory we should have this problem for NR_SLAB{UN,}RECLAIMABLE_B
> > because we already reparent objcgs, or am I missing something?
> 
> We do, but we don't expose these stats in cgroup v1, and we don't expose non-hierarchical stats in cgroup v2.

Oops, right.
I was missing that. Thanks!

-- 
Cheers,
Harry / Hyeonggon


^ permalink raw reply	[flat|nested] 149+ messages in thread

* Re: [PATCH v2 00/28] Eliminate Dying Memory Cgroup
  2025-12-17  7:27 [PATCH v2 00/28] Eliminate Dying Memory Cgroup Qi Zheng
                   ` (28 preceding siblings ...)
  2025-12-23 20:04 ` [PATCH v2 00/28] Eliminate Dying Memory Cgroup Yosry Ahmed
@ 2025-12-30  1:36 ` Roman Gushchin
  2025-12-30  2:44   ` Qi Zheng
  2025-12-30  4:01   ` Shakeel Butt
  29 siblings, 2 replies; 149+ messages in thread
From: Roman Gushchin @ 2025-12-30  1:36 UTC (permalink / raw)
  To: Qi Zheng
  Cc: hannes, hughd, mhocko, shakeel.butt, muchun.song, david,
	lorenzo.stoakes, ziy, harry.yoo, imran.f.khan, kamalesh.babulal,
	axelrasmussen, yuanchu, weixugc, chenridong, mkoutny, akpm,
	hamzamahfooz, apais, lance.yang, linux-mm, linux-kernel, cgroups,
	Qi Zheng

Qi Zheng <qi.zheng@linux.dev> writes:

Hey!

I ran this patchset through AI review and it found few regression (which
can of course be false positives). When you'll have time, can you,
please, take a look and comment on which are real and which are not?

Thank you!

--

# Task
Date: 2025-12-29 19:55:20
Model: gemini-3-pro-preview
Prompts SHA: 192922ae6bf4 ("bpf.md: adjust the documentation about bpf kfunc parameter validation")
Commits to review:
- e416d881eea4 ("mm: memcontrol: remove dead code of checking parent memory cgroup")
- 8e00ae594254 ("mm: workingset: use folio_lruvec() in workingset_refault()")
- a272ef87d5e7 ("mm: rename unlock_page_lruvec_irq and its variants")
- d57d548a3d6b ("mm: vmscan: prepare for the refactoring the move_folios_to_lru()")
- 9b02a45b6fc8 ("mm: vmscan: refactor move_folios_to_lru()")
- 057fca991b78 ("mm: memcontrol: allocate object cgroup for non-kmem case")
- 7c4110a3d8b6 ("mm: memcontrol: return root object cgroup for root memory cgroup")
- 8479f2eef536 ("mm: memcontrol: prevent memory cgroup release in get_mem_cgroup_from_folio()")
- c10b7e11fc09 ("buffer: prevent memory cgroup release in folio_alloc_buffers()")
- 65610d739afc ("writeback: prevent memory cgroup release in writeback module")
- f9b3cc3aed9f ("mm: memcontrol: prevent memory cgroup release in count_memcg_folio_events()")
- 91e4b3924291 ("mm: page_io: prevent memory cgroup release in page_io module")
- bb45e352bb34 ("mm: migrate: prevent memory cgroup release in folio_migrate_mapping()")
- a1189dd21a56 ("mm: mglru: prevent memory cgroup release in mglru")
- 4f41e0db1fd8 ("mm: memcontrol: prevent memory cgroup release in mem_cgroup_swap_full()")
- de63e2b7a03e ("mm: workingset: prevent memory cgroup release in lru_gen_eviction()")
- c0cce04fd4dc ("mm: thp: prevent memory cgroup release in folio_split_queue_lock{_irqsave}()")
- 555a447cb5f1 ("mm: zswap: prevent memory cgroup release in zswap_compress()")
- 80bbd804adde ("mm: workingset: prevent lruvec release in workingset_refault()")
- 9d232388a8e3 ("mm: zswap: prevent lruvec release in zswap_folio_swapin()")
- d7cb66b9350d ("mm: swap: prevent lruvec release in lru_gen_clear_refs()")
- 3e71e5543c8f ("mm: workingset: prevent lruvec release in workingset_activation()")
- e765ff303f13 ("mm: memcontrol: prepare for reparenting LRU pages for lruvec lock")
- d04921029e6d ("mm: vmscan: prepare for reparenting traditional LRU folios")
- dc3b4a08ad35 ("mm: vmscan: prepare for reparenting MGLRU folios")
- 18791acfe86b ("mm: memcontrol: refactor memcg_reparent_objcgs()")
- 19f79273a812 ("mm: memcontrol: eliminate the problem of dying memory cgroup for LRU folios")
- 44b8e8bca06e ("mm: lru: add VM_WARN_ON_ONCE_FOLIO to lru maintenance helpers")

# Commit e416d881eea4 ("mm: memcontrol: remove dead code of checking parent memory cgroup")
No detailed regression report generated.

# Commit 8e00ae594254 ("mm: workingset: use folio_lruvec() in workingset_refault()")
No detailed regression report generated.

# Commit a272ef87d5e7 ("mm: rename unlock_page_lruvec_irq and its variants")
No detailed regression report generated.

# Commit d57d548a3d6b ("mm: vmscan: prepare for the refactoring the move_folios_to_lru()")
No detailed regression report generated.

# Commit 9b02a45b6fc8 ("mm: vmscan: refactor move_folios_to_lru()")
commit 9b02a45b6fc8a2253e2edcccb24c8485d3dca672
Author: Muchun Song <muchun.song@linux.dev>

mm: vmscan: refactor move_folios_to_lru()

Refactors move_folios_to_lru() to handle its own locking via
folio_lruvec_relock_irq(), removing the requirement for the caller to hold
lruvec->lru_lock. Updates call sites to release the lock before calling.

> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 49e566174621..354b19f7365d 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c

[ ... ]

> @@ -4736,14 +4734,14 @@ static int evict_folios(unsigned long nr_to_scan, struct lruvec *lruvec,
>  			set_mask_bits(&folio->flags.f, LRU_REFS_FLAGS, BIT(PG_active));
>  	}
>  
> -	spin_lock_irq(&lruvec->lru_lock);
> -
> -	move_folios_to_lru(lruvec, &list);
> +	move_folios_to_lru(&list);
>  
>  	walk = current->reclaim_state->mm_walk;
>  	if (walk && walk->batched) {
>  		walk->lruvec = lruvec;
> +		spin_lock(&lruvec->lru_lock);

move_folios_to_lru() drops the lock and enables interrupts via
lruvec_unlock_irq() (which calls spin_unlock_irq()). Since lru_lock is
an IRQ-safe lock (used elsewhere with spin_lock_irq()), is it safe to
use spin_lock() here with interrupts enabled? If an interrupt occurs
and tries to acquire lru_lock, could this lead to a deadlock?

>  		reset_batch_size(walk);
> +		spin_unlock(&lruvec->lru_lock);
>  	}
>  
>  	mod_lruvec_state(lruvec, PGDEMOTE_KSWAPD + reclaimer_offset(sc),

[ ... ]

# Commit 057fca991b78 ("mm: memcontrol: allocate object cgroup for non-kmem case")
No detailed regression report generated.

# Commit 7c4110a3d8b6 ("mm: memcontrol: return root object cgroup for root memory cgroup")
No detailed regression report generated.

# Commit 8479f2eef536 ("mm: memcontrol: prevent memory cgroup release in get_mem_cgroup_from_folio()")
commit 8479f2eef5363fb93cc593038e9bc54dd5068ef2
Author: Muchun Song <muchun.song@linux.dev>

mm: memcontrol: prevent memory cgroup release in get_mem_cgroup_from_folio()

This patch modifies get_mem_cgroup_from_folio() to retry the lookup if
locking the memcg fails, serving as preparation for future changes where
folios may not pin their memcg.

> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 21b5aad34cae..431b3154c70c 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -973,14 +973,19 @@ struct mem_cgroup *get_mem_cgroup_from_current(void)
>   */
>  struct mem_cgroup *get_mem_cgroup_from_folio(struct folio *folio)
>  {
> -	struct mem_cgroup *memcg = folio_memcg(folio);
> +	struct mem_cgroup *memcg;
>  
>  	if (mem_cgroup_disabled())
>  		return NULL;
>  
> +	if (!folio_memcg_charged(folio))
> +		return root_mem_cgroup;
> +
>  	rcu_read_lock();
> -	if (!memcg || WARN_ON_ONCE(!css_tryget(&memcg->css)))
> -		memcg = root_mem_cgroup;
> +retry:
> +	memcg = folio_memcg(folio);
> +	if (unlikely(!css_tryget(&memcg->css)))
> +		goto retry;

If css_tryget() fails (e.g. refcount is 0), this loop spins indefinitely
with the RCU read lock held. Is it guaranteed that folio_memcg() will
return a different, alive memcg in subsequent iterations?

If the folio is isolated (e.g. via migrate_misplaced_folio()), it might be
missed by reparenting logic that iterates LRU lists. In that case, the
folio would continue pointing to the dying memcg, leading to a hard lockup.

Also, folio_memcg() calls __folio_memcg(), which reads folio->memcg_data
without READ_ONCE(). Since this loop waits for memcg_data to be updated
by another CPU (reparenting), could the compiler hoist the load out of
the loop, preventing the update from being seen?

Finally, the previous code fell back to root_mem_cgroup on failure. Is it
safe to remove that fallback? If css_tryget() fails unexpectedly, hanging
seems more severe than the previous behavior of warning and falling back.

>  	rcu_read_unlock();
>  	return memcg;
>  }
# Commit c10b7e11fc09 ("buffer: prevent memory cgroup release in folio_alloc_buffers()")
No detailed regression report generated.

# Commit 65610d739afc ("writeback: prevent memory cgroup release in writeback module")

<regression>
  <description>
    Tight infinite loop in get_mem_cgroup_from_folio due to missing READ_ONCE and cpu_relax
  </description>
  <file>
    mm/memcontrol.c
  </file>
  <lines>
    991-993
  </lines>
  <reason>
    The retry loop in get_mem_cgroup_from_folio() continuously checks folio_memcg(folio)
    and tries to get a css reference. There are two issues:
    1. folio_memcg() reads folio->memcg_data without READ_ONCE(). The compiler may
       hoist this load out of the loop, leading to an infinite loop if the memcg
       is offlined and the loop condition depends on reparenting updating memcg_data.
    2. The loop does not contain cpu_relax(). If the update to folio->memcg_data
       is delayed (e.g., reparenting takes time), this creates a tight loop that
       burns CPU and stalls the RCU read-side critical section, potentially causing
       RCU stalls or soft lockups.
  </reason>
  <classification>
    Performance/Stability (Soft Lockup / Infinite Loop)
  </classification>
</regression>

<regression>
  <description>
    Removal of NULL check in mem_cgroup_track_foreign_dirty introduces potential crash
  </description>
  <file>
    include/linux/memcontrol.h
  </file>
  <lines>
    1557
  </lines>
  <reason>
    The check `if (unlikely(memcg && &memcg->css != wb->memcg_css))` was replaced with
    `if (unlikely(&memcg->css != wb->memcg_css))`.
    While folio_memcg_charged() is checked earlier, there is a race window between that check
    and the call to folio_memcg() where the folio could be uncharged (setting memcg_data to 0).
    If folio_memcg() returns NULL, accessing `&memcg->css` will dereference the NULL pointer
    (since css is at offset 0). The previous code safely handled NULL memcg.
  </reason>
  <classification>
    Robustness / Null Pointer Dereference
  </classification>
</regression>

# Commit f9b3cc3aed9f ("mm: memcontrol: prevent memory cgroup release in count_memcg_folio_events()")
No detailed regression report generated.

# Commit 91e4b3924291 ("mm: page_io: prevent memory cgroup release in page_io module")
No detailed regression report generated.

# Commit bb45e352bb34 ("mm: migrate: prevent memory cgroup release in folio_migrate_mapping()")
No detailed regression report generated.

# Commit a1189dd21a56 ("mm: mglru: prevent memory cgroup release in mglru")
commit a1189dd21a568b45eaddafc17b982f6200ba28c0
Author: Muchun Song <muchun.song@linux.dev>

mm: mglru: prevent memory cgroup release in mglru

This patch uses the rcu read lock to safeguard against the release of
the memory cgroup in mglru, as a preparatory measure for the reparenting
of the LRU pages.

> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 354b19f7365d..814498a2c1bd 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
[ ... ]
> @@ -4242,6 +4244,13 @@ bool lru_gen_look_around(struct page_vma_mapped_walk *pvmw)
>  		}
>  	}
>  
> +	rcu_read_lock();
> +	memcg = folio_memcg(folio);
> +	lruvec = mem_cgroup_lruvec(memcg, pgdat);
> +	max_seq = READ_ONCE((lruvec)->lrugen.max_seq);
> +	gen = lru_gen_from_seq(max_seq);
> +	mm_state = get_mm_state(lruvec);
> +
>  	arch_enter_lazy_mmu_mode();
>  
>  	pte -= (addr - start) / PAGE_SIZE;
[ ... ]
> @@ -4282,6 +4291,8 @@ bool lru_gen_look_around(struct page_vma_mapped_walk *pvmw)
>  	if (mm_state && suitable_to_scan(i, young))
>  		update_bloom_filter(mm_state, max_seq, pvmw->pmd);
>  
> +	rcu_read_unlock();
> +
>  	return true;
>  }

This critical section appears to encompass calls to walk_update_folio(),
which calls folio_activate(), which eventually calls folio_lruvec_lock_irq()
to acquire lruvec->lru_lock.

Since lruvec->lru_lock is a spinlock_t, it becomes a sleeping lock on
PREEMPT_RT kernels. Is it safe to acquire a sleeping lock inside an RCU
read-side critical section on PREEMPT_RT?

# Commit 4f41e0db1fd8 ("mm: memcontrol: prevent memory cgroup release in mem_cgroup_swap_full()")
No detailed regression report generated.

# Commit de63e2b7a03e ("mm: workingset: prevent memory cgroup release in lru_gen_eviction()")
No detailed regression report generated.

# Commit c0cce04fd4dc ("mm: thp: prevent memory cgroup release in folio_split_queue_lock{_irqsave}()")
commit c0cce04fd4dca3a2b00e8846cbbeb52bd2f27625
Author: Qi Zheng <zhengqi.arch@bytedance.com>

mm: thp: prevent memory cgroup release in folio_split_queue_lock{_irqsave}()

This patch wraps the queue lookup in rcu_read_lock/unlock to safeguard
against memory cgroup release during the lookup, as folios may no longer
pin the memcg.

> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index 12b46215b30c..b9e6855ec0b6 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -1154,13 +1154,25 @@ split_queue_lock_irqsave(int nid, struct mem_cgroup *memcg, unsigned long *flags
>  
>  static struct deferred_split *folio_split_queue_lock(struct folio *folio)
>  {
> -	return split_queue_lock(folio_nid(folio), folio_memcg(folio));
> +	struct deferred_split *queue;
> +
> +	rcu_read_lock();
> +	queue = split_queue_lock(folio_nid(folio), folio_memcg(folio));
> +	rcu_read_unlock();
> +
> +	return queue;
>  }

If the folio does not pin the memcg (as suggested by the commit message),
is it safe to return the queue pointer here? The queue is embedded in the
memcg, so if the memcg is freed after rcu_read_unlock(), wouldn't this
pointer become invalid while the caller still holds the lock?

>  
>  static struct deferred_split *
>  folio_split_queue_lock_irqsave(struct folio *folio, unsigned long *flags)
>  {
> -	return split_queue_lock_irqsave(folio_nid(folio), folio_memcg(folio), flags);
> +	struct deferred_split *queue;
> +
> +	rcu_read_lock();
> +	queue = split_queue_lock_irqsave(folio_nid(folio), folio_memcg(folio), flags);
> +	rcu_read_unlock();
> +
> +	return queue;
>  }

# Commit 555a447cb5f1 ("mm: zswap: prevent memory cgroup release in zswap_compress()")
No detailed regression report generated.

# Commit 80bbd804adde ("mm: workingset: prevent lruvec release in workingset_refault()")
commit 80bbd804adde7198f6c7c5fdf7b4349996586597
Author: Muchun Song <muchun.song@linux.dev>
mm: workingset: prevent lruvec release in workingset_refault()

In the near future, a folio will no longer pin its corresponding
memory cgroup. So an lruvec returned by folio_lruvec() could be
released without the rcu read lock or a reference to its memory
cgroup.

> diff --git a/mm/workingset.c b/mm/workingset.c
> index 445fc634196d..427ca1a5625e 100644
> --- a/mm/workingset.c
> +++ b/mm/workingset.c
> @@ -560,11 +560,12 @@ void workingset_refault(struct folio *folio, void *shadow)
>  	 * locked to guarantee folio_memcg() stability throughout.
>  	 */
>  	nr = folio_nr_pages(folio);
> +	rcu_read_lock();
>  	lruvec = folio_lruvec(folio);
>  	mod_lruvec_state(lruvec, WORKINGSET_REFAULT_BASE + file, nr);
>  
>  	if (!workingset_test_recent(shadow, file, &workingset, true))
> -		return;
> +		goto out;

Does calling workingset_test_recent() with flush=true while holding
rcu_read_lock() risk sleeping in atomic context?

workingset_test_recent() calls mem_cgroup_flush_stats_ratelimited() when
flush is true.  The flush path leads to css_rstat_flush(), which calls
might_sleep() and cond_resched().

>  
>  	folio_set_active(folio);
>  	workingset_age_nonresident(lruvec, nr);
> @@ -580,6 +581,8 @@ void workingset_refault(struct folio *folio, void *shadow)
>  		lru_note_cost_refault(folio);
>  		mod_lruvec_state(lruvec, WORKINGSET_RESTORE_BASE + file, nr);
>  	}
> +out:
> +	rcu_read_unlock();
>  }
>  
>  /**

# Commit 9d232388a8e3 ("mm: zswap: prevent lruvec release in zswap_folio_swapin()")
No detailed regression report generated.

# Commit d7cb66b9350d ("mm: swap: prevent lruvec release in lru_gen_clear_refs()")
No detailed regression report generated.

# Commit 3e71e5543c8f ("mm: workingset: prevent lruvec release in workingset_activation()")
No detailed regression report generated.

# Commit e765ff303f13 ("mm: memcontrol: prepare for reparenting LRU pages for lruvec lock")
No detailed regression report generated.

# Commit d04921029e6d ("mm: vmscan: prepare for reparenting traditional LRU folios")
commit d04921029e6dc92824024356a9ed83a4f82d68dc
Author: Qi Zheng <zhengqi.arch@bytedance.com>

mm: vmscan: prepare for reparenting traditional LRU folios

This commit implements a specific function to reparent LRU folios of a
child memcg to its parent memcg during the reparenting process.

> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 814498a2c1bd..5fd0f97c3719 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -2648,6 +2648,44 @@ static bool can_age_anon_pages(struct lruvec *lruvec,
>  			  lruvec_memcg(lruvec));
>  }
>  
> +#ifdef CONFIG_MEMCG
> +static void lruvec_reparent_lru(struct lruvec *src, struct lruvec *dst,
> +				enum lru_list lru)
> +{
> +	int zid;
> +	struct mem_cgroup_per_node *mz_src, *mz_dst;
> +
> +	mz_src = container_of(src, struct mem_cgroup_per_node, lruvec);
> +	mz_dst = container_of(dst, struct mem_cgroup_per_node, lruvec);
> +
> +	if (lru != LRU_UNEVICTABLE)
> +		list_splice_tail_init(&src->lists[lru], &dst->lists[lru]);

Is it safe to modify src->lists and dst->lists here without holding
src->lru_lock or dst->lru_lock?  Since dst is the parent memcg, it is
likely live and subject to concurrent access by reclaim or allocation paths.

> +
> +	for (zid = 0; zid < MAX_NR_ZONES; zid++) {
> +		mz_dst->lru_zone_size[zid][lru] += mz_src->lru_zone_size[zid][lru];
> +		mz_src->lru_zone_size[zid][lru] = 0;
> +	}

Also, lru_zone_size is typically protected by lru_lock.  Is it safe to
update these counters without locking?

> +}
> +
> +void lru_reparent_memcg(struct mem_cgroup *src, struct mem_cgroup *dst)
> +{
> +	int nid;
> +
> +	for_each_node(nid) {
> +		enum lru_list lru;
> +		struct lruvec *src_lruvec, *dst_lruvec;
> +
> +		src_lruvec = mem_cgroup_lruvec(src, NODE_DATA(nid));
> +		dst_lruvec = mem_cgroup_lruvec(dst, NODE_DATA(nid));
> +		dst_lruvec->anon_cost += src_lruvec->anon_cost;
> +		dst_lruvec->file_cost += src_lruvec->file_cost;
> +
> +		for_each_lru(lru)
> +			lruvec_reparent_lru(src_lruvec, dst_lruvec, lru);
> +	}
> +}
> +#endif

# Commit dc3b4a08ad35 ("mm: vmscan: prepare for reparenting MGLRU folios")
Here are the findings for the provided patch.

1.  **Missing locking in `lru_gen_reparent_memcg`**
    In `mm/vmscan.c`, the function `lru_gen_reparent_memcg` (and its helper `__lru_gen_reparent_memcg`) modifies the LRU lists and statistics of `lruvec` structures without holding the `lru_lock`.

    Specifically:
    - `__lru_gen_reparent_memcg` calls `__update_lru_size`.
    - `__update_lru_size` has a `lockdep_assert_held(&lruvec->lru_lock)`.
    - `__lru_gen_reparent_memcg` calls `list_splice_tail_init` to move folios from the source lruvec to the destination lruvec.

    The destination lruvec (`dst_lruvec`) belongs to the parent memcg, which is active and shared. Modifying its lists and counters without locking will lead to data corruption (list corruption) and statistics drift, as well as triggering lockdep warnings.

    ```c
    void lru_gen_reparent_memcg(struct mem_cgroup *src, struct mem_cgroup *dst)
    {
            int nid;

            for_each_node(nid) {
                    struct lruvec *src_lruvec, *dst_lruvec;
                    /* ... */
                    src_lruvec = get_lruvec(src, nid);
                    dst_lruvec = get_lruvec(dst, nid);

                    for (zone = 0; zone < MAX_NR_ZONES; zone++)
                            for (type = 0; type < ANON_AND_FILE; type++)
                                    __lru_gen_reparent_memcg(src_lruvec, dst_lruvec, zone, type);
            }
    }
    ```

    The `lruvec` lock must be acquired for each node before calling `__lru_gen_reparent_memcg`.

# Commit 18791acfe86b ("mm: memcontrol: refactor memcg_reparent_objcgs()")
No detailed regression report generated.

# Commit 19f79273a812 ("mm: memcontrol: eliminate the problem of dying memory cgroup for LRU folios")
file: mm/memcontrol.c
line: 224
type: Bug
category: Locking
description:
The `reparent_locks` function takes `lru_lock` for all NUMA nodes in a loop, utilizing `spin_lock_nested` with an incrementing `nest` counter. The `nest` counter increments for each lock taken (2 per node: src and dst). Since `MAX_LOCKDEP_SUBCLASSES` is 8, this code will trigger a Lockdep violation (and potential panic if `panic_on_warn` is set) on systems with more than 4 NUMA nodes (4 nodes * 2 locks = 8 subclasses). Furthermore, locking all nodes simultaneously is a scalability regression, blocking LRU operations globally during reparenting.

file: include/linux/memcontrol.h
line: 430
type: Risk
category: API
description:
The implementation of `folio_memcg` has changed to rely on `obj_cgroup_memcg`, which enforces that `rcu_read_lock` or `cgroup_mutex` is held via a lockdep assertion. Previously, for LRU folios, the memcg pointer was directly embedded and stable under the folio lock. Existing callers (e.g., in `mm/workingset.c`) relied on the folio lock for stability. While some callers may hold RCU, others might not, leading to lockdep warnings or races where `folio_memcg` returns a pointer to a memcg that is being reparented or freed. Additionally, the return value of `folio_memcg` is no longer constant for a locked folio; it can change if reparenting occurs, potentially breaking logic that assumes identity equality over time.

# Commit 44b8e8bca06e ("mm: lru: add VM_WARN_ON_ONCE_FOLIO to lru maintenance helpers")
No detailed regression report generated.

# Summary

| Commit                                                                                        | Regressions |
| :-------------------------------------------------------------------------------------------- | :---------- |
| e416d881eea4 ("mm: memcontrol: remove dead code of checking parent memory cgroup")            | 0           |
| 8e00ae594254 ("mm: workingset: use folio_lruvec() in workingset_refault()")                   | 0           |
| a272ef87d5e7 ("mm: rename unlock_page_lruvec_irq and its variants")                           | 0           |
| d57d548a3d6b ("mm: vmscan: prepare for the refactoring the move_folios_to_lru()")             | 0           |
| 9b02a45b6fc8 ("mm: vmscan: refactor move_folios_to_lru()")                                    | 1           |
| 057fca991b78 ("mm: memcontrol: allocate object cgroup for non-kmem case")                     | 0           |
| 7c4110a3d8b6 ("mm: memcontrol: return root object cgroup for root memory cgroup")             | 0           |
| 8479f2eef536 ("mm: memcontrol: prevent memory cgroup release in get_mem_cgroup_from_folio()") | 1           |
| c10b7e11fc09 ("buffer: prevent memory cgroup release in folio_alloc_buffers()")               | 0           |
| 65610d739afc ("writeback: prevent memory cgroup release in writeback module")                 | 2           |
| f9b3cc3aed9f ("mm: memcontrol: prevent memory cgroup release in count_memcg_folio_events()")  | 0           |
| 91e4b3924291 ("mm: page_io: prevent memory cgroup release in page_io module")                 | 0           |
| bb45e352bb34 ("mm: migrate: prevent memory cgroup release in folio_migrate_mapping()")        | 0           |
| a1189dd21a56 ("mm: mglru: prevent memory cgroup release in mglru")                            | 1           |
| 4f41e0db1fd8 ("mm: memcontrol: prevent memory cgroup release in mem_cgroup_swap_full()")      | 0           |
| de63e2b7a03e ("mm: workingset: prevent memory cgroup release in lru_gen_eviction()")          | 0           |
| c0cce04fd4dc ("mm: thp: prevent memory cgroup release in folio_split_queue_lock{_irqsave}()") | 1           |
| 555a447cb5f1 ("mm: zswap: prevent memory cgroup release in zswap_compress()")                 | 0           |
| 80bbd804adde ("mm: workingset: prevent lruvec release in workingset_refault()")               | 1           |
| 9d232388a8e3 ("mm: zswap: prevent lruvec release in zswap_folio_swapin()")                    | 0           |
| d7cb66b9350d ("mm: swap: prevent lruvec release in lru_gen_clear_refs()")                     | 0           |
| 3e71e5543c8f ("mm: workingset: prevent lruvec release in workingset_activation()")            | 0           |
| e765ff303f13 ("mm: memcontrol: prepare for reparenting LRU pages for lruvec lock")            | 0           |
| d04921029e6d ("mm: vmscan: prepare for reparenting traditional LRU folios")                   | 1           |
| dc3b4a08ad35 ("mm: vmscan: prepare for reparenting MGLRU folios")                             | 1           |
| 18791acfe86b ("mm: memcontrol: refactor memcg_reparent_objcgs()")                             | 0           |
| 19f79273a812 ("mm: memcontrol: eliminate the problem of dying memory cgroup for LRU folios")  | 2           |
| 44b8e8bca06e ("mm: lru: add VM_WARN_ON_ONCE_FOLIO to lru maintenance helpers")                | 0           |

^ permalink raw reply	[flat|nested] 149+ messages in thread

* Re: [PATCH v2 00/28] Eliminate Dying Memory Cgroup
  2025-12-30  1:36 ` Roman Gushchin
@ 2025-12-30  2:44   ` Qi Zheng
  2025-12-30  4:20     ` Roman Gushchin
  2025-12-30  4:01   ` Shakeel Butt
  1 sibling, 1 reply; 149+ messages in thread
From: Qi Zheng @ 2025-12-30  2:44 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: hannes, hughd, mhocko, shakeel.butt, muchun.song, david,
	lorenzo.stoakes, ziy, harry.yoo, imran.f.khan, kamalesh.babulal,
	axelrasmussen, yuanchu, weixugc, chenridong, mkoutny, akpm,
	hamzamahfooz, apais, lance.yang, linux-mm, linux-kernel, cgroups,
	Qi Zheng

Hi Roman,

On 12/30/25 9:36 AM, Roman Gushchin wrote:
> Qi Zheng <qi.zheng@linux.dev> writes:
> 
> Hey!
> 
> I ran this patchset through AI review and it found few regression (which
> can of course be false positives). When you'll have time, can you,
> please, take a look and comment on which are real and which are not?

Thank you for running the AI review for this patchset, but please do not
directly send the raw data from the AI review to the community, as this
is no different from automated review by a robot.

Thanks,
Qi

> 
> Thank you!
> 
> --
> 
> # Task
> Date: 2025-12-29 19:55:20
> Model: gemini-3-pro-preview
> Prompts SHA: 192922ae6bf4 ("bpf.md: adjust the documentation about bpf kfunc parameter validation")
> Commits to review:
> - e416d881eea4 ("mm: memcontrol: remove dead code of checking parent memory cgroup")
> - 8e00ae594254 ("mm: workingset: use folio_lruvec() in workingset_refault()")
> - a272ef87d5e7 ("mm: rename unlock_page_lruvec_irq and its variants")
> - d57d548a3d6b ("mm: vmscan: prepare for the refactoring the move_folios_to_lru()")
> - 9b02a45b6fc8 ("mm: vmscan: refactor move_folios_to_lru()")
> - 057fca991b78 ("mm: memcontrol: allocate object cgroup for non-kmem case")
> - 7c4110a3d8b6 ("mm: memcontrol: return root object cgroup for root memory cgroup")
> - 8479f2eef536 ("mm: memcontrol: prevent memory cgroup release in get_mem_cgroup_from_folio()")
> - c10b7e11fc09 ("buffer: prevent memory cgroup release in folio_alloc_buffers()")
> - 65610d739afc ("writeback: prevent memory cgroup release in writeback module")
> - f9b3cc3aed9f ("mm: memcontrol: prevent memory cgroup release in count_memcg_folio_events()")
> - 91e4b3924291 ("mm: page_io: prevent memory cgroup release in page_io module")
> - bb45e352bb34 ("mm: migrate: prevent memory cgroup release in folio_migrate_mapping()")
> - a1189dd21a56 ("mm: mglru: prevent memory cgroup release in mglru")
> - 4f41e0db1fd8 ("mm: memcontrol: prevent memory cgroup release in mem_cgroup_swap_full()")
> - de63e2b7a03e ("mm: workingset: prevent memory cgroup release in lru_gen_eviction()")
> - c0cce04fd4dc ("mm: thp: prevent memory cgroup release in folio_split_queue_lock{_irqsave}()")
> - 555a447cb5f1 ("mm: zswap: prevent memory cgroup release in zswap_compress()")
> - 80bbd804adde ("mm: workingset: prevent lruvec release in workingset_refault()")
> - 9d232388a8e3 ("mm: zswap: prevent lruvec release in zswap_folio_swapin()")
> - d7cb66b9350d ("mm: swap: prevent lruvec release in lru_gen_clear_refs()")
> - 3e71e5543c8f ("mm: workingset: prevent lruvec release in workingset_activation()")
> - e765ff303f13 ("mm: memcontrol: prepare for reparenting LRU pages for lruvec lock")
> - d04921029e6d ("mm: vmscan: prepare for reparenting traditional LRU folios")
> - dc3b4a08ad35 ("mm: vmscan: prepare for reparenting MGLRU folios")
> - 18791acfe86b ("mm: memcontrol: refactor memcg_reparent_objcgs()")
> - 19f79273a812 ("mm: memcontrol: eliminate the problem of dying memory cgroup for LRU folios")
> - 44b8e8bca06e ("mm: lru: add VM_WARN_ON_ONCE_FOLIO to lru maintenance helpers")
> 
> 
> # Commit e416d881eea4 ("mm: memcontrol: remove dead code of checking parent memory cgroup")
> No detailed regression report generated.
> 
> # Commit 8e00ae594254 ("mm: workingset: use folio_lruvec() in workingset_refault()")
> No detailed regression report generated.
> 
> # Commit a272ef87d5e7 ("mm: rename unlock_page_lruvec_irq and its variants")
> No detailed regression report generated.
> 
> # Commit d57d548a3d6b ("mm: vmscan: prepare for the refactoring the move_folios_to_lru()")
> No detailed regression report generated.
> 
> # Commit 9b02a45b6fc8 ("mm: vmscan: refactor move_folios_to_lru()")
> commit 9b02a45b6fc8a2253e2edcccb24c8485d3dca672
> Author: Muchun Song <muchun.song@linux.dev>
> 
> mm: vmscan: refactor move_folios_to_lru()
> 
> Refactors move_folios_to_lru() to handle its own locking via
> folio_lruvec_relock_irq(), removing the requirement for the caller to hold
> lruvec->lru_lock. Updates call sites to release the lock before calling.
> 
>> diff --git a/mm/vmscan.c b/mm/vmscan.c
>> index 49e566174621..354b19f7365d 100644
>> --- a/mm/vmscan.c
>> +++ b/mm/vmscan.c
> 
> [ ... ]
> 
>> @@ -4736,14 +4734,14 @@ static int evict_folios(unsigned long nr_to_scan, struct lruvec *lruvec,
>>   			set_mask_bits(&folio->flags.f, LRU_REFS_FLAGS, BIT(PG_active));
>>   	}
>>   
>> -	spin_lock_irq(&lruvec->lru_lock);
>> -
>> -	move_folios_to_lru(lruvec, &list);
>> +	move_folios_to_lru(&list);
>>   
>>   	walk = current->reclaim_state->mm_walk;
>>   	if (walk && walk->batched) {
>>   		walk->lruvec = lruvec;
>> +		spin_lock(&lruvec->lru_lock);
> 
> move_folios_to_lru() drops the lock and enables interrupts via
> lruvec_unlock_irq() (which calls spin_unlock_irq()). Since lru_lock is
> an IRQ-safe lock (used elsewhere with spin_lock_irq()), is it safe to
> use spin_lock() here with interrupts enabled? If an interrupt occurs
> and tries to acquire lru_lock, could this lead to a deadlock?
> 
>>   		reset_batch_size(walk);
>> +		spin_unlock(&lruvec->lru_lock);
>>   	}
>>   
>>   	mod_lruvec_state(lruvec, PGDEMOTE_KSWAPD + reclaimer_offset(sc),
> 
> [ ... ]
> 
> # Commit 057fca991b78 ("mm: memcontrol: allocate object cgroup for non-kmem case")
> No detailed regression report generated.
> 
> # Commit 7c4110a3d8b6 ("mm: memcontrol: return root object cgroup for root memory cgroup")
> No detailed regression report generated.
> 
> # Commit 8479f2eef536 ("mm: memcontrol: prevent memory cgroup release in get_mem_cgroup_from_folio()")
> commit 8479f2eef5363fb93cc593038e9bc54dd5068ef2
> Author: Muchun Song <muchun.song@linux.dev>
> 
> mm: memcontrol: prevent memory cgroup release in get_mem_cgroup_from_folio()
> 
> This patch modifies get_mem_cgroup_from_folio() to retry the lookup if
> locking the memcg fails, serving as preparation for future changes where
> folios may not pin their memcg.
> 
>> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
>> index 21b5aad34cae..431b3154c70c 100644
>> --- a/mm/memcontrol.c
>> +++ b/mm/memcontrol.c
>> @@ -973,14 +973,19 @@ struct mem_cgroup *get_mem_cgroup_from_current(void)
>>    */
>>   struct mem_cgroup *get_mem_cgroup_from_folio(struct folio *folio)
>>   {
>> -	struct mem_cgroup *memcg = folio_memcg(folio);
>> +	struct mem_cgroup *memcg;
>>   
>>   	if (mem_cgroup_disabled())
>>   		return NULL;
>>   
>> +	if (!folio_memcg_charged(folio))
>> +		return root_mem_cgroup;
>> +
>>   	rcu_read_lock();
>> -	if (!memcg || WARN_ON_ONCE(!css_tryget(&memcg->css)))
>> -		memcg = root_mem_cgroup;
>> +retry:
>> +	memcg = folio_memcg(folio);
>> +	if (unlikely(!css_tryget(&memcg->css)))
>> +		goto retry;
> 
> If css_tryget() fails (e.g. refcount is 0), this loop spins indefinitely
> with the RCU read lock held. Is it guaranteed that folio_memcg() will
> return a different, alive memcg in subsequent iterations?
> 
> If the folio is isolated (e.g. via migrate_misplaced_folio()), it might be
> missed by reparenting logic that iterates LRU lists. In that case, the
> folio would continue pointing to the dying memcg, leading to a hard lockup.
> 
> Also, folio_memcg() calls __folio_memcg(), which reads folio->memcg_data
> without READ_ONCE(). Since this loop waits for memcg_data to be updated
> by another CPU (reparenting), could the compiler hoist the load out of
> the loop, preventing the update from being seen?
> 
> Finally, the previous code fell back to root_mem_cgroup on failure. Is it
> safe to remove that fallback? If css_tryget() fails unexpectedly, hanging
> seems more severe than the previous behavior of warning and falling back.
> 
>>   	rcu_read_unlock();
>>   	return memcg;
>>   }
> # Commit c10b7e11fc09 ("buffer: prevent memory cgroup release in folio_alloc_buffers()")
> No detailed regression report generated.
> 
> # Commit 65610d739afc ("writeback: prevent memory cgroup release in writeback module")
> 
> <regression>
>    <description>
>      Tight infinite loop in get_mem_cgroup_from_folio due to missing READ_ONCE and cpu_relax
>    </description>
>    <file>
>      mm/memcontrol.c
>    </file>
>    <lines>
>      991-993
>    </lines>
>    <reason>
>      The retry loop in get_mem_cgroup_from_folio() continuously checks folio_memcg(folio)
>      and tries to get a css reference. There are two issues:
>      1. folio_memcg() reads folio->memcg_data without READ_ONCE(). The compiler may
>         hoist this load out of the loop, leading to an infinite loop if the memcg
>         is offlined and the loop condition depends on reparenting updating memcg_data.
>      2. The loop does not contain cpu_relax(). If the update to folio->memcg_data
>         is delayed (e.g., reparenting takes time), this creates a tight loop that
>         burns CPU and stalls the RCU read-side critical section, potentially causing
>         RCU stalls or soft lockups.
>    </reason>
>    <classification>
>      Performance/Stability (Soft Lockup / Infinite Loop)
>    </classification>
> </regression>
> 
> <regression>
>    <description>
>      Removal of NULL check in mem_cgroup_track_foreign_dirty introduces potential crash
>    </description>
>    <file>
>      include/linux/memcontrol.h
>    </file>
>    <lines>
>      1557
>    </lines>
>    <reason>
>      The check `if (unlikely(memcg && &memcg->css != wb->memcg_css))` was replaced with
>      `if (unlikely(&memcg->css != wb->memcg_css))`.
>      While folio_memcg_charged() is checked earlier, there is a race window between that check
>      and the call to folio_memcg() where the folio could be uncharged (setting memcg_data to 0).
>      If folio_memcg() returns NULL, accessing `&memcg->css` will dereference the NULL pointer
>      (since css is at offset 0). The previous code safely handled NULL memcg.
>    </reason>
>    <classification>
>      Robustness / Null Pointer Dereference
>    </classification>
> </regression>
> 
> # Commit f9b3cc3aed9f ("mm: memcontrol: prevent memory cgroup release in count_memcg_folio_events()")
> No detailed regression report generated.
> 
> # Commit 91e4b3924291 ("mm: page_io: prevent memory cgroup release in page_io module")
> No detailed regression report generated.
> 
> # Commit bb45e352bb34 ("mm: migrate: prevent memory cgroup release in folio_migrate_mapping()")
> No detailed regression report generated.
> 
> # Commit a1189dd21a56 ("mm: mglru: prevent memory cgroup release in mglru")
> commit a1189dd21a568b45eaddafc17b982f6200ba28c0
> Author: Muchun Song <muchun.song@linux.dev>
> 
> mm: mglru: prevent memory cgroup release in mglru
> 
> This patch uses the rcu read lock to safeguard against the release of
> the memory cgroup in mglru, as a preparatory measure for the reparenting
> of the LRU pages.
> 
>> diff --git a/mm/vmscan.c b/mm/vmscan.c
>> index 354b19f7365d..814498a2c1bd 100644
>> --- a/mm/vmscan.c
>> +++ b/mm/vmscan.c
> [ ... ]
>> @@ -4242,6 +4244,13 @@ bool lru_gen_look_around(struct page_vma_mapped_walk *pvmw)
>>   		}
>>   	}
>>   
>> +	rcu_read_lock();
>> +	memcg = folio_memcg(folio);
>> +	lruvec = mem_cgroup_lruvec(memcg, pgdat);
>> +	max_seq = READ_ONCE((lruvec)->lrugen.max_seq);
>> +	gen = lru_gen_from_seq(max_seq);
>> +	mm_state = get_mm_state(lruvec);
>> +
>>   	arch_enter_lazy_mmu_mode();
>>   
>>   	pte -= (addr - start) / PAGE_SIZE;
> [ ... ]
>> @@ -4282,6 +4291,8 @@ bool lru_gen_look_around(struct page_vma_mapped_walk *pvmw)
>>   	if (mm_state && suitable_to_scan(i, young))
>>   		update_bloom_filter(mm_state, max_seq, pvmw->pmd);
>>   
>> +	rcu_read_unlock();
>> +
>>   	return true;
>>   }
> 
> This critical section appears to encompass calls to walk_update_folio(),
> which calls folio_activate(), which eventually calls folio_lruvec_lock_irq()
> to acquire lruvec->lru_lock.
> 
> Since lruvec->lru_lock is a spinlock_t, it becomes a sleeping lock on
> PREEMPT_RT kernels. Is it safe to acquire a sleeping lock inside an RCU
> read-side critical section on PREEMPT_RT?
> 
> 
> # Commit 4f41e0db1fd8 ("mm: memcontrol: prevent memory cgroup release in mem_cgroup_swap_full()")
> No detailed regression report generated.
> 
> # Commit de63e2b7a03e ("mm: workingset: prevent memory cgroup release in lru_gen_eviction()")
> No detailed regression report generated.
> 
> # Commit c0cce04fd4dc ("mm: thp: prevent memory cgroup release in folio_split_queue_lock{_irqsave}()")
> commit c0cce04fd4dca3a2b00e8846cbbeb52bd2f27625
> Author: Qi Zheng <zhengqi.arch@bytedance.com>
> 
> mm: thp: prevent memory cgroup release in folio_split_queue_lock{_irqsave}()
> 
> This patch wraps the queue lookup in rcu_read_lock/unlock to safeguard
> against memory cgroup release during the lookup, as folios may no longer
> pin the memcg.
> 
>> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
>> index 12b46215b30c..b9e6855ec0b6 100644
>> --- a/mm/huge_memory.c
>> +++ b/mm/huge_memory.c
>> @@ -1154,13 +1154,25 @@ split_queue_lock_irqsave(int nid, struct mem_cgroup *memcg, unsigned long *flags
>>   
>>   static struct deferred_split *folio_split_queue_lock(struct folio *folio)
>>   {
>> -	return split_queue_lock(folio_nid(folio), folio_memcg(folio));
>> +	struct deferred_split *queue;
>> +
>> +	rcu_read_lock();
>> +	queue = split_queue_lock(folio_nid(folio), folio_memcg(folio));
>> +	rcu_read_unlock();
>> +
>> +	return queue;
>>   }
> 
> If the folio does not pin the memcg (as suggested by the commit message),
> is it safe to return the queue pointer here? The queue is embedded in the
> memcg, so if the memcg is freed after rcu_read_unlock(), wouldn't this
> pointer become invalid while the caller still holds the lock?
> 
>>   
>>   static struct deferred_split *
>>   folio_split_queue_lock_irqsave(struct folio *folio, unsigned long *flags)
>>   {
>> -	return split_queue_lock_irqsave(folio_nid(folio), folio_memcg(folio), flags);
>> +	struct deferred_split *queue;
>> +
>> +	rcu_read_lock();
>> +	queue = split_queue_lock_irqsave(folio_nid(folio), folio_memcg(folio), flags);
>> +	rcu_read_unlock();
>> +
>> +	return queue;
>>   }
> 
> # Commit 555a447cb5f1 ("mm: zswap: prevent memory cgroup release in zswap_compress()")
> No detailed regression report generated.
> 
> # Commit 80bbd804adde ("mm: workingset: prevent lruvec release in workingset_refault()")
> commit 80bbd804adde7198f6c7c5fdf7b4349996586597
> Author: Muchun Song <muchun.song@linux.dev>
> mm: workingset: prevent lruvec release in workingset_refault()
> 
> In the near future, a folio will no longer pin its corresponding
> memory cgroup. So an lruvec returned by folio_lruvec() could be
> released without the rcu read lock or a reference to its memory
> cgroup.
> 
>> diff --git a/mm/workingset.c b/mm/workingset.c
>> index 445fc634196d..427ca1a5625e 100644
>> --- a/mm/workingset.c
>> +++ b/mm/workingset.c
>> @@ -560,11 +560,12 @@ void workingset_refault(struct folio *folio, void *shadow)
>>   	 * locked to guarantee folio_memcg() stability throughout.
>>   	 */
>>   	nr = folio_nr_pages(folio);
>> +	rcu_read_lock();
>>   	lruvec = folio_lruvec(folio);
>>   	mod_lruvec_state(lruvec, WORKINGSET_REFAULT_BASE + file, nr);
>>   
>>   	if (!workingset_test_recent(shadow, file, &workingset, true))
>> -		return;
>> +		goto out;
> 
> Does calling workingset_test_recent() with flush=true while holding
> rcu_read_lock() risk sleeping in atomic context?
> 
> workingset_test_recent() calls mem_cgroup_flush_stats_ratelimited() when
> flush is true.  The flush path leads to css_rstat_flush(), which calls
> might_sleep() and cond_resched().
> 
>>   
>>   	folio_set_active(folio);
>>   	workingset_age_nonresident(lruvec, nr);
>> @@ -580,6 +581,8 @@ void workingset_refault(struct folio *folio, void *shadow)
>>   		lru_note_cost_refault(folio);
>>   		mod_lruvec_state(lruvec, WORKINGSET_RESTORE_BASE + file, nr);
>>   	}
>> +out:
>> +	rcu_read_unlock();
>>   }
>>   
>>   /**
> 
> # Commit 9d232388a8e3 ("mm: zswap: prevent lruvec release in zswap_folio_swapin()")
> No detailed regression report generated.
> 
> # Commit d7cb66b9350d ("mm: swap: prevent lruvec release in lru_gen_clear_refs()")
> No detailed regression report generated.
> 
> # Commit 3e71e5543c8f ("mm: workingset: prevent lruvec release in workingset_activation()")
> No detailed regression report generated.
> 
> # Commit e765ff303f13 ("mm: memcontrol: prepare for reparenting LRU pages for lruvec lock")
> No detailed regression report generated.
> 
> # Commit d04921029e6d ("mm: vmscan: prepare for reparenting traditional LRU folios")
> commit d04921029e6dc92824024356a9ed83a4f82d68dc
> Author: Qi Zheng <zhengqi.arch@bytedance.com>
> 
> mm: vmscan: prepare for reparenting traditional LRU folios
> 
> This commit implements a specific function to reparent LRU folios of a
> child memcg to its parent memcg during the reparenting process.
> 
>> diff --git a/mm/vmscan.c b/mm/vmscan.c
>> index 814498a2c1bd..5fd0f97c3719 100644
>> --- a/mm/vmscan.c
>> +++ b/mm/vmscan.c
>> @@ -2648,6 +2648,44 @@ static bool can_age_anon_pages(struct lruvec *lruvec,
>>   			  lruvec_memcg(lruvec));
>>   }
>>   
>> +#ifdef CONFIG_MEMCG
>> +static void lruvec_reparent_lru(struct lruvec *src, struct lruvec *dst,
>> +				enum lru_list lru)
>> +{
>> +	int zid;
>> +	struct mem_cgroup_per_node *mz_src, *mz_dst;
>> +
>> +	mz_src = container_of(src, struct mem_cgroup_per_node, lruvec);
>> +	mz_dst = container_of(dst, struct mem_cgroup_per_node, lruvec);
>> +
>> +	if (lru != LRU_UNEVICTABLE)
>> +		list_splice_tail_init(&src->lists[lru], &dst->lists[lru]);
> 
> Is it safe to modify src->lists and dst->lists here without holding
> src->lru_lock or dst->lru_lock?  Since dst is the parent memcg, it is
> likely live and subject to concurrent access by reclaim or allocation paths.
> 
>> +
>> +	for (zid = 0; zid < MAX_NR_ZONES; zid++) {
>> +		mz_dst->lru_zone_size[zid][lru] += mz_src->lru_zone_size[zid][lru];
>> +		mz_src->lru_zone_size[zid][lru] = 0;
>> +	}
> 
> Also, lru_zone_size is typically protected by lru_lock.  Is it safe to
> update these counters without locking?
> 
>> +}
>> +
>> +void lru_reparent_memcg(struct mem_cgroup *src, struct mem_cgroup *dst)
>> +{
>> +	int nid;
>> +
>> +	for_each_node(nid) {
>> +		enum lru_list lru;
>> +		struct lruvec *src_lruvec, *dst_lruvec;
>> +
>> +		src_lruvec = mem_cgroup_lruvec(src, NODE_DATA(nid));
>> +		dst_lruvec = mem_cgroup_lruvec(dst, NODE_DATA(nid));
>> +		dst_lruvec->anon_cost += src_lruvec->anon_cost;
>> +		dst_lruvec->file_cost += src_lruvec->file_cost;
>> +
>> +		for_each_lru(lru)
>> +			lruvec_reparent_lru(src_lruvec, dst_lruvec, lru);
>> +	}
>> +}
>> +#endif
> 
> # Commit dc3b4a08ad35 ("mm: vmscan: prepare for reparenting MGLRU folios")
> Here are the findings for the provided patch.
> 
> 1.  **Missing locking in `lru_gen_reparent_memcg`**
>      In `mm/vmscan.c`, the function `lru_gen_reparent_memcg` (and its helper `__lru_gen_reparent_memcg`) modifies the LRU lists and statistics of `lruvec` structures without holding the `lru_lock`.
> 
>      Specifically:
>      - `__lru_gen_reparent_memcg` calls `__update_lru_size`.
>      - `__update_lru_size` has a `lockdep_assert_held(&lruvec->lru_lock)`.
>      - `__lru_gen_reparent_memcg` calls `list_splice_tail_init` to move folios from the source lruvec to the destination lruvec.
> 
>      The destination lruvec (`dst_lruvec`) belongs to the parent memcg, which is active and shared. Modifying its lists and counters without locking will lead to data corruption (list corruption) and statistics drift, as well as triggering lockdep warnings.
> 
>      ```c
>      void lru_gen_reparent_memcg(struct mem_cgroup *src, struct mem_cgroup *dst)
>      {
>              int nid;
> 
>              for_each_node(nid) {
>                      struct lruvec *src_lruvec, *dst_lruvec;
>                      /* ... */
>                      src_lruvec = get_lruvec(src, nid);
>                      dst_lruvec = get_lruvec(dst, nid);
> 
>                      for (zone = 0; zone < MAX_NR_ZONES; zone++)
>                              for (type = 0; type < ANON_AND_FILE; type++)
>                                      __lru_gen_reparent_memcg(src_lruvec, dst_lruvec, zone, type);
>              }
>      }
>      ```
> 
>      The `lruvec` lock must be acquired for each node before calling `__lru_gen_reparent_memcg`.
> 
> # Commit 18791acfe86b ("mm: memcontrol: refactor memcg_reparent_objcgs()")
> No detailed regression report generated.
> 
> # Commit 19f79273a812 ("mm: memcontrol: eliminate the problem of dying memory cgroup for LRU folios")
> file: mm/memcontrol.c
> line: 224
> type: Bug
> category: Locking
> description:
> The `reparent_locks` function takes `lru_lock` for all NUMA nodes in a loop, utilizing `spin_lock_nested` with an incrementing `nest` counter. The `nest` counter increments for each lock taken (2 per node: src and dst). Since `MAX_LOCKDEP_SUBCLASSES` is 8, this code will trigger a Lockdep violation (and potential panic if `panic_on_warn` is set) on systems with more than 4 NUMA nodes (4 nodes * 2 locks = 8 subclasses). Furthermore, locking all nodes simultaneously is a scalability regression, blocking LRU operations globally during reparenting.
> 
> file: include/linux/memcontrol.h
> line: 430
> type: Risk
> category: API
> description:
> The implementation of `folio_memcg` has changed to rely on `obj_cgroup_memcg`, which enforces that `rcu_read_lock` or `cgroup_mutex` is held via a lockdep assertion. Previously, for LRU folios, the memcg pointer was directly embedded and stable under the folio lock. Existing callers (e.g., in `mm/workingset.c`) relied on the folio lock for stability. While some callers may hold RCU, others might not, leading to lockdep warnings or races where `folio_memcg` returns a pointer to a memcg that is being reparented or freed. Additionally, the return value of `folio_memcg` is no longer constant for a locked folio; it can change if reparenting occurs, potentially breaking logic that assumes identity equality over time.
> 
> # Commit 44b8e8bca06e ("mm: lru: add VM_WARN_ON_ONCE_FOLIO to lru maintenance helpers")
> No detailed regression report generated.
> 
> # Summary
> 
> | Commit                                                                                        | Regressions |
> | :-------------------------------------------------------------------------------------------- | :---------- |
> | e416d881eea4 ("mm: memcontrol: remove dead code of checking parent memory cgroup")            | 0           |
> | 8e00ae594254 ("mm: workingset: use folio_lruvec() in workingset_refault()")                   | 0           |
> | a272ef87d5e7 ("mm: rename unlock_page_lruvec_irq and its variants")                           | 0           |
> | d57d548a3d6b ("mm: vmscan: prepare for the refactoring the move_folios_to_lru()")             | 0           |
> | 9b02a45b6fc8 ("mm: vmscan: refactor move_folios_to_lru()")                                    | 1           |
> | 057fca991b78 ("mm: memcontrol: allocate object cgroup for non-kmem case")                     | 0           |
> | 7c4110a3d8b6 ("mm: memcontrol: return root object cgroup for root memory cgroup")             | 0           |
> | 8479f2eef536 ("mm: memcontrol: prevent memory cgroup release in get_mem_cgroup_from_folio()") | 1           |
> | c10b7e11fc09 ("buffer: prevent memory cgroup release in folio_alloc_buffers()")               | 0           |
> | 65610d739afc ("writeback: prevent memory cgroup release in writeback module")                 | 2           |
> | f9b3cc3aed9f ("mm: memcontrol: prevent memory cgroup release in count_memcg_folio_events()")  | 0           |
> | 91e4b3924291 ("mm: page_io: prevent memory cgroup release in page_io module")                 | 0           |
> | bb45e352bb34 ("mm: migrate: prevent memory cgroup release in folio_migrate_mapping()")        | 0           |
> | a1189dd21a56 ("mm: mglru: prevent memory cgroup release in mglru")                            | 1           |
> | 4f41e0db1fd8 ("mm: memcontrol: prevent memory cgroup release in mem_cgroup_swap_full()")      | 0           |
> | de63e2b7a03e ("mm: workingset: prevent memory cgroup release in lru_gen_eviction()")          | 0           |
> | c0cce04fd4dc ("mm: thp: prevent memory cgroup release in folio_split_queue_lock{_irqsave}()") | 1           |
> | 555a447cb5f1 ("mm: zswap: prevent memory cgroup release in zswap_compress()")                 | 0           |
> | 80bbd804adde ("mm: workingset: prevent lruvec release in workingset_refault()")               | 1           |
> | 9d232388a8e3 ("mm: zswap: prevent lruvec release in zswap_folio_swapin()")                    | 0           |
> | d7cb66b9350d ("mm: swap: prevent lruvec release in lru_gen_clear_refs()")                     | 0           |
> | 3e71e5543c8f ("mm: workingset: prevent lruvec release in workingset_activation()")            | 0           |
> | e765ff303f13 ("mm: memcontrol: prepare for reparenting LRU pages for lruvec lock")            | 0           |
> | d04921029e6d ("mm: vmscan: prepare for reparenting traditional LRU folios")                   | 1           |
> | dc3b4a08ad35 ("mm: vmscan: prepare for reparenting MGLRU folios")                             | 1           |
> | 18791acfe86b ("mm: memcontrol: refactor memcg_reparent_objcgs()")                             | 0           |
> | 19f79273a812 ("mm: memcontrol: eliminate the problem of dying memory cgroup for LRU folios")  | 2           |
> | 44b8e8bca06e ("mm: lru: add VM_WARN_ON_ONCE_FOLIO to lru maintenance helpers")                | 0           |



^ permalink raw reply	[flat|nested] 149+ messages in thread

* Re: [PATCH v2 00/28] Eliminate Dying Memory Cgroup
  2025-12-30  2:44   ` Qi Zheng
@ 2025-12-30  4:20     ` Roman Gushchin
  2025-12-30  4:25       ` Qi Zheng
  0 siblings, 1 reply; 149+ messages in thread
From: Roman Gushchin @ 2025-12-30  4:20 UTC (permalink / raw)
  To: Qi Zheng
  Cc: hannes, hughd, mhocko, shakeel.butt, muchun.song, david,
	lorenzo.stoakes, ziy, harry.yoo, imran.f.khan, kamalesh.babulal,
	axelrasmussen, yuanchu, weixugc, chenridong, mkoutny, akpm,
	hamzamahfooz, apais, lance.yang, linux-mm, linux-kernel, cgroups,
	Qi Zheng

Qi Zheng <qi.zheng@linux.dev> writes:

> Hi Roman,
>
> On 12/30/25 9:36 AM, Roman Gushchin wrote:
>> Qi Zheng <qi.zheng@linux.dev> writes:
>> Hey!
>> I ran this patchset through AI review and it found few regression
>> (which
>> can of course be false positives). When you'll have time, can you,
>> please, take a look and comment on which are real and which are not?
>
> Thank you for running the AI review for this patchset, but please do not
> directly send the raw data from the AI review to the community, as this
> is no different from automated review by a robot.

Hi Qi,

I don't know why you're so negative towards it. It's been great at
finding pretty tricky bugs often missed by human reviewers. In no way
it's a replacement for human reviews, but if a robot can find real
issues and make the kernel more reliable and safe, I'm in.

In my experience it finds real problems pretty often. In my measurements
at least 50% of the reported issues are real, and it matches the data
reported by others. Some subsystems (e.g. bpf) pass all patches through
the ai review.

In any case feel free to ignore it.

>
> Thanks,
> Qi
>
>> Thank you!
>> --
>> # Task
>> Date: 2025-12-29 19:55:20
>> Model: gemini-3-pro-preview
>> Prompts SHA: 192922ae6bf4 ("bpf.md: adjust the documentation about bpf kfunc parameter validation")
>> Commits to review:
>> - e416d881eea4 ("mm: memcontrol: remove dead code of checking parent memory cgroup")
>> - 8e00ae594254 ("mm: workingset: use folio_lruvec() in workingset_refault()")
>> - a272ef87d5e7 ("mm: rename unlock_page_lruvec_irq and its variants")
>> - d57d548a3d6b ("mm: vmscan: prepare for the refactoring the move_folios_to_lru()")
>> - 9b02a45b6fc8 ("mm: vmscan: refactor move_folios_to_lru()")
>> - 057fca991b78 ("mm: memcontrol: allocate object cgroup for non-kmem case")
>> - 7c4110a3d8b6 ("mm: memcontrol: return root object cgroup for root memory cgroup")
>> - 8479f2eef536 ("mm: memcontrol: prevent memory cgroup release in get_mem_cgroup_from_folio()")
>> - c10b7e11fc09 ("buffer: prevent memory cgroup release in folio_alloc_buffers()")
>> - 65610d739afc ("writeback: prevent memory cgroup release in writeback module")
>> - f9b3cc3aed9f ("mm: memcontrol: prevent memory cgroup release in count_memcg_folio_events()")
>> - 91e4b3924291 ("mm: page_io: prevent memory cgroup release in page_io module")
>> - bb45e352bb34 ("mm: migrate: prevent memory cgroup release in folio_migrate_mapping()")
>> - a1189dd21a56 ("mm: mglru: prevent memory cgroup release in mglru")
>> - 4f41e0db1fd8 ("mm: memcontrol: prevent memory cgroup release in mem_cgroup_swap_full()")
>> - de63e2b7a03e ("mm: workingset: prevent memory cgroup release in lru_gen_eviction()")
>> - c0cce04fd4dc ("mm: thp: prevent memory cgroup release in folio_split_queue_lock{_irqsave}()")
>> - 555a447cb5f1 ("mm: zswap: prevent memory cgroup release in zswap_compress()")
>> - 80bbd804adde ("mm: workingset: prevent lruvec release in workingset_refault()")
>> - 9d232388a8e3 ("mm: zswap: prevent lruvec release in zswap_folio_swapin()")
>> - d7cb66b9350d ("mm: swap: prevent lruvec release in lru_gen_clear_refs()")
>> - 3e71e5543c8f ("mm: workingset: prevent lruvec release in workingset_activation()")
>> - e765ff303f13 ("mm: memcontrol: prepare for reparenting LRU pages for lruvec lock")
>> - d04921029e6d ("mm: vmscan: prepare for reparenting traditional LRU folios")
>> - dc3b4a08ad35 ("mm: vmscan: prepare for reparenting MGLRU folios")
>> - 18791acfe86b ("mm: memcontrol: refactor memcg_reparent_objcgs()")
>> - 19f79273a812 ("mm: memcontrol: eliminate the problem of dying memory cgroup for LRU folios")
>> - 44b8e8bca06e ("mm: lru: add VM_WARN_ON_ONCE_FOLIO to lru maintenance helpers")
>> # Commit e416d881eea4 ("mm: memcontrol: remove dead code of checking
>> parent memory cgroup")
>> No detailed regression report generated.
>> # Commit 8e00ae594254 ("mm: workingset: use folio_lruvec() in
>> workingset_refault()")
>> No detailed regression report generated.
>> # Commit a272ef87d5e7 ("mm: rename unlock_page_lruvec_irq and its
>> variants")
>> No detailed regression report generated.
>> # Commit d57d548a3d6b ("mm: vmscan: prepare for the refactoring the
>> move_folios_to_lru()")
>> No detailed regression report generated.
>> # Commit 9b02a45b6fc8 ("mm: vmscan: refactor move_folios_to_lru()")
>> commit 9b02a45b6fc8a2253e2edcccb24c8485d3dca672
>> Author: Muchun Song <muchun.song@linux.dev>
>> mm: vmscan: refactor move_folios_to_lru()
>> Refactors move_folios_to_lru() to handle its own locking via
>> folio_lruvec_relock_irq(), removing the requirement for the caller to hold
>> lruvec->lru_lock. Updates call sites to release the lock before calling.
>> 
>>> diff --git a/mm/vmscan.c b/mm/vmscan.c
>>> index 49e566174621..354b19f7365d 100644
>>> --- a/mm/vmscan.c
>>> +++ b/mm/vmscan.c
>> [ ... ]
>> 
>>> @@ -4736,14 +4734,14 @@ static int evict_folios(unsigned long nr_to_scan, struct lruvec *lruvec,
>>>   			set_mask_bits(&folio->flags.f, LRU_REFS_FLAGS, BIT(PG_active));
>>>   	}
>>>   -	spin_lock_irq(&lruvec->lru_lock);
>>> -
>>> -	move_folios_to_lru(lruvec, &list);
>>> +	move_folios_to_lru(&list);
>>>     	walk = current->reclaim_state->mm_walk;
>>>   	if (walk && walk->batched) {
>>>   		walk->lruvec = lruvec;
>>> +		spin_lock(&lruvec->lru_lock);
>> move_folios_to_lru() drops the lock and enables interrupts via
>> lruvec_unlock_irq() (which calls spin_unlock_irq()). Since lru_lock is
>> an IRQ-safe lock (used elsewhere with spin_lock_irq()), is it safe to
>> use spin_lock() here with interrupts enabled? If an interrupt occurs
>> and tries to acquire lru_lock, could this lead to a deadlock?
>> 
>>>   		reset_batch_size(walk);
>>> +		spin_unlock(&lruvec->lru_lock);
>>>   	}
>>>     	mod_lruvec_state(lruvec, PGDEMOTE_KSWAPD +
>>> reclaimer_offset(sc),
>> [ ... ]
>> # Commit 057fca991b78 ("mm: memcontrol: allocate object cgroup for
>> non-kmem case")
>> No detailed regression report generated.
>> # Commit 7c4110a3d8b6 ("mm: memcontrol: return root object cgroup
>> for root memory cgroup")
>> No detailed regression report generated.
>> # Commit 8479f2eef536 ("mm: memcontrol: prevent memory cgroup
>> release in get_mem_cgroup_from_folio()")
>> commit 8479f2eef5363fb93cc593038e9bc54dd5068ef2
>> Author: Muchun Song <muchun.song@linux.dev>
>> mm: memcontrol: prevent memory cgroup release in
>> get_mem_cgroup_from_folio()
>> This patch modifies get_mem_cgroup_from_folio() to retry the lookup
>> if
>> locking the memcg fails, serving as preparation for future changes where
>> folios may not pin their memcg.
>> 
>>> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
>>> index 21b5aad34cae..431b3154c70c 100644
>>> --- a/mm/memcontrol.c
>>> +++ b/mm/memcontrol.c
>>> @@ -973,14 +973,19 @@ struct mem_cgroup *get_mem_cgroup_from_current(void)
>>>    */
>>>   struct mem_cgroup *get_mem_cgroup_from_folio(struct folio *folio)
>>>   {
>>> -	struct mem_cgroup *memcg = folio_memcg(folio);
>>> +	struct mem_cgroup *memcg;
>>>     	if (mem_cgroup_disabled())
>>>   		return NULL;
>>>   +	if (!folio_memcg_charged(folio))
>>> +		return root_mem_cgroup;
>>> +
>>>   	rcu_read_lock();
>>> -	if (!memcg || WARN_ON_ONCE(!css_tryget(&memcg->css)))
>>> -		memcg = root_mem_cgroup;
>>> +retry:
>>> +	memcg = folio_memcg(folio);
>>> +	if (unlikely(!css_tryget(&memcg->css)))
>>> +		goto retry;
>> If css_tryget() fails (e.g. refcount is 0), this loop spins
>> indefinitely
>> with the RCU read lock held. Is it guaranteed that folio_memcg() will
>> return a different, alive memcg in subsequent iterations?
>> If the folio is isolated (e.g. via migrate_misplaced_folio()), it
>> might be
>> missed by reparenting logic that iterates LRU lists. In that case, the
>> folio would continue pointing to the dying memcg, leading to a hard lockup.
>> Also, folio_memcg() calls __folio_memcg(), which reads
>> folio->memcg_data
>> without READ_ONCE(). Since this loop waits for memcg_data to be updated
>> by another CPU (reparenting), could the compiler hoist the load out of
>> the loop, preventing the update from being seen?
>> Finally, the previous code fell back to root_mem_cgroup on
>> failure. Is it
>> safe to remove that fallback? If css_tryget() fails unexpectedly, hanging
>> seems more severe than the previous behavior of warning and falling back.
>> 
>>>   	rcu_read_unlock();
>>>   	return memcg;
>>>   }
>> # Commit c10b7e11fc09 ("buffer: prevent memory cgroup release in folio_alloc_buffers()")
>> No detailed regression report generated.
>> # Commit 65610d739afc ("writeback: prevent memory cgroup release in
>> writeback module")
>> <regression>
>>    <description>
>>      Tight infinite loop in get_mem_cgroup_from_folio due to missing READ_ONCE and cpu_relax
>>    </description>
>>    <file>
>>      mm/memcontrol.c
>>    </file>
>>    <lines>
>>      991-993
>>    </lines>
>>    <reason>
>>      The retry loop in get_mem_cgroup_from_folio() continuously checks folio_memcg(folio)
>>      and tries to get a css reference. There are two issues:
>>      1. folio_memcg() reads folio->memcg_data without READ_ONCE(). The compiler may
>>         hoist this load out of the loop, leading to an infinite loop if the memcg
>>         is offlined and the loop condition depends on reparenting updating memcg_data.
>>      2. The loop does not contain cpu_relax(). If the update to folio->memcg_data
>>         is delayed (e.g., reparenting takes time), this creates a tight loop that
>>         burns CPU and stalls the RCU read-side critical section, potentially causing
>>         RCU stalls or soft lockups.
>>    </reason>
>>    <classification>
>>      Performance/Stability (Soft Lockup / Infinite Loop)
>>    </classification>
>> </regression>
>> <regression>
>>    <description>
>>      Removal of NULL check in mem_cgroup_track_foreign_dirty introduces potential crash
>>    </description>
>>    <file>
>>      include/linux/memcontrol.h
>>    </file>
>>    <lines>
>>      1557
>>    </lines>
>>    <reason>
>>      The check `if (unlikely(memcg && &memcg->css != wb->memcg_css))` was replaced with
>>      `if (unlikely(&memcg->css != wb->memcg_css))`.
>>      While folio_memcg_charged() is checked earlier, there is a race window between that check
>>      and the call to folio_memcg() where the folio could be uncharged (setting memcg_data to 0).
>>      If folio_memcg() returns NULL, accessing `&memcg->css` will dereference the NULL pointer
>>      (since css is at offset 0). The previous code safely handled NULL memcg.
>>    </reason>
>>    <classification>
>>      Robustness / Null Pointer Dereference
>>    </classification>
>> </regression>
>> # Commit f9b3cc3aed9f ("mm: memcontrol: prevent memory cgroup
>> release in count_memcg_folio_events()")
>> No detailed regression report generated.
>> # Commit 91e4b3924291 ("mm: page_io: prevent memory cgroup release
>> in page_io module")
>> No detailed regression report generated.
>> # Commit bb45e352bb34 ("mm: migrate: prevent memory cgroup release
>> in folio_migrate_mapping()")
>> No detailed regression report generated.
>> # Commit a1189dd21a56 ("mm: mglru: prevent memory cgroup release in
>> mglru")
>> commit a1189dd21a568b45eaddafc17b982f6200ba28c0
>> Author: Muchun Song <muchun.song@linux.dev>
>> mm: mglru: prevent memory cgroup release in mglru
>> This patch uses the rcu read lock to safeguard against the release
>> of
>> the memory cgroup in mglru, as a preparatory measure for the reparenting
>> of the LRU pages.
>> 
>>> diff --git a/mm/vmscan.c b/mm/vmscan.c
>>> index 354b19f7365d..814498a2c1bd 100644
>>> --- a/mm/vmscan.c
>>> +++ b/mm/vmscan.c
>> [ ... ]
>>> @@ -4242,6 +4244,13 @@ bool lru_gen_look_around(struct page_vma_mapped_walk *pvmw)
>>>   		}
>>>   	}
>>>   +	rcu_read_lock();
>>> +	memcg = folio_memcg(folio);
>>> +	lruvec = mem_cgroup_lruvec(memcg, pgdat);
>>> +	max_seq = READ_ONCE((lruvec)->lrugen.max_seq);
>>> +	gen = lru_gen_from_seq(max_seq);
>>> +	mm_state = get_mm_state(lruvec);
>>> +
>>>   	arch_enter_lazy_mmu_mode();
>>>     	pte -= (addr - start) / PAGE_SIZE;
>> [ ... ]
>>> @@ -4282,6 +4291,8 @@ bool lru_gen_look_around(struct page_vma_mapped_walk *pvmw)
>>>   	if (mm_state && suitable_to_scan(i, young))
>>>   		update_bloom_filter(mm_state, max_seq, pvmw->pmd);
>>>   +	rcu_read_unlock();
>>> +
>>>   	return true;
>>>   }
>> This critical section appears to encompass calls to
>> walk_update_folio(),
>> which calls folio_activate(), which eventually calls folio_lruvec_lock_irq()
>> to acquire lruvec->lru_lock.
>> Since lruvec->lru_lock is a spinlock_t, it becomes a sleeping lock
>> on
>> PREEMPT_RT kernels. Is it safe to acquire a sleeping lock inside an RCU
>> read-side critical section on PREEMPT_RT?
>> # Commit 4f41e0db1fd8 ("mm: memcontrol: prevent memory cgroup
>> release in mem_cgroup_swap_full()")
>> No detailed regression report generated.
>> # Commit de63e2b7a03e ("mm: workingset: prevent memory cgroup
>> release in lru_gen_eviction()")
>> No detailed regression report generated.
>> # Commit c0cce04fd4dc ("mm: thp: prevent memory cgroup release in
>> folio_split_queue_lock{_irqsave}()")
>> commit c0cce04fd4dca3a2b00e8846cbbeb52bd2f27625
>> Author: Qi Zheng <zhengqi.arch@bytedance.com>
>> mm: thp: prevent memory cgroup release in
>> folio_split_queue_lock{_irqsave}()
>> This patch wraps the queue lookup in rcu_read_lock/unlock to
>> safeguard
>> against memory cgroup release during the lookup, as folios may no longer
>> pin the memcg.
>> 
>>> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
>>> index 12b46215b30c..b9e6855ec0b6 100644
>>> --- a/mm/huge_memory.c
>>> +++ b/mm/huge_memory.c
>>> @@ -1154,13 +1154,25 @@ split_queue_lock_irqsave(int nid, struct mem_cgroup *memcg, unsigned long *flags
>>>     static struct deferred_split *folio_split_queue_lock(struct
>>> folio *folio)
>>>   {
>>> -	return split_queue_lock(folio_nid(folio), folio_memcg(folio));
>>> +	struct deferred_split *queue;
>>> +
>>> +	rcu_read_lock();
>>> +	queue = split_queue_lock(folio_nid(folio), folio_memcg(folio));
>>> +	rcu_read_unlock();
>>> +
>>> +	return queue;
>>>   }
>> If the folio does not pin the memcg (as suggested by the commit
>> message),
>> is it safe to return the queue pointer here? The queue is embedded in the
>> memcg, so if the memcg is freed after rcu_read_unlock(), wouldn't this
>> pointer become invalid while the caller still holds the lock?
>> 
>>>     static struct deferred_split *
>>>   folio_split_queue_lock_irqsave(struct folio *folio, unsigned long *flags)
>>>   {
>>> -	return split_queue_lock_irqsave(folio_nid(folio), folio_memcg(folio), flags);
>>> +	struct deferred_split *queue;
>>> +
>>> +	rcu_read_lock();
>>> +	queue = split_queue_lock_irqsave(folio_nid(folio), folio_memcg(folio), flags);
>>> +	rcu_read_unlock();
>>> +
>>> +	return queue;
>>>   }
>> # Commit 555a447cb5f1 ("mm: zswap: prevent memory cgroup release in
>> zswap_compress()")
>> No detailed regression report generated.
>> # Commit 80bbd804adde ("mm: workingset: prevent lruvec release in
>> workingset_refault()")
>> commit 80bbd804adde7198f6c7c5fdf7b4349996586597
>> Author: Muchun Song <muchun.song@linux.dev>
>> mm: workingset: prevent lruvec release in workingset_refault()
>> In the near future, a folio will no longer pin its corresponding
>> memory cgroup. So an lruvec returned by folio_lruvec() could be
>> released without the rcu read lock or a reference to its memory
>> cgroup.
>> 
>>> diff --git a/mm/workingset.c b/mm/workingset.c
>>> index 445fc634196d..427ca1a5625e 100644
>>> --- a/mm/workingset.c
>>> +++ b/mm/workingset.c
>>> @@ -560,11 +560,12 @@ void workingset_refault(struct folio *folio, void *shadow)
>>>   	 * locked to guarantee folio_memcg() stability throughout.
>>>   	 */
>>>   	nr = folio_nr_pages(folio);
>>> +	rcu_read_lock();
>>>   	lruvec = folio_lruvec(folio);
>>>   	mod_lruvec_state(lruvec, WORKINGSET_REFAULT_BASE + file, nr);
>>>     	if (!workingset_test_recent(shadow, file, &workingset, true))
>>> -		return;
>>> +		goto out;
>> Does calling workingset_test_recent() with flush=true while holding
>> rcu_read_lock() risk sleeping in atomic context?
>> workingset_test_recent() calls mem_cgroup_flush_stats_ratelimited()
>> when
>> flush is true.  The flush path leads to css_rstat_flush(), which calls
>> might_sleep() and cond_resched().
>> 
>>>     	folio_set_active(folio);
>>>   	workingset_age_nonresident(lruvec, nr);
>>> @@ -580,6 +581,8 @@ void workingset_refault(struct folio *folio, void *shadow)
>>>   		lru_note_cost_refault(folio);
>>>   		mod_lruvec_state(lruvec, WORKINGSET_RESTORE_BASE + file, nr);
>>>   	}
>>> +out:
>>> +	rcu_read_unlock();
>>>   }
>>>     /**
>> # Commit 9d232388a8e3 ("mm: zswap: prevent lruvec release in
>> zswap_folio_swapin()")
>> No detailed regression report generated.
>> # Commit d7cb66b9350d ("mm: swap: prevent lruvec release in
>> lru_gen_clear_refs()")
>> No detailed regression report generated.
>> # Commit 3e71e5543c8f ("mm: workingset: prevent lruvec release in
>> workingset_activation()")
>> No detailed regression report generated.
>> # Commit e765ff303f13 ("mm: memcontrol: prepare for reparenting LRU
>> pages for lruvec lock")
>> No detailed regression report generated.
>> # Commit d04921029e6d ("mm: vmscan: prepare for reparenting
>> traditional LRU folios")
>> commit d04921029e6dc92824024356a9ed83a4f82d68dc
>> Author: Qi Zheng <zhengqi.arch@bytedance.com>
>> mm: vmscan: prepare for reparenting traditional LRU folios
>> This commit implements a specific function to reparent LRU folios of
>> a
>> child memcg to its parent memcg during the reparenting process.
>> 
>>> diff --git a/mm/vmscan.c b/mm/vmscan.c
>>> index 814498a2c1bd..5fd0f97c3719 100644
>>> --- a/mm/vmscan.c
>>> +++ b/mm/vmscan.c
>>> @@ -2648,6 +2648,44 @@ static bool can_age_anon_pages(struct lruvec *lruvec,
>>>   			  lruvec_memcg(lruvec));
>>>   }
>>>   +#ifdef CONFIG_MEMCG
>>> +static void lruvec_reparent_lru(struct lruvec *src, struct lruvec *dst,
>>> +				enum lru_list lru)
>>> +{
>>> +	int zid;
>>> +	struct mem_cgroup_per_node *mz_src, *mz_dst;
>>> +
>>> +	mz_src = container_of(src, struct mem_cgroup_per_node, lruvec);
>>> +	mz_dst = container_of(dst, struct mem_cgroup_per_node, lruvec);
>>> +
>>> +	if (lru != LRU_UNEVICTABLE)
>>> +		list_splice_tail_init(&src->lists[lru], &dst->lists[lru]);
>> Is it safe to modify src->lists and dst->lists here without holding
>> src->lru_lock or dst->lru_lock?  Since dst is the parent memcg, it is
>> likely live and subject to concurrent access by reclaim or allocation paths.
>> 
>>> +
>>> +	for (zid = 0; zid < MAX_NR_ZONES; zid++) {
>>> +		mz_dst->lru_zone_size[zid][lru] += mz_src->lru_zone_size[zid][lru];
>>> +		mz_src->lru_zone_size[zid][lru] = 0;
>>> +	}
>> Also, lru_zone_size is typically protected by lru_lock.  Is it safe
>> to
>> update these counters without locking?
>> 
>>> +}
>>> +
>>> +void lru_reparent_memcg(struct mem_cgroup *src, struct mem_cgroup *dst)
>>> +{
>>> +	int nid;
>>> +
>>> +	for_each_node(nid) {
>>> +		enum lru_list lru;
>>> +		struct lruvec *src_lruvec, *dst_lruvec;
>>> +
>>> +		src_lruvec = mem_cgroup_lruvec(src, NODE_DATA(nid));
>>> +		dst_lruvec = mem_cgroup_lruvec(dst, NODE_DATA(nid));
>>> +		dst_lruvec->anon_cost += src_lruvec->anon_cost;
>>> +		dst_lruvec->file_cost += src_lruvec->file_cost;
>>> +
>>> +		for_each_lru(lru)
>>> +			lruvec_reparent_lru(src_lruvec, dst_lruvec, lru);
>>> +	}
>>> +}
>>> +#endif
>> # Commit dc3b4a08ad35 ("mm: vmscan: prepare for reparenting MGLRU
>> folios")
>> Here are the findings for the provided patch.
>> 1.  **Missing locking in `lru_gen_reparent_memcg`**
>> In `mm/vmscan.c`, the function `lru_gen_reparent_memcg` (and its
> helper `__lru_gen_reparent_memcg`) modifies the LRU lists and
> statistics of `lruvec` structures without holding the `lru_lock`.
>>      Specifically:
>>      - `__lru_gen_reparent_memcg` calls `__update_lru_size`.
>>      - `__update_lru_size` has a `lockdep_assert_held(&lruvec->lru_lock)`.
>>      - `__lru_gen_reparent_memcg` calls `list_splice_tail_init` to move folios from the source lruvec to the destination lruvec.
>>      The destination lruvec (`dst_lruvec`) belongs to the parent
>> memcg, which is active and shared. Modifying its lists and counters
>> without locking will lead to data corruption (list corruption) and
>> statistics drift, as well as triggering lockdep warnings.
>>      ```c
>>      void lru_gen_reparent_memcg(struct mem_cgroup *src, struct mem_cgroup *dst)
>>      {
>>              int nid;
>>              for_each_node(nid) {
>>                      struct lruvec *src_lruvec, *dst_lruvec;
>>                      /* ... */
>>                      src_lruvec = get_lruvec(src, nid);
>>                      dst_lruvec = get_lruvec(dst, nid);
>>                      for (zone = 0; zone < MAX_NR_ZONES; zone++)
>>                              for (type = 0; type < ANON_AND_FILE; type++)
>>                                      __lru_gen_reparent_memcg(src_lruvec, dst_lruvec, zone, type);
>>              }
>>      }
>>      ```
>>      The `lruvec` lock must be acquired for each node before calling
>> `__lru_gen_reparent_memcg`.
>> # Commit 18791acfe86b ("mm: memcontrol: refactor
>> memcg_reparent_objcgs()")
>> No detailed regression report generated.
>> # Commit 19f79273a812 ("mm: memcontrol: eliminate the problem of
>> dying memory cgroup for LRU folios")
>> file: mm/memcontrol.c
>> line: 224
>> type: Bug
>> category: Locking
>> description:
>> The `reparent_locks` function takes `lru_lock` for all NUMA nodes in
> a loop, utilizing `spin_lock_nested` with an incrementing `nest`
> counter. The `nest` counter increments for each lock taken (2 per
> node: src and dst). Since `MAX_LOCKDEP_SUBCLASSES` is 8, this code
> will trigger a Lockdep violation (and potential panic if
> `panic_on_warn` is set) on systems with more than 4 NUMA nodes (4
> nodes * 2 locks = 8 subclasses). Furthermore, locking all nodes
> simultaneously is a scalability regression, blocking LRU operations
> globally during reparenting.
>> file: include/linux/memcontrol.h
>> line: 430
>> type: Risk
>> category: API
>> description:
>> The implementation of `folio_memcg` has changed to rely on
> `obj_cgroup_memcg`, which enforces that `rcu_read_lock` or
> `cgroup_mutex` is held via a lockdep assertion. Previously, for LRU
> folios, the memcg pointer was directly embedded and stable under the
> folio lock. Existing callers (e.g., in `mm/workingset.c`) relied on
> the folio lock for stability. While some callers may hold RCU, others
> might not, leading to lockdep warnings or races where `folio_memcg`
> returns a pointer to a memcg that is being reparented or
> freed. Additionally, the return value of `folio_memcg` is no longer
> constant for a locked folio; it can change if reparenting occurs,
> potentially breaking logic that assumes identity equality over time.
>> # Commit 44b8e8bca06e ("mm: lru: add VM_WARN_ON_ONCE_FOLIO to lru
>> maintenance helpers")
>> No detailed regression report generated.
>> # Summary
>> | Commit
>> | Regressions |
>> | :-------------------------------------------------------------------------------------------- | :---------- |
>> | e416d881eea4 ("mm: memcontrol: remove dead code of checking parent memory cgroup")            | 0           |
>> | 8e00ae594254 ("mm: workingset: use folio_lruvec() in workingset_refault()")                   | 0           |
>> | a272ef87d5e7 ("mm: rename unlock_page_lruvec_irq and its variants")                           | 0           |
>> | d57d548a3d6b ("mm: vmscan: prepare for the refactoring the move_folios_to_lru()")             | 0           |
>> | 9b02a45b6fc8 ("mm: vmscan: refactor move_folios_to_lru()")                                    | 1           |
>> | 057fca991b78 ("mm: memcontrol: allocate object cgroup for non-kmem case")                     | 0           |
>> | 7c4110a3d8b6 ("mm: memcontrol: return root object cgroup for root memory cgroup")             | 0           |
>> | 8479f2eef536 ("mm: memcontrol: prevent memory cgroup release in get_mem_cgroup_from_folio()") | 1           |
>> | c10b7e11fc09 ("buffer: prevent memory cgroup release in folio_alloc_buffers()")               | 0           |
>> | 65610d739afc ("writeback: prevent memory cgroup release in writeback module")                 | 2           |
>> | f9b3cc3aed9f ("mm: memcontrol: prevent memory cgroup release in count_memcg_folio_events()")  | 0           |
>> | 91e4b3924291 ("mm: page_io: prevent memory cgroup release in page_io module")                 | 0           |
>> | bb45e352bb34 ("mm: migrate: prevent memory cgroup release in folio_migrate_mapping()")        | 0           |
>> | a1189dd21a56 ("mm: mglru: prevent memory cgroup release in mglru")                            | 1           |
>> | 4f41e0db1fd8 ("mm: memcontrol: prevent memory cgroup release in mem_cgroup_swap_full()")      | 0           |
>> | de63e2b7a03e ("mm: workingset: prevent memory cgroup release in lru_gen_eviction()")          | 0           |
>> | c0cce04fd4dc ("mm: thp: prevent memory cgroup release in folio_split_queue_lock{_irqsave}()") | 1           |
>> | 555a447cb5f1 ("mm: zswap: prevent memory cgroup release in zswap_compress()")                 | 0           |
>> | 80bbd804adde ("mm: workingset: prevent lruvec release in workingset_refault()")               | 1           |
>> | 9d232388a8e3 ("mm: zswap: prevent lruvec release in zswap_folio_swapin()")                    | 0           |
>> | d7cb66b9350d ("mm: swap: prevent lruvec release in lru_gen_clear_refs()")                     | 0           |
>> | 3e71e5543c8f ("mm: workingset: prevent lruvec release in workingset_activation()")            | 0           |
>> | e765ff303f13 ("mm: memcontrol: prepare for reparenting LRU pages for lruvec lock")            | 0           |
>> | d04921029e6d ("mm: vmscan: prepare for reparenting traditional LRU folios")                   | 1           |
>> | dc3b4a08ad35 ("mm: vmscan: prepare for reparenting MGLRU folios")                             | 1           |
>> | 18791acfe86b ("mm: memcontrol: refactor memcg_reparent_objcgs()")                             | 0           |
>> | 19f79273a812 ("mm: memcontrol: eliminate the problem of dying memory cgroup for LRU folios")  | 2           |
>> | 44b8e8bca06e ("mm: lru: add VM_WARN_ON_ONCE_FOLIO to lru maintenance helpers")                | 0           |


^ permalink raw reply	[flat|nested] 149+ messages in thread

* Re: [PATCH v2 00/28] Eliminate Dying Memory Cgroup
  2025-12-30  4:20     ` Roman Gushchin
@ 2025-12-30  4:25       ` Qi Zheng
  2025-12-30  4:48         ` Shakeel Butt
  0 siblings, 1 reply; 149+ messages in thread
From: Qi Zheng @ 2025-12-30  4:25 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: hannes, hughd, mhocko, shakeel.butt, muchun.song, david,
	lorenzo.stoakes, ziy, harry.yoo, imran.f.khan, kamalesh.babulal,
	axelrasmussen, yuanchu, weixugc, chenridong, mkoutny, akpm,
	hamzamahfooz, apais, lance.yang, linux-mm, linux-kernel, cgroups,
	Qi Zheng



On 12/30/25 12:20 PM, Roman Gushchin wrote:
> Qi Zheng <qi.zheng@linux.dev> writes:
> 
>> Hi Roman,
>>
>> On 12/30/25 9:36 AM, Roman Gushchin wrote:
>>> Qi Zheng <qi.zheng@linux.dev> writes:
>>> Hey!
>>> I ran this patchset through AI review and it found few regression
>>> (which
>>> can of course be false positives). When you'll have time, can you,
>>> please, take a look and comment on which are real and which are not?
>>
>> Thank you for running the AI review for this patchset, but please do not
>> directly send the raw data from the AI review to the community, as this
>> is no different from automated review by a robot.
> 
> Hi Qi,
> 
> I don't know why you're so negative towards it. It's been great at

No, I don't object to having a dedicated robot to do this.

> finding pretty tricky bugs often missed by human reviewers. In no way
> it's a replacement for human reviews, but if a robot can find real
> issues and make the kernel more reliable and safe, I'm in.

I just think you should do a preliminary review of the AI review results
instead of sending them out directly. Otherwise, if everyone does this,
the community will be full of bots.

No?

> 
> In my experience it finds real problems pretty often. In my measurements
> at least 50% of the reported issues are real, and it matches the data
> reported by others. Some subsystems (e.g. bpf) pass all patches through
> the ai review.
> 
> In any case feel free to ignore it.
> 
>>
>> Thanks,
>> Qi
>>
>>> Thank you!
>>> --
>>> # Task
>>> Date: 2025-12-29 19:55:20
>>> Model: gemini-3-pro-preview
>>> Prompts SHA: 192922ae6bf4 ("bpf.md: adjust the documentation about bpf kfunc parameter validation")
>>> Commits to review:
>>> - e416d881eea4 ("mm: memcontrol: remove dead code of checking parent memory cgroup")
>>> - 8e00ae594254 ("mm: workingset: use folio_lruvec() in workingset_refault()")
>>> - a272ef87d5e7 ("mm: rename unlock_page_lruvec_irq and its variants")
>>> - d57d548a3d6b ("mm: vmscan: prepare for the refactoring the move_folios_to_lru()")
>>> - 9b02a45b6fc8 ("mm: vmscan: refactor move_folios_to_lru()")
>>> - 057fca991b78 ("mm: memcontrol: allocate object cgroup for non-kmem case")
>>> - 7c4110a3d8b6 ("mm: memcontrol: return root object cgroup for root memory cgroup")
>>> - 8479f2eef536 ("mm: memcontrol: prevent memory cgroup release in get_mem_cgroup_from_folio()")
>>> - c10b7e11fc09 ("buffer: prevent memory cgroup release in folio_alloc_buffers()")
>>> - 65610d739afc ("writeback: prevent memory cgroup release in writeback module")
>>> - f9b3cc3aed9f ("mm: memcontrol: prevent memory cgroup release in count_memcg_folio_events()")
>>> - 91e4b3924291 ("mm: page_io: prevent memory cgroup release in page_io module")
>>> - bb45e352bb34 ("mm: migrate: prevent memory cgroup release in folio_migrate_mapping()")
>>> - a1189dd21a56 ("mm: mglru: prevent memory cgroup release in mglru")
>>> - 4f41e0db1fd8 ("mm: memcontrol: prevent memory cgroup release in mem_cgroup_swap_full()")
>>> - de63e2b7a03e ("mm: workingset: prevent memory cgroup release in lru_gen_eviction()")
>>> - c0cce04fd4dc ("mm: thp: prevent memory cgroup release in folio_split_queue_lock{_irqsave}()")
>>> - 555a447cb5f1 ("mm: zswap: prevent memory cgroup release in zswap_compress()")
>>> - 80bbd804adde ("mm: workingset: prevent lruvec release in workingset_refault()")
>>> - 9d232388a8e3 ("mm: zswap: prevent lruvec release in zswap_folio_swapin()")
>>> - d7cb66b9350d ("mm: swap: prevent lruvec release in lru_gen_clear_refs()")
>>> - 3e71e5543c8f ("mm: workingset: prevent lruvec release in workingset_activation()")
>>> - e765ff303f13 ("mm: memcontrol: prepare for reparenting LRU pages for lruvec lock")
>>> - d04921029e6d ("mm: vmscan: prepare for reparenting traditional LRU folios")
>>> - dc3b4a08ad35 ("mm: vmscan: prepare for reparenting MGLRU folios")
>>> - 18791acfe86b ("mm: memcontrol: refactor memcg_reparent_objcgs()")
>>> - 19f79273a812 ("mm: memcontrol: eliminate the problem of dying memory cgroup for LRU folios")
>>> - 44b8e8bca06e ("mm: lru: add VM_WARN_ON_ONCE_FOLIO to lru maintenance helpers")
>>> # Commit e416d881eea4 ("mm: memcontrol: remove dead code of checking
>>> parent memory cgroup")
>>> No detailed regression report generated.
>>> # Commit 8e00ae594254 ("mm: workingset: use folio_lruvec() in
>>> workingset_refault()")
>>> No detailed regression report generated.
>>> # Commit a272ef87d5e7 ("mm: rename unlock_page_lruvec_irq and its
>>> variants")
>>> No detailed regression report generated.
>>> # Commit d57d548a3d6b ("mm: vmscan: prepare for the refactoring the
>>> move_folios_to_lru()")
>>> No detailed regression report generated.
>>> # Commit 9b02a45b6fc8 ("mm: vmscan: refactor move_folios_to_lru()")
>>> commit 9b02a45b6fc8a2253e2edcccb24c8485d3dca672
>>> Author: Muchun Song <muchun.song@linux.dev>
>>> mm: vmscan: refactor move_folios_to_lru()
>>> Refactors move_folios_to_lru() to handle its own locking via
>>> folio_lruvec_relock_irq(), removing the requirement for the caller to hold
>>> lruvec->lru_lock. Updates call sites to release the lock before calling.
>>>
>>>> diff --git a/mm/vmscan.c b/mm/vmscan.c
>>>> index 49e566174621..354b19f7365d 100644
>>>> --- a/mm/vmscan.c
>>>> +++ b/mm/vmscan.c
>>> [ ... ]
>>>
>>>> @@ -4736,14 +4734,14 @@ static int evict_folios(unsigned long nr_to_scan, struct lruvec *lruvec,
>>>>    			set_mask_bits(&folio->flags.f, LRU_REFS_FLAGS, BIT(PG_active));
>>>>    	}
>>>>    -	spin_lock_irq(&lruvec->lru_lock);
>>>> -
>>>> -	move_folios_to_lru(lruvec, &list);
>>>> +	move_folios_to_lru(&list);
>>>>      	walk = current->reclaim_state->mm_walk;
>>>>    	if (walk && walk->batched) {
>>>>    		walk->lruvec = lruvec;
>>>> +		spin_lock(&lruvec->lru_lock);
>>> move_folios_to_lru() drops the lock and enables interrupts via
>>> lruvec_unlock_irq() (which calls spin_unlock_irq()). Since lru_lock is
>>> an IRQ-safe lock (used elsewhere with spin_lock_irq()), is it safe to
>>> use spin_lock() here with interrupts enabled? If an interrupt occurs
>>> and tries to acquire lru_lock, could this lead to a deadlock?
>>>
>>>>    		reset_batch_size(walk);
>>>> +		spin_unlock(&lruvec->lru_lock);
>>>>    	}
>>>>      	mod_lruvec_state(lruvec, PGDEMOTE_KSWAPD +
>>>> reclaimer_offset(sc),
>>> [ ... ]
>>> # Commit 057fca991b78 ("mm: memcontrol: allocate object cgroup for
>>> non-kmem case")
>>> No detailed regression report generated.
>>> # Commit 7c4110a3d8b6 ("mm: memcontrol: return root object cgroup
>>> for root memory cgroup")
>>> No detailed regression report generated.
>>> # Commit 8479f2eef536 ("mm: memcontrol: prevent memory cgroup
>>> release in get_mem_cgroup_from_folio()")
>>> commit 8479f2eef5363fb93cc593038e9bc54dd5068ef2
>>> Author: Muchun Song <muchun.song@linux.dev>
>>> mm: memcontrol: prevent memory cgroup release in
>>> get_mem_cgroup_from_folio()
>>> This patch modifies get_mem_cgroup_from_folio() to retry the lookup
>>> if
>>> locking the memcg fails, serving as preparation for future changes where
>>> folios may not pin their memcg.
>>>
>>>> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
>>>> index 21b5aad34cae..431b3154c70c 100644
>>>> --- a/mm/memcontrol.c
>>>> +++ b/mm/memcontrol.c
>>>> @@ -973,14 +973,19 @@ struct mem_cgroup *get_mem_cgroup_from_current(void)
>>>>     */
>>>>    struct mem_cgroup *get_mem_cgroup_from_folio(struct folio *folio)
>>>>    {
>>>> -	struct mem_cgroup *memcg = folio_memcg(folio);
>>>> +	struct mem_cgroup *memcg;
>>>>      	if (mem_cgroup_disabled())
>>>>    		return NULL;
>>>>    +	if (!folio_memcg_charged(folio))
>>>> +		return root_mem_cgroup;
>>>> +
>>>>    	rcu_read_lock();
>>>> -	if (!memcg || WARN_ON_ONCE(!css_tryget(&memcg->css)))
>>>> -		memcg = root_mem_cgroup;
>>>> +retry:
>>>> +	memcg = folio_memcg(folio);
>>>> +	if (unlikely(!css_tryget(&memcg->css)))
>>>> +		goto retry;
>>> If css_tryget() fails (e.g. refcount is 0), this loop spins
>>> indefinitely
>>> with the RCU read lock held. Is it guaranteed that folio_memcg() will
>>> return a different, alive memcg in subsequent iterations?
>>> If the folio is isolated (e.g. via migrate_misplaced_folio()), it
>>> might be
>>> missed by reparenting logic that iterates LRU lists. In that case, the
>>> folio would continue pointing to the dying memcg, leading to a hard lockup.
>>> Also, folio_memcg() calls __folio_memcg(), which reads
>>> folio->memcg_data
>>> without READ_ONCE(). Since this loop waits for memcg_data to be updated
>>> by another CPU (reparenting), could the compiler hoist the load out of
>>> the loop, preventing the update from being seen?
>>> Finally, the previous code fell back to root_mem_cgroup on
>>> failure. Is it
>>> safe to remove that fallback? If css_tryget() fails unexpectedly, hanging
>>> seems more severe than the previous behavior of warning and falling back.
>>>
>>>>    	rcu_read_unlock();
>>>>    	return memcg;
>>>>    }
>>> # Commit c10b7e11fc09 ("buffer: prevent memory cgroup release in folio_alloc_buffers()")
>>> No detailed regression report generated.
>>> # Commit 65610d739afc ("writeback: prevent memory cgroup release in
>>> writeback module")
>>> <regression>
>>>     <description>
>>>       Tight infinite loop in get_mem_cgroup_from_folio due to missing READ_ONCE and cpu_relax
>>>     </description>
>>>     <file>
>>>       mm/memcontrol.c
>>>     </file>
>>>     <lines>
>>>       991-993
>>>     </lines>
>>>     <reason>
>>>       The retry loop in get_mem_cgroup_from_folio() continuously checks folio_memcg(folio)
>>>       and tries to get a css reference. There are two issues:
>>>       1. folio_memcg() reads folio->memcg_data without READ_ONCE(). The compiler may
>>>          hoist this load out of the loop, leading to an infinite loop if the memcg
>>>          is offlined and the loop condition depends on reparenting updating memcg_data.
>>>       2. The loop does not contain cpu_relax(). If the update to folio->memcg_data
>>>          is delayed (e.g., reparenting takes time), this creates a tight loop that
>>>          burns CPU and stalls the RCU read-side critical section, potentially causing
>>>          RCU stalls or soft lockups.
>>>     </reason>
>>>     <classification>
>>>       Performance/Stability (Soft Lockup / Infinite Loop)
>>>     </classification>
>>> </regression>
>>> <regression>
>>>     <description>
>>>       Removal of NULL check in mem_cgroup_track_foreign_dirty introduces potential crash
>>>     </description>
>>>     <file>
>>>       include/linux/memcontrol.h
>>>     </file>
>>>     <lines>
>>>       1557
>>>     </lines>
>>>     <reason>
>>>       The check `if (unlikely(memcg && &memcg->css != wb->memcg_css))` was replaced with
>>>       `if (unlikely(&memcg->css != wb->memcg_css))`.
>>>       While folio_memcg_charged() is checked earlier, there is a race window between that check
>>>       and the call to folio_memcg() where the folio could be uncharged (setting memcg_data to 0).
>>>       If folio_memcg() returns NULL, accessing `&memcg->css` will dereference the NULL pointer
>>>       (since css is at offset 0). The previous code safely handled NULL memcg.
>>>     </reason>
>>>     <classification>
>>>       Robustness / Null Pointer Dereference
>>>     </classification>
>>> </regression>
>>> # Commit f9b3cc3aed9f ("mm: memcontrol: prevent memory cgroup
>>> release in count_memcg_folio_events()")
>>> No detailed regression report generated.
>>> # Commit 91e4b3924291 ("mm: page_io: prevent memory cgroup release
>>> in page_io module")
>>> No detailed regression report generated.
>>> # Commit bb45e352bb34 ("mm: migrate: prevent memory cgroup release
>>> in folio_migrate_mapping()")
>>> No detailed regression report generated.
>>> # Commit a1189dd21a56 ("mm: mglru: prevent memory cgroup release in
>>> mglru")
>>> commit a1189dd21a568b45eaddafc17b982f6200ba28c0
>>> Author: Muchun Song <muchun.song@linux.dev>
>>> mm: mglru: prevent memory cgroup release in mglru
>>> This patch uses the rcu read lock to safeguard against the release
>>> of
>>> the memory cgroup in mglru, as a preparatory measure for the reparenting
>>> of the LRU pages.
>>>
>>>> diff --git a/mm/vmscan.c b/mm/vmscan.c
>>>> index 354b19f7365d..814498a2c1bd 100644
>>>> --- a/mm/vmscan.c
>>>> +++ b/mm/vmscan.c
>>> [ ... ]
>>>> @@ -4242,6 +4244,13 @@ bool lru_gen_look_around(struct page_vma_mapped_walk *pvmw)
>>>>    		}
>>>>    	}
>>>>    +	rcu_read_lock();
>>>> +	memcg = folio_memcg(folio);
>>>> +	lruvec = mem_cgroup_lruvec(memcg, pgdat);
>>>> +	max_seq = READ_ONCE((lruvec)->lrugen.max_seq);
>>>> +	gen = lru_gen_from_seq(max_seq);
>>>> +	mm_state = get_mm_state(lruvec);
>>>> +
>>>>    	arch_enter_lazy_mmu_mode();
>>>>      	pte -= (addr - start) / PAGE_SIZE;
>>> [ ... ]
>>>> @@ -4282,6 +4291,8 @@ bool lru_gen_look_around(struct page_vma_mapped_walk *pvmw)
>>>>    	if (mm_state && suitable_to_scan(i, young))
>>>>    		update_bloom_filter(mm_state, max_seq, pvmw->pmd);
>>>>    +	rcu_read_unlock();
>>>> +
>>>>    	return true;
>>>>    }
>>> This critical section appears to encompass calls to
>>> walk_update_folio(),
>>> which calls folio_activate(), which eventually calls folio_lruvec_lock_irq()
>>> to acquire lruvec->lru_lock.
>>> Since lruvec->lru_lock is a spinlock_t, it becomes a sleeping lock
>>> on
>>> PREEMPT_RT kernels. Is it safe to acquire a sleeping lock inside an RCU
>>> read-side critical section on PREEMPT_RT?
>>> # Commit 4f41e0db1fd8 ("mm: memcontrol: prevent memory cgroup
>>> release in mem_cgroup_swap_full()")
>>> No detailed regression report generated.
>>> # Commit de63e2b7a03e ("mm: workingset: prevent memory cgroup
>>> release in lru_gen_eviction()")
>>> No detailed regression report generated.
>>> # Commit c0cce04fd4dc ("mm: thp: prevent memory cgroup release in
>>> folio_split_queue_lock{_irqsave}()")
>>> commit c0cce04fd4dca3a2b00e8846cbbeb52bd2f27625
>>> Author: Qi Zheng <zhengqi.arch@bytedance.com>
>>> mm: thp: prevent memory cgroup release in
>>> folio_split_queue_lock{_irqsave}()
>>> This patch wraps the queue lookup in rcu_read_lock/unlock to
>>> safeguard
>>> against memory cgroup release during the lookup, as folios may no longer
>>> pin the memcg.
>>>
>>>> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
>>>> index 12b46215b30c..b9e6855ec0b6 100644
>>>> --- a/mm/huge_memory.c
>>>> +++ b/mm/huge_memory.c
>>>> @@ -1154,13 +1154,25 @@ split_queue_lock_irqsave(int nid, struct mem_cgroup *memcg, unsigned long *flags
>>>>      static struct deferred_split *folio_split_queue_lock(struct
>>>> folio *folio)
>>>>    {
>>>> -	return split_queue_lock(folio_nid(folio), folio_memcg(folio));
>>>> +	struct deferred_split *queue;
>>>> +
>>>> +	rcu_read_lock();
>>>> +	queue = split_queue_lock(folio_nid(folio), folio_memcg(folio));
>>>> +	rcu_read_unlock();
>>>> +
>>>> +	return queue;
>>>>    }
>>> If the folio does not pin the memcg (as suggested by the commit
>>> message),
>>> is it safe to return the queue pointer here? The queue is embedded in the
>>> memcg, so if the memcg is freed after rcu_read_unlock(), wouldn't this
>>> pointer become invalid while the caller still holds the lock?
>>>
>>>>      static struct deferred_split *
>>>>    folio_split_queue_lock_irqsave(struct folio *folio, unsigned long *flags)
>>>>    {
>>>> -	return split_queue_lock_irqsave(folio_nid(folio), folio_memcg(folio), flags);
>>>> +	struct deferred_split *queue;
>>>> +
>>>> +	rcu_read_lock();
>>>> +	queue = split_queue_lock_irqsave(folio_nid(folio), folio_memcg(folio), flags);
>>>> +	rcu_read_unlock();
>>>> +
>>>> +	return queue;
>>>>    }
>>> # Commit 555a447cb5f1 ("mm: zswap: prevent memory cgroup release in
>>> zswap_compress()")
>>> No detailed regression report generated.
>>> # Commit 80bbd804adde ("mm: workingset: prevent lruvec release in
>>> workingset_refault()")
>>> commit 80bbd804adde7198f6c7c5fdf7b4349996586597
>>> Author: Muchun Song <muchun.song@linux.dev>
>>> mm: workingset: prevent lruvec release in workingset_refault()
>>> In the near future, a folio will no longer pin its corresponding
>>> memory cgroup. So an lruvec returned by folio_lruvec() could be
>>> released without the rcu read lock or a reference to its memory
>>> cgroup.
>>>
>>>> diff --git a/mm/workingset.c b/mm/workingset.c
>>>> index 445fc634196d..427ca1a5625e 100644
>>>> --- a/mm/workingset.c
>>>> +++ b/mm/workingset.c
>>>> @@ -560,11 +560,12 @@ void workingset_refault(struct folio *folio, void *shadow)
>>>>    	 * locked to guarantee folio_memcg() stability throughout.
>>>>    	 */
>>>>    	nr = folio_nr_pages(folio);
>>>> +	rcu_read_lock();
>>>>    	lruvec = folio_lruvec(folio);
>>>>    	mod_lruvec_state(lruvec, WORKINGSET_REFAULT_BASE + file, nr);
>>>>      	if (!workingset_test_recent(shadow, file, &workingset, true))
>>>> -		return;
>>>> +		goto out;
>>> Does calling workingset_test_recent() with flush=true while holding
>>> rcu_read_lock() risk sleeping in atomic context?
>>> workingset_test_recent() calls mem_cgroup_flush_stats_ratelimited()
>>> when
>>> flush is true.  The flush path leads to css_rstat_flush(), which calls
>>> might_sleep() and cond_resched().
>>>
>>>>      	folio_set_active(folio);
>>>>    	workingset_age_nonresident(lruvec, nr);
>>>> @@ -580,6 +581,8 @@ void workingset_refault(struct folio *folio, void *shadow)
>>>>    		lru_note_cost_refault(folio);
>>>>    		mod_lruvec_state(lruvec, WORKINGSET_RESTORE_BASE + file, nr);
>>>>    	}
>>>> +out:
>>>> +	rcu_read_unlock();
>>>>    }
>>>>      /**
>>> # Commit 9d232388a8e3 ("mm: zswap: prevent lruvec release in
>>> zswap_folio_swapin()")
>>> No detailed regression report generated.
>>> # Commit d7cb66b9350d ("mm: swap: prevent lruvec release in
>>> lru_gen_clear_refs()")
>>> No detailed regression report generated.
>>> # Commit 3e71e5543c8f ("mm: workingset: prevent lruvec release in
>>> workingset_activation()")
>>> No detailed regression report generated.
>>> # Commit e765ff303f13 ("mm: memcontrol: prepare for reparenting LRU
>>> pages for lruvec lock")
>>> No detailed regression report generated.
>>> # Commit d04921029e6d ("mm: vmscan: prepare for reparenting
>>> traditional LRU folios")
>>> commit d04921029e6dc92824024356a9ed83a4f82d68dc
>>> Author: Qi Zheng <zhengqi.arch@bytedance.com>
>>> mm: vmscan: prepare for reparenting traditional LRU folios
>>> This commit implements a specific function to reparent LRU folios of
>>> a
>>> child memcg to its parent memcg during the reparenting process.
>>>
>>>> diff --git a/mm/vmscan.c b/mm/vmscan.c
>>>> index 814498a2c1bd..5fd0f97c3719 100644
>>>> --- a/mm/vmscan.c
>>>> +++ b/mm/vmscan.c
>>>> @@ -2648,6 +2648,44 @@ static bool can_age_anon_pages(struct lruvec *lruvec,
>>>>    			  lruvec_memcg(lruvec));
>>>>    }
>>>>    +#ifdef CONFIG_MEMCG
>>>> +static void lruvec_reparent_lru(struct lruvec *src, struct lruvec *dst,
>>>> +				enum lru_list lru)
>>>> +{
>>>> +	int zid;
>>>> +	struct mem_cgroup_per_node *mz_src, *mz_dst;
>>>> +
>>>> +	mz_src = container_of(src, struct mem_cgroup_per_node, lruvec);
>>>> +	mz_dst = container_of(dst, struct mem_cgroup_per_node, lruvec);
>>>> +
>>>> +	if (lru != LRU_UNEVICTABLE)
>>>> +		list_splice_tail_init(&src->lists[lru], &dst->lists[lru]);
>>> Is it safe to modify src->lists and dst->lists here without holding
>>> src->lru_lock or dst->lru_lock?  Since dst is the parent memcg, it is
>>> likely live and subject to concurrent access by reclaim or allocation paths.
>>>
>>>> +
>>>> +	for (zid = 0; zid < MAX_NR_ZONES; zid++) {
>>>> +		mz_dst->lru_zone_size[zid][lru] += mz_src->lru_zone_size[zid][lru];
>>>> +		mz_src->lru_zone_size[zid][lru] = 0;
>>>> +	}
>>> Also, lru_zone_size is typically protected by lru_lock.  Is it safe
>>> to
>>> update these counters without locking?
>>>
>>>> +}
>>>> +
>>>> +void lru_reparent_memcg(struct mem_cgroup *src, struct mem_cgroup *dst)
>>>> +{
>>>> +	int nid;
>>>> +
>>>> +	for_each_node(nid) {
>>>> +		enum lru_list lru;
>>>> +		struct lruvec *src_lruvec, *dst_lruvec;
>>>> +
>>>> +		src_lruvec = mem_cgroup_lruvec(src, NODE_DATA(nid));
>>>> +		dst_lruvec = mem_cgroup_lruvec(dst, NODE_DATA(nid));
>>>> +		dst_lruvec->anon_cost += src_lruvec->anon_cost;
>>>> +		dst_lruvec->file_cost += src_lruvec->file_cost;
>>>> +
>>>> +		for_each_lru(lru)
>>>> +			lruvec_reparent_lru(src_lruvec, dst_lruvec, lru);
>>>> +	}
>>>> +}
>>>> +#endif
>>> # Commit dc3b4a08ad35 ("mm: vmscan: prepare for reparenting MGLRU
>>> folios")
>>> Here are the findings for the provided patch.
>>> 1.  **Missing locking in `lru_gen_reparent_memcg`**
>>> In `mm/vmscan.c`, the function `lru_gen_reparent_memcg` (and its
>> helper `__lru_gen_reparent_memcg`) modifies the LRU lists and
>> statistics of `lruvec` structures without holding the `lru_lock`.
>>>       Specifically:
>>>       - `__lru_gen_reparent_memcg` calls `__update_lru_size`.
>>>       - `__update_lru_size` has a `lockdep_assert_held(&lruvec->lru_lock)`.
>>>       - `__lru_gen_reparent_memcg` calls `list_splice_tail_init` to move folios from the source lruvec to the destination lruvec.
>>>       The destination lruvec (`dst_lruvec`) belongs to the parent
>>> memcg, which is active and shared. Modifying its lists and counters
>>> without locking will lead to data corruption (list corruption) and
>>> statistics drift, as well as triggering lockdep warnings.
>>>       ```c
>>>       void lru_gen_reparent_memcg(struct mem_cgroup *src, struct mem_cgroup *dst)
>>>       {
>>>               int nid;
>>>               for_each_node(nid) {
>>>                       struct lruvec *src_lruvec, *dst_lruvec;
>>>                       /* ... */
>>>                       src_lruvec = get_lruvec(src, nid);
>>>                       dst_lruvec = get_lruvec(dst, nid);
>>>                       for (zone = 0; zone < MAX_NR_ZONES; zone++)
>>>                               for (type = 0; type < ANON_AND_FILE; type++)
>>>                                       __lru_gen_reparent_memcg(src_lruvec, dst_lruvec, zone, type);
>>>               }
>>>       }
>>>       ```
>>>       The `lruvec` lock must be acquired for each node before calling
>>> `__lru_gen_reparent_memcg`.
>>> # Commit 18791acfe86b ("mm: memcontrol: refactor
>>> memcg_reparent_objcgs()")
>>> No detailed regression report generated.
>>> # Commit 19f79273a812 ("mm: memcontrol: eliminate the problem of
>>> dying memory cgroup for LRU folios")
>>> file: mm/memcontrol.c
>>> line: 224
>>> type: Bug
>>> category: Locking
>>> description:
>>> The `reparent_locks` function takes `lru_lock` for all NUMA nodes in
>> a loop, utilizing `spin_lock_nested` with an incrementing `nest`
>> counter. The `nest` counter increments for each lock taken (2 per
>> node: src and dst). Since `MAX_LOCKDEP_SUBCLASSES` is 8, this code
>> will trigger a Lockdep violation (and potential panic if
>> `panic_on_warn` is set) on systems with more than 4 NUMA nodes (4
>> nodes * 2 locks = 8 subclasses). Furthermore, locking all nodes
>> simultaneously is a scalability regression, blocking LRU operations
>> globally during reparenting.
>>> file: include/linux/memcontrol.h
>>> line: 430
>>> type: Risk
>>> category: API
>>> description:
>>> The implementation of `folio_memcg` has changed to rely on
>> `obj_cgroup_memcg`, which enforces that `rcu_read_lock` or
>> `cgroup_mutex` is held via a lockdep assertion. Previously, for LRU
>> folios, the memcg pointer was directly embedded and stable under the
>> folio lock. Existing callers (e.g., in `mm/workingset.c`) relied on
>> the folio lock for stability. While some callers may hold RCU, others
>> might not, leading to lockdep warnings or races where `folio_memcg`
>> returns a pointer to a memcg that is being reparented or
>> freed. Additionally, the return value of `folio_memcg` is no longer
>> constant for a locked folio; it can change if reparenting occurs,
>> potentially breaking logic that assumes identity equality over time.
>>> # Commit 44b8e8bca06e ("mm: lru: add VM_WARN_ON_ONCE_FOLIO to lru
>>> maintenance helpers")
>>> No detailed regression report generated.
>>> # Summary
>>> | Commit
>>> | Regressions |
>>> | :-------------------------------------------------------------------------------------------- | :---------- |
>>> | e416d881eea4 ("mm: memcontrol: remove dead code of checking parent memory cgroup")            | 0           |
>>> | 8e00ae594254 ("mm: workingset: use folio_lruvec() in workingset_refault()")                   | 0           |
>>> | a272ef87d5e7 ("mm: rename unlock_page_lruvec_irq and its variants")                           | 0           |
>>> | d57d548a3d6b ("mm: vmscan: prepare for the refactoring the move_folios_to_lru()")             | 0           |
>>> | 9b02a45b6fc8 ("mm: vmscan: refactor move_folios_to_lru()")                                    | 1           |
>>> | 057fca991b78 ("mm: memcontrol: allocate object cgroup for non-kmem case")                     | 0           |
>>> | 7c4110a3d8b6 ("mm: memcontrol: return root object cgroup for root memory cgroup")             | 0           |
>>> | 8479f2eef536 ("mm: memcontrol: prevent memory cgroup release in get_mem_cgroup_from_folio()") | 1           |
>>> | c10b7e11fc09 ("buffer: prevent memory cgroup release in folio_alloc_buffers()")               | 0           |
>>> | 65610d739afc ("writeback: prevent memory cgroup release in writeback module")                 | 2           |
>>> | f9b3cc3aed9f ("mm: memcontrol: prevent memory cgroup release in count_memcg_folio_events()")  | 0           |
>>> | 91e4b3924291 ("mm: page_io: prevent memory cgroup release in page_io module")                 | 0           |
>>> | bb45e352bb34 ("mm: migrate: prevent memory cgroup release in folio_migrate_mapping()")        | 0           |
>>> | a1189dd21a56 ("mm: mglru: prevent memory cgroup release in mglru")                            | 1           |
>>> | 4f41e0db1fd8 ("mm: memcontrol: prevent memory cgroup release in mem_cgroup_swap_full()")      | 0           |
>>> | de63e2b7a03e ("mm: workingset: prevent memory cgroup release in lru_gen_eviction()")          | 0           |
>>> | c0cce04fd4dc ("mm: thp: prevent memory cgroup release in folio_split_queue_lock{_irqsave}()") | 1           |
>>> | 555a447cb5f1 ("mm: zswap: prevent memory cgroup release in zswap_compress()")                 | 0           |
>>> | 80bbd804adde ("mm: workingset: prevent lruvec release in workingset_refault()")               | 1           |
>>> | 9d232388a8e3 ("mm: zswap: prevent lruvec release in zswap_folio_swapin()")                    | 0           |
>>> | d7cb66b9350d ("mm: swap: prevent lruvec release in lru_gen_clear_refs()")                     | 0           |
>>> | 3e71e5543c8f ("mm: workingset: prevent lruvec release in workingset_activation()")            | 0           |
>>> | e765ff303f13 ("mm: memcontrol: prepare for reparenting LRU pages for lruvec lock")            | 0           |
>>> | d04921029e6d ("mm: vmscan: prepare for reparenting traditional LRU folios")                   | 1           |
>>> | dc3b4a08ad35 ("mm: vmscan: prepare for reparenting MGLRU folios")                             | 1           |
>>> | 18791acfe86b ("mm: memcontrol: refactor memcg_reparent_objcgs()")                             | 0           |
>>> | 19f79273a812 ("mm: memcontrol: eliminate the problem of dying memory cgroup for LRU folios")  | 2           |
>>> | 44b8e8bca06e ("mm: lru: add VM_WARN_ON_ONCE_FOLIO to lru maintenance helpers")                | 0           |



^ permalink raw reply	[flat|nested] 149+ messages in thread

* Re: [PATCH v2 00/28] Eliminate Dying Memory Cgroup
  2025-12-30  4:25       ` Qi Zheng
@ 2025-12-30  4:48         ` Shakeel Butt
  2025-12-30 16:46           ` Zi Yan
  0 siblings, 1 reply; 149+ messages in thread
From: Shakeel Butt @ 2025-12-30  4:48 UTC (permalink / raw)
  To: Qi Zheng
  Cc: Roman Gushchin, hannes, hughd, mhocko, muchun.song, david,
	lorenzo.stoakes, ziy, harry.yoo, imran.f.khan, kamalesh.babulal,
	axelrasmussen, yuanchu, weixugc, chenridong, mkoutny, akpm,
	hamzamahfooz, apais, lance.yang, linux-mm, linux-kernel, cgroups,
	Qi Zheng

On Tue, Dec 30, 2025 at 12:25:31PM +0800, Qi Zheng wrote:
> 
> 
[...]
> > > 
> > > Thank you for running the AI review for this patchset, but please do not
> > > directly send the raw data from the AI review to the community, as this
> > > is no different from automated review by a robot.
> > 
> > Hi Qi,
> > 
> > I don't know why you're so negative towards it. It's been great at
> 
> No, I don't object to having a dedicated robot to do this.
> 
> > finding pretty tricky bugs often missed by human reviewers. In no way
> > it's a replacement for human reviews, but if a robot can find real
> > issues and make the kernel more reliable and safe, I'm in.
> 
> I just think you should do a preliminary review of the AI review results
> instead of sending them out directly. Otherwise, if everyone does this,
> the community will be full of bots.
> 
> No?
> 

We don't want too many bots but we definitely want at least one AI
review bot. Now we have precedence of BPF and networking subsystem and
the results I have seen are really good. I think the MM community needs
to come together and decide on the formalities of AI review process and
I see Roman is doing some early experimentation and result looks great.

Shakeel


^ permalink raw reply	[flat|nested] 149+ messages in thread

* Re: [PATCH v2 00/28] Eliminate Dying Memory Cgroup
  2025-12-30  4:48         ` Shakeel Butt
@ 2025-12-30 16:46           ` Zi Yan
  2025-12-30 18:13             ` Shakeel Butt
  2025-12-30 19:34             ` Roman Gushchin
  0 siblings, 2 replies; 149+ messages in thread
From: Zi Yan @ 2025-12-30 16:46 UTC (permalink / raw)
  To: Shakeel Butt, Roman Gushchin
  Cc: Qi Zheng, hannes, hughd, mhocko, muchun.song, david,
	lorenzo.stoakes, harry.yoo, imran.f.khan, kamalesh.babulal,
	axelrasmussen, yuanchu, weixugc, chenridong, mkoutny, akpm,
	hamzamahfooz, apais, lance.yang, linux-mm, linux-kernel, cgroups,
	Qi Zheng

On 29 Dec 2025, at 23:48, Shakeel Butt wrote:

> On Tue, Dec 30, 2025 at 12:25:31PM +0800, Qi Zheng wrote:
>>
>>
> [...]
>>>>
>>>> Thank you for running the AI review for this patchset, but please do not
>>>> directly send the raw data from the AI review to the community, as this
>>>> is no different from automated review by a robot.
>>>
>>> Hi Qi,
>>>
>>> I don't know why you're so negative towards it. It's been great at
>>
>> No, I don't object to having a dedicated robot to do this.
>>
>>> finding pretty tricky bugs often missed by human reviewers. In no way
>>> it's a replacement for human reviews, but if a robot can find real
>>> issues and make the kernel more reliable and safe, I'm in.
>>
>> I just think you should do a preliminary review of the AI review results
>> instead of sending them out directly. Otherwise, if everyone does this,
>> the community will be full of bots.
>>
>> No?
>>
>
> We don't want too many bots but we definitely want at least one AI
> review bot. Now we have precedence of BPF and networking subsystem and
> the results I have seen are really good. I think the MM community needs
> to come together and decide on the formalities of AI review process and
> I see Roman is doing some early experimentation and result looks great.

Do you mind explaining why the result looks great? Does it mean you agree
the regressions pointed out by the AI review?

If we want to do AI reviews, the process should be improved instead of
just pasting the output from AI. In the initial stage, I think some human
intervention is needed, at least adding some comment on AI reviews would
be helpful. Otherwise, it looks like you agree completely with AI reviews.
In addition, “50% of the reported issues are real”, is the AI tossing
a coin when reporting issues?

When I am looking into the prompt part, I have the following questions:

1. What is “Prompts SHA: 192922ae6bf4 ("bpf.md: adjust the documentation
about bpf kfunc parameter validation”)”? I got the actual prompts
from irc: https://github.com/masoncl/review-prompts/tree/main, but it
should be provided along with the review for others to reproduce.

2. Looking at the mm prompt: https://github.com/masoncl/review-prompts/blob/main/mm.md, are you sure the patterns are all right?
	a. Page/Folio States, Large folios require per-page state tracking for
		Reference counts. I thought we want to get rid of per page refcount.
    b. Migration Invariants, NUMA balancing expects valid PTE combinations.
		PROTNONE PTEs are hardware invalid to trigger fault.
	c. TLB flushes required after PTE modifications. How about spurious fault
		handling?

3. For a cgroup patchset, I was expecting some cgroup specific prompt rules,
	but could not find any. What am I missing?



Best Regards,
Yan, Zi


^ permalink raw reply	[flat|nested] 149+ messages in thread

* Re: [PATCH v2 00/28] Eliminate Dying Memory Cgroup
  2025-12-30 16:46           ` Zi Yan
@ 2025-12-30 18:13             ` Shakeel Butt
  2025-12-30 19:18               ` Chris Mason
  2025-12-30 19:34             ` Roman Gushchin
  1 sibling, 1 reply; 149+ messages in thread
From: Shakeel Butt @ 2025-12-30 18:13 UTC (permalink / raw)
  To: Zi Yan
  Cc: Roman Gushchin, Qi Zheng, hannes, hughd, mhocko, muchun.song,
	david, lorenzo.stoakes, harry.yoo, imran.f.khan,
	kamalesh.babulal, axelrasmussen, yuanchu, weixugc, chenridong,
	mkoutny, akpm, hamzamahfooz, apais, lance.yang, linux-mm,
	linux-kernel, cgroups, Qi Zheng, Chris Mason

On Tue, Dec 30, 2025 at 11:46:02AM -0500, Zi Yan wrote:
> On 29 Dec 2025, at 23:48, Shakeel Butt wrote:
> 
> > On Tue, Dec 30, 2025 at 12:25:31PM +0800, Qi Zheng wrote:
> >>
> >>
> > [...]
> >>>>
> >>>> Thank you for running the AI review for this patchset, but please do not
> >>>> directly send the raw data from the AI review to the community, as this
> >>>> is no different from automated review by a robot.
> >>>
> >>> Hi Qi,
> >>>
> >>> I don't know why you're so negative towards it. It's been great at
> >>
> >> No, I don't object to having a dedicated robot to do this.
> >>
> >>> finding pretty tricky bugs often missed by human reviewers. In no way
> >>> it's a replacement for human reviews, but if a robot can find real
> >>> issues and make the kernel more reliable and safe, I'm in.
> >>
> >> I just think you should do a preliminary review of the AI review results
> >> instead of sending them out directly. Otherwise, if everyone does this,
> >> the community will be full of bots.
> >>
> >> No?
> >>
> >
> > We don't want too many bots but we definitely want at least one AI
> > review bot. Now we have precedence of BPF and networking subsystem and
> > the results I have seen are really good. I think the MM community needs
> > to come together and decide on the formalities of AI review process and
> > I see Roman is doing some early experimentation and result looks great.
> 
> Do you mind explaining why the result looks great? Does it mean you agree
> the regressions pointed out by the AI review?

The result looks great because the points raised are really thought
provoking and things I have not thought about when I reviewed the
series. The lru lock without irq or the possible infinite retry loop in
get_mem_cgroup_from_folio() are two such examples. Are these real
regressions? I am not sure.

> 
> If we want to do AI reviews, the process should be improved instead of
> just pasting the output from AI. In the initial stage, I think some human
> intervention is needed, at least adding some comment on AI reviews would
> be helpful.

Yes I agree and therefore I mentioned we should discuss how should we
(MM community) should adopt the AI reviews.

> Otherwise, it looks like you agree completely with AI reviews.
> In addition, “50% of the reported issues are real”, is the AI tossing
> a coin when reporting issues?
> 
> When I am looking into the prompt part, I have the following questions:
> 
> 1. What is “Prompts SHA: 192922ae6bf4 ("bpf.md: adjust the documentation
> about bpf kfunc parameter validation”)”? I got the actual prompts
> from irc: https://github.com/masoncl/review-prompts/tree/main, but it
> should be provided along with the review for others to reproduce.

I agree and I didn't know that Chris's review prompts are used here.

Ccing Chris for your following questions.

> 
> 2. Looking at the mm prompt: https://github.com/masoncl/review-prompts/blob/main/mm.md, are you sure the patterns are all right?
> 	a. Page/Folio States, Large folios require per-page state tracking for
> 		Reference counts. I thought we want to get rid of per page refcount.
>     b. Migration Invariants, NUMA balancing expects valid PTE combinations.
> 		PROTNONE PTEs are hardware invalid to trigger fault.
> 	c. TLB flushes required after PTE modifications. How about spurious fault
> 		handling?
> 
> 3. For a cgroup patchset, I was expecting some cgroup specific prompt rules,
> 	but could not find any. What am I missing?
> 
> 


^ permalink raw reply	[flat|nested] 149+ messages in thread

* Re: [PATCH v2 00/28] Eliminate Dying Memory Cgroup
  2025-12-30 18:13             ` Shakeel Butt
@ 2025-12-30 19:18               ` Chris Mason
  2025-12-30 20:51                 ` Matthew Wilcox
  2025-12-30 21:07                 ` Zi Yan
  0 siblings, 2 replies; 149+ messages in thread
From: Chris Mason @ 2025-12-30 19:18 UTC (permalink / raw)
  To: Shakeel Butt, Zi Yan
  Cc: Roman Gushchin, Qi Zheng, hannes, hughd, mhocko, muchun.song,
	david, lorenzo.stoakes, harry.yoo, imran.f.khan,
	kamalesh.babulal, axelrasmussen, yuanchu, weixugc, chenridong,
	mkoutny, akpm, hamzamahfooz, apais, lance.yang, linux-mm,
	linux-kernel, cgroups, Qi Zheng, Chris Mason

On 12/30/25 1:13 PM, Shakeel Butt wrote:
> On Tue, Dec 30, 2025 at 11:46:02AM -0500, Zi Yan wrote:
>> On 29 Dec 2025, at 23:48, Shakeel Butt wrote:
>>
>>> On Tue, Dec 30, 2025 at 12:25:31PM +0800, Qi Zheng wrote:
>>>>
>>>>
>>> [...]
>>>>>>
>>>>>> Thank you for running the AI review for this patchset, but please do not
>>>>>> directly send the raw data from the AI review to the community, as this
>>>>>> is no different from automated review by a robot.
>>>>>
>>>>> Hi Qi,
>>>>>
>>>>> I don't know why you're so negative towards it. It's been great at
>>>>
>>>> No, I don't object to having a dedicated robot to do this.
>>>>
>>>>> finding pretty tricky bugs often missed by human reviewers. In no way
>>>>> it's a replacement for human reviews, but if a robot can find real
>>>>> issues and make the kernel more reliable and safe, I'm in.
>>>>
>>>> I just think you should do a preliminary review of the AI review results
>>>> instead of sending them out directly. Otherwise, if everyone does this,
>>>> the community will be full of bots.

I do think it's awkward to dump the whole review output for the patch
series in a single message.  It looks like there's a sudden jump to XML?
It's better to reply to the individual patches with the comments
inline, which I think is where Roman is trying to go long term.

With BPF, it looks more like this:
https://lore.kernel.org/bpf/?q=AI+reviewed+your+patch

>>>>
>>>> No?
>>>>
>>>
>>> We don't want too many bots but we definitely want at least one AI
>>> review bot. Now we have precedence of BPF and networking subsystem and
>>> the results I have seen are really good. I think the MM community needs
>>> to come together and decide on the formalities of AI review process and
>>> I see Roman is doing some early experimentation and result looks great.
>>
>> Do you mind explaining why the result looks great? Does it mean you agree
>> the regressions pointed out by the AI review?
> 
> The result looks great because the points raised are really thought
> provoking and things I have not thought about when I reviewed the
> series. The lru lock without irq or the possible infinite retry loop in
> get_mem_cgroup_from_folio() are two such examples. Are these real
> regressions? I am not sure.
> 
>>
>> If we want to do AI reviews, the process should be improved instead of
>> just pasting the output from AI. In the initial stage, I think some human
>> intervention is needed, at least adding some comment on AI reviews would
>> be helpful.
> 
> Yes I agree and therefore I mentioned we should discuss how should we
> (MM community) should adopt the AI reviews.

What tends to happen with BPF is the patch author or bpf maintainers
point out problems with the reviews and I fix up the prompts over time.
The false positive rate is ~20% today (measured since late October), and
it's generally declining.

> 
>> Otherwise, it looks like you agree completely with AI reviews.
>> In addition, “50% of the reported issues are real”, is the AI tossing
>> a coin when reporting issues?
>>
>> When I am looking into the prompt part, I have the following questions:
>>
>> 1. What is “Prompts SHA: 192922ae6bf4 ("bpf.md: adjust the documentation
>> about bpf kfunc parameter validation”)”? I got the actual prompts
>> from irc: https://github.com/masoncl/review-prompts/tree/main , but it
>> should be provided along with the review for others to reproduce.
> 
> I agree and I didn't know that Chris's review prompts are used here.
> 
> Ccing Chris for your following questions.
> 
>>>> 2. Looking at the mm prompt: https://github.com/masoncl/review-prompts/blob/main/mm.md , are you sure the patterns are all right?
>> 	a. Page/Folio States, Large folios require per-page state tracking for
>> 		Reference counts. I thought we want to get rid of per page refcount.

Early in prompt development I hand picked a few hundred patches from
6.16 fixing bugs, and I iterated on these adding subsystem knowledge to
catch the known bugs.  That's where that rule came from, but as you say
there's a risk this information gets old.  Do we want to get rid of per
page refcounts or have we done it?  (more on that at the bottom of the
email).

>>     b. Migration Invariants, NUMA balancing expects valid PTE combinations.
>> 		PROTNONE PTEs are hardware invalid to trigger fault.
>> 	c. TLB flushes required after PTE modifications. How about spurious fault
>> 		handling?
>>

AI generally uses them as a starting point and fills in details, but
I agree the MM bits are pretty minimal.

>> 3. For a cgroup patchset, I was expecting some cgroup specific prompt rules,
>> 	but could not find any. What am I missing?

I think the only cgroup specific information I've needed so far is
explaining css_get() and the section on __GFP_ACCOUNT.  I actively try
to avoid adding details unless we're missing bugs or generating false
positives.

As an example of how I'd fix the prompt if the per page state tracking
were causing problems (and if we didn't want to just remove it), I asked
claude to analyze how it is still used.  The output is below, I'd double
check things as best I could, shorten into prompt form and send to the
list for review.

Per-Page Tracking in Large Folios - Analysis
=============================================

Based on analysis of mm/*.c files and commit history, MM-004's claim is
still partially true - large folios do need per-page tracking for some
bits, though recent work has significantly reduced this.


Bits That Still Require Per-Page Tracking
------------------------------------------

1. PG_hwpoison (include/linux/page-flags.h:118)

   Defined as PAGEFLAG(HWPoison, hwpoison, PF_ANY), this flag is set on
   individual pages within a large folio when hardware memory corruption
   is detected.

   The folio_test_has_hwpoisoned() flag on the second page indicates at
   least one subpage is poisoned, but does not identify which one.

   When splitting a large folio, page_range_has_hwpoisoned() in
   mm/huge_memory.c:3467 iterates through pages checking PageHWPoison()
   for each:

       static bool page_range_has_hwpoisoned(struct page *page, long nr_pages)
       {
           for (; nr_pages; page++, nr_pages--)
               if (PageHWPoison(page))
                   return true;
           return false;
       }

   Used in rmap code (mm/rmap.c:1990, 2070, 2473) to check individual
   subpages when unmapping or migrating.

2. PG_anon_exclusive (include/linux/page-flags.h:146)

   Per the comment at include/linux/page-flags.h:139-145:

       "Depending on the way an anonymous folio can be mapped into a page
       table (e.g., single PMD/PUD/CONT of the head page vs. PTE-mapped
       THP), PG_anon_exclusive may be set only for the head page or for
       tail pages of an anonymous folio. For now, we only expect it to be
       set on tail pages for PTE-mapped THP."

   Used at mm/rmap.c:1408-1416: when RMAP_EXCLUSIVE flag is set for
   PTE-level mappings, it iterates through each page:

       for (i = 0; i < nr_pages; i++)
           SetPageAnonExclusive(page + i);

   HugeTLB stores this on head page only (see PageAnonExclusive() at
   include/linux/page-flags.h:1153-1162), but PTE-mapped THP needs
   per-page tracking.


Recent Changes - Per-Page Mapcount Removed
------------------------------------------

Commit 749492229e3bd ("mm: stop maintaining the per-page mapcount of
large folios") by David Hildenbrand (March 2025) introduced
CONFIG_NO_PAGE_MAPCOUNT which:

  - Stops maintaining per-page mapcounts in tail pages of large folios
  - Tail page mapcount is now always logically 0 (-1 value)
  - Removed _nr_pages_mapped tracking

This was a significant simplification, but it does not affect the
per-page flag tracking described above.


Flags Stored in Second Page Only (Not Per-Page)
-----------------------------------------------

These are stored in the first tail page (FOLIO_SECOND_PAGE) and apply to
the entire folio, not individual pages:

  - PG_has_hwpoisoned  - indicates some page in folio is poisoned
  - PG_large_rmappable - folio is rmappable
  - PG_partially_mapped - folio is partially mapped

See PAGE_FLAGS_SECOND definition at include/linux/page-flags.h:1218-1220.


Conclusion
----------

While per-page mapcount has been eliminated, PG_hwpoison and
PG_anon_exclusive (for PTE-mapped THP) still require per-page tracking
in large folios. MM-004's claim remains valid for these specific bits.

Key source files:
  - include/linux/page-flags.h (flag definitions and accessors)
  - mm/huge_memory.c (folio split handling)
  - mm/rmap.c (reverse mapping with per-page exclusive tracking)


^ permalink raw reply	[flat|nested] 149+ messages in thread

* Re: [PATCH v2 00/28] Eliminate Dying Memory Cgroup
  2025-12-30 19:18               ` Chris Mason
@ 2025-12-30 20:51                 ` Matthew Wilcox
  2025-12-30 21:10                   ` Chris Mason
  2025-12-30 22:03                   ` Roman Gushchin
  2025-12-30 21:07                 ` Zi Yan
  1 sibling, 2 replies; 149+ messages in thread
From: Matthew Wilcox @ 2025-12-30 20:51 UTC (permalink / raw)
  To: Chris Mason
  Cc: Shakeel Butt, Zi Yan, Roman Gushchin, Qi Zheng, hannes, hughd,
	mhocko, muchun.song, david, lorenzo.stoakes, harry.yoo,
	imran.f.khan, kamalesh.babulal, axelrasmussen, yuanchu, weixugc,
	chenridong, mkoutny, akpm, hamzamahfooz, apais, lance.yang,
	linux-mm, linux-kernel, cgroups, Qi Zheng, Chris Mason

On Tue, Dec 30, 2025 at 02:18:51PM -0500, Chris Mason wrote:
> >>>> I just think you should do a preliminary review of the AI review results
> >>>> instead of sending them out directly. Otherwise, if everyone does this,
> >>>> the community will be full of bots.
> 
> I do think it's awkward to dump the whole review output for the patch
> series in a single message.  It looks like there's a sudden jump to XML?
> It's better to reply to the individual patches with the comments
> inline, which I think is where Roman is trying to go long term.

I don't know what Roman's trying to do long-term, but his email
that started this thread was so badly written that it was offensive.
Had it been sent to me, I would have responded in the style of Arkell
v Pressdram.

> With BPF, it looks more like this:
> https://lore.kernel.org/bpf/?q=AI+reviewed+your+patch

That's actually useful.

> >>>> 2. Looking at the mm prompt: https://github.com/masoncl/review-prompts/blob/main/mm.md , are you sure the patterns are all right?
> >> 	a. Page/Folio States, Large folios require per-page state tracking for
> >> 		Reference counts. I thought we want to get rid of per page refcount.
> 
> Early in prompt development I hand picked a few hundred patches from
> 6.16 fixing bugs, and I iterated on these adding subsystem knowledge to
> catch the known bugs.  That's where that rule came from, but as you say
> there's a risk this information gets old.  Do we want to get rid of per
> page refcounts or have we done it?  (more on that at the bottom of the
> email).

There is no such thing as a per-page reference count.  Any attempt to
access the page reference count redirects to the folio refcount.  This
has been the case since 2016 (four years before folios existed).  See
commit ddc58f27f9ee.

We do want to git rid of calls to get_page() and put_page() for a
variety of reasons that will be long and painful to write out.

> As an example of how I'd fix the prompt if the per page state tracking
> were causing problems (and if we didn't want to just remove it), I asked
> claude to analyze how it is still used.  The output is below, I'd double
> check things as best I could, shorten into prompt form and send to the
> list for review.
> 
> Per-Page Tracking in Large Folios - Analysis
> =============================================
> 
> Based on analysis of mm/*.c files and commit history, MM-004's claim is
> still partially true - large folios do need per-page tracking for some
> bits, though recent work has significantly reduced this.
> 
> 
> Bits That Still Require Per-Page Tracking
> ------------------------------------------
> 
> 1. PG_hwpoison (include/linux/page-flags.h:118)
> 
>    Defined as PAGEFLAG(HWPoison, hwpoison, PF_ANY), this flag is set on
>    individual pages within a large folio when hardware memory corruption
>    is detected.
> 
>    The folio_test_has_hwpoisoned() flag on the second page indicates at
>    least one subpage is poisoned, but does not identify which one.
> 
>    When splitting a large folio, page_range_has_hwpoisoned() in
>    mm/huge_memory.c:3467 iterates through pages checking PageHWPoison()
>    for each:
> 
>        static bool page_range_has_hwpoisoned(struct page *page, long nr_pages)
>        {
>            for (; nr_pages; page++, nr_pages--)
>                if (PageHWPoison(page))
>                    return true;
>            return false;
>        }
> 
>    Used in rmap code (mm/rmap.c:1990, 2070, 2473) to check individual
>    subpages when unmapping or migrating.
> 
> 2. PG_anon_exclusive (include/linux/page-flags.h:146)
> 
>    Per the comment at include/linux/page-flags.h:139-145:
> 
>        "Depending on the way an anonymous folio can be mapped into a page
>        table (e.g., single PMD/PUD/CONT of the head page vs. PTE-mapped
>        THP), PG_anon_exclusive may be set only for the head page or for
>        tail pages of an anonymous folio. For now, we only expect it to be
>        set on tail pages for PTE-mapped THP."
> 
>    Used at mm/rmap.c:1408-1416: when RMAP_EXCLUSIVE flag is set for
>    PTE-level mappings, it iterates through each page:
> 
>        for (i = 0; i < nr_pages; i++)
>            SetPageAnonExclusive(page + i);
> 
>    HugeTLB stores this on head page only (see PageAnonExclusive() at
>    include/linux/page-flags.h:1153-1162), but PTE-mapped THP needs
>    per-page tracking.
> 
> 
> Recent Changes - Per-Page Mapcount Removed
> ------------------------------------------
> 
> Commit 749492229e3bd ("mm: stop maintaining the per-page mapcount of
> large folios") by David Hildenbrand (March 2025) introduced
> CONFIG_NO_PAGE_MAPCOUNT which:
> 
>   - Stops maintaining per-page mapcounts in tail pages of large folios
>   - Tail page mapcount is now always logically 0 (-1 value)
>   - Removed _nr_pages_mapped tracking
> 
> This was a significant simplification, but it does not affect the
> per-page flag tracking described above.
> 
> 
> Flags Stored in Second Page Only (Not Per-Page)
> -----------------------------------------------
> 
> These are stored in the first tail page (FOLIO_SECOND_PAGE) and apply to
> the entire folio, not individual pages:
> 
>   - PG_has_hwpoisoned  - indicates some page in folio is poisoned
>   - PG_large_rmappable - folio is rmappable
>   - PG_partially_mapped - folio is partially mapped
> 
> See PAGE_FLAGS_SECOND definition at include/linux/page-flags.h:1218-1220.
> 
> 
> Conclusion
> ----------
> 
> While per-page mapcount has been eliminated, PG_hwpoison and
> PG_anon_exclusive (for PTE-mapped THP) still require per-page tracking
> in large folios. MM-004's claim remains valid for these specific bits.
> 
> Key source files:
>   - include/linux/page-flags.h (flag definitions and accessors)
>   - mm/huge_memory.c (folio split handling)
>   - mm/rmap.c (reverse mapping with per-page exclusive tracking)

This is pretty good and yet dangerously wrong in some missed nuances.
Which probably summarises the state of the art nicely ;-)

To start with, all flags marked as PF_ANY are set on individual pages
rather than only the folio.  So that's currently:

PAGEFLAG(Private, private, PF_ANY)
PAGEFLAG(HWPoison, hwpoison, PF_ANY)
PAGEFLAG(VmemmapSelfHosted, vmemmap_self_hosted, PF_ANY)
__SETPAGEFLAG(Head, head, PF_ANY)
        return test_bit(PG_anon_exclusive, &PF_ANY(page, 1)->flags.f);

Now, PG_private is a flag we're trying to get rid of -- it should be
identical to (folio->private != NULL), so I haven't made any effort
to convert that from being PF_ANY.  I'm not too unhappy that your chatbot
doesn't talk about PG_private, but a more full answer would include
mention of this.

PG_hwpoison and PG_anon_exclusive will remain per-page state in a
memdesc world, and there's a plan to handle those, so there's no need to
eliminate them.

PG_vmemmap_self_hosted is a very, very internal flag.  It's OK to not
know about it.

PG_head has to remain per-page state for now for obvious reasons ;-)
In a memdesc word, there will be no way to ask if a page is the first
page of an allocation, so this flag will not be needed.

I believe there are some subtleties around PG_hwpoison and hugetlb that
are not fully captured above, but I'm not convinced of my ability to
state definitely what they currently are, so I'll leve that for somebody
else to do.

---

Looking through your prompts, there are definitely some conditions that
could be profitably added.  For example, pages which are mapped into
page tables must be PG_uptodate (we have various assertions in the MM
code that this is true and they occasionally trigger).


^ permalink raw reply	[flat|nested] 149+ messages in thread

* Re: [PATCH v2 00/28] Eliminate Dying Memory Cgroup
  2025-12-30 20:51                 ` Matthew Wilcox
@ 2025-12-30 21:10                   ` Chris Mason
  2025-12-30 22:30                     ` Roman Gushchin
  2025-12-30 22:03                   ` Roman Gushchin
  1 sibling, 1 reply; 149+ messages in thread
From: Chris Mason @ 2025-12-30 21:10 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Shakeel Butt, Zi Yan, Roman Gushchin, Qi Zheng, hannes, hughd,
	mhocko, muchun.song, david, lorenzo.stoakes, harry.yoo,
	imran.f.khan, kamalesh.babulal, axelrasmussen, yuanchu, weixugc,
	chenridong, mkoutny, akpm, hamzamahfooz, apais, lance.yang,
	linux-mm, linux-kernel, cgroups, Qi Zheng, Chris Mason

On 12/30/25 3:51 PM, Matthew Wilcox wrote:
> On Tue, Dec 30, 2025 at 02:18:51PM -0500, Chris Mason wrote:
>>>>>> I just think you should do a preliminary review of the AI review results
>>>>>> instead of sending them out directly. Otherwise, if everyone does this,
>>>>>> the community will be full of bots.>>>>>> 2. Looking at the mm prompt: https://github.com/masoncl/review-prompts/blob/main/mm.md  , are you sure the patterns are all right?
>>>> 	a. Page/Folio States, Large folios require per-page state tracking for
>>>> 		Reference counts. I thought we want to get rid of per page refcount.
>>

[ ... ]

>> Early in prompt development I hand picked a few hundred patches from
>> 6.16 fixing bugs, and I iterated on these adding subsystem knowledge to
>> catch the known bugs.  That's where that rule came from, but as you say
>> there's a risk this information gets old.  Do we want to get rid of per
>> page refcounts or have we done it?  (more on that at the bottom of the
>> email).
> 
> There is no such thing as a per-page reference count.  Any attempt to
> access the page reference count redirects to the folio refcount.  This
> has been the case since 2016 (four years before folios existed).  See
> commit ddc58f27f9ee.
> 
Ok, I'm half out the door to vacation, but I'll fix up the mm.md to
better reflect reality when I get back.

> We do want to git rid of calls to get_page() and put_page() for a
> variety of reasons that will be long and painful to write out.
> 
>> As an example of how I'd fix the prompt if the per page state tracking
>> were causing problems (and if we didn't want to just remove it), I asked
>> claude to analyze how it is still used.  The output is below, I'd double
>> check things as best I could, shorten into prompt form and send to the
>> list for review.
>>
>> Per-Page Tracking in Large Folios - Analysis
>> =============================================
>>
>> Based on analysis of mm/*.c files and commit history, MM-004's claim is
>> still partially true - large folios do need per-page tracking for some
>> bits, though recent work has significantly reduced this.
>>
>>
>> Bits That Still Require Per-Page Tracking
>> ------------------------------------------
>>
>> 1. PG_hwpoison (include/linux/page-flags.h:118)
>>
>>    Defined as PAGEFLAG(HWPoison, hwpoison, PF_ANY), this flag is set on
>>    individual pages within a large folio when hardware memory corruption
>>    is detected.
>>
>>    The folio_test_has_hwpoisoned() flag on the second page indicates at
>>    least one subpage is poisoned, but does not identify which one.
>>
>>    When splitting a large folio, page_range_has_hwpoisoned() in
>>    mm/huge_memory.c:3467 iterates through pages checking PageHWPoison()
>>    for each:
>>
>>        static bool page_range_has_hwpoisoned(struct page *page, long nr_pages)
>>        {
>>            for (; nr_pages; page++, nr_pages--)
>>                if (PageHWPoison(page))
>>                    return true;
>>            return false;
>>        }
>>
>>    Used in rmap code (mm/rmap.c:1990, 2070, 2473) to check individual
>>    subpages when unmapping or migrating.
>>
>> 2. PG_anon_exclusive (include/linux/page-flags.h:146)
>>
>>    Per the comment at include/linux/page-flags.h:139-145:
>>
>>        "Depending on the way an anonymous folio can be mapped into a page
>>        table (e.g., single PMD/PUD/CONT of the head page vs. PTE-mapped
>>        THP), PG_anon_exclusive may be set only for the head page or for
>>        tail pages of an anonymous folio. For now, we only expect it to be
>>        set on tail pages for PTE-mapped THP."
>>
>>    Used at mm/rmap.c:1408-1416: when RMAP_EXCLUSIVE flag is set for
>>    PTE-level mappings, it iterates through each page:
>>
>>        for (i = 0; i < nr_pages; i++)
>>            SetPageAnonExclusive(page + i);
>>
>>    HugeTLB stores this on head page only (see PageAnonExclusive() at
>>    include/linux/page-flags.h:1153-1162), but PTE-mapped THP needs
>>    per-page tracking.
>>
>>
>> Recent Changes - Per-Page Mapcount Removed
>> ------------------------------------------
>>
>> Commit 749492229e3bd ("mm: stop maintaining the per-page mapcount of
>> large folios") by David Hildenbrand (March 2025) introduced
>> CONFIG_NO_PAGE_MAPCOUNT which:
>>
>>   - Stops maintaining per-page mapcounts in tail pages of large folios
>>   - Tail page mapcount is now always logically 0 (-1 value)
>>   - Removed _nr_pages_mapped tracking
>>
>> This was a significant simplification, but it does not affect the
>> per-page flag tracking described above.
>>
>>
>> Flags Stored in Second Page Only (Not Per-Page)
>> -----------------------------------------------
>>
>> These are stored in the first tail page (FOLIO_SECOND_PAGE) and apply to
>> the entire folio, not individual pages:
>>
>>   - PG_has_hwpoisoned  - indicates some page in folio is poisoned
>>   - PG_large_rmappable - folio is rmappable
>>   - PG_partially_mapped - folio is partially mapped
>>
>> See PAGE_FLAGS_SECOND definition at include/linux/page-flags.h:1218-1220.
>>
>>
>> Conclusion
>> ----------
>>
>> While per-page mapcount has been eliminated, PG_hwpoison and
>> PG_anon_exclusive (for PTE-mapped THP) still require per-page tracking
>> in large folios. MM-004's claim remains valid for these specific bits.
>>
>> Key source files:
>>   - include/linux/page-flags.h (flag definitions and accessors)
>>   - mm/huge_memory.c (folio split handling)
>>   - mm/rmap.c (reverse mapping with per-page exclusive tracking)
> 
> This is pretty good and yet dangerously wrong in some missed nuances.
> Which probably summarises the state of the art nicely ;-)
> 

Yeah, that's generally how it goes.  It's a good starting point, but the
details have to get verified.

> To start with, all flags marked as PF_ANY are set on individual pages
> rather than only the folio.  So that's currently:
> 
> PAGEFLAG(Private, private, PF_ANY)
> PAGEFLAG(HWPoison, hwpoison, PF_ANY)
> PAGEFLAG(VmemmapSelfHosted, vmemmap_self_hosted, PF_ANY)
> __SETPAGEFLAG(Head, head, PF_ANY)
>         return test_bit(PG_anon_exclusive, &PF_ANY(page, 1)->flags.f);
> 
> Now, PG_private is a flag we're trying to get rid of -- it should be
> identical to (folio->private != NULL), so I haven't made any effort
> to convert that from being PF_ANY.  I'm not too unhappy that your chatbot
> doesn't talk about PG_private, but a more full answer would include
> mention of this.
> 
> PG_hwpoison and PG_anon_exclusive will remain per-page state in a
> memdesc world, and there's a plan to handle those, so there's no need to
> eliminate them.
> 
> PG_vmemmap_self_hosted is a very, very internal flag.  It's OK to not
> know about it.
> 
> PG_head has to remain per-page state for now for obvious reasons ;-)
> In a memdesc word, there will be no way to ask if a page is the first
> page of an allocation, so this flag will not be needed.
> 
> I believe there are some subtleties around PG_hwpoison and hugetlb that
> are not fully captured above, but I'm not convinced of my ability to
> state definitely what they currently are, so I'll leve that for somebody
> else to do.

Thanks for taking the time to debug the output.  I think before trying
to put this into the prompt, I'd step back and ask:

- What bugs do we want AI to catch?  I can see knowing these large folio
details really helping find bugs in the transition, or debug bug reports
down the line, so it feels like an important detail to record.  It's
definitely something AI won't know all by itself.

- What details is AI getting wrong in current reviews?  We don't really
have this answer yet, but if AI isn't getting it wrong, there's no
reason to try and teach it more.

-chris



^ permalink raw reply	[flat|nested] 149+ messages in thread

* Re: [PATCH v2 00/28] Eliminate Dying Memory Cgroup
  2025-12-30 21:10                   ` Chris Mason
@ 2025-12-30 22:30                     ` Roman Gushchin
  0 siblings, 0 replies; 149+ messages in thread
From: Roman Gushchin @ 2025-12-30 22:30 UTC (permalink / raw)
  To: Chris Mason
  Cc: Matthew Wilcox, Shakeel Butt, Zi Yan, Qi Zheng, hannes, hughd,
	mhocko, muchun.song, david, lorenzo.stoakes, harry.yoo,
	imran.f.khan, kamalesh.babulal, axelrasmussen, yuanchu, weixugc,
	chenridong, mkoutny, akpm, hamzamahfooz, apais, lance.yang,
	linux-mm, linux-kernel, cgroups, Qi Zheng, Chris Mason

Chris Mason <clm@meta.com> writes:

> On 12/30/25 3:51 PM, Matthew Wilcox wrote:
>> On Tue, Dec 30, 2025 at 02:18:51PM -0500, Chris Mason wrote:
>>>>>>> I just think you should do a preliminary review of the AI review results
>>>>>>> instead of sending them out directly. Otherwise, if everyone does this,
>>>>>>> the community will be full of bots.>>>>>> 2. Looking at the mm
> prompt: https://github.com/masoncl/review-prompts/blob/main/mm.md ,
> are you sure the patterns are all right?
>>>>> 	a. Page/Folio States, Large folios require per-page state tracking for
>>>>> 		Reference counts. I thought we want to get rid of per page refcount.
>>>
>
> [ ... ]
>
>>> Early in prompt development I hand picked a few hundred patches from
>>> 6.16 fixing bugs, and I iterated on these adding subsystem knowledge to
>>> catch the known bugs.  That's where that rule came from, but as you say
>>> there's a risk this information gets old.  Do we want to get rid of per
>>> page refcounts or have we done it?  (more on that at the bottom of the
>>> email).
>> 
>> There is no such thing as a per-page reference count.  Any attempt to
>> access the page reference count redirects to the folio refcount.  This
>> has been the case since 2016 (four years before folios existed).  See
>> commit ddc58f27f9ee.
>> 
> Ok, I'm half out the door to vacation, but I'll fix up the mm.md to
> better reflect reality when I get back.
>
>> We do want to git rid of calls to get_page() and put_page() for a
>> variety of reasons that will be long and painful to write out.
>> 
>>> As an example of how I'd fix the prompt if the per page state tracking
>>> were causing problems (and if we didn't want to just remove it), I asked
>>> claude to analyze how it is still used.  The output is below, I'd double
>>> check things as best I could, shorten into prompt form and send to the
>>> list for review.
>>>
>>> Per-Page Tracking in Large Folios - Analysis
>>> =============================================
>>>
>>> Based on analysis of mm/*.c files and commit history, MM-004's claim is
>>> still partially true - large folios do need per-page tracking for some
>>> bits, though recent work has significantly reduced this.
>>>
>>>
>>> Bits That Still Require Per-Page Tracking
>>> ------------------------------------------
>>>
>>> 1. PG_hwpoison (include/linux/page-flags.h:118)
>>>
>>>    Defined as PAGEFLAG(HWPoison, hwpoison, PF_ANY), this flag is set on
>>>    individual pages within a large folio when hardware memory corruption
>>>    is detected.
>>>
>>>    The folio_test_has_hwpoisoned() flag on the second page indicates at
>>>    least one subpage is poisoned, but does not identify which one.
>>>
>>>    When splitting a large folio, page_range_has_hwpoisoned() in
>>>    mm/huge_memory.c:3467 iterates through pages checking PageHWPoison()
>>>    for each:
>>>
>>>        static bool page_range_has_hwpoisoned(struct page *page, long nr_pages)
>>>        {
>>>            for (; nr_pages; page++, nr_pages--)
>>>                if (PageHWPoison(page))
>>>                    return true;
>>>            return false;
>>>        }
>>>
>>>    Used in rmap code (mm/rmap.c:1990, 2070, 2473) to check individual
>>>    subpages when unmapping or migrating.
>>>
>>> 2. PG_anon_exclusive (include/linux/page-flags.h:146)
>>>
>>>    Per the comment at include/linux/page-flags.h:139-145:
>>>
>>>        "Depending on the way an anonymous folio can be mapped into a page
>>>        table (e.g., single PMD/PUD/CONT of the head page vs. PTE-mapped
>>>        THP), PG_anon_exclusive may be set only for the head page or for
>>>        tail pages of an anonymous folio. For now, we only expect it to be
>>>        set on tail pages for PTE-mapped THP."
>>>
>>>    Used at mm/rmap.c:1408-1416: when RMAP_EXCLUSIVE flag is set for
>>>    PTE-level mappings, it iterates through each page:
>>>
>>>        for (i = 0; i < nr_pages; i++)
>>>            SetPageAnonExclusive(page + i);
>>>
>>>    HugeTLB stores this on head page only (see PageAnonExclusive() at
>>>    include/linux/page-flags.h:1153-1162), but PTE-mapped THP needs
>>>    per-page tracking.
>>>
>>>
>>> Recent Changes - Per-Page Mapcount Removed
>>> ------------------------------------------
>>>
>>> Commit 749492229e3bd ("mm: stop maintaining the per-page mapcount of
>>> large folios") by David Hildenbrand (March 2025) introduced
>>> CONFIG_NO_PAGE_MAPCOUNT which:
>>>
>>>   - Stops maintaining per-page mapcounts in tail pages of large folios
>>>   - Tail page mapcount is now always logically 0 (-1 value)
>>>   - Removed _nr_pages_mapped tracking
>>>
>>> This was a significant simplification, but it does not affect the
>>> per-page flag tracking described above.
>>>
>>>
>>> Flags Stored in Second Page Only (Not Per-Page)
>>> -----------------------------------------------
>>>
>>> These are stored in the first tail page (FOLIO_SECOND_PAGE) and apply to
>>> the entire folio, not individual pages:
>>>
>>>   - PG_has_hwpoisoned  - indicates some page in folio is poisoned
>>>   - PG_large_rmappable - folio is rmappable
>>>   - PG_partially_mapped - folio is partially mapped
>>>
>>> See PAGE_FLAGS_SECOND definition at include/linux/page-flags.h:1218-1220.
>>>
>>>
>>> Conclusion
>>> ----------
>>>
>>> While per-page mapcount has been eliminated, PG_hwpoison and
>>> PG_anon_exclusive (for PTE-mapped THP) still require per-page tracking
>>> in large folios. MM-004's claim remains valid for these specific bits.
>>>
>>> Key source files:
>>>   - include/linux/page-flags.h (flag definitions and accessors)
>>>   - mm/huge_memory.c (folio split handling)
>>>   - mm/rmap.c (reverse mapping with per-page exclusive tracking)
>> 
>> This is pretty good and yet dangerously wrong in some missed nuances.
>> Which probably summarises the state of the art nicely ;-)
>> 
>
> Yeah, that's generally how it goes.  It's a good starting point, but the
> details have to get verified.
>
>> To start with, all flags marked as PF_ANY are set on individual pages
>> rather than only the folio.  So that's currently:
>> 
>> PAGEFLAG(Private, private, PF_ANY)
>> PAGEFLAG(HWPoison, hwpoison, PF_ANY)
>> PAGEFLAG(VmemmapSelfHosted, vmemmap_self_hosted, PF_ANY)
>> __SETPAGEFLAG(Head, head, PF_ANY)
>>         return test_bit(PG_anon_exclusive, &PF_ANY(page, 1)->flags.f);
>> 
>> Now, PG_private is a flag we're trying to get rid of -- it should be
>> identical to (folio->private != NULL), so I haven't made any effort
>> to convert that from being PF_ANY.  I'm not too unhappy that your chatbot
>> doesn't talk about PG_private, but a more full answer would include
>> mention of this.
>> 
>> PG_hwpoison and PG_anon_exclusive will remain per-page state in a
>> memdesc world, and there's a plan to handle those, so there's no need to
>> eliminate them.
>> 
>> PG_vmemmap_self_hosted is a very, very internal flag.  It's OK to not
>> know about it.
>> 
>> PG_head has to remain per-page state for now for obvious reasons ;-)
>> In a memdesc word, there will be no way to ask if a page is the first
>> page of an allocation, so this flag will not be needed.
>> 
>> I believe there are some subtleties around PG_hwpoison and hugetlb that
>> are not fully captured above, but I'm not convinced of my ability to
>> state definitely what they currently are, so I'll leve that for somebody
>> else to do.
>
> Thanks for taking the time to debug the output.  I think before trying
> to put this into the prompt, I'd step back and ask:
>
> - What bugs do we want AI to catch?  I can see knowing these large folio
> details really helping find bugs in the transition, or debug bug reports
> down the line, so it feels like an important detail to record.  It's
> definitely something AI won't know all by itself.
>
> - What details is AI getting wrong in current reviews?  We don't really
> have this answer yet, but if AI isn't getting it wrong, there's no
> reason to try and teach it more.

Also, we probably don't want to hard-code the current state of the art,
as ideally we should be able to review old patches as well (e.g. on top
of LTS or custom private trees). So in the perfect world we want to provide
some meta-ideas on how the LLM can decode the rules from the code itself.


^ permalink raw reply	[flat|nested] 149+ messages in thread

* Re: [PATCH v2 00/28] Eliminate Dying Memory Cgroup
  2025-12-30 20:51                 ` Matthew Wilcox
  2025-12-30 21:10                   ` Chris Mason
@ 2025-12-30 22:03                   ` Roman Gushchin
  1 sibling, 0 replies; 149+ messages in thread
From: Roman Gushchin @ 2025-12-30 22:03 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Chris Mason, Shakeel Butt, Zi Yan, Qi Zheng, hannes, hughd,
	mhocko, muchun.song, david, lorenzo.stoakes, harry.yoo,
	imran.f.khan, kamalesh.babulal, axelrasmussen, yuanchu, weixugc,
	chenridong, mkoutny, akpm, hamzamahfooz, apais, lance.yang,
	linux-mm, linux-kernel, cgroups, Qi Zheng, Chris Mason

Matthew Wilcox <willy@infradead.org> writes:

> On Tue, Dec 30, 2025 at 02:18:51PM -0500, Chris Mason wrote:
>> >>>> I just think you should do a preliminary review of the AI review results
>> >>>> instead of sending them out directly. Otherwise, if everyone does this,
>> >>>> the community will be full of bots.
>> 
>> I do think it's awkward to dump the whole review output for the patch
>> series in a single message.  It looks like there's a sudden jump to
>> XML?
>> It's better to reply to the individual patches with the comments
>> inline, which I think is where Roman is trying to go long term.
>
> I don't know what Roman's trying to do long-term, but his email
> that started this thread was so badly written that it was offensive.
> Had it been sent to me, I would have responded in the style of Arkell
> v Pressdram.

It felt awkward to send a bunch of emails from myself, all beginning
with the "I ran the ai review and here is the output" header.
Once we have a bot, obviously it's better to answer individual emails,
as the bpf subsystem does.


^ permalink raw reply	[flat|nested] 149+ messages in thread

* Re: [PATCH v2 00/28] Eliminate Dying Memory Cgroup
  2025-12-30 19:18               ` Chris Mason
  2025-12-30 20:51                 ` Matthew Wilcox
@ 2025-12-30 21:07                 ` Zi Yan
  1 sibling, 0 replies; 149+ messages in thread
From: Zi Yan @ 2025-12-30 21:07 UTC (permalink / raw)
  To: Chris Mason, Roman Gushchin
  Cc: Shakeel Butt, Qi Zheng, hannes, hughd, mhocko, muchun.song,
	david, lorenzo.stoakes, harry.yoo, imran.f.khan,
	kamalesh.babulal, axelrasmussen, yuanchu, weixugc, chenridong,
	mkoutny, akpm, hamzamahfooz, apais, lance.yang, linux-mm,
	linux-kernel, cgroups, Qi Zheng, Chris Mason

On 30 Dec 2025, at 14:18, Chris Mason wrote:

> On 12/30/25 1:13 PM, Shakeel Butt wrote:
>> On Tue, Dec 30, 2025 at 11:46:02AM -0500, Zi Yan wrote:
>>> On 29 Dec 2025, at 23:48, Shakeel Butt wrote:
>>>
>>>> On Tue, Dec 30, 2025 at 12:25:31PM +0800, Qi Zheng wrote:
>>>>>
>>>>>
>>>> [...]
>>>>>>>
>>>>>>> Thank you for running the AI review for this patchset, but please do not
>>>>>>> directly send the raw data from the AI review to the community, as this
>>>>>>> is no different from automated review by a robot.
>>>>>>
>>>>>> Hi Qi,
>>>>>>
>>>>>> I don't know why you're so negative towards it. It's been great at
>>>>>
>>>>> No, I don't object to having a dedicated robot to do this.
>>>>>
>>>>>> finding pretty tricky bugs often missed by human reviewers. In no way
>>>>>> it's a replacement for human reviews, but if a robot can find real
>>>>>> issues and make the kernel more reliable and safe, I'm in.
>>>>>
>>>>> I just think you should do a preliminary review of the AI review results
>>>>> instead of sending them out directly. Otherwise, if everyone does this,
>>>>> the community will be full of bots.
>
> I do think it's awkward to dump the whole review output for the patch
> series in a single message.  It looks like there's a sudden jump to XML?
> It's better to reply to the individual patches with the comments
> inline, which I think is where Roman is trying to go long term.
>
> With BPF, it looks more like this:
> https://lore.kernel.org/bpf/?q=AI+reviewed+your+patch

These look really good. At least the patch author can easily see the
feedback.

>
>>>>>
>>>>> No?
>>>>>
>>>>
>>>> We don't want too many bots but we definitely want at least one AI
>>>> review bot. Now we have precedence of BPF and networking subsystem and
>>>> the results I have seen are really good. I think the MM community needs
>>>> to come together and decide on the formalities of AI review process and
>>>> I see Roman is doing some early experimentation and result looks great.
>>>
>>> Do you mind explaining why the result looks great? Does it mean you agree
>>> the regressions pointed out by the AI review?
>>
>> The result looks great because the points raised are really thought
>> provoking and things I have not thought about when I reviewed the
>> series. The lru lock without irq or the possible infinite retry loop in
>> get_mem_cgroup_from_folio() are two such examples. Are these real
>> regressions? I am not sure.
>>
>>>
>>> If we want to do AI reviews, the process should be improved instead of
>>> just pasting the output from AI. In the initial stage, I think some human
>>> intervention is needed, at least adding some comment on AI reviews would
>>> be helpful.
>>
>> Yes I agree and therefore I mentioned we should discuss how should we
>> (MM community) should adopt the AI reviews.
>
> What tends to happen with BPF is the patch author or bpf maintainers
> point out problems with the reviews and I fix up the prompts over time.
> The false positive rate is ~20% today (measured since late October), and
> it's generally declining.

Yeah, I can see bpf.md contains more detailed rules compared to mm.md.

>
>>
>>> Otherwise, it looks like you agree completely with AI reviews.
>>> In addition, “50% of the reported issues are real”, is the AI tossing
>>> a coin when reporting issues?
>>>
>>> When I am looking into the prompt part, I have the following questions:
>>>
>>> 1. What is “Prompts SHA: 192922ae6bf4 ("bpf.md: adjust the documentation
>>> about bpf kfunc parameter validation”)”? I got the actual prompts
>>> from irc: https://github.com/masoncl/review-prompts/tree/main , but it
>>> should be provided along with the review for others to reproduce.
>>
>> I agree and I didn't know that Chris's review prompts are used here.
>>
>> Ccing Chris for your following questions.
>>
>>>>> 2. Looking at the mm prompt: https://github.com/masoncl/review-prompts/blob/main/mm.md , are you sure the patterns are all right?
>>> 	a. Page/Folio States, Large folios require per-page state tracking for
>>> 		Reference counts. I thought we want to get rid of per page refcount.
>
> Early in prompt development I hand picked a few hundred patches from
> 6.16 fixing bugs, and I iterated on these adding subsystem knowledge to
> catch the known bugs.  That's where that rule came from, but as you say
> there's a risk this information gets old.  Do we want to get rid of per
> page refcounts or have we done it?  (more on that at the bottom of the
> email).

willy has covered this part in another email.

>
>>>     b. Migration Invariants, NUMA balancing expects valid PTE combinations.
>>> 		PROTNONE PTEs are hardware invalid to trigger fault.
>>> 	c. TLB flushes required after PTE modifications. How about spurious fault
>>> 		handling?
>>>
>
> AI generally uses them as a starting point and fills in details, but
> I agree the MM bits are pretty minimal.
>
>>> 3. For a cgroup patchset, I was expecting some cgroup specific prompt rules,
>>> 	but could not find any. What am I missing?
>
> I think the only cgroup specific information I've needed so far is
> explaining css_get() and the section on __GFP_ACCOUNT.  I actively try
> to avoid adding details unless we're missing bugs or generating false
> positives.

I assume your review prompts are mainly used for BPF code, so it is understandable
that there are not many MM rules. My above concerns are mainly on the prompts
are directly used on MM patches without adding more MM specific rules.

Ideally, each subsystem could maintain its own rules in the corresponding
file to get a better outcome. :)

Best Regards,
Yan, Zi


^ permalink raw reply	[flat|nested] 149+ messages in thread

* Re: [PATCH v2 00/28] Eliminate Dying Memory Cgroup
  2025-12-30 16:46           ` Zi Yan
  2025-12-30 18:13             ` Shakeel Butt
@ 2025-12-30 19:34             ` Roman Gushchin
  2025-12-30 21:13               ` Zi Yan
  1 sibling, 1 reply; 149+ messages in thread
From: Roman Gushchin @ 2025-12-30 19:34 UTC (permalink / raw)
  To: Zi Yan
  Cc: Shakeel Butt, Qi Zheng, hannes, hughd, mhocko, muchun.song,
	david, lorenzo.stoakes, harry.yoo, imran.f.khan,
	kamalesh.babulal, axelrasmussen, yuanchu, weixugc, chenridong,
	mkoutny, akpm, hamzamahfooz, apais, lance.yang, linux-mm,
	linux-kernel, cgroups, Qi Zheng

Zi Yan <ziy@nvidia.com> writes:

> On 29 Dec 2025, at 23:48, Shakeel Butt wrote:
>
>> On Tue, Dec 30, 2025 at 12:25:31PM +0800, Qi Zheng wrote:
>>>
>>>
>> [...]
>>>>>
>>>>> Thank you for running the AI review for this patchset, but please do not
>>>>> directly send the raw data from the AI review to the community, as this
>>>>> is no different from automated review by a robot.
>>>>
>>>> Hi Qi,
>>>>
>>>> I don't know why you're so negative towards it. It's been great at
>>>
>>> No, I don't object to having a dedicated robot to do this.
>>>
>>>> finding pretty tricky bugs often missed by human reviewers. In no way
>>>> it's a replacement for human reviews, but if a robot can find real
>>>> issues and make the kernel more reliable and safe, I'm in.
>>>
>>> I just think you should do a preliminary review of the AI review results
>>> instead of sending them out directly. Otherwise, if everyone does this,
>>> the community will be full of bots.
>>>
>>> No?

The problem is that it works only when AI is obviously wrong,
which is not a large percentage of cases with latest models.
In my practice with Gemini 3 and Chris Mason's prompts, it almost
never dead wrong: it's either a real issue or some gray zone.
And you really often need a deep expertise and a significant amount
of time to decide if it's real or not, so it's not like you can
assign a single person who can review all ai reviews.

>>>
>>
>> We don't want too many bots but we definitely want at least one AI
>> review bot. Now we have precedence of BPF and networking subsystem and
>> the results I have seen are really good. I think the MM community needs
>> to come together and decide on the formalities of AI review process and
>> I see Roman is doing some early experimentation and result looks great.
>
> Do you mind explaining why the result looks great? Does it mean you agree
> the regressions pointed out by the AI review?
>
> If we want to do AI reviews, the process should be improved instead of
> just pasting the output from AI. In the initial stage, I think some human
> intervention is needed, at least adding some comment on AI reviews would
> be helpful. Otherwise, it looks like you agree completely with AI reviews.
> In addition, “50% of the reported issues are real”, is the AI tossing
> a coin when reporting issues?

I said at least 50% in my experience. If there is a 50% chance that
someone is pointing at a real issue in my code, I'd rather look into it
and fix or explain why it's not an issue. Btw, this is exactly how I
learned about this stuff - sent some bpf patches (bpf oom) and got
excited about a number of real issues discovered by ai review.

I agree though that we should not pollute email threads with a number of
AI-generated reports with a similar context.

> When I am looking into the prompt part, I have the following questions:
>
> 1. What is “Prompts SHA: 192922ae6bf4 ("bpf.md: adjust the documentation
> about bpf kfunc parameter validation”)”? I got the actual prompts
> from irc: https://github.com/masoncl/review-prompts/tree/main, but it
> should be provided along with the review for others to reproduce.

It's a significant amount of text, way too much to directly include into
emails. SHA from the prompts git should be enough, no?

> 2. Looking at the mm prompt: https://github.com/masoncl/review-prompts/blob/main/mm.md, are you sure the patterns are all right?
> 	a. Page/Folio States, Large folios require per-page state tracking for
> 		Reference counts. I thought we want to get rid of per page refcount.
>     b. Migration Invariants, NUMA balancing expects valid PTE combinations.
> 		PROTNONE PTEs are hardware invalid to trigger fault.
> 	c. TLB flushes required after PTE modifications. How about spurious fault
> 		handling?
>
> 3. For a cgroup patchset, I was expecting some cgroup specific prompt rules,
> 	but could not find any. What am I missing?

MM and cgroups-specific prompts are definitely in a very early stage.
But to develop/improve them we need data.

Thanks!


^ permalink raw reply	[flat|nested] 149+ messages in thread

* Re: [PATCH v2 00/28] Eliminate Dying Memory Cgroup
  2025-12-30 19:34             ` Roman Gushchin
@ 2025-12-30 21:13               ` Zi Yan
  0 siblings, 0 replies; 149+ messages in thread
From: Zi Yan @ 2025-12-30 21:13 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: Shakeel Butt, Qi Zheng, hannes, hughd, mhocko, muchun.song,
	david, lorenzo.stoakes, harry.yoo, imran.f.khan,
	kamalesh.babulal, axelrasmussen, yuanchu, weixugc, chenridong,
	mkoutny, akpm, hamzamahfooz, apais, lance.yang, linux-mm,
	linux-kernel, cgroups, Qi Zheng

On 30 Dec 2025, at 14:34, Roman Gushchin wrote:

> Zi Yan <ziy@nvidia.com> writes:
>
>> On 29 Dec 2025, at 23:48, Shakeel Butt wrote:
>>
>>> On Tue, Dec 30, 2025 at 12:25:31PM +0800, Qi Zheng wrote:
>>>>
>>>>
>>> [...]
>>>>>>
>>>>>> Thank you for running the AI review for this patchset, but please do not
>>>>>> directly send the raw data from the AI review to the community, as this
>>>>>> is no different from automated review by a robot.
>>>>>
>>>>> Hi Qi,
>>>>>
>>>>> I don't know why you're so negative towards it. It's been great at
>>>>
>>>> No, I don't object to having a dedicated robot to do this.
>>>>
>>>>> finding pretty tricky bugs often missed by human reviewers. In no way
>>>>> it's a replacement for human reviews, but if a robot can find real
>>>>> issues and make the kernel more reliable and safe, I'm in.
>>>>
>>>> I just think you should do a preliminary review of the AI review results
>>>> instead of sending them out directly. Otherwise, if everyone does this,
>>>> the community will be full of bots.
>>>>
>>>> No?
>
> The problem is that it works only when AI is obviously wrong,
> which is not a large percentage of cases with latest models.
> In my practice with Gemini 3 and Chris Mason's prompts, it almost
> never dead wrong: it's either a real issue or some gray zone.
> And you really often need a deep expertise and a significant amount
> of time to decide if it's real or not, so it's not like you can
> assign a single person who can review all ai reviews.
>
>>>>
>>>
>>> We don't want too many bots but we definitely want at least one AI
>>> review bot. Now we have precedence of BPF and networking subsystem and
>>> the results I have seen are really good. I think the MM community needs
>>> to come together and decide on the formalities of AI review process and
>>> I see Roman is doing some early experimentation and result looks great.
>>
>> Do you mind explaining why the result looks great? Does it mean you agree
>> the regressions pointed out by the AI review?
>>
>> If we want to do AI reviews, the process should be improved instead of
>> just pasting the output from AI. In the initial stage, I think some human
>> intervention is needed, at least adding some comment on AI reviews would
>> be helpful. Otherwise, it looks like you agree completely with AI reviews.
>> In addition, “50% of the reported issues are real”, is the AI tossing
>> a coin when reporting issues?
>
> I said at least 50% in my experience. If there is a 50% chance that
> someone is pointing at a real issue in my code, I'd rather look into it
> and fix or explain why it's not an issue. Btw, this is exactly how I
> learned about this stuff - sent some bpf patches (bpf oom) and got
> excited about a number of real issues discovered by ai review.
>
> I agree though that we should not pollute email threads with a number of
> AI-generated reports with a similar context.
>
>> When I am looking into the prompt part, I have the following questions:
>>
>> 1. What is “Prompts SHA: 192922ae6bf4 ("bpf.md: adjust the documentation
>> about bpf kfunc parameter validation”)”? I got the actual prompts
>> from irc: https://github.com/masoncl/review-prompts/tree/main, but it
>> should be provided along with the review for others to reproduce.
>
> It's a significant amount of text, way too much to directly include into
> emails. SHA from the prompts git should be enough, no?

I mean at least the GitHub link should be provided, otherwise, how can people
know the exact prompts?

>
>> 2. Looking at the mm prompt: https://github.com/masoncl/review-prompts/blob/main/mm.md, are you sure the patterns are all right?
>> 	a. Page/Folio States, Large folios require per-page state tracking for
>> 		Reference counts. I thought we want to get rid of per page refcount.
>>     b. Migration Invariants, NUMA balancing expects valid PTE combinations.
>> 		PROTNONE PTEs are hardware invalid to trigger fault.
>> 	c. TLB flushes required after PTE modifications. How about spurious fault
>> 		handling?
>>
>> 3. For a cgroup patchset, I was expecting some cgroup specific prompt rules,
>> 	but could not find any. What am I missing?
>
> MM and cgroups-specific prompts are definitely in a very early stage.
> But to develop/improve them we need data.

Not just data. You are a maintainer of cgroup, so at least you could add
more cgroup specific rules to improve the quality of AI reviews.


Best Regards,
Yan, Zi


^ permalink raw reply	[flat|nested] 149+ messages in thread

* Re: [PATCH v2 00/28] Eliminate Dying Memory Cgroup
  2025-12-30  1:36 ` Roman Gushchin
  2025-12-30  2:44   ` Qi Zheng
@ 2025-12-30  4:01   ` Shakeel Butt
  2025-12-30  4:11     ` Roman Gushchin
  1 sibling, 1 reply; 149+ messages in thread
From: Shakeel Butt @ 2025-12-30  4:01 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: Qi Zheng, hannes, hughd, mhocko, muchun.song, david,
	lorenzo.stoakes, ziy, harry.yoo, imran.f.khan, kamalesh.babulal,
	axelrasmussen, yuanchu, weixugc, chenridong, mkoutny, akpm,
	hamzamahfooz, apais, lance.yang, linux-mm, linux-kernel, cgroups,
	Qi Zheng

On Tue, Dec 30, 2025 at 01:36:12AM +0000, Roman Gushchin wrote:
> Qi Zheng <qi.zheng@linux.dev> writes:
> 
> Hey!
> 
> I ran this patchset through AI review and it found few regression (which
> can of course be false positives). When you'll have time, can you,
> please, take a look and comment on which are real and which are not?
> 
> Thank you!

Hi Roman, this is really good. I assume this is Gemini model. I see BPF
and networking folks have automated the AI review process which is
really good. I think MM should also adopt that model. Are you looking
into automating MM review bot?

thanks,
Shakeel


^ permalink raw reply	[flat|nested] 149+ messages in thread

* Re: [PATCH v2 00/28] Eliminate Dying Memory Cgroup
  2025-12-30  4:01   ` Shakeel Butt
@ 2025-12-30  4:11     ` Roman Gushchin
  2025-12-30 18:36       ` Shakeel Butt
  0 siblings, 1 reply; 149+ messages in thread
From: Roman Gushchin @ 2025-12-30  4:11 UTC (permalink / raw)
  To: Shakeel Butt
  Cc: Qi Zheng, hannes, hughd, mhocko, muchun.song, david,
	lorenzo.stoakes, ziy, harry.yoo, imran.f.khan, kamalesh.babulal,
	axelrasmussen, yuanchu, weixugc, chenridong, mkoutny, akpm,
	hamzamahfooz, apais, lance.yang, linux-mm, linux-kernel, cgroups,
	Qi Zheng

Shakeel Butt <shakeel.butt@linux.dev> writes:

> On Tue, Dec 30, 2025 at 01:36:12AM +0000, Roman Gushchin wrote:
>> Qi Zheng <qi.zheng@linux.dev> writes:
>> 
>> Hey!
>> 
>> I ran this patchset through AI review and it found few regression (which
>> can of course be false positives). When you'll have time, can you,
>> please, take a look and comment on which are real and which are not?
>> 
>> Thank you!
>
> Hi Roman, this is really good. I assume this is Gemini model. I see BPF
> and networking folks have automated the AI review process which is
> really good. I think MM should also adopt that model. Are you looking
> into automating MM review bot?

Yes, absolutely. We're working on it, hopefully in January/February we
can have something reasonably solid.

Thanks


^ permalink raw reply	[flat|nested] 149+ messages in thread

* Re: [PATCH v2 00/28] Eliminate Dying Memory Cgroup
  2025-12-30  4:11     ` Roman Gushchin
@ 2025-12-30 18:36       ` Shakeel Butt
  2025-12-30 20:47         ` Roman Gushchin
  0 siblings, 1 reply; 149+ messages in thread
From: Shakeel Butt @ 2025-12-30 18:36 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: Qi Zheng, hannes, hughd, mhocko, muchun.song, david,
	lorenzo.stoakes, ziy, harry.yoo, imran.f.khan, kamalesh.babulal,
	axelrasmussen, yuanchu, weixugc, chenridong, mkoutny, akpm,
	hamzamahfooz, apais, lance.yang, linux-mm, linux-kernel, cgroups,
	Qi Zheng

On Mon, Dec 29, 2025 at 08:11:22PM -0800, Roman Gushchin wrote:
> Shakeel Butt <shakeel.butt@linux.dev> writes:
> 
> > On Tue, Dec 30, 2025 at 01:36:12AM +0000, Roman Gushchin wrote:
> >> Qi Zheng <qi.zheng@linux.dev> writes:
> >> 
> >> Hey!
> >> 
> >> I ran this patchset through AI review and it found few regression (which
> >> can of course be false positives). When you'll have time, can you,
> >> please, take a look and comment on which are real and which are not?
> >> 
> >> Thank you!
> >
> > Hi Roman, this is really good. I assume this is Gemini model. I see BPF
> > and networking folks have automated the AI review process which is
> > really good. I think MM should also adopt that model. Are you looking
> > into automating MM review bot?
> 
> Yes, absolutely. We're working on it, hopefully in January/February we
> can have something reasonably solid.

Can you share a bit more about the plan? Are you working more on the
infra side of things or also iterating on the prompts? (I assume you are
using Chris Mason's review prompts)

On the infra side, one thing I would love to have is to get early
feedback/review on my patches before posting on the list. Will it be
possible to support such a scenario?

On the prompt side, what kind of experiments are you doing to reduce the
false positives? I wonder if we can comeup with some recommendations
which help maintainers to describe relevant prompts for their sub-area
and be helpful for AI reviews.

On the automation side, I assume we will start with some manual process.
I definitely see back-and-forth to improve prompts for MM and someone
needs to manually review the results generated by AI and may have to
update prompts for better results. Also we can start with opt-in
approach i.e. someone adds a tag to the subject for AI to review the
series.

Anyways a good topic to start a separete email for discussion.

^ permalink raw reply	[flat|nested] 149+ messages in thread

* Re: [PATCH v2 00/28] Eliminate Dying Memory Cgroup
  2025-12-30 18:36       ` Shakeel Butt
@ 2025-12-30 20:47         ` Roman Gushchin
  0 siblings, 0 replies; 149+ messages in thread
From: Roman Gushchin @ 2025-12-30 20:47 UTC (permalink / raw)
  To: Shakeel Butt
  Cc: Qi Zheng, hannes, hughd, mhocko, muchun.song, david,
	lorenzo.stoakes, ziy, harry.yoo, imran.f.khan, kamalesh.babulal,
	axelrasmussen, yuanchu, weixugc, chenridong, mkoutny, akpm,
	hamzamahfooz, apais, lance.yang, linux-mm, linux-kernel, cgroups,
	Qi Zheng

Shakeel Butt <shakeel.butt@linux.dev> writes:

> On Mon, Dec 29, 2025 at 08:11:22PM -0800, Roman Gushchin wrote:
>> Shakeel Butt <shakeel.butt@linux.dev> writes:
>> 
>> > On Tue, Dec 30, 2025 at 01:36:12AM +0000, Roman Gushchin wrote:
>> >> Qi Zheng <qi.zheng@linux.dev> writes:
>> >> 
>> >> Hey!
>> >> 
>> >> I ran this patchset through AI review and it found few regression (which
>> >> can of course be false positives). When you'll have time, can you,
>> >> please, take a look and comment on which are real and which are not?
>> >> 
>> >> Thank you!
>> >
>> > Hi Roman, this is really good. I assume this is Gemini model. I see BPF
>> > and networking folks have automated the AI review process which is
>> > really good. I think MM should also adopt that model. Are you looking
>> > into automating MM review bot?
>> 
>> Yes, absolutely. We're working on it, hopefully in January/February we
>> can have something reasonably solid.
>
> Can you share a bit more about the plan? Are you working more on the
> infra side of things or also iterating on the prompts? (I assume you are
> using Chris Mason's review prompts)

Mostly on the infra side. I want to look closer into mm and cgroups
patches, but I need a bit more data and infra to do this.

As of now we're building some basic infra to run it on scale.

> On the infra side, one thing I would love to have is to get early
> feedback/review on my patches before posting on the list. Will it be
> possible to support such a scenario?

Absolutely.
You can do it already using a local setup, assuming you have an access
to some decent LLM. In my experience, a standard entry-level pro
subscription which is like $20/month these days is good enough to review
a number of patchsets per day. This works well for reviewing personal
patches and/or some targeted upstream reviews (like this one), but for
covering entire subsystems we need more infra and more token budgets
(and this is what we're building).

Btw, I mastered a pre-configured environment which saves time on setting
things up: https://github.com/rgushchin/kengp .
I've been testing it with Gemini CLI, but it should work with other
similar tools with some minimal changes (and I'd love to accept patches
adding this support).

Actually I think there will be several rounds of ai-reviews:
1) Developers can review their patches while working on the code and
before sending anything upstream
2) AI bots can send feedback to proposed patchset, as it works for bpf
currently
3) Maintainers and/or ci systems will run ai reviews to ensure the
quality of changes (e.g. eventually we can re-review mm-next nightly
to make sure it's not creating new regressions with other changes in
linux-next).

> On the prompt side, what kind of experiments are you doing to reduce the
> false positives? I wonder if we can comeup with some recommendations
> which help maintainers to describe relevant prompts for their sub-area
> and be helpful for AI reviews.

I like Chris's approach here: let's look into specific examples of
false positives and missed bugs and tailor our prompt systems to handle
these cases correctly. The smarter LLMs are, the fewer tricks we really
need, so it's better to keep prompts minimal. We don't really need to
write AI-specific long texts about subsystems, but sometimes codify some
non-obvious design principles and limitations.

> On the automation side, I assume we will start with some manual process.
> I definitely see back-and-forth to improve prompts for MM and someone
> needs to manually review the results generated by AI and may have to
> update prompts for better results. Also we can start with opt-in
> approach i.e. someone adds a tag to the subject for AI to review the
> series.

Yeah, there are some things to figure out on how to make sure we're not
creating a lot of noise and not repeating the same feedback if it's not
useful.

Thanks!

^ permalink raw reply	[flat|nested] 149+ messages in thread

end of thread, other threads:[~2025-12-30 22:30 UTC | newest]

Thread overview: 149+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2025-12-17  7:27 [PATCH v2 00/28] Eliminate Dying Memory Cgroup Qi Zheng
2025-12-17  7:27 ` [PATCH v2 01/28] mm: memcontrol: remove dead code of checking parent memory cgroup Qi Zheng
2025-12-18 23:31   ` Shakeel Butt
2025-12-17  7:27 ` [PATCH v2 02/28] mm: workingset: use folio_lruvec() in workingset_refault() Qi Zheng
2025-12-18 23:32   ` Shakeel Butt
2025-12-17  7:27 ` [PATCH v2 03/28] mm: rename unlock_page_lruvec_irq and its variants Qi Zheng
2025-12-18  9:00   ` David Hildenbrand (Red Hat)
2025-12-18 23:34   ` Shakeel Butt
2025-12-17  7:27 ` [PATCH v2 04/28] mm: vmscan: prepare for the refactoring the move_folios_to_lru() Qi Zheng
2025-12-17 21:13   ` Johannes Weiner
2025-12-18  9:04   ` David Hildenbrand (Red Hat)
2025-12-18  9:31     ` Qi Zheng
2025-12-18 23:39   ` Shakeel Butt
2025-12-25  3:45   ` Chen Ridong
2025-12-17  7:27 ` [PATCH v2 05/28] mm: vmscan: refactor move_folios_to_lru() Qi Zheng
2025-12-19  0:04   ` Shakeel Butt
2025-12-17  7:27 ` [PATCH v2 06/28] mm: memcontrol: allocate object cgroup for non-kmem case Qi Zheng
2025-12-17 21:22   ` Johannes Weiner
2025-12-18  6:25     ` Qi Zheng
2025-12-19  0:23   ` Shakeel Butt
2025-12-25  6:23   ` Chen Ridong
2025-12-17  7:27 ` [PATCH v2 07/28] mm: memcontrol: return root object cgroup for root memory cgroup Qi Zheng
2025-12-17 21:28   ` Johannes Weiner
2025-12-19  0:39   ` Shakeel Butt
2025-12-26  1:03   ` Chen Ridong
2025-12-26  3:10     ` Muchun Song
2025-12-26  3:50       ` Chen Ridong
2025-12-26  3:58         ` Chen Ridong
2025-12-17  7:27 ` [PATCH v2 08/28] mm: memcontrol: prevent memory cgroup release in get_mem_cgroup_from_folio() Qi Zheng
2025-12-17 21:45   ` Johannes Weiner
2025-12-18  6:31     ` Qi Zheng
2025-12-19  2:09     ` Shakeel Butt
2025-12-19  3:53       ` Johannes Weiner
2025-12-19  3:56         ` Johannes Weiner
2025-12-17  7:27 ` [PATCH v2 09/28] buffer: prevent memory cgroup release in folio_alloc_buffers() Qi Zheng
2025-12-17 21:45   ` Johannes Weiner
2025-12-19  2:14   ` Shakeel Butt
2025-12-26  2:01     ` Chen Ridong
2025-12-17  7:27 ` [PATCH v2 10/28] writeback: prevent memory cgroup release in writeback module Qi Zheng
2025-12-17 22:08   ` Johannes Weiner
2025-12-19  2:30   ` Shakeel Butt
2025-12-17  7:27 ` [PATCH v2 11/28] mm: memcontrol: prevent memory cgroup release in count_memcg_folio_events() Qi Zheng
2025-12-17 22:11   ` Johannes Weiner
2025-12-19 23:31   ` Shakeel Butt
2025-12-26  2:12   ` Chen Ridong
2025-12-17  7:27 ` [PATCH v2 12/28] mm: page_io: prevent memory cgroup release in page_io module Qi Zheng
2025-12-17 22:12   ` Johannes Weiner
2025-12-19 23:44   ` Shakeel Butt
2025-12-17  7:27 ` [PATCH v2 13/28] mm: migrate: prevent memory cgroup release in folio_migrate_mapping() Qi Zheng
2025-12-17 22:14   ` Johannes Weiner
2025-12-18  9:09   ` David Hildenbrand (Red Hat)
2025-12-18  9:36     ` Qi Zheng
2025-12-18  9:43       ` David Hildenbrand (Red Hat)
2025-12-18 11:40         ` Qi Zheng
2025-12-18 11:56           ` David Hildenbrand (Red Hat)
2025-12-18 13:00             ` Qi Zheng
2025-12-18 13:04               ` David Hildenbrand (Red Hat)
2025-12-18 13:16                 ` Qi Zheng
2025-12-19  4:12                   ` Harry Yoo
2025-12-19  6:18                     ` David Hildenbrand (Red Hat)
2025-12-18 14:26     ` Johannes Weiner
2025-12-22  3:42       ` Qi Zheng
2025-12-30 20:07       ` David Hildenbrand (Red Hat)
2025-12-19 23:51   ` Shakeel Butt
2025-12-17  7:27 ` [PATCH v2 14/28] mm: mglru: prevent memory cgroup release in mglru Qi Zheng
2025-12-17 22:18   ` Johannes Weiner
2025-12-18  6:50     ` Qi Zheng
2025-12-20  0:58     ` Shakeel Butt
2025-12-17  7:27 ` [PATCH v2 15/28] mm: memcontrol: prevent memory cgroup release in mem_cgroup_swap_full() Qi Zheng
2025-12-17 22:21   ` Johannes Weiner
2025-12-20  1:05   ` Shakeel Butt
2025-12-22  4:02     ` Qi Zheng
2025-12-26  2:29     ` Chen Ridong
2025-12-17  7:27 ` [PATCH v2 16/28] mm: workingset: prevent memory cgroup release in lru_gen_eviction() Qi Zheng
2025-12-17 22:23   ` Johannes Weiner
2025-12-20  1:06   ` Shakeel Butt
2025-12-17  7:27 ` [PATCH v2 17/28] mm: thp: prevent memory cgroup release in folio_split_queue_lock{_irqsave}() Qi Zheng
2025-12-17 22:27   ` Johannes Weiner
2025-12-20  1:11     ` Shakeel Butt
2025-12-22  3:33       ` Qi Zheng
2025-12-18  9:10   ` David Hildenbrand (Red Hat)
2025-12-17  7:27 ` [PATCH v2 18/28] mm: zswap: prevent memory cgroup release in zswap_compress() Qi Zheng
2025-12-17 22:27   ` Johannes Weiner
2025-12-20  1:14   ` Shakeel Butt
2025-12-17  7:27 ` [PATCH v2 19/28] mm: workingset: prevent lruvec release in workingset_refault() Qi Zheng
2025-12-17 22:30   ` Johannes Weiner
2025-12-18  6:57     ` Qi Zheng
2025-12-17  7:27 ` [PATCH v2 20/28] mm: zswap: prevent lruvec release in zswap_folio_swapin() Qi Zheng
2025-12-17 22:33   ` Johannes Weiner
2025-12-18  7:09     ` Qi Zheng
2025-12-18 13:02       ` Johannes Weiner
2025-12-20  1:23   ` Shakeel Butt
2025-12-17  7:27 ` [PATCH v2 21/28] mm: swap: prevent lruvec release in lru_gen_clear_refs() Qi Zheng
2025-12-17 22:34   ` Johannes Weiner
2025-12-20  1:24   ` Shakeel Butt
2025-12-17  7:27 ` [PATCH v2 22/28] mm: workingset: prevent lruvec release in workingset_activation() Qi Zheng
2025-12-17 22:36   ` Johannes Weiner
2025-12-20  1:25   ` Shakeel Butt
2025-12-17  7:27 ` [PATCH v2 23/28] mm: memcontrol: prepare for reparenting LRU pages for lruvec lock Qi Zheng
2025-12-18 13:00   ` Johannes Weiner
2025-12-18 13:17     ` Qi Zheng
2025-12-20  2:03   ` Shakeel Butt
2025-12-23  6:14     ` Qi Zheng
2025-12-17  7:27 ` [PATCH v2 24/28] mm: vmscan: prepare for reparenting traditional LRU folios Qi Zheng
2025-12-18 13:32   ` Johannes Weiner
2025-12-22  3:55     ` Qi Zheng
2025-12-17  7:27 ` [PATCH v2 25/28] mm: vmscan: prepare for reparenting MGLRU folios Qi Zheng
2025-12-17  7:27 ` [PATCH v2 26/28] mm: memcontrol: refactor memcg_reparent_objcgs() Qi Zheng
2025-12-18 13:45   ` Johannes Weiner
2025-12-22  3:56     ` Qi Zheng
2025-12-17  7:27 ` [PATCH v2 27/28] mm: memcontrol: eliminate the problem of dying memory cgroup for LRU folios Qi Zheng
2025-12-18 14:06   ` Johannes Weiner
2025-12-22  3:59     ` Qi Zheng
2025-12-17  7:27 ` [PATCH v2 28/28] mm: lru: add VM_WARN_ON_ONCE_FOLIO to lru maintenance helpers Qi Zheng
2025-12-18 14:07   ` Johannes Weiner
2025-12-23 20:04 ` [PATCH v2 00/28] Eliminate Dying Memory Cgroup Yosry Ahmed
2025-12-23 23:20   ` Shakeel Butt
2025-12-24  0:07     ` Yosry Ahmed
2025-12-24  0:36       ` Shakeel Butt
2025-12-24  0:43         ` Yosry Ahmed
2025-12-24  0:58           ` Shakeel Butt
2025-12-29  9:42             ` Qi Zheng
2025-12-29 10:52               ` Michal Koutný
2025-12-29  7:48     ` Qi Zheng
2025-12-29  9:35       ` Harry Yoo
2025-12-29  9:46         ` Qi Zheng
2025-12-29 10:53         ` Michal Koutný
2025-12-24  8:43   ` Harry Yoo
2025-12-24 14:51     ` Yosry Ahmed
2025-12-26 11:24       ` Harry Yoo
2025-12-30  1:36 ` Roman Gushchin
2025-12-30  2:44   ` Qi Zheng
2025-12-30  4:20     ` Roman Gushchin
2025-12-30  4:25       ` Qi Zheng
2025-12-30  4:48         ` Shakeel Butt
2025-12-30 16:46           ` Zi Yan
2025-12-30 18:13             ` Shakeel Butt
2025-12-30 19:18               ` Chris Mason
2025-12-30 20:51                 ` Matthew Wilcox
2025-12-30 21:10                   ` Chris Mason
2025-12-30 22:30                     ` Roman Gushchin
2025-12-30 22:03                   ` Roman Gushchin
2025-12-30 21:07                 ` Zi Yan
2025-12-30 19:34             ` Roman Gushchin
2025-12-30 21:13               ` Zi Yan
2025-12-30  4:01   ` Shakeel Butt
2025-12-30  4:11     ` Roman Gushchin
2025-12-30 18:36       ` Shakeel Butt
2025-12-30 20:47         ` Roman Gushchin

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox