* [PATCH v4 00/31] Eliminate Dying Memory Cgroup
@ 2026-02-05 8:54 Qi Zheng
2026-02-05 8:54 ` [PATCH v4 01/31] mm: memcontrol: remove dead code of checking parent memory cgroup Qi Zheng
` (30 more replies)
0 siblings, 31 replies; 50+ messages in thread
From: Qi Zheng @ 2026-02-05 8:54 UTC (permalink / raw)
To: hannes, hughd, mhocko, roman.gushchin, shakeel.butt, muchun.song,
david, lorenzo.stoakes, ziy, harry.yoo, yosry.ahmed,
imran.f.khan, kamalesh.babulal, axelrasmussen, yuanchu, weixugc,
chenridong, mkoutny, akpm, hamzamahfooz, apais, lance.yang, bhe
Cc: linux-mm, linux-kernel, cgroups, Qi Zheng
From: Qi Zheng <zhengqi.arch@bytedance.com>
Changes in v4:
- fix commit message in [PATCH v3 23/30] (pointed by Baoquan He)
- move lruvec_lock_irq() and firends to mm/memcontrol.c to fix the compilation
error in [PATCH v4 24/31] (reported by LKP)
- include parent_lruvec() within the RCU lock in lru_note_cost_unlock_irq() in
[PATCH v4 24/31] (pointed by Harry Yoo)
- move the declaration of lru_reparent_memcg() to swap.h
(suggested by Muchun Song)
- fix lru size update logic in lru_gen_reparent_memcg() in [PATCH v4 26/31]
(pointed and suggested by Harry Yoo)
- add [PATCH v4 28/31] to use lruvec_lru_size() to get the number of lru pages
in count_shadow_nodes() (suggested by Shakeel Butt)
- fix reparenting logic of lruvec_stats->state_local in [PATCH v4 29/31]
(pointed by Shakeel Butt)
- change these non-hierarchical stats to atomic_long_t type to avoid race
between mem_cgroup_stat_aggregate() and reparent_state_local() in
[PATCH v4 29/31]
- make css_killed_work_fn() to be called in rcu work, and use rcu lock +
CSS_IS_DYING check to avoid race between
mod_memcg_state()/mod_memcg_lruvec_state()
(suggested by Shakeel Butt)
- collect Acked-bys and Reviewed-bys
- rebase onto the next-20260128
Changes in v3:
- modify the commit message in [PATCH v2 04/28], [PATCH v2 06/28],
[PATCH v2 13/28], [PATCH v2 24/28] and [PATCH v2 27/28]
(suggested by David Hildenbrand, Chen Ridong and Johannes Weiner)
- change code style in [PATCH v3 8/30], [PATCH v3 15/30] and [PATCH v3 27/30]
(suggested by Johannes Weiner and Shakeel Butt)
- use get_mem_cgroup_from_folio() + mem_cgroup_put() to replace holding rcu
lock in [PATCH v3 14/30] and [PATCH v3 19/30]
(pointed by Johannes Weiner)
- add a comment to folio_split_queue_lock() in [PATCH v3 17/30]
(suggested by Shakeel Butt)
- modify the comment above folio_lruvec() in [PATCH v3 24/30]
(suggested by Johannes Weiner)
- fix rcu lock issue in lru_note_cost_refault()
(pointed by Shakeel Butt)
- add [PATCH v3 28/30] to fix non-hierarchical memcg1_stats issues
(pointed by Yosry Ahmed)
- fix lru_zone_size issue in [PATCH v2 24/28] and [PATCH v2 25/28]
- collect Acked-bys and Reviewed-bys
- rebase onto the next-20260114
Changes in v2:
- add [PATCH v2 04/28] and remove local_irq_disable() in evict_folios()
(pointed by Harry Yoo)
- recheck objcg in [PATCH v2 07/28] (pointed by Harry Yoo)
- modify the commit message in [PATCH v2 12/28] and [PATCH v2 21/28]
(pointed by Harry Yoo)
- use rcu lock to protect mm_state in [PATCH v2 14/28] (pointed by Harry Yoo)
- fix bad unlock balance warning in [PATCH v2 23/28]
- change nr_pages type to long in [PATCH v2 25/28] (pointed by Harry Yoo)
- incease mm_state->seq during reparenting to make mm walker work properly in
[PATCH v2 25/28] (pointed by Harry Yoo)
- add [PATCH v2 18/28] to fix WARNING in folio_memcg() (pointed by Harry Yoo)
- collect Reviewed-bys
- rebase onto the next-20251216
Changes in v1:
- drop [PATCH RFC 02/28]
- drop THP split queue related part, which has been merged as a separate
patchset[2]
- prevent memory cgroup release in folio_split_queue_lock{_irqsave}() in
[PATCH v1 16/26]
- Separate the reparenting function of traditional LRU folios to [PATCH v1 22/26]
- adapted to the MGLRU scenarios in [PATCH v1 23/26]
- refactor memcg_reparent_objcgs() in [PATCH v1 24/26]
- collect Acked-bys and Reviewed-bys
- rebase onto the next-20251028
Hi all,
Introduction
============
This patchset is intended to transfer the LRU pages to the object cgroup
without holding a reference to the original memory cgroup in order to
address the issue of the dying memory cgroup. A consensus has already been
reached regarding this approach recently [1].
Background
==========
The issue of a dying memory cgroup refers to a situation where a memory
cgroup is no longer being used by users, but memory (the metadata
associated with memory cgroups) remains allocated to it. This situation
may potentially result in memory leaks or inefficiencies in memory
reclamation and has persisted as an issue for several years. Any memory
allocation that endures longer than the lifespan (from the users'
perspective) of a memory cgroup can lead to the issue of dying memory
cgroup. We have exerted greater efforts to tackle this problem by
introducing the infrastructure of object cgroup [2].
Presently, numerous types of objects (slab objects, non-slab kernel
allocations, per-CPU objects) are charged to the object cgroup without
holding a reference to the original memory cgroup. The final allocations
for LRU pages (anonymous pages and file pages) are charged at allocation
time and continues to hold a reference to the original memory cgroup
until reclaimed.
File pages are more complex than anonymous pages as they can be shared
among different memory cgroups and may persist beyond the lifespan of
the memory cgroup. The long-term pinning of file pages to memory cgroups
is a widespread issue that causes recurring problems in practical
scenarios [3]. File pages remain unreclaimed for extended periods.
Additionally, they are accessed by successive instances (second, third,
fourth, etc.) of the same job, which is restarted into a new cgroup each
time. As a result, unreclaimable dying memory cgroups accumulate,
leading to memory wastage and significantly reducing the efficiency
of page reclamation.
Fundamentals
============
A folio will no longer pin its corresponding memory cgroup. It is necessary
to ensure that the memory cgroup or the lruvec associated with the memory
cgroup is not released when a user obtains a pointer to the memory cgroup
or lruvec returned by folio_memcg() or folio_lruvec(). Users are required
to hold the RCU read lock or acquire a reference to the memory cgroup
associated with the folio to prevent its release if they are not concerned
about the binding stability between the folio and its corresponding memory
cgroup. However, some users of folio_lruvec() (i.e., the lruvec lock)
desire a stable binding between the folio and its corresponding memory
cgroup. An approach is needed to ensure the stability of the binding while
the lruvec lock is held, and to detect the situation of holding the
incorrect lruvec lock when there is a race condition during memory cgroup
reparenting. The following four steps are taken to achieve these goals.
1. The first step to be taken is to identify all users of both functions
(folio_memcg() and folio_lruvec()) who are not concerned about binding
stability and implement appropriate measures (such as holding a RCU read
lock or temporarily obtaining a reference to the memory cgroup for a
brief period) to prevent the release of the memory cgroup.
2. Secondly, the following refactoring of folio_lruvec_lock() demonstrates
how to ensure the binding stability from the user's perspective of
folio_lruvec().
struct lruvec *folio_lruvec_lock(struct folio *folio)
{
struct lruvec *lruvec;
rcu_read_lock();
retry:
lruvec = folio_lruvec(folio);
spin_lock(&lruvec->lru_lock);
if (unlikely(lruvec_memcg(lruvec) != folio_memcg(folio))) {
spin_unlock(&lruvec->lru_lock);
goto retry;
}
return lruvec;
}
From the perspective of memory cgroup removal, the entire reparenting
process (altering the binding relationship between folio and its memory
cgroup and moving the LRU lists to its parental memory cgroup) should be
carried out under both the lruvec lock of the memory cgroup being removed
and the lruvec lock of its parent.
3. Finally, transfer the LRU pages to the object cgroup without holding a
reference to the original memory cgroup.
Effect
======
Finally, it can be observed that the quantity of dying memory cgroups will
not experience a significant increase if the following test script is
executed to reproduce the issue.
```bash
#!/bin/bash
# Create a temporary file 'temp' filled with zero bytes
dd if=/dev/zero of=temp bs=4096 count=1
# Display memory-cgroup info from /proc/cgroups
cat /proc/cgroups | grep memory
for i in {0..2000}
do
mkdir /sys/fs/cgroup/memory/test$i
echo $$ > /sys/fs/cgroup/memory/test$i/cgroup.procs
# Append 'temp' file content to 'log'
cat temp >> log
echo $$ > /sys/fs/cgroup/memory/cgroup.procs
# Potentially create a dying memory cgroup
rmdir /sys/fs/cgroup/memory/test$i
done
# Display memory-cgroup info after test
cat /proc/cgroups | grep memory
rm -f temp log
```
Comments and suggestions are welcome!
Thanks,
Qi
[1].https://lore.kernel.org/linux-mm/Z6OkXXYDorPrBvEQ@hm-sls2/
[2].https://lwn.net/Articles/895431/
[3].https://github.com/systemd/systemd/pull/36827
Muchun Song (22):
mm: memcontrol: remove dead code of checking parent memory cgroup
mm: workingset: use folio_lruvec() in workingset_refault()
mm: rename unlock_page_lruvec_irq and its variants
mm: vmscan: refactor move_folios_to_lru()
mm: memcontrol: allocate object cgroup for non-kmem case
mm: memcontrol: return root object cgroup for root memory cgroup
mm: memcontrol: prevent memory cgroup release in
get_mem_cgroup_from_folio()
buffer: prevent memory cgroup release in folio_alloc_buffers()
writeback: prevent memory cgroup release in writeback module
mm: memcontrol: prevent memory cgroup release in
count_memcg_folio_events()
mm: page_io: prevent memory cgroup release in page_io module
mm: migrate: prevent memory cgroup release in folio_migrate_mapping()
mm: mglru: prevent memory cgroup release in mglru
mm: memcontrol: prevent memory cgroup release in
mem_cgroup_swap_full()
mm: workingset: prevent memory cgroup release in lru_gen_eviction()
mm: workingset: prevent lruvec release in workingset_refault()
mm: zswap: prevent lruvec release in zswap_folio_swapin()
mm: swap: prevent lruvec release in lru_gen_clear_refs()
mm: workingset: prevent lruvec release in workingset_activation()
mm: memcontrol: prepare for reparenting LRU pages for lruvec lock
mm: memcontrol: eliminate the problem of dying memory cgroup for LRU
folios
mm: lru: add VM_WARN_ON_ONCE_FOLIO to lru maintenance helpers
Qi Zheng (9):
mm: vmscan: prepare for the refactoring the move_folios_to_lru()
mm: thp: prevent memory cgroup release in
folio_split_queue_lock{_irqsave}()
mm: zswap: prevent memory cgroup release in zswap_compress()
mm: do not open-code lruvec lock
mm: vmscan: prepare for reparenting traditional LRU folios
mm: vmscan: prepare for reparenting MGLRU folios
mm: memcontrol: refactor memcg_reparent_objcgs()
mm: workingset: use lruvec_lru_size() to get the number of lru pages
mm: memcontrol: prepare for reparenting non-hierarchical stats
fs/buffer.c | 4 +-
fs/fs-writeback.c | 22 +-
include/linux/memcontrol.h | 177 +++++-----
include/linux/mm_inline.h | 6 +
include/linux/mmzone.h | 16 +
include/linux/swap.h | 25 +-
include/trace/events/writeback.h | 3 +
kernel/cgroup/cgroup.c | 8 +-
mm/compaction.c | 43 ++-
mm/huge_memory.c | 22 +-
mm/memcontrol-v1.c | 31 +-
mm/memcontrol-v1.h | 3 +
mm/memcontrol.c | 554 +++++++++++++++++++++----------
mm/migrate.c | 2 +
mm/mlock.c | 2 +-
mm/page_io.c | 8 +-
mm/percpu.c | 2 +-
mm/shrinker.c | 6 +-
mm/swap.c | 63 +++-
mm/vmscan.c | 293 +++++++++++-----
mm/workingset.c | 30 +-
mm/zswap.c | 5 +
22 files changed, 909 insertions(+), 416 deletions(-)
--
2.20.1
^ permalink raw reply [flat|nested] 50+ messages in thread
* [PATCH v4 01/31] mm: memcontrol: remove dead code of checking parent memory cgroup
2026-02-05 8:54 [PATCH v4 00/31] Eliminate Dying Memory Cgroup Qi Zheng
@ 2026-02-05 8:54 ` Qi Zheng
2026-02-05 8:54 ` [PATCH v4 02/31] mm: workingset: use folio_lruvec() in workingset_refault() Qi Zheng
` (29 subsequent siblings)
30 siblings, 0 replies; 50+ messages in thread
From: Qi Zheng @ 2026-02-05 8:54 UTC (permalink / raw)
To: hannes, hughd, mhocko, roman.gushchin, shakeel.butt, muchun.song,
david, lorenzo.stoakes, ziy, harry.yoo, yosry.ahmed,
imran.f.khan, kamalesh.babulal, axelrasmussen, yuanchu, weixugc,
chenridong, mkoutny, akpm, hamzamahfooz, apais, lance.yang, bhe
Cc: linux-mm, linux-kernel, cgroups, Muchun Song, Qi Zheng, Chen Ridong
From: Muchun Song <songmuchun@bytedance.com>
Since the no-hierarchy mode has been deprecated after the commit:
commit bef8620cd8e0 ("mm: memcg: deprecate the non-hierarchical mode").
As a result, parent_mem_cgroup() will not return NULL except when passing
the root memcg, and the root memcg cannot be offline. Hence, it's safe to
remove the check on the returned value of parent_mem_cgroup(). Remove the
corresponding dead code.
Signed-off-by: Muchun Song <songmuchun@bytedance.com>
Acked-by: Roman Gushchin <roman.gushchin@linux.dev>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
Reviewed-by: Harry Yoo <harry.yoo@oracle.com>
Reviewed-by: Chen Ridong <chenridong@huawei.com>
Acked-by: Shakeel Butt <shakeel.butt@linux.dev>
---
mm/memcontrol.c | 5 -----
mm/shrinker.c | 6 +-----
2 files changed, 1 insertion(+), 10 deletions(-)
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index f2b87e02574e0..43dbd18150e97 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -3352,9 +3352,6 @@ static void memcg_offline_kmem(struct mem_cgroup *memcg)
return;
parent = parent_mem_cgroup(memcg);
- if (!parent)
- parent = root_mem_cgroup;
-
memcg_reparent_list_lrus(memcg, parent);
/*
@@ -3645,8 +3642,6 @@ struct mem_cgroup *mem_cgroup_private_id_get_online(struct mem_cgroup *memcg)
break;
}
memcg = parent_mem_cgroup(memcg);
- if (!memcg)
- memcg = root_mem_cgroup;
}
return memcg;
}
diff --git a/mm/shrinker.c b/mm/shrinker.c
index 4a93fd433689a..e8e092a2f7f41 100644
--- a/mm/shrinker.c
+++ b/mm/shrinker.c
@@ -286,14 +286,10 @@ void reparent_shrinker_deferred(struct mem_cgroup *memcg)
{
int nid, index, offset;
long nr;
- struct mem_cgroup *parent;
+ struct mem_cgroup *parent = parent_mem_cgroup(memcg);
struct shrinker_info *child_info, *parent_info;
struct shrinker_info_unit *child_unit, *parent_unit;
- parent = parent_mem_cgroup(memcg);
- if (!parent)
- parent = root_mem_cgroup;
-
/* Prevent from concurrent shrinker_info expand */
mutex_lock(&shrinker_mutex);
for_each_node(nid) {
--
2.20.1
^ permalink raw reply [flat|nested] 50+ messages in thread
* [PATCH v4 02/31] mm: workingset: use folio_lruvec() in workingset_refault()
2026-02-05 8:54 [PATCH v4 00/31] Eliminate Dying Memory Cgroup Qi Zheng
2026-02-05 8:54 ` [PATCH v4 01/31] mm: memcontrol: remove dead code of checking parent memory cgroup Qi Zheng
@ 2026-02-05 8:54 ` Qi Zheng
2026-02-05 8:54 ` [PATCH v4 03/31] mm: rename unlock_page_lruvec_irq and its variants Qi Zheng
` (28 subsequent siblings)
30 siblings, 0 replies; 50+ messages in thread
From: Qi Zheng @ 2026-02-05 8:54 UTC (permalink / raw)
To: hannes, hughd, mhocko, roman.gushchin, shakeel.butt, muchun.song,
david, lorenzo.stoakes, ziy, harry.yoo, yosry.ahmed,
imran.f.khan, kamalesh.babulal, axelrasmussen, yuanchu, weixugc,
chenridong, mkoutny, akpm, hamzamahfooz, apais, lance.yang, bhe
Cc: linux-mm, linux-kernel, cgroups, Muchun Song, Qi Zheng
From: Muchun Song <songmuchun@bytedance.com>
Use folio_lruvec() to simplify the code.
Signed-off-by: Muchun Song <songmuchun@bytedance.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
Reviewed-by: Harry Yoo <harry.yoo@oracle.com>
Acked-by: Shakeel Butt <shakeel.butt@linux.dev>
---
mm/workingset.c | 7 +------
1 file changed, 1 insertion(+), 6 deletions(-)
diff --git a/mm/workingset.c b/mm/workingset.c
index 13422d3047158..ed12744d93a29 100644
--- a/mm/workingset.c
+++ b/mm/workingset.c
@@ -534,8 +534,6 @@ bool workingset_test_recent(void *shadow, bool file, bool *workingset,
void workingset_refault(struct folio *folio, void *shadow)
{
bool file = folio_is_file_lru(folio);
- struct pglist_data *pgdat;
- struct mem_cgroup *memcg;
struct lruvec *lruvec;
bool workingset;
long nr;
@@ -557,10 +555,7 @@ void workingset_refault(struct folio *folio, void *shadow)
* locked to guarantee folio_memcg() stability throughout.
*/
nr = folio_nr_pages(folio);
- memcg = folio_memcg(folio);
- pgdat = folio_pgdat(folio);
- lruvec = mem_cgroup_lruvec(memcg, pgdat);
-
+ lruvec = folio_lruvec(folio);
mod_lruvec_state(lruvec, WORKINGSET_REFAULT_BASE + file, nr);
if (!workingset_test_recent(shadow, file, &workingset, true))
--
2.20.1
^ permalink raw reply [flat|nested] 50+ messages in thread
* [PATCH v4 03/31] mm: rename unlock_page_lruvec_irq and its variants
2026-02-05 8:54 [PATCH v4 00/31] Eliminate Dying Memory Cgroup Qi Zheng
2026-02-05 8:54 ` [PATCH v4 01/31] mm: memcontrol: remove dead code of checking parent memory cgroup Qi Zheng
2026-02-05 8:54 ` [PATCH v4 02/31] mm: workingset: use folio_lruvec() in workingset_refault() Qi Zheng
@ 2026-02-05 8:54 ` Qi Zheng
2026-02-05 8:54 ` [PATCH v4 04/31] mm: vmscan: prepare for the refactoring the move_folios_to_lru() Qi Zheng
` (27 subsequent siblings)
30 siblings, 0 replies; 50+ messages in thread
From: Qi Zheng @ 2026-02-05 8:54 UTC (permalink / raw)
To: hannes, hughd, mhocko, roman.gushchin, shakeel.butt, muchun.song,
david, lorenzo.stoakes, ziy, harry.yoo, yosry.ahmed,
imran.f.khan, kamalesh.babulal, axelrasmussen, yuanchu, weixugc,
chenridong, mkoutny, akpm, hamzamahfooz, apais, lance.yang, bhe
Cc: linux-mm, linux-kernel, cgroups, Muchun Song, Qi Zheng, Chen Ridong
From: Muchun Song <songmuchun@bytedance.com>
It is inappropriate to use folio_lruvec_lock() variants in conjunction
with unlock_page_lruvec() variants, as this involves the inconsistent
operation of locking a folio while unlocking a page. To rectify this, the
functions unlock_page_lruvec{_irq, _irqrestore} are renamed to
lruvec_unlock{_irq,_irqrestore}.
Signed-off-by: Muchun Song <songmuchun@bytedance.com>
Acked-by: Roman Gushchin <roman.gushchin@linux.dev>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
Reviewed-by: Harry Yoo <harry.yoo@oracle.com>
Reviewed-by: Chen Ridong <chenridong@huawei.com>
Acked-by: David Hildenbrand (Red Hat) <david@kernel.org>
Acked-by: Shakeel Butt <shakeel.butt@linux.dev>
---
include/linux/memcontrol.h | 10 +++++-----
mm/compaction.c | 14 +++++++-------
mm/huge_memory.c | 2 +-
mm/mlock.c | 2 +-
mm/swap.c | 12 ++++++------
mm/vmscan.c | 4 ++--
6 files changed, 22 insertions(+), 22 deletions(-)
diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index af352cabedbae..6a44e79a8bd23 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -1480,17 +1480,17 @@ static inline struct lruvec *parent_lruvec(struct lruvec *lruvec)
return mem_cgroup_lruvec(memcg, lruvec_pgdat(lruvec));
}
-static inline void unlock_page_lruvec(struct lruvec *lruvec)
+static inline void lruvec_unlock(struct lruvec *lruvec)
{
spin_unlock(&lruvec->lru_lock);
}
-static inline void unlock_page_lruvec_irq(struct lruvec *lruvec)
+static inline void lruvec_unlock_irq(struct lruvec *lruvec)
{
spin_unlock_irq(&lruvec->lru_lock);
}
-static inline void unlock_page_lruvec_irqrestore(struct lruvec *lruvec,
+static inline void lruvec_unlock_irqrestore(struct lruvec *lruvec,
unsigned long flags)
{
spin_unlock_irqrestore(&lruvec->lru_lock, flags);
@@ -1512,7 +1512,7 @@ static inline struct lruvec *folio_lruvec_relock_irq(struct folio *folio,
if (folio_matches_lruvec(folio, locked_lruvec))
return locked_lruvec;
- unlock_page_lruvec_irq(locked_lruvec);
+ lruvec_unlock_irq(locked_lruvec);
}
return folio_lruvec_lock_irq(folio);
@@ -1526,7 +1526,7 @@ static inline void folio_lruvec_relock_irqsave(struct folio *folio,
if (folio_matches_lruvec(folio, *lruvecp))
return;
- unlock_page_lruvec_irqrestore(*lruvecp, *flags);
+ lruvec_unlock_irqrestore(*lruvecp, *flags);
}
*lruvecp = folio_lruvec_lock_irqsave(folio, flags);
diff --git a/mm/compaction.c b/mm/compaction.c
index 1e8f8eca318c6..c3e338aaa0ffb 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -913,7 +913,7 @@ isolate_migratepages_block(struct compact_control *cc, unsigned long low_pfn,
*/
if (!(low_pfn % COMPACT_CLUSTER_MAX)) {
if (locked) {
- unlock_page_lruvec_irqrestore(locked, flags);
+ lruvec_unlock_irqrestore(locked, flags);
locked = NULL;
}
@@ -964,7 +964,7 @@ isolate_migratepages_block(struct compact_control *cc, unsigned long low_pfn,
}
/* for alloc_contig case */
if (locked) {
- unlock_page_lruvec_irqrestore(locked, flags);
+ lruvec_unlock_irqrestore(locked, flags);
locked = NULL;
}
@@ -1053,7 +1053,7 @@ isolate_migratepages_block(struct compact_control *cc, unsigned long low_pfn,
if (unlikely(page_has_movable_ops(page)) &&
!PageMovableOpsIsolated(page)) {
if (locked) {
- unlock_page_lruvec_irqrestore(locked, flags);
+ lruvec_unlock_irqrestore(locked, flags);
locked = NULL;
}
@@ -1158,7 +1158,7 @@ isolate_migratepages_block(struct compact_control *cc, unsigned long low_pfn,
/* If we already hold the lock, we can skip some rechecking */
if (lruvec != locked) {
if (locked)
- unlock_page_lruvec_irqrestore(locked, flags);
+ lruvec_unlock_irqrestore(locked, flags);
compact_lock_irqsave(&lruvec->lru_lock, &flags, cc);
locked = lruvec;
@@ -1226,7 +1226,7 @@ isolate_migratepages_block(struct compact_control *cc, unsigned long low_pfn,
isolate_fail_put:
/* Avoid potential deadlock in freeing page under lru_lock */
if (locked) {
- unlock_page_lruvec_irqrestore(locked, flags);
+ lruvec_unlock_irqrestore(locked, flags);
locked = NULL;
}
folio_put(folio);
@@ -1242,7 +1242,7 @@ isolate_migratepages_block(struct compact_control *cc, unsigned long low_pfn,
*/
if (nr_isolated) {
if (locked) {
- unlock_page_lruvec_irqrestore(locked, flags);
+ lruvec_unlock_irqrestore(locked, flags);
locked = NULL;
}
putback_movable_pages(&cc->migratepages);
@@ -1274,7 +1274,7 @@ isolate_migratepages_block(struct compact_control *cc, unsigned long low_pfn,
isolate_abort:
if (locked)
- unlock_page_lruvec_irqrestore(locked, flags);
+ lruvec_unlock_irqrestore(locked, flags);
if (folio) {
folio_set_lru(folio);
folio_put(folio);
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 0d487649e4ded..580376323124c 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -3899,7 +3899,7 @@ static int __folio_freeze_and_split_unmapped(struct folio *folio, unsigned int n
folio_ref_unfreeze(folio, folio_cache_ref_count(folio) + 1);
if (do_lru)
- unlock_page_lruvec(lruvec);
+ lruvec_unlock(lruvec);
if (ci)
swap_cluster_unlock(ci);
diff --git a/mm/mlock.c b/mm/mlock.c
index 2f699c3497a57..66740e16679c3 100644
--- a/mm/mlock.c
+++ b/mm/mlock.c
@@ -205,7 +205,7 @@ static void mlock_folio_batch(struct folio_batch *fbatch)
}
if (lruvec)
- unlock_page_lruvec_irq(lruvec);
+ lruvec_unlock_irq(lruvec);
folios_put(fbatch);
}
diff --git a/mm/swap.c b/mm/swap.c
index bb19ccbece464..245ba159e01d7 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -91,7 +91,7 @@ static void page_cache_release(struct folio *folio)
__page_cache_release(folio, &lruvec, &flags);
if (lruvec)
- unlock_page_lruvec_irqrestore(lruvec, flags);
+ lruvec_unlock_irqrestore(lruvec, flags);
}
void __folio_put(struct folio *folio)
@@ -175,7 +175,7 @@ static void folio_batch_move_lru(struct folio_batch *fbatch, move_fn_t move_fn)
}
if (lruvec)
- unlock_page_lruvec_irqrestore(lruvec, flags);
+ lruvec_unlock_irqrestore(lruvec, flags);
folios_put(fbatch);
}
@@ -349,7 +349,7 @@ void folio_activate(struct folio *folio)
lruvec = folio_lruvec_lock_irq(folio);
lru_activate(lruvec, folio);
- unlock_page_lruvec_irq(lruvec);
+ lruvec_unlock_irq(lruvec);
folio_set_lru(folio);
}
#endif
@@ -963,7 +963,7 @@ void folios_put_refs(struct folio_batch *folios, unsigned int *refs)
if (folio_is_zone_device(folio)) {
if (lruvec) {
- unlock_page_lruvec_irqrestore(lruvec, flags);
+ lruvec_unlock_irqrestore(lruvec, flags);
lruvec = NULL;
}
if (folio_ref_sub_and_test(folio, nr_refs))
@@ -977,7 +977,7 @@ void folios_put_refs(struct folio_batch *folios, unsigned int *refs)
/* hugetlb has its own memcg */
if (folio_test_hugetlb(folio)) {
if (lruvec) {
- unlock_page_lruvec_irqrestore(lruvec, flags);
+ lruvec_unlock_irqrestore(lruvec, flags);
lruvec = NULL;
}
free_huge_folio(folio);
@@ -991,7 +991,7 @@ void folios_put_refs(struct folio_batch *folios, unsigned int *refs)
j++;
}
if (lruvec)
- unlock_page_lruvec_irqrestore(lruvec, flags);
+ lruvec_unlock_irqrestore(lruvec, flags);
if (!j) {
folio_batch_reinit(folios);
return;
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 01d3364fe506f..a4892f5e3d347 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1836,7 +1836,7 @@ bool folio_isolate_lru(struct folio *folio)
folio_get(folio);
lruvec = folio_lruvec_lock_irq(folio);
lruvec_del_folio(lruvec, folio);
- unlock_page_lruvec_irq(lruvec);
+ lruvec_unlock_irq(lruvec);
ret = true;
}
@@ -7878,7 +7878,7 @@ void check_move_unevictable_folios(struct folio_batch *fbatch)
if (lruvec) {
__count_vm_events(UNEVICTABLE_PGRESCUED, pgrescued);
__count_vm_events(UNEVICTABLE_PGSCANNED, pgscanned);
- unlock_page_lruvec_irq(lruvec);
+ lruvec_unlock_irq(lruvec);
} else if (pgscanned) {
count_vm_events(UNEVICTABLE_PGSCANNED, pgscanned);
}
--
2.20.1
^ permalink raw reply [flat|nested] 50+ messages in thread
* [PATCH v4 04/31] mm: vmscan: prepare for the refactoring the move_folios_to_lru()
2026-02-05 8:54 [PATCH v4 00/31] Eliminate Dying Memory Cgroup Qi Zheng
` (2 preceding siblings ...)
2026-02-05 8:54 ` [PATCH v4 03/31] mm: rename unlock_page_lruvec_irq and its variants Qi Zheng
@ 2026-02-05 8:54 ` Qi Zheng
2026-02-05 8:54 ` [PATCH v4 05/31] mm: vmscan: refactor move_folios_to_lru() Qi Zheng
` (26 subsequent siblings)
30 siblings, 0 replies; 50+ messages in thread
From: Qi Zheng @ 2026-02-05 8:54 UTC (permalink / raw)
To: hannes, hughd, mhocko, roman.gushchin, shakeel.butt, muchun.song,
david, lorenzo.stoakes, ziy, harry.yoo, yosry.ahmed,
imran.f.khan, kamalesh.babulal, axelrasmussen, yuanchu, weixugc,
chenridong, mkoutny, akpm, hamzamahfooz, apais, lance.yang, bhe
Cc: linux-mm, linux-kernel, cgroups, Qi Zheng, Chen Ridong
From: Qi Zheng <zhengqi.arch@bytedance.com>
Once we refactor move_folios_to_lru(), its callers will no longer have to
hold the lruvec lock; For shrink_inactive_list(), shrink_active_list() and
evict_folios(), IRQ disabling is only needed for __count_vm_events() and
__mod_node_page_state().
To avoid using local_irq_disable() on the PREEMPT_RT kernel, let's make
all callers of move_folios_to_lru() use IRQ-safed count_vm_events() and
mod_node_page_state().
Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Shakeel Butt <shakeel.butt@linux.dev>
Reviewed-by: Chen Ridong <chenridong@huawei.com>
Reviewed-by: Harry Yoo <harry.yoo@oracle.com>
Acked-by: Muchun Song <muchun.song@linux.dev>
---
mm/vmscan.c | 14 +++++++-------
1 file changed, 7 insertions(+), 7 deletions(-)
diff --git a/mm/vmscan.c b/mm/vmscan.c
index a4892f5e3d347..f595e063faeec 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2028,12 +2028,12 @@ static unsigned long shrink_inactive_list(unsigned long nr_to_scan,
mod_lruvec_state(lruvec, PGDEMOTE_KSWAPD + reclaimer_offset(sc),
stat.nr_demoted);
- __mod_node_page_state(pgdat, NR_ISOLATED_ANON + file, -nr_taken);
+ mod_node_page_state(pgdat, NR_ISOLATED_ANON + file, -nr_taken);
item = PGSTEAL_KSWAPD + reclaimer_offset(sc);
if (!cgroup_reclaim(sc))
- __count_vm_events(item, nr_reclaimed);
+ count_vm_events(item, nr_reclaimed);
count_memcg_events(lruvec_memcg(lruvec), item, nr_reclaimed);
- __count_vm_events(PGSTEAL_ANON + file, nr_reclaimed);
+ count_vm_events(PGSTEAL_ANON + file, nr_reclaimed);
lru_note_cost_unlock_irq(lruvec, file, stat.nr_pageout,
nr_scanned - nr_reclaimed);
@@ -2178,10 +2178,10 @@ static void shrink_active_list(unsigned long nr_to_scan,
nr_activate = move_folios_to_lru(lruvec, &l_active);
nr_deactivate = move_folios_to_lru(lruvec, &l_inactive);
- __count_vm_events(PGDEACTIVATE, nr_deactivate);
+ count_vm_events(PGDEACTIVATE, nr_deactivate);
count_memcg_events(lruvec_memcg(lruvec), PGDEACTIVATE, nr_deactivate);
- __mod_node_page_state(pgdat, NR_ISOLATED_ANON + file, -nr_taken);
+ mod_node_page_state(pgdat, NR_ISOLATED_ANON + file, -nr_taken);
lru_note_cost_unlock_irq(lruvec, file, 0, nr_rotated);
trace_mm_vmscan_lru_shrink_active(pgdat->node_id, nr_taken, nr_activate,
@@ -4757,9 +4757,9 @@ static int evict_folios(unsigned long nr_to_scan, struct lruvec *lruvec,
item = PGSTEAL_KSWAPD + reclaimer_offset(sc);
if (!cgroup_reclaim(sc))
- __count_vm_events(item, reclaimed);
+ count_vm_events(item, reclaimed);
count_memcg_events(memcg, item, reclaimed);
- __count_vm_events(PGSTEAL_ANON + type, reclaimed);
+ count_vm_events(PGSTEAL_ANON + type, reclaimed);
spin_unlock_irq(&lruvec->lru_lock);
--
2.20.1
^ permalink raw reply [flat|nested] 50+ messages in thread
* [PATCH v4 05/31] mm: vmscan: refactor move_folios_to_lru()
2026-02-05 8:54 [PATCH v4 00/31] Eliminate Dying Memory Cgroup Qi Zheng
` (3 preceding siblings ...)
2026-02-05 8:54 ` [PATCH v4 04/31] mm: vmscan: prepare for the refactoring the move_folios_to_lru() Qi Zheng
@ 2026-02-05 8:54 ` Qi Zheng
2026-02-05 8:54 ` [PATCH v4 06/31] mm: memcontrol: allocate object cgroup for non-kmem case Qi Zheng
` (25 subsequent siblings)
30 siblings, 0 replies; 50+ messages in thread
From: Qi Zheng @ 2026-02-05 8:54 UTC (permalink / raw)
To: hannes, hughd, mhocko, roman.gushchin, shakeel.butt, muchun.song,
david, lorenzo.stoakes, ziy, harry.yoo, yosry.ahmed,
imran.f.khan, kamalesh.babulal, axelrasmussen, yuanchu, weixugc,
chenridong, mkoutny, akpm, hamzamahfooz, apais, lance.yang, bhe
Cc: linux-mm, linux-kernel, cgroups, Muchun Song, Qi Zheng
From: Muchun Song <songmuchun@bytedance.com>
In a subsequent patch, we'll reparent the LRU folios. The folios that are
moved to the appropriate LRU list can undergo reparenting during the
move_folios_to_lru() process. Hence, it's incorrect for the caller to hold
a lruvec lock. Instead, we should utilize the more general interface of
folio_lruvec_relock_irq() to obtain the correct lruvec lock.
This patch involves only code refactoring and doesn't introduce any
functional changes.
Signed-off-by: Muchun Song <songmuchun@bytedance.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
Acked-by: Shakeel Butt <shakeel.butt@linux.dev>
Reviewed-by: Harry Yoo <harry.yoo@oracle.com>
---
mm/vmscan.c | 46 +++++++++++++++++++++-------------------------
1 file changed, 21 insertions(+), 25 deletions(-)
diff --git a/mm/vmscan.c b/mm/vmscan.c
index f595e063faeec..8039df1c9fca5 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1890,24 +1890,27 @@ static bool too_many_isolated(struct pglist_data *pgdat, int file,
/*
* move_folios_to_lru() moves folios from private @list to appropriate LRU list.
*
- * Returns the number of pages moved to the given lruvec.
+ * Returns the number of pages moved to the appropriate lruvec.
+ *
+ * Note: The caller must not hold any lruvec lock.
*/
-static unsigned int move_folios_to_lru(struct lruvec *lruvec,
- struct list_head *list)
+static unsigned int move_folios_to_lru(struct list_head *list)
{
int nr_pages, nr_moved = 0;
+ struct lruvec *lruvec = NULL;
struct folio_batch free_folios;
folio_batch_init(&free_folios);
while (!list_empty(list)) {
struct folio *folio = lru_to_folio(list);
+ lruvec = folio_lruvec_relock_irq(folio, lruvec);
VM_BUG_ON_FOLIO(folio_test_lru(folio), folio);
list_del(&folio->lru);
if (unlikely(!folio_evictable(folio))) {
- spin_unlock_irq(&lruvec->lru_lock);
+ lruvec_unlock_irq(lruvec);
folio_putback_lru(folio);
- spin_lock_irq(&lruvec->lru_lock);
+ lruvec = NULL;
continue;
}
@@ -1929,19 +1932,15 @@ static unsigned int move_folios_to_lru(struct lruvec *lruvec,
folio_unqueue_deferred_split(folio);
if (folio_batch_add(&free_folios, folio) == 0) {
- spin_unlock_irq(&lruvec->lru_lock);
+ lruvec_unlock_irq(lruvec);
mem_cgroup_uncharge_folios(&free_folios);
free_unref_folios(&free_folios);
- spin_lock_irq(&lruvec->lru_lock);
+ lruvec = NULL;
}
continue;
}
- /*
- * All pages were isolated from the same lruvec (and isolation
- * inhibits memcg migration).
- */
VM_BUG_ON_FOLIO(!folio_matches_lruvec(folio, lruvec), folio);
lruvec_add_folio(lruvec, folio);
nr_pages = folio_nr_pages(folio);
@@ -1950,11 +1949,12 @@ static unsigned int move_folios_to_lru(struct lruvec *lruvec,
workingset_age_nonresident(lruvec, nr_pages);
}
+ if (lruvec)
+ lruvec_unlock_irq(lruvec);
+
if (free_folios.nr) {
- spin_unlock_irq(&lruvec->lru_lock);
mem_cgroup_uncharge_folios(&free_folios);
free_unref_folios(&free_folios);
- spin_lock_irq(&lruvec->lru_lock);
}
return nr_moved;
@@ -2023,8 +2023,7 @@ static unsigned long shrink_inactive_list(unsigned long nr_to_scan,
nr_reclaimed = shrink_folio_list(&folio_list, pgdat, sc, &stat, false,
lruvec_memcg(lruvec));
- spin_lock_irq(&lruvec->lru_lock);
- move_folios_to_lru(lruvec, &folio_list);
+ move_folios_to_lru(&folio_list);
mod_lruvec_state(lruvec, PGDEMOTE_KSWAPD + reclaimer_offset(sc),
stat.nr_demoted);
@@ -2035,6 +2034,7 @@ static unsigned long shrink_inactive_list(unsigned long nr_to_scan,
count_memcg_events(lruvec_memcg(lruvec), item, nr_reclaimed);
count_vm_events(PGSTEAL_ANON + file, nr_reclaimed);
+ spin_lock_irq(&lruvec->lru_lock);
lru_note_cost_unlock_irq(lruvec, file, stat.nr_pageout,
nr_scanned - nr_reclaimed);
@@ -2173,16 +2173,14 @@ static void shrink_active_list(unsigned long nr_to_scan,
/*
* Move folios back to the lru list.
*/
- spin_lock_irq(&lruvec->lru_lock);
-
- nr_activate = move_folios_to_lru(lruvec, &l_active);
- nr_deactivate = move_folios_to_lru(lruvec, &l_inactive);
+ nr_activate = move_folios_to_lru(&l_active);
+ nr_deactivate = move_folios_to_lru(&l_inactive);
count_vm_events(PGDEACTIVATE, nr_deactivate);
count_memcg_events(lruvec_memcg(lruvec), PGDEACTIVATE, nr_deactivate);
-
mod_node_page_state(pgdat, NR_ISOLATED_ANON + file, -nr_taken);
+ spin_lock_irq(&lruvec->lru_lock);
lru_note_cost_unlock_irq(lruvec, file, 0, nr_rotated);
trace_mm_vmscan_lru_shrink_active(pgdat->node_id, nr_taken, nr_activate,
nr_deactivate, nr_rotated, sc->priority, file);
@@ -4742,14 +4740,14 @@ static int evict_folios(unsigned long nr_to_scan, struct lruvec *lruvec,
set_mask_bits(&folio->flags.f, LRU_REFS_FLAGS, BIT(PG_active));
}
- spin_lock_irq(&lruvec->lru_lock);
-
- move_folios_to_lru(lruvec, &list);
+ move_folios_to_lru(&list);
walk = current->reclaim_state->mm_walk;
if (walk && walk->batched) {
walk->lruvec = lruvec;
+ spin_lock_irq(&lruvec->lru_lock);
reset_batch_size(walk);
+ spin_unlock_irq(&lruvec->lru_lock);
}
mod_lruvec_state(lruvec, PGDEMOTE_KSWAPD + reclaimer_offset(sc),
@@ -4761,8 +4759,6 @@ static int evict_folios(unsigned long nr_to_scan, struct lruvec *lruvec,
count_memcg_events(memcg, item, reclaimed);
count_vm_events(PGSTEAL_ANON + type, reclaimed);
- spin_unlock_irq(&lruvec->lru_lock);
-
list_splice_init(&clean, &list);
if (!list_empty(&list)) {
--
2.20.1
^ permalink raw reply [flat|nested] 50+ messages in thread
* [PATCH v4 06/31] mm: memcontrol: allocate object cgroup for non-kmem case
2026-02-05 8:54 [PATCH v4 00/31] Eliminate Dying Memory Cgroup Qi Zheng
` (4 preceding siblings ...)
2026-02-05 8:54 ` [PATCH v4 05/31] mm: vmscan: refactor move_folios_to_lru() Qi Zheng
@ 2026-02-05 8:54 ` Qi Zheng
2026-02-05 8:54 ` [PATCH v4 07/31] mm: memcontrol: return root object cgroup for root memory cgroup Qi Zheng
` (24 subsequent siblings)
30 siblings, 0 replies; 50+ messages in thread
From: Qi Zheng @ 2026-02-05 8:54 UTC (permalink / raw)
To: hannes, hughd, mhocko, roman.gushchin, shakeel.butt, muchun.song,
david, lorenzo.stoakes, ziy, harry.yoo, yosry.ahmed,
imran.f.khan, kamalesh.babulal, axelrasmussen, yuanchu, weixugc,
chenridong, mkoutny, akpm, hamzamahfooz, apais, lance.yang, bhe
Cc: linux-mm, linux-kernel, cgroups, Muchun Song, Qi Zheng, Chen Ridong
From: Muchun Song <songmuchun@bytedance.com>
To allow LRU page reparenting, the objcg infrastructure is no longer
solely applicable to the kmem case. In this patch, we extend the scope of
the objcg infrastructure beyond the kmem case, enabling LRU folios to
reuse it for folio charging purposes.
It should be noted that LRU folios are not accounted for at the root
level, yet the folio->memcg_data points to the root_mem_cgroup. Hence,
the folio->memcg_data of LRU folios always points to a valid pointer.
However, the root_mem_cgroup does not possess an object cgroup.
Therefore, we also allocate an object cgroup for the root_mem_cgroup.
Signed-off-by: Muchun Song <songmuchun@bytedance.com>
Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
Reviewed-by: Harry Yoo <harry.yoo@oracle.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Shakeel Butt <shakeel.butt@linux.dev>
Reviewed-by: Chen Ridong <chenridong@huawei.com>
---
mm/memcontrol.c | 51 +++++++++++++++++++++++--------------------------
1 file changed, 24 insertions(+), 27 deletions(-)
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 43dbd18150e97..5625955bcd23b 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -206,10 +206,10 @@ static struct obj_cgroup *obj_cgroup_alloc(void)
return objcg;
}
-static void memcg_reparent_objcgs(struct mem_cgroup *memcg,
- struct mem_cgroup *parent)
+static void memcg_reparent_objcgs(struct mem_cgroup *memcg)
{
struct obj_cgroup *objcg, *iter;
+ struct mem_cgroup *parent = parent_mem_cgroup(memcg);
objcg = rcu_replace_pointer(memcg->objcg, NULL, true);
@@ -3315,30 +3315,17 @@ void folio_split_memcg_refs(struct folio *folio, unsigned old_order,
css_get_many(&__folio_memcg(folio)->css, new_refs);
}
-static int memcg_online_kmem(struct mem_cgroup *memcg)
+static void memcg_online_kmem(struct mem_cgroup *memcg)
{
- struct obj_cgroup *objcg;
-
if (mem_cgroup_kmem_disabled())
- return 0;
+ return;
if (unlikely(mem_cgroup_is_root(memcg)))
- return 0;
-
- objcg = obj_cgroup_alloc();
- if (!objcg)
- return -ENOMEM;
-
- objcg->memcg = memcg;
- rcu_assign_pointer(memcg->objcg, objcg);
- obj_cgroup_get(objcg);
- memcg->orig_objcg = objcg;
+ return;
static_branch_enable(&memcg_kmem_online_key);
memcg->kmemcg_id = memcg->id.id;
-
- return 0;
}
static void memcg_offline_kmem(struct mem_cgroup *memcg)
@@ -3353,12 +3340,6 @@ static void memcg_offline_kmem(struct mem_cgroup *memcg)
parent = parent_mem_cgroup(memcg);
memcg_reparent_list_lrus(memcg, parent);
-
- /*
- * Objcg's reparenting must be after list_lru's, make sure list_lru
- * helpers won't use parent's list_lru until child is drained.
- */
- memcg_reparent_objcgs(memcg, parent);
}
#ifdef CONFIG_CGROUP_WRITEBACK
@@ -3871,9 +3852,9 @@ mem_cgroup_css_alloc(struct cgroup_subsys_state *parent_css)
static int mem_cgroup_css_online(struct cgroup_subsys_state *css)
{
struct mem_cgroup *memcg = mem_cgroup_from_css(css);
+ struct obj_cgroup *objcg;
- if (memcg_online_kmem(memcg))
- goto remove_id;
+ memcg_online_kmem(memcg);
/*
* A memcg must be visible for expand_shrinker_info()
@@ -3883,6 +3864,15 @@ static int mem_cgroup_css_online(struct cgroup_subsys_state *css)
if (alloc_shrinker_info(memcg))
goto offline_kmem;
+ objcg = obj_cgroup_alloc();
+ if (!objcg)
+ goto free_shrinker;
+
+ objcg->memcg = memcg;
+ rcu_assign_pointer(memcg->objcg, objcg);
+ obj_cgroup_get(objcg);
+ memcg->orig_objcg = objcg;
+
if (unlikely(mem_cgroup_is_root(memcg)) && !mem_cgroup_disabled())
queue_delayed_work(system_dfl_wq, &stats_flush_dwork,
FLUSH_TIME);
@@ -3905,9 +3895,10 @@ static int mem_cgroup_css_online(struct cgroup_subsys_state *css)
xa_store(&mem_cgroup_private_ids, memcg->id.id, memcg, GFP_KERNEL);
return 0;
+free_shrinker:
+ free_shrinker_info(memcg);
offline_kmem:
memcg_offline_kmem(memcg);
-remove_id:
mem_cgroup_private_id_remove(memcg);
return -ENOMEM;
}
@@ -3925,6 +3916,12 @@ static void mem_cgroup_css_offline(struct cgroup_subsys_state *css)
memcg_offline_kmem(memcg);
reparent_deferred_split_queue(memcg);
+ /*
+ * The reparenting of objcg must be after the reparenting of the
+ * list_lru and deferred_split_queue above, which ensures that they will
+ * not mistakenly get the parent list_lru and deferred_split_queue.
+ */
+ memcg_reparent_objcgs(memcg);
reparent_shrinker_deferred(memcg);
wb_memcg_offline(memcg);
lru_gen_offline_memcg(memcg);
--
2.20.1
^ permalink raw reply [flat|nested] 50+ messages in thread
* [PATCH v4 07/31] mm: memcontrol: return root object cgroup for root memory cgroup
2026-02-05 8:54 [PATCH v4 00/31] Eliminate Dying Memory Cgroup Qi Zheng
` (5 preceding siblings ...)
2026-02-05 8:54 ` [PATCH v4 06/31] mm: memcontrol: allocate object cgroup for non-kmem case Qi Zheng
@ 2026-02-05 8:54 ` Qi Zheng
2026-02-05 9:01 ` [PATCH v4 08/31] mm: memcontrol: prevent memory cgroup release in get_mem_cgroup_from_folio() Qi Zheng
` (23 subsequent siblings)
30 siblings, 0 replies; 50+ messages in thread
From: Qi Zheng @ 2026-02-05 8:54 UTC (permalink / raw)
To: hannes, hughd, mhocko, roman.gushchin, shakeel.butt, muchun.song,
david, lorenzo.stoakes, ziy, harry.yoo, yosry.ahmed,
imran.f.khan, kamalesh.babulal, axelrasmussen, yuanchu, weixugc,
chenridong, mkoutny, akpm, hamzamahfooz, apais, lance.yang, bhe
Cc: linux-mm, linux-kernel, cgroups, Muchun Song, Qi Zheng
From: Muchun Song <songmuchun@bytedance.com>
Memory cgroup functions such as get_mem_cgroup_from_folio() and
get_mem_cgroup_from_mm() return a valid memory cgroup pointer,
even for the root memory cgroup. In contrast, the situation for
object cgroups has been different.
Previously, the root object cgroup couldn't be returned because
it didn't exist. Now that a valid root object cgroup exists, for
the sake of consistency, it's necessary to align the behavior of
object-cgroup-related operations with that of memory cgroup APIs.
Signed-off-by: Muchun Song <songmuchun@bytedance.com>
Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Shakeel Butt <shakeel.butt@linux.dev>
Reviewed-by: Harry Yoo <harry.yoo@oracle.com>
---
include/linux/memcontrol.h | 26 +++++++++++++++++-----
mm/memcontrol.c | 45 ++++++++++++++++++++------------------
mm/percpu.c | 2 +-
3 files changed, 45 insertions(+), 28 deletions(-)
diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 6a44e79a8bd23..7b3d8f341ff10 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -332,6 +332,7 @@ struct mem_cgroup {
#define MEMCG_CHARGE_BATCH 64U
extern struct mem_cgroup *root_mem_cgroup;
+extern struct obj_cgroup *root_obj_cgroup;
enum page_memcg_data_flags {
/* page->memcg_data is a pointer to an slabobj_ext vector */
@@ -549,6 +550,11 @@ static inline bool mem_cgroup_is_root(struct mem_cgroup *memcg)
return (memcg == root_mem_cgroup);
}
+static inline bool obj_cgroup_is_root(const struct obj_cgroup *objcg)
+{
+ return objcg == root_obj_cgroup;
+}
+
static inline bool mem_cgroup_disabled(void)
{
return !cgroup_subsys_enabled(memory_cgrp_subsys);
@@ -775,23 +781,26 @@ struct mem_cgroup *mem_cgroup_from_css(struct cgroup_subsys_state *css){
static inline bool obj_cgroup_tryget(struct obj_cgroup *objcg)
{
+ if (obj_cgroup_is_root(objcg))
+ return true;
return percpu_ref_tryget(&objcg->refcnt);
}
-static inline void obj_cgroup_get(struct obj_cgroup *objcg)
+static inline void obj_cgroup_get_many(struct obj_cgroup *objcg,
+ unsigned long nr)
{
- percpu_ref_get(&objcg->refcnt);
+ if (!obj_cgroup_is_root(objcg))
+ percpu_ref_get_many(&objcg->refcnt, nr);
}
-static inline void obj_cgroup_get_many(struct obj_cgroup *objcg,
- unsigned long nr)
+static inline void obj_cgroup_get(struct obj_cgroup *objcg)
{
- percpu_ref_get_many(&objcg->refcnt, nr);
+ obj_cgroup_get_many(objcg, 1);
}
static inline void obj_cgroup_put(struct obj_cgroup *objcg)
{
- if (objcg)
+ if (objcg && !obj_cgroup_is_root(objcg))
percpu_ref_put(&objcg->refcnt);
}
@@ -1088,6 +1097,11 @@ static inline bool mem_cgroup_is_root(struct mem_cgroup *memcg)
return true;
}
+static inline bool obj_cgroup_is_root(const struct obj_cgroup *objcg)
+{
+ return true;
+}
+
static inline bool mem_cgroup_disabled(void)
{
return true;
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 5625955bcd23b..ac75e48f144c1 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -83,6 +83,8 @@ EXPORT_SYMBOL(memory_cgrp_subsys);
struct mem_cgroup *root_mem_cgroup __read_mostly;
EXPORT_SYMBOL(root_mem_cgroup);
+struct obj_cgroup *root_obj_cgroup __read_mostly;
+
/* Active memory cgroup to use from an interrupt context */
DEFINE_PER_CPU(struct mem_cgroup *, int_active_memcg);
EXPORT_PER_CPU_SYMBOL_GPL(int_active_memcg);
@@ -2668,15 +2670,14 @@ struct mem_cgroup *mem_cgroup_from_virt(void *p)
static struct obj_cgroup *__get_obj_cgroup_from_memcg(struct mem_cgroup *memcg)
{
- struct obj_cgroup *objcg = NULL;
+ for (; memcg; memcg = parent_mem_cgroup(memcg)) {
+ struct obj_cgroup *objcg = rcu_dereference(memcg->objcg);
- for (; !mem_cgroup_is_root(memcg); memcg = parent_mem_cgroup(memcg)) {
- objcg = rcu_dereference(memcg->objcg);
if (likely(objcg && obj_cgroup_tryget(objcg)))
- break;
- objcg = NULL;
+ return objcg;
}
- return objcg;
+
+ return NULL;
}
static struct obj_cgroup *current_objcg_update(void)
@@ -2750,18 +2751,17 @@ __always_inline struct obj_cgroup *current_obj_cgroup(void)
* Objcg reference is kept by the task, so it's safe
* to use the objcg by the current task.
*/
- return objcg;
+ return objcg ? : root_obj_cgroup;
}
memcg = this_cpu_read(int_active_memcg);
if (unlikely(memcg))
goto from_memcg;
- return NULL;
+ return root_obj_cgroup;
from_memcg:
- objcg = NULL;
- for (; !mem_cgroup_is_root(memcg); memcg = parent_mem_cgroup(memcg)) {
+ for (; memcg; memcg = parent_mem_cgroup(memcg)) {
/*
* Memcg pointer is protected by scope (see set_active_memcg())
* and is pinning the corresponding objcg, so objcg can't go
@@ -2770,10 +2770,10 @@ __always_inline struct obj_cgroup *current_obj_cgroup(void)
*/
objcg = rcu_dereference_check(memcg->objcg, 1);
if (likely(objcg))
- break;
+ return objcg;
}
- return objcg;
+ return root_obj_cgroup;
}
struct obj_cgroup *get_obj_cgroup_from_folio(struct folio *folio)
@@ -2787,14 +2787,8 @@ struct obj_cgroup *get_obj_cgroup_from_folio(struct folio *folio)
objcg = __folio_objcg(folio);
obj_cgroup_get(objcg);
} else {
- struct mem_cgroup *memcg;
-
rcu_read_lock();
- memcg = __folio_memcg(folio);
- if (memcg)
- objcg = __get_obj_cgroup_from_memcg(memcg);
- else
- objcg = NULL;
+ objcg = __get_obj_cgroup_from_memcg(__folio_memcg(folio));
rcu_read_unlock();
}
return objcg;
@@ -2897,7 +2891,7 @@ int __memcg_kmem_charge_page(struct page *page, gfp_t gfp, int order)
int ret = 0;
objcg = current_obj_cgroup();
- if (objcg) {
+ if (objcg && !obj_cgroup_is_root(objcg)) {
ret = obj_cgroup_charge_pages(objcg, gfp, 1 << order);
if (!ret) {
obj_cgroup_get(objcg);
@@ -3198,7 +3192,7 @@ bool __memcg_slab_post_alloc_hook(struct kmem_cache *s, struct list_lru *lru,
* obj_cgroup_get() is used to get a permanent reference.
*/
objcg = current_obj_cgroup();
- if (!objcg)
+ if (!objcg || obj_cgroup_is_root(objcg))
return true;
/*
@@ -3868,6 +3862,9 @@ static int mem_cgroup_css_online(struct cgroup_subsys_state *css)
if (!objcg)
goto free_shrinker;
+ if (unlikely(mem_cgroup_is_root(memcg)))
+ root_obj_cgroup = objcg;
+
objcg->memcg = memcg;
rcu_assign_pointer(memcg->objcg, objcg);
obj_cgroup_get(objcg);
@@ -5496,6 +5493,9 @@ void obj_cgroup_charge_zswap(struct obj_cgroup *objcg, size_t size)
if (!cgroup_subsys_on_dfl(memory_cgrp_subsys))
return;
+ if (obj_cgroup_is_root(objcg))
+ return;
+
VM_WARN_ON_ONCE(!(current->flags & PF_MEMALLOC));
/* PF_MEMALLOC context, charging must succeed */
@@ -5523,6 +5523,9 @@ void obj_cgroup_uncharge_zswap(struct obj_cgroup *objcg, size_t size)
if (!cgroup_subsys_on_dfl(memory_cgrp_subsys))
return;
+ if (obj_cgroup_is_root(objcg))
+ return;
+
obj_cgroup_uncharge(objcg, size);
rcu_read_lock();
diff --git a/mm/percpu.c b/mm/percpu.c
index a2107bdebf0b5..b0676b8054ed0 100644
--- a/mm/percpu.c
+++ b/mm/percpu.c
@@ -1622,7 +1622,7 @@ static bool pcpu_memcg_pre_alloc_hook(size_t size, gfp_t gfp,
return true;
objcg = current_obj_cgroup();
- if (!objcg)
+ if (!objcg || obj_cgroup_is_root(objcg))
return true;
if (obj_cgroup_charge(objcg, gfp, pcpu_obj_full_size(size)))
--
2.20.1
^ permalink raw reply [flat|nested] 50+ messages in thread
* [PATCH v4 08/31] mm: memcontrol: prevent memory cgroup release in get_mem_cgroup_from_folio()
2026-02-05 8:54 [PATCH v4 00/31] Eliminate Dying Memory Cgroup Qi Zheng
` (6 preceding siblings ...)
2026-02-05 8:54 ` [PATCH v4 07/31] mm: memcontrol: return root object cgroup for root memory cgroup Qi Zheng
@ 2026-02-05 9:01 ` Qi Zheng
2026-02-05 9:01 ` [PATCH v4 09/31] buffer: prevent memory cgroup release in folio_alloc_buffers() Qi Zheng
` (22 subsequent siblings)
30 siblings, 0 replies; 50+ messages in thread
From: Qi Zheng @ 2026-02-05 9:01 UTC (permalink / raw)
To: hannes, hughd, mhocko, roman.gushchin, shakeel.butt, muchun.song,
david, lorenzo.stoakes, ziy, harry.yoo, yosry.ahmed,
imran.f.khan, kamalesh.babulal, axelrasmussen, yuanchu, weixugc,
chenridong, mkoutny, akpm, hamzamahfooz, apais, lance.yang, bhe
Cc: linux-mm, linux-kernel, cgroups, Muchun Song, Qi Zheng
From: Muchun Song <songmuchun@bytedance.com>
In the near future, a folio will no longer pin its corresponding
memory cgroup. To ensure safety, it will only be appropriate to
hold the rcu read lock or acquire a reference to the memory cgroup
returned by folio_memcg(), thereby preventing it from being released.
In the current patch, the rcu read lock is employed to safeguard
against the release of the memory cgroup in get_mem_cgroup_from_folio().
This serves as a preparatory measure for the reparenting of the
LRU pages.
Signed-off-by: Muchun Song <songmuchun@bytedance.com>
Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
Reviewed-by: Harry Yoo <harry.yoo@oracle.com>
Acked-by: Shakeel Butt <shakeel.butt@linux.dev>
---
mm/memcontrol.c | 10 +++++++---
1 file changed, 7 insertions(+), 3 deletions(-)
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index ac75e48f144c1..9e7b00f1450e7 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -991,14 +991,18 @@ struct mem_cgroup *get_mem_cgroup_from_current(void)
*/
struct mem_cgroup *get_mem_cgroup_from_folio(struct folio *folio)
{
- struct mem_cgroup *memcg = folio_memcg(folio);
+ struct mem_cgroup *memcg;
if (mem_cgroup_disabled())
return NULL;
+ if (!folio_memcg_charged(folio))
+ return root_mem_cgroup;
+
rcu_read_lock();
- if (!memcg || WARN_ON_ONCE(!css_tryget(&memcg->css)))
- memcg = root_mem_cgroup;
+ do {
+ memcg = folio_memcg(folio);
+ } while (unlikely(!css_tryget(&memcg->css)));
rcu_read_unlock();
return memcg;
}
--
2.20.1
^ permalink raw reply [flat|nested] 50+ messages in thread
* [PATCH v4 09/31] buffer: prevent memory cgroup release in folio_alloc_buffers()
2026-02-05 8:54 [PATCH v4 00/31] Eliminate Dying Memory Cgroup Qi Zheng
` (7 preceding siblings ...)
2026-02-05 9:01 ` [PATCH v4 08/31] mm: memcontrol: prevent memory cgroup release in get_mem_cgroup_from_folio() Qi Zheng
@ 2026-02-05 9:01 ` Qi Zheng
2026-02-05 9:01 ` [PATCH v4 10/31] writeback: prevent memory cgroup release in writeback module Qi Zheng
` (21 subsequent siblings)
30 siblings, 0 replies; 50+ messages in thread
From: Qi Zheng @ 2026-02-05 9:01 UTC (permalink / raw)
To: hannes, hughd, mhocko, roman.gushchin, shakeel.butt, muchun.song,
david, lorenzo.stoakes, ziy, harry.yoo, yosry.ahmed,
imran.f.khan, kamalesh.babulal, axelrasmussen, yuanchu, weixugc,
chenridong, mkoutny, akpm, hamzamahfooz, apais, lance.yang, bhe
Cc: linux-mm, linux-kernel, cgroups, Muchun Song, Qi Zheng
From: Muchun Song <songmuchun@bytedance.com>
In the near future, a folio will no longer pin its corresponding
memory cgroup. To ensure safety, it will only be appropriate to
hold the rcu read lock or acquire a reference to the memory cgroup
returned by folio_memcg(), thereby preventing it from being released.
In the current patch, the function get_mem_cgroup_from_folio() is
employed to safeguard against the release of the memory cgroup.
This serves as a preparatory measure for the reparenting of the
LRU pages.
Signed-off-by: Muchun Song <songmuchun@bytedance.com>
Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
Reviewed-by: Harry Yoo <harry.yoo@oracle.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Shakeel Butt <shakeel.butt@linux.dev>
---
fs/buffer.c | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)
diff --git a/fs/buffer.c b/fs/buffer.c
index b67791690ed33..d80b635cff162 100644
--- a/fs/buffer.c
+++ b/fs/buffer.c
@@ -926,8 +926,7 @@ struct buffer_head *folio_alloc_buffers(struct folio *folio, unsigned long size,
long offset;
struct mem_cgroup *memcg, *old_memcg;
- /* The folio lock pins the memcg */
- memcg = folio_memcg(folio);
+ memcg = get_mem_cgroup_from_folio(folio);
old_memcg = set_active_memcg(memcg);
head = NULL;
@@ -948,6 +947,7 @@ struct buffer_head *folio_alloc_buffers(struct folio *folio, unsigned long size,
}
out:
set_active_memcg(old_memcg);
+ mem_cgroup_put(memcg);
return head;
/*
* In case anything failed, we just free everything we got.
--
2.20.1
^ permalink raw reply [flat|nested] 50+ messages in thread
* [PATCH v4 10/31] writeback: prevent memory cgroup release in writeback module
2026-02-05 8:54 [PATCH v4 00/31] Eliminate Dying Memory Cgroup Qi Zheng
` (8 preceding siblings ...)
2026-02-05 9:01 ` [PATCH v4 09/31] buffer: prevent memory cgroup release in folio_alloc_buffers() Qi Zheng
@ 2026-02-05 9:01 ` Qi Zheng
2026-02-05 9:01 ` [PATCH v4 11/31] mm: memcontrol: prevent memory cgroup release in count_memcg_folio_events() Qi Zheng
` (20 subsequent siblings)
30 siblings, 0 replies; 50+ messages in thread
From: Qi Zheng @ 2026-02-05 9:01 UTC (permalink / raw)
To: hannes, hughd, mhocko, roman.gushchin, shakeel.butt, muchun.song,
david, lorenzo.stoakes, ziy, harry.yoo, yosry.ahmed,
imran.f.khan, kamalesh.babulal, axelrasmussen, yuanchu, weixugc,
chenridong, mkoutny, akpm, hamzamahfooz, apais, lance.yang, bhe
Cc: linux-mm, linux-kernel, cgroups, Muchun Song, Qi Zheng
From: Muchun Song <songmuchun@bytedance.com>
In the near future, a folio will no longer pin its corresponding
memory cgroup. To ensure safety, it will only be appropriate to
hold the rcu read lock or acquire a reference to the memory cgroup
returned by folio_memcg(), thereby preventing it from being released.
In the current patch, the function get_mem_cgroup_css_from_folio()
and the rcu read lock are employed to safeguard against the release
of the memory cgroup.
This serves as a preparatory measure for the reparenting of the
LRU pages.
Signed-off-by: Muchun Song <songmuchun@bytedance.com>
Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
Reviewed-by: Harry Yoo <harry.yoo@oracle.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Shakeel Butt <shakeel.butt@linux.dev>
---
fs/fs-writeback.c | 22 +++++++++++-----------
include/linux/memcontrol.h | 9 +++++++--
include/trace/events/writeback.h | 3 +++
mm/memcontrol.c | 14 ++++++++------
4 files changed, 29 insertions(+), 19 deletions(-)
diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
index 68228bf89b82e..1a527ce28514d 100644
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -279,15 +279,13 @@ void __inode_attach_wb(struct inode *inode, struct folio *folio)
if (inode_cgwb_enabled(inode)) {
struct cgroup_subsys_state *memcg_css;
- if (folio) {
- memcg_css = mem_cgroup_css_from_folio(folio);
- wb = wb_get_create(bdi, memcg_css, GFP_ATOMIC);
- } else {
- /* must pin memcg_css, see wb_get_create() */
+ /* must pin memcg_css, see wb_get_create() */
+ if (folio)
+ memcg_css = get_mem_cgroup_css_from_folio(folio);
+ else
memcg_css = task_get_css(current, memory_cgrp_id);
- wb = wb_get_create(bdi, memcg_css, GFP_ATOMIC);
- css_put(memcg_css);
- }
+ wb = wb_get_create(bdi, memcg_css, GFP_ATOMIC);
+ css_put(memcg_css);
}
if (!wb)
@@ -979,16 +977,16 @@ void wbc_account_cgroup_owner(struct writeback_control *wbc, struct folio *folio
if (!wbc->wb || wbc->no_cgroup_owner)
return;
- css = mem_cgroup_css_from_folio(folio);
+ css = get_mem_cgroup_css_from_folio(folio);
/* dead cgroups shouldn't contribute to inode ownership arbitration */
if (!css_is_online(css))
- return;
+ goto out;
id = css->id;
if (id == wbc->wb_id) {
wbc->wb_bytes += bytes;
- return;
+ goto out;
}
if (id == wbc->wb_lcand_id)
@@ -1001,6 +999,8 @@ void wbc_account_cgroup_owner(struct writeback_control *wbc, struct folio *folio
wbc->wb_tcand_bytes += bytes;
else
wbc->wb_tcand_bytes -= min(bytes, wbc->wb_tcand_bytes);
+out:
+ css_put(css);
}
EXPORT_SYMBOL_GPL(wbc_account_cgroup_owner);
diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 7b3d8f341ff10..6b987f7089ca4 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -895,7 +895,7 @@ static inline bool mm_match_cgroup(struct mm_struct *mm,
return match;
}
-struct cgroup_subsys_state *mem_cgroup_css_from_folio(struct folio *folio);
+struct cgroup_subsys_state *get_mem_cgroup_css_from_folio(struct folio *folio);
ino_t page_cgroup_ino(struct page *page);
static inline bool mem_cgroup_online(struct mem_cgroup *memcg)
@@ -1564,9 +1564,14 @@ static inline void mem_cgroup_track_foreign_dirty(struct folio *folio,
if (mem_cgroup_disabled())
return;
+ if (!folio_memcg_charged(folio))
+ return;
+
+ rcu_read_lock();
memcg = folio_memcg(folio);
- if (unlikely(memcg && &memcg->css != wb->memcg_css))
+ if (unlikely(&memcg->css != wb->memcg_css))
mem_cgroup_track_foreign_dirty_slowpath(folio, wb);
+ rcu_read_unlock();
}
void mem_cgroup_flush_foreign(struct bdi_writeback *wb);
diff --git a/include/trace/events/writeback.h b/include/trace/events/writeback.h
index 4d3d8c8f3a1bc..b849b8cc96b1e 100644
--- a/include/trace/events/writeback.h
+++ b/include/trace/events/writeback.h
@@ -294,7 +294,10 @@ TRACE_EVENT(track_foreign_dirty,
__entry->ino = inode ? inode->i_ino : 0;
__entry->memcg_id = wb->memcg_css->id;
__entry->cgroup_ino = __trace_wb_assign_cgroup(wb);
+
+ rcu_read_lock();
__entry->page_cgroup_ino = cgroup_ino(folio_memcg(folio)->css.cgroup);
+ rcu_read_unlock();
),
TP_printk("bdi %s[%llu]: ino=%lu memcg_id=%u cgroup_ino=%lu page_cgroup_ino=%lu",
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 9e7b00f1450e7..5508a4aced0cc 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -243,7 +243,7 @@ DEFINE_STATIC_KEY_FALSE(memcg_bpf_enabled_key);
EXPORT_SYMBOL(memcg_bpf_enabled_key);
/**
- * mem_cgroup_css_from_folio - css of the memcg associated with a folio
+ * get_mem_cgroup_css_from_folio - acquire a css of the memcg associated with a folio
* @folio: folio of interest
*
* If memcg is bound to the default hierarchy, css of the memcg associated
@@ -253,14 +253,16 @@ EXPORT_SYMBOL(memcg_bpf_enabled_key);
* If memcg is bound to a traditional hierarchy, the css of root_mem_cgroup
* is returned.
*/
-struct cgroup_subsys_state *mem_cgroup_css_from_folio(struct folio *folio)
+struct cgroup_subsys_state *get_mem_cgroup_css_from_folio(struct folio *folio)
{
- struct mem_cgroup *memcg = folio_memcg(folio);
+ struct mem_cgroup *memcg;
- if (!memcg || !cgroup_subsys_on_dfl(memory_cgrp_subsys))
- memcg = root_mem_cgroup;
+ if (!cgroup_subsys_on_dfl(memory_cgrp_subsys))
+ return &root_mem_cgroup->css;
- return &memcg->css;
+ memcg = get_mem_cgroup_from_folio(folio);
+
+ return memcg ? &memcg->css : &root_mem_cgroup->css;
}
/**
--
2.20.1
^ permalink raw reply [flat|nested] 50+ messages in thread
* [PATCH v4 11/31] mm: memcontrol: prevent memory cgroup release in count_memcg_folio_events()
2026-02-05 8:54 [PATCH v4 00/31] Eliminate Dying Memory Cgroup Qi Zheng
` (9 preceding siblings ...)
2026-02-05 9:01 ` [PATCH v4 10/31] writeback: prevent memory cgroup release in writeback module Qi Zheng
@ 2026-02-05 9:01 ` Qi Zheng
2026-02-05 9:01 ` [PATCH v4 12/31] mm: page_io: prevent memory cgroup release in page_io module Qi Zheng
` (19 subsequent siblings)
30 siblings, 0 replies; 50+ messages in thread
From: Qi Zheng @ 2026-02-05 9:01 UTC (permalink / raw)
To: hannes, hughd, mhocko, roman.gushchin, shakeel.butt, muchun.song,
david, lorenzo.stoakes, ziy, harry.yoo, yosry.ahmed,
imran.f.khan, kamalesh.babulal, axelrasmussen, yuanchu, weixugc,
chenridong, mkoutny, akpm, hamzamahfooz, apais, lance.yang, bhe
Cc: linux-mm, linux-kernel, cgroups, Muchun Song, Qi Zheng
From: Muchun Song <songmuchun@bytedance.com>
In the near future, a folio will no longer pin its corresponding
memory cgroup. To ensure safety, it will only be appropriate to
hold the rcu read lock or acquire a reference to the memory cgroup
returned by folio_memcg(), thereby preventing it from being released.
In the current patch, the rcu read lock is employed to safeguard
against the release of the memory cgroup in count_memcg_folio_events().
This serves as a preparatory measure for the reparenting of the
LRU pages.
Signed-off-by: Muchun Song <songmuchun@bytedance.com>
Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
Reviewed-by: Harry Yoo <harry.yoo@oracle.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Shakeel Butt <shakeel.butt@linux.dev>
---
include/linux/memcontrol.h | 11 ++++++++---
1 file changed, 8 insertions(+), 3 deletions(-)
diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 6b987f7089ca4..f1556759d0d3f 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -976,10 +976,15 @@ void count_memcg_events(struct mem_cgroup *memcg, enum vm_event_item idx,
static inline void count_memcg_folio_events(struct folio *folio,
enum vm_event_item idx, unsigned long nr)
{
- struct mem_cgroup *memcg = folio_memcg(folio);
+ struct mem_cgroup *memcg;
- if (memcg)
- count_memcg_events(memcg, idx, nr);
+ if (!folio_memcg_charged(folio))
+ return;
+
+ rcu_read_lock();
+ memcg = folio_memcg(folio);
+ count_memcg_events(memcg, idx, nr);
+ rcu_read_unlock();
}
static inline void count_memcg_events_mm(struct mm_struct *mm,
--
2.20.1
^ permalink raw reply [flat|nested] 50+ messages in thread
* [PATCH v4 12/31] mm: page_io: prevent memory cgroup release in page_io module
2026-02-05 8:54 [PATCH v4 00/31] Eliminate Dying Memory Cgroup Qi Zheng
` (10 preceding siblings ...)
2026-02-05 9:01 ` [PATCH v4 11/31] mm: memcontrol: prevent memory cgroup release in count_memcg_folio_events() Qi Zheng
@ 2026-02-05 9:01 ` Qi Zheng
2026-02-05 9:01 ` [PATCH v4 13/31] mm: migrate: prevent memory cgroup release in folio_migrate_mapping() Qi Zheng
` (18 subsequent siblings)
30 siblings, 0 replies; 50+ messages in thread
From: Qi Zheng @ 2026-02-05 9:01 UTC (permalink / raw)
To: hannes, hughd, mhocko, roman.gushchin, shakeel.butt, muchun.song,
david, lorenzo.stoakes, ziy, harry.yoo, yosry.ahmed,
imran.f.khan, kamalesh.babulal, axelrasmussen, yuanchu, weixugc,
chenridong, mkoutny, akpm, hamzamahfooz, apais, lance.yang, bhe
Cc: linux-mm, linux-kernel, cgroups, Muchun Song, Qi Zheng
From: Muchun Song <songmuchun@bytedance.com>
In the near future, a folio will no longer pin its corresponding
memory cgroup. To ensure safety, it will only be appropriate to
hold the rcu read lock or acquire a reference to the memory cgroup
returned by folio_memcg(), thereby preventing it from being released.
In the current patch, the rcu read lock is employed to safeguard
against the release of the memory cgroup in swap_writeout() and
bio_associate_blkg_from_page().
This serves as a preparatory measure for the reparenting of the
LRU pages.
Signed-off-by: Muchun Song <songmuchun@bytedance.com>
Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
Reviewed-by: Harry Yoo <harry.yoo@oracle.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Shakeel Butt <shakeel.butt@linux.dev>
---
mm/page_io.c | 8 ++++++--
1 file changed, 6 insertions(+), 2 deletions(-)
diff --git a/mm/page_io.c b/mm/page_io.c
index a2c034660c805..63b262f4c5a9b 100644
--- a/mm/page_io.c
+++ b/mm/page_io.c
@@ -276,10 +276,14 @@ int swap_writeout(struct folio *folio, struct swap_iocb **swap_plug)
count_mthp_stat(folio_order(folio), MTHP_STAT_ZSWPOUT);
goto out_unlock;
}
+
+ rcu_read_lock();
if (!mem_cgroup_zswap_writeback_enabled(folio_memcg(folio))) {
+ rcu_read_unlock();
folio_mark_dirty(folio);
return AOP_WRITEPAGE_ACTIVATE;
}
+ rcu_read_unlock();
__swap_writepage(folio, swap_plug);
return 0;
@@ -307,11 +311,11 @@ static void bio_associate_blkg_from_page(struct bio *bio, struct folio *folio)
struct cgroup_subsys_state *css;
struct mem_cgroup *memcg;
- memcg = folio_memcg(folio);
- if (!memcg)
+ if (!folio_memcg_charged(folio))
return;
rcu_read_lock();
+ memcg = folio_memcg(folio);
css = cgroup_e_css(memcg->css.cgroup, &io_cgrp_subsys);
bio_associate_blkg_from_css(bio, css);
rcu_read_unlock();
--
2.20.1
^ permalink raw reply [flat|nested] 50+ messages in thread
* [PATCH v4 13/31] mm: migrate: prevent memory cgroup release in folio_migrate_mapping()
2026-02-05 8:54 [PATCH v4 00/31] Eliminate Dying Memory Cgroup Qi Zheng
` (11 preceding siblings ...)
2026-02-05 9:01 ` [PATCH v4 12/31] mm: page_io: prevent memory cgroup release in page_io module Qi Zheng
@ 2026-02-05 9:01 ` Qi Zheng
2026-02-05 9:01 ` [PATCH v4 14/31] mm: mglru: prevent memory cgroup release in mglru Qi Zheng
` (17 subsequent siblings)
30 siblings, 0 replies; 50+ messages in thread
From: Qi Zheng @ 2026-02-05 9:01 UTC (permalink / raw)
To: hannes, hughd, mhocko, roman.gushchin, shakeel.butt, muchun.song,
david, lorenzo.stoakes, ziy, harry.yoo, yosry.ahmed,
imran.f.khan, kamalesh.babulal, axelrasmussen, yuanchu, weixugc,
chenridong, mkoutny, akpm, hamzamahfooz, apais, lance.yang, bhe
Cc: linux-mm, linux-kernel, cgroups, Muchun Song, Qi Zheng
From: Muchun Song <songmuchun@bytedance.com>
In the near future, a folio will no longer pin its corresponding
memory cgroup. To ensure safety, it will only be appropriate to
hold the rcu read lock or acquire a reference to the memory cgroup
returned by folio_memcg(), thereby preventing it from being released.
In __folio_migrate_mapping(), the rcu read lock is employed to safeguard
against the release of the memory cgroup in folio_migrate_mapping().
This serves as a preparatory measure for the reparenting of the
LRU pages.
Signed-off-by: Muchun Song <songmuchun@bytedance.com>
Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
Reviewed-by: Harry Yoo <harry.yoo@oracle.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Shakeel Butt <shakeel.butt@linux.dev>
---
mm/migrate.c | 2 ++
1 file changed, 2 insertions(+)
diff --git a/mm/migrate.c b/mm/migrate.c
index 1bf2cf8c44dd4..45ba49af2136c 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -672,6 +672,7 @@ static int __folio_migrate_mapping(struct address_space *mapping,
struct lruvec *old_lruvec, *new_lruvec;
struct mem_cgroup *memcg;
+ rcu_read_lock();
memcg = folio_memcg(folio);
old_lruvec = mem_cgroup_lruvec(memcg, oldzone->zone_pgdat);
new_lruvec = mem_cgroup_lruvec(memcg, newzone->zone_pgdat);
@@ -699,6 +700,7 @@ static int __folio_migrate_mapping(struct address_space *mapping,
mod_lruvec_state(new_lruvec, NR_FILE_DIRTY, nr);
__mod_zone_page_state(newzone, NR_ZONE_WRITE_PENDING, nr);
}
+ rcu_read_unlock();
}
local_irq_enable();
--
2.20.1
^ permalink raw reply [flat|nested] 50+ messages in thread
* [PATCH v4 14/31] mm: mglru: prevent memory cgroup release in mglru
2026-02-05 8:54 [PATCH v4 00/31] Eliminate Dying Memory Cgroup Qi Zheng
` (12 preceding siblings ...)
2026-02-05 9:01 ` [PATCH v4 13/31] mm: migrate: prevent memory cgroup release in folio_migrate_mapping() Qi Zheng
@ 2026-02-05 9:01 ` Qi Zheng
2026-02-05 9:01 ` [PATCH v4 15/31] mm: memcontrol: prevent memory cgroup release in mem_cgroup_swap_full() Qi Zheng
` (16 subsequent siblings)
30 siblings, 0 replies; 50+ messages in thread
From: Qi Zheng @ 2026-02-05 9:01 UTC (permalink / raw)
To: hannes, hughd, mhocko, roman.gushchin, shakeel.butt, muchun.song,
david, lorenzo.stoakes, ziy, harry.yoo, yosry.ahmed,
imran.f.khan, kamalesh.babulal, axelrasmussen, yuanchu, weixugc,
chenridong, mkoutny, akpm, hamzamahfooz, apais, lance.yang, bhe
Cc: linux-mm, linux-kernel, cgroups, Muchun Song, Qi Zheng
From: Muchun Song <songmuchun@bytedance.com>
In the near future, a folio will no longer pin its corresponding
memory cgroup. To ensure safety, it will only be appropriate to
hold the rcu read lock or acquire a reference to the memory cgroup
returned by folio_memcg(), thereby preventing it from being released.
In the current patch, the rcu read lock is employed to safeguard
against the release of the memory cgroup in mglru.
This serves as a preparatory measure for the reparenting of the
LRU pages.
Signed-off-by: Muchun Song <songmuchun@bytedance.com>
Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
Acked-by: Shakeel Butt <shakeel.butt@linux.dev>
Reviewed-by: Harry Yoo <harry.yoo@oracle.com>
---
mm/vmscan.c | 22 ++++++++++++++++------
1 file changed, 16 insertions(+), 6 deletions(-)
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 8039df1c9fca5..6a7eacd39bc5f 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -3450,8 +3450,10 @@ static struct folio *get_pfn_folio(unsigned long pfn, struct mem_cgroup *memcg,
if (folio_nid(folio) != pgdat->node_id)
return NULL;
+ rcu_read_lock();
if (folio_memcg(folio) != memcg)
- return NULL;
+ folio = NULL;
+ rcu_read_unlock();
return folio;
}
@@ -4208,12 +4210,12 @@ bool lru_gen_look_around(struct page_vma_mapped_walk *pvmw)
unsigned long addr = pvmw->address;
struct vm_area_struct *vma = pvmw->vma;
struct folio *folio = pfn_folio(pvmw->pfn);
- struct mem_cgroup *memcg = folio_memcg(folio);
+ struct mem_cgroup *memcg;
struct pglist_data *pgdat = folio_pgdat(folio);
- struct lruvec *lruvec = mem_cgroup_lruvec(memcg, pgdat);
- struct lru_gen_mm_state *mm_state = get_mm_state(lruvec);
- DEFINE_MAX_SEQ(lruvec);
- int gen = lru_gen_from_seq(max_seq);
+ struct lruvec *lruvec;
+ struct lru_gen_mm_state *mm_state;
+ unsigned long max_seq;
+ int gen;
lockdep_assert_held(pvmw->ptl);
VM_WARN_ON_ONCE_FOLIO(folio_test_lru(folio), folio);
@@ -4248,6 +4250,12 @@ bool lru_gen_look_around(struct page_vma_mapped_walk *pvmw)
}
}
+ memcg = get_mem_cgroup_from_folio(folio);
+ lruvec = mem_cgroup_lruvec(memcg, pgdat);
+ max_seq = READ_ONCE((lruvec)->lrugen.max_seq);
+ gen = lru_gen_from_seq(max_seq);
+ mm_state = get_mm_state(lruvec);
+
lazy_mmu_mode_enable();
pte -= (addr - start) / PAGE_SIZE;
@@ -4288,6 +4296,8 @@ bool lru_gen_look_around(struct page_vma_mapped_walk *pvmw)
if (mm_state && suitable_to_scan(i, young))
update_bloom_filter(mm_state, max_seq, pvmw->pmd);
+ mem_cgroup_put(memcg);
+
return true;
}
--
2.20.1
^ permalink raw reply [flat|nested] 50+ messages in thread
* [PATCH v4 15/31] mm: memcontrol: prevent memory cgroup release in mem_cgroup_swap_full()
2026-02-05 8:54 [PATCH v4 00/31] Eliminate Dying Memory Cgroup Qi Zheng
` (13 preceding siblings ...)
2026-02-05 9:01 ` [PATCH v4 14/31] mm: mglru: prevent memory cgroup release in mglru Qi Zheng
@ 2026-02-05 9:01 ` Qi Zheng
2026-02-05 9:01 ` [PATCH v4 16/31] mm: workingset: prevent memory cgroup release in lru_gen_eviction() Qi Zheng
` (15 subsequent siblings)
30 siblings, 0 replies; 50+ messages in thread
From: Qi Zheng @ 2026-02-05 9:01 UTC (permalink / raw)
To: hannes, hughd, mhocko, roman.gushchin, shakeel.butt, muchun.song,
david, lorenzo.stoakes, ziy, harry.yoo, yosry.ahmed,
imran.f.khan, kamalesh.babulal, axelrasmussen, yuanchu, weixugc,
chenridong, mkoutny, akpm, hamzamahfooz, apais, lance.yang, bhe
Cc: linux-mm, linux-kernel, cgroups, Muchun Song, Qi Zheng
From: Muchun Song <songmuchun@bytedance.com>
In the near future, a folio will no longer pin its corresponding
memory cgroup. To ensure safety, it will only be appropriate to
hold the rcu read lock or acquire a reference to the memory cgroup
returned by folio_memcg(), thereby preventing it from being released.
In the current patch, the rcu read lock is employed to safeguard
against the release of the memory cgroup in mem_cgroup_swap_full().
This serves as a preparatory measure for the reparenting of the
LRU pages.
Signed-off-by: Muchun Song <songmuchun@bytedance.com>
Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
Reviewed-by: Harry Yoo <harry.yoo@oracle.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Shakeel Butt <shakeel.butt@linux.dev>
---
mm/memcontrol.c | 18 ++++++++++--------
1 file changed, 10 insertions(+), 8 deletions(-)
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 5508a4aced0cc..641c2fa077ccf 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -5283,27 +5283,29 @@ long mem_cgroup_get_nr_swap_pages(struct mem_cgroup *memcg)
bool mem_cgroup_swap_full(struct folio *folio)
{
struct mem_cgroup *memcg;
+ bool ret = false;
VM_BUG_ON_FOLIO(!folio_test_locked(folio), folio);
if (vm_swap_full())
return true;
- if (do_memsw_account())
- return false;
+ if (do_memsw_account() || !folio_memcg_charged(folio))
+ return ret;
+ rcu_read_lock();
memcg = folio_memcg(folio);
- if (!memcg)
- return false;
-
for (; !mem_cgroup_is_root(memcg); memcg = parent_mem_cgroup(memcg)) {
unsigned long usage = page_counter_read(&memcg->swap);
if (usage * 2 >= READ_ONCE(memcg->swap.high) ||
- usage * 2 >= READ_ONCE(memcg->swap.max))
- return true;
+ usage * 2 >= READ_ONCE(memcg->swap.max)) {
+ ret = true;
+ break;
+ }
}
+ rcu_read_unlock();
- return false;
+ return ret;
}
static int __init setup_swap_account(char *s)
--
2.20.1
^ permalink raw reply [flat|nested] 50+ messages in thread
* [PATCH v4 16/31] mm: workingset: prevent memory cgroup release in lru_gen_eviction()
2026-02-05 8:54 [PATCH v4 00/31] Eliminate Dying Memory Cgroup Qi Zheng
` (14 preceding siblings ...)
2026-02-05 9:01 ` [PATCH v4 15/31] mm: memcontrol: prevent memory cgroup release in mem_cgroup_swap_full() Qi Zheng
@ 2026-02-05 9:01 ` Qi Zheng
2026-02-05 9:01 ` [PATCH v4 17/31] mm: thp: prevent memory cgroup release in folio_split_queue_lock{_irqsave}() Qi Zheng
` (14 subsequent siblings)
30 siblings, 0 replies; 50+ messages in thread
From: Qi Zheng @ 2026-02-05 9:01 UTC (permalink / raw)
To: hannes, hughd, mhocko, roman.gushchin, shakeel.butt, muchun.song,
david, lorenzo.stoakes, ziy, harry.yoo, yosry.ahmed,
imran.f.khan, kamalesh.babulal, axelrasmussen, yuanchu, weixugc,
chenridong, mkoutny, akpm, hamzamahfooz, apais, lance.yang, bhe
Cc: linux-mm, linux-kernel, cgroups, Muchun Song, Qi Zheng
From: Muchun Song <songmuchun@bytedance.com>
In the near future, a folio will no longer pin its corresponding
memory cgroup. To ensure safety, it will only be appropriate to
hold the rcu read lock or acquire a reference to the memory cgroup
returned by folio_memcg(), thereby preventing it from being released.
In the current patch, the rcu read lock is employed to safeguard
against the release of the memory cgroup in lru_gen_eviction().
This serves as a preparatory measure for the reparenting of the
LRU pages.
Signed-off-by: Muchun Song <songmuchun@bytedance.com>
Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
Reviewed-by: Harry Yoo <harry.yoo@oracle.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Shakeel Butt <shakeel.butt@linux.dev>
---
mm/workingset.c | 9 +++++++--
1 file changed, 7 insertions(+), 2 deletions(-)
diff --git a/mm/workingset.c b/mm/workingset.c
index ed12744d93a29..79b40f058cd48 100644
--- a/mm/workingset.c
+++ b/mm/workingset.c
@@ -241,11 +241,14 @@ static void *lru_gen_eviction(struct folio *folio)
int refs = folio_lru_refs(folio);
bool workingset = folio_test_workingset(folio);
int tier = lru_tier_from_refs(refs, workingset);
- struct mem_cgroup *memcg = folio_memcg(folio);
+ struct mem_cgroup *memcg;
struct pglist_data *pgdat = folio_pgdat(folio);
+ unsigned short memcg_id;
BUILD_BUG_ON(LRU_GEN_WIDTH + LRU_REFS_WIDTH > BITS_PER_LONG - EVICTION_SHIFT);
+ rcu_read_lock();
+ memcg = folio_memcg(folio);
lruvec = mem_cgroup_lruvec(memcg, pgdat);
lrugen = &lruvec->lrugen;
min_seq = READ_ONCE(lrugen->min_seq[type]);
@@ -253,8 +256,10 @@ static void *lru_gen_eviction(struct folio *folio)
hist = lru_hist_from_seq(min_seq);
atomic_long_add(delta, &lrugen->evicted[hist][type][tier]);
+ memcg_id = mem_cgroup_private_id(memcg);
+ rcu_read_unlock();
- return pack_shadow(mem_cgroup_private_id(memcg), pgdat, token, workingset);
+ return pack_shadow(memcg_id, pgdat, token, workingset);
}
/*
--
2.20.1
^ permalink raw reply [flat|nested] 50+ messages in thread
* [PATCH v4 17/31] mm: thp: prevent memory cgroup release in folio_split_queue_lock{_irqsave}()
2026-02-05 8:54 [PATCH v4 00/31] Eliminate Dying Memory Cgroup Qi Zheng
` (15 preceding siblings ...)
2026-02-05 9:01 ` [PATCH v4 16/31] mm: workingset: prevent memory cgroup release in lru_gen_eviction() Qi Zheng
@ 2026-02-05 9:01 ` Qi Zheng
2026-02-05 9:01 ` [PATCH v4 18/31] mm: zswap: prevent memory cgroup release in zswap_compress() Qi Zheng
` (13 subsequent siblings)
30 siblings, 0 replies; 50+ messages in thread
From: Qi Zheng @ 2026-02-05 9:01 UTC (permalink / raw)
To: hannes, hughd, mhocko, roman.gushchin, shakeel.butt, muchun.song,
david, lorenzo.stoakes, ziy, harry.yoo, yosry.ahmed,
imran.f.khan, kamalesh.babulal, axelrasmussen, yuanchu, weixugc,
chenridong, mkoutny, akpm, hamzamahfooz, apais, lance.yang, bhe
Cc: linux-mm, linux-kernel, cgroups, Qi Zheng
From: Qi Zheng <zhengqi.arch@bytedance.com>
In the near future, a folio will no longer pin its corresponding memory
cgroup. To ensure safety, it will only be appropriate to hold the rcu read
lock or acquire a reference to the memory cgroup returned by
folio_memcg(), thereby preventing it from being released.
In the current patch, the rcu read lock is employed to safeguard against
the release of the memory cgroup in folio_split_queue_lock{_irqsave}().
Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
Reviewed-by: Harry Yoo <harry.yoo@oracle.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Shakeel Butt <shakeel.butt@linux.dev>
Acked-by: David Hildenbrand (Red Hat) <david@kernel.org>
Acked-by: Muchun Song <muchun.song@linux.dev>
---
mm/huge_memory.c | 20 ++++++++++++++++++--
1 file changed, 18 insertions(+), 2 deletions(-)
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 580376323124c..3f6675240f9dc 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1154,13 +1154,29 @@ split_queue_lock_irqsave(int nid, struct mem_cgroup *memcg, unsigned long *flags
static struct deferred_split *folio_split_queue_lock(struct folio *folio)
{
- return split_queue_lock(folio_nid(folio), folio_memcg(folio));
+ struct deferred_split *queue;
+
+ rcu_read_lock();
+ queue = split_queue_lock(folio_nid(folio), folio_memcg(folio));
+ /*
+ * The memcg destruction path is acquiring the split queue lock for
+ * reparenting. Once you have it locked, it's safe to drop the rcu lock.
+ */
+ rcu_read_unlock();
+
+ return queue;
}
static struct deferred_split *
folio_split_queue_lock_irqsave(struct folio *folio, unsigned long *flags)
{
- return split_queue_lock_irqsave(folio_nid(folio), folio_memcg(folio), flags);
+ struct deferred_split *queue;
+
+ rcu_read_lock();
+ queue = split_queue_lock_irqsave(folio_nid(folio), folio_memcg(folio), flags);
+ rcu_read_unlock();
+
+ return queue;
}
static inline void split_queue_unlock(struct deferred_split *queue)
--
2.20.1
^ permalink raw reply [flat|nested] 50+ messages in thread
* [PATCH v4 18/31] mm: zswap: prevent memory cgroup release in zswap_compress()
2026-02-05 8:54 [PATCH v4 00/31] Eliminate Dying Memory Cgroup Qi Zheng
` (16 preceding siblings ...)
2026-02-05 9:01 ` [PATCH v4 17/31] mm: thp: prevent memory cgroup release in folio_split_queue_lock{_irqsave}() Qi Zheng
@ 2026-02-05 9:01 ` Qi Zheng
2026-02-05 9:01 ` [PATCH v4 19/31] mm: workingset: prevent lruvec release in workingset_refault() Qi Zheng
` (12 subsequent siblings)
30 siblings, 0 replies; 50+ messages in thread
From: Qi Zheng @ 2026-02-05 9:01 UTC (permalink / raw)
To: hannes, hughd, mhocko, roman.gushchin, shakeel.butt, muchun.song,
david, lorenzo.stoakes, ziy, harry.yoo, yosry.ahmed,
imran.f.khan, kamalesh.babulal, axelrasmussen, yuanchu, weixugc,
chenridong, mkoutny, akpm, hamzamahfooz, apais, lance.yang, bhe
Cc: linux-mm, linux-kernel, cgroups, Qi Zheng
From: Qi Zheng <zhengqi.arch@bytedance.com>
In the near future, a folio will no longer pin its corresponding memory
cgroup. To ensure safety, it will only be appropriate to hold the rcu read
lock or acquire a reference to the memory cgroup returned by
folio_memcg(), thereby preventing it from being released.
In the current patch, the rcu read lock is employed to safeguard against
the release of the memory cgroup in zswap_compress().
Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Shakeel Butt <shakeel.butt@linux.dev>
Acked-by: Muchun Song <muchun.song@linux.dev>
Reviewed-by: Harry Yoo <harry.yoo@oracle.com>
---
mm/zswap.c | 3 +++
1 file changed, 3 insertions(+)
diff --git a/mm/zswap.c b/mm/zswap.c
index 3d2d59ac3f9c2..a9319ecd92b4b 100644
--- a/mm/zswap.c
+++ b/mm/zswap.c
@@ -893,11 +893,14 @@ static bool zswap_compress(struct page *page, struct zswap_entry *entry,
* to the active LRU list in the case.
*/
if (comp_ret || !dlen || dlen >= PAGE_SIZE) {
+ rcu_read_lock();
if (!mem_cgroup_zswap_writeback_enabled(
folio_memcg(page_folio(page)))) {
+ rcu_read_unlock();
comp_ret = comp_ret ? comp_ret : -EINVAL;
goto unlock;
}
+ rcu_read_unlock();
comp_ret = 0;
dlen = PAGE_SIZE;
dst = kmap_local_page(page);
--
2.20.1
^ permalink raw reply [flat|nested] 50+ messages in thread
* [PATCH v4 19/31] mm: workingset: prevent lruvec release in workingset_refault()
2026-02-05 8:54 [PATCH v4 00/31] Eliminate Dying Memory Cgroup Qi Zheng
` (17 preceding siblings ...)
2026-02-05 9:01 ` [PATCH v4 18/31] mm: zswap: prevent memory cgroup release in zswap_compress() Qi Zheng
@ 2026-02-05 9:01 ` Qi Zheng
2026-02-05 9:01 ` [PATCH v4 20/31] mm: zswap: prevent lruvec release in zswap_folio_swapin() Qi Zheng
` (11 subsequent siblings)
30 siblings, 0 replies; 50+ messages in thread
From: Qi Zheng @ 2026-02-05 9:01 UTC (permalink / raw)
To: hannes, hughd, mhocko, roman.gushchin, shakeel.butt, muchun.song,
david, lorenzo.stoakes, ziy, harry.yoo, yosry.ahmed,
imran.f.khan, kamalesh.babulal, axelrasmussen, yuanchu, weixugc,
chenridong, mkoutny, akpm, hamzamahfooz, apais, lance.yang, bhe
Cc: linux-mm, linux-kernel, cgroups, Muchun Song, Qi Zheng
From: Muchun Song <songmuchun@bytedance.com>
In the near future, a folio will no longer pin its corresponding
memory cgroup. So an lruvec returned by folio_lruvec() could be
released without the rcu read lock or a reference to its memory
cgroup.
In the current patch, the rcu read lock is employed to safeguard
against the release of the lruvec in workingset_refault().
This serves as a preparatory measure for the reparenting of the
LRU pages.
Signed-off-by: Muchun Song <songmuchun@bytedance.com>
Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
Reviewed-by: Harry Yoo <harry.yoo@oracle.com>
Acked-by: Shakeel Butt <shakeel.butt@linux.dev>
---
mm/workingset.c | 8 ++++++--
1 file changed, 6 insertions(+), 2 deletions(-)
diff --git a/mm/workingset.c b/mm/workingset.c
index 79b40f058cd48..5fbf316fb0e71 100644
--- a/mm/workingset.c
+++ b/mm/workingset.c
@@ -539,6 +539,7 @@ bool workingset_test_recent(void *shadow, bool file, bool *workingset,
void workingset_refault(struct folio *folio, void *shadow)
{
bool file = folio_is_file_lru(folio);
+ struct mem_cgroup *memcg;
struct lruvec *lruvec;
bool workingset;
long nr;
@@ -560,11 +561,12 @@ void workingset_refault(struct folio *folio, void *shadow)
* locked to guarantee folio_memcg() stability throughout.
*/
nr = folio_nr_pages(folio);
- lruvec = folio_lruvec(folio);
+ memcg = get_mem_cgroup_from_folio(folio);
+ lruvec = mem_cgroup_lruvec(memcg, folio_pgdat(folio));
mod_lruvec_state(lruvec, WORKINGSET_REFAULT_BASE + file, nr);
if (!workingset_test_recent(shadow, file, &workingset, true))
- return;
+ goto out;
folio_set_active(folio);
workingset_age_nonresident(lruvec, nr);
@@ -580,6 +582,8 @@ void workingset_refault(struct folio *folio, void *shadow)
lru_note_cost_refault(folio);
mod_lruvec_state(lruvec, WORKINGSET_RESTORE_BASE + file, nr);
}
+out:
+ mem_cgroup_put(memcg);
}
/**
--
2.20.1
^ permalink raw reply [flat|nested] 50+ messages in thread
* [PATCH v4 20/31] mm: zswap: prevent lruvec release in zswap_folio_swapin()
2026-02-05 8:54 [PATCH v4 00/31] Eliminate Dying Memory Cgroup Qi Zheng
` (18 preceding siblings ...)
2026-02-05 9:01 ` [PATCH v4 19/31] mm: workingset: prevent lruvec release in workingset_refault() Qi Zheng
@ 2026-02-05 9:01 ` Qi Zheng
2026-02-05 9:01 ` [PATCH v4 21/31] mm: swap: prevent lruvec release in lru_gen_clear_refs() Qi Zheng
` (10 subsequent siblings)
30 siblings, 0 replies; 50+ messages in thread
From: Qi Zheng @ 2026-02-05 9:01 UTC (permalink / raw)
To: hannes, hughd, mhocko, roman.gushchin, shakeel.butt, muchun.song,
david, lorenzo.stoakes, ziy, harry.yoo, yosry.ahmed,
imran.f.khan, kamalesh.babulal, axelrasmussen, yuanchu, weixugc,
chenridong, mkoutny, akpm, hamzamahfooz, apais, lance.yang, bhe
Cc: linux-mm, linux-kernel, cgroups, Muchun Song, Nhat Pham,
Chengming Zhou, Qi Zheng
From: Muchun Song <songmuchun@bytedance.com>
In the near future, a folio will no longer pin its corresponding
memory cgroup. So an lruvec returned by folio_lruvec() could be
released without the rcu read lock or a reference to its memory
cgroup.
In the current patch, the rcu read lock is employed to safeguard
against the release of the lruvec in zswap_folio_swapin().
This serves as a preparatory measure for the reparenting of the
LRU pages.
Signed-off-by: Muchun Song <songmuchun@bytedance.com>
Acked-by: Nhat Pham <nphamcs@gmail.com>
Reviewed-by: Chengming Zhou <chengming.zhou@linux.dev>
Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
Reviewed-by: Harry Yoo <harry.yoo@oracle.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Shakeel Butt <shakeel.butt@linux.dev>
---
mm/zswap.c | 2 ++
1 file changed, 2 insertions(+)
diff --git a/mm/zswap.c b/mm/zswap.c
index a9319ecd92b4b..aea3267c5a967 100644
--- a/mm/zswap.c
+++ b/mm/zswap.c
@@ -664,8 +664,10 @@ void zswap_folio_swapin(struct folio *folio)
struct lruvec *lruvec;
if (folio) {
+ rcu_read_lock();
lruvec = folio_lruvec(folio);
atomic_long_inc(&lruvec->zswap_lruvec_state.nr_disk_swapins);
+ rcu_read_unlock();
}
}
--
2.20.1
^ permalink raw reply [flat|nested] 50+ messages in thread
* [PATCH v4 21/31] mm: swap: prevent lruvec release in lru_gen_clear_refs()
2026-02-05 8:54 [PATCH v4 00/31] Eliminate Dying Memory Cgroup Qi Zheng
` (19 preceding siblings ...)
2026-02-05 9:01 ` [PATCH v4 20/31] mm: zswap: prevent lruvec release in zswap_folio_swapin() Qi Zheng
@ 2026-02-05 9:01 ` Qi Zheng
2026-02-05 9:01 ` [PATCH v4 22/31] mm: workingset: prevent lruvec release in workingset_activation() Qi Zheng
` (9 subsequent siblings)
30 siblings, 0 replies; 50+ messages in thread
From: Qi Zheng @ 2026-02-05 9:01 UTC (permalink / raw)
To: hannes, hughd, mhocko, roman.gushchin, shakeel.butt, muchun.song,
david, lorenzo.stoakes, ziy, harry.yoo, yosry.ahmed,
imran.f.khan, kamalesh.babulal, axelrasmussen, yuanchu, weixugc,
chenridong, mkoutny, akpm, hamzamahfooz, apais, lance.yang, bhe
Cc: linux-mm, linux-kernel, cgroups, Muchun Song, Qi Zheng
From: Muchun Song <songmuchun@bytedance.com>
In the near future, a folio will no longer pin its corresponding
memory cgroup. So an lruvec returned by folio_lruvec() could be
released without the rcu read lock or a reference to its memory
cgroup.
In the current patch, the rcu read lock is employed to safeguard
against the release of the lruvec in lru_gen_clear_refs().
This serves as a preparatory measure for the reparenting of the
LRU pages.
Signed-off-by: Muchun Song <songmuchun@bytedance.com>
Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
Reviewed-by: Harry Yoo <harry.yoo@oracle.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Shakeel Butt <shakeel.butt@linux.dev>
---
mm/swap.c | 8 +++++---
1 file changed, 5 insertions(+), 3 deletions(-)
diff --git a/mm/swap.c b/mm/swap.c
index 245ba159e01d7..cb1148a92d8ec 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -412,18 +412,20 @@ static void lru_gen_inc_refs(struct folio *folio)
static bool lru_gen_clear_refs(struct folio *folio)
{
- struct lru_gen_folio *lrugen;
int gen = folio_lru_gen(folio);
int type = folio_is_file_lru(folio);
+ unsigned long seq;
if (gen < 0)
return true;
set_mask_bits(&folio->flags.f, LRU_REFS_FLAGS | BIT(PG_workingset), 0);
- lrugen = &folio_lruvec(folio)->lrugen;
+ rcu_read_lock();
+ seq = READ_ONCE(folio_lruvec(folio)->lrugen.min_seq[type]);
+ rcu_read_unlock();
/* whether can do without shuffling under the LRU lock */
- return gen == lru_gen_from_seq(READ_ONCE(lrugen->min_seq[type]));
+ return gen == lru_gen_from_seq(seq);
}
#else /* !CONFIG_LRU_GEN */
--
2.20.1
^ permalink raw reply [flat|nested] 50+ messages in thread
* [PATCH v4 22/31] mm: workingset: prevent lruvec release in workingset_activation()
2026-02-05 8:54 [PATCH v4 00/31] Eliminate Dying Memory Cgroup Qi Zheng
` (20 preceding siblings ...)
2026-02-05 9:01 ` [PATCH v4 21/31] mm: swap: prevent lruvec release in lru_gen_clear_refs() Qi Zheng
@ 2026-02-05 9:01 ` Qi Zheng
2026-02-05 9:01 ` [PATCH v4 23/31] mm: do not open-code lruvec lock Qi Zheng
` (8 subsequent siblings)
30 siblings, 0 replies; 50+ messages in thread
From: Qi Zheng @ 2026-02-05 9:01 UTC (permalink / raw)
To: hannes, hughd, mhocko, roman.gushchin, shakeel.butt, muchun.song,
david, lorenzo.stoakes, ziy, harry.yoo, yosry.ahmed,
imran.f.khan, kamalesh.babulal, axelrasmussen, yuanchu, weixugc,
chenridong, mkoutny, akpm, hamzamahfooz, apais, lance.yang, bhe
Cc: linux-mm, linux-kernel, cgroups, Muchun Song, Qi Zheng
From: Muchun Song <songmuchun@bytedance.com>
In the near future, a folio will no longer pin its corresponding
memory cgroup. So an lruvec returned by folio_lruvec() could be
released without the rcu read lock or a reference to its memory
cgroup.
In the current patch, the rcu read lock is employed to safeguard
against the release of the lruvec in workingset_activation().
This serves as a preparatory measure for the reparenting of the
LRU pages.
Signed-off-by: Muchun Song <songmuchun@bytedance.com>
Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
Reviewed-by: Harry Yoo <harry.yoo@oracle.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Shakeel Butt <shakeel.butt@linux.dev>
---
mm/workingset.c | 5 ++++-
1 file changed, 4 insertions(+), 1 deletion(-)
diff --git a/mm/workingset.c b/mm/workingset.c
index 5fbf316fb0e71..2be53098f6282 100644
--- a/mm/workingset.c
+++ b/mm/workingset.c
@@ -596,8 +596,11 @@ void workingset_activation(struct folio *folio)
* Filter non-memcg pages here, e.g. unmap can call
* mark_page_accessed() on VDSO pages.
*/
- if (mem_cgroup_disabled() || folio_memcg_charged(folio))
+ if (mem_cgroup_disabled() || folio_memcg_charged(folio)) {
+ rcu_read_lock();
workingset_age_nonresident(folio_lruvec(folio), folio_nr_pages(folio));
+ rcu_read_unlock();
+ }
}
/*
--
2.20.1
^ permalink raw reply [flat|nested] 50+ messages in thread
* [PATCH v4 23/31] mm: do not open-code lruvec lock
2026-02-05 8:54 [PATCH v4 00/31] Eliminate Dying Memory Cgroup Qi Zheng
` (21 preceding siblings ...)
2026-02-05 9:01 ` [PATCH v4 22/31] mm: workingset: prevent lruvec release in workingset_activation() Qi Zheng
@ 2026-02-05 9:01 ` Qi Zheng
2026-02-05 9:01 ` [PATCH v4 24/31] mm: memcontrol: prepare for reparenting LRU pages for " Qi Zheng
` (7 subsequent siblings)
30 siblings, 0 replies; 50+ messages in thread
From: Qi Zheng @ 2026-02-05 9:01 UTC (permalink / raw)
To: hannes, hughd, mhocko, roman.gushchin, shakeel.butt, muchun.song,
david, lorenzo.stoakes, ziy, harry.yoo, yosry.ahmed,
imran.f.khan, kamalesh.babulal, axelrasmussen, yuanchu, weixugc,
chenridong, mkoutny, akpm, hamzamahfooz, apais, lance.yang, bhe
Cc: linux-mm, linux-kernel, cgroups, Qi Zheng
From: Qi Zheng <zhengqi.arch@bytedance.com>
Now we have lruvec_unlock(), lruvec_unlock_irq() and
lruvec_unlock_irqrestore(), but no the paired lruvec_lock(),
lruvec_lock_irq() and lruvec_lock_irqsave().
There is currently no use case for lruvec_lock_irqsave(), so only
introduce lruvec_lock_irq(), and change all open-code places to use
this helper function. This looks cleaner and prepares for reparenting
LRU pages, preventing user from missing RCU lock calls due to
open-code lruvec lock.
Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
Acked-by: Muchun Song <muchun.song@linux.dev>
Acked-by: Shakeel Butt <shakeel.butt@linux.dev>
Reviewed-by: Harry Yoo <harry.yoo@oracle.com>
---
include/linux/memcontrol.h | 5 +++++
mm/vmscan.c | 38 +++++++++++++++++++-------------------
2 files changed, 24 insertions(+), 19 deletions(-)
diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index f1556759d0d3f..4b6f20dc694ba 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -1499,6 +1499,11 @@ static inline struct lruvec *parent_lruvec(struct lruvec *lruvec)
return mem_cgroup_lruvec(memcg, lruvec_pgdat(lruvec));
}
+static inline void lruvec_lock_irq(struct lruvec *lruvec)
+{
+ spin_lock_irq(&lruvec->lru_lock);
+}
+
static inline void lruvec_unlock(struct lruvec *lruvec)
{
spin_unlock(&lruvec->lru_lock);
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 6a7eacd39bc5f..f904231e33ec0 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2003,7 +2003,7 @@ static unsigned long shrink_inactive_list(unsigned long nr_to_scan,
lru_add_drain();
- spin_lock_irq(&lruvec->lru_lock);
+ lruvec_lock_irq(lruvec);
nr_taken = isolate_lru_folios(nr_to_scan, lruvec, &folio_list,
&nr_scanned, sc, lru);
@@ -2015,7 +2015,7 @@ static unsigned long shrink_inactive_list(unsigned long nr_to_scan,
count_memcg_events(lruvec_memcg(lruvec), item, nr_scanned);
__count_vm_events(PGSCAN_ANON + file, nr_scanned);
- spin_unlock_irq(&lruvec->lru_lock);
+ lruvec_unlock_irq(lruvec);
if (nr_taken == 0)
return 0;
@@ -2034,7 +2034,7 @@ static unsigned long shrink_inactive_list(unsigned long nr_to_scan,
count_memcg_events(lruvec_memcg(lruvec), item, nr_reclaimed);
count_vm_events(PGSTEAL_ANON + file, nr_reclaimed);
- spin_lock_irq(&lruvec->lru_lock);
+ lruvec_lock_irq(lruvec);
lru_note_cost_unlock_irq(lruvec, file, stat.nr_pageout,
nr_scanned - nr_reclaimed);
@@ -2113,7 +2113,7 @@ static void shrink_active_list(unsigned long nr_to_scan,
lru_add_drain();
- spin_lock_irq(&lruvec->lru_lock);
+ lruvec_lock_irq(lruvec);
nr_taken = isolate_lru_folios(nr_to_scan, lruvec, &l_hold,
&nr_scanned, sc, lru);
@@ -2124,7 +2124,7 @@ static void shrink_active_list(unsigned long nr_to_scan,
__count_vm_events(PGREFILL, nr_scanned);
count_memcg_events(lruvec_memcg(lruvec), PGREFILL, nr_scanned);
- spin_unlock_irq(&lruvec->lru_lock);
+ lruvec_unlock_irq(lruvec);
while (!list_empty(&l_hold)) {
struct folio *folio;
@@ -2180,7 +2180,7 @@ static void shrink_active_list(unsigned long nr_to_scan,
count_memcg_events(lruvec_memcg(lruvec), PGDEACTIVATE, nr_deactivate);
mod_node_page_state(pgdat, NR_ISOLATED_ANON + file, -nr_taken);
- spin_lock_irq(&lruvec->lru_lock);
+ lruvec_lock_irq(lruvec);
lru_note_cost_unlock_irq(lruvec, file, 0, nr_rotated);
trace_mm_vmscan_lru_shrink_active(pgdat->node_id, nr_taken, nr_activate,
nr_deactivate, nr_rotated, sc->priority, file);
@@ -3801,9 +3801,9 @@ static void walk_mm(struct mm_struct *mm, struct lru_gen_mm_walk *walk)
}
if (walk->batched) {
- spin_lock_irq(&lruvec->lru_lock);
+ lruvec_lock_irq(lruvec);
reset_batch_size(walk);
- spin_unlock_irq(&lruvec->lru_lock);
+ lruvec_unlock_irq(lruvec);
}
cond_resched();
@@ -3962,7 +3962,7 @@ static bool inc_max_seq(struct lruvec *lruvec, unsigned long seq, int swappiness
if (seq < READ_ONCE(lrugen->max_seq))
return false;
- spin_lock_irq(&lruvec->lru_lock);
+ lruvec_lock_irq(lruvec);
VM_WARN_ON_ONCE(!seq_is_valid(lruvec));
@@ -3977,7 +3977,7 @@ static bool inc_max_seq(struct lruvec *lruvec, unsigned long seq, int swappiness
if (inc_min_seq(lruvec, type, swappiness))
continue;
- spin_unlock_irq(&lruvec->lru_lock);
+ lruvec_unlock_irq(lruvec);
cond_resched();
goto restart;
}
@@ -4012,7 +4012,7 @@ static bool inc_max_seq(struct lruvec *lruvec, unsigned long seq, int swappiness
/* make sure preceding modifications appear */
smp_store_release(&lrugen->max_seq, lrugen->max_seq + 1);
unlock:
- spin_unlock_irq(&lruvec->lru_lock);
+ lruvec_unlock_irq(lruvec);
return success;
}
@@ -4708,7 +4708,7 @@ static int evict_folios(unsigned long nr_to_scan, struct lruvec *lruvec,
struct mem_cgroup *memcg = lruvec_memcg(lruvec);
struct pglist_data *pgdat = lruvec_pgdat(lruvec);
- spin_lock_irq(&lruvec->lru_lock);
+ lruvec_lock_irq(lruvec);
scanned = isolate_folios(nr_to_scan, lruvec, sc, swappiness, &type, &list);
@@ -4717,7 +4717,7 @@ static int evict_folios(unsigned long nr_to_scan, struct lruvec *lruvec,
if (evictable_min_seq(lrugen->min_seq, swappiness) + MIN_NR_GENS > lrugen->max_seq)
scanned = 0;
- spin_unlock_irq(&lruvec->lru_lock);
+ lruvec_unlock_irq(lruvec);
if (list_empty(&list))
return scanned;
@@ -4755,9 +4755,9 @@ static int evict_folios(unsigned long nr_to_scan, struct lruvec *lruvec,
walk = current->reclaim_state->mm_walk;
if (walk && walk->batched) {
walk->lruvec = lruvec;
- spin_lock_irq(&lruvec->lru_lock);
+ lruvec_lock_irq(lruvec);
reset_batch_size(walk);
- spin_unlock_irq(&lruvec->lru_lock);
+ lruvec_unlock_irq(lruvec);
}
mod_lruvec_state(lruvec, PGDEMOTE_KSWAPD + reclaimer_offset(sc),
@@ -5195,7 +5195,7 @@ static void lru_gen_change_state(bool enabled)
for_each_node(nid) {
struct lruvec *lruvec = get_lruvec(memcg, nid);
- spin_lock_irq(&lruvec->lru_lock);
+ lruvec_lock_irq(lruvec);
VM_WARN_ON_ONCE(!seq_is_valid(lruvec));
VM_WARN_ON_ONCE(!state_is_valid(lruvec));
@@ -5203,12 +5203,12 @@ static void lru_gen_change_state(bool enabled)
lruvec->lrugen.enabled = enabled;
while (!(enabled ? fill_evictable(lruvec) : drain_evictable(lruvec))) {
- spin_unlock_irq(&lruvec->lru_lock);
+ lruvec_unlock_irq(lruvec);
cond_resched();
- spin_lock_irq(&lruvec->lru_lock);
+ lruvec_lock_irq(lruvec);
}
- spin_unlock_irq(&lruvec->lru_lock);
+ lruvec_unlock_irq(lruvec);
}
cond_resched();
--
2.20.1
^ permalink raw reply [flat|nested] 50+ messages in thread
* [PATCH v4 24/31] mm: memcontrol: prepare for reparenting LRU pages for lruvec lock
2026-02-05 8:54 [PATCH v4 00/31] Eliminate Dying Memory Cgroup Qi Zheng
` (22 preceding siblings ...)
2026-02-05 9:01 ` [PATCH v4 23/31] mm: do not open-code lruvec lock Qi Zheng
@ 2026-02-05 9:01 ` Qi Zheng
2026-02-05 15:02 ` kernel test robot
2026-02-05 15:02 ` kernel test robot
2026-02-05 9:01 ` [PATCH v4 25/31] mm: vmscan: prepare for reparenting traditional LRU folios Qi Zheng
` (6 subsequent siblings)
30 siblings, 2 replies; 50+ messages in thread
From: Qi Zheng @ 2026-02-05 9:01 UTC (permalink / raw)
To: hannes, hughd, mhocko, roman.gushchin, shakeel.butt, muchun.song,
david, lorenzo.stoakes, ziy, harry.yoo, yosry.ahmed,
imran.f.khan, kamalesh.babulal, axelrasmussen, yuanchu, weixugc,
chenridong, mkoutny, akpm, hamzamahfooz, apais, lance.yang, bhe
Cc: linux-mm, linux-kernel, cgroups, Muchun Song, Qi Zheng
From: Muchun Song <songmuchun@bytedance.com>
The following diagram illustrates how to ensure the safety of the folio
lruvec lock when LRU folios undergo reparenting.
In the folio_lruvec_lock(folio) function:
```
rcu_read_lock();
retry:
lruvec = folio_lruvec(folio);
/* There is a possibility of folio reparenting at this point. */
spin_lock(&lruvec->lru_lock);
if (unlikely(lruvec_memcg(lruvec) != folio_memcg(folio))) {
/*
* The wrong lruvec lock was acquired, and a retry is required.
* This is because the folio resides on the parent memcg lruvec
* list.
*/
spin_unlock(&lruvec->lru_lock);
goto retry;
}
/* Reaching here indicates that folio_memcg() is stable. */
```
In the memcg_reparent_objcgs(memcg) function:
```
spin_lock(&lruvec->lru_lock);
spin_lock(&lruvec_parent->lru_lock);
/* Transfer folios from the lruvec list to the parent's. */
spin_unlock(&lruvec_parent->lru_lock);
spin_unlock(&lruvec->lru_lock);
```
After acquiring the lruvec lock, it is necessary to verify whether
the folio has been reparented. If reparenting has occurred, the new
lruvec lock must be reacquired. During the LRU folio reparenting
process, the lruvec lock will also be acquired (this will be
implemented in a subsequent patch). Therefore, folio_memcg() remains
unchanged while the lruvec lock is held.
Given that lruvec_memcg(lruvec) is always equal to folio_memcg(folio)
after the lruvec lock is acquired, the lruvec_memcg_debug() check is
redundant. Hence, it is removed.
This patch serves as a preparation for the reparenting of LRU folios.
Signed-off-by: Muchun Song <songmuchun@bytedance.com>
Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Shakeel Butt <shakeel.butt@linux.dev>
---
include/linux/memcontrol.h | 51 ++++++++----------------
include/linux/swap.h | 3 +-
mm/compaction.c | 29 +++++++++++---
mm/memcontrol.c | 79 ++++++++++++++++++++++++++++----------
mm/swap.c | 6 ++-
5 files changed, 104 insertions(+), 64 deletions(-)
diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 4b6f20dc694ba..3970c102fe741 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -742,7 +742,15 @@ static inline struct lruvec *mem_cgroup_lruvec(struct mem_cgroup *memcg,
* folio_lruvec - return lruvec for isolating/putting an LRU folio
* @folio: Pointer to the folio.
*
- * This function relies on folio->mem_cgroup being stable.
+ * Call with rcu_read_lock() held to ensure the lifetime of the returned lruvec.
+ * Note that this alone will NOT guarantee the stability of the folio->lruvec
+ * association; the folio can be reparented to an ancestor if this races with
+ * cgroup deletion.
+ *
+ * Use folio_lruvec_lock() to ensure both lifetime and stability of the binding.
+ * Once a lruvec is locked, folio_lruvec() can be called on other folios, and
+ * their binding is stable if the returned lruvec matches the one the caller has
+ * locked. Useful for lock batching.
*/
static inline struct lruvec *folio_lruvec(struct folio *folio)
{
@@ -765,15 +773,6 @@ struct lruvec *folio_lruvec_lock_irq(struct folio *folio);
struct lruvec *folio_lruvec_lock_irqsave(struct folio *folio,
unsigned long *flags);
-#ifdef CONFIG_DEBUG_VM
-void lruvec_memcg_debug(struct lruvec *lruvec, struct folio *folio);
-#else
-static inline
-void lruvec_memcg_debug(struct lruvec *lruvec, struct folio *folio)
-{
-}
-#endif
-
static inline
struct mem_cgroup *mem_cgroup_from_css(struct cgroup_subsys_state *css){
return css ? container_of(css, struct mem_cgroup, css) : NULL;
@@ -1199,11 +1198,6 @@ static inline struct lruvec *folio_lruvec(struct folio *folio)
return &pgdat->__lruvec;
}
-static inline
-void lruvec_memcg_debug(struct lruvec *lruvec, struct folio *folio)
-{
-}
-
static inline struct mem_cgroup *parent_mem_cgroup(struct mem_cgroup *memcg)
{
return NULL;
@@ -1262,6 +1256,7 @@ static inline struct lruvec *folio_lruvec_lock(struct folio *folio)
{
struct pglist_data *pgdat = folio_pgdat(folio);
+ rcu_read_lock();
spin_lock(&pgdat->__lruvec.lru_lock);
return &pgdat->__lruvec;
}
@@ -1270,6 +1265,7 @@ static inline struct lruvec *folio_lruvec_lock_irq(struct folio *folio)
{
struct pglist_data *pgdat = folio_pgdat(folio);
+ rcu_read_lock();
spin_lock_irq(&pgdat->__lruvec.lru_lock);
return &pgdat->__lruvec;
}
@@ -1279,6 +1275,7 @@ static inline struct lruvec *folio_lruvec_lock_irqsave(struct folio *folio,
{
struct pglist_data *pgdat = folio_pgdat(folio);
+ rcu_read_lock();
spin_lock_irqsave(&pgdat->__lruvec.lru_lock, *flagsp);
return &pgdat->__lruvec;
}
@@ -1499,26 +1496,10 @@ static inline struct lruvec *parent_lruvec(struct lruvec *lruvec)
return mem_cgroup_lruvec(memcg, lruvec_pgdat(lruvec));
}
-static inline void lruvec_lock_irq(struct lruvec *lruvec)
-{
- spin_lock_irq(&lruvec->lru_lock);
-}
-
-static inline void lruvec_unlock(struct lruvec *lruvec)
-{
- spin_unlock(&lruvec->lru_lock);
-}
-
-static inline void lruvec_unlock_irq(struct lruvec *lruvec)
-{
- spin_unlock_irq(&lruvec->lru_lock);
-}
-
-static inline void lruvec_unlock_irqrestore(struct lruvec *lruvec,
- unsigned long flags)
-{
- spin_unlock_irqrestore(&lruvec->lru_lock, flags);
-}
+void lruvec_lock_irq(struct lruvec *lruvec);
+void lruvec_unlock(struct lruvec *lruvec);
+void lruvec_unlock_irq(struct lruvec *lruvec);
+void lruvec_unlock_irqrestore(struct lruvec *lruvec, unsigned long flags);
/* Test requires a stable folio->memcg binding, see folio_memcg() */
static inline bool folio_matches_lruvec(struct folio *folio,
diff --git a/include/linux/swap.h b/include/linux/swap.h
index 62fc7499b4089..39ecd25217178 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -328,8 +328,7 @@ extern unsigned long totalreserve_pages;
/* linux/mm/swap.c */
void lru_note_cost_unlock_irq(struct lruvec *lruvec, bool file,
- unsigned int nr_io, unsigned int nr_rotated)
- __releases(lruvec->lru_lock);
+ unsigned int nr_io, unsigned int nr_rotated);
void lru_note_cost_refault(struct folio *);
void folio_add_lru(struct folio *);
void folio_add_lru_vma(struct folio *, struct vm_area_struct *);
diff --git a/mm/compaction.c b/mm/compaction.c
index c3e338aaa0ffb..3648ce22c8072 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -518,6 +518,24 @@ static bool compact_lock_irqsave(spinlock_t *lock, unsigned long *flags,
return true;
}
+static struct lruvec *
+compact_folio_lruvec_lock_irqsave(struct folio *folio, unsigned long *flags,
+ struct compact_control *cc)
+{
+ struct lruvec *lruvec;
+
+ rcu_read_lock();
+retry:
+ lruvec = folio_lruvec(folio);
+ compact_lock_irqsave(&lruvec->lru_lock, flags, cc);
+ if (unlikely(lruvec_memcg(lruvec) != folio_memcg(folio))) {
+ spin_unlock_irqrestore(&lruvec->lru_lock, *flags);
+ goto retry;
+ }
+
+ return lruvec;
+}
+
/*
* Compaction requires the taking of some coarse locks that are potentially
* very heavily contended. The lock should be periodically unlocked to avoid
@@ -839,7 +857,7 @@ isolate_migratepages_block(struct compact_control *cc, unsigned long low_pfn,
{
pg_data_t *pgdat = cc->zone->zone_pgdat;
unsigned long nr_scanned = 0, nr_isolated = 0;
- struct lruvec *lruvec;
+ struct lruvec *lruvec = NULL;
unsigned long flags = 0;
struct lruvec *locked = NULL;
struct folio *folio = NULL;
@@ -1153,18 +1171,17 @@ isolate_migratepages_block(struct compact_control *cc, unsigned long low_pfn,
if (!folio_test_clear_lru(folio))
goto isolate_fail_put;
- lruvec = folio_lruvec(folio);
+ if (locked)
+ lruvec = folio_lruvec(folio);
/* If we already hold the lock, we can skip some rechecking */
- if (lruvec != locked) {
+ if (lruvec != locked || !locked) {
if (locked)
lruvec_unlock_irqrestore(locked, flags);
- compact_lock_irqsave(&lruvec->lru_lock, &flags, cc);
+ lruvec = compact_folio_lruvec_lock_irqsave(folio, &flags, cc);
locked = lruvec;
- lruvec_memcg_debug(lruvec, folio);
-
/*
* Try get exclusive access under lock. If marked for
* skip, the scan is aborted unless the current context
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 641c2fa077ccf..115a1f34bcef9 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -1201,22 +1201,37 @@ void mem_cgroup_scan_tasks(struct mem_cgroup *memcg,
}
}
-#ifdef CONFIG_DEBUG_VM
-void lruvec_memcg_debug(struct lruvec *lruvec, struct folio *folio)
+void lruvec_lock_irq(struct lruvec *lruvec)
+ __acquires(&lruvec->lru_lock)
+ __acquires(rcu)
{
- struct mem_cgroup *memcg;
+ rcu_read_lock();
+ spin_lock_irq(&lruvec->lru_lock);
+}
- if (mem_cgroup_disabled())
- return;
+void lruvec_unlock(struct lruvec *lruvec)
+ __releases(&lruvec->lru_lock)
+ __releases(rcu)
+{
+ spin_unlock(&lruvec->lru_lock);
+ rcu_read_unlock();
+}
- memcg = folio_memcg(folio);
+void lruvec_unlock_irq(struct lruvec *lruvec)
+ __releases(&lruvec->lru_lock)
+ __releases(rcu)
+{
+ spin_unlock_irq(&lruvec->lru_lock);
+ rcu_read_unlock();
+}
- if (!memcg)
- VM_BUG_ON_FOLIO(!mem_cgroup_is_root(lruvec_memcg(lruvec)), folio);
- else
- VM_BUG_ON_FOLIO(lruvec_memcg(lruvec) != memcg, folio);
+void lruvec_unlock_irqrestore(struct lruvec *lruvec, unsigned long flags)
+ __releases(&lruvec->lru_lock)
+ __releases(rcu)
+{
+ spin_unlock_irqrestore(&lruvec->lru_lock, flags);
+ rcu_read_unlock();
}
-#endif
/**
* folio_lruvec_lock - Lock the lruvec for a folio.
@@ -1227,14 +1242,22 @@ void lruvec_memcg_debug(struct lruvec *lruvec, struct folio *folio)
* - folio_test_lru false
* - folio frozen (refcount of 0)
*
- * Return: The lruvec this folio is on with its lock held.
+ * Return: The lruvec this folio is on with its lock held and rcu read lock held.
*/
struct lruvec *folio_lruvec_lock(struct folio *folio)
+ __acquires(&lruvec->lru_lock)
+ __acquires(rcu)
{
- struct lruvec *lruvec = folio_lruvec(folio);
+ struct lruvec *lruvec;
+ rcu_read_lock();
+retry:
+ lruvec = folio_lruvec(folio);
spin_lock(&lruvec->lru_lock);
- lruvec_memcg_debug(lruvec, folio);
+ if (unlikely(lruvec_memcg(lruvec) != folio_memcg(folio))) {
+ spin_unlock(&lruvec->lru_lock);
+ goto retry;
+ }
return lruvec;
}
@@ -1249,14 +1272,22 @@ struct lruvec *folio_lruvec_lock(struct folio *folio)
* - folio frozen (refcount of 0)
*
* Return: The lruvec this folio is on with its lock held and interrupts
- * disabled.
+ * disabled and rcu read lock held.
*/
struct lruvec *folio_lruvec_lock_irq(struct folio *folio)
+ __acquires(&lruvec->lru_lock)
+ __acquires(rcu)
{
- struct lruvec *lruvec = folio_lruvec(folio);
+ struct lruvec *lruvec;
+ rcu_read_lock();
+retry:
+ lruvec = folio_lruvec(folio);
spin_lock_irq(&lruvec->lru_lock);
- lruvec_memcg_debug(lruvec, folio);
+ if (unlikely(lruvec_memcg(lruvec) != folio_memcg(folio))) {
+ spin_unlock_irq(&lruvec->lru_lock);
+ goto retry;
+ }
return lruvec;
}
@@ -1272,15 +1303,23 @@ struct lruvec *folio_lruvec_lock_irq(struct folio *folio)
* - folio frozen (refcount of 0)
*
* Return: The lruvec this folio is on with its lock held and interrupts
- * disabled.
+ * disabled and rcu read lock held.
*/
struct lruvec *folio_lruvec_lock_irqsave(struct folio *folio,
unsigned long *flags)
+ __acquires(&lruvec->lru_lock)
+ __acquires(rcu)
{
- struct lruvec *lruvec = folio_lruvec(folio);
+ struct lruvec *lruvec;
+ rcu_read_lock();
+retry:
+ lruvec = folio_lruvec(folio);
spin_lock_irqsave(&lruvec->lru_lock, *flags);
- lruvec_memcg_debug(lruvec, folio);
+ if (unlikely(lruvec_memcg(lruvec) != folio_memcg(folio))) {
+ spin_unlock_irqrestore(&lruvec->lru_lock, *flags);
+ goto retry;
+ }
return lruvec;
}
diff --git a/mm/swap.c b/mm/swap.c
index cb1148a92d8ec..d5bfe6a76ca45 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -240,6 +240,7 @@ void folio_rotate_reclaimable(struct folio *folio)
void lru_note_cost_unlock_irq(struct lruvec *lruvec, bool file,
unsigned int nr_io, unsigned int nr_rotated)
__releases(lruvec->lru_lock)
+ __releases(rcu)
{
unsigned long cost;
@@ -253,6 +254,7 @@ void lru_note_cost_unlock_irq(struct lruvec *lruvec, bool file,
cost = nr_io * SWAP_CLUSTER_MAX + nr_rotated;
if (!cost) {
spin_unlock_irq(&lruvec->lru_lock);
+ rcu_read_unlock();
return;
}
@@ -285,8 +287,10 @@ void lru_note_cost_unlock_irq(struct lruvec *lruvec, bool file,
spin_unlock_irq(&lruvec->lru_lock);
lruvec = parent_lruvec(lruvec);
- if (!lruvec)
+ if (!lruvec) {
+ rcu_read_unlock();
break;
+ }
spin_lock_irq(&lruvec->lru_lock);
}
}
--
2.20.1
^ permalink raw reply [flat|nested] 50+ messages in thread
* [PATCH v4 25/31] mm: vmscan: prepare for reparenting traditional LRU folios
2026-02-05 8:54 [PATCH v4 00/31] Eliminate Dying Memory Cgroup Qi Zheng
` (23 preceding siblings ...)
2026-02-05 9:01 ` [PATCH v4 24/31] mm: memcontrol: prepare for reparenting LRU pages for " Qi Zheng
@ 2026-02-05 9:01 ` Qi Zheng
2026-02-07 1:28 ` Shakeel Butt
2026-02-05 9:01 ` [PATCH v4 26/31] mm: vmscan: prepare for reparenting MGLRU folios Qi Zheng
` (5 subsequent siblings)
30 siblings, 1 reply; 50+ messages in thread
From: Qi Zheng @ 2026-02-05 9:01 UTC (permalink / raw)
To: hannes, hughd, mhocko, roman.gushchin, shakeel.butt, muchun.song,
david, lorenzo.stoakes, ziy, harry.yoo, yosry.ahmed,
imran.f.khan, kamalesh.babulal, axelrasmussen, yuanchu, weixugc,
chenridong, mkoutny, akpm, hamzamahfooz, apais, lance.yang, bhe
Cc: linux-mm, linux-kernel, cgroups, Qi Zheng
From: Qi Zheng <zhengqi.arch@bytedance.com>
To resolve the dying memcg issue, we need to reparent LRU folios of child
memcg to its parent memcg. For traditional LRU list, each lruvec of every
memcg comprises four LRU lists. Due to the symmetry of the LRU lists, it
is feasible to transfer the LRU lists from a memcg to its parent memcg
during the reparenting process.
This commit implements the specific function, which will be used during
the reparenting process.
Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
Reviewed-by: Harry Yoo <harry.yoo@oracle.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Muchun Song <muchun.song@linux.dev>
---
include/linux/swap.h | 21 +++++++++++++++++++++
mm/swap.c | 37 +++++++++++++++++++++++++++++++++++++
mm/vmscan.c | 19 -------------------
3 files changed, 58 insertions(+), 19 deletions(-)
diff --git a/include/linux/swap.h b/include/linux/swap.h
index 39ecd25217178..62e124ec6b75a 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -570,6 +570,8 @@ static inline int mem_cgroup_swappiness(struct mem_cgroup *memcg)
return READ_ONCE(memcg->swappiness);
}
+
+void lru_reparent_memcg(struct mem_cgroup *memcg, struct mem_cgroup *parent);
#else
static inline int mem_cgroup_swappiness(struct mem_cgroup *mem)
{
@@ -634,5 +636,24 @@ static inline bool mem_cgroup_swap_full(struct folio *folio)
}
#endif
+/* for_each_managed_zone_pgdat - helper macro to iterate over all managed zones in a pgdat up to
+ * and including the specified highidx
+ * @zone: The current zone in the iterator
+ * @pgdat: The pgdat which node_zones are being iterated
+ * @idx: The index variable
+ * @highidx: The index of the highest zone to return
+ *
+ * This macro iterates through all managed zones up to and including the specified highidx.
+ * The zone iterator enters an invalid state after macro call and must be reinitialized
+ * before it can be used again.
+ */
+#define for_each_managed_zone_pgdat(zone, pgdat, idx, highidx) \
+ for ((idx) = 0, (zone) = (pgdat)->node_zones; \
+ (idx) <= (highidx); \
+ (idx)++, (zone)++) \
+ if (!managed_zone(zone)) \
+ continue; \
+ else
+
#endif /* __KERNEL__*/
#endif /* _LINUX_SWAP_H */
diff --git a/mm/swap.c b/mm/swap.c
index d5bfe6a76ca45..25f39d4263fb5 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -1090,6 +1090,43 @@ void folio_batch_remove_exceptionals(struct folio_batch *fbatch)
fbatch->nr = j;
}
+#ifdef CONFIG_MEMCG
+static void lruvec_reparent_lru(struct lruvec *child_lruvec,
+ struct lruvec *parent_lruvec,
+ enum lru_list lru, int nid)
+{
+ int zid;
+ struct zone *zone;
+
+ if (lru != LRU_UNEVICTABLE)
+ list_splice_tail_init(&child_lruvec->lists[lru], &parent_lruvec->lists[lru]);
+
+ for_each_managed_zone_pgdat(zone, NODE_DATA(nid), zid, MAX_NR_ZONES - 1) {
+ unsigned long size = mem_cgroup_get_zone_lru_size(child_lruvec, lru, zid);
+
+ mem_cgroup_update_lru_size(parent_lruvec, lru, zid, size);
+ }
+}
+
+void lru_reparent_memcg(struct mem_cgroup *memcg, struct mem_cgroup *parent)
+{
+ int nid;
+
+ for_each_node(nid) {
+ enum lru_list lru;
+ struct lruvec *child_lruvec, *parent_lruvec;
+
+ child_lruvec = mem_cgroup_lruvec(memcg, NODE_DATA(nid));
+ parent_lruvec = mem_cgroup_lruvec(parent, NODE_DATA(nid));
+ parent_lruvec->anon_cost += child_lruvec->anon_cost;
+ parent_lruvec->file_cost += child_lruvec->file_cost;
+
+ for_each_lru(lru)
+ lruvec_reparent_lru(child_lruvec, parent_lruvec, lru, nid);
+ }
+}
+#endif
+
static const struct ctl_table swap_sysctl_table[] = {
{
.procname = "page-cluster",
diff --git a/mm/vmscan.c b/mm/vmscan.c
index f904231e33ec0..e2d9ef9a5dedc 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -269,25 +269,6 @@ static int sc_swappiness(struct scan_control *sc, struct mem_cgroup *memcg)
}
#endif
-/* for_each_managed_zone_pgdat - helper macro to iterate over all managed zones in a pgdat up to
- * and including the specified highidx
- * @zone: The current zone in the iterator
- * @pgdat: The pgdat which node_zones are being iterated
- * @idx: The index variable
- * @highidx: The index of the highest zone to return
- *
- * This macro iterates through all managed zones up to and including the specified highidx.
- * The zone iterator enters an invalid state after macro call and must be reinitialized
- * before it can be used again.
- */
-#define for_each_managed_zone_pgdat(zone, pgdat, idx, highidx) \
- for ((idx) = 0, (zone) = (pgdat)->node_zones; \
- (idx) <= (highidx); \
- (idx)++, (zone)++) \
- if (!managed_zone(zone)) \
- continue; \
- else
-
static void set_task_reclaim_state(struct task_struct *task,
struct reclaim_state *rs)
{
--
2.20.1
^ permalink raw reply [flat|nested] 50+ messages in thread
* [PATCH v4 26/31] mm: vmscan: prepare for reparenting MGLRU folios
2026-02-05 8:54 [PATCH v4 00/31] Eliminate Dying Memory Cgroup Qi Zheng
` (24 preceding siblings ...)
2026-02-05 9:01 ` [PATCH v4 25/31] mm: vmscan: prepare for reparenting traditional LRU folios Qi Zheng
@ 2026-02-05 9:01 ` Qi Zheng
2026-02-12 8:46 ` Harry Yoo
2026-02-05 9:01 ` [PATCH v4 27/31] mm: memcontrol: refactor memcg_reparent_objcgs() Qi Zheng
` (4 subsequent siblings)
30 siblings, 1 reply; 50+ messages in thread
From: Qi Zheng @ 2026-02-05 9:01 UTC (permalink / raw)
To: hannes, hughd, mhocko, roman.gushchin, shakeel.butt, muchun.song,
david, lorenzo.stoakes, ziy, harry.yoo, yosry.ahmed,
imran.f.khan, kamalesh.babulal, axelrasmussen, yuanchu, weixugc,
chenridong, mkoutny, akpm, hamzamahfooz, apais, lance.yang, bhe
Cc: linux-mm, linux-kernel, cgroups, Qi Zheng
From: Qi Zheng <zhengqi.arch@bytedance.com>
Similar to traditional LRU folios, in order to solve the dying memcg
problem, we also need to reparenting MGLRU folios to the parent memcg when
memcg offline.
However, there are the following challenges:
1. Each lruvec has between MIN_NR_GENS and MAX_NR_GENS generations, the
number of generations of the parent and child memcg may be different,
so we cannot simply transfer MGLRU folios in the child memcg to the
parent memcg as we did for traditional LRU folios.
2. The generation information is stored in folio->flags, but we cannot
traverse these folios while holding the lru lock, otherwise it may
cause softlockup.
3. In walk_update_folio(), the gen of folio and corresponding lru size
may be updated, but the folio is not immediately moved to the
corresponding lru list. Therefore, there may be folios of different
generations on an LRU list.
4. In lru_gen_del_folio(), the generation to which the folio belongs is
found based on the generation information in folio->flags, and the
corresponding LRU size will be updated. Therefore, we need to update
the lru size correctly during reparenting, otherwise the lru size may
be updated incorrectly in lru_gen_del_folio().
Finally, this patch chose a compromise method, which is to splice the lru
list in the child memcg to the lru list of the same generation in the
parent memcg during reparenting. And in order to ensure that the parent
memcg has the same generation, we need to increase the generations in the
parent memcg to the MAX_NR_GENS before reparenting.
Of course, the same generation has different meanings in the parent and
child memcg, this will cause confusion in the hot and cold information of
folios. But other than that, this method is simple enough, the lru size
is correct, and there is no need to consider some concurrency issues (such
as lru_gen_del_folio()).
To prepare for the above work, this commit implements the specific
functions, which will be used during reparenting.
Suggested-by: Harry Yoo <harry.yoo@oracle.com>
Suggested-by: Imran Khan <imran.f.khan@oracle.com>
Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
---
include/linux/mmzone.h | 16 +++++
mm/vmscan.c | 154 +++++++++++++++++++++++++++++++++++++++++
2 files changed, 170 insertions(+)
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 3e51190a55e4c..0c18b17f0fe2e 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -624,6 +624,9 @@ void lru_gen_online_memcg(struct mem_cgroup *memcg);
void lru_gen_offline_memcg(struct mem_cgroup *memcg);
void lru_gen_release_memcg(struct mem_cgroup *memcg);
void lru_gen_soft_reclaim(struct mem_cgroup *memcg, int nid);
+void max_lru_gen_memcg(struct mem_cgroup *memcg);
+bool recheck_lru_gen_max_memcg(struct mem_cgroup *memcg);
+void lru_gen_reparent_memcg(struct mem_cgroup *memcg, struct mem_cgroup *parent);
#else /* !CONFIG_LRU_GEN */
@@ -664,6 +667,19 @@ static inline void lru_gen_soft_reclaim(struct mem_cgroup *memcg, int nid)
{
}
+static inline void max_lru_gen_memcg(struct mem_cgroup *memcg)
+{
+}
+
+static inline bool recheck_lru_gen_max_memcg(struct mem_cgroup *memcg)
+{
+ return true;
+}
+
+static inline void lru_gen_reparent_memcg(struct mem_cgroup *memcg, struct mem_cgroup *parent)
+{
+}
+
#endif /* CONFIG_LRU_GEN */
struct lruvec {
diff --git a/mm/vmscan.c b/mm/vmscan.c
index e2d9ef9a5dedc..8c6f8f0df24b1 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -4414,6 +4414,160 @@ void lru_gen_soft_reclaim(struct mem_cgroup *memcg, int nid)
lru_gen_rotate_memcg(lruvec, MEMCG_LRU_HEAD);
}
+bool recheck_lru_gen_max_memcg(struct mem_cgroup *memcg)
+{
+ int nid;
+
+ for_each_node(nid) {
+ struct lruvec *lruvec = get_lruvec(memcg, nid);
+ int type;
+
+ for (type = 0; type < ANON_AND_FILE; type++) {
+ if (get_nr_gens(lruvec, type) != MAX_NR_GENS)
+ return false;
+ }
+ }
+
+ return true;
+}
+
+static void try_to_inc_max_seq_nowalk(struct mem_cgroup *memcg,
+ struct lruvec *lruvec)
+{
+ struct lru_gen_mm_list *mm_list = get_mm_list(memcg);
+ struct lru_gen_mm_state *mm_state = get_mm_state(lruvec);
+ int swappiness = mem_cgroup_swappiness(memcg);
+ DEFINE_MAX_SEQ(lruvec);
+ bool success = false;
+
+ /*
+ * We are not iterating the mm_list here, updating mm_state->seq is just
+ * to make mm walkers work properly.
+ */
+ if (mm_state) {
+ spin_lock(&mm_list->lock);
+ VM_WARN_ON_ONCE(mm_state->seq + 1 < max_seq);
+ if (max_seq > mm_state->seq) {
+ WRITE_ONCE(mm_state->seq, mm_state->seq + 1);
+ success = true;
+ }
+ spin_unlock(&mm_list->lock);
+ } else {
+ success = true;
+ }
+
+ if (success)
+ inc_max_seq(lruvec, max_seq, swappiness);
+}
+
+/*
+ * We need to ensure that the folios of child memcg can be reparented to the
+ * same gen of the parent memcg, so the gens of the parent memcg needed be
+ * incremented to the MAX_NR_GENS before reparenting.
+ */
+void max_lru_gen_memcg(struct mem_cgroup *memcg)
+{
+ int nid;
+
+ for_each_node(nid) {
+ struct lruvec *lruvec = get_lruvec(memcg, nid);
+ int type;
+
+ for (type = 0; type < ANON_AND_FILE; type++) {
+ while (get_nr_gens(lruvec, type) < MAX_NR_GENS) {
+ try_to_inc_max_seq_nowalk(memcg, lruvec);
+ cond_resched();
+ }
+ }
+ }
+}
+
+/*
+ * Compared to traditional LRU, MGLRU faces the following challenges:
+ *
+ * 1. Each lruvec has between MIN_NR_GENS and MAX_NR_GENS generations, the
+ * number of generations of the parent and child memcg may be different,
+ * so we cannot simply transfer MGLRU folios in the child memcg to the
+ * parent memcg as we did for traditional LRU folios.
+ * 2. The generation information is stored in folio->flags, but we cannot
+ * traverse these folios while holding the lru lock, otherwise it may
+ * cause softlockup.
+ * 3. In walk_update_folio(), the gen of folio and corresponding lru size
+ * may be updated, but the folio is not immediately moved to the
+ * corresponding lru list. Therefore, there may be folios of different
+ * generations on an LRU list.
+ * 4. In lru_gen_del_folio(), the generation to which the folio belongs is
+ * found based on the generation information in folio->flags, and the
+ * corresponding LRU size will be updated. Therefore, we need to update
+ * the lru size correctly during reparenting, otherwise the lru size may
+ * be updated incorrectly in lru_gen_del_folio().
+ *
+ * Finally, we choose a compromise method, which is to splice the lru list in
+ * the child memcg to the lru list of the same generation in the parent memcg
+ * during reparenting.
+ *
+ * The same generation has different meanings in the parent and child memcg,
+ * so this compromise method will cause the LRU inversion problem. But as the
+ * system runs, this problem will be fixed automatically.
+ */
+static void __lru_gen_reparent_memcg(struct lruvec *child_lruvec, struct lruvec *parent_lruvec,
+ int zone, int type)
+{
+ struct lru_gen_folio *child_lrugen, *parent_lrugen;
+ enum lru_list lru = type * LRU_INACTIVE_FILE;
+ int i;
+
+ child_lrugen = &child_lruvec->lrugen;
+ parent_lrugen = &parent_lruvec->lrugen;
+
+ for (i = 0; i < get_nr_gens(child_lruvec, type); i++) {
+ int gen = lru_gen_from_seq(child_lrugen->max_seq - i);
+ long nr_pages = child_lrugen->nr_pages[gen][type][zone];
+ int child_lru_active = lru_gen_is_active(child_lruvec, gen) ? LRU_ACTIVE : 0;
+ int parent_lru_active = lru_gen_is_active(parent_lruvec, gen) ? LRU_ACTIVE : 0;
+
+ /* Assuming that child pages are colder than parent pages */
+ list_splice_init(&child_lrugen->folios[gen][type][zone],
+ &parent_lrugen->folios[gen][type][zone]);
+
+ WRITE_ONCE(child_lrugen->nr_pages[gen][type][zone], 0);
+ WRITE_ONCE(parent_lrugen->nr_pages[gen][type][zone],
+ parent_lrugen->nr_pages[gen][type][zone] + nr_pages);
+
+ if (lru_gen_is_active(child_lruvec, gen) != lru_gen_is_active(parent_lruvec, gen)) {
+ __update_lru_size(child_lruvec, lru + child_lru_active, zone, -nr_pages);
+ __update_lru_size(parent_lruvec, lru + parent_lru_active, zone, nr_pages);
+ }
+ }
+}
+
+void lru_gen_reparent_memcg(struct mem_cgroup *memcg, struct mem_cgroup *parent)
+{
+ int nid;
+
+ for_each_node(nid) {
+ struct lruvec *child_lruvec, *parent_lruvec;
+ int type, zid;
+ struct zone *zone;
+ enum lru_list lru;
+
+ child_lruvec = get_lruvec(memcg, nid);
+ parent_lruvec = get_lruvec(parent, nid);
+
+ for_each_managed_zone_pgdat(zone, NODE_DATA(nid), zid, MAX_NR_ZONES - 1)
+ for (type = 0; type < ANON_AND_FILE; type++)
+ __lru_gen_reparent_memcg(child_lruvec, parent_lruvec, zid, type);
+
+ for_each_lru(lru) {
+ for_each_managed_zone_pgdat(zone, NODE_DATA(nid), zid, MAX_NR_ZONES - 1) {
+ unsigned long size = mem_cgroup_get_zone_lru_size(child_lruvec, lru, zid);
+
+ mem_cgroup_update_lru_size(parent_lruvec, lru, zid, size);
+ }
+ }
+ }
+}
+
#endif /* CONFIG_MEMCG */
/******************************************************************************
--
2.20.1
^ permalink raw reply [flat|nested] 50+ messages in thread
* [PATCH v4 27/31] mm: memcontrol: refactor memcg_reparent_objcgs()
2026-02-05 8:54 [PATCH v4 00/31] Eliminate Dying Memory Cgroup Qi Zheng
` (25 preceding siblings ...)
2026-02-05 9:01 ` [PATCH v4 26/31] mm: vmscan: prepare for reparenting MGLRU folios Qi Zheng
@ 2026-02-05 9:01 ` Qi Zheng
2026-02-05 9:01 ` [PATCH v4 28/31] mm: workingset: use lruvec_lru_size() to get the number of lru pages Qi Zheng
` (3 subsequent siblings)
30 siblings, 0 replies; 50+ messages in thread
From: Qi Zheng @ 2026-02-05 9:01 UTC (permalink / raw)
To: hannes, hughd, mhocko, roman.gushchin, shakeel.butt, muchun.song,
david, lorenzo.stoakes, ziy, harry.yoo, yosry.ahmed,
imran.f.khan, kamalesh.babulal, axelrasmussen, yuanchu, weixugc,
chenridong, mkoutny, akpm, hamzamahfooz, apais, lance.yang, bhe
Cc: linux-mm, linux-kernel, cgroups, Qi Zheng
From: Qi Zheng <zhengqi.arch@bytedance.com>
Refactor the memcg_reparent_objcgs() to facilitate subsequent reparenting
LRU folios here.
Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Shakeel Butt <shakeel.butt@linux.dev>
Reviewed-by: Harry Yoo <harry.yoo@oracle.com>
Reviewed-by: Muchun Song <muchun.song@linux.dev>
---
mm/memcontrol.c | 29 ++++++++++++++++++++++++-----
1 file changed, 24 insertions(+), 5 deletions(-)
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 115a1f34bcef9..c9b5dfd822d0a 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -208,15 +208,12 @@ static struct obj_cgroup *obj_cgroup_alloc(void)
return objcg;
}
-static void memcg_reparent_objcgs(struct mem_cgroup *memcg)
+static inline struct obj_cgroup *__memcg_reparent_objcgs(struct mem_cgroup *memcg,
+ struct mem_cgroup *parent)
{
struct obj_cgroup *objcg, *iter;
- struct mem_cgroup *parent = parent_mem_cgroup(memcg);
objcg = rcu_replace_pointer(memcg->objcg, NULL, true);
-
- spin_lock_irq(&objcg_lock);
-
/* 1) Ready to reparent active objcg. */
list_add(&objcg->list, &memcg->objcg_list);
/* 2) Reparent active objcg and already reparented objcgs to parent. */
@@ -225,7 +222,29 @@ static void memcg_reparent_objcgs(struct mem_cgroup *memcg)
/* 3) Move already reparented objcgs to the parent's list */
list_splice(&memcg->objcg_list, &parent->objcg_list);
+ return objcg;
+}
+
+static inline void reparent_locks(struct mem_cgroup *memcg, struct mem_cgroup *parent)
+{
+ spin_lock_irq(&objcg_lock);
+}
+
+static inline void reparent_unlocks(struct mem_cgroup *memcg, struct mem_cgroup *parent)
+{
spin_unlock_irq(&objcg_lock);
+}
+
+static void memcg_reparent_objcgs(struct mem_cgroup *memcg)
+{
+ struct obj_cgroup *objcg;
+ struct mem_cgroup *parent = parent_mem_cgroup(memcg);
+
+ reparent_locks(memcg, parent);
+
+ objcg = __memcg_reparent_objcgs(memcg, parent);
+
+ reparent_unlocks(memcg, parent);
percpu_ref_kill(&objcg->refcnt);
}
--
2.20.1
^ permalink raw reply [flat|nested] 50+ messages in thread
* [PATCH v4 28/31] mm: workingset: use lruvec_lru_size() to get the number of lru pages
2026-02-05 8:54 [PATCH v4 00/31] Eliminate Dying Memory Cgroup Qi Zheng
` (26 preceding siblings ...)
2026-02-05 9:01 ` [PATCH v4 27/31] mm: memcontrol: refactor memcg_reparent_objcgs() Qi Zheng
@ 2026-02-05 9:01 ` Qi Zheng
2026-02-07 1:48 ` Shakeel Butt
2026-02-07 3:59 ` Muchun Song
2026-02-05 9:01 ` [PATCH v4 29/31] mm: memcontrol: prepare for reparenting non-hierarchical stats Qi Zheng
` (2 subsequent siblings)
30 siblings, 2 replies; 50+ messages in thread
From: Qi Zheng @ 2026-02-05 9:01 UTC (permalink / raw)
To: hannes, hughd, mhocko, roman.gushchin, shakeel.butt, muchun.song,
david, lorenzo.stoakes, ziy, harry.yoo, yosry.ahmed,
imran.f.khan, kamalesh.babulal, axelrasmussen, yuanchu, weixugc,
chenridong, mkoutny, akpm, hamzamahfooz, apais, lance.yang, bhe
Cc: linux-mm, linux-kernel, cgroups, Qi Zheng
From: Qi Zheng <zhengqi.arch@bytedance.com>
For cgroup v2, count_shadow_nodes() is the only place to read
non-hierarchical stats (lruvec_stats->state_local). To avoid the need to
consider cgroup v2 during subsequent non-hierarchical stats reparenting,
use lruvec_lru_size() instead of lruvec_page_state_local() to get the
number of lru pages.
For NR_SLAB_RECLAIMABLE_B and NR_SLAB_UNRECLAIMABLE_B cases, it appears
that the statistics here have already been problematic for a while since
slab pages have been reparented. So just ignore it for now.
Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
---
include/linux/swap.h | 1 +
mm/vmscan.c | 3 +--
mm/workingset.c | 5 +++--
3 files changed, 5 insertions(+), 4 deletions(-)
diff --git a/include/linux/swap.h b/include/linux/swap.h
index 62e124ec6b75a..2ee736438b9d6 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -370,6 +370,7 @@ extern void swap_setup(void);
extern unsigned long zone_reclaimable_pages(struct zone *zone);
extern unsigned long try_to_free_pages(struct zonelist *zonelist, int order,
gfp_t gfp_mask, nodemask_t *mask);
+unsigned long lruvec_lru_size(struct lruvec *lruvec, enum lru_list lru, int zone_idx);
#define MEMCG_RECLAIM_MAY_SWAP (1 << 1)
#define MEMCG_RECLAIM_PROACTIVE (1 << 2)
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 8c6f8f0df24b1..4dc52b4b4af50 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -390,8 +390,7 @@ unsigned long zone_reclaimable_pages(struct zone *zone)
* @lru: lru to use
* @zone_idx: zones to consider (use MAX_NR_ZONES - 1 for the whole LRU list)
*/
-static unsigned long lruvec_lru_size(struct lruvec *lruvec, enum lru_list lru,
- int zone_idx)
+unsigned long lruvec_lru_size(struct lruvec *lruvec, enum lru_list lru, int zone_idx)
{
unsigned long size = 0;
int zid;
diff --git a/mm/workingset.c b/mm/workingset.c
index 2be53098f6282..ecdc293cd06fb 100644
--- a/mm/workingset.c
+++ b/mm/workingset.c
@@ -684,9 +684,10 @@ static unsigned long count_shadow_nodes(struct shrinker *shrinker,
mem_cgroup_flush_stats_ratelimited(sc->memcg);
lruvec = mem_cgroup_lruvec(sc->memcg, NODE_DATA(sc->nid));
+
for (pages = 0, i = 0; i < NR_LRU_LISTS; i++)
- pages += lruvec_page_state_local(lruvec,
- NR_LRU_BASE + i);
+ pages += lruvec_lru_size(lruvec, i, MAX_NR_ZONES - 1);
+
pages += lruvec_page_state_local(
lruvec, NR_SLAB_RECLAIMABLE_B) >> PAGE_SHIFT;
pages += lruvec_page_state_local(
--
2.20.1
^ permalink raw reply [flat|nested] 50+ messages in thread
* [PATCH v4 29/31] mm: memcontrol: prepare for reparenting non-hierarchical stats
2026-02-05 8:54 [PATCH v4 00/31] Eliminate Dying Memory Cgroup Qi Zheng
` (27 preceding siblings ...)
2026-02-05 9:01 ` [PATCH v4 28/31] mm: workingset: use lruvec_lru_size() to get the number of lru pages Qi Zheng
@ 2026-02-05 9:01 ` Qi Zheng
2026-02-07 2:19 ` Shakeel Butt
2026-02-05 9:01 ` [PATCH v4 30/31] mm: memcontrol: eliminate the problem of dying memory cgroup for LRU folios Qi Zheng
2026-02-05 9:01 ` [PATCH v4 31/31] mm: lru: add VM_WARN_ON_ONCE_FOLIO to lru maintenance helpers Qi Zheng
30 siblings, 1 reply; 50+ messages in thread
From: Qi Zheng @ 2026-02-05 9:01 UTC (permalink / raw)
To: hannes, hughd, mhocko, roman.gushchin, shakeel.butt, muchun.song,
david, lorenzo.stoakes, ziy, harry.yoo, yosry.ahmed,
imran.f.khan, kamalesh.babulal, axelrasmussen, yuanchu, weixugc,
chenridong, mkoutny, akpm, hamzamahfooz, apais, lance.yang, bhe
Cc: linux-mm, linux-kernel, cgroups, Qi Zheng
From: Qi Zheng <zhengqi.arch@bytedance.com>
To resolve the dying memcg issue, we need to reparent LRU folios of child
memcg to its parent memcg. This could cause problems for non-hierarchical
stats.
As Yosry Ahmed pointed out:
```
In short, if memory is charged to a dying cgroup at the time of
reparenting, when the memory gets uncharged the stats updates will occur
at the parent. This will update both hierarchical and non-hierarchical
stats of the parent, which would corrupt the parent's non-hierarchical
stats (because those counters were never incremented when the memory was
charged).
```
Now we have the following two types of non-hierarchical stats, and they
are only used in CONFIG_MEMCG_V1:
a. memcg->vmstats->state_local[i]
b. pn->lruvec_stats->state_local[i]
To ensure that these non-hierarchical stats work properly, we need to
reparent these non-hierarchical stats after reparenting LRU folios. To
this end, this commit makes the following preparations:
1. implement reparent_state_local() to reparent non-hierarchical stats
2. make css_killed_work_fn() to be called in rcu work, and implement
get_non_dying_memcg_start() and get_non_dying_memcg_end() to avoid race
between mod_memcg_state()/mod_memcg_lruvec_state()
and reparent_state_local()
3. change these non-hierarchical stats to atomic_long_t type to avoid race
between mem_cgroup_stat_aggregate() and reparent_state_local()
Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
---
include/linux/memcontrol.h | 4 ++
kernel/cgroup/cgroup.c | 8 +--
mm/memcontrol-v1.c | 16 ++++++
mm/memcontrol-v1.h | 3 +
mm/memcontrol.c | 113 ++++++++++++++++++++++++++++++++++---
5 files changed, 132 insertions(+), 12 deletions(-)
diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 3970c102fe741..a4f6ab7eb98d6 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -957,12 +957,16 @@ static inline void mod_memcg_page_state(struct page *page,
unsigned long memcg_events(struct mem_cgroup *memcg, int event);
unsigned long memcg_page_state(struct mem_cgroup *memcg, int idx);
+void reparent_memcg_state_local(struct mem_cgroup *memcg,
+ struct mem_cgroup *parent, int idx);
unsigned long memcg_page_state_output(struct mem_cgroup *memcg, int item);
bool memcg_stat_item_valid(int idx);
bool memcg_vm_event_item_valid(enum vm_event_item idx);
unsigned long lruvec_page_state(struct lruvec *lruvec, enum node_stat_item idx);
unsigned long lruvec_page_state_local(struct lruvec *lruvec,
enum node_stat_item idx);
+void reparent_memcg_lruvec_state_local(struct mem_cgroup *memcg,
+ struct mem_cgroup *parent, int idx);
void mem_cgroup_flush_stats(struct mem_cgroup *memcg);
void mem_cgroup_flush_stats_ratelimited(struct mem_cgroup *memcg);
diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c
index 94788bd1fdf0e..dbf94a77018e6 100644
--- a/kernel/cgroup/cgroup.c
+++ b/kernel/cgroup/cgroup.c
@@ -6043,8 +6043,8 @@ int cgroup_mkdir(struct kernfs_node *parent_kn, const char *name, umode_t mode)
*/
static void css_killed_work_fn(struct work_struct *work)
{
- struct cgroup_subsys_state *css =
- container_of(work, struct cgroup_subsys_state, destroy_work);
+ struct cgroup_subsys_state *css = container_of(to_rcu_work(work),
+ struct cgroup_subsys_state, destroy_rwork);
cgroup_lock();
@@ -6065,8 +6065,8 @@ static void css_killed_ref_fn(struct percpu_ref *ref)
container_of(ref, struct cgroup_subsys_state, refcnt);
if (atomic_dec_and_test(&css->online_cnt)) {
- INIT_WORK(&css->destroy_work, css_killed_work_fn);
- queue_work(cgroup_offline_wq, &css->destroy_work);
+ INIT_RCU_WORK(&css->destroy_rwork, css_killed_work_fn);
+ queue_rcu_work(cgroup_offline_wq, &css->destroy_rwork);
}
}
diff --git a/mm/memcontrol-v1.c b/mm/memcontrol-v1.c
index c6078cd7f7e53..a427bb205763b 100644
--- a/mm/memcontrol-v1.c
+++ b/mm/memcontrol-v1.c
@@ -1887,6 +1887,22 @@ static const unsigned int memcg1_events[] = {
PGMAJFAULT,
};
+void reparent_memcg1_state_local(struct mem_cgroup *memcg, struct mem_cgroup *parent)
+{
+ int i;
+
+ for (i = 0; i < ARRAY_SIZE(memcg1_stats); i++)
+ reparent_memcg_state_local(memcg, parent, memcg1_stats[i]);
+}
+
+void reparent_memcg1_lruvec_state_local(struct mem_cgroup *memcg, struct mem_cgroup *parent)
+{
+ int i;
+
+ for (i = 0; i < NR_LRU_LISTS; i++)
+ reparent_memcg_lruvec_state_local(memcg, parent, i);
+}
+
void memcg1_stat_format(struct mem_cgroup *memcg, struct seq_buf *s)
{
unsigned long memory, memsw;
diff --git a/mm/memcontrol-v1.h b/mm/memcontrol-v1.h
index eb3c3c1056574..45528195d3578 100644
--- a/mm/memcontrol-v1.h
+++ b/mm/memcontrol-v1.h
@@ -41,6 +41,7 @@ static inline bool do_memsw_account(void)
unsigned long memcg_events_local(struct mem_cgroup *memcg, int event);
unsigned long memcg_page_state_local(struct mem_cgroup *memcg, int idx);
+void mod_memcg_page_state_local(struct mem_cgroup *memcg, int idx, unsigned long val);
unsigned long memcg_page_state_local_output(struct mem_cgroup *memcg, int item);
bool memcg1_alloc_events(struct mem_cgroup *memcg);
void memcg1_free_events(struct mem_cgroup *memcg);
@@ -73,6 +74,8 @@ void memcg1_uncharge_batch(struct mem_cgroup *memcg, unsigned long pgpgout,
unsigned long nr_memory, int nid);
void memcg1_stat_format(struct mem_cgroup *memcg, struct seq_buf *s);
+void reparent_memcg1_state_local(struct mem_cgroup *memcg, struct mem_cgroup *parent);
+void reparent_memcg1_lruvec_state_local(struct mem_cgroup *memcg, struct mem_cgroup *parent);
void memcg1_account_kmem(struct mem_cgroup *memcg, int nr_pages);
static inline bool memcg1_tcpmem_active(struct mem_cgroup *memcg)
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index c9b5dfd822d0a..e7d4e4ff411b6 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -225,6 +225,26 @@ static inline struct obj_cgroup *__memcg_reparent_objcgs(struct mem_cgroup *memc
return objcg;
}
+#ifdef CONFIG_MEMCG_V1
+static void __mem_cgroup_flush_stats(struct mem_cgroup *memcg, bool force);
+
+static inline void reparent_state_local(struct mem_cgroup *memcg, struct mem_cgroup *parent)
+{
+ if (cgroup_subsys_on_dfl(memory_cgrp_subsys))
+ return;
+
+ __mem_cgroup_flush_stats(memcg, true);
+
+ /* The following counts are all non-hierarchical and need to be reparented. */
+ reparent_memcg1_state_local(memcg, parent);
+ reparent_memcg1_lruvec_state_local(memcg, parent);
+}
+#else
+static inline void reparent_state_local(struct mem_cgroup *memcg, struct mem_cgroup *parent)
+{
+}
+#endif
+
static inline void reparent_locks(struct mem_cgroup *memcg, struct mem_cgroup *parent)
{
spin_lock_irq(&objcg_lock);
@@ -407,7 +427,7 @@ struct lruvec_stats {
long state[NR_MEMCG_NODE_STAT_ITEMS];
/* Non-hierarchical (CPU aggregated) state */
- long state_local[NR_MEMCG_NODE_STAT_ITEMS];
+ atomic_long_t state_local[NR_MEMCG_NODE_STAT_ITEMS];
/* Pending child counts during tree propagation */
long state_pending[NR_MEMCG_NODE_STAT_ITEMS];
@@ -450,7 +470,7 @@ unsigned long lruvec_page_state_local(struct lruvec *lruvec,
return 0;
pn = container_of(lruvec, struct mem_cgroup_per_node, lruvec);
- x = READ_ONCE(pn->lruvec_stats->state_local[i]);
+ x = atomic_long_read(&(pn->lruvec_stats->state_local[i]));
#ifdef CONFIG_SMP
if (x < 0)
x = 0;
@@ -458,6 +478,27 @@ unsigned long lruvec_page_state_local(struct lruvec *lruvec,
return x;
}
+void reparent_memcg_lruvec_state_local(struct mem_cgroup *memcg,
+ struct mem_cgroup *parent, int idx)
+{
+ int i = memcg_stats_index(idx);
+ int nid;
+
+ if (WARN_ONCE(BAD_STAT_IDX(i), "%s: missing stat item %d\n", __func__, idx))
+ return;
+
+ for_each_node(nid) {
+ struct lruvec *child_lruvec = mem_cgroup_lruvec(memcg, NODE_DATA(nid));
+ struct lruvec *parent_lruvec = mem_cgroup_lruvec(parent, NODE_DATA(nid));
+ struct mem_cgroup_per_node *parent_pn;
+ unsigned long value = lruvec_page_state_local(child_lruvec, idx);
+
+ parent_pn = container_of(parent_lruvec, struct mem_cgroup_per_node, lruvec);
+
+ atomic_long_add(value, &(parent_pn->lruvec_stats->state_local[i]));
+ }
+}
+
/* Subset of vm_event_item to report for memcg event stats */
static const unsigned int memcg_vm_event_stat[] = {
#ifdef CONFIG_MEMCG_V1
@@ -549,7 +590,7 @@ struct memcg_vmstats {
unsigned long events[NR_MEMCG_EVENTS];
/* Non-hierarchical (CPU aggregated) page state & events */
- long state_local[MEMCG_VMSTAT_SIZE];
+ atomic_long_t state_local[MEMCG_VMSTAT_SIZE];
unsigned long events_local[NR_MEMCG_EVENTS];
/* Pending child counts during tree propagation */
@@ -712,6 +753,42 @@ static int memcg_state_val_in_pages(int idx, int val)
return max(val * unit / PAGE_SIZE, 1UL);
}
+#ifdef CONFIG_MEMCG_V1
+/*
+ * Used in mod_memcg_state() and mod_memcg_lruvec_state() to avoid race with
+ * reparenting of non-hierarchical state_locals.
+ */
+static inline struct mem_cgroup *get_non_dying_memcg_start(struct mem_cgroup *memcg)
+{
+ if (cgroup_subsys_on_dfl(memory_cgrp_subsys))
+ return memcg;
+
+ rcu_read_lock();
+
+ while (memcg_is_dying(memcg))
+ memcg = parent_mem_cgroup(memcg);
+
+ return memcg;
+}
+
+static inline void get_non_dying_memcg_end(void)
+{
+ if (cgroup_subsys_on_dfl(memory_cgrp_subsys))
+ return;
+
+ rcu_read_unlock();
+}
+#else
+static inline struct mem_cgroup *get_non_dying_memcg_start(struct mem_cgroup *memcg)
+{
+ return memcg;
+}
+
+static inline void get_non_dying_memcg_end(void)
+{
+}
+#endif
+
/**
* mod_memcg_state - update cgroup memory statistics
* @memcg: the memory cgroup
@@ -750,13 +827,25 @@ unsigned long memcg_page_state_local(struct mem_cgroup *memcg, int idx)
if (WARN_ONCE(BAD_STAT_IDX(i), "%s: missing stat item %d\n", __func__, idx))
return 0;
- x = READ_ONCE(memcg->vmstats->state_local[i]);
+ x = atomic_long_read(&(memcg->vmstats->state_local[i]));
#ifdef CONFIG_SMP
if (x < 0)
x = 0;
#endif
return x;
}
+
+void reparent_memcg_state_local(struct mem_cgroup *memcg,
+ struct mem_cgroup *parent, int idx)
+{
+ int i = memcg_stats_index(idx);
+ unsigned long value = memcg_page_state_local(memcg, idx);
+
+ if (WARN_ONCE(BAD_STAT_IDX(i), "%s: missing stat item %d\n", __func__, idx))
+ return;
+
+ atomic_long_add(value, &(parent->vmstats->state_local[i]));
+}
#endif
static void mod_memcg_lruvec_state(struct lruvec *lruvec,
@@ -4061,6 +4150,8 @@ struct aggregate_control {
long *aggregate;
/* pointer to the non-hierarchichal (CPU aggregated) counters */
long *local;
+ /* pointer to the atomic non-hierarchichal (CPU aggregated) counters */
+ atomic_long_t *alocal;
/* pointer to the pending child counters during tree propagation */
long *pending;
/* pointer to the parent's pending counters, could be NULL */
@@ -4098,8 +4189,12 @@ static void mem_cgroup_stat_aggregate(struct aggregate_control *ac)
}
/* Aggregate counts on this level and propagate upwards */
- if (delta_cpu)
- ac->local[i] += delta_cpu;
+ if (delta_cpu) {
+ if (ac->local)
+ ac->local[i] += delta_cpu;
+ else if (ac->alocal)
+ atomic_long_add(delta_cpu, &(ac->alocal[i]));
+ }
if (delta) {
ac->aggregate[i] += delta;
@@ -4170,7 +4265,8 @@ static void mem_cgroup_css_rstat_flush(struct cgroup_subsys_state *css, int cpu)
ac = (struct aggregate_control) {
.aggregate = memcg->vmstats->state,
- .local = memcg->vmstats->state_local,
+ .local = NULL,
+ .alocal = memcg->vmstats->state_local,
.pending = memcg->vmstats->state_pending,
.ppending = parent ? parent->vmstats->state_pending : NULL,
.cstat = statc->state,
@@ -4203,7 +4299,8 @@ static void mem_cgroup_css_rstat_flush(struct cgroup_subsys_state *css, int cpu)
ac = (struct aggregate_control) {
.aggregate = lstats->state,
- .local = lstats->state_local,
+ .local = NULL,
+ .alocal = lstats->state_local,
.pending = lstats->state_pending,
.ppending = plstats ? plstats->state_pending : NULL,
.cstat = lstatc->state,
--
2.20.1
^ permalink raw reply [flat|nested] 50+ messages in thread
* [PATCH v4 30/31] mm: memcontrol: eliminate the problem of dying memory cgroup for LRU folios
2026-02-05 8:54 [PATCH v4 00/31] Eliminate Dying Memory Cgroup Qi Zheng
` (28 preceding siblings ...)
2026-02-05 9:01 ` [PATCH v4 29/31] mm: memcontrol: prepare for reparenting non-hierarchical stats Qi Zheng
@ 2026-02-05 9:01 ` Qi Zheng
2026-02-07 19:59 ` Usama Arif
2026-02-07 22:25 ` Shakeel Butt
2026-02-05 9:01 ` [PATCH v4 31/31] mm: lru: add VM_WARN_ON_ONCE_FOLIO to lru maintenance helpers Qi Zheng
30 siblings, 2 replies; 50+ messages in thread
From: Qi Zheng @ 2026-02-05 9:01 UTC (permalink / raw)
To: hannes, hughd, mhocko, roman.gushchin, shakeel.butt, muchun.song,
david, lorenzo.stoakes, ziy, harry.yoo, yosry.ahmed,
imran.f.khan, kamalesh.babulal, axelrasmussen, yuanchu, weixugc,
chenridong, mkoutny, akpm, hamzamahfooz, apais, lance.yang, bhe
Cc: linux-mm, linux-kernel, cgroups, Muchun Song, Qi Zheng
From: Muchun Song <songmuchun@bytedance.com>
Now that everything is set up, switch folio->memcg_data pointers to
objcgs, update the accessors, and execute reparenting on cgroup death.
Finally, folio->memcg_data of LRU folios and kmem folios will always
point to an object cgroup pointer. The folio->memcg_data of slab
folios will point to an vector of object cgroups.
Signed-off-by: Muchun Song <songmuchun@bytedance.com>
Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
---
include/linux/memcontrol.h | 77 +++++---------
mm/memcontrol-v1.c | 15 +--
mm/memcontrol.c | 200 +++++++++++++++++++++++--------------
3 files changed, 159 insertions(+), 133 deletions(-)
diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index a4f6ab7eb98d6..15eec4ee00c29 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -369,9 +369,6 @@ enum objext_flags {
#define OBJEXTS_FLAGS_MASK (__NR_OBJEXTS_FLAGS - 1)
#ifdef CONFIG_MEMCG
-
-static inline bool folio_memcg_kmem(struct folio *folio);
-
/*
* After the initialization objcg->memcg is always pointing at
* a valid memcg, but can be atomically swapped to the parent memcg.
@@ -385,43 +382,19 @@ static inline struct mem_cgroup *obj_cgroup_memcg(struct obj_cgroup *objcg)
}
/*
- * __folio_memcg - Get the memory cgroup associated with a non-kmem folio
- * @folio: Pointer to the folio.
- *
- * Returns a pointer to the memory cgroup associated with the folio,
- * or NULL. This function assumes that the folio is known to have a
- * proper memory cgroup pointer. It's not safe to call this function
- * against some type of folios, e.g. slab folios or ex-slab folios or
- * kmem folios.
- */
-static inline struct mem_cgroup *__folio_memcg(struct folio *folio)
-{
- unsigned long memcg_data = folio->memcg_data;
-
- VM_BUG_ON_FOLIO(folio_test_slab(folio), folio);
- VM_BUG_ON_FOLIO(memcg_data & MEMCG_DATA_OBJEXTS, folio);
- VM_BUG_ON_FOLIO(memcg_data & MEMCG_DATA_KMEM, folio);
-
- return (struct mem_cgroup *)(memcg_data & ~OBJEXTS_FLAGS_MASK);
-}
-
-/*
- * __folio_objcg - get the object cgroup associated with a kmem folio.
+ * folio_objcg - get the object cgroup associated with a folio.
* @folio: Pointer to the folio.
*
* Returns a pointer to the object cgroup associated with the folio,
* or NULL. This function assumes that the folio is known to have a
- * proper object cgroup pointer. It's not safe to call this function
- * against some type of folios, e.g. slab folios or ex-slab folios or
- * LRU folios.
+ * proper object cgroup pointer.
*/
-static inline struct obj_cgroup *__folio_objcg(struct folio *folio)
+static inline struct obj_cgroup *folio_objcg(struct folio *folio)
{
unsigned long memcg_data = folio->memcg_data;
VM_BUG_ON_FOLIO(folio_test_slab(folio), folio);
VM_BUG_ON_FOLIO(memcg_data & MEMCG_DATA_OBJEXTS, folio);
- VM_BUG_ON_FOLIO(!(memcg_data & MEMCG_DATA_KMEM), folio);
return (struct obj_cgroup *)(memcg_data & ~OBJEXTS_FLAGS_MASK);
}
@@ -435,21 +408,30 @@ static inline struct obj_cgroup *__folio_objcg(struct folio *folio)
* proper memory cgroup pointer. It's not safe to call this function
* against some type of folios, e.g. slab folios or ex-slab folios.
*
- * For a non-kmem folio any of the following ensures folio and memcg binding
- * stability:
+ * For a folio any of the following ensures folio and objcg binding stability:
*
* - the folio lock
* - LRU isolation
* - exclusive reference
*
- * For a kmem folio a caller should hold an rcu read lock to protect memcg
- * associated with a kmem folio from being released.
+ * Based on the stable binding of folio and objcg, for a folio any of the
+ * following ensures folio and memcg binding stability:
+ *
+ * - cgroup_mutex
+ * - the lruvec lock
+ *
+ * If the caller only want to ensure that the page counters of memcg are
+ * updated correctly, ensure that the binding stability of folio and objcg
+ * is sufficient.
+ *
+ * Note: The caller should hold an rcu read lock or cgroup_mutex to protect
+ * memcg associated with a folio from being released.
*/
static inline struct mem_cgroup *folio_memcg(struct folio *folio)
{
- if (folio_memcg_kmem(folio))
- return obj_cgroup_memcg(__folio_objcg(folio));
- return __folio_memcg(folio);
+ struct obj_cgroup *objcg = folio_objcg(folio);
+
+ return objcg ? obj_cgroup_memcg(objcg) : NULL;
}
/*
@@ -473,15 +455,10 @@ static inline bool folio_memcg_charged(struct folio *folio)
* has an associated memory cgroup pointer or an object cgroups vector or
* an object cgroup.
*
- * For a non-kmem folio any of the following ensures folio and memcg binding
- * stability:
+ * The page and objcg or memcg binding rules can refer to folio_memcg().
*
- * - the folio lock
- * - LRU isolation
- * - exclusive reference
- *
- * For a kmem folio a caller should hold an rcu read lock to protect memcg
- * associated with a kmem folio from being released.
+ * A caller should hold an rcu read lock to protect memcg associated with a
+ * page from being released.
*/
static inline struct mem_cgroup *folio_memcg_check(struct folio *folio)
{
@@ -490,18 +467,14 @@ static inline struct mem_cgroup *folio_memcg_check(struct folio *folio)
* for slabs, READ_ONCE() should be used here.
*/
unsigned long memcg_data = READ_ONCE(folio->memcg_data);
+ struct obj_cgroup *objcg;
if (memcg_data & MEMCG_DATA_OBJEXTS)
return NULL;
- if (memcg_data & MEMCG_DATA_KMEM) {
- struct obj_cgroup *objcg;
-
- objcg = (void *)(memcg_data & ~OBJEXTS_FLAGS_MASK);
- return obj_cgroup_memcg(objcg);
- }
+ objcg = (void *)(memcg_data & ~OBJEXTS_FLAGS_MASK);
- return (struct mem_cgroup *)(memcg_data & ~OBJEXTS_FLAGS_MASK);
+ return objcg ? obj_cgroup_memcg(objcg) : NULL;
}
static inline struct mem_cgroup *page_memcg_check(struct page *page)
diff --git a/mm/memcontrol-v1.c b/mm/memcontrol-v1.c
index a427bb205763b..401ba65470410 100644
--- a/mm/memcontrol-v1.c
+++ b/mm/memcontrol-v1.c
@@ -613,6 +613,7 @@ void memcg1_commit_charge(struct folio *folio, struct mem_cgroup *memcg)
void memcg1_swapout(struct folio *folio, swp_entry_t entry)
{
struct mem_cgroup *memcg, *swap_memcg;
+ struct obj_cgroup *objcg;
unsigned int nr_entries;
VM_BUG_ON_FOLIO(folio_test_lru(folio), folio);
@@ -624,12 +625,13 @@ void memcg1_swapout(struct folio *folio, swp_entry_t entry)
if (!do_memsw_account())
return;
- memcg = folio_memcg(folio);
-
- VM_WARN_ON_ONCE_FOLIO(!memcg, folio);
- if (!memcg)
+ objcg = folio_objcg(folio);
+ VM_WARN_ON_ONCE_FOLIO(!objcg, folio);
+ if (!objcg)
return;
+ rcu_read_lock();
+ memcg = obj_cgroup_memcg(objcg);
/*
* In case the memcg owning these pages has been offlined and doesn't
* have an ID allocated to it anymore, charge the closest online
@@ -647,7 +649,7 @@ void memcg1_swapout(struct folio *folio, swp_entry_t entry)
folio_unqueue_deferred_split(folio);
folio->memcg_data = 0;
- if (!mem_cgroup_is_root(memcg))
+ if (!obj_cgroup_is_root(objcg))
page_counter_uncharge(&memcg->memory, nr_entries);
if (memcg != swap_memcg) {
@@ -668,7 +670,8 @@ void memcg1_swapout(struct folio *folio, swp_entry_t entry)
preempt_enable_nested();
memcg1_check_events(memcg, folio_nid(folio));
- css_put(&memcg->css);
+ rcu_read_unlock();
+ obj_cgroup_put(objcg);
}
/*
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index e7d4e4ff411b6..0e0efaa511d3d 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -247,11 +247,25 @@ static inline void reparent_state_local(struct mem_cgroup *memcg, struct mem_cgr
static inline void reparent_locks(struct mem_cgroup *memcg, struct mem_cgroup *parent)
{
+ int nid, nest = 0;
+
spin_lock_irq(&objcg_lock);
+ for_each_node(nid) {
+ spin_lock_nested(&mem_cgroup_lruvec(memcg,
+ NODE_DATA(nid))->lru_lock, nest++);
+ spin_lock_nested(&mem_cgroup_lruvec(parent,
+ NODE_DATA(nid))->lru_lock, nest++);
+ }
}
static inline void reparent_unlocks(struct mem_cgroup *memcg, struct mem_cgroup *parent)
{
+ int nid;
+
+ for_each_node(nid) {
+ spin_unlock(&mem_cgroup_lruvec(parent, NODE_DATA(nid))->lru_lock);
+ spin_unlock(&mem_cgroup_lruvec(memcg, NODE_DATA(nid))->lru_lock);
+ }
spin_unlock_irq(&objcg_lock);
}
@@ -260,12 +274,28 @@ static void memcg_reparent_objcgs(struct mem_cgroup *memcg)
struct obj_cgroup *objcg;
struct mem_cgroup *parent = parent_mem_cgroup(memcg);
+retry:
+ if (lru_gen_enabled())
+ max_lru_gen_memcg(parent);
+
reparent_locks(memcg, parent);
+ if (lru_gen_enabled()) {
+ if (!recheck_lru_gen_max_memcg(parent)) {
+ reparent_unlocks(memcg, parent);
+ cond_resched();
+ goto retry;
+ }
+ lru_gen_reparent_memcg(memcg, parent);
+ } else {
+ lru_reparent_memcg(memcg, parent);
+ }
objcg = __memcg_reparent_objcgs(memcg, parent);
reparent_unlocks(memcg, parent);
+ reparent_state_local(memcg, parent);
+
percpu_ref_kill(&objcg->refcnt);
}
@@ -809,9 +839,14 @@ void mod_memcg_state(struct mem_cgroup *memcg, enum memcg_stat_item idx,
cpu = get_cpu();
+ memcg = get_non_dying_memcg_start(memcg);
+
this_cpu_add(memcg->vmstats_percpu->state[i], val);
val = memcg_state_val_in_pages(idx, val);
memcg_rstat_updated(memcg, val, cpu);
+
+ get_non_dying_memcg_end();
+
trace_mod_memcg_state(memcg, idx, val);
put_cpu();
@@ -852,6 +887,7 @@ static void mod_memcg_lruvec_state(struct lruvec *lruvec,
enum node_stat_item idx,
int val)
{
+ struct pglist_data *pgdat = lruvec_pgdat(lruvec);
struct mem_cgroup_per_node *pn;
struct mem_cgroup *memcg;
int i = memcg_stats_index(idx);
@@ -865,14 +901,18 @@ static void mod_memcg_lruvec_state(struct lruvec *lruvec,
cpu = get_cpu();
+ memcg = get_non_dying_memcg_start(memcg);
+ pn = memcg->nodeinfo[pgdat->node_id];
+
/* Update memcg */
this_cpu_add(memcg->vmstats_percpu->state[i], val);
-
/* Update lruvec */
this_cpu_add(pn->lruvec_stats_percpu->state[i], val);
-
val = memcg_state_val_in_pages(idx, val);
memcg_rstat_updated(memcg, val, cpu);
+
+ get_non_dying_memcg_end();
+
trace_mod_memcg_lruvec_state(memcg, idx, val);
put_cpu();
@@ -1098,6 +1138,8 @@ struct mem_cgroup *get_mem_cgroup_from_current(void)
/**
* get_mem_cgroup_from_folio - Obtain a reference on a given folio's memcg.
* @folio: folio from which memcg should be extracted.
+ *
+ * See folio_memcg() for folio->objcg/memcg binding rules.
*/
struct mem_cgroup *get_mem_cgroup_from_folio(struct folio *folio)
{
@@ -2711,17 +2753,17 @@ static inline int try_charge(struct mem_cgroup *memcg, gfp_t gfp_mask,
return try_charge_memcg(memcg, gfp_mask, nr_pages);
}
-static void commit_charge(struct folio *folio, struct mem_cgroup *memcg)
+static void commit_charge(struct folio *folio, struct obj_cgroup *objcg)
{
VM_BUG_ON_FOLIO(folio_memcg_charged(folio), folio);
/*
- * Any of the following ensures page's memcg stability:
+ * Any of the following ensures folio's objcg stability:
*
* - the page lock
* - LRU isolation
* - exclusive reference
*/
- folio->memcg_data = (unsigned long)memcg;
+ folio->memcg_data = (unsigned long)objcg;
}
#ifdef CONFIG_MEMCG_NMI_SAFETY_REQUIRES_ATOMIC
@@ -2833,6 +2875,17 @@ static struct obj_cgroup *__get_obj_cgroup_from_memcg(struct mem_cgroup *memcg)
return NULL;
}
+static inline struct obj_cgroup *get_obj_cgroup_from_memcg(struct mem_cgroup *memcg)
+{
+ struct obj_cgroup *objcg;
+
+ rcu_read_lock();
+ objcg = __get_obj_cgroup_from_memcg(memcg);
+ rcu_read_unlock();
+
+ return objcg;
+}
+
static struct obj_cgroup *current_objcg_update(void)
{
struct mem_cgroup *memcg;
@@ -2933,17 +2986,10 @@ struct obj_cgroup *get_obj_cgroup_from_folio(struct folio *folio)
{
struct obj_cgroup *objcg;
- if (!memcg_kmem_online())
- return NULL;
-
- if (folio_memcg_kmem(folio)) {
- objcg = __folio_objcg(folio);
+ objcg = folio_objcg(folio);
+ if (objcg)
obj_cgroup_get(objcg);
- } else {
- rcu_read_lock();
- objcg = __get_obj_cgroup_from_memcg(__folio_memcg(folio));
- rcu_read_unlock();
- }
+
return objcg;
}
@@ -3459,7 +3505,7 @@ void folio_split_memcg_refs(struct folio *folio, unsigned old_order,
return;
new_refs = (1 << (old_order - new_order)) - 1;
- css_get_many(&__folio_memcg(folio)->css, new_refs);
+ obj_cgroup_get_many(folio_objcg(folio), new_refs);
}
static void memcg_online_kmem(struct mem_cgroup *memcg)
@@ -4890,16 +4936,20 @@ void mem_cgroup_calculate_protection(struct mem_cgroup *root,
static int charge_memcg(struct folio *folio, struct mem_cgroup *memcg,
gfp_t gfp)
{
- int ret;
-
- ret = try_charge(memcg, gfp, folio_nr_pages(folio));
- if (ret)
- goto out;
+ int ret = 0;
+ struct obj_cgroup *objcg;
- css_get(&memcg->css);
- commit_charge(folio, memcg);
+ objcg = get_obj_cgroup_from_memcg(memcg);
+ /* Do not account at the root objcg level. */
+ if (!obj_cgroup_is_root(objcg))
+ ret = try_charge(memcg, gfp, folio_nr_pages(folio));
+ if (ret) {
+ obj_cgroup_put(objcg);
+ return ret;
+ }
+ commit_charge(folio, objcg);
memcg1_commit_charge(folio, memcg);
-out:
+
return ret;
}
@@ -4985,7 +5035,7 @@ int mem_cgroup_swapin_charge_folio(struct folio *folio, struct mm_struct *mm,
}
struct uncharge_gather {
- struct mem_cgroup *memcg;
+ struct obj_cgroup *objcg;
unsigned long nr_memory;
unsigned long pgpgout;
unsigned long nr_kmem;
@@ -4999,58 +5049,52 @@ static inline void uncharge_gather_clear(struct uncharge_gather *ug)
static void uncharge_batch(const struct uncharge_gather *ug)
{
+ struct mem_cgroup *memcg;
+
+ rcu_read_lock();
+ memcg = obj_cgroup_memcg(ug->objcg);
if (ug->nr_memory) {
- memcg_uncharge(ug->memcg, ug->nr_memory);
+ memcg_uncharge(memcg, ug->nr_memory);
if (ug->nr_kmem) {
- mod_memcg_state(ug->memcg, MEMCG_KMEM, -ug->nr_kmem);
- memcg1_account_kmem(ug->memcg, -ug->nr_kmem);
+ mod_memcg_state(memcg, MEMCG_KMEM, -ug->nr_kmem);
+ memcg1_account_kmem(memcg, -ug->nr_kmem);
}
- memcg1_oom_recover(ug->memcg);
+ memcg1_oom_recover(memcg);
}
- memcg1_uncharge_batch(ug->memcg, ug->pgpgout, ug->nr_memory, ug->nid);
+ memcg1_uncharge_batch(memcg, ug->pgpgout, ug->nr_memory, ug->nid);
+ rcu_read_unlock();
/* drop reference from uncharge_folio */
- css_put(&ug->memcg->css);
+ obj_cgroup_put(ug->objcg);
}
static void uncharge_folio(struct folio *folio, struct uncharge_gather *ug)
{
long nr_pages;
- struct mem_cgroup *memcg;
struct obj_cgroup *objcg;
VM_BUG_ON_FOLIO(folio_test_lru(folio), folio);
/*
* Nobody should be changing or seriously looking at
- * folio memcg or objcg at this point, we have fully
- * exclusive access to the folio.
+ * folio objcg at this point, we have fully exclusive
+ * access to the folio.
*/
- if (folio_memcg_kmem(folio)) {
- objcg = __folio_objcg(folio);
- /*
- * This get matches the put at the end of the function and
- * kmem pages do not hold memcg references anymore.
- */
- memcg = get_mem_cgroup_from_objcg(objcg);
- } else {
- memcg = __folio_memcg(folio);
- }
-
- if (!memcg)
+ objcg = folio_objcg(folio);
+ if (!objcg)
return;
- if (ug->memcg != memcg) {
- if (ug->memcg) {
+ if (ug->objcg != objcg) {
+ if (ug->objcg) {
uncharge_batch(ug);
uncharge_gather_clear(ug);
}
- ug->memcg = memcg;
+ ug->objcg = objcg;
ug->nid = folio_nid(folio);
- /* pairs with css_put in uncharge_batch */
- css_get(&memcg->css);
+ /* pairs with obj_cgroup_put in uncharge_batch */
+ obj_cgroup_get(objcg);
}
nr_pages = folio_nr_pages(folio);
@@ -5058,20 +5102,17 @@ static void uncharge_folio(struct folio *folio, struct uncharge_gather *ug)
if (folio_memcg_kmem(folio)) {
ug->nr_memory += nr_pages;
ug->nr_kmem += nr_pages;
-
- folio->memcg_data = 0;
- obj_cgroup_put(objcg);
} else {
/* LRU pages aren't accounted at the root level */
- if (!mem_cgroup_is_root(memcg))
+ if (!obj_cgroup_is_root(objcg))
ug->nr_memory += nr_pages;
ug->pgpgout++;
WARN_ON_ONCE(folio_unqueue_deferred_split(folio));
- folio->memcg_data = 0;
}
- css_put(&memcg->css);
+ folio->memcg_data = 0;
+ obj_cgroup_put(objcg);
}
void __mem_cgroup_uncharge(struct folio *folio)
@@ -5095,7 +5136,7 @@ void __mem_cgroup_uncharge_folios(struct folio_batch *folios)
uncharge_gather_clear(&ug);
for (i = 0; i < folios->nr; i++)
uncharge_folio(folios->folios[i], &ug);
- if (ug.memcg)
+ if (ug.objcg)
uncharge_batch(&ug);
}
@@ -5112,6 +5153,7 @@ void __mem_cgroup_uncharge_folios(struct folio_batch *folios)
void mem_cgroup_replace_folio(struct folio *old, struct folio *new)
{
struct mem_cgroup *memcg;
+ struct obj_cgroup *objcg;
long nr_pages = folio_nr_pages(new);
VM_BUG_ON_FOLIO(!folio_test_locked(old), old);
@@ -5126,21 +5168,24 @@ void mem_cgroup_replace_folio(struct folio *old, struct folio *new)
if (folio_memcg_charged(new))
return;
- memcg = folio_memcg(old);
- VM_WARN_ON_ONCE_FOLIO(!memcg, old);
- if (!memcg)
+ objcg = folio_objcg(old);
+ VM_WARN_ON_ONCE_FOLIO(!objcg, old);
+ if (!objcg)
return;
+ rcu_read_lock();
+ memcg = obj_cgroup_memcg(objcg);
/* Force-charge the new page. The old one will be freed soon */
- if (!mem_cgroup_is_root(memcg)) {
+ if (!obj_cgroup_is_root(objcg)) {
page_counter_charge(&memcg->memory, nr_pages);
if (do_memsw_account())
page_counter_charge(&memcg->memsw, nr_pages);
}
- css_get(&memcg->css);
- commit_charge(new, memcg);
+ obj_cgroup_get(objcg);
+ commit_charge(new, objcg);
memcg1_commit_charge(new, memcg);
+ rcu_read_unlock();
}
/**
@@ -5156,7 +5201,7 @@ void mem_cgroup_replace_folio(struct folio *old, struct folio *new)
*/
void mem_cgroup_migrate(struct folio *old, struct folio *new)
{
- struct mem_cgroup *memcg;
+ struct obj_cgroup *objcg;
VM_BUG_ON_FOLIO(!folio_test_locked(old), old);
VM_BUG_ON_FOLIO(!folio_test_locked(new), new);
@@ -5167,18 +5212,18 @@ void mem_cgroup_migrate(struct folio *old, struct folio *new)
if (mem_cgroup_disabled())
return;
- memcg = folio_memcg(old);
+ objcg = folio_objcg(old);
/*
- * Note that it is normal to see !memcg for a hugetlb folio.
+ * Note that it is normal to see !objcg for a hugetlb folio.
* For e.g, it could have been allocated when memory_hugetlb_accounting
* was not selected.
*/
- VM_WARN_ON_ONCE_FOLIO(!folio_test_hugetlb(old) && !memcg, old);
- if (!memcg)
+ VM_WARN_ON_ONCE_FOLIO(!folio_test_hugetlb(old) && !objcg, old);
+ if (!objcg)
return;
- /* Transfer the charge and the css ref */
- commit_charge(new, memcg);
+ /* Transfer the charge and the objcg ref */
+ commit_charge(new, objcg);
/* Warning should never happen, so don't worry about refcount non-0 */
WARN_ON_ONCE(folio_unqueue_deferred_split(old));
@@ -5361,22 +5406,27 @@ int __mem_cgroup_try_charge_swap(struct folio *folio, swp_entry_t entry)
unsigned int nr_pages = folio_nr_pages(folio);
struct page_counter *counter;
struct mem_cgroup *memcg;
+ struct obj_cgroup *objcg;
if (do_memsw_account())
return 0;
- memcg = folio_memcg(folio);
-
- VM_WARN_ON_ONCE_FOLIO(!memcg, folio);
- if (!memcg)
+ objcg = folio_objcg(folio);
+ VM_WARN_ON_ONCE_FOLIO(!objcg, folio);
+ if (!objcg)
return 0;
+ rcu_read_lock();
+ memcg = obj_cgroup_memcg(objcg);
if (!entry.val) {
memcg_memory_event(memcg, MEMCG_SWAP_FAIL);
+ rcu_read_unlock();
return 0;
}
memcg = mem_cgroup_private_id_get_online(memcg);
+ /* memcg is pined by memcg ID. */
+ rcu_read_unlock();
if (!mem_cgroup_is_root(memcg) &&
!page_counter_try_charge(&memcg->swap, nr_pages, &counter)) {
--
2.20.1
^ permalink raw reply [flat|nested] 50+ messages in thread
* [PATCH v4 31/31] mm: lru: add VM_WARN_ON_ONCE_FOLIO to lru maintenance helpers
2026-02-05 8:54 [PATCH v4 00/31] Eliminate Dying Memory Cgroup Qi Zheng
` (29 preceding siblings ...)
2026-02-05 9:01 ` [PATCH v4 30/31] mm: memcontrol: eliminate the problem of dying memory cgroup for LRU folios Qi Zheng
@ 2026-02-05 9:01 ` Qi Zheng
2026-02-07 22:26 ` Shakeel Butt
30 siblings, 1 reply; 50+ messages in thread
From: Qi Zheng @ 2026-02-05 9:01 UTC (permalink / raw)
To: hannes, hughd, mhocko, roman.gushchin, shakeel.butt, muchun.song,
david, lorenzo.stoakes, ziy, harry.yoo, yosry.ahmed,
imran.f.khan, kamalesh.babulal, axelrasmussen, yuanchu, weixugc,
chenridong, mkoutny, akpm, hamzamahfooz, apais, lance.yang, bhe
Cc: linux-mm, linux-kernel, cgroups, Muchun Song, Qi Zheng
From: Muchun Song <songmuchun@bytedance.com>
We must ensure the folio is deleted from or added to the correct lruvec
list. So, add VM_WARN_ON_ONCE_FOLIO() to catch invalid users. The
VM_BUG_ON_PAGE() in move_pages_to_lru() can be removed as
add_page_to_lru_list() will perform the necessary check.
Signed-off-by: Muchun Song <songmuchun@bytedance.com>
Acked-by: Roman Gushchin <roman.gushchin@linux.dev>
Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
---
include/linux/mm_inline.h | 6 ++++++
mm/vmscan.c | 1 -
2 files changed, 6 insertions(+), 1 deletion(-)
diff --git a/include/linux/mm_inline.h b/include/linux/mm_inline.h
index fa2d6ba811b53..ad50688d89dba 100644
--- a/include/linux/mm_inline.h
+++ b/include/linux/mm_inline.h
@@ -342,6 +342,8 @@ void lruvec_add_folio(struct lruvec *lruvec, struct folio *folio)
{
enum lru_list lru = folio_lru_list(folio);
+ VM_WARN_ON_ONCE_FOLIO(!folio_matches_lruvec(folio, lruvec), folio);
+
if (lru_gen_add_folio(lruvec, folio, false))
return;
@@ -356,6 +358,8 @@ void lruvec_add_folio_tail(struct lruvec *lruvec, struct folio *folio)
{
enum lru_list lru = folio_lru_list(folio);
+ VM_WARN_ON_ONCE_FOLIO(!folio_matches_lruvec(folio, lruvec), folio);
+
if (lru_gen_add_folio(lruvec, folio, true))
return;
@@ -370,6 +374,8 @@ void lruvec_del_folio(struct lruvec *lruvec, struct folio *folio)
{
enum lru_list lru = folio_lru_list(folio);
+ VM_WARN_ON_ONCE_FOLIO(!folio_matches_lruvec(folio, lruvec), folio);
+
if (lru_gen_del_folio(lruvec, folio, false))
return;
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 4dc52b4b4af50..5f69b252b403e 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1921,7 +1921,6 @@ static unsigned int move_folios_to_lru(struct list_head *list)
continue;
}
- VM_BUG_ON_FOLIO(!folio_matches_lruvec(folio, lruvec), folio);
lruvec_add_folio(lruvec, folio);
nr_pages = folio_nr_pages(folio);
nr_moved += nr_pages;
--
2.20.1
^ permalink raw reply [flat|nested] 50+ messages in thread
* Re: [PATCH v4 24/31] mm: memcontrol: prepare for reparenting LRU pages for lruvec lock
2026-02-05 9:01 ` [PATCH v4 24/31] mm: memcontrol: prepare for reparenting LRU pages for " Qi Zheng
@ 2026-02-05 15:02 ` kernel test robot
2026-02-05 15:02 ` kernel test robot
1 sibling, 0 replies; 50+ messages in thread
From: kernel test robot @ 2026-02-05 15:02 UTC (permalink / raw)
To: Qi Zheng, hannes, hughd, mhocko, roman.gushchin, shakeel.butt,
muchun.song, david, lorenzo.stoakes, ziy, harry.yoo, yosry.ahmed,
imran.f.khan, kamalesh.babulal, axelrasmussen, yuanchu, weixugc,
chenridong, mkoutny, akpm, hamzamahfooz, apais, lance.yang, bhe
Cc: oe-kbuild-all, linux-mm, linux-kernel, cgroups, Muchun Song, Qi Zheng
Hi Qi,
kernel test robot noticed the following build errors:
[auto build test ERROR on next-20260204]
[cannot apply to akpm-mm/mm-everything brauner-vfs/vfs.all trace/for-next tj-cgroup/for-next linus/master dennis-percpu/for-next v6.19-rc8 v6.19-rc7 v6.19-rc6 v6.19-rc8]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]
url: https://github.com/intel-lab-lkp/linux/commits/Qi-Zheng/mm-memcontrol-remove-dead-code-of-checking-parent-memory-cgroup/20260205-170812
base: next-20260204
patch link: https://lore.kernel.org/r/e27edb311dda624751cb41860237f290de8c16ae.1770279888.git.zhengqi.arch%40bytedance.com
patch subject: [PATCH v4 24/31] mm: memcontrol: prepare for reparenting LRU pages for lruvec lock
config: xtensa-allnoconfig (https://download.01.org/0day-ci/archive/20260205/202602052247.VAoEwR9g-lkp@intel.com/config)
compiler: xtensa-linux-gcc (GCC) 15.2.0
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20260205/202602052247.VAoEwR9g-lkp@intel.com/reproduce)
If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202602052247.VAoEwR9g-lkp@intel.com/
All errors (new ones prefixed by >>):
>> xtensa-linux-ld: mm/swap.o:(.text+0x40): undefined reference to `lruvec_unlock_irqrestore'
xtensa-linux-ld: mm/swap.o: in function `__page_cache_release.part.0':
swap.c:(.text+0x60): undefined reference to `lruvec_unlock_irqrestore'
xtensa-linux-ld: mm/swap.o: in function `__folio_put':
swap.c:(.text+0x1a6): undefined reference to `lruvec_unlock_irqrestore'
xtensa-linux-ld: mm/swap.o: in function `folios_put_refs':
swap.c:(.text+0x276): undefined reference to `lruvec_unlock_irqrestore'
xtensa-linux-ld: mm/swap.o: in function `folio_batch_move_lru':
swap.c:(.text+0x31a): undefined reference to `lruvec_unlock_irqrestore'
xtensa-linux-ld: mm/swap.o:swap.c:(.text+0x354): more undefined references to `lruvec_unlock_irqrestore' follow
xtensa-linux-ld: mm/swap.o: in function `lru_note_cost_refault':
swap.c:(.text+0x1340): undefined reference to `lruvec_unlock_irq'
xtensa-linux-ld: mm/swap.o: in function `folio_activate':
swap.c:(.text+0x136c): undefined reference to `lruvec_unlock_irq'
xtensa-linux-ld: mm/vmscan.o: in function `check_move_unevictable_folios':
vmscan.c:(.text+0x762): undefined reference to `lruvec_unlock_irq'
>> xtensa-linux-ld: vmscan.c:(.text+0x972): undefined reference to `lruvec_unlock_irq'
xtensa-linux-ld: mm/vmscan.o: in function `move_folios_to_lru':
vmscan.c:(.text+0xa22): undefined reference to `lruvec_unlock_irq'
xtensa-linux-ld: mm/vmscan.o:vmscan.c:(.text+0xa58): more undefined references to `lruvec_unlock_irq' follow
xtensa-linux-ld: mm/vmscan.o: in function `move_folios_to_lru':
vmscan.c:(.text+0xc40): undefined reference to `lruvec_lock_irq'
xtensa-linux-ld: mm/vmscan.o: in function `shrink_active_list':
vmscan.c:(.text+0xc9b): undefined reference to `lruvec_lock_irq'
xtensa-linux-ld: vmscan.c:(.text+0xcfc): undefined reference to `lruvec_unlock_irq'
>> xtensa-linux-ld: vmscan.c:(.text+0xe39): undefined reference to `lruvec_lock_irq'
xtensa-linux-ld: mm/vmscan.o: in function `shrink_inactive_list':
vmscan.c:(.text+0x1d32): undefined reference to `lruvec_lock_irq'
xtensa-linux-ld: vmscan.c:(.text+0x1e46): undefined reference to `lruvec_unlock_irq'
xtensa-linux-ld: vmscan.c:(.text+0x1f3a): undefined reference to `lruvec_lock_irq'
xtensa-linux-ld: mm/vmscan.o: in function `folio_isolate_lru':
vmscan.c:(.text+0x2eb4): undefined reference to `lruvec_unlock_irq'
xtensa-linux-ld: mm/mlock.o: in function `__munlock_folio':
mlock.c:(.text+0x5cc): undefined reference to `lruvec_unlock_irq'
xtensa-linux-ld: mm/mlock.o: in function `__mlock_folio':
mlock.c:(.text+0x8b3): undefined reference to `lruvec_unlock_irq'
xtensa-linux-ld: mm/mlock.o: in function `mlock_folio_batch.constprop.0':
mlock.c:(.text+0xd15): undefined reference to `lruvec_unlock_irq'
>> xtensa-linux-ld: mlock.c:(.text+0xe70): undefined reference to `lruvec_unlock_irq'
--
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki
^ permalink raw reply [flat|nested] 50+ messages in thread
* Re: [PATCH v4 24/31] mm: memcontrol: prepare for reparenting LRU pages for lruvec lock
2026-02-05 9:01 ` [PATCH v4 24/31] mm: memcontrol: prepare for reparenting LRU pages for " Qi Zheng
2026-02-05 15:02 ` kernel test robot
@ 2026-02-05 15:02 ` kernel test robot
2026-02-06 6:13 ` Qi Zheng
1 sibling, 1 reply; 50+ messages in thread
From: kernel test robot @ 2026-02-05 15:02 UTC (permalink / raw)
To: Qi Zheng, hannes, hughd, mhocko, roman.gushchin, shakeel.butt,
muchun.song, david, lorenzo.stoakes, ziy, harry.yoo, yosry.ahmed,
imran.f.khan, kamalesh.babulal, axelrasmussen, yuanchu, weixugc,
chenridong, mkoutny, akpm, hamzamahfooz, apais, lance.yang, bhe
Cc: oe-kbuild-all, linux-mm, linux-kernel, cgroups, Muchun Song, Qi Zheng
Hi Qi,
kernel test robot noticed the following build errors:
[auto build test ERROR on next-20260204]
[cannot apply to akpm-mm/mm-everything brauner-vfs/vfs.all trace/for-next tj-cgroup/for-next linus/master dennis-percpu/for-next v6.19-rc8 v6.19-rc7 v6.19-rc6 v6.19-rc8]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]
url: https://github.com/intel-lab-lkp/linux/commits/Qi-Zheng/mm-memcontrol-remove-dead-code-of-checking-parent-memory-cgroup/20260205-170812
base: next-20260204
patch link: https://lore.kernel.org/r/e27edb311dda624751cb41860237f290de8c16ae.1770279888.git.zhengqi.arch%40bytedance.com
patch subject: [PATCH v4 24/31] mm: memcontrol: prepare for reparenting LRU pages for lruvec lock
config: nios2-allnoconfig (https://download.01.org/0day-ci/archive/20260205/202602052203.U8hxsh2N-lkp@intel.com/config)
compiler: nios2-linux-gcc (GCC) 11.5.0
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20260205/202602052203.U8hxsh2N-lkp@intel.com/reproduce)
If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202602052203.U8hxsh2N-lkp@intel.com/
All errors (new ones prefixed by >>):
nios2-linux-ld: mm/swap.o: in function `__page_cache_release.part.0':
swap.c:(.text+0x4c): undefined reference to `lruvec_unlock_irqrestore'
>> swap.c:(.text+0x4c): relocation truncated to fit: R_NIOS2_CALL26 against `lruvec_unlock_irqrestore'
nios2-linux-ld: mm/swap.o: in function `__folio_put':
swap.c:(.text+0x2ac): undefined reference to `lruvec_unlock_irqrestore'
swap.c:(.text+0x2ac): relocation truncated to fit: R_NIOS2_CALL26 against `lruvec_unlock_irqrestore'
nios2-linux-ld: mm/swap.o: in function `folios_put_refs':
swap.c:(.text+0x384): undefined reference to `lruvec_unlock_irqrestore'
swap.c:(.text+0x384): relocation truncated to fit: R_NIOS2_CALL26 against `lruvec_unlock_irqrestore'
nios2-linux-ld: mm/swap.o: in function `folio_batch_move_lru':
swap.c:(.text+0x4ac): undefined reference to `lruvec_unlock_irqrestore'
swap.c:(.text+0x4ac): relocation truncated to fit: R_NIOS2_CALL26 against `lruvec_unlock_irqrestore'
>> nios2-linux-ld: swap.c:(.text+0x50c): undefined reference to `lruvec_unlock_irqrestore'
swap.c:(.text+0x50c): relocation truncated to fit: R_NIOS2_CALL26 against `lruvec_unlock_irqrestore'
nios2-linux-ld: mm/swap.o: in function `folio_activate':
swap.c:(.text+0x21d8): undefined reference to `lruvec_unlock_irq'
>> swap.c:(.text+0x21d8): relocation truncated to fit: R_NIOS2_CALL26 against `lruvec_unlock_irq'
nios2-linux-ld: mm/vmscan.o: in function `move_folios_to_lru':
vmscan.c:(.text+0xa4c): undefined reference to `lruvec_unlock_irq'
>> vmscan.c:(.text+0xa4c): relocation truncated to fit: R_NIOS2_CALL26 against `lruvec_unlock_irq'
>> nios2-linux-ld: vmscan.c:(.text+0xaa4): undefined reference to `lruvec_unlock_irq'
vmscan.c:(.text+0xaa4): relocation truncated to fit: R_NIOS2_CALL26 against `lruvec_unlock_irq'
nios2-linux-ld: vmscan.c:(.text+0xcac): undefined reference to `lruvec_unlock_irq'
vmscan.c:(.text+0xcac): relocation truncated to fit: R_NIOS2_CALL26 against `lruvec_unlock_irq'
nios2-linux-ld: vmscan.c:(.text+0xd8c): undefined reference to `lruvec_unlock_irq'
vmscan.c:(.text+0xd8c): relocation truncated to fit: R_NIOS2_CALL26 against `lruvec_unlock_irq'
nios2-linux-ld: mm/vmscan.o: in function `shrink_active_list':
vmscan.c:(.text+0x11e0): undefined reference to `lruvec_lock_irq'
vmscan.c:(.text+0x11e0): additional relocation overflows omitted from the output
nios2-linux-ld: vmscan.c:(.text+0x12a0): undefined reference to `lruvec_unlock_irq'
>> nios2-linux-ld: vmscan.c:(.text+0x144c): undefined reference to `lruvec_lock_irq'
nios2-linux-ld: mm/vmscan.o: in function `check_move_unevictable_folios':
vmscan.c:(.text+0x15d4): undefined reference to `lruvec_unlock_irq'
nios2-linux-ld: vmscan.c:(.text+0x1958): undefined reference to `lruvec_unlock_irq'
nios2-linux-ld: mm/vmscan.o: in function `shrink_inactive_list':
vmscan.c:(.text+0x2f8c): undefined reference to `lruvec_lock_irq'
nios2-linux-ld: vmscan.c:(.text+0x307c): undefined reference to `lruvec_unlock_irq'
nios2-linux-ld: vmscan.c:(.text+0x31dc): undefined reference to `lruvec_lock_irq'
nios2-linux-ld: mm/vmscan.o: in function `folio_isolate_lru':
vmscan.c:(.text+0x48a4): undefined reference to `lruvec_unlock_irq'
nios2-linux-ld: mm/mlock.o: in function `__munlock_folio':
mlock.c:(.text+0x968): undefined reference to `lruvec_unlock_irq'
nios2-linux-ld: mm/mlock.o: in function `__mlock_folio':
mlock.c:(.text+0xe5c): undefined reference to `lruvec_unlock_irq'
nios2-linux-ld: mm/mlock.o: in function `mlock_folio_batch.constprop.0':
mlock.c:(.text+0x158c): undefined reference to `lruvec_unlock_irq'
>> nios2-linux-ld: mlock.c:(.text+0x1808): undefined reference to `lruvec_unlock_irq'
--
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki
^ permalink raw reply [flat|nested] 50+ messages in thread
* Re: [PATCH v4 24/31] mm: memcontrol: prepare for reparenting LRU pages for lruvec lock
2026-02-05 15:02 ` kernel test robot
@ 2026-02-06 6:13 ` Qi Zheng
2026-02-06 23:34 ` Shakeel Butt
0 siblings, 1 reply; 50+ messages in thread
From: Qi Zheng @ 2026-02-06 6:13 UTC (permalink / raw)
To: kernel test robot, hannes, hughd, mhocko, roman.gushchin,
shakeel.butt, muchun.song, david, lorenzo.stoakes, ziy,
harry.yoo, yosry.ahmed, imran.f.khan, kamalesh.babulal,
axelrasmussen, yuanchu, weixugc, chenridong, mkoutny, akpm,
hamzamahfooz, apais, lance.yang, bhe
Cc: oe-kbuild-all, linux-mm, linux-kernel, cgroups, Muchun Song, Qi Zheng
On 2/5/26 11:02 PM, kernel test robot wrote:
> Hi Qi,
>
> kernel test robot noticed the following build errors:
>
> [auto build test ERROR on next-20260204]
> [cannot apply to akpm-mm/mm-everything brauner-vfs/vfs.all trace/for-next tj-cgroup/for-next linus/master dennis-percpu/for-next v6.19-rc8 v6.19-rc7 v6.19-rc6 v6.19-rc8]
> [If your patch is applied to the wrong git tree, kindly drop us a note.
> And when submitting patch, we suggest to use '--base' as documented in
> https://git-scm.com/docs/git-format-patch#_base_tree_information]
>
> url: https://github.com/intel-lab-lkp/linux/commits/Qi-Zheng/mm-memcontrol-remove-dead-code-of-checking-parent-memory-cgroup/20260205-170812
> base: next-20260204
> patch link: https://lore.kernel.org/r/e27edb311dda624751cb41860237f290de8c16ae.1770279888.git.zhengqi.arch%40bytedance.com
> patch subject: [PATCH v4 24/31] mm: memcontrol: prepare for reparenting LRU pages for lruvec lock
> config: nios2-allnoconfig (https://download.01.org/0day-ci/archive/20260205/202602052203.U8hxsh2N-lkp@intel.com/config)
> compiler: nios2-linux-gcc (GCC) 11.5.0
> reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20260205/202602052203.U8hxsh2N-lkp@intel.com/reproduce)
>
> If you fix the issue in a separate patch/commit (i.e. not just a new version of
> the same patch/commit), kindly add following tags
> | Reported-by: kernel test robot <lkp@intel.com>
> | Closes: https://lore.kernel.org/oe-kbuild-all/202602052203.U8hxsh2N-lkp@intel.com/
>
> All errors (new ones prefixed by >>):
>
> nios2-linux-ld: mm/swap.o: in function `__page_cache_release.part.0':
> swap.c:(.text+0x4c): undefined reference to `lruvec_unlock_irqrestore'
>>> swap.c:(.text+0x4c): relocation truncated to fit: R_NIOS2_CALL26 against `lruvec_unlock_irqrestore'
> nios2-linux-ld: mm/swap.o: in function `__folio_put':
> swap.c:(.text+0x2ac): undefined reference to `lruvec_unlock_irqrestore'
> swap.c:(.text+0x2ac): relocation truncated to fit: R_NIOS2_CALL26 against `lruvec_unlock_irqrestore'
> nios2-linux-ld: mm/swap.o: in function `folios_put_refs':
> swap.c:(.text+0x384): undefined reference to `lruvec_unlock_irqrestore'
> swap.c:(.text+0x384): relocation truncated to fit: R_NIOS2_CALL26 against `lruvec_unlock_irqrestore'
> nios2-linux-ld: mm/swap.o: in function `folio_batch_move_lru':
> swap.c:(.text+0x4ac): undefined reference to `lruvec_unlock_irqrestore'
> swap.c:(.text+0x4ac): relocation truncated to fit: R_NIOS2_CALL26 against `lruvec_unlock_irqrestore'
>>> nios2-linux-ld: swap.c:(.text+0x50c): undefined reference to `lruvec_unlock_irqrestore'
> swap.c:(.text+0x50c): relocation truncated to fit: R_NIOS2_CALL26 against `lruvec_unlock_irqrestore'
> nios2-linux-ld: mm/swap.o: in function `folio_activate':
> swap.c:(.text+0x21d8): undefined reference to `lruvec_unlock_irq'
>>> swap.c:(.text+0x21d8): relocation truncated to fit: R_NIOS2_CALL26 against `lruvec_unlock_irq'
> nios2-linux-ld: mm/vmscan.o: in function `move_folios_to_lru':
> vmscan.c:(.text+0xa4c): undefined reference to `lruvec_unlock_irq'
>>> vmscan.c:(.text+0xa4c): relocation truncated to fit: R_NIOS2_CALL26 against `lruvec_unlock_irq'
>>> nios2-linux-ld: vmscan.c:(.text+0xaa4): undefined reference to `lruvec_unlock_irq'
> vmscan.c:(.text+0xaa4): relocation truncated to fit: R_NIOS2_CALL26 against `lruvec_unlock_irq'
> nios2-linux-ld: vmscan.c:(.text+0xcac): undefined reference to `lruvec_unlock_irq'
> vmscan.c:(.text+0xcac): relocation truncated to fit: R_NIOS2_CALL26 against `lruvec_unlock_irq'
> nios2-linux-ld: vmscan.c:(.text+0xd8c): undefined reference to `lruvec_unlock_irq'
> vmscan.c:(.text+0xd8c): relocation truncated to fit: R_NIOS2_CALL26 against `lruvec_unlock_irq'
> nios2-linux-ld: mm/vmscan.o: in function `shrink_active_list':
> vmscan.c:(.text+0x11e0): undefined reference to `lruvec_lock_irq'
> vmscan.c:(.text+0x11e0): additional relocation overflows omitted from the output
> nios2-linux-ld: vmscan.c:(.text+0x12a0): undefined reference to `lruvec_unlock_irq'
>>> nios2-linux-ld: vmscan.c:(.text+0x144c): undefined reference to `lruvec_lock_irq'
> nios2-linux-ld: mm/vmscan.o: in function `check_move_unevictable_folios':
> vmscan.c:(.text+0x15d4): undefined reference to `lruvec_unlock_irq'
> nios2-linux-ld: vmscan.c:(.text+0x1958): undefined reference to `lruvec_unlock_irq'
> nios2-linux-ld: mm/vmscan.o: in function `shrink_inactive_list':
> vmscan.c:(.text+0x2f8c): undefined reference to `lruvec_lock_irq'
> nios2-linux-ld: vmscan.c:(.text+0x307c): undefined reference to `lruvec_unlock_irq'
> nios2-linux-ld: vmscan.c:(.text+0x31dc): undefined reference to `lruvec_lock_irq'
> nios2-linux-ld: mm/vmscan.o: in function `folio_isolate_lru':
> vmscan.c:(.text+0x48a4): undefined reference to `lruvec_unlock_irq'
> nios2-linux-ld: mm/mlock.o: in function `__munlock_folio':
> mlock.c:(.text+0x968): undefined reference to `lruvec_unlock_irq'
> nios2-linux-ld: mm/mlock.o: in function `__mlock_folio':
> mlock.c:(.text+0xe5c): undefined reference to `lruvec_unlock_irq'
> nios2-linux-ld: mm/mlock.o: in function `mlock_folio_batch.constprop.0':
> mlock.c:(.text+0x158c): undefined reference to `lruvec_unlock_irq'
>>> nios2-linux-ld: mlock.c:(.text+0x1808): undefined reference to `lruvec_unlock_irq'
Ouch, I move lruvec_lock_irq() and its firends to memcontrol.c to fix
the compilation errors related to __acquires/__releases, but I forgot
that memcontrol.c will only be compiled under CONFIG_MEMCG.
Hi Shakeel, for simplicity, perhaps keeping lruvec_lock_irq() and its
firends in memcontrol.h and drop __acquires/__releases would be a
better option?
Thanks,
Qi
>
^ permalink raw reply [flat|nested] 50+ messages in thread
* Re: [PATCH v4 24/31] mm: memcontrol: prepare for reparenting LRU pages for lruvec lock
2026-02-06 6:13 ` Qi Zheng
@ 2026-02-06 23:34 ` Shakeel Butt
0 siblings, 0 replies; 50+ messages in thread
From: Shakeel Butt @ 2026-02-06 23:34 UTC (permalink / raw)
To: Qi Zheng
Cc: kernel test robot, hannes, hughd, mhocko, roman.gushchin,
muchun.song, david, lorenzo.stoakes, ziy, harry.yoo, yosry.ahmed,
imran.f.khan, kamalesh.babulal, axelrasmussen, yuanchu, weixugc,
chenridong, mkoutny, akpm, hamzamahfooz, apais, lance.yang, bhe,
oe-kbuild-all, linux-mm, linux-kernel, cgroups, Muchun Song,
Qi Zheng
On Fri, Feb 06, 2026 at 02:13:32PM +0800, Qi Zheng wrote:
>
>
> On 2/5/26 11:02 PM, kernel test robot wrote:
> > Hi Qi,
> >
> > kernel test robot noticed the following build errors:
> >
> > [auto build test ERROR on next-20260204]
> > [cannot apply to akpm-mm/mm-everything brauner-vfs/vfs.all trace/for-next tj-cgroup/for-next linus/master dennis-percpu/for-next v6.19-rc8 v6.19-rc7 v6.19-rc6 v6.19-rc8]
> > [If your patch is applied to the wrong git tree, kindly drop us a note.
> > And when submitting patch, we suggest to use '--base' as documented in
> > https://git-scm.com/docs/git-format-patch#_base_tree_information]
> >
> > url: https://github.com/intel-lab-lkp/linux/commits/Qi-Zheng/mm-memcontrol-remove-dead-code-of-checking-parent-memory-cgroup/20260205-170812
> > base: next-20260204
> > patch link: https://lore.kernel.org/r/e27edb311dda624751cb41860237f290de8c16ae.1770279888.git.zhengqi.arch%40bytedance.com
> > patch subject: [PATCH v4 24/31] mm: memcontrol: prepare for reparenting LRU pages for lruvec lock
> > config: nios2-allnoconfig (https://download.01.org/0day-ci/archive/20260205/202602052203.U8hxsh2N-lkp@intel.com/config)
> > compiler: nios2-linux-gcc (GCC) 11.5.0
> > reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20260205/202602052203.U8hxsh2N-lkp@intel.com/reproduce)
> >
> > If you fix the issue in a separate patch/commit (i.e. not just a new version of
> > the same patch/commit), kindly add following tags
> > | Reported-by: kernel test robot <lkp@intel.com>
> > | Closes: https://lore.kernel.org/oe-kbuild-all/202602052203.U8hxsh2N-lkp@intel.com/
> >
> > All errors (new ones prefixed by >>):
> >
> > nios2-linux-ld: mm/swap.o: in function `__page_cache_release.part.0':
> > swap.c:(.text+0x4c): undefined reference to `lruvec_unlock_irqrestore'
> > > > swap.c:(.text+0x4c): relocation truncated to fit: R_NIOS2_CALL26 against `lruvec_unlock_irqrestore'
> > nios2-linux-ld: mm/swap.o: in function `__folio_put':
> > swap.c:(.text+0x2ac): undefined reference to `lruvec_unlock_irqrestore'
> > swap.c:(.text+0x2ac): relocation truncated to fit: R_NIOS2_CALL26 against `lruvec_unlock_irqrestore'
> > nios2-linux-ld: mm/swap.o: in function `folios_put_refs':
> > swap.c:(.text+0x384): undefined reference to `lruvec_unlock_irqrestore'
> > swap.c:(.text+0x384): relocation truncated to fit: R_NIOS2_CALL26 against `lruvec_unlock_irqrestore'
> > nios2-linux-ld: mm/swap.o: in function `folio_batch_move_lru':
> > swap.c:(.text+0x4ac): undefined reference to `lruvec_unlock_irqrestore'
> > swap.c:(.text+0x4ac): relocation truncated to fit: R_NIOS2_CALL26 against `lruvec_unlock_irqrestore'
> > > > nios2-linux-ld: swap.c:(.text+0x50c): undefined reference to `lruvec_unlock_irqrestore'
> > swap.c:(.text+0x50c): relocation truncated to fit: R_NIOS2_CALL26 against `lruvec_unlock_irqrestore'
> > nios2-linux-ld: mm/swap.o: in function `folio_activate':
> > swap.c:(.text+0x21d8): undefined reference to `lruvec_unlock_irq'
> > > > swap.c:(.text+0x21d8): relocation truncated to fit: R_NIOS2_CALL26 against `lruvec_unlock_irq'
> > nios2-linux-ld: mm/vmscan.o: in function `move_folios_to_lru':
> > vmscan.c:(.text+0xa4c): undefined reference to `lruvec_unlock_irq'
> > > > vmscan.c:(.text+0xa4c): relocation truncated to fit: R_NIOS2_CALL26 against `lruvec_unlock_irq'
> > > > nios2-linux-ld: vmscan.c:(.text+0xaa4): undefined reference to `lruvec_unlock_irq'
> > vmscan.c:(.text+0xaa4): relocation truncated to fit: R_NIOS2_CALL26 against `lruvec_unlock_irq'
> > nios2-linux-ld: vmscan.c:(.text+0xcac): undefined reference to `lruvec_unlock_irq'
> > vmscan.c:(.text+0xcac): relocation truncated to fit: R_NIOS2_CALL26 against `lruvec_unlock_irq'
> > nios2-linux-ld: vmscan.c:(.text+0xd8c): undefined reference to `lruvec_unlock_irq'
> > vmscan.c:(.text+0xd8c): relocation truncated to fit: R_NIOS2_CALL26 against `lruvec_unlock_irq'
> > nios2-linux-ld: mm/vmscan.o: in function `shrink_active_list':
> > vmscan.c:(.text+0x11e0): undefined reference to `lruvec_lock_irq'
> > vmscan.c:(.text+0x11e0): additional relocation overflows omitted from the output
> > nios2-linux-ld: vmscan.c:(.text+0x12a0): undefined reference to `lruvec_unlock_irq'
> > > > nios2-linux-ld: vmscan.c:(.text+0x144c): undefined reference to `lruvec_lock_irq'
> > nios2-linux-ld: mm/vmscan.o: in function `check_move_unevictable_folios':
> > vmscan.c:(.text+0x15d4): undefined reference to `lruvec_unlock_irq'
> > nios2-linux-ld: vmscan.c:(.text+0x1958): undefined reference to `lruvec_unlock_irq'
> > nios2-linux-ld: mm/vmscan.o: in function `shrink_inactive_list':
> > vmscan.c:(.text+0x2f8c): undefined reference to `lruvec_lock_irq'
> > nios2-linux-ld: vmscan.c:(.text+0x307c): undefined reference to `lruvec_unlock_irq'
> > nios2-linux-ld: vmscan.c:(.text+0x31dc): undefined reference to `lruvec_lock_irq'
> > nios2-linux-ld: mm/vmscan.o: in function `folio_isolate_lru':
> > vmscan.c:(.text+0x48a4): undefined reference to `lruvec_unlock_irq'
> > nios2-linux-ld: mm/mlock.o: in function `__munlock_folio':
> > mlock.c:(.text+0x968): undefined reference to `lruvec_unlock_irq'
> > nios2-linux-ld: mm/mlock.o: in function `__mlock_folio':
> > mlock.c:(.text+0xe5c): undefined reference to `lruvec_unlock_irq'
> > nios2-linux-ld: mm/mlock.o: in function `mlock_folio_batch.constprop.0':
> > mlock.c:(.text+0x158c): undefined reference to `lruvec_unlock_irq'
> > > > nios2-linux-ld: mlock.c:(.text+0x1808): undefined reference to `lruvec_unlock_irq'
>
> Ouch, I move lruvec_lock_irq() and its firends to memcontrol.c to fix
> the compilation errors related to __acquires/__releases, but I forgot
> that memcontrol.c will only be compiled under CONFIG_MEMCG.
>
> Hi Shakeel, for simplicity, perhaps keeping lruvec_lock_irq() and its
> firends in memcontrol.h and drop __acquires/__releases would be a
> better option?
Yes, let's proceed with that for now. We can always improve this later.
^ permalink raw reply [flat|nested] 50+ messages in thread
* Re: [PATCH v4 25/31] mm: vmscan: prepare for reparenting traditional LRU folios
2026-02-05 9:01 ` [PATCH v4 25/31] mm: vmscan: prepare for reparenting traditional LRU folios Qi Zheng
@ 2026-02-07 1:28 ` Shakeel Butt
0 siblings, 0 replies; 50+ messages in thread
From: Shakeel Butt @ 2026-02-07 1:28 UTC (permalink / raw)
To: Qi Zheng
Cc: hannes, hughd, mhocko, roman.gushchin, muchun.song, david,
lorenzo.stoakes, ziy, harry.yoo, yosry.ahmed, imran.f.khan,
kamalesh.babulal, axelrasmussen, yuanchu, weixugc, chenridong,
mkoutny, akpm, hamzamahfooz, apais, lance.yang, bhe, linux-mm,
linux-kernel, cgroups, Qi Zheng
On Thu, Feb 05, 2026 at 05:01:44PM +0800, Qi Zheng wrote:
> From: Qi Zheng <zhengqi.arch@bytedance.com>
>
> To resolve the dying memcg issue, we need to reparent LRU folios of child
> memcg to its parent memcg. For traditional LRU list, each lruvec of every
> memcg comprises four LRU lists. Due to the symmetry of the LRU lists, it
> is feasible to transfer the LRU lists from a memcg to its parent memcg
> during the reparenting process.
>
> This commit implements the specific function, which will be used during
> the reparenting process.
>
> Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
> Reviewed-by: Harry Yoo <harry.yoo@oracle.com>
> Acked-by: Johannes Weiner <hannes@cmpxchg.org>
> Acked-by: Muchun Song <muchun.song@linux.dev>
Acked-by: Shakeel Butt <shakeel.butt@linux.dev>
^ permalink raw reply [flat|nested] 50+ messages in thread
* Re: [PATCH v4 28/31] mm: workingset: use lruvec_lru_size() to get the number of lru pages
2026-02-05 9:01 ` [PATCH v4 28/31] mm: workingset: use lruvec_lru_size() to get the number of lru pages Qi Zheng
@ 2026-02-07 1:48 ` Shakeel Butt
2026-02-07 3:59 ` Muchun Song
1 sibling, 0 replies; 50+ messages in thread
From: Shakeel Butt @ 2026-02-07 1:48 UTC (permalink / raw)
To: Qi Zheng
Cc: hannes, hughd, mhocko, roman.gushchin, muchun.song, david,
lorenzo.stoakes, ziy, harry.yoo, yosry.ahmed, imran.f.khan,
kamalesh.babulal, axelrasmussen, yuanchu, weixugc, chenridong,
mkoutny, akpm, hamzamahfooz, apais, lance.yang, bhe, linux-mm,
linux-kernel, cgroups, Qi Zheng
On Thu, Feb 05, 2026 at 05:01:47PM +0800, Qi Zheng wrote:
> From: Qi Zheng <zhengqi.arch@bytedance.com>
>
> For cgroup v2, count_shadow_nodes() is the only place to read
> non-hierarchical stats (lruvec_stats->state_local). To avoid the need to
> consider cgroup v2 during subsequent non-hierarchical stats reparenting,
> use lruvec_lru_size() instead of lruvec_page_state_local() to get the
> number of lru pages.
>
> For NR_SLAB_RECLAIMABLE_B and NR_SLAB_UNRECLAIMABLE_B cases, it appears
> that the statistics here have already been problematic for a while since
> slab pages have been reparented. So just ignore it for now.
>
> Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
Acked-by: Shakeel Butt <shakeel.butt@linux.dev>
^ permalink raw reply [flat|nested] 50+ messages in thread
* Re: [PATCH v4 29/31] mm: memcontrol: prepare for reparenting non-hierarchical stats
2026-02-05 9:01 ` [PATCH v4 29/31] mm: memcontrol: prepare for reparenting non-hierarchical stats Qi Zheng
@ 2026-02-07 2:19 ` Shakeel Butt
2026-02-10 6:47 ` Qi Zheng
0 siblings, 1 reply; 50+ messages in thread
From: Shakeel Butt @ 2026-02-07 2:19 UTC (permalink / raw)
To: Qi Zheng
Cc: hannes, hughd, mhocko, roman.gushchin, muchun.song, david,
lorenzo.stoakes, ziy, harry.yoo, yosry.ahmed, imran.f.khan,
kamalesh.babulal, axelrasmussen, yuanchu, weixugc, chenridong,
mkoutny, akpm, hamzamahfooz, apais, lance.yang, bhe, linux-mm,
linux-kernel, cgroups, Qi Zheng
On Thu, Feb 05, 2026 at 05:01:48PM +0800, Qi Zheng wrote:
> From: Qi Zheng <zhengqi.arch@bytedance.com>
>
> To resolve the dying memcg issue, we need to reparent LRU folios of child
> memcg to its parent memcg. This could cause problems for non-hierarchical
> stats.
>
> As Yosry Ahmed pointed out:
>
> ```
> In short, if memory is charged to a dying cgroup at the time of
> reparenting, when the memory gets uncharged the stats updates will occur
> at the parent. This will update both hierarchical and non-hierarchical
> stats of the parent, which would corrupt the parent's non-hierarchical
> stats (because those counters were never incremented when the memory was
> charged).
> ```
>
> Now we have the following two types of non-hierarchical stats, and they
> are only used in CONFIG_MEMCG_V1:
>
> a. memcg->vmstats->state_local[i]
> b. pn->lruvec_stats->state_local[i]
>
> To ensure that these non-hierarchical stats work properly, we need to
> reparent these non-hierarchical stats after reparenting LRU folios. To
> this end, this commit makes the following preparations:
>
> 1. implement reparent_state_local() to reparent non-hierarchical stats
> 2. make css_killed_work_fn() to be called in rcu work, and implement
> get_non_dying_memcg_start() and get_non_dying_memcg_end() to avoid race
> between mod_memcg_state()/mod_memcg_lruvec_state()
> and reparent_state_local()
> 3. change these non-hierarchical stats to atomic_long_t type to avoid race
> between mem_cgroup_stat_aggregate() and reparent_state_local()
>
> Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
Overall looks good just a couple of comments.
> ---
> include/linux/memcontrol.h | 4 ++
> kernel/cgroup/cgroup.c | 8 +--
> mm/memcontrol-v1.c | 16 ++++++
> mm/memcontrol-v1.h | 3 +
> mm/memcontrol.c | 113 ++++++++++++++++++++++++++++++++++---
> 5 files changed, 132 insertions(+), 12 deletions(-)
>
> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> index 3970c102fe741..a4f6ab7eb98d6 100644
> --- a/include/linux/memcontrol.h
> +++ b/include/linux/memcontrol.h
> @@ -957,12 +957,16 @@ static inline void mod_memcg_page_state(struct page *page,
>
> unsigned long memcg_events(struct mem_cgroup *memcg, int event);
> unsigned long memcg_page_state(struct mem_cgroup *memcg, int idx);
> +void reparent_memcg_state_local(struct mem_cgroup *memcg,
> + struct mem_cgroup *parent, int idx);
Put the above in mm/memcontrol-v1.h file.
> unsigned long memcg_page_state_output(struct mem_cgroup *memcg, int item);
> bool memcg_stat_item_valid(int idx);
> bool memcg_vm_event_item_valid(enum vm_event_item idx);
> unsigned long lruvec_page_state(struct lruvec *lruvec, enum node_stat_item idx);
> unsigned long lruvec_page_state_local(struct lruvec *lruvec,
> enum node_stat_item idx);
> +void reparent_memcg_lruvec_state_local(struct mem_cgroup *memcg,
> + struct mem_cgroup *parent, int idx);
Put the above in mm/memcontrol-v1.h file.
>
> void mem_cgroup_flush_stats(struct mem_cgroup *memcg);
> void mem_cgroup_flush_stats_ratelimited(struct mem_cgroup *memcg);
> diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c
> index 94788bd1fdf0e..dbf94a77018e6 100644
> --- a/kernel/cgroup/cgroup.c
> +++ b/kernel/cgroup/cgroup.c
> @@ -6043,8 +6043,8 @@ int cgroup_mkdir(struct kernfs_node *parent_kn, const char *name, umode_t mode)
> */
> static void css_killed_work_fn(struct work_struct *work)
> {
> - struct cgroup_subsys_state *css =
> - container_of(work, struct cgroup_subsys_state, destroy_work);
> + struct cgroup_subsys_state *css = container_of(to_rcu_work(work),
> + struct cgroup_subsys_state, destroy_rwork);
>
> cgroup_lock();
>
> @@ -6065,8 +6065,8 @@ static void css_killed_ref_fn(struct percpu_ref *ref)
> container_of(ref, struct cgroup_subsys_state, refcnt);
>
> if (atomic_dec_and_test(&css->online_cnt)) {
> - INIT_WORK(&css->destroy_work, css_killed_work_fn);
> - queue_work(cgroup_offline_wq, &css->destroy_work);
> + INIT_RCU_WORK(&css->destroy_rwork, css_killed_work_fn);
> + queue_rcu_work(cgroup_offline_wq, &css->destroy_rwork);
> }
> }
>
> diff --git a/mm/memcontrol-v1.c b/mm/memcontrol-v1.c
> index c6078cd7f7e53..a427bb205763b 100644
> --- a/mm/memcontrol-v1.c
> +++ b/mm/memcontrol-v1.c
> @@ -1887,6 +1887,22 @@ static const unsigned int memcg1_events[] = {
> PGMAJFAULT,
> };
>
> +void reparent_memcg1_state_local(struct mem_cgroup *memcg, struct mem_cgroup *parent)
> +{
> + int i;
> +
> + for (i = 0; i < ARRAY_SIZE(memcg1_stats); i++)
> + reparent_memcg_state_local(memcg, parent, memcg1_stats[i]);
> +}
> +
> +void reparent_memcg1_lruvec_state_local(struct mem_cgroup *memcg, struct mem_cgroup *parent)
> +{
> + int i;
> +
> + for (i = 0; i < NR_LRU_LISTS; i++)
> + reparent_memcg_lruvec_state_local(memcg, parent, i);
> +}
> +
> void memcg1_stat_format(struct mem_cgroup *memcg, struct seq_buf *s)
> {
> unsigned long memory, memsw;
> diff --git a/mm/memcontrol-v1.h b/mm/memcontrol-v1.h
> index eb3c3c1056574..45528195d3578 100644
> --- a/mm/memcontrol-v1.h
> +++ b/mm/memcontrol-v1.h
> @@ -41,6 +41,7 @@ static inline bool do_memsw_account(void)
>
> unsigned long memcg_events_local(struct mem_cgroup *memcg, int event);
> unsigned long memcg_page_state_local(struct mem_cgroup *memcg, int idx);
> +void mod_memcg_page_state_local(struct mem_cgroup *memcg, int idx, unsigned long val);
> unsigned long memcg_page_state_local_output(struct mem_cgroup *memcg, int item);
> bool memcg1_alloc_events(struct mem_cgroup *memcg);
> void memcg1_free_events(struct mem_cgroup *memcg);
> @@ -73,6 +74,8 @@ void memcg1_uncharge_batch(struct mem_cgroup *memcg, unsigned long pgpgout,
> unsigned long nr_memory, int nid);
>
> void memcg1_stat_format(struct mem_cgroup *memcg, struct seq_buf *s);
> +void reparent_memcg1_state_local(struct mem_cgroup *memcg, struct mem_cgroup *parent);
> +void reparent_memcg1_lruvec_state_local(struct mem_cgroup *memcg, struct mem_cgroup *parent);
>
> void memcg1_account_kmem(struct mem_cgroup *memcg, int nr_pages);
> static inline bool memcg1_tcpmem_active(struct mem_cgroup *memcg)
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index c9b5dfd822d0a..e7d4e4ff411b6 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -225,6 +225,26 @@ static inline struct obj_cgroup *__memcg_reparent_objcgs(struct mem_cgroup *memc
> return objcg;
> }
>
> +#ifdef CONFIG_MEMCG_V1
> +static void __mem_cgroup_flush_stats(struct mem_cgroup *memcg, bool force);
> +
> +static inline void reparent_state_local(struct mem_cgroup *memcg, struct mem_cgroup *parent)
> +{
> + if (cgroup_subsys_on_dfl(memory_cgrp_subsys))
> + return;
> +
> + __mem_cgroup_flush_stats(memcg, true);
> +
> + /* The following counts are all non-hierarchical and need to be reparented. */
> + reparent_memcg1_state_local(memcg, parent);
> + reparent_memcg1_lruvec_state_local(memcg, parent);
> +}
> +#else
> +static inline void reparent_state_local(struct mem_cgroup *memcg, struct mem_cgroup *parent)
> +{
> +}
> +#endif
> +
> static inline void reparent_locks(struct mem_cgroup *memcg, struct mem_cgroup *parent)
> {
> spin_lock_irq(&objcg_lock);
> @@ -407,7 +427,7 @@ struct lruvec_stats {
> long state[NR_MEMCG_NODE_STAT_ITEMS];
>
> /* Non-hierarchical (CPU aggregated) state */
> - long state_local[NR_MEMCG_NODE_STAT_ITEMS];
> + atomic_long_t state_local[NR_MEMCG_NODE_STAT_ITEMS];
>
> /* Pending child counts during tree propagation */
> long state_pending[NR_MEMCG_NODE_STAT_ITEMS];
> @@ -450,7 +470,7 @@ unsigned long lruvec_page_state_local(struct lruvec *lruvec,
> return 0;
>
> pn = container_of(lruvec, struct mem_cgroup_per_node, lruvec);
> - x = READ_ONCE(pn->lruvec_stats->state_local[i]);
> + x = atomic_long_read(&(pn->lruvec_stats->state_local[i]));
> #ifdef CONFIG_SMP
> if (x < 0)
> x = 0;
> @@ -458,6 +478,27 @@ unsigned long lruvec_page_state_local(struct lruvec *lruvec,
> return x;
> }
>
Please put the following function under CONFIG_MEMCG_V1. Just move it in
the same block as reparent_state_local().
> +void reparent_memcg_lruvec_state_local(struct mem_cgroup *memcg,
> + struct mem_cgroup *parent, int idx)
> +{
> + int i = memcg_stats_index(idx);
> + int nid;
> +
> + if (WARN_ONCE(BAD_STAT_IDX(i), "%s: missing stat item %d\n", __func__, idx))
> + return;
> +
> + for_each_node(nid) {
> + struct lruvec *child_lruvec = mem_cgroup_lruvec(memcg, NODE_DATA(nid));
> + struct lruvec *parent_lruvec = mem_cgroup_lruvec(parent, NODE_DATA(nid));
> + struct mem_cgroup_per_node *parent_pn;
> + unsigned long value = lruvec_page_state_local(child_lruvec, idx);
> +
> + parent_pn = container_of(parent_lruvec, struct mem_cgroup_per_node, lruvec);
> +
> + atomic_long_add(value, &(parent_pn->lruvec_stats->state_local[i]));
> + }
> +}
> +
[...]
>
> +#ifdef CONFIG_MEMCG_V1
> +/*
> + * Used in mod_memcg_state() and mod_memcg_lruvec_state() to avoid race with
> + * reparenting of non-hierarchical state_locals.
> + */
> +static inline struct mem_cgroup *get_non_dying_memcg_start(struct mem_cgroup *memcg)
> +{
> + if (cgroup_subsys_on_dfl(memory_cgrp_subsys))
> + return memcg;
> +
> + rcu_read_lock();
> +
> + while (memcg_is_dying(memcg))
> + memcg = parent_mem_cgroup(memcg);
> +
> + return memcg;
> +}
> +
> +static inline void get_non_dying_memcg_end(void)
> +{
> + if (cgroup_subsys_on_dfl(memory_cgrp_subsys))
> + return;
> +
> + rcu_read_unlock();
> +}
> +#else
> +static inline struct mem_cgroup *get_non_dying_memcg_start(struct mem_cgroup *memcg)
> +{
> + return memcg;
> +}
> +
> +static inline void get_non_dying_memcg_end(void)
> +{
> +}
> +#endif
Add the usage of these start and end functions in mod_memcg_state() and
mod_memcg_lruvec_state() in this patch.
^ permalink raw reply [flat|nested] 50+ messages in thread
* Re: [PATCH v4 28/31] mm: workingset: use lruvec_lru_size() to get the number of lru pages
2026-02-05 9:01 ` [PATCH v4 28/31] mm: workingset: use lruvec_lru_size() to get the number of lru pages Qi Zheng
2026-02-07 1:48 ` Shakeel Butt
@ 2026-02-07 3:59 ` Muchun Song
1 sibling, 0 replies; 50+ messages in thread
From: Muchun Song @ 2026-02-07 3:59 UTC (permalink / raw)
To: Qi Zheng
Cc: hannes, hughd, mhocko, roman.gushchin, shakeel.butt, david,
lorenzo.stoakes, ziy, harry.yoo, yosry.ahmed, imran.f.khan,
kamalesh.babulal, axelrasmussen, yuanchu, weixugc, chenridong,
mkoutny, akpm, hamzamahfooz, apais, lance.yang, bhe, linux-mm,
linux-kernel, cgroups, Qi Zheng
> On Feb 5, 2026, at 17:01, Qi Zheng <qi.zheng@linux.dev> wrote:
>
> From: Qi Zheng <zhengqi.arch@bytedance.com>
>
> For cgroup v2, count_shadow_nodes() is the only place to read
> non-hierarchical stats (lruvec_stats->state_local). To avoid the need to
> consider cgroup v2 during subsequent non-hierarchical stats reparenting,
> use lruvec_lru_size() instead of lruvec_page_state_local() to get the
> number of lru pages.
>
> For NR_SLAB_RECLAIMABLE_B and NR_SLAB_UNRECLAIMABLE_B cases, it appears
> that the statistics here have already been problematic for a while since
> slab pages have been reparented. So just ignore it for now.
>
> Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
Acked-by: Muchun Song <muchun.song@linux.dev>
^ permalink raw reply [flat|nested] 50+ messages in thread
* Re: [PATCH v4 30/31] mm: memcontrol: eliminate the problem of dying memory cgroup for LRU folios
2026-02-05 9:01 ` [PATCH v4 30/31] mm: memcontrol: eliminate the problem of dying memory cgroup for LRU folios Qi Zheng
@ 2026-02-07 19:59 ` Usama Arif
2026-02-07 22:25 ` Shakeel Butt
1 sibling, 0 replies; 50+ messages in thread
From: Usama Arif @ 2026-02-07 19:59 UTC (permalink / raw)
To: Qi Zheng, hannes, hughd, mhocko, roman.gushchin, shakeel.butt,
muchun.song, david, lorenzo.stoakes, ziy, harry.yoo, yosry.ahmed,
imran.f.khan, kamalesh.babulal, axelrasmussen, yuanchu, weixugc,
chenridong, mkoutny, akpm, hamzamahfooz, apais, lance.yang, bhe
Cc: linux-mm, linux-kernel, cgroups, Muchun Song, Qi Zheng
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index e7d4e4ff411b6..0e0efaa511d3d 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -247,11 +247,25 @@ static inline void reparent_state_local(struct mem_cgroup *memcg, struct mem_cgr
>
> static inline void reparent_locks(struct mem_cgroup *memcg, struct mem_cgroup *parent)
> {
> + int nid, nest = 0;
> +
> spin_lock_irq(&objcg_lock);
> + for_each_node(nid) {
> + spin_lock_nested(&mem_cgroup_lruvec(memcg,
> + NODE_DATA(nid))->lru_lock, nest++);
> + spin_lock_nested(&mem_cgroup_lruvec(parent,
> + NODE_DATA(nid))->lru_lock, nest++);
> + }
> }
>
mWould this break lockdep on more than 4 NUMA nodes as MAX_LOCKDEP_SUBCLASSES = 8 and
2 locks are being acquired per node.
> static inline void reparent_unlocks(struct mem_cgroup *memcg, struct mem_cgroup *parent)
> {
> + int nid;
> +
> + for_each_node(nid) {
> + spin_unlock(&mem_cgroup_lruvec(parent, NODE_DATA(nid))->lru_lock);
> + spin_unlock(&mem_cgroup_lruvec(memcg, NODE_DATA(nid))->lru_lock);
> + }
> spin_unlock_irq(&objcg_lock);
> }
^ permalink raw reply [flat|nested] 50+ messages in thread
* Re: [PATCH v4 30/31] mm: memcontrol: eliminate the problem of dying memory cgroup for LRU folios
2026-02-05 9:01 ` [PATCH v4 30/31] mm: memcontrol: eliminate the problem of dying memory cgroup for LRU folios Qi Zheng
2026-02-07 19:59 ` Usama Arif
@ 2026-02-07 22:25 ` Shakeel Butt
2026-02-09 3:49 ` Qi Zheng
1 sibling, 1 reply; 50+ messages in thread
From: Shakeel Butt @ 2026-02-07 22:25 UTC (permalink / raw)
To: Qi Zheng
Cc: hannes, hughd, mhocko, roman.gushchin, muchun.song, david,
lorenzo.stoakes, ziy, harry.yoo, yosry.ahmed, imran.f.khan,
kamalesh.babulal, axelrasmussen, yuanchu, weixugc, chenridong,
mkoutny, akpm, hamzamahfooz, apais, lance.yang, bhe, linux-mm,
linux-kernel, cgroups, Muchun Song, Qi Zheng
On Thu, Feb 05, 2026 at 05:01:49PM +0800, Qi Zheng wrote:
> From: Muchun Song <songmuchun@bytedance.com>
>
> Now that everything is set up, switch folio->memcg_data pointers to
> objcgs, update the accessors, and execute reparenting on cgroup death.
>
> Finally, folio->memcg_data of LRU folios and kmem folios will always
> point to an object cgroup pointer. The folio->memcg_data of slab
> folios will point to an vector of object cgroups.
>
> Signed-off-by: Muchun Song <songmuchun@bytedance.com>
> Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
>
> /*
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index e7d4e4ff411b6..0e0efaa511d3d 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -247,11 +247,25 @@ static inline void reparent_state_local(struct mem_cgroup *memcg, struct mem_cgr
>
> static inline void reparent_locks(struct mem_cgroup *memcg, struct mem_cgroup *parent)
> {
> + int nid, nest = 0;
> +
> spin_lock_irq(&objcg_lock);
> + for_each_node(nid) {
> + spin_lock_nested(&mem_cgroup_lruvec(memcg,
> + NODE_DATA(nid))->lru_lock, nest++);
> + spin_lock_nested(&mem_cgroup_lruvec(parent,
> + NODE_DATA(nid))->lru_lock, nest++);
Is there a reason to acquire locks for all the node together? Why not do
the for_each_node(nid) in memcg_reparent_objcgs() and then reparent the
LRUs for each node one by one and taking and releasing lock
individually. Though the lock for the offlining memcg might not be
contentious but the parent's lock might be if a lot of memory has been
reparented.
> + }
> }
>
> static inline void reparent_unlocks(struct mem_cgroup *memcg, struct mem_cgroup *parent)
> {
> + int nid;
> +
> + for_each_node(nid) {
> + spin_unlock(&mem_cgroup_lruvec(parent, NODE_DATA(nid))->lru_lock);
> + spin_unlock(&mem_cgroup_lruvec(memcg, NODE_DATA(nid))->lru_lock);
> + }
> spin_unlock_irq(&objcg_lock);
> }
>
> @@ -260,12 +274,28 @@ static void memcg_reparent_objcgs(struct mem_cgroup *memcg)
> struct obj_cgroup *objcg;
> struct mem_cgroup *parent = parent_mem_cgroup(memcg);
>
> +retry:
> + if (lru_gen_enabled())
> + max_lru_gen_memcg(parent);
> +
> reparent_locks(memcg, parent);
> + if (lru_gen_enabled()) {
> + if (!recheck_lru_gen_max_memcg(parent)) {
> + reparent_unlocks(memcg, parent);
> + cond_resched();
> + goto retry;
> + }
> + lru_gen_reparent_memcg(memcg, parent);
> + } else {
> + lru_reparent_memcg(memcg, parent);
> + }
>
> objcg = __memcg_reparent_objcgs(memcg, parent);
The above does not need lru locks. With the per-node refactor, it will
be out of lru lock.
>
> reparent_unlocks(memcg, parent);
>
> + reparent_state_local(memcg, parent);
> +
> percpu_ref_kill(&objcg->refcnt);
> }
>
>
[...]
> static int charge_memcg(struct folio *folio, struct mem_cgroup *memcg,
> gfp_t gfp)
> {
> - int ret;
> -
> - ret = try_charge(memcg, gfp, folio_nr_pages(folio));
> - if (ret)
> - goto out;
> + int ret = 0;
> + struct obj_cgroup *objcg;
>
> - css_get(&memcg->css);
> - commit_charge(folio, memcg);
> + objcg = get_obj_cgroup_from_memcg(memcg);
> + /* Do not account at the root objcg level. */
> + if (!obj_cgroup_is_root(objcg))
> + ret = try_charge(memcg, gfp, folio_nr_pages(folio));
Use try_charge_memcg() directly and then this will remove the last user
of try_charge, so remove try_charge completely.
> + if (ret) {
> + obj_cgroup_put(objcg);
> + return ret;
> + }
> + commit_charge(folio, objcg);
> memcg1_commit_charge(folio, memcg);
> -out:
> +
> return ret;
> }
>
^ permalink raw reply [flat|nested] 50+ messages in thread
* Re: [PATCH v4 31/31] mm: lru: add VM_WARN_ON_ONCE_FOLIO to lru maintenance helpers
2026-02-05 9:01 ` [PATCH v4 31/31] mm: lru: add VM_WARN_ON_ONCE_FOLIO to lru maintenance helpers Qi Zheng
@ 2026-02-07 22:26 ` Shakeel Butt
0 siblings, 0 replies; 50+ messages in thread
From: Shakeel Butt @ 2026-02-07 22:26 UTC (permalink / raw)
To: Qi Zheng
Cc: hannes, hughd, mhocko, roman.gushchin, muchun.song, david,
lorenzo.stoakes, ziy, harry.yoo, yosry.ahmed, imran.f.khan,
kamalesh.babulal, axelrasmussen, yuanchu, weixugc, chenridong,
mkoutny, akpm, hamzamahfooz, apais, lance.yang, bhe, linux-mm,
linux-kernel, cgroups, Muchun Song, Qi Zheng
On Thu, Feb 05, 2026 at 05:01:50PM +0800, Qi Zheng wrote:
> From: Muchun Song <songmuchun@bytedance.com>
>
> We must ensure the folio is deleted from or added to the correct lruvec
> list. So, add VM_WARN_ON_ONCE_FOLIO() to catch invalid users. The
> VM_BUG_ON_PAGE() in move_pages_to_lru() can be removed as
> add_page_to_lru_list() will perform the necessary check.
>
> Signed-off-by: Muchun Song <songmuchun@bytedance.com>
> Acked-by: Roman Gushchin <roman.gushchin@linux.dev>
> Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
> Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Shakeel Butt <shakeel.butt@linux.dev>
^ permalink raw reply [flat|nested] 50+ messages in thread
* Re: [PATCH v4 30/31] mm: memcontrol: eliminate the problem of dying memory cgroup for LRU folios
2026-02-07 22:25 ` Shakeel Butt
@ 2026-02-09 3:49 ` Qi Zheng
2026-02-09 17:53 ` Shakeel Butt
0 siblings, 1 reply; 50+ messages in thread
From: Qi Zheng @ 2026-02-09 3:49 UTC (permalink / raw)
To: Shakeel Butt
Cc: hannes, hughd, mhocko, roman.gushchin, muchun.song, david,
lorenzo.stoakes, ziy, harry.yoo, yosry.ahmed, imran.f.khan,
kamalesh.babulal, axelrasmussen, yuanchu, weixugc, chenridong,
mkoutny, akpm, hamzamahfooz, apais, lance.yang, bhe, linux-mm,
linux-kernel, cgroups, Muchun Song, Qi Zheng
On 2/8/26 6:25 AM, Shakeel Butt wrote:
> On Thu, Feb 05, 2026 at 05:01:49PM +0800, Qi Zheng wrote:
>> From: Muchun Song <songmuchun@bytedance.com>
>>
>> Now that everything is set up, switch folio->memcg_data pointers to
>> objcgs, update the accessors, and execute reparenting on cgroup death.
>>
>> Finally, folio->memcg_data of LRU folios and kmem folios will always
>> point to an object cgroup pointer. The folio->memcg_data of slab
>> folios will point to an vector of object cgroups.
>>
>> Signed-off-by: Muchun Song <songmuchun@bytedance.com>
>> Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
>>
>> /*
>> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
>> index e7d4e4ff411b6..0e0efaa511d3d 100644
>> --- a/mm/memcontrol.c
>> +++ b/mm/memcontrol.c
>> @@ -247,11 +247,25 @@ static inline void reparent_state_local(struct mem_cgroup *memcg, struct mem_cgr
>>
>> static inline void reparent_locks(struct mem_cgroup *memcg, struct mem_cgroup *parent)
>> {
>> + int nid, nest = 0;
>> +
>> spin_lock_irq(&objcg_lock);
>> + for_each_node(nid) {
>> + spin_lock_nested(&mem_cgroup_lruvec(memcg,
>> + NODE_DATA(nid))->lru_lock, nest++);
>> + spin_lock_nested(&mem_cgroup_lruvec(parent,
>> + NODE_DATA(nid))->lru_lock, nest++);
>
> Is there a reason to acquire locks for all the node together? Why not do
> the for_each_node(nid) in memcg_reparent_objcgs() and then reparent the
> LRUs for each node one by one and taking and releasing lock
> individually. Though the lock for the offlining memcg might not be
To do this, we first need to convert objcg from per-memcg to per-memcg
per-node. In this way, we can hold the lru lock and objcg lock for
each node to reparent the folio and the corresponding objcg together.
Otherwise, the folio might have been moved to the parent lruvec, but
objcg hasn't been reparent. In that case, it might be holding the lock
of child lruvec to operate on the folio on the parent lruvec.
> contentious but the parent's lock might be if a lot of memory has been
> reparented.
>
>> + }
>> }
>>
>> static inline void reparent_unlocks(struct mem_cgroup *memcg, struct mem_cgroup *parent)
>> {
>> + int nid;
>> +
>> + for_each_node(nid) {
>> + spin_unlock(&mem_cgroup_lruvec(parent, NODE_DATA(nid))->lru_lock);
>> + spin_unlock(&mem_cgroup_lruvec(memcg, NODE_DATA(nid))->lru_lock);
>> + }
>> spin_unlock_irq(&objcg_lock);
>> }
>>
>> @@ -260,12 +274,28 @@ static void memcg_reparent_objcgs(struct mem_cgroup *memcg)
>> struct obj_cgroup *objcg;
>> struct mem_cgroup *parent = parent_mem_cgroup(memcg);
>>
>> +retry:
>> + if (lru_gen_enabled())
>> + max_lru_gen_memcg(parent);
>> +
>> reparent_locks(memcg, parent);
>> + if (lru_gen_enabled()) {
>> + if (!recheck_lru_gen_max_memcg(parent)) {
>> + reparent_unlocks(memcg, parent);
>> + cond_resched();
>> + goto retry;
>> + }
>> + lru_gen_reparent_memcg(memcg, parent);
>> + } else {
>> + lru_reparent_memcg(memcg, parent);
>> + }
>>
>> objcg = __memcg_reparent_objcgs(memcg, parent);
>
> The above does not need lru locks. With the per-node refactor, it will
> be out of lru lock.
>
>>
>> reparent_unlocks(memcg, parent);
>>
>> + reparent_state_local(memcg, parent);
>> +
>> percpu_ref_kill(&objcg->refcnt);
>> }
>>
>>
>
> [...]
>
>> static int charge_memcg(struct folio *folio, struct mem_cgroup *memcg,
>> gfp_t gfp)
>> {
>> - int ret;
>> -
>> - ret = try_charge(memcg, gfp, folio_nr_pages(folio));
>> - if (ret)
>> - goto out;
>> + int ret = 0;
>> + struct obj_cgroup *objcg;
>>
>> - css_get(&memcg->css);
>> - commit_charge(folio, memcg);
>> + objcg = get_obj_cgroup_from_memcg(memcg);
>> + /* Do not account at the root objcg level. */
>> + if (!obj_cgroup_is_root(objcg))
>> + ret = try_charge(memcg, gfp, folio_nr_pages(folio));
>
> Use try_charge_memcg() directly and then this will remove the last user
> of try_charge, so remove try_charge completely.
>
>> + if (ret) {
>> + obj_cgroup_put(objcg);
>> + return ret;
>> + }
>> + commit_charge(folio, objcg);
>> memcg1_commit_charge(folio, memcg);
>> -out:
>> +
>> return ret;
>> }
>>
^ permalink raw reply [flat|nested] 50+ messages in thread
* Re: [PATCH v4 30/31] mm: memcontrol: eliminate the problem of dying memory cgroup for LRU folios
2026-02-09 3:49 ` Qi Zheng
@ 2026-02-09 17:53 ` Shakeel Butt
2026-02-10 3:11 ` Qi Zheng
0 siblings, 1 reply; 50+ messages in thread
From: Shakeel Butt @ 2026-02-09 17:53 UTC (permalink / raw)
To: Qi Zheng
Cc: hannes, hughd, mhocko, roman.gushchin, muchun.song, david,
lorenzo.stoakes, ziy, harry.yoo, yosry.ahmed, imran.f.khan,
kamalesh.babulal, axelrasmussen, yuanchu, weixugc, chenridong,
mkoutny, akpm, hamzamahfooz, apais, lance.yang, bhe, linux-mm,
linux-kernel, cgroups, Muchun Song, Qi Zheng
On Mon, Feb 09, 2026 at 11:49:43AM +0800, Qi Zheng wrote:
>
>
> On 2/8/26 6:25 AM, Shakeel Butt wrote:
> > On Thu, Feb 05, 2026 at 05:01:49PM +0800, Qi Zheng wrote:
> > > From: Muchun Song <songmuchun@bytedance.com>
> > >
> > > Now that everything is set up, switch folio->memcg_data pointers to
> > > objcgs, update the accessors, and execute reparenting on cgroup death.
> > >
> > > Finally, folio->memcg_data of LRU folios and kmem folios will always
> > > point to an object cgroup pointer. The folio->memcg_data of slab
> > > folios will point to an vector of object cgroups.
> > >
> > > Signed-off-by: Muchun Song <songmuchun@bytedance.com>
> > > Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
> > > /*
> > > diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> > > index e7d4e4ff411b6..0e0efaa511d3d 100644
> > > --- a/mm/memcontrol.c
> > > +++ b/mm/memcontrol.c
> > > @@ -247,11 +247,25 @@ static inline void reparent_state_local(struct mem_cgroup *memcg, struct mem_cgr
> > > static inline void reparent_locks(struct mem_cgroup *memcg, struct mem_cgroup *parent)
> > > {
> > > + int nid, nest = 0;
> > > +
> > > spin_lock_irq(&objcg_lock);
> > > + for_each_node(nid) {
> > > + spin_lock_nested(&mem_cgroup_lruvec(memcg,
> > > + NODE_DATA(nid))->lru_lock, nest++);
> > > + spin_lock_nested(&mem_cgroup_lruvec(parent,
> > > + NODE_DATA(nid))->lru_lock, nest++);
> >
> > Is there a reason to acquire locks for all the node together? Why not do
> > the for_each_node(nid) in memcg_reparent_objcgs() and then reparent the
> > LRUs for each node one by one and taking and releasing lock
> > individually. Though the lock for the offlining memcg might not be
>
> To do this, we first need to convert objcg from per-memcg to per-memcg
> per-node. In this way, we can hold the lru lock and objcg lock for
> each node to reparent the folio and the corresponding objcg together.
Oh we want reparenting of both objcg and folio atomic. Let's add a
comment here with the explanation.
^ permalink raw reply [flat|nested] 50+ messages in thread
* Re: [PATCH v4 30/31] mm: memcontrol: eliminate the problem of dying memory cgroup for LRU folios
2026-02-09 17:53 ` Shakeel Butt
@ 2026-02-10 3:11 ` Qi Zheng
0 siblings, 0 replies; 50+ messages in thread
From: Qi Zheng @ 2026-02-10 3:11 UTC (permalink / raw)
To: Shakeel Butt
Cc: hannes, hughd, mhocko, roman.gushchin, muchun.song, david,
lorenzo.stoakes, ziy, harry.yoo, yosry.ahmed, imran.f.khan,
kamalesh.babulal, axelrasmussen, yuanchu, weixugc, chenridong,
mkoutny, akpm, hamzamahfooz, apais, lance.yang, bhe, linux-mm,
linux-kernel, cgroups, Muchun Song, Qi Zheng
On 2/10/26 1:53 AM, Shakeel Butt wrote:
> On Mon, Feb 09, 2026 at 11:49:43AM +0800, Qi Zheng wrote:
>>
>>
>> On 2/8/26 6:25 AM, Shakeel Butt wrote:
>>> On Thu, Feb 05, 2026 at 05:01:49PM +0800, Qi Zheng wrote:
>>>> From: Muchun Song <songmuchun@bytedance.com>
>>>>
>>>> Now that everything is set up, switch folio->memcg_data pointers to
>>>> objcgs, update the accessors, and execute reparenting on cgroup death.
>>>>
>>>> Finally, folio->memcg_data of LRU folios and kmem folios will always
>>>> point to an object cgroup pointer. The folio->memcg_data of slab
>>>> folios will point to an vector of object cgroups.
>>>>
>>>> Signed-off-by: Muchun Song <songmuchun@bytedance.com>
>>>> Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
>>>> /*
>>>> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
>>>> index e7d4e4ff411b6..0e0efaa511d3d 100644
>>>> --- a/mm/memcontrol.c
>>>> +++ b/mm/memcontrol.c
>>>> @@ -247,11 +247,25 @@ static inline void reparent_state_local(struct mem_cgroup *memcg, struct mem_cgr
>>>> static inline void reparent_locks(struct mem_cgroup *memcg, struct mem_cgroup *parent)
>>>> {
>>>> + int nid, nest = 0;
>>>> +
>>>> spin_lock_irq(&objcg_lock);
>>>> + for_each_node(nid) {
>>>> + spin_lock_nested(&mem_cgroup_lruvec(memcg,
>>>> + NODE_DATA(nid))->lru_lock, nest++);
>>>> + spin_lock_nested(&mem_cgroup_lruvec(parent,
>>>> + NODE_DATA(nid))->lru_lock, nest++);
>>>
>>> Is there a reason to acquire locks for all the node together? Why not do
>>> the for_each_node(nid) in memcg_reparent_objcgs() and then reparent the
>>> LRUs for each node one by one and taking and releasing lock
>>> individually. Though the lock for the offlining memcg might not be
>>
>> To do this, we first need to convert objcg from per-memcg to per-memcg
>> per-node. In this way, we can hold the lru lock and objcg lock for
>> each node to reparent the folio and the corresponding objcg together.
>
> Oh we want reparenting of both objcg and folio atomic. Let's add a
Right.
> comment here with the explanation.
OK, will do this refactoring and send v5.
Thanks,
Qi
>
^ permalink raw reply [flat|nested] 50+ messages in thread
* Re: [PATCH v4 29/31] mm: memcontrol: prepare for reparenting non-hierarchical stats
2026-02-07 2:19 ` Shakeel Butt
@ 2026-02-10 6:47 ` Qi Zheng
2026-02-11 0:38 ` Shakeel Butt
0 siblings, 1 reply; 50+ messages in thread
From: Qi Zheng @ 2026-02-10 6:47 UTC (permalink / raw)
To: Shakeel Butt
Cc: hannes, hughd, mhocko, roman.gushchin, muchun.song, david,
lorenzo.stoakes, ziy, harry.yoo, yosry.ahmed, imran.f.khan,
kamalesh.babulal, axelrasmussen, yuanchu, weixugc, chenridong,
mkoutny, akpm, hamzamahfooz, apais, lance.yang, bhe, linux-mm,
linux-kernel, cgroups, Qi Zheng
On 2/7/26 10:19 AM, Shakeel Butt wrote:
> On Thu, Feb 05, 2026 at 05:01:48PM +0800, Qi Zheng wrote:
>> From: Qi Zheng <zhengqi.arch@bytedance.com>
>>
>> To resolve the dying memcg issue, we need to reparent LRU folios of child
>> memcg to its parent memcg. This could cause problems for non-hierarchical
>> stats.
>>
>> As Yosry Ahmed pointed out:
>>
>> ```
>> In short, if memory is charged to a dying cgroup at the time of
>> reparenting, when the memory gets uncharged the stats updates will occur
>> at the parent. This will update both hierarchical and non-hierarchical
>> stats of the parent, which would corrupt the parent's non-hierarchical
>> stats (because those counters were never incremented when the memory was
>> charged).
>> ```
>>
>> Now we have the following two types of non-hierarchical stats, and they
>> are only used in CONFIG_MEMCG_V1:
>>
>> a. memcg->vmstats->state_local[i]
>> b. pn->lruvec_stats->state_local[i]
>>
>> To ensure that these non-hierarchical stats work properly, we need to
>> reparent these non-hierarchical stats after reparenting LRU folios. To
>> this end, this commit makes the following preparations:
>>
>> 1. implement reparent_state_local() to reparent non-hierarchical stats
>> 2. make css_killed_work_fn() to be called in rcu work, and implement
>> get_non_dying_memcg_start() and get_non_dying_memcg_end() to avoid race
>> between mod_memcg_state()/mod_memcg_lruvec_state()
>> and reparent_state_local()
>> 3. change these non-hierarchical stats to atomic_long_t type to avoid race
>> between mem_cgroup_stat_aggregate() and reparent_state_local()
>>
>> Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
>
> Overall looks good just a couple of comments.
>
>> ---
>> include/linux/memcontrol.h | 4 ++
>> kernel/cgroup/cgroup.c | 8 +--
>> mm/memcontrol-v1.c | 16 ++++++
>> mm/memcontrol-v1.h | 3 +
>> mm/memcontrol.c | 113 ++++++++++++++++++++++++++++++++++---
>> 5 files changed, 132 insertions(+), 12 deletions(-)
>>
>> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
>> index 3970c102fe741..a4f6ab7eb98d6 100644
>> --- a/include/linux/memcontrol.h
>> +++ b/include/linux/memcontrol.h
>> @@ -957,12 +957,16 @@ static inline void mod_memcg_page_state(struct page *page,
>>
>> unsigned long memcg_events(struct mem_cgroup *memcg, int event);
>> unsigned long memcg_page_state(struct mem_cgroup *memcg, int idx);
>> +void reparent_memcg_state_local(struct mem_cgroup *memcg,
>> + struct mem_cgroup *parent, int idx);
>
> Put the above in mm/memcontrol-v1.h file.
OK.
>
>> unsigned long memcg_page_state_output(struct mem_cgroup *memcg, int item);
>> bool memcg_stat_item_valid(int idx);
>> bool memcg_vm_event_item_valid(enum vm_event_item idx);
>> unsigned long lruvec_page_state(struct lruvec *lruvec, enum node_stat_item idx);
>> unsigned long lruvec_page_state_local(struct lruvec *lruvec,
>> enum node_stat_item idx);
>> +void reparent_memcg_lruvec_state_local(struct mem_cgroup *memcg,
>> + struct mem_cgroup *parent, int idx);
>
> Put the above in mm/memcontrol-v1.h file.
OK.
>
>>
>> void mem_cgroup_flush_stats(struct mem_cgroup *memcg);
>> void mem_cgroup_flush_stats_ratelimited(struct mem_cgroup *memcg);
>> diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c
>> index 94788bd1fdf0e..dbf94a77018e6 100644
>> --- a/kernel/cgroup/cgroup.c
>> +++ b/kernel/cgroup/cgroup.c
>> @@ -6043,8 +6043,8 @@ int cgroup_mkdir(struct kernfs_node *parent_kn, const char *name, umode_t mode)
>> */
>> static void css_killed_work_fn(struct work_struct *work)
>> {
>> - struct cgroup_subsys_state *css =
>> - container_of(work, struct cgroup_subsys_state, destroy_work);
>> + struct cgroup_subsys_state *css = container_of(to_rcu_work(work),
>> + struct cgroup_subsys_state, destroy_rwork);
>>
>> cgroup_lock();
>>
>> @@ -6065,8 +6065,8 @@ static void css_killed_ref_fn(struct percpu_ref *ref)
>> container_of(ref, struct cgroup_subsys_state, refcnt);
>>
>> if (atomic_dec_and_test(&css->online_cnt)) {
>> - INIT_WORK(&css->destroy_work, css_killed_work_fn);
>> - queue_work(cgroup_offline_wq, &css->destroy_work);
>> + INIT_RCU_WORK(&css->destroy_rwork, css_killed_work_fn);
>> + queue_rcu_work(cgroup_offline_wq, &css->destroy_rwork);
>> }
>> }
>>
>> diff --git a/mm/memcontrol-v1.c b/mm/memcontrol-v1.c
>> index c6078cd7f7e53..a427bb205763b 100644
>> --- a/mm/memcontrol-v1.c
>> +++ b/mm/memcontrol-v1.c
>> @@ -1887,6 +1887,22 @@ static const unsigned int memcg1_events[] = {
>> PGMAJFAULT,
>> };
>>
>> +void reparent_memcg1_state_local(struct mem_cgroup *memcg, struct mem_cgroup *parent)
>> +{
>> + int i;
>> +
>> + for (i = 0; i < ARRAY_SIZE(memcg1_stats); i++)
>> + reparent_memcg_state_local(memcg, parent, memcg1_stats[i]);
>> +}
>> +
>> +void reparent_memcg1_lruvec_state_local(struct mem_cgroup *memcg, struct mem_cgroup *parent)
>> +{
>> + int i;
>> +
>> + for (i = 0; i < NR_LRU_LISTS; i++)
>> + reparent_memcg_lruvec_state_local(memcg, parent, i);
>> +}
>> +
>> void memcg1_stat_format(struct mem_cgroup *memcg, struct seq_buf *s)
>> {
>> unsigned long memory, memsw;
>> diff --git a/mm/memcontrol-v1.h b/mm/memcontrol-v1.h
>> index eb3c3c1056574..45528195d3578 100644
>> --- a/mm/memcontrol-v1.h
>> +++ b/mm/memcontrol-v1.h
>> @@ -41,6 +41,7 @@ static inline bool do_memsw_account(void)
>>
>> unsigned long memcg_events_local(struct mem_cgroup *memcg, int event);
>> unsigned long memcg_page_state_local(struct mem_cgroup *memcg, int idx);
>> +void mod_memcg_page_state_local(struct mem_cgroup *memcg, int idx, unsigned long val);
>> unsigned long memcg_page_state_local_output(struct mem_cgroup *memcg, int item);
>> bool memcg1_alloc_events(struct mem_cgroup *memcg);
>> void memcg1_free_events(struct mem_cgroup *memcg);
>> @@ -73,6 +74,8 @@ void memcg1_uncharge_batch(struct mem_cgroup *memcg, unsigned long pgpgout,
>> unsigned long nr_memory, int nid);
>>
>> void memcg1_stat_format(struct mem_cgroup *memcg, struct seq_buf *s);
>> +void reparent_memcg1_state_local(struct mem_cgroup *memcg, struct mem_cgroup *parent);
>> +void reparent_memcg1_lruvec_state_local(struct mem_cgroup *memcg, struct mem_cgroup *parent);
>>
>> void memcg1_account_kmem(struct mem_cgroup *memcg, int nr_pages);
>> static inline bool memcg1_tcpmem_active(struct mem_cgroup *memcg)
>> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
>> index c9b5dfd822d0a..e7d4e4ff411b6 100644
>> --- a/mm/memcontrol.c
>> +++ b/mm/memcontrol.c
>> @@ -225,6 +225,26 @@ static inline struct obj_cgroup *__memcg_reparent_objcgs(struct mem_cgroup *memc
>> return objcg;
>> }
>>
>> +#ifdef CONFIG_MEMCG_V1
>> +static void __mem_cgroup_flush_stats(struct mem_cgroup *memcg, bool force);
>> +
>> +static inline void reparent_state_local(struct mem_cgroup *memcg, struct mem_cgroup *parent)
>> +{
>> + if (cgroup_subsys_on_dfl(memory_cgrp_subsys))
>> + return;
>> +
>> + __mem_cgroup_flush_stats(memcg, true);
>> +
>> + /* The following counts are all non-hierarchical and need to be reparented. */
>> + reparent_memcg1_state_local(memcg, parent);
>> + reparent_memcg1_lruvec_state_local(memcg, parent);
>> +}
>> +#else
>> +static inline void reparent_state_local(struct mem_cgroup *memcg, struct mem_cgroup *parent)
>> +{
>> +}
>> +#endif
>> +
>> static inline void reparent_locks(struct mem_cgroup *memcg, struct mem_cgroup *parent)
>> {
>> spin_lock_irq(&objcg_lock);
>> @@ -407,7 +427,7 @@ struct lruvec_stats {
>> long state[NR_MEMCG_NODE_STAT_ITEMS];
>>
>> /* Non-hierarchical (CPU aggregated) state */
>> - long state_local[NR_MEMCG_NODE_STAT_ITEMS];
>> + atomic_long_t state_local[NR_MEMCG_NODE_STAT_ITEMS];
>>
>> /* Pending child counts during tree propagation */
>> long state_pending[NR_MEMCG_NODE_STAT_ITEMS];
>> @@ -450,7 +470,7 @@ unsigned long lruvec_page_state_local(struct lruvec *lruvec,
>> return 0;
>>
>> pn = container_of(lruvec, struct mem_cgroup_per_node, lruvec);
>> - x = READ_ONCE(pn->lruvec_stats->state_local[i]);
>> + x = atomic_long_read(&(pn->lruvec_stats->state_local[i]));
>> #ifdef CONFIG_SMP
>> if (x < 0)
>> x = 0;
>> @@ -458,6 +478,27 @@ unsigned long lruvec_page_state_local(struct lruvec *lruvec,
>> return x;
>> }
>>
>
> Please put the following function under CONFIG_MEMCG_V1. Just move it in
> the same block as reparent_state_local().
OK, will try to do it.
>
>> +void reparent_memcg_lruvec_state_local(struct mem_cgroup *memcg,
>> + struct mem_cgroup *parent, int idx)
>> +{
>> + int i = memcg_stats_index(idx);
>> + int nid;
>> +
>> + if (WARN_ONCE(BAD_STAT_IDX(i), "%s: missing stat item %d\n", __func__, idx))
>> + return;
>> +
>> + for_each_node(nid) {
>> + struct lruvec *child_lruvec = mem_cgroup_lruvec(memcg, NODE_DATA(nid));
>> + struct lruvec *parent_lruvec = mem_cgroup_lruvec(parent, NODE_DATA(nid));
>> + struct mem_cgroup_per_node *parent_pn;
>> + unsigned long value = lruvec_page_state_local(child_lruvec, idx);
>> +
>> + parent_pn = container_of(parent_lruvec, struct mem_cgroup_per_node, lruvec);
>> +
>> + atomic_long_add(value, &(parent_pn->lruvec_stats->state_local[i]));
>> + }
>> +}
>> +
>
> [...]
>
>>
>> +#ifdef CONFIG_MEMCG_V1
>> +/*
>> + * Used in mod_memcg_state() and mod_memcg_lruvec_state() to avoid race with
>> + * reparenting of non-hierarchical state_locals.
>> + */
>> +static inline struct mem_cgroup *get_non_dying_memcg_start(struct mem_cgroup *memcg)
>> +{
>> + if (cgroup_subsys_on_dfl(memory_cgrp_subsys))
>> + return memcg;
>> +
>> + rcu_read_lock();
>> +
>> + while (memcg_is_dying(memcg))
>> + memcg = parent_mem_cgroup(memcg);
>> +
>> + return memcg;
>> +}
>> +
>> +static inline void get_non_dying_memcg_end(void)
>> +{
>> + if (cgroup_subsys_on_dfl(memory_cgrp_subsys))
>> + return;
>> +
>> + rcu_read_unlock();
>> +}
>> +#else
>> +static inline struct mem_cgroup *get_non_dying_memcg_start(struct mem_cgroup *memcg)
>> +{
>> + return memcg;
>> +}
>> +
>> +static inline void get_non_dying_memcg_end(void)
>> +{
>> +}
>> +#endif
>
> Add the usage of these start and end functions in mod_memcg_state() and
> mod_memcg_lruvec_state() in this patch.
Using these two function will change the behavior of mod_memcg_state()
and mod_memcg_lruvec_state(), but LRU folios has not yet been
reparented.
To ensure the patch itself is error-free, I chose to place the usage of
these two function in patch #30.
Thanks,
Qi
>
^ permalink raw reply [flat|nested] 50+ messages in thread
* Re: [PATCH v4 29/31] mm: memcontrol: prepare for reparenting non-hierarchical stats
2026-02-10 6:47 ` Qi Zheng
@ 2026-02-11 0:38 ` Shakeel Butt
0 siblings, 0 replies; 50+ messages in thread
From: Shakeel Butt @ 2026-02-11 0:38 UTC (permalink / raw)
To: Qi Zheng
Cc: hannes, hughd, mhocko, roman.gushchin, muchun.song, david,
lorenzo.stoakes, ziy, harry.yoo, yosry.ahmed, imran.f.khan,
kamalesh.babulal, axelrasmussen, yuanchu, weixugc, chenridong,
mkoutny, akpm, hamzamahfooz, apais, lance.yang, bhe, linux-mm,
linux-kernel, cgroups, Qi Zheng
On Tue, Feb 10, 2026 at 02:47:51PM +0800, Qi Zheng wrote:
>
>
> >
> > Add the usage of these start and end functions in mod_memcg_state() and
> > mod_memcg_lruvec_state() in this patch.
>
> Using these two function will change the behavior of mod_memcg_state()
> and mod_memcg_lruvec_state(), but LRU folios has not yet been
> reparented.
>
> To ensure the patch itself is error-free, I chose to place the usage of
> these two function in patch #30.
Makes sense.
I think we are good with respect to this patch series (next version), it
will miss 7.0 but I think that is fine, 7.1 is not too far.
Thanks for pushing this through.
^ permalink raw reply [flat|nested] 50+ messages in thread
* Re: [PATCH v4 26/31] mm: vmscan: prepare for reparenting MGLRU folios
2026-02-05 9:01 ` [PATCH v4 26/31] mm: vmscan: prepare for reparenting MGLRU folios Qi Zheng
@ 2026-02-12 8:46 ` Harry Yoo
2026-02-15 7:28 ` Qi Zheng
0 siblings, 1 reply; 50+ messages in thread
From: Harry Yoo @ 2026-02-12 8:46 UTC (permalink / raw)
To: Qi Zheng
Cc: hannes, hughd, mhocko, roman.gushchin, shakeel.butt, muchun.song,
david, lorenzo.stoakes, ziy, yosry.ahmed, imran.f.khan,
kamalesh.babulal, axelrasmussen, yuanchu, weixugc, chenridong,
mkoutny, akpm, hamzamahfooz, apais, lance.yang, bhe, linux-mm,
linux-kernel, cgroups, Qi Zheng
On Thu, Feb 05, 2026 at 05:01:45PM +0800, Qi Zheng wrote:
> From: Qi Zheng <zhengqi.arch@bytedance.com>
>
> Similar to traditional LRU folios, in order to solve the dying memcg
> problem, we also need to reparenting MGLRU folios to the parent memcg when
> memcg offline.
>
> However, there are the following challenges:
>
> 1. Each lruvec has between MIN_NR_GENS and MAX_NR_GENS generations, the
> number of generations of the parent and child memcg may be different,
> so we cannot simply transfer MGLRU folios in the child memcg to the
> parent memcg as we did for traditional LRU folios.
> 2. The generation information is stored in folio->flags, but we cannot
> traverse these folios while holding the lru lock, otherwise it may
> cause softlockup.
> 3. In walk_update_folio(), the gen of folio and corresponding lru size
> may be updated, but the folio is not immediately moved to the
> corresponding lru list. Therefore, there may be folios of different
> generations on an LRU list.
> 4. In lru_gen_del_folio(), the generation to which the folio belongs is
> found based on the generation information in folio->flags, and the
> corresponding LRU size will be updated. Therefore, we need to update
> the lru size correctly during reparenting, otherwise the lru size may
> be updated incorrectly in lru_gen_del_folio().
>
> Finally, this patch chose a compromise method, which is to splice the lru
> list in the child memcg to the lru list of the same generation in the
> parent memcg during reparenting. And in order to ensure that the parent
> memcg has the same generation, we need to increase the generations in the
> parent memcg to the MAX_NR_GENS before reparenting.
>
> Of course, the same generation has different meanings in the parent and
> child memcg, this will cause confusion in the hot and cold information of
> folios. But other than that, this method is simple enough, the lru size
> is correct, and there is no need to consider some concurrency issues (such
> as lru_gen_del_folio()).
>
> To prepare for the above work, this commit implements the specific
> functions, which will be used during reparenting.
>
> Suggested-by: Harry Yoo <harry.yoo@oracle.com>
> Suggested-by: Imran Khan <imran.f.khan@oracle.com>
> Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
> ---
> include/linux/mmzone.h | 16 +++++
> mm/vmscan.c | 154 +++++++++++++++++++++++++++++++++++++++++
> 2 files changed, 170 insertions(+)
>
> diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
> index 3e51190a55e4c..0c18b17f0fe2e 100644
> --- a/include/linux/mmzone.h
> +++ b/include/linux/mmzone.h
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index e2d9ef9a5dedc..8c6f8f0df24b1 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> +void lru_gen_reparent_memcg(struct mem_cgroup *memcg, struct mem_cgroup *parent)
> +{
> + int nid;
> +
> + for_each_node(nid) {
> + struct lruvec *child_lruvec, *parent_lruvec;
> + int type, zid;
> + struct zone *zone;
> + enum lru_list lru;
> +
> + child_lruvec = get_lruvec(memcg, nid);
> + parent_lruvec = get_lruvec(parent, nid);
> +
> + for_each_managed_zone_pgdat(zone, NODE_DATA(nid), zid, MAX_NR_ZONES - 1)
> + for (type = 0; type < ANON_AND_FILE; type++)
> + __lru_gen_reparent_memcg(child_lruvec, parent_lruvec, zid, type);
> +
> + for_each_lru(lru) {
> + for_each_managed_zone_pgdat(zone, NODE_DATA(nid), zid, MAX_NR_ZONES - 1) {
> + unsigned long size = mem_cgroup_get_zone_lru_size(child_lruvec, lru, zid);
> +
> + mem_cgroup_update_lru_size(parent_lruvec, lru, zid, size);
This part looks fine, but I think the nr_pages parameter
in mem_cgroup_update_lru_size() should be long instead of int.
Could you please update the type as well?
Otherwise looks good to me,
Acked-by: Harry Yoo <harry.yoo@oracle.com>
> + }
> + }
> + }
> +}
> +
> #endif /* CONFIG_MEMCG */
--
Cheers,
Harry / Hyeonggon
^ permalink raw reply [flat|nested] 50+ messages in thread
* Re: [PATCH v4 26/31] mm: vmscan: prepare for reparenting MGLRU folios
2026-02-12 8:46 ` Harry Yoo
@ 2026-02-15 7:28 ` Qi Zheng
0 siblings, 0 replies; 50+ messages in thread
From: Qi Zheng @ 2026-02-15 7:28 UTC (permalink / raw)
To: Harry Yoo
Cc: hannes, hughd, mhocko, roman.gushchin, shakeel.butt, muchun.song,
david, lorenzo.stoakes, ziy, yosry.ahmed, imran.f.khan,
kamalesh.babulal, axelrasmussen, yuanchu, weixugc, chenridong,
mkoutny, akpm, hamzamahfooz, apais, lance.yang, bhe, linux-mm,
linux-kernel, cgroups, Qi Zheng
On 2/12/26 4:46 PM, Harry Yoo wrote:
> On Thu, Feb 05, 2026 at 05:01:45PM +0800, Qi Zheng wrote:
>> From: Qi Zheng <zhengqi.arch@bytedance.com>
>>
>> Similar to traditional LRU folios, in order to solve the dying memcg
>> problem, we also need to reparenting MGLRU folios to the parent memcg when
>> memcg offline.
>>
>> However, there are the following challenges:
>>
>> 1. Each lruvec has between MIN_NR_GENS and MAX_NR_GENS generations, the
>> number of generations of the parent and child memcg may be different,
>> so we cannot simply transfer MGLRU folios in the child memcg to the
>> parent memcg as we did for traditional LRU folios.
>> 2. The generation information is stored in folio->flags, but we cannot
>> traverse these folios while holding the lru lock, otherwise it may
>> cause softlockup.
>> 3. In walk_update_folio(), the gen of folio and corresponding lru size
>> may be updated, but the folio is not immediately moved to the
>> corresponding lru list. Therefore, there may be folios of different
>> generations on an LRU list.
>> 4. In lru_gen_del_folio(), the generation to which the folio belongs is
>> found based on the generation information in folio->flags, and the
>> corresponding LRU size will be updated. Therefore, we need to update
>> the lru size correctly during reparenting, otherwise the lru size may
>> be updated incorrectly in lru_gen_del_folio().
>>
>> Finally, this patch chose a compromise method, which is to splice the lru
>> list in the child memcg to the lru list of the same generation in the
>> parent memcg during reparenting. And in order to ensure that the parent
>> memcg has the same generation, we need to increase the generations in the
>> parent memcg to the MAX_NR_GENS before reparenting.
>>
>> Of course, the same generation has different meanings in the parent and
>> child memcg, this will cause confusion in the hot and cold information of
>> folios. But other than that, this method is simple enough, the lru size
>> is correct, and there is no need to consider some concurrency issues (such
>> as lru_gen_del_folio()).
>>
>> To prepare for the above work, this commit implements the specific
>> functions, which will be used during reparenting.
>>
>> Suggested-by: Harry Yoo <harry.yoo@oracle.com>
>> Suggested-by: Imran Khan <imran.f.khan@oracle.com>
>> Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
>> ---
>> include/linux/mmzone.h | 16 +++++
>> mm/vmscan.c | 154 +++++++++++++++++++++++++++++++++++++++++
>> 2 files changed, 170 insertions(+)
>>
>> diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
>> index 3e51190a55e4c..0c18b17f0fe2e 100644
>> --- a/include/linux/mmzone.h
>> +++ b/include/linux/mmzone.h
>> diff --git a/mm/vmscan.c b/mm/vmscan.c
>> index e2d9ef9a5dedc..8c6f8f0df24b1 100644
>> --- a/mm/vmscan.c
>> +++ b/mm/vmscan.c
>> +void lru_gen_reparent_memcg(struct mem_cgroup *memcg, struct mem_cgroup *parent)
>> +{
>> + int nid;
>> +
>> + for_each_node(nid) {
>> + struct lruvec *child_lruvec, *parent_lruvec;
>> + int type, zid;
>> + struct zone *zone;
>> + enum lru_list lru;
>> +
>> + child_lruvec = get_lruvec(memcg, nid);
>> + parent_lruvec = get_lruvec(parent, nid);
>> +
>> + for_each_managed_zone_pgdat(zone, NODE_DATA(nid), zid, MAX_NR_ZONES - 1)
>> + for (type = 0; type < ANON_AND_FILE; type++)
>> + __lru_gen_reparent_memcg(child_lruvec, parent_lruvec, zid, type);
>> +
>> + for_each_lru(lru) {
>> + for_each_managed_zone_pgdat(zone, NODE_DATA(nid), zid, MAX_NR_ZONES - 1) {
>> + unsigned long size = mem_cgroup_get_zone_lru_size(child_lruvec, lru, zid);
>> +
>> + mem_cgroup_update_lru_size(parent_lruvec, lru, zid, size);
>
> This part looks fine, but I think the nr_pages parameter
> in mem_cgroup_update_lru_size() should be long instead of int.
> Could you please update the type as well?
Make sense, and I think it would be better to do this by sending
a separate patch, will do that and add your Suggested-by.
>
> Otherwise looks good to me,
> Acked-by: Harry Yoo <harry.yoo@oracle.com>
Thanks!
>
>> + }
>> + }
>> + }
>> +}
>> +
>> #endif /* CONFIG_MEMCG */
>
^ permalink raw reply [flat|nested] 50+ messages in thread
end of thread, other threads:[~2026-02-15 7:29 UTC | newest]
Thread overview: 50+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2026-02-05 8:54 [PATCH v4 00/31] Eliminate Dying Memory Cgroup Qi Zheng
2026-02-05 8:54 ` [PATCH v4 01/31] mm: memcontrol: remove dead code of checking parent memory cgroup Qi Zheng
2026-02-05 8:54 ` [PATCH v4 02/31] mm: workingset: use folio_lruvec() in workingset_refault() Qi Zheng
2026-02-05 8:54 ` [PATCH v4 03/31] mm: rename unlock_page_lruvec_irq and its variants Qi Zheng
2026-02-05 8:54 ` [PATCH v4 04/31] mm: vmscan: prepare for the refactoring the move_folios_to_lru() Qi Zheng
2026-02-05 8:54 ` [PATCH v4 05/31] mm: vmscan: refactor move_folios_to_lru() Qi Zheng
2026-02-05 8:54 ` [PATCH v4 06/31] mm: memcontrol: allocate object cgroup for non-kmem case Qi Zheng
2026-02-05 8:54 ` [PATCH v4 07/31] mm: memcontrol: return root object cgroup for root memory cgroup Qi Zheng
2026-02-05 9:01 ` [PATCH v4 08/31] mm: memcontrol: prevent memory cgroup release in get_mem_cgroup_from_folio() Qi Zheng
2026-02-05 9:01 ` [PATCH v4 09/31] buffer: prevent memory cgroup release in folio_alloc_buffers() Qi Zheng
2026-02-05 9:01 ` [PATCH v4 10/31] writeback: prevent memory cgroup release in writeback module Qi Zheng
2026-02-05 9:01 ` [PATCH v4 11/31] mm: memcontrol: prevent memory cgroup release in count_memcg_folio_events() Qi Zheng
2026-02-05 9:01 ` [PATCH v4 12/31] mm: page_io: prevent memory cgroup release in page_io module Qi Zheng
2026-02-05 9:01 ` [PATCH v4 13/31] mm: migrate: prevent memory cgroup release in folio_migrate_mapping() Qi Zheng
2026-02-05 9:01 ` [PATCH v4 14/31] mm: mglru: prevent memory cgroup release in mglru Qi Zheng
2026-02-05 9:01 ` [PATCH v4 15/31] mm: memcontrol: prevent memory cgroup release in mem_cgroup_swap_full() Qi Zheng
2026-02-05 9:01 ` [PATCH v4 16/31] mm: workingset: prevent memory cgroup release in lru_gen_eviction() Qi Zheng
2026-02-05 9:01 ` [PATCH v4 17/31] mm: thp: prevent memory cgroup release in folio_split_queue_lock{_irqsave}() Qi Zheng
2026-02-05 9:01 ` [PATCH v4 18/31] mm: zswap: prevent memory cgroup release in zswap_compress() Qi Zheng
2026-02-05 9:01 ` [PATCH v4 19/31] mm: workingset: prevent lruvec release in workingset_refault() Qi Zheng
2026-02-05 9:01 ` [PATCH v4 20/31] mm: zswap: prevent lruvec release in zswap_folio_swapin() Qi Zheng
2026-02-05 9:01 ` [PATCH v4 21/31] mm: swap: prevent lruvec release in lru_gen_clear_refs() Qi Zheng
2026-02-05 9:01 ` [PATCH v4 22/31] mm: workingset: prevent lruvec release in workingset_activation() Qi Zheng
2026-02-05 9:01 ` [PATCH v4 23/31] mm: do not open-code lruvec lock Qi Zheng
2026-02-05 9:01 ` [PATCH v4 24/31] mm: memcontrol: prepare for reparenting LRU pages for " Qi Zheng
2026-02-05 15:02 ` kernel test robot
2026-02-05 15:02 ` kernel test robot
2026-02-06 6:13 ` Qi Zheng
2026-02-06 23:34 ` Shakeel Butt
2026-02-05 9:01 ` [PATCH v4 25/31] mm: vmscan: prepare for reparenting traditional LRU folios Qi Zheng
2026-02-07 1:28 ` Shakeel Butt
2026-02-05 9:01 ` [PATCH v4 26/31] mm: vmscan: prepare for reparenting MGLRU folios Qi Zheng
2026-02-12 8:46 ` Harry Yoo
2026-02-15 7:28 ` Qi Zheng
2026-02-05 9:01 ` [PATCH v4 27/31] mm: memcontrol: refactor memcg_reparent_objcgs() Qi Zheng
2026-02-05 9:01 ` [PATCH v4 28/31] mm: workingset: use lruvec_lru_size() to get the number of lru pages Qi Zheng
2026-02-07 1:48 ` Shakeel Butt
2026-02-07 3:59 ` Muchun Song
2026-02-05 9:01 ` [PATCH v4 29/31] mm: memcontrol: prepare for reparenting non-hierarchical stats Qi Zheng
2026-02-07 2:19 ` Shakeel Butt
2026-02-10 6:47 ` Qi Zheng
2026-02-11 0:38 ` Shakeel Butt
2026-02-05 9:01 ` [PATCH v4 30/31] mm: memcontrol: eliminate the problem of dying memory cgroup for LRU folios Qi Zheng
2026-02-07 19:59 ` Usama Arif
2026-02-07 22:25 ` Shakeel Butt
2026-02-09 3:49 ` Qi Zheng
2026-02-09 17:53 ` Shakeel Butt
2026-02-10 3:11 ` Qi Zheng
2026-02-05 9:01 ` [PATCH v4 31/31] mm: lru: add VM_WARN_ON_ONCE_FOLIO to lru maintenance helpers Qi Zheng
2026-02-07 22:26 ` Shakeel Butt
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox