[PATCH v2 0/4] reparent the THP split queue

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [PATCH v2 0/4] reparent the THP split queue
@ 2025-09-23  9:16 Qi Zheng
  2025-09-23  9:16 ` [PATCH v2 1/4] mm: thp: replace folio_memcg() with folio_memcg_charged() Qi Zheng
                   ` (3 more replies)
  0 siblings, 4 replies; 25+ messages in thread
From: Qi Zheng @ 2025-09-23  9:16 UTC (permalink / raw)
  To: hannes, hughd, mhocko, roman.gushchin, shakeel.butt, muchun.song,
	david, lorenzo.stoakes, ziy, harry.yoo, baolin.wang,
	Liam.Howlett, npache, ryan.roberts, dev.jain, baohua, lance.yang,
	akpm
  Cc: linux-mm, linux-kernel, cgroups, Qi Zheng

Changes in v2:
 - fix build errors in [PATCH 2/4] and [PATCH 4/4]
 - some cleanups for [PATCH 3/4] (suggested by David Hildenbrand)
 - collect Acked-bys and Reviewed-bys
 - rebase onto the next-20250922

Hi all,

In the future, we will reparent LRU folios during memcg offline to eliminate
dying memory cgroups, which requires reparenting the THP split queue to its
parent memcg.

Similar to list_lru, the split queue is relatively independent and does not need
to be reparented along with objcg and LRU folios (holding objcg lock and lru
lock). Therefore, we can apply the same mechanism as list_lru to reparent the
split queue first when memcg is offine.

The first three patches in this series are separated from the series
"Eliminate Dying Memory Cgroup" [1], mainly to do some cleanup and preparatory
work.

The last patch reparents the THP split queue to its parent memcg during memcg
offline.

Comments and suggestions are welcome!

Thanks,
Qi

[1]. https://lore.kernel.org/all/20250415024532.26632-1-songmuchun@bytedance.com/

Muchun Song (3):
  mm: thp: replace folio_memcg() with folio_memcg_charged()
  mm: thp: introduce folio_split_queue_lock and its variants
  mm: thp: use folio_batch to handle THP splitting in
    deferred_split_scan()

Qi Zheng (1):
  mm: thp: reparent the split queue during memcg offline

 include/linux/huge_mm.h    |   2 +
 include/linux/memcontrol.h |  10 ++
 include/linux/mmzone.h     |   1 +
 mm/huge_memory.c           | 229 +++++++++++++++++++++++++------------
 mm/memcontrol.c            |   1 +
 mm/mm_init.c               |   1 +
 6 files changed, 172 insertions(+), 72 deletions(-)

-- 
2.20.1

^ permalink raw reply	[flat|nested] 25+ messages in thread

* [PATCH v2 1/4] mm: thp: replace folio_memcg() with folio_memcg_charged()
  2025-09-23  9:16 [PATCH v2 0/4] reparent the THP split queue Qi Zheng
@ 2025-09-23  9:16 ` Qi Zheng
  2025-09-24  9:10   ` Roman Gushchin
  2025-09-23  9:16 ` [PATCH v2 2/4] mm: thp: introduce folio_split_queue_lock and its variants Qi Zheng
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 25+ messages in thread
From: Qi Zheng @ 2025-09-23  9:16 UTC (permalink / raw)
  To: hannes, hughd, mhocko, roman.gushchin, shakeel.butt, muchun.song,
	david, lorenzo.stoakes, ziy, harry.yoo, baolin.wang,
	Liam.Howlett, npache, ryan.roberts, dev.jain, baohua, lance.yang,
	akpm
  Cc: linux-mm, linux-kernel, cgroups, Muchun Song, Qi Zheng

From: Muchun Song <songmuchun@bytedance.com>

folio_memcg_charged() is intended for use when the user is unconcerned
about the returned memcg pointer. It is more efficient than folio_memcg().
Therefore, replace folio_memcg() with folio_memcg_charged().

Signed-off-by: Muchun Song <songmuchun@bytedance.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Shakeel Butt <shakeel.butt@linux.dev>
Acked-by: David Hildenbrand <david@redhat.com>
Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
---
 mm/huge_memory.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 5acca24bbabbe..582628ddf3f33 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -4014,7 +4014,7 @@ bool __folio_unqueue_deferred_split(struct folio *folio)
 	bool unqueued = false;
 
 	WARN_ON_ONCE(folio_ref_count(folio));
-	WARN_ON_ONCE(!mem_cgroup_disabled() && !folio_memcg(folio));
+	WARN_ON_ONCE(!mem_cgroup_disabled() && !folio_memcg_charged(folio));
 
 	ds_queue = get_deferred_split_queue(folio);
 	spin_lock_irqsave(&ds_queue->split_queue_lock, flags);
-- 
2.20.1



^ permalink raw reply	[flat|nested] 25+ messages in thread

* [PATCH v2 2/4] mm: thp: introduce folio_split_queue_lock and its variants
  2025-09-23  9:16 [PATCH v2 0/4] reparent the THP split queue Qi Zheng
  2025-09-23  9:16 ` [PATCH v2 1/4] mm: thp: replace folio_memcg() with folio_memcg_charged() Qi Zheng
@ 2025-09-23  9:16 ` Qi Zheng
  2025-09-23  9:16 ` [PATCH v2 3/4] mm: thp: use folio_batch to handle THP splitting in deferred_split_scan() Qi Zheng
  2025-09-23  9:16 ` [PATCH v2 4/4] mm: thp: reparent the split queue during memcg offline Qi Zheng
  3 siblings, 0 replies; 25+ messages in thread
From: Qi Zheng @ 2025-09-23  9:16 UTC (permalink / raw)
  To: hannes, hughd, mhocko, roman.gushchin, shakeel.butt, muchun.song,
	david, lorenzo.stoakes, ziy, harry.yoo, baolin.wang,
	Liam.Howlett, npache, ryan.roberts, dev.jain, baohua, lance.yang,
	akpm
  Cc: linux-mm, linux-kernel, cgroups, Muchun Song, Qi Zheng

From: Muchun Song <songmuchun@bytedance.com>

In future memcg removal, the binding between a folio and a memcg may
change, making the split lock within the memcg unstable when held.

A new approach is required to reparent the split queue to its parent. This
patch starts introducing a unified way to acquire the split lock for
future work.

It's a code-only refactoring with no functional changes.

Signed-off-by: Muchun Song <songmuchun@bytedance.com>
Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: Zi Yan <ziy@nvidia.com>
Acked-by: Shakeel Butt <shakeel.butt@linux.dev>
Acked-by: David Hildenbrand <david@redhat.com>
---
 include/linux/memcontrol.h |  10 ++++
 mm/huge_memory.c           | 104 ++++++++++++++++++++++++++++---------
 2 files changed, 89 insertions(+), 25 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 16fe0306e50ea..99876af13c315 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -1662,6 +1662,11 @@ int alloc_shrinker_info(struct mem_cgroup *memcg);
 void free_shrinker_info(struct mem_cgroup *memcg);
 void set_shrinker_bit(struct mem_cgroup *memcg, int nid, int shrinker_id);
 void reparent_shrinker_deferred(struct mem_cgroup *memcg);
+
+static inline int shrinker_id(struct shrinker *shrinker)
+{
+	return shrinker->id;
+}
 #else
 #define mem_cgroup_sockets_enabled 0
 
@@ -1693,6 +1698,11 @@ static inline void set_shrinker_bit(struct mem_cgroup *memcg,
 				    int nid, int shrinker_id)
 {
 }
+
+static inline int shrinker_id(struct shrinker *shrinker)
+{
+	return -1;
+}
 #endif
 
 #ifdef CONFIG_MEMCG
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 582628ddf3f33..2f41b8f0d4871 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1078,26 +1078,83 @@ pmd_t maybe_pmd_mkwrite(pmd_t pmd, struct vm_area_struct *vma)
 
 #ifdef CONFIG_MEMCG
 static inline
-struct deferred_split *get_deferred_split_queue(struct folio *folio)
+struct mem_cgroup *folio_split_queue_memcg(struct folio *folio,
+					   struct deferred_split *queue)
 {
-	struct mem_cgroup *memcg = folio_memcg(folio);
-	struct pglist_data *pgdat = NODE_DATA(folio_nid(folio));
+	if (mem_cgroup_disabled())
+		return NULL;
+	if (&NODE_DATA(folio_nid(folio))->deferred_split_queue == queue)
+		return NULL;
+	return container_of(queue, struct mem_cgroup, deferred_split_queue);
+}
 
-	if (memcg)
-		return &memcg->deferred_split_queue;
-	else
-		return &pgdat->deferred_split_queue;
+static struct deferred_split *folio_split_queue_lock(struct folio *folio)
+{
+	struct mem_cgroup *memcg;
+	struct deferred_split *queue;
+
+	memcg = folio_memcg(folio);
+	queue = memcg ? &memcg->deferred_split_queue :
+			&NODE_DATA(folio_nid(folio))->deferred_split_queue;
+	spin_lock(&queue->split_queue_lock);
+
+	return queue;
+}
+
+static struct deferred_split *
+folio_split_queue_lock_irqsave(struct folio *folio, unsigned long *flags)
+{
+	struct mem_cgroup *memcg;
+	struct deferred_split *queue;
+
+	memcg = folio_memcg(folio);
+	queue = memcg ? &memcg->deferred_split_queue :
+			&NODE_DATA(folio_nid(folio))->deferred_split_queue;
+	spin_lock_irqsave(&queue->split_queue_lock, *flags);
+
+	return queue;
 }
 #else
 static inline
-struct deferred_split *get_deferred_split_queue(struct folio *folio)
+struct mem_cgroup *folio_split_queue_memcg(struct folio *folio,
+					   struct deferred_split *queue)
+{
+	return NULL;
+}
+
+static struct deferred_split *folio_split_queue_lock(struct folio *folio)
 {
 	struct pglist_data *pgdat = NODE_DATA(folio_nid(folio));
+	struct deferred_split *queue = &pgdat->deferred_split_queue;
+
+	spin_lock(&queue->split_queue_lock);
 
-	return &pgdat->deferred_split_queue;
+	return queue;
+}
+
+static struct deferred_split *
+folio_split_queue_lock_irqsave(struct folio *folio, unsigned long *flags)
+{
+	struct pglist_data *pgdat = NODE_DATA(folio_nid(folio));
+	struct deferred_split *queue = &pgdat->deferred_split_queue;
+
+	spin_lock_irqsave(&queue->split_queue_lock, *flags);
+
+	return queue;
 }
 #endif
 
+static inline void split_queue_unlock(struct deferred_split *queue)
+{
+	spin_unlock(&queue->split_queue_lock);
+}
+
+static inline void split_queue_unlock_irqrestore(struct deferred_split *queue,
+						 unsigned long flags)
+{
+	spin_unlock_irqrestore(&queue->split_queue_lock, flags);
+}
+
 static inline bool is_transparent_hugepage(const struct folio *folio)
 {
 	if (!folio_test_large(folio))
@@ -3579,7 +3636,7 @@ static int __folio_split(struct folio *folio, unsigned int new_order,
 		struct page *split_at, struct page *lock_at,
 		struct list_head *list, bool uniform_split)
 {
-	struct deferred_split *ds_queue = get_deferred_split_queue(folio);
+	struct deferred_split *ds_queue;
 	XA_STATE(xas, &folio->mapping->i_pages, folio->index);
 	struct folio *end_folio = folio_next(folio);
 	bool is_anon = folio_test_anon(folio);
@@ -3718,7 +3775,7 @@ static int __folio_split(struct folio *folio, unsigned int new_order,
 	}
 
 	/* Prevent deferred_split_scan() touching ->_refcount */
-	spin_lock(&ds_queue->split_queue_lock);
+	ds_queue = folio_split_queue_lock(folio);
 	if (folio_ref_freeze(folio, 1 + extra_pins)) {
 		struct swap_cluster_info *ci = NULL;
 		struct lruvec *lruvec;
@@ -3740,7 +3797,7 @@ static int __folio_split(struct folio *folio, unsigned int new_order,
 			 */
 			list_del_init(&folio->_deferred_list);
 		}
-		spin_unlock(&ds_queue->split_queue_lock);
+		split_queue_unlock(ds_queue);
 		if (mapping) {
 			int nr = folio_nr_pages(folio);
 
@@ -3835,7 +3892,7 @@ static int __folio_split(struct folio *folio, unsigned int new_order,
 		if (ci)
 			swap_cluster_unlock(ci);
 	} else {
-		spin_unlock(&ds_queue->split_queue_lock);
+		split_queue_unlock(ds_queue);
 		ret = -EAGAIN;
 	}
 fail:
@@ -4016,8 +4073,7 @@ bool __folio_unqueue_deferred_split(struct folio *folio)
 	WARN_ON_ONCE(folio_ref_count(folio));
 	WARN_ON_ONCE(!mem_cgroup_disabled() && !folio_memcg_charged(folio));
 
-	ds_queue = get_deferred_split_queue(folio);
-	spin_lock_irqsave(&ds_queue->split_queue_lock, flags);
+	ds_queue = folio_split_queue_lock_irqsave(folio, &flags);
 	if (!list_empty(&folio->_deferred_list)) {
 		ds_queue->split_queue_len--;
 		if (folio_test_partially_mapped(folio)) {
@@ -4028,7 +4084,7 @@ bool __folio_unqueue_deferred_split(struct folio *folio)
 		list_del_init(&folio->_deferred_list);
 		unqueued = true;
 	}
-	spin_unlock_irqrestore(&ds_queue->split_queue_lock, flags);
+	split_queue_unlock_irqrestore(ds_queue, flags);
 
 	return unqueued;	/* useful for debug warnings */
 }
@@ -4036,10 +4092,7 @@ bool __folio_unqueue_deferred_split(struct folio *folio)
 /* partially_mapped=false won't clear PG_partially_mapped folio flag */
 void deferred_split_folio(struct folio *folio, bool partially_mapped)
 {
-	struct deferred_split *ds_queue = get_deferred_split_queue(folio);
-#ifdef CONFIG_MEMCG
-	struct mem_cgroup *memcg = folio_memcg(folio);
-#endif
+	struct deferred_split *ds_queue;
 	unsigned long flags;
 
 	/*
@@ -4062,7 +4115,7 @@ void deferred_split_folio(struct folio *folio, bool partially_mapped)
 	if (folio_test_swapcache(folio))
 		return;
 
-	spin_lock_irqsave(&ds_queue->split_queue_lock, flags);
+	ds_queue = folio_split_queue_lock_irqsave(folio, &flags);
 	if (partially_mapped) {
 		if (!folio_test_partially_mapped(folio)) {
 			folio_set_partially_mapped(folio);
@@ -4077,15 +4130,16 @@ void deferred_split_folio(struct folio *folio, bool partially_mapped)
 		VM_WARN_ON_FOLIO(folio_test_partially_mapped(folio), folio);
 	}
 	if (list_empty(&folio->_deferred_list)) {
+		struct mem_cgroup *memcg;
+
+		memcg = folio_split_queue_memcg(folio, ds_queue);
 		list_add_tail(&folio->_deferred_list, &ds_queue->split_queue);
 		ds_queue->split_queue_len++;
-#ifdef CONFIG_MEMCG
 		if (memcg)
 			set_shrinker_bit(memcg, folio_nid(folio),
-					 deferred_split_shrinker->id);
-#endif
+					 shrinker_id(deferred_split_shrinker));
 	}
-	spin_unlock_irqrestore(&ds_queue->split_queue_lock, flags);
+	split_queue_unlock_irqrestore(ds_queue, flags);
 }
 
 static unsigned long deferred_split_count(struct shrinker *shrink,
-- 
2.20.1



^ permalink raw reply	[flat|nested] 25+ messages in thread

* [PATCH v2 3/4] mm: thp: use folio_batch to handle THP splitting in deferred_split_scan()
  2025-09-23  9:16 [PATCH v2 0/4] reparent the THP split queue Qi Zheng
  2025-09-23  9:16 ` [PATCH v2 1/4] mm: thp: replace folio_memcg() with folio_memcg_charged() Qi Zheng
  2025-09-23  9:16 ` [PATCH v2 2/4] mm: thp: introduce folio_split_queue_lock and its variants Qi Zheng
@ 2025-09-23  9:16 ` Qi Zheng
  2025-09-23 15:31   ` Zi Yan
  2025-09-24 12:31   ` David Hildenbrand
  2025-09-23  9:16 ` [PATCH v2 4/4] mm: thp: reparent the split queue during memcg offline Qi Zheng
  3 siblings, 2 replies; 25+ messages in thread
From: Qi Zheng @ 2025-09-23  9:16 UTC (permalink / raw)
  To: hannes, hughd, mhocko, roman.gushchin, shakeel.butt, muchun.song,
	david, lorenzo.stoakes, ziy, harry.yoo, baolin.wang,
	Liam.Howlett, npache, ryan.roberts, dev.jain, baohua, lance.yang,
	akpm
  Cc: linux-mm, linux-kernel, cgroups, Muchun Song, Qi Zheng

From: Muchun Song <songmuchun@bytedance.com>

The maintenance of the folio->_deferred_list is intricate because it's
reused in a local list.

Here are some peculiarities:

   1) When a folio is removed from its split queue and added to a local
      on-stack list in deferred_split_scan(), the ->split_queue_len isn't
      updated, leading to an inconsistency between it and the actual
      number of folios in the split queue.

   2) When the folio is split via split_folio() later, it's removed from
      the local list while holding the split queue lock. At this time,
      this lock protects the local list, not the split queue.

   3) To handle the race condition with a third-party freeing or migrating
      the preceding folio, we must ensure there's always one safe (with
      raised refcount) folio before by delaying its folio_put(). More
      details can be found in commit e66f3185fa04 ("mm/thp: fix deferred
      split queue not partially_mapped"). It's rather tricky.

We can use the folio_batch infrastructure to handle this clearly. In this
case, ->split_queue_len will be consistent with the real number of folios
in the split queue. If list_empty(&folio->_deferred_list) returns false,
it's clear the folio must be in its split queue (not in a local list
anymore).

In the future, we will reparent LRU folios during memcg offline to
eliminate dying memory cgroups, which requires reparenting the split queue
to its parent first. So this patch prepares for using
folio_split_queue_lock_irqsave() as the memcg may change then.

Signed-off-by: Muchun Song <songmuchun@bytedance.com>
Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
---
 mm/huge_memory.c | 84 ++++++++++++++++++++++--------------------------
 1 file changed, 38 insertions(+), 46 deletions(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 2f41b8f0d4871..48b51e6230a67 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -3781,21 +3781,22 @@ static int __folio_split(struct folio *folio, unsigned int new_order,
 		struct lruvec *lruvec;
 		int expected_refs;
 
-		if (folio_order(folio) > 1 &&
-		    !list_empty(&folio->_deferred_list)) {
-			ds_queue->split_queue_len--;
+		if (folio_order(folio) > 1) {
+			if (!list_empty(&folio->_deferred_list)) {
+				ds_queue->split_queue_len--;
+				/*
+				 * Reinitialize page_deferred_list after removing the
+				 * page from the split_queue, otherwise a subsequent
+				 * split will see list corruption when checking the
+				 * page_deferred_list.
+				 */
+				list_del_init(&folio->_deferred_list);
+			}
 			if (folio_test_partially_mapped(folio)) {
 				folio_clear_partially_mapped(folio);
 				mod_mthp_stat(folio_order(folio),
 					      MTHP_STAT_NR_ANON_PARTIALLY_MAPPED, -1);
 			}
-			/*
-			 * Reinitialize page_deferred_list after removing the
-			 * page from the split_queue, otherwise a subsequent
-			 * split will see list corruption when checking the
-			 * page_deferred_list.
-			 */
-			list_del_init(&folio->_deferred_list);
 		}
 		split_queue_unlock(ds_queue);
 		if (mapping) {
@@ -4194,40 +4195,44 @@ static unsigned long deferred_split_scan(struct shrinker *shrink,
 	struct pglist_data *pgdata = NODE_DATA(sc->nid);
 	struct deferred_split *ds_queue = &pgdata->deferred_split_queue;
 	unsigned long flags;
-	LIST_HEAD(list);
-	struct folio *folio, *next, *prev = NULL;
-	int split = 0, removed = 0;
+	struct folio *folio, *next;
+	int split = 0, i;
+	struct folio_batch fbatch;
 
 #ifdef CONFIG_MEMCG
 	if (sc->memcg)
 		ds_queue = &sc->memcg->deferred_split_queue;
 #endif
 
+	folio_batch_init(&fbatch);
+retry:
 	spin_lock_irqsave(&ds_queue->split_queue_lock, flags);
 	/* Take pin on all head pages to avoid freeing them under us */
 	list_for_each_entry_safe(folio, next, &ds_queue->split_queue,
 							_deferred_list) {
 		if (folio_try_get(folio)) {
-			list_move(&folio->_deferred_list, &list);
-		} else {
+			folio_batch_add(&fbatch, folio);
+		} else if (folio_test_partially_mapped(folio)) {
 			/* We lost race with folio_put() */
-			if (folio_test_partially_mapped(folio)) {
-				folio_clear_partially_mapped(folio);
-				mod_mthp_stat(folio_order(folio),
-					      MTHP_STAT_NR_ANON_PARTIALLY_MAPPED, -1);
-			}
-			list_del_init(&folio->_deferred_list);
-			ds_queue->split_queue_len--;
+			folio_clear_partially_mapped(folio);
+			mod_mthp_stat(folio_order(folio),
+				      MTHP_STAT_NR_ANON_PARTIALLY_MAPPED, -1);
 		}
+		list_del_init(&folio->_deferred_list);
+		ds_queue->split_queue_len--;
 		if (!--sc->nr_to_scan)
 			break;
+		if (!folio_batch_space(&fbatch))
+			break;
 	}
 	spin_unlock_irqrestore(&ds_queue->split_queue_lock, flags);
 
-	list_for_each_entry_safe(folio, next, &list, _deferred_list) {
+	for (i = 0; i < folio_batch_count(&fbatch); i++) {
 		bool did_split = false;
 		bool underused = false;
+		struct deferred_split *fqueue;
 
+		folio = fbatch.folios[i];
 		if (!folio_test_partially_mapped(folio)) {
 			/*
 			 * See try_to_map_unused_to_zeropage(): we cannot
@@ -4250,38 +4255,25 @@ static unsigned long deferred_split_scan(struct shrinker *shrink,
 		}
 		folio_unlock(folio);
 next:
+		if (did_split || !folio_test_partially_mapped(folio))
+			continue;
 		/*
-		 * split_folio() removes folio from list on success.
 		 * Only add back to the queue if folio is partially mapped.
 		 * If thp_underused returns false, or if split_folio fails
 		 * in the case it was underused, then consider it used and
 		 * don't add it back to split_queue.
 		 */
-		if (did_split) {
-			; /* folio already removed from list */
-		} else if (!folio_test_partially_mapped(folio)) {
-			list_del_init(&folio->_deferred_list);
-			removed++;
-		} else {
-			/*
-			 * That unlocked list_del_init() above would be unsafe,
-			 * unless its folio is separated from any earlier folios
-			 * left on the list (which may be concurrently unqueued)
-			 * by one safe folio with refcount still raised.
-			 */
-			swap(folio, prev);
+		fqueue = folio_split_queue_lock_irqsave(folio, &flags);
+		if (list_empty(&folio->_deferred_list)) {
+			list_add_tail(&folio->_deferred_list, &fqueue->split_queue);
+			fqueue->split_queue_len++;
 		}
-		if (folio)
-			folio_put(folio);
+		split_queue_unlock_irqrestore(fqueue, flags);
 	}
+	folios_put(&fbatch);
 
-	spin_lock_irqsave(&ds_queue->split_queue_lock, flags);
-	list_splice_tail(&list, &ds_queue->split_queue);
-	ds_queue->split_queue_len -= removed;
-	spin_unlock_irqrestore(&ds_queue->split_queue_lock, flags);
-
-	if (prev)
-		folio_put(prev);
+	if (sc->nr_to_scan)
+		goto retry;
 
 	/*
 	 * Stop shrinker if we didn't split any page, but the queue is empty.
-- 
2.20.1



^ permalink raw reply	[flat|nested] 25+ messages in thread

* [PATCH v2 4/4] mm: thp: reparent the split queue during memcg offline
  2025-09-23  9:16 [PATCH v2 0/4] reparent the THP split queue Qi Zheng
                   ` (2 preceding siblings ...)
  2025-09-23  9:16 ` [PATCH v2 3/4] mm: thp: use folio_batch to handle THP splitting in deferred_split_scan() Qi Zheng
@ 2025-09-23  9:16 ` Qi Zheng
  2025-09-23 15:44   ` Zi Yan
                     ` (4 more replies)
  3 siblings, 5 replies; 25+ messages in thread
From: Qi Zheng @ 2025-09-23  9:16 UTC (permalink / raw)
  To: hannes, hughd, mhocko, roman.gushchin, shakeel.butt, muchun.song,
	david, lorenzo.stoakes, ziy, harry.yoo, baolin.wang,
	Liam.Howlett, npache, ryan.roberts, dev.jain, baohua, lance.yang,
	akpm
  Cc: linux-mm, linux-kernel, cgroups, Qi Zheng

In the future, we will reparent LRU folios during memcg offline to
eliminate dying memory cgroups, which requires reparenting the split queue
to its parent.

Similar to list_lru, the split queue is relatively independent and does
not need to be reparented along with objcg and LRU folios (holding
objcg lock and lru lock). So let's apply the same mechanism as list_lru
to reparent the split queue separately when memcg is offine.

Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
---
 include/linux/huge_mm.h |  2 ++
 include/linux/mmzone.h  |  1 +
 mm/huge_memory.c        | 39 +++++++++++++++++++++++++++++++++++++++
 mm/memcontrol.c         |  1 +
 mm/mm_init.c            |  1 +
 5 files changed, 44 insertions(+)

diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index f327d62fc9852..a0d4b751974d2 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -417,6 +417,7 @@ static inline int split_huge_page(struct page *page)
 	return split_huge_page_to_list_to_order(page, NULL, ret);
 }
 void deferred_split_folio(struct folio *folio, bool partially_mapped);
+void reparent_deferred_split_queue(struct mem_cgroup *memcg);
 
 void __split_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,
 		unsigned long address, bool freeze);
@@ -611,6 +612,7 @@ static inline int try_folio_split(struct folio *folio, struct page *page,
 }
 
 static inline void deferred_split_folio(struct folio *folio, bool partially_mapped) {}
+static inline void reparent_deferred_split_queue(struct mem_cgroup *memcg) {}
 #define split_huge_pmd(__vma, __pmd, __address)	\
 	do { } while (0)
 
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 7fb7331c57250..f3eb81fee056a 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -1346,6 +1346,7 @@ struct deferred_split {
 	spinlock_t split_queue_lock;
 	struct list_head split_queue;
 	unsigned long split_queue_len;
+	bool is_dying;
 };
 #endif
 
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 48b51e6230a67..de7806f759cba 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1094,9 +1094,15 @@ static struct deferred_split *folio_split_queue_lock(struct folio *folio)
 	struct deferred_split *queue;
 
 	memcg = folio_memcg(folio);
+retry:
 	queue = memcg ? &memcg->deferred_split_queue :
 			&NODE_DATA(folio_nid(folio))->deferred_split_queue;
 	spin_lock(&queue->split_queue_lock);
+	if (unlikely(queue->is_dying == true)) {
+		spin_unlock(&queue->split_queue_lock);
+		memcg = parent_mem_cgroup(memcg);
+		goto retry;
+	}
 
 	return queue;
 }
@@ -1108,9 +1114,15 @@ folio_split_queue_lock_irqsave(struct folio *folio, unsigned long *flags)
 	struct deferred_split *queue;
 
 	memcg = folio_memcg(folio);
+retry:
 	queue = memcg ? &memcg->deferred_split_queue :
 			&NODE_DATA(folio_nid(folio))->deferred_split_queue;
 	spin_lock_irqsave(&queue->split_queue_lock, *flags);
+	if (unlikely(queue->is_dying == true)) {
+		spin_unlock_irqrestore(&queue->split_queue_lock, *flags);
+		memcg = parent_mem_cgroup(memcg);
+		goto retry;
+	}
 
 	return queue;
 }
@@ -4284,6 +4296,33 @@ static unsigned long deferred_split_scan(struct shrinker *shrink,
 	return split;
 }
 
+void reparent_deferred_split_queue(struct mem_cgroup *memcg)
+{
+	struct mem_cgroup *parent = parent_mem_cgroup(memcg);
+	struct deferred_split *ds_queue = &memcg->deferred_split_queue;
+	struct deferred_split *parent_ds_queue = &parent->deferred_split_queue;
+	int nid;
+
+	spin_lock_irq(&ds_queue->split_queue_lock);
+	spin_lock_nested(&parent_ds_queue->split_queue_lock, SINGLE_DEPTH_NESTING);
+
+	if (!ds_queue->split_queue_len)
+		goto unlock;
+
+	list_splice_tail_init(&ds_queue->split_queue, &parent_ds_queue->split_queue);
+	parent_ds_queue->split_queue_len += ds_queue->split_queue_len;
+	ds_queue->split_queue_len = 0;
+	/* Mark the ds_queue dead */
+	ds_queue->is_dying = true;
+
+	for_each_node(nid)
+		set_shrinker_bit(parent, nid, shrinker_id(deferred_split_shrinker));
+
+unlock:
+	spin_unlock(&parent_ds_queue->split_queue_lock);
+	spin_unlock_irq(&ds_queue->split_queue_lock);
+}
+
 #ifdef CONFIG_DEBUG_FS
 static void split_huge_pages_all(void)
 {
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index e090f29eb03bd..d03da72e7585d 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -3887,6 +3887,7 @@ static void mem_cgroup_css_offline(struct cgroup_subsys_state *css)
 	zswap_memcg_offline_cleanup(memcg);
 
 	memcg_offline_kmem(memcg);
+	reparent_deferred_split_queue(memcg);
 	reparent_shrinker_deferred(memcg);
 	wb_memcg_offline(memcg);
 	lru_gen_offline_memcg(memcg);
diff --git a/mm/mm_init.c b/mm/mm_init.c
index 3db2dea7db4c5..cbda5c2ee3241 100644
--- a/mm/mm_init.c
+++ b/mm/mm_init.c
@@ -1387,6 +1387,7 @@ static void pgdat_init_split_queue(struct pglist_data *pgdat)
 	spin_lock_init(&ds_queue->split_queue_lock);
 	INIT_LIST_HEAD(&ds_queue->split_queue);
 	ds_queue->split_queue_len = 0;
+	ds_queue->is_dying = false;
 }
 #else
 static void pgdat_init_split_queue(struct pglist_data *pgdat) {}
-- 
2.20.1



^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v2 3/4] mm: thp: use folio_batch to handle THP splitting in deferred_split_scan()
  2025-09-23  9:16 ` [PATCH v2 3/4] mm: thp: use folio_batch to handle THP splitting in deferred_split_scan() Qi Zheng
@ 2025-09-23 15:31   ` Zi Yan
  2025-09-24  9:57     ` Qi Zheng
  2025-09-24 12:31   ` David Hildenbrand
  1 sibling, 1 reply; 25+ messages in thread
From: Zi Yan @ 2025-09-23 15:31 UTC (permalink / raw)
  To: Qi Zheng
  Cc: hannes, hughd, mhocko, roman.gushchin, shakeel.butt, muchun.song,
	david, lorenzo.stoakes, harry.yoo, baolin.wang, Liam.Howlett,
	npache, ryan.roberts, dev.jain, baohua, lance.yang, akpm,
	linux-mm, linux-kernel, cgroups, Muchun Song

On 23 Sep 2025, at 5:16, Qi Zheng wrote:

> From: Muchun Song <songmuchun@bytedance.com>
>
> The maintenance of the folio->_deferred_list is intricate because it's
> reused in a local list.
>
> Here are some peculiarities:
>
>    1) When a folio is removed from its split queue and added to a local
>       on-stack list in deferred_split_scan(), the ->split_queue_len isn't
>       updated, leading to an inconsistency between it and the actual
>       number of folios in the split queue.
>
>    2) When the folio is split via split_folio() later, it's removed from
>       the local list while holding the split queue lock. At this time,
>       this lock protects the local list, not the split queue.
>
>    3) To handle the race condition with a third-party freeing or migrating
>       the preceding folio, we must ensure there's always one safe (with
>       raised refcount) folio before by delaying its folio_put(). More
>       details can be found in commit e66f3185fa04 ("mm/thp: fix deferred
>       split queue not partially_mapped"). It's rather tricky.
>
> We can use the folio_batch infrastructure to handle this clearly. In this

Can you add more details on how folio_batch handles the above three concerns
in the original code? That would guide me what to look for during code review.

> case, ->split_queue_len will be consistent with the real number of folios
> in the split queue. If list_empty(&folio->_deferred_list) returns false,
> it's clear the folio must be in its split queue (not in a local list
> anymore).
>
> In the future, we will reparent LRU folios during memcg offline to
> eliminate dying memory cgroups, which requires reparenting the split queue
> to its parent first. So this patch prepares for using
> folio_split_queue_lock_irqsave() as the memcg may change then.
>
> Signed-off-by: Muchun Song <songmuchun@bytedance.com>
> Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
> ---
>  mm/huge_memory.c | 84 ++++++++++++++++++++++--------------------------
>  1 file changed, 38 insertions(+), 46 deletions(-)
>
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index 2f41b8f0d4871..48b51e6230a67 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -3781,21 +3781,22 @@ static int __folio_split(struct folio *folio, unsigned int new_order,
>  		struct lruvec *lruvec;
>  		int expected_refs;
>
> -		if (folio_order(folio) > 1 &&
> -		    !list_empty(&folio->_deferred_list)) {
> -			ds_queue->split_queue_len--;
> +		if (folio_order(folio) > 1) {
> +			if (!list_empty(&folio->_deferred_list)) {
> +				ds_queue->split_queue_len--;
> +				/*
> +				 * Reinitialize page_deferred_list after removing the
> +				 * page from the split_queue, otherwise a subsequent
> +				 * split will see list corruption when checking the
> +				 * page_deferred_list.
> +				 */
> +				list_del_init(&folio->_deferred_list);
> +			}
>  			if (folio_test_partially_mapped(folio)) {
>  				folio_clear_partially_mapped(folio);
>  				mod_mthp_stat(folio_order(folio),
>  					      MTHP_STAT_NR_ANON_PARTIALLY_MAPPED, -1);
>  			}

folio_test_partially_mapped() is done regardless the folio is on _deferred_list
or not, is it because the folio can be on a folio batch and its _deferred_list
is empty?

> -			/*
> -			 * Reinitialize page_deferred_list after removing the
> -			 * page from the split_queue, otherwise a subsequent
> -			 * split will see list corruption when checking the
> -			 * page_deferred_list.
> -			 */
> -			list_del_init(&folio->_deferred_list);
>  		}
>  		split_queue_unlock(ds_queue);
>  		if (mapping) {
> @@ -4194,40 +4195,44 @@ static unsigned long deferred_split_scan(struct shrinker *shrink,
>  	struct pglist_data *pgdata = NODE_DATA(sc->nid);
>  	struct deferred_split *ds_queue = &pgdata->deferred_split_queue;
>  	unsigned long flags;
> -	LIST_HEAD(list);
> -	struct folio *folio, *next, *prev = NULL;
> -	int split = 0, removed = 0;
> +	struct folio *folio, *next;
> +	int split = 0, i;
> +	struct folio_batch fbatch;
>
>  #ifdef CONFIG_MEMCG
>  	if (sc->memcg)
>  		ds_queue = &sc->memcg->deferred_split_queue;
>  #endif
>
> +	folio_batch_init(&fbatch);
> +retry:
>  	spin_lock_irqsave(&ds_queue->split_queue_lock, flags);
>  	/* Take pin on all head pages to avoid freeing them under us */
>  	list_for_each_entry_safe(folio, next, &ds_queue->split_queue,
>  							_deferred_list) {
>  		if (folio_try_get(folio)) {
> -			list_move(&folio->_deferred_list, &list);
> -		} else {
> +			folio_batch_add(&fbatch, folio);
> +		} else if (folio_test_partially_mapped(folio)) {
>  			/* We lost race with folio_put() */
> -			if (folio_test_partially_mapped(folio)) {
> -				folio_clear_partially_mapped(folio);
> -				mod_mthp_stat(folio_order(folio),
> -					      MTHP_STAT_NR_ANON_PARTIALLY_MAPPED, -1);
> -			}
> -			list_del_init(&folio->_deferred_list);
> -			ds_queue->split_queue_len--;
> +			folio_clear_partially_mapped(folio);
> +			mod_mthp_stat(folio_order(folio),
> +				      MTHP_STAT_NR_ANON_PARTIALLY_MAPPED, -1);
>  		}
> +		list_del_init(&folio->_deferred_list);
> +		ds_queue->split_queue_len--;

At this point, the folio can be following conditions:
1. deferred_split_scan() gets it,
2. it is freed by folio_put().

In both cases, it is removed from deferred_split_queue, right?

>  		if (!--sc->nr_to_scan)
>  			break;
> +		if (!folio_batch_space(&fbatch))
> +			break;
>  	}
>  	spin_unlock_irqrestore(&ds_queue->split_queue_lock, flags);
>
> -	list_for_each_entry_safe(folio, next, &list, _deferred_list) {
> +	for (i = 0; i < folio_batch_count(&fbatch); i++) {
>  		bool did_split = false;
>  		bool underused = false;
> +		struct deferred_split *fqueue;
>
> +		folio = fbatch.folios[i];
>  		if (!folio_test_partially_mapped(folio)) {
>  			/*
>  			 * See try_to_map_unused_to_zeropage(): we cannot
> @@ -4250,38 +4255,25 @@ static unsigned long deferred_split_scan(struct shrinker *shrink,
>  		}
>  		folio_unlock(folio);
>  next:
> +		if (did_split || !folio_test_partially_mapped(folio))
> +			continue;
>  		/*
> -		 * split_folio() removes folio from list on success.
>  		 * Only add back to the queue if folio is partially mapped.
>  		 * If thp_underused returns false, or if split_folio fails
>  		 * in the case it was underused, then consider it used and
>  		 * don't add it back to split_queue.
>  		 */
> -		if (did_split) {
> -			; /* folio already removed from list */
> -		} else if (!folio_test_partially_mapped(folio)) {
> -			list_del_init(&folio->_deferred_list);
> -			removed++;
> -		} else {
> -			/*
> -			 * That unlocked list_del_init() above would be unsafe,
> -			 * unless its folio is separated from any earlier folios
> -			 * left on the list (which may be concurrently unqueued)
> -			 * by one safe folio with refcount still raised.
> -			 */
> -			swap(folio, prev);
> +		fqueue = folio_split_queue_lock_irqsave(folio, &flags);
> +		if (list_empty(&folio->_deferred_list)) {
> +			list_add_tail(&folio->_deferred_list, &fqueue->split_queue);
> +			fqueue->split_queue_len++;

fqueue should be the same as ds_queue, right? Just want to make sure
I understand the code.

>  		}
> -		if (folio)
> -			folio_put(folio);
> +		split_queue_unlock_irqrestore(fqueue, flags);
>  	}
> +	folios_put(&fbatch);
>
> -	spin_lock_irqsave(&ds_queue->split_queue_lock, flags);
> -	list_splice_tail(&list, &ds_queue->split_queue);
> -	ds_queue->split_queue_len -= removed;
> -	spin_unlock_irqrestore(&ds_queue->split_queue_lock, flags);
> -
> -	if (prev)
> -		folio_put(prev);
> +	if (sc->nr_to_scan)
> +		goto retry;
>
>  	/*
>  	 * Stop shrinker if we didn't split any page, but the queue is empty.
> -- 
> 2.20.1


Best Regards,
Yan, Zi


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v2 4/4] mm: thp: reparent the split queue during memcg offline
  2025-09-23  9:16 ` [PATCH v2 4/4] mm: thp: reparent the split queue during memcg offline Qi Zheng
@ 2025-09-23 15:44   ` Zi Yan
  2025-09-24  9:58     ` Qi Zheng
  2025-09-24  9:23   ` Roman Gushchin
                     ` (3 subsequent siblings)
  4 siblings, 1 reply; 25+ messages in thread
From: Zi Yan @ 2025-09-23 15:44 UTC (permalink / raw)
  To: Qi Zheng
  Cc: hannes, hughd, mhocko, roman.gushchin, shakeel.butt, muchun.song,
	david, lorenzo.stoakes, harry.yoo, baolin.wang, Liam.Howlett,
	npache, ryan.roberts, dev.jain, baohua, lance.yang, akpm,
	linux-mm, linux-kernel, cgroups

On 23 Sep 2025, at 5:16, Qi Zheng wrote:

> In the future, we will reparent LRU folios during memcg offline to
> eliminate dying memory cgroups, which requires reparenting the split queue
> to its parent.
>
> Similar to list_lru, the split queue is relatively independent and does
> not need to be reparented along with objcg and LRU folios (holding
> objcg lock and lru lock). So let's apply the same mechanism as list_lru
> to reparent the split queue separately when memcg is offine.
>
> Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
> ---
>  include/linux/huge_mm.h |  2 ++
>  include/linux/mmzone.h  |  1 +
>  mm/huge_memory.c        | 39 +++++++++++++++++++++++++++++++++++++++
>  mm/memcontrol.c         |  1 +
>  mm/mm_init.c            |  1 +
>  5 files changed, 44 insertions(+)
>
> diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
> index f327d62fc9852..a0d4b751974d2 100644
> --- a/include/linux/huge_mm.h
> +++ b/include/linux/huge_mm.h
> @@ -417,6 +417,7 @@ static inline int split_huge_page(struct page *page)
>  	return split_huge_page_to_list_to_order(page, NULL, ret);
>  }
>  void deferred_split_folio(struct folio *folio, bool partially_mapped);
> +void reparent_deferred_split_queue(struct mem_cgroup *memcg);
>
>  void __split_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,
>  		unsigned long address, bool freeze);
> @@ -611,6 +612,7 @@ static inline int try_folio_split(struct folio *folio, struct page *page,
>  }
>
>  static inline void deferred_split_folio(struct folio *folio, bool partially_mapped) {}
> +static inline void reparent_deferred_split_queue(struct mem_cgroup *memcg) {}
>  #define split_huge_pmd(__vma, __pmd, __address)	\
>  	do { } while (0)
>
> diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
> index 7fb7331c57250..f3eb81fee056a 100644
> --- a/include/linux/mmzone.h
> +++ b/include/linux/mmzone.h
> @@ -1346,6 +1346,7 @@ struct deferred_split {
>  	spinlock_t split_queue_lock;
>  	struct list_head split_queue;
>  	unsigned long split_queue_len;
> +	bool is_dying;
>  };
>  #endif
>
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index 48b51e6230a67..de7806f759cba 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -1094,9 +1094,15 @@ static struct deferred_split *folio_split_queue_lock(struct folio *folio)
>  	struct deferred_split *queue;
>
>  	memcg = folio_memcg(folio);
> +retry:
>  	queue = memcg ? &memcg->deferred_split_queue :
>  			&NODE_DATA(folio_nid(folio))->deferred_split_queue;
>  	spin_lock(&queue->split_queue_lock);
> +	if (unlikely(queue->is_dying == true)) {
> +		spin_unlock(&queue->split_queue_lock);
> +		memcg = parent_mem_cgroup(memcg);
> +		goto retry;
> +	}
>
>  	return queue;
>  }
> @@ -1108,9 +1114,15 @@ folio_split_queue_lock_irqsave(struct folio *folio, unsigned long *flags)
>  	struct deferred_split *queue;
>
>  	memcg = folio_memcg(folio);
> +retry:
>  	queue = memcg ? &memcg->deferred_split_queue :
>  			&NODE_DATA(folio_nid(folio))->deferred_split_queue;
>  	spin_lock_irqsave(&queue->split_queue_lock, *flags);
> +	if (unlikely(queue->is_dying == true)) {
> +		spin_unlock_irqrestore(&queue->split_queue_lock, *flags);
> +		memcg = parent_mem_cgroup(memcg);
> +		goto retry;
> +	}
>
>  	return queue;
>  }
> @@ -4284,6 +4296,33 @@ static unsigned long deferred_split_scan(struct shrinker *shrink,
>  	return split;
>  }
>
> +void reparent_deferred_split_queue(struct mem_cgroup *memcg)
> +{
> +	struct mem_cgroup *parent = parent_mem_cgroup(memcg);
> +	struct deferred_split *ds_queue = &memcg->deferred_split_queue;
> +	struct deferred_split *parent_ds_queue = &parent->deferred_split_queue;
> +	int nid;
> +
> +	spin_lock_irq(&ds_queue->split_queue_lock);
> +	spin_lock_nested(&parent_ds_queue->split_queue_lock, SINGLE_DEPTH_NESTING);
> +
> +	if (!ds_queue->split_queue_len)
> +		goto unlock;

Should ds_queue still be marked as dying even if it is empty?
Otherwise, new folios still can be added to it, based on my
understanding of the changes to folio_split_queue_lock*().

> +
> +	list_splice_tail_init(&ds_queue->split_queue, &parent_ds_queue->split_queue);
> +	parent_ds_queue->split_queue_len += ds_queue->split_queue_len;
> +	ds_queue->split_queue_len = 0;
> +	/* Mark the ds_queue dead */
> +	ds_queue->is_dying = true;
> +
> +	for_each_node(nid)
> +		set_shrinker_bit(parent, nid, shrinker_id(deferred_split_shrinker));
> +
> +unlock:
> +	spin_unlock(&parent_ds_queue->split_queue_lock);
> +	spin_unlock_irq(&ds_queue->split_queue_lock);
> +}
> +
>  #ifdef CONFIG_DEBUG_FS
>  static void split_huge_pages_all(void)
>  {
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index e090f29eb03bd..d03da72e7585d 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -3887,6 +3887,7 @@ static void mem_cgroup_css_offline(struct cgroup_subsys_state *css)
>  	zswap_memcg_offline_cleanup(memcg);
>
>  	memcg_offline_kmem(memcg);
> +	reparent_deferred_split_queue(memcg);
>  	reparent_shrinker_deferred(memcg);
>  	wb_memcg_offline(memcg);
>  	lru_gen_offline_memcg(memcg);
> diff --git a/mm/mm_init.c b/mm/mm_init.c
> index 3db2dea7db4c5..cbda5c2ee3241 100644
> --- a/mm/mm_init.c
> +++ b/mm/mm_init.c
> @@ -1387,6 +1387,7 @@ static void pgdat_init_split_queue(struct pglist_data *pgdat)
>  	spin_lock_init(&ds_queue->split_queue_lock);
>  	INIT_LIST_HEAD(&ds_queue->split_queue);
>  	ds_queue->split_queue_len = 0;
> +	ds_queue->is_dying = false;
>  }
>  #else
>  static void pgdat_init_split_queue(struct pglist_data *pgdat) {}
> -- 
> 2.20.1


Best Regards,
Yan, Zi


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v2 1/4] mm: thp: replace folio_memcg() with folio_memcg_charged()
  2025-09-23  9:16 ` [PATCH v2 1/4] mm: thp: replace folio_memcg() with folio_memcg_charged() Qi Zheng
@ 2025-09-24  9:10   ` Roman Gushchin
  0 siblings, 0 replies; 25+ messages in thread
From: Roman Gushchin @ 2025-09-24  9:10 UTC (permalink / raw)
  To: Qi Zheng
  Cc: hannes, hughd, mhocko, shakeel.butt, muchun.song, david,
	lorenzo.stoakes, ziy, harry.yoo, baolin.wang, Liam.Howlett,
	npache, ryan.roberts, dev.jain, baohua, lance.yang, akpm,
	linux-mm, linux-kernel, cgroups, Muchun Song

Qi Zheng <zhengqi.arch@bytedance.com> writes:

> From: Muchun Song <songmuchun@bytedance.com>
>
> folio_memcg_charged() is intended for use when the user is unconcerned
> about the returned memcg pointer. It is more efficient than folio_memcg().
> Therefore, replace folio_memcg() with folio_memcg_charged().
>
> Signed-off-by: Muchun Song <songmuchun@bytedance.com>
> Acked-by: Johannes Weiner <hannes@cmpxchg.org>
> Acked-by: Shakeel Butt <shakeel.butt@linux.dev>
> Acked-by: David Hildenbrand <david@redhat.com>
> Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>

Reviewed-by: Roman Gushchin <roman.gushchin@linux.dev>


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v2 4/4] mm: thp: reparent the split queue during memcg offline
  2025-09-23  9:16 ` [PATCH v2 4/4] mm: thp: reparent the split queue during memcg offline Qi Zheng
  2025-09-23 15:44   ` Zi Yan
@ 2025-09-24  9:23   ` Roman Gushchin
  2025-09-24 10:06     ` Qi Zheng
  2025-09-24 12:38   ` David Hildenbrand
                     ` (2 subsequent siblings)
  4 siblings, 1 reply; 25+ messages in thread
From: Roman Gushchin @ 2025-09-24  9:23 UTC (permalink / raw)
  To: Qi Zheng
  Cc: hannes, hughd, mhocko, shakeel.butt, muchun.song, david,
	lorenzo.stoakes, ziy, harry.yoo, baolin.wang, Liam.Howlett,
	npache, ryan.roberts, dev.jain, baohua, lance.yang, akpm,
	linux-mm, linux-kernel, cgroups

Qi Zheng <zhengqi.arch@bytedance.com> writes:

> In the future, we will reparent LRU folios during memcg offline to
> eliminate dying memory cgroups, which requires reparenting the split queue
> to its parent.

Nit: commit logs should really focus on the actual change, not the future
plans.

>
> Similar to list_lru, the split queue is relatively independent and does
> not need to be reparented along with objcg and LRU folios (holding
> objcg lock and lru lock). So let's apply the same mechanism as list_lru
> to reparent the split queue separately when memcg is offine.
>
> Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
> ---
>  include/linux/huge_mm.h |  2 ++
>  include/linux/mmzone.h  |  1 +
>  mm/huge_memory.c        | 39 +++++++++++++++++++++++++++++++++++++++
>  mm/memcontrol.c         |  1 +
>  mm/mm_init.c            |  1 +
>  5 files changed, 44 insertions(+)
>
> diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
> index f327d62fc9852..a0d4b751974d2 100644
> --- a/include/linux/huge_mm.h
> +++ b/include/linux/huge_mm.h
> @@ -417,6 +417,7 @@ static inline int split_huge_page(struct page *page)
>  	return split_huge_page_to_list_to_order(page, NULL, ret);
>  }
>  void deferred_split_folio(struct folio *folio, bool partially_mapped);
> +void reparent_deferred_split_queue(struct mem_cgroup *memcg);
>  
>  void __split_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,
>  		unsigned long address, bool freeze);
> @@ -611,6 +612,7 @@ static inline int try_folio_split(struct folio *folio, struct page *page,
>  }
>  
>  static inline void deferred_split_folio(struct folio *folio, bool partially_mapped) {}
> +static inline void reparent_deferred_split_queue(struct mem_cgroup *memcg) {}
>  #define split_huge_pmd(__vma, __pmd, __address)	\
>  	do { } while (0)
>  
> diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
> index 7fb7331c57250..f3eb81fee056a 100644
> --- a/include/linux/mmzone.h
> +++ b/include/linux/mmzone.h
> @@ -1346,6 +1346,7 @@ struct deferred_split {
>  	spinlock_t split_queue_lock;
>  	struct list_head split_queue;
>  	unsigned long split_queue_len;
> +	bool is_dying;
>  };
>  #endif
>  
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index 48b51e6230a67..de7806f759cba 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -1094,9 +1094,15 @@ static struct deferred_split *folio_split_queue_lock(struct folio *folio)
>  	struct deferred_split *queue;
>  
>  	memcg = folio_memcg(folio);
> +retry:
>  	queue = memcg ? &memcg->deferred_split_queue :
>  			&NODE_DATA(folio_nid(folio))->deferred_split_queue;
>  	spin_lock(&queue->split_queue_lock);
> +	if (unlikely(queue->is_dying == true)) {
> +		spin_unlock(&queue->split_queue_lock);
> +		memcg = parent_mem_cgroup(memcg);
> +		goto retry;
> +	}
>  
>  	return queue;
>  }
> @@ -1108,9 +1114,15 @@ folio_split_queue_lock_irqsave(struct folio *folio, unsigned long *flags)
>  	struct deferred_split *queue;
>  
>  	memcg = folio_memcg(folio);
> +retry:
>  	queue = memcg ? &memcg->deferred_split_queue :
>  			&NODE_DATA(folio_nid(folio))->deferred_split_queue;
>  	spin_lock_irqsave(&queue->split_queue_lock, *flags);
> +	if (unlikely(queue->is_dying == true)) {
> +		spin_unlock_irqrestore(&queue->split_queue_lock, *flags);
> +		memcg = parent_mem_cgroup(memcg);
> +		goto retry;
> +	}
>  
>  	return queue;
>  }
> @@ -4284,6 +4296,33 @@ static unsigned long deferred_split_scan(struct shrinker *shrink,
>  	return split;
>  }
>  
> +void reparent_deferred_split_queue(struct mem_cgroup *memcg)
> +{
> +	struct mem_cgroup *parent = parent_mem_cgroup(memcg);
> +	struct deferred_split *ds_queue = &memcg->deferred_split_queue;
> +	struct deferred_split *parent_ds_queue = &parent->deferred_split_queue;
> +	int nid;
> +
> +	spin_lock_irq(&ds_queue->split_queue_lock);
> +	spin_lock_nested(&parent_ds_queue->split_queue_lock, SINGLE_DEPTH_NESTING);
> +
> +	if (!ds_queue->split_queue_len)
> +		goto unlock;
> +
> +	list_splice_tail_init(&ds_queue->split_queue, &parent_ds_queue->split_queue);
> +	parent_ds_queue->split_queue_len += ds_queue->split_queue_len;
> +	ds_queue->split_queue_len = 0;
> +	/* Mark the ds_queue dead */
> +	ds_queue->is_dying = true;
> +
> +	for_each_node(nid)
> +		set_shrinker_bit(parent, nid, shrinker_id(deferred_split_shrinker));

Does this loop need to be under locks?

> +
> +unlock:
> +	spin_unlock(&parent_ds_queue->split_queue_lock);
> +	spin_unlock_irq(&ds_queue->split_queue_lock);
> +}
> +
>  #ifdef CONFIG_DEBUG_FS
>  static void split_huge_pages_all(void)
>  {
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index e090f29eb03bd..d03da72e7585d 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -3887,6 +3887,7 @@ static void mem_cgroup_css_offline(struct cgroup_subsys_state *css)
>  	zswap_memcg_offline_cleanup(memcg);
>  
>  	memcg_offline_kmem(memcg);
> +	reparent_deferred_split_queue(memcg);
>  	reparent_shrinker_deferred(memcg);

I guess the naming can be a bit more consistent here :)

Thanks!


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v2 3/4] mm: thp: use folio_batch to handle THP splitting in deferred_split_scan()
  2025-09-23 15:31   ` Zi Yan
@ 2025-09-24  9:57     ` Qi Zheng
  2025-09-24 14:57       ` Zi Yan
  0 siblings, 1 reply; 25+ messages in thread
From: Qi Zheng @ 2025-09-24  9:57 UTC (permalink / raw)
  To: Zi Yan
  Cc: hannes, hughd, mhocko, roman.gushchin, shakeel.butt, muchun.song,
	david, lorenzo.stoakes, harry.yoo, baolin.wang, Liam.Howlett,
	npache, ryan.roberts, dev.jain, baohua, lance.yang, akpm,
	linux-mm, linux-kernel, cgroups, Muchun Song

Hi Zi,

On 9/23/25 11:31 PM, Zi Yan wrote:
> On 23 Sep 2025, at 5:16, Qi Zheng wrote:
> 
>> From: Muchun Song <songmuchun@bytedance.com>
>>
>> The maintenance of the folio->_deferred_list is intricate because it's
>> reused in a local list.
>>
>> Here are some peculiarities:
>>
>>     1) When a folio is removed from its split queue and added to a local
>>        on-stack list in deferred_split_scan(), the ->split_queue_len isn't
>>        updated, leading to an inconsistency between it and the actual
>>        number of folios in the split queue.
>>
>>     2) When the folio is split via split_folio() later, it's removed from
>>        the local list while holding the split queue lock. At this time,
>>        this lock protects the local list, not the split queue.
>>
>>     3) To handle the race condition with a third-party freeing or migrating
>>        the preceding folio, we must ensure there's always one safe (with
>>        raised refcount) folio before by delaying its folio_put(). More
>>        details can be found in commit e66f3185fa04 ("mm/thp: fix deferred
>>        split queue not partially_mapped"). It's rather tricky.
>>
>> We can use the folio_batch infrastructure to handle this clearly. In this
> 
> Can you add more details on how folio_batch handles the above three concerns
> in the original code? That would guide me what to look for during code review.

Sure.

For 1), after adding folio to folio_batch, we immediatelly decrement the
ds_queue->split_queue_len, so there are no more inconsistencies.

For 2), after adding folio to folio_batch, we will see list_empty() in
__folio_split(), so there is no longer a situation where
split_queue_lock protects the local list.

For 3), after adding folio to folio_batch, we call folios_put() at the
end to decrement the refcount of folios, which looks more natural.

> 
>> case, ->split_queue_len will be consistent with the real number of folios
>> in the split queue. If list_empty(&folio->_deferred_list) returns false,
>> it's clear the folio must be in its split queue (not in a local list
>> anymore).
>>
>> In the future, we will reparent LRU folios during memcg offline to
>> eliminate dying memory cgroups, which requires reparenting the split queue
>> to its parent first. So this patch prepares for using
>> folio_split_queue_lock_irqsave() as the memcg may change then.
>>
>> Signed-off-by: Muchun Song <songmuchun@bytedance.com>
>> Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
>> ---
>>   mm/huge_memory.c | 84 ++++++++++++++++++++++--------------------------
>>   1 file changed, 38 insertions(+), 46 deletions(-)
>>
>> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
>> index 2f41b8f0d4871..48b51e6230a67 100644
>> --- a/mm/huge_memory.c
>> +++ b/mm/huge_memory.c
>> @@ -3781,21 +3781,22 @@ static int __folio_split(struct folio *folio, unsigned int new_order,
>>   		struct lruvec *lruvec;
>>   		int expected_refs;
>>
>> -		if (folio_order(folio) > 1 &&
>> -		    !list_empty(&folio->_deferred_list)) {
>> -			ds_queue->split_queue_len--;
>> +		if (folio_order(folio) > 1) {
>> +			if (!list_empty(&folio->_deferred_list)) {
>> +				ds_queue->split_queue_len--;
>> +				/*
>> +				 * Reinitialize page_deferred_list after removing the
>> +				 * page from the split_queue, otherwise a subsequent
>> +				 * split will see list corruption when checking the
>> +				 * page_deferred_list.
>> +				 */
>> +				list_del_init(&folio->_deferred_list);
>> +			}
>>   			if (folio_test_partially_mapped(folio)) {
>>   				folio_clear_partially_mapped(folio);
>>   				mod_mthp_stat(folio_order(folio),
>>   					      MTHP_STAT_NR_ANON_PARTIALLY_MAPPED, -1);
>>   			}
> 
> folio_test_partially_mapped() is done regardless the folio is on _deferred_list
> or not, is it because the folio can be on a folio batch and its _deferred_list
> is empty?

Yes.

> 
>> -			/*
>> -			 * Reinitialize page_deferred_list after removing the
>> -			 * page from the split_queue, otherwise a subsequent
>> -			 * split will see list corruption when checking the
>> -			 * page_deferred_list.
>> -			 */
>> -			list_del_init(&folio->_deferred_list);
>>   		}
>>   		split_queue_unlock(ds_queue);
>>   		if (mapping) {
>> @@ -4194,40 +4195,44 @@ static unsigned long deferred_split_scan(struct shrinker *shrink,
>>   	struct pglist_data *pgdata = NODE_DATA(sc->nid);
>>   	struct deferred_split *ds_queue = &pgdata->deferred_split_queue;
>>   	unsigned long flags;
>> -	LIST_HEAD(list);
>> -	struct folio *folio, *next, *prev = NULL;
>> -	int split = 0, removed = 0;
>> +	struct folio *folio, *next;
>> +	int split = 0, i;
>> +	struct folio_batch fbatch;
>>
>>   #ifdef CONFIG_MEMCG
>>   	if (sc->memcg)
>>   		ds_queue = &sc->memcg->deferred_split_queue;
>>   #endif
>>
>> +	folio_batch_init(&fbatch);
>> +retry:
>>   	spin_lock_irqsave(&ds_queue->split_queue_lock, flags);
>>   	/* Take pin on all head pages to avoid freeing them under us */
>>   	list_for_each_entry_safe(folio, next, &ds_queue->split_queue,
>>   							_deferred_list) {
>>   		if (folio_try_get(folio)) {
>> -			list_move(&folio->_deferred_list, &list);
>> -		} else {
>> +			folio_batch_add(&fbatch, folio);
>> +		} else if (folio_test_partially_mapped(folio)) {
>>   			/* We lost race with folio_put() */
>> -			if (folio_test_partially_mapped(folio)) {
>> -				folio_clear_partially_mapped(folio);
>> -				mod_mthp_stat(folio_order(folio),
>> -					      MTHP_STAT_NR_ANON_PARTIALLY_MAPPED, -1);
>> -			}
>> -			list_del_init(&folio->_deferred_list);
>> -			ds_queue->split_queue_len--;
>> +			folio_clear_partially_mapped(folio);
>> +			mod_mthp_stat(folio_order(folio),
>> +				      MTHP_STAT_NR_ANON_PARTIALLY_MAPPED, -1);
>>   		}
>> +		list_del_init(&folio->_deferred_list);
>> +		ds_queue->split_queue_len--;
> 
> At this point, the folio can be following conditions:
> 1. deferred_split_scan() gets it,
> 2. it is freed by folio_put().
> 
> In both cases, it is removed from deferred_split_queue, right?

Right. For the case 1), we may add folio back to deferred_split_queue.

> 
>>   		if (!--sc->nr_to_scan)
>>   			break;
>> +		if (!folio_batch_space(&fbatch))
>> +			break;
>>   	}
>>   	spin_unlock_irqrestore(&ds_queue->split_queue_lock, flags);
>>
>> -	list_for_each_entry_safe(folio, next, &list, _deferred_list) {
>> +	for (i = 0; i < folio_batch_count(&fbatch); i++) {
>>   		bool did_split = false;
>>   		bool underused = false;
>> +		struct deferred_split *fqueue;
>>
>> +		folio = fbatch.folios[i];
>>   		if (!folio_test_partially_mapped(folio)) {
>>   			/*
>>   			 * See try_to_map_unused_to_zeropage(): we cannot
>> @@ -4250,38 +4255,25 @@ static unsigned long deferred_split_scan(struct shrinker *shrink,
>>   		}
>>   		folio_unlock(folio);
>>   next:
>> +		if (did_split || !folio_test_partially_mapped(folio))
>> +			continue;
>>   		/*
>> -		 * split_folio() removes folio from list on success.
>>   		 * Only add back to the queue if folio is partially mapped.
>>   		 * If thp_underused returns false, or if split_folio fails
>>   		 * in the case it was underused, then consider it used and
>>   		 * don't add it back to split_queue.
>>   		 */
>> -		if (did_split) {
>> -			; /* folio already removed from list */
>> -		} else if (!folio_test_partially_mapped(folio)) {
>> -			list_del_init(&folio->_deferred_list);
>> -			removed++;
>> -		} else {
>> -			/*
>> -			 * That unlocked list_del_init() above would be unsafe,
>> -			 * unless its folio is separated from any earlier folios
>> -			 * left on the list (which may be concurrently unqueued)
>> -			 * by one safe folio with refcount still raised.
>> -			 */
>> -			swap(folio, prev);
>> +		fqueue = folio_split_queue_lock_irqsave(folio, &flags);
>> +		if (list_empty(&folio->_deferred_list)) {
>> +			list_add_tail(&folio->_deferred_list, &fqueue->split_queue);
>> +			fqueue->split_queue_len++;
> 
> fqueue should be the same as ds_queue, right? Just want to make sure
> I understand the code.

After patch #4, fqueue may be the deferred_split of parent memcg.

Thanks,
Qi

> 
>>   		}
>> -		if (folio)
>> -			folio_put(folio);
>> +		split_queue_unlock_irqrestore(fqueue, flags);
>>   	}
>> +	folios_put(&fbatch);
>>
>> -	spin_lock_irqsave(&ds_queue->split_queue_lock, flags);
>> -	list_splice_tail(&list, &ds_queue->split_queue);
>> -	ds_queue->split_queue_len -= removed;
>> -	spin_unlock_irqrestore(&ds_queue->split_queue_lock, flags);
>> -
>> -	if (prev)
>> -		folio_put(prev);
>> +	if (sc->nr_to_scan)
>> +		goto retry;
>>
>>   	/*
>>   	 * Stop shrinker if we didn't split any page, but the queue is empty.
>> -- 
>> 2.20.1
> 
> 
> Best Regards,
> Yan, Zi



^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v2 4/4] mm: thp: reparent the split queue during memcg offline
  2025-09-23 15:44   ` Zi Yan
@ 2025-09-24  9:58     ` Qi Zheng
  0 siblings, 0 replies; 25+ messages in thread
From: Qi Zheng @ 2025-09-24  9:58 UTC (permalink / raw)
  To: Zi Yan
  Cc: hannes, hughd, mhocko, roman.gushchin, shakeel.butt, muchun.song,
	david, lorenzo.stoakes, harry.yoo, baolin.wang, Liam.Howlett,
	npache, ryan.roberts, dev.jain, baohua, lance.yang, akpm,
	linux-mm, linux-kernel, cgroups



On 9/23/25 11:44 PM, Zi Yan wrote:
> On 23 Sep 2025, at 5:16, Qi Zheng wrote:
> 
>> In the future, we will reparent LRU folios during memcg offline to
>> eliminate dying memory cgroups, which requires reparenting the split queue
>> to its parent.
>>
>> Similar to list_lru, the split queue is relatively independent and does
>> not need to be reparented along with objcg and LRU folios (holding
>> objcg lock and lru lock). So let's apply the same mechanism as list_lru
>> to reparent the split queue separately when memcg is offine.
>>
>> Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
>> ---
>>   include/linux/huge_mm.h |  2 ++
>>   include/linux/mmzone.h  |  1 +
>>   mm/huge_memory.c        | 39 +++++++++++++++++++++++++++++++++++++++
>>   mm/memcontrol.c         |  1 +
>>   mm/mm_init.c            |  1 +
>>   5 files changed, 44 insertions(+)
>>
>> diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
>> index f327d62fc9852..a0d4b751974d2 100644
>> --- a/include/linux/huge_mm.h
>> +++ b/include/linux/huge_mm.h
>> @@ -417,6 +417,7 @@ static inline int split_huge_page(struct page *page)
>>   	return split_huge_page_to_list_to_order(page, NULL, ret);
>>   }
>>   void deferred_split_folio(struct folio *folio, bool partially_mapped);
>> +void reparent_deferred_split_queue(struct mem_cgroup *memcg);
>>
>>   void __split_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,
>>   		unsigned long address, bool freeze);
>> @@ -611,6 +612,7 @@ static inline int try_folio_split(struct folio *folio, struct page *page,
>>   }
>>
>>   static inline void deferred_split_folio(struct folio *folio, bool partially_mapped) {}
>> +static inline void reparent_deferred_split_queue(struct mem_cgroup *memcg) {}
>>   #define split_huge_pmd(__vma, __pmd, __address)	\
>>   	do { } while (0)
>>
>> diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
>> index 7fb7331c57250..f3eb81fee056a 100644
>> --- a/include/linux/mmzone.h
>> +++ b/include/linux/mmzone.h
>> @@ -1346,6 +1346,7 @@ struct deferred_split {
>>   	spinlock_t split_queue_lock;
>>   	struct list_head split_queue;
>>   	unsigned long split_queue_len;
>> +	bool is_dying;
>>   };
>>   #endif
>>
>> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
>> index 48b51e6230a67..de7806f759cba 100644
>> --- a/mm/huge_memory.c
>> +++ b/mm/huge_memory.c
>> @@ -1094,9 +1094,15 @@ static struct deferred_split *folio_split_queue_lock(struct folio *folio)
>>   	struct deferred_split *queue;
>>
>>   	memcg = folio_memcg(folio);
>> +retry:
>>   	queue = memcg ? &memcg->deferred_split_queue :
>>   			&NODE_DATA(folio_nid(folio))->deferred_split_queue;
>>   	spin_lock(&queue->split_queue_lock);
>> +	if (unlikely(queue->is_dying == true)) {
>> +		spin_unlock(&queue->split_queue_lock);
>> +		memcg = parent_mem_cgroup(memcg);
>> +		goto retry;
>> +	}
>>
>>   	return queue;
>>   }
>> @@ -1108,9 +1114,15 @@ folio_split_queue_lock_irqsave(struct folio *folio, unsigned long *flags)
>>   	struct deferred_split *queue;
>>
>>   	memcg = folio_memcg(folio);
>> +retry:
>>   	queue = memcg ? &memcg->deferred_split_queue :
>>   			&NODE_DATA(folio_nid(folio))->deferred_split_queue;
>>   	spin_lock_irqsave(&queue->split_queue_lock, *flags);
>> +	if (unlikely(queue->is_dying == true)) {
>> +		spin_unlock_irqrestore(&queue->split_queue_lock, *flags);
>> +		memcg = parent_mem_cgroup(memcg);
>> +		goto retry;
>> +	}
>>
>>   	return queue;
>>   }
>> @@ -4284,6 +4296,33 @@ static unsigned long deferred_split_scan(struct shrinker *shrink,
>>   	return split;
>>   }
>>
>> +void reparent_deferred_split_queue(struct mem_cgroup *memcg)
>> +{
>> +	struct mem_cgroup *parent = parent_mem_cgroup(memcg);
>> +	struct deferred_split *ds_queue = &memcg->deferred_split_queue;
>> +	struct deferred_split *parent_ds_queue = &parent->deferred_split_queue;
>> +	int nid;
>> +
>> +	spin_lock_irq(&ds_queue->split_queue_lock);
>> +	spin_lock_nested(&parent_ds_queue->split_queue_lock, SINGLE_DEPTH_NESTING);
>> +
>> +	if (!ds_queue->split_queue_len)
>> +		goto unlock;
> 
> Should ds_queue still be marked as dying even if it is empty?
> Otherwise, new folios still can be added to it, based on my
> understanding of the changes to folio_split_queue_lock*().

I think you are right, will do in the next version.

Thanks,
Qi

> 
>> +
>> +	list_splice_tail_init(&ds_queue->split_queue, &parent_ds_queue->split_queue);
>> +	parent_ds_queue->split_queue_len += ds_queue->split_queue_len;
>> +	ds_queue->split_queue_len = 0;
>> +	/* Mark the ds_queue dead */
>> +	ds_queue->is_dying = true;
>> +
>> +	for_each_node(nid)
>> +		set_shrinker_bit(parent, nid, shrinker_id(deferred_split_shrinker));
>> +
>> +unlock:
>> +	spin_unlock(&parent_ds_queue->split_queue_lock);
>> +	spin_unlock_irq(&ds_queue->split_queue_lock);
>> +}
>> +
>>   #ifdef CONFIG_DEBUG_FS
>>   static void split_huge_pages_all(void)
>>   {
>> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
>> index e090f29eb03bd..d03da72e7585d 100644
>> --- a/mm/memcontrol.c
>> +++ b/mm/memcontrol.c
>> @@ -3887,6 +3887,7 @@ static void mem_cgroup_css_offline(struct cgroup_subsys_state *css)
>>   	zswap_memcg_offline_cleanup(memcg);
>>
>>   	memcg_offline_kmem(memcg);
>> +	reparent_deferred_split_queue(memcg);
>>   	reparent_shrinker_deferred(memcg);
>>   	wb_memcg_offline(memcg);
>>   	lru_gen_offline_memcg(memcg);
>> diff --git a/mm/mm_init.c b/mm/mm_init.c
>> index 3db2dea7db4c5..cbda5c2ee3241 100644
>> --- a/mm/mm_init.c
>> +++ b/mm/mm_init.c
>> @@ -1387,6 +1387,7 @@ static void pgdat_init_split_queue(struct pglist_data *pgdat)
>>   	spin_lock_init(&ds_queue->split_queue_lock);
>>   	INIT_LIST_HEAD(&ds_queue->split_queue);
>>   	ds_queue->split_queue_len = 0;
>> +	ds_queue->is_dying = false;
>>   }
>>   #else
>>   static void pgdat_init_split_queue(struct pglist_data *pgdat) {}
>> -- 
>> 2.20.1
> 
> 
> Best Regards,
> Yan, Zi



^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v2 4/4] mm: thp: reparent the split queue during memcg offline
  2025-09-24  9:23   ` Roman Gushchin
@ 2025-09-24 10:06     ` Qi Zheng
  0 siblings, 0 replies; 25+ messages in thread
From: Qi Zheng @ 2025-09-24 10:06 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: hannes, hughd, mhocko, shakeel.butt, muchun.song, david,
	lorenzo.stoakes, ziy, harry.yoo, baolin.wang, Liam.Howlett,
	npache, ryan.roberts, dev.jain, baohua, lance.yang, akpm,
	linux-mm, linux-kernel, cgroups

Hi Roman,

On 9/24/25 5:23 PM, Roman Gushchin wrote:
> Qi Zheng <zhengqi.arch@bytedance.com> writes:
> 
>> In the future, we will reparent LRU folios during memcg offline to
>> eliminate dying memory cgroups, which requires reparenting the split queue
>> to its parent.
> 
> Nit: commit logs should really focus on the actual change, not the future
> plans.

Got it.

> 
>>
>> Similar to list_lru, the split queue is relatively independent and does
>> not need to be reparented along with objcg and LRU folios (holding
>> objcg lock and lru lock). So let's apply the same mechanism as list_lru
>> to reparent the split queue separately when memcg is offine.
>>
>> Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
>> ---
>>   include/linux/huge_mm.h |  2 ++
>>   include/linux/mmzone.h  |  1 +
>>   mm/huge_memory.c        | 39 +++++++++++++++++++++++++++++++++++++++
>>   mm/memcontrol.c         |  1 +
>>   mm/mm_init.c            |  1 +
>>   5 files changed, 44 insertions(+)
>>
>> diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
>> index f327d62fc9852..a0d4b751974d2 100644
>> --- a/include/linux/huge_mm.h
>> +++ b/include/linux/huge_mm.h
>> @@ -417,6 +417,7 @@ static inline int split_huge_page(struct page *page)
>>   	return split_huge_page_to_list_to_order(page, NULL, ret);
>>   }
>>   void deferred_split_folio(struct folio *folio, bool partially_mapped);
>> +void reparent_deferred_split_queue(struct mem_cgroup *memcg);
>>   
>>   void __split_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,
>>   		unsigned long address, bool freeze);
>> @@ -611,6 +612,7 @@ static inline int try_folio_split(struct folio *folio, struct page *page,
>>   }
>>   
>>   static inline void deferred_split_folio(struct folio *folio, bool partially_mapped) {}
>> +static inline void reparent_deferred_split_queue(struct mem_cgroup *memcg) {}
>>   #define split_huge_pmd(__vma, __pmd, __address)	\
>>   	do { } while (0)
>>   
>> diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
>> index 7fb7331c57250..f3eb81fee056a 100644
>> --- a/include/linux/mmzone.h
>> +++ b/include/linux/mmzone.h
>> @@ -1346,6 +1346,7 @@ struct deferred_split {
>>   	spinlock_t split_queue_lock;
>>   	struct list_head split_queue;
>>   	unsigned long split_queue_len;
>> +	bool is_dying;
>>   };
>>   #endif
>>   
>> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
>> index 48b51e6230a67..de7806f759cba 100644
>> --- a/mm/huge_memory.c
>> +++ b/mm/huge_memory.c
>> @@ -1094,9 +1094,15 @@ static struct deferred_split *folio_split_queue_lock(struct folio *folio)
>>   	struct deferred_split *queue;
>>   
>>   	memcg = folio_memcg(folio);
>> +retry:
>>   	queue = memcg ? &memcg->deferred_split_queue :
>>   			&NODE_DATA(folio_nid(folio))->deferred_split_queue;
>>   	spin_lock(&queue->split_queue_lock);
>> +	if (unlikely(queue->is_dying == true)) {
>> +		spin_unlock(&queue->split_queue_lock);
>> +		memcg = parent_mem_cgroup(memcg);
>> +		goto retry;
>> +	}
>>   
>>   	return queue;
>>   }
>> @@ -1108,9 +1114,15 @@ folio_split_queue_lock_irqsave(struct folio *folio, unsigned long *flags)
>>   	struct deferred_split *queue;
>>   
>>   	memcg = folio_memcg(folio);
>> +retry:
>>   	queue = memcg ? &memcg->deferred_split_queue :
>>   			&NODE_DATA(folio_nid(folio))->deferred_split_queue;
>>   	spin_lock_irqsave(&queue->split_queue_lock, *flags);
>> +	if (unlikely(queue->is_dying == true)) {
>> +		spin_unlock_irqrestore(&queue->split_queue_lock, *flags);
>> +		memcg = parent_mem_cgroup(memcg);
>> +		goto retry;
>> +	}
>>   
>>   	return queue;
>>   }
>> @@ -4284,6 +4296,33 @@ static unsigned long deferred_split_scan(struct shrinker *shrink,
>>   	return split;
>>   }
>>   
>> +void reparent_deferred_split_queue(struct mem_cgroup *memcg)
>> +{
>> +	struct mem_cgroup *parent = parent_mem_cgroup(memcg);
>> +	struct deferred_split *ds_queue = &memcg->deferred_split_queue;
>> +	struct deferred_split *parent_ds_queue = &parent->deferred_split_queue;
>> +	int nid;
>> +
>> +	spin_lock_irq(&ds_queue->split_queue_lock);
>> +	spin_lock_nested(&parent_ds_queue->split_queue_lock, SINGLE_DEPTH_NESTING);
>> +
>> +	if (!ds_queue->split_queue_len)
>> +		goto unlock;
>> +
>> +	list_splice_tail_init(&ds_queue->split_queue, &parent_ds_queue->split_queue);
>> +	parent_ds_queue->split_queue_len += ds_queue->split_queue_len;
>> +	ds_queue->split_queue_len = 0;
>> +	/* Mark the ds_queue dead */
>> +	ds_queue->is_dying = true;
>> +
>> +	for_each_node(nid)
>> +		set_shrinker_bit(parent, nid, shrinker_id(deferred_split_shrinker));
> 
> Does this loop need to be under locks?

I think it is not necessary, but the loop overhead should not be high.

> 
>> +
>> +unlock:
>> +	spin_unlock(&parent_ds_queue->split_queue_lock);
>> +	spin_unlock_irq(&ds_queue->split_queue_lock);
>> +}
>> +
>>   #ifdef CONFIG_DEBUG_FS
>>   static void split_huge_pages_all(void)
>>   {
>> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
>> index e090f29eb03bd..d03da72e7585d 100644
>> --- a/mm/memcontrol.c
>> +++ b/mm/memcontrol.c
>> @@ -3887,6 +3887,7 @@ static void mem_cgroup_css_offline(struct cgroup_subsys_state *css)
>>   	zswap_memcg_offline_cleanup(memcg);
>>   
>>   	memcg_offline_kmem(memcg);
>> +	reparent_deferred_split_queue(memcg);
>>   	reparent_shrinker_deferred(memcg);
> 
> I guess the naming can be a bit more consistent here :)

Do you mean to change them all to:

memcg_offline_xxx()

or

reparent_xxx() ?

Thanks,
Qi

> 
> Thanks!



^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v2 3/4] mm: thp: use folio_batch to handle THP splitting in deferred_split_scan()
  2025-09-23  9:16 ` [PATCH v2 3/4] mm: thp: use folio_batch to handle THP splitting in deferred_split_scan() Qi Zheng
  2025-09-23 15:31   ` Zi Yan
@ 2025-09-24 12:31   ` David Hildenbrand
  1 sibling, 0 replies; 25+ messages in thread
From: David Hildenbrand @ 2025-09-24 12:31 UTC (permalink / raw)
  To: Qi Zheng, hannes, hughd, mhocko, roman.gushchin, shakeel.butt,
	muchun.song, lorenzo.stoakes, ziy, harry.yoo, baolin.wang,
	Liam.Howlett, npache, ryan.roberts, dev.jain, baohua, lance.yang,
	akpm
  Cc: linux-mm, linux-kernel, cgroups, Muchun Song

On 23.09.25 11:16, Qi Zheng wrote:
> From: Muchun Song <songmuchun@bytedance.com>
> 
> The maintenance of the folio->_deferred_list is intricate because it's
> reused in a local list.
> 
> Here are some peculiarities:
> 
>     1) When a folio is removed from its split queue and added to a local
>        on-stack list in deferred_split_scan(), the ->split_queue_len isn't
>        updated, leading to an inconsistency between it and the actual
>        number of folios in the split queue.
> 
>     2) When the folio is split via split_folio() later, it's removed from
>        the local list while holding the split queue lock. At this time,
>        this lock protects the local list, not the split queue.
> 
>     3) To handle the race condition with a third-party freeing or migrating
>        the preceding folio, we must ensure there's always one safe (with
>        raised refcount) folio before by delaying its folio_put(). More
>        details can be found in commit e66f3185fa04 ("mm/thp: fix deferred
>        split queue not partially_mapped"). It's rather tricky.
> 
> We can use the folio_batch infrastructure to handle this clearly. In this
> case, ->split_queue_len will be consistent with the real number of folios
> in the split queue. If list_empty(&folio->_deferred_list) returns false,
> it's clear the folio must be in its split queue (not in a local list
> anymore).
> 
> In the future, we will reparent LRU folios during memcg offline to
> eliminate dying memory cgroups, which requires reparenting the split queue
> to its parent first. So this patch prepares for using
> folio_split_queue_lock_irqsave() as the memcg may change then.
> 
> Signed-off-by: Muchun Song <songmuchun@bytedance.com>
> Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
> ---

Nothing jumped at me

Acked-by: David Hildenbrand <david@redhat.com>

-- 
Cheers

David / dhildenb



^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v2 4/4] mm: thp: reparent the split queue during memcg offline
  2025-09-23  9:16 ` [PATCH v2 4/4] mm: thp: reparent the split queue during memcg offline Qi Zheng
  2025-09-23 15:44   ` Zi Yan
  2025-09-24  9:23   ` Roman Gushchin
@ 2025-09-24 12:38   ` David Hildenbrand
  2025-09-25  6:11     ` Qi Zheng
  2025-09-24 13:49   ` kernel test robot
  2025-09-24 14:22   ` Harry Yoo
  4 siblings, 1 reply; 25+ messages in thread
From: David Hildenbrand @ 2025-09-24 12:38 UTC (permalink / raw)
  To: Qi Zheng, hannes, hughd, mhocko, roman.gushchin, shakeel.butt,
	muchun.song, lorenzo.stoakes, ziy, harry.yoo, baolin.wang,
	Liam.Howlett, npache, ryan.roberts, dev.jain, baohua, lance.yang,
	akpm
  Cc: linux-mm, linux-kernel, cgroups

On 23.09.25 11:16, Qi Zheng wrote:
> In the future, we will reparent LRU folios during memcg offline to
> eliminate dying memory cgroups, which requires reparenting the split queue
> to its parent.
> 
> Similar to list_lru, the split queue is relatively independent and does
> not need to be reparented along with objcg and LRU folios (holding
> objcg lock and lru lock). So let's apply the same mechanism as list_lru
> to reparent the split queue separately when memcg is offine.
> 
> Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
> ---
>   include/linux/huge_mm.h |  2 ++
>   include/linux/mmzone.h  |  1 +
>   mm/huge_memory.c        | 39 +++++++++++++++++++++++++++++++++++++++
>   mm/memcontrol.c         |  1 +
>   mm/mm_init.c            |  1 +
>   5 files changed, 44 insertions(+)
> 
> diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
> index f327d62fc9852..a0d4b751974d2 100644
> --- a/include/linux/huge_mm.h
> +++ b/include/linux/huge_mm.h
> @@ -417,6 +417,7 @@ static inline int split_huge_page(struct page *page)
>   	return split_huge_page_to_list_to_order(page, NULL, ret);
>   }
>   void deferred_split_folio(struct folio *folio, bool partially_mapped);
> +void reparent_deferred_split_queue(struct mem_cgroup *memcg);
>   
>   void __split_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,
>   		unsigned long address, bool freeze);
> @@ -611,6 +612,7 @@ static inline int try_folio_split(struct folio *folio, struct page *page,
>   }
>   
>   static inline void deferred_split_folio(struct folio *folio, bool partially_mapped) {}
> +static inline void reparent_deferred_split_queue(struct mem_cgroup *memcg) {}
>   #define split_huge_pmd(__vma, __pmd, __address)	\
>   	do { } while (0)
>   
> diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
> index 7fb7331c57250..f3eb81fee056a 100644
> --- a/include/linux/mmzone.h
> +++ b/include/linux/mmzone.h
> @@ -1346,6 +1346,7 @@ struct deferred_split {
>   	spinlock_t split_queue_lock;
>   	struct list_head split_queue;
>   	unsigned long split_queue_len;
> +	bool is_dying;

It's a bit weird to query whether the "struct deferred_split" is dying. 
Shouldn't this be a memcg property? (and in particular, not exist for 
the pglist_data part where it might not make sense at all?).

>   };
>   #endif
>   
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index 48b51e6230a67..de7806f759cba 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -1094,9 +1094,15 @@ static struct deferred_split *folio_split_queue_lock(struct folio *folio)
>   	struct deferred_split *queue;
>   
>   	memcg = folio_memcg(folio);
> +retry:
>   	queue = memcg ? &memcg->deferred_split_queue :
>   			&NODE_DATA(folio_nid(folio))->deferred_split_queue;
>   	spin_lock(&queue->split_queue_lock);
> +	if (unlikely(queue->is_dying == true)) {

if (unlikely(queue->is_dying))

> +		spin_unlock(&queue->split_queue_lock);
> +		memcg = parent_mem_cgroup(memcg);
> +		goto retry;
> +	}
>   
>   	return queue;
>   }
> @@ -1108,9 +1114,15 @@ folio_split_queue_lock_irqsave(struct folio *folio, unsigned long *flags)
>   	struct deferred_split *queue;
>   
>   	memcg = folio_memcg(folio);
> +retry:
>   	queue = memcg ? &memcg->deferred_split_queue :
>   			&NODE_DATA(folio_nid(folio))->deferred_split_queue;
>   	spin_lock_irqsave(&queue->split_queue_lock, *flags);
> +	if (unlikely(queue->is_dying == true)) {

if (unlikely(queue->is_dying))

> +		spin_unlock_irqrestore(&queue->split_queue_lock, *flags);
> +		memcg = parent_mem_cgroup(memcg);
> +		goto retry;
> +	}
>   
>   	return queue;
>   }

Nothing else jumped at me, but I am not a memcg expert :)

-- 
Cheers

David / dhildenb



^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v2 4/4] mm: thp: reparent the split queue during memcg offline
  2025-09-23  9:16 ` [PATCH v2 4/4] mm: thp: reparent the split queue during memcg offline Qi Zheng
                     ` (2 preceding siblings ...)
  2025-09-24 12:38   ` David Hildenbrand
@ 2025-09-24 13:49   ` kernel test robot
  2025-09-24 14:22   ` Harry Yoo
  4 siblings, 0 replies; 25+ messages in thread
From: kernel test robot @ 2025-09-24 13:49 UTC (permalink / raw)
  To: Qi Zheng, hannes, hughd, mhocko, roman.gushchin, shakeel.butt,
	muchun.song, david, lorenzo.stoakes, ziy, harry.yoo, baolin.wang,
	Liam.Howlett, npache, ryan.roberts, dev.jain, baohua, lance.yang,
	akpm
  Cc: oe-kbuild-all, linux-mm, linux-kernel, cgroups, Qi Zheng

Hi Qi,

kernel test robot noticed the following build errors:

[auto build test ERROR on next-20250922]
[also build test ERROR on v6.17-rc7]
[cannot apply to akpm-mm/mm-everything rppt-memblock/for-next rppt-memblock/fixes linus/master v6.17-rc7 v6.17-rc6 v6.17-rc5]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]

url:    https://github.com/intel-lab-lkp/linux/commits/Qi-Zheng/mm-thp-replace-folio_memcg-with-folio_memcg_charged/20250923-171935
base:   next-20250922
patch link:    https://lore.kernel.org/r/55370bda7b2df617033ac12116c1712144bb7591.1758618527.git.zhengqi.arch%40bytedance.com
patch subject: [PATCH v2 4/4] mm: thp: reparent the split queue during memcg offline
config: riscv-randconfig-001-20250924 (https://download.01.org/0day-ci/archive/20250924/202509242123.QwwFy7gc-lkp@intel.com/config)
compiler: riscv64-linux-gcc (GCC) 8.5.0
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20250924/202509242123.QwwFy7gc-lkp@intel.com/reproduce)

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202509242123.QwwFy7gc-lkp@intel.com/

All errors (new ones prefixed by >>):

   mm/huge_memory.c: In function 'reparent_deferred_split_queue':
>> mm/huge_memory.c:4302:42: error: dereferencing pointer to incomplete type 'struct mem_cgroup'
     struct deferred_split *ds_queue = &memcg->deferred_split_queue;
                                             ^~

Kconfig warnings: (for reference only)
   WARNING: unmet direct dependencies detected for ARCH_HAS_ELF_CORE_EFLAGS
   Depends on [n]: BINFMT_ELF [=n] && ELF_CORE [=n]
   Selected by [y]:
   - RISCV [=y]


vim +4302 mm/huge_memory.c

  4298	
  4299	void reparent_deferred_split_queue(struct mem_cgroup *memcg)
  4300	{
  4301		struct mem_cgroup *parent = parent_mem_cgroup(memcg);
> 4302		struct deferred_split *ds_queue = &memcg->deferred_split_queue;
  4303		struct deferred_split *parent_ds_queue = &parent->deferred_split_queue;
  4304		int nid;
  4305	
  4306		spin_lock_irq(&ds_queue->split_queue_lock);
  4307		spin_lock_nested(&parent_ds_queue->split_queue_lock, SINGLE_DEPTH_NESTING);
  4308	
  4309		if (!ds_queue->split_queue_len)
  4310			goto unlock;
  4311	
  4312		list_splice_tail_init(&ds_queue->split_queue, &parent_ds_queue->split_queue);
  4313		parent_ds_queue->split_queue_len += ds_queue->split_queue_len;
  4314		ds_queue->split_queue_len = 0;
  4315		/* Mark the ds_queue dead */
  4316		ds_queue->is_dying = true;
  4317	
  4318		for_each_node(nid)
  4319			set_shrinker_bit(parent, nid, shrinker_id(deferred_split_shrinker));
  4320	
  4321	unlock:
  4322		spin_unlock(&parent_ds_queue->split_queue_lock);
  4323		spin_unlock_irq(&ds_queue->split_queue_lock);
  4324	}
  4325	

-- 
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v2 4/4] mm: thp: reparent the split queue during memcg offline
  2025-09-23  9:16 ` [PATCH v2 4/4] mm: thp: reparent the split queue during memcg offline Qi Zheng
                     ` (3 preceding siblings ...)
  2025-09-24 13:49   ` kernel test robot
@ 2025-09-24 14:22   ` Harry Yoo
  2025-09-25  6:29     ` Qi Zheng
  4 siblings, 1 reply; 25+ messages in thread
From: Harry Yoo @ 2025-09-24 14:22 UTC (permalink / raw)
  To: Qi Zheng
  Cc: hannes, hughd, mhocko, roman.gushchin, shakeel.butt, muchun.song,
	david, lorenzo.stoakes, ziy, baolin.wang, Liam.Howlett, npache,
	ryan.roberts, dev.jain, baohua, lance.yang, akpm, linux-mm,
	linux-kernel, cgroups

On Tue, Sep 23, 2025 at 05:16:25PM +0800, Qi Zheng wrote:
> In the future, we will reparent LRU folios during memcg offline to
> eliminate dying memory cgroups, which requires reparenting the split queue
> to its parent.
> 
> Similar to list_lru, the split queue is relatively independent and does
> not need to be reparented along with objcg and LRU folios (holding
> objcg lock and lru lock). So let's apply the same mechanism as list_lru
> to reparent the split queue separately when memcg is offine.
>
> Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
> ---
>  include/linux/huge_mm.h |  2 ++
>  include/linux/mmzone.h  |  1 +
>  mm/huge_memory.c        | 39 +++++++++++++++++++++++++++++++++++++++
>  mm/memcontrol.c         |  1 +
>  mm/mm_init.c            |  1 +
>  5 files changed, 44 insertions(+)
> 
> diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
> index f327d62fc9852..a0d4b751974d2 100644
> --- a/include/linux/huge_mm.h
> +++ b/include/linux/huge_mm.h
> @@ -417,6 +417,7 @@ static inline int split_huge_page(struct page *page)
>  	return split_huge_page_to_list_to_order(page, NULL, ret);
>  }
>  void deferred_split_folio(struct folio *folio, bool partially_mapped);
> +void reparent_deferred_split_queue(struct mem_cgroup *memcg);
>  
>  void __split_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,
>  		unsigned long address, bool freeze);
> @@ -611,6 +612,7 @@ static inline int try_folio_split(struct folio *folio, struct page *page,
>  }
>  
>  static inline void deferred_split_folio(struct folio *folio, bool partially_mapped) {}
> +static inline void reparent_deferred_split_queue(struct mem_cgroup *memcg) {}
>  #define split_huge_pmd(__vma, __pmd, __address)	\
>  	do { } while (0)
>  
> diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
> index 7fb7331c57250..f3eb81fee056a 100644
> --- a/include/linux/mmzone.h
> +++ b/include/linux/mmzone.h
> @@ -1346,6 +1346,7 @@ struct deferred_split {
>  	spinlock_t split_queue_lock;
>  	struct list_head split_queue;
>  	unsigned long split_queue_len;
> +	bool is_dying;
>  };
>  #endif

The scheme in Muchun's version was:

retry:
queue = folio_split_queue(folio);
spin_lock(&queue->split_queue_lock);
if (folio_memcg(folio) != folio_split_queue_memcg(folio, queue)) {
    /* split queue was reparented, retry */
    spin_unlock(&queue->split_queue_lock);
    goto retry;
}
/* now we have a stable mapping between the folio and the split queue */
spin_unlock(&queue->split_queue_lock);

Oh, I see. We can't use this scheme yet because we don't reparent LRU
folios. (I was wondering why we're adding is_dying property)

> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index 48b51e6230a67..de7806f759cba 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -1094,9 +1094,15 @@ static struct deferred_split *folio_split_queue_lock(struct folio *folio)
>  	struct deferred_split *queue;


For now it's safe to not call rcu_read_lock() here because memcgs won't
disappear under us as long as there are folios to split (we don't reparent
LRU folios), right?

>  	memcg = folio_memcg(folio);
> +retry:
>  	queue = memcg ? &memcg->deferred_split_queue :
>  			&NODE_DATA(folio_nid(folio))->deferred_split_queue;
>  	spin_lock(&queue->split_queue_lock);
> +	if (unlikely(queue->is_dying == true)) {
> +		spin_unlock(&queue->split_queue_lock);
> +		memcg = parent_mem_cgroup(memcg);
> +		goto retry;
> +	}
>  	return queue;
>  }

-- 
Cheers,
Harry / Hyeonggon


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v2 3/4] mm: thp: use folio_batch to handle THP splitting in deferred_split_scan()
  2025-09-24  9:57     ` Qi Zheng
@ 2025-09-24 14:57       ` Zi Yan
  0 siblings, 0 replies; 25+ messages in thread
From: Zi Yan @ 2025-09-24 14:57 UTC (permalink / raw)
  To: Qi Zheng
  Cc: hannes, hughd, mhocko, roman.gushchin, shakeel.butt, muchun.song,
	david, lorenzo.stoakes, harry.yoo, baolin.wang, Liam.Howlett,
	npache, ryan.roberts, dev.jain, baohua, lance.yang, akpm,
	linux-mm, linux-kernel, cgroups, Muchun Song

On 24 Sep 2025, at 5:57, Qi Zheng wrote:

> Hi Zi,
>
> On 9/23/25 11:31 PM, Zi Yan wrote:
>> On 23 Sep 2025, at 5:16, Qi Zheng wrote:
>>
>>> From: Muchun Song <songmuchun@bytedance.com>
>>>
>>> The maintenance of the folio->_deferred_list is intricate because it's
>>> reused in a local list.
>>>
>>> Here are some peculiarities:
>>>
>>>     1) When a folio is removed from its split queue and added to a local
>>>        on-stack list in deferred_split_scan(), the ->split_queue_len isn't
>>>        updated, leading to an inconsistency between it and the actual
>>>        number of folios in the split queue.
>>>
>>>     2) When the folio is split via split_folio() later, it's removed from
>>>        the local list while holding the split queue lock. At this time,
>>>        this lock protects the local list, not the split queue.
>>>
>>>     3) To handle the race condition with a third-party freeing or migrating
>>>        the preceding folio, we must ensure there's always one safe (with
>>>        raised refcount) folio before by delaying its folio_put(). More
>>>        details can be found in commit e66f3185fa04 ("mm/thp: fix deferred
>>>        split queue not partially_mapped"). It's rather tricky.
>>>
>>> We can use the folio_batch infrastructure to handle this clearly. In this
>>
>> Can you add more details on how folio_batch handles the above three concerns
>> in the original code? That would guide me what to look for during code review.
>
> Sure.
>
> For 1), after adding folio to folio_batch, we immediatelly decrement the
> ds_queue->split_queue_len, so there are no more inconsistencies.
>
> For 2), after adding folio to folio_batch, we will see list_empty() in
> __folio_split(), so there is no longer a situation where
> split_queue_lock protects the local list.
>
> For 3), after adding folio to folio_batch, we call folios_put() at the
> end to decrement the refcount of folios, which looks more natural.
>
>>
>>> case, ->split_queue_len will be consistent with the real number of folios
>>> in the split queue. If list_empty(&folio->_deferred_list) returns false,
>>> it's clear the folio must be in its split queue (not in a local list
>>> anymore).
>>>
>>> In the future, we will reparent LRU folios during memcg offline to
>>> eliminate dying memory cgroups, which requires reparenting the split queue
>>> to its parent first. So this patch prepares for using
>>> folio_split_queue_lock_irqsave() as the memcg may change then.
>>>
>>> Signed-off-by: Muchun Song <songmuchun@bytedance.com>
>>> Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
>>> ---
>>>   mm/huge_memory.c | 84 ++++++++++++++++++++++--------------------------
>>>   1 file changed, 38 insertions(+), 46 deletions(-)
>>>
>>> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
>>> index 2f41b8f0d4871..48b51e6230a67 100644
>>> --- a/mm/huge_memory.c
>>> +++ b/mm/huge_memory.c
>>> @@ -3781,21 +3781,22 @@ static int __folio_split(struct folio *folio, unsigned int new_order,
>>>   		struct lruvec *lruvec;
>>>   		int expected_refs;
>>>
>>> -		if (folio_order(folio) > 1 &&
>>> -		    !list_empty(&folio->_deferred_list)) {
>>> -			ds_queue->split_queue_len--;
>>> +		if (folio_order(folio) > 1) {
>>> +			if (!list_empty(&folio->_deferred_list)) {
>>> +				ds_queue->split_queue_len--;
>>> +				/*
>>> +				 * Reinitialize page_deferred_list after removing the
>>> +				 * page from the split_queue, otherwise a subsequent
>>> +				 * split will see list corruption when checking the
>>> +				 * page_deferred_list.
>>> +				 */
>>> +				list_del_init(&folio->_deferred_list);
>>> +			}
>>>   			if (folio_test_partially_mapped(folio)) {
>>>   				folio_clear_partially_mapped(folio);
>>>   				mod_mthp_stat(folio_order(folio),
>>>   					      MTHP_STAT_NR_ANON_PARTIALLY_MAPPED, -1);
>>>   			}
>>
>> folio_test_partially_mapped() is done regardless the folio is on _deferred_list
>> or not, is it because the folio can be on a folio batch and its _deferred_list
>> is empty?
>
> Yes.
>
>>
>>> -			/*
>>> -			 * Reinitialize page_deferred_list after removing the
>>> -			 * page from the split_queue, otherwise a subsequent
>>> -			 * split will see list corruption when checking the
>>> -			 * page_deferred_list.
>>> -			 */
>>> -			list_del_init(&folio->_deferred_list);
>>>   		}
>>>   		split_queue_unlock(ds_queue);
>>>   		if (mapping) {
>>> @@ -4194,40 +4195,44 @@ static unsigned long deferred_split_scan(struct shrinker *shrink,
>>>   	struct pglist_data *pgdata = NODE_DATA(sc->nid);
>>>   	struct deferred_split *ds_queue = &pgdata->deferred_split_queue;
>>>   	unsigned long flags;
>>> -	LIST_HEAD(list);
>>> -	struct folio *folio, *next, *prev = NULL;
>>> -	int split = 0, removed = 0;
>>> +	struct folio *folio, *next;
>>> +	int split = 0, i;
>>> +	struct folio_batch fbatch;
>>>
>>>   #ifdef CONFIG_MEMCG
>>>   	if (sc->memcg)
>>>   		ds_queue = &sc->memcg->deferred_split_queue;
>>>   #endif
>>>
>>> +	folio_batch_init(&fbatch);
>>> +retry:
>>>   	spin_lock_irqsave(&ds_queue->split_queue_lock, flags);
>>>   	/* Take pin on all head pages to avoid freeing them under us */
>>>   	list_for_each_entry_safe(folio, next, &ds_queue->split_queue,
>>>   							_deferred_list) {
>>>   		if (folio_try_get(folio)) {
>>> -			list_move(&folio->_deferred_list, &list);
>>> -		} else {
>>> +			folio_batch_add(&fbatch, folio);
>>> +		} else if (folio_test_partially_mapped(folio)) {
>>>   			/* We lost race with folio_put() */
>>> -			if (folio_test_partially_mapped(folio)) {
>>> -				folio_clear_partially_mapped(folio);
>>> -				mod_mthp_stat(folio_order(folio),
>>> -					      MTHP_STAT_NR_ANON_PARTIALLY_MAPPED, -1);
>>> -			}
>>> -			list_del_init(&folio->_deferred_list);
>>> -			ds_queue->split_queue_len--;
>>> +			folio_clear_partially_mapped(folio);
>>> +			mod_mthp_stat(folio_order(folio),
>>> +				      MTHP_STAT_NR_ANON_PARTIALLY_MAPPED, -1);
>>>   		}
>>> +		list_del_init(&folio->_deferred_list);
>>> +		ds_queue->split_queue_len--;
>>
>> At this point, the folio can be following conditions:
>> 1. deferred_split_scan() gets it,
>> 2. it is freed by folio_put().
>>
>> In both cases, it is removed from deferred_split_queue, right?
>
> Right. For the case 1), we may add folio back to deferred_split_queue.
>
>>
>>>   		if (!--sc->nr_to_scan)
>>>   			break;
>>> +		if (!folio_batch_space(&fbatch))
>>> +			break;
>>>   	}
>>>   	spin_unlock_irqrestore(&ds_queue->split_queue_lock, flags);
>>>
>>> -	list_for_each_entry_safe(folio, next, &list, _deferred_list) {
>>> +	for (i = 0; i < folio_batch_count(&fbatch); i++) {
>>>   		bool did_split = false;
>>>   		bool underused = false;
>>> +		struct deferred_split *fqueue;
>>>
>>> +		folio = fbatch.folios[i];
>>>   		if (!folio_test_partially_mapped(folio)) {
>>>   			/*
>>>   			 * See try_to_map_unused_to_zeropage(): we cannot
>>> @@ -4250,38 +4255,25 @@ static unsigned long deferred_split_scan(struct shrinker *shrink,
>>>   		}
>>>   		folio_unlock(folio);
>>>   next:
>>> +		if (did_split || !folio_test_partially_mapped(folio))
>>> +			continue;
>>>   		/*
>>> -		 * split_folio() removes folio from list on success.
>>>   		 * Only add back to the queue if folio is partially mapped.
>>>   		 * If thp_underused returns false, or if split_folio fails
>>>   		 * in the case it was underused, then consider it used and
>>>   		 * don't add it back to split_queue.
>>>   		 */
>>> -		if (did_split) {
>>> -			; /* folio already removed from list */
>>> -		} else if (!folio_test_partially_mapped(folio)) {
>>> -			list_del_init(&folio->_deferred_list);
>>> -			removed++;
>>> -		} else {
>>> -			/*
>>> -			 * That unlocked list_del_init() above would be unsafe,
>>> -			 * unless its folio is separated from any earlier folios
>>> -			 * left on the list (which may be concurrently unqueued)
>>> -			 * by one safe folio with refcount still raised.
>>> -			 */
>>> -			swap(folio, prev);
>>> +		fqueue = folio_split_queue_lock_irqsave(folio, &flags);
>>> +		if (list_empty(&folio->_deferred_list)) {
>>> +			list_add_tail(&folio->_deferred_list, &fqueue->split_queue);
>>> +			fqueue->split_queue_len++;
>>
>> fqueue should be the same as ds_queue, right? Just want to make sure
>> I understand the code.
>
> After patch #4, fqueue may be the deferred_split of parent memcg.
Thank you for the explanation. The changes look good to me.

Reviewed-by: Zi Yan <ziy@nvidia.com>

Best Regards,
Yan, Zi


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v2 4/4] mm: thp: reparent the split queue during memcg offline
  2025-09-24 12:38   ` David Hildenbrand
@ 2025-09-25  6:11     ` Qi Zheng
  2025-09-25 19:35       ` David Hildenbrand
  0 siblings, 1 reply; 25+ messages in thread
From: Qi Zheng @ 2025-09-25  6:11 UTC (permalink / raw)
  To: David Hildenbrand, hannes, hughd, mhocko, roman.gushchin,
	shakeel.butt, muchun.song, lorenzo.stoakes, ziy, harry.yoo,
	baolin.wang, Liam.Howlett, npache, ryan.roberts, dev.jain,
	baohua, lance.yang, akpm
  Cc: linux-mm, linux-kernel, cgroups

Hi David,

On 9/24/25 8:38 PM, David Hildenbrand wrote:
> On 23.09.25 11:16, Qi Zheng wrote:
>> In the future, we will reparent LRU folios during memcg offline to
>> eliminate dying memory cgroups, which requires reparenting the split 
>> queue
>> to its parent.
>>
>> Similar to list_lru, the split queue is relatively independent and does
>> not need to be reparented along with objcg and LRU folios (holding
>> objcg lock and lru lock). So let's apply the same mechanism as list_lru
>> to reparent the split queue separately when memcg is offine.
>>
>> Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
>> ---
>>   include/linux/huge_mm.h |  2 ++
>>   include/linux/mmzone.h  |  1 +
>>   mm/huge_memory.c        | 39 +++++++++++++++++++++++++++++++++++++++
>>   mm/memcontrol.c         |  1 +
>>   mm/mm_init.c            |  1 +
>>   5 files changed, 44 insertions(+)
>>
>> diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
>> index f327d62fc9852..a0d4b751974d2 100644
>> --- a/include/linux/huge_mm.h
>> +++ b/include/linux/huge_mm.h
>> @@ -417,6 +417,7 @@ static inline int split_huge_page(struct page *page)
>>       return split_huge_page_to_list_to_order(page, NULL, ret);
>>   }
>>   void deferred_split_folio(struct folio *folio, bool partially_mapped);
>> +void reparent_deferred_split_queue(struct mem_cgroup *memcg);
>>   void __split_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,
>>           unsigned long address, bool freeze);
>> @@ -611,6 +612,7 @@ static inline int try_folio_split(struct folio 
>> *folio, struct page *page,
>>   }
>>   static inline void deferred_split_folio(struct folio *folio, bool 
>> partially_mapped) {}
>> +static inline void reparent_deferred_split_queue(struct mem_cgroup 
>> *memcg) {}
>>   #define split_huge_pmd(__vma, __pmd, __address)    \
>>       do { } while (0)
>> diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
>> index 7fb7331c57250..f3eb81fee056a 100644
>> --- a/include/linux/mmzone.h
>> +++ b/include/linux/mmzone.h
>> @@ -1346,6 +1346,7 @@ struct deferred_split {
>>       spinlock_t split_queue_lock;
>>       struct list_head split_queue;
>>       unsigned long split_queue_len;
>> +    bool is_dying;
> 
> It's a bit weird to query whether the "struct deferred_split" is dying. 
> Shouldn't this be a memcg property? (and in particular, not exist for 

There is indeed a CSS_DYING flag. But we must modify 'is_dying' under
the protection of the split_queue_lock, otherwise the folio may be added
back to the deferred_split of child memcg.

> the pglist_data part where it might not make sense at all?).

Maybe:

#ifdef CONFIG_MEMCG
     bool is_dying;
#endif

> 
>>   };
>>   #endif
>> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
>> index 48b51e6230a67..de7806f759cba 100644
>> --- a/mm/huge_memory.c
>> +++ b/mm/huge_memory.c
>> @@ -1094,9 +1094,15 @@ static struct deferred_split 
>> *folio_split_queue_lock(struct folio *folio)
>>       struct deferred_split *queue;
>>       memcg = folio_memcg(folio);
>> +retry:
>>       queue = memcg ? &memcg->deferred_split_queue :
>>               &NODE_DATA(folio_nid(folio))->deferred_split_queue;
>>       spin_lock(&queue->split_queue_lock);
>> +    if (unlikely(queue->is_dying == true)) {
> 
> if (unlikely(queue->is_dying))

Will do.

> 
>> +        spin_unlock(&queue->split_queue_lock);
>> +        memcg = parent_mem_cgroup(memcg);
>> +        goto retry;
>> +    }
>>       return queue;
>>   }
>> @@ -1108,9 +1114,15 @@ folio_split_queue_lock_irqsave(struct folio 
>> *folio, unsigned long *flags)
>>       struct deferred_split *queue;
>>       memcg = folio_memcg(folio);
>> +retry:
>>       queue = memcg ? &memcg->deferred_split_queue :
>>               &NODE_DATA(folio_nid(folio))->deferred_split_queue;
>>       spin_lock_irqsave(&queue->split_queue_lock, *flags);
>> +    if (unlikely(queue->is_dying == true)) {
> 
> if (unlikely(queue->is_dying))

Will do.

> 
>> +        spin_unlock_irqrestore(&queue->split_queue_lock, *flags);
>> +        memcg = parent_mem_cgroup(memcg);
>> +        goto retry;
>> +    }
>>       return queue;
>>   }
> 
> Nothing else jumped at me, but I am not a memcg expert :)

Thanks,
Qi

> 



^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v2 4/4] mm: thp: reparent the split queue during memcg offline
  2025-09-24 14:22   ` Harry Yoo
@ 2025-09-25  6:29     ` Qi Zheng
  0 siblings, 0 replies; 25+ messages in thread
From: Qi Zheng @ 2025-09-25  6:29 UTC (permalink / raw)
  To: Harry Yoo
  Cc: hannes, hughd, mhocko, roman.gushchin, shakeel.butt, muchun.song,
	david, lorenzo.stoakes, ziy, baolin.wang, Liam.Howlett, npache,
	ryan.roberts, dev.jain, baohua, lance.yang, akpm, linux-mm,
	linux-kernel, cgroups

Hi Harry,

On 9/24/25 10:22 PM, Harry Yoo wrote:
> On Tue, Sep 23, 2025 at 05:16:25PM +0800, Qi Zheng wrote:
>> In the future, we will reparent LRU folios during memcg offline to
>> eliminate dying memory cgroups, which requires reparenting the split queue
>> to its parent.
>>
>> Similar to list_lru, the split queue is relatively independent and does
>> not need to be reparented along with objcg and LRU folios (holding
>> objcg lock and lru lock). So let's apply the same mechanism as list_lru
>> to reparent the split queue separately when memcg is offine.
>>
>> Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
>> ---
>>   include/linux/huge_mm.h |  2 ++
>>   include/linux/mmzone.h  |  1 +
>>   mm/huge_memory.c        | 39 +++++++++++++++++++++++++++++++++++++++
>>   mm/memcontrol.c         |  1 +
>>   mm/mm_init.c            |  1 +
>>   5 files changed, 44 insertions(+)
>>
>> diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
>> index f327d62fc9852..a0d4b751974d2 100644
>> --- a/include/linux/huge_mm.h
>> +++ b/include/linux/huge_mm.h
>> @@ -417,6 +417,7 @@ static inline int split_huge_page(struct page *page)
>>   	return split_huge_page_to_list_to_order(page, NULL, ret);
>>   }
>>   void deferred_split_folio(struct folio *folio, bool partially_mapped);
>> +void reparent_deferred_split_queue(struct mem_cgroup *memcg);
>>   
>>   void __split_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,
>>   		unsigned long address, bool freeze);
>> @@ -611,6 +612,7 @@ static inline int try_folio_split(struct folio *folio, struct page *page,
>>   }
>>   
>>   static inline void deferred_split_folio(struct folio *folio, bool partially_mapped) {}
>> +static inline void reparent_deferred_split_queue(struct mem_cgroup *memcg) {}
>>   #define split_huge_pmd(__vma, __pmd, __address)	\
>>   	do { } while (0)
>>   
>> diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
>> index 7fb7331c57250..f3eb81fee056a 100644
>> --- a/include/linux/mmzone.h
>> +++ b/include/linux/mmzone.h
>> @@ -1346,6 +1346,7 @@ struct deferred_split {
>>   	spinlock_t split_queue_lock;
>>   	struct list_head split_queue;
>>   	unsigned long split_queue_len;
>> +	bool is_dying;
>>   };
>>   #endif
> 
> The scheme in Muchun's version was:
> 
> retry:
> queue = folio_split_queue(folio);
> spin_lock(&queue->split_queue_lock);
> if (folio_memcg(folio) != folio_split_queue_memcg(folio, queue)) {
>      /* split queue was reparented, retry */
>      spin_unlock(&queue->split_queue_lock);
>      goto retry;
> }
> /* now we have a stable mapping between the folio and the split queue */
> spin_unlock(&queue->split_queue_lock);
> 
> Oh, I see. We can't use this scheme yet because we don't reparent LRU
> folios. (I was wondering why we're adding is_dying property)

Right. And reparenting THP split queue independently can avoid the
following situations:

```
acquire child and parent split_queue_lock
acquire child and parent objcg_lock
acquire child and parent lru lock

reparent THP split queue
reparent objcg
reparent LRU folios

release child and parent lru lock
release child and parent objcg_lock
release child and parent split_queue_lock
```

> 
>> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
>> index 48b51e6230a67..de7806f759cba 100644
>> --- a/mm/huge_memory.c
>> +++ b/mm/huge_memory.c
>> @@ -1094,9 +1094,15 @@ static struct deferred_split *folio_split_queue_lock(struct folio *folio)
>>   	struct deferred_split *queue;
> 
> 
> For now it's safe to not call rcu_read_lock() here because memcgs won't
> disappear under us as long as there are folios to split (we don't reparent
> LRU folios), right?

Right. We will add rcu_read_lock() when reparenting LRU folios.

Thanks,
Qi

> 
>>   	memcg = folio_memcg(folio);
>> +retry:
>>   	queue = memcg ? &memcg->deferred_split_queue :
>>   			&NODE_DATA(folio_nid(folio))->deferred_split_queue;
>>   	spin_lock(&queue->split_queue_lock);
>> +	if (unlikely(queue->is_dying == true)) {
>> +		spin_unlock(&queue->split_queue_lock);
>> +		memcg = parent_mem_cgroup(memcg);
>> +		goto retry;
>> +	}
>>   	return queue;
>>   }
> 



^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v2 4/4] mm: thp: reparent the split queue during memcg offline
  2025-09-25  6:11     ` Qi Zheng
@ 2025-09-25 19:35       ` David Hildenbrand
  2025-09-25 19:49         ` Zi Yan
  0 siblings, 1 reply; 25+ messages in thread
From: David Hildenbrand @ 2025-09-25 19:35 UTC (permalink / raw)
  To: Qi Zheng, hannes, hughd, mhocko, roman.gushchin, shakeel.butt,
	muchun.song, lorenzo.stoakes, ziy, harry.yoo, baolin.wang,
	Liam.Howlett, npache, ryan.roberts, dev.jain, baohua, lance.yang,
	akpm
  Cc: linux-mm, linux-kernel, cgroups

On 25.09.25 08:11, Qi Zheng wrote:
> Hi David,

Hi :)

[...]

>>> +++ b/include/linux/mmzone.h
>>> @@ -1346,6 +1346,7 @@ struct deferred_split {
>>>        spinlock_t split_queue_lock;
>>>        struct list_head split_queue;
>>>        unsigned long split_queue_len;
>>> +    bool is_dying;
>>
>> It's a bit weird to query whether the "struct deferred_split" is dying.
>> Shouldn't this be a memcg property? (and in particular, not exist for
> 
> There is indeed a CSS_DYING flag. But we must modify 'is_dying' under
> the protection of the split_queue_lock, otherwise the folio may be added
> back to the deferred_split of child memcg.

Is there no way to reuse the existing mechanisms, and find a way to have 
the shrinker / queue locking sync against that?

There is also the offline_css() function where we clear CSS_ONLINE. But 
it happens after calling ss->css_offline(css);

Being able to query "is the memcg going offline" and having a way to 
sync against that would be probably cleanest.

I'll let all the memcg people comment on how that could be done best.

> 
>> the pglist_data part where it might not make sense at all?).
> 
> Maybe:
> 
> #ifdef CONFIG_MEMCG
>       bool is_dying;
> #endif
> 

Still doesn't quite look like it would belong here :(

Also, is "dying" really the right terminology? It's more like "going 
offline"?

But then, the queue is not going offline, the memcg is ...

-- 
Cheers

David / dhildenb



^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v2 4/4] mm: thp: reparent the split queue during memcg offline
  2025-09-25 19:35       ` David Hildenbrand
@ 2025-09-25 19:49         ` Zi Yan
  2025-09-25 22:15           ` Shakeel Butt
  0 siblings, 1 reply; 25+ messages in thread
From: Zi Yan @ 2025-09-25 19:49 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Qi Zheng, hannes, hughd, mhocko, roman.gushchin, shakeel.butt,
	muchun.song, lorenzo.stoakes, harry.yoo, baolin.wang,
	Liam.Howlett, npache, ryan.roberts, dev.jain, baohua, lance.yang,
	akpm, linux-mm, linux-kernel, cgroups

On 25 Sep 2025, at 15:35, David Hildenbrand wrote:

> On 25.09.25 08:11, Qi Zheng wrote:
>> Hi David,
>
> Hi :)
>
> [...]
>
>>>> +++ b/include/linux/mmzone.h
>>>> @@ -1346,6 +1346,7 @@ struct deferred_split {
>>>>        spinlock_t split_queue_lock;
>>>>        struct list_head split_queue;
>>>>        unsigned long split_queue_len;
>>>> +    bool is_dying;
>>>
>>> It's a bit weird to query whether the "struct deferred_split" is dying.
>>> Shouldn't this be a memcg property? (and in particular, not exist for
>>
>> There is indeed a CSS_DYING flag. But we must modify 'is_dying' under
>> the protection of the split_queue_lock, otherwise the folio may be added
>> back to the deferred_split of child memcg.
>
> Is there no way to reuse the existing mechanisms, and find a way to have the shrinker / queue locking sync against that?
>
> There is also the offline_css() function where we clear CSS_ONLINE. But it happens after calling ss->css_offline(css);

I see CSS_DYING will be set by kill_css() before offline_css() is called.
Probably the code can check CSS_DYING instead.

>
> Being able to query "is the memcg going offline" and having a way to sync against that would be probably cleanest.

So basically, something like:
1. at folio_split_queue_lock*() time, get folio’s memcg or
   its parent memcg until there is no CSS_DYING set or CSS_ONLINE is set.
2. return the associated deferred_split_queue.

>
> I'll let all the memcg people comment on how that could be done best.
>
>>
>>> the pglist_data part where it might not make sense at all?).
>>
>> Maybe:
>>
>> #ifdef CONFIG_MEMCG
>>       bool is_dying;
>> #endif
>>
>
> Still doesn't quite look like it would belong here :(
>
> Also, is "dying" really the right terminology? It's more like "going offline"?
>
> But then, the queue is not going offline, the memcg is ...
>
> -- 
> Cheers
>
> David / dhildenb


Best Regards,
Yan, Zi


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v2 4/4] mm: thp: reparent the split queue during memcg offline
  2025-09-25 19:49         ` Zi Yan
@ 2025-09-25 22:15           ` Shakeel Butt
  2025-09-25 22:35             ` Shakeel Butt
  0 siblings, 1 reply; 25+ messages in thread
From: Shakeel Butt @ 2025-09-25 22:15 UTC (permalink / raw)
  To: Zi Yan
  Cc: David Hildenbrand, Qi Zheng, hannes, hughd, mhocko,
	roman.gushchin, muchun.song, lorenzo.stoakes, harry.yoo,
	baolin.wang, Liam.Howlett, npache, ryan.roberts, dev.jain,
	baohua, lance.yang, akpm, linux-mm, linux-kernel, cgroups

On Thu, Sep 25, 2025 at 03:49:52PM -0400, Zi Yan wrote:
> On 25 Sep 2025, at 15:35, David Hildenbrand wrote:
> 
> > On 25.09.25 08:11, Qi Zheng wrote:
> >> Hi David,
> >
> > Hi :)
> >
> > [...]
> >
> >>>> +++ b/include/linux/mmzone.h
> >>>> @@ -1346,6 +1346,7 @@ struct deferred_split {
> >>>>        spinlock_t split_queue_lock;
> >>>>        struct list_head split_queue;
> >>>>        unsigned long split_queue_len;
> >>>> +    bool is_dying;
> >>>
> >>> It's a bit weird to query whether the "struct deferred_split" is dying.
> >>> Shouldn't this be a memcg property? (and in particular, not exist for
> >>
> >> There is indeed a CSS_DYING flag. But we must modify 'is_dying' under
> >> the protection of the split_queue_lock, otherwise the folio may be added
> >> back to the deferred_split of child memcg.
> >
> > Is there no way to reuse the existing mechanisms, and find a way to have the shrinker / queue locking sync against that?
> >
> > There is also the offline_css() function where we clear CSS_ONLINE. But it happens after calling ss->css_offline(css);
> 
> I see CSS_DYING will be set by kill_css() before offline_css() is called.
> Probably the code can check CSS_DYING instead.
> 
> >
> > Being able to query "is the memcg going offline" and having a way to sync against that would be probably cleanest.
> 
> So basically, something like:
> 1. at folio_split_queue_lock*() time, get folio’s memcg or
>    its parent memcg until there is no CSS_DYING set or CSS_ONLINE is set.
> 2. return the associated deferred_split_queue.
> 

Yes, css_is_dying() can be used but please note that there is a rcu
grace period between setting CSS_DYING and clearing CSS_ONLINE (i.e.
reparenting deferred split queue) and during that period the deferred
split THPs of the dying memcg will be hidden from shrinkers (which
might be fine).


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v2 4/4] mm: thp: reparent the split queue during memcg offline
  2025-09-25 22:15           ` Shakeel Butt
@ 2025-09-25 22:35             ` Shakeel Butt
  2025-09-26  6:57               ` Qi Zheng
  0 siblings, 1 reply; 25+ messages in thread
From: Shakeel Butt @ 2025-09-25 22:35 UTC (permalink / raw)
  To: Zi Yan
  Cc: David Hildenbrand, Qi Zheng, hannes, hughd, mhocko,
	roman.gushchin, muchun.song, lorenzo.stoakes, harry.yoo,
	baolin.wang, Liam.Howlett, npache, ryan.roberts, dev.jain,
	baohua, lance.yang, akpm, linux-mm, linux-kernel, cgroups

On Thu, Sep 25, 2025 at 03:15:26PM -0700, Shakeel Butt wrote:
> On Thu, Sep 25, 2025 at 03:49:52PM -0400, Zi Yan wrote:
> > On 25 Sep 2025, at 15:35, David Hildenbrand wrote:
> > 
> > > On 25.09.25 08:11, Qi Zheng wrote:
> > >> Hi David,
> > >
> > > Hi :)
> > >
> > > [...]
> > >
> > >>>> +++ b/include/linux/mmzone.h
> > >>>> @@ -1346,6 +1346,7 @@ struct deferred_split {
> > >>>>        spinlock_t split_queue_lock;
> > >>>>        struct list_head split_queue;
> > >>>>        unsigned long split_queue_len;
> > >>>> +    bool is_dying;
> > >>>
> > >>> It's a bit weird to query whether the "struct deferred_split" is dying.
> > >>> Shouldn't this be a memcg property? (and in particular, not exist for
> > >>
> > >> There is indeed a CSS_DYING flag. But we must modify 'is_dying' under
> > >> the protection of the split_queue_lock, otherwise the folio may be added
> > >> back to the deferred_split of child memcg.
> > >
> > > Is there no way to reuse the existing mechanisms, and find a way to have the shrinker / queue locking sync against that?
> > >
> > > There is also the offline_css() function where we clear CSS_ONLINE. But it happens after calling ss->css_offline(css);
> > 
> > I see CSS_DYING will be set by kill_css() before offline_css() is called.
> > Probably the code can check CSS_DYING instead.
> > 
> > >
> > > Being able to query "is the memcg going offline" and having a way to sync against that would be probably cleanest.
> > 
> > So basically, something like:
> > 1. at folio_split_queue_lock*() time, get folio’s memcg or
> >    its parent memcg until there is no CSS_DYING set or CSS_ONLINE is set.
> > 2. return the associated deferred_split_queue.
> > 
> 
> Yes, css_is_dying() can be used but please note that there is a rcu
> grace period between setting CSS_DYING and clearing CSS_ONLINE (i.e.
> reparenting deferred split queue) and during that period the deferred
> split THPs of the dying memcg will be hidden from shrinkers (which
> might be fine).

BTW if this period is not acceptable and we don't want to add is_dying
to struct deferred_split, we can use something similar to what list_lru
does in the similar situation i.e. set a special value (LONG_MIN) in its
nr_items variable. That is make split_queue_len a long and set it to
LONG_MIN during memcg offlining/reparenting.


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v2 4/4] mm: thp: reparent the split queue during memcg offline
  2025-09-25 22:35             ` Shakeel Butt
@ 2025-09-26  6:57               ` Qi Zheng
  2025-09-26 16:36                 ` Shakeel Butt
  0 siblings, 1 reply; 25+ messages in thread
From: Qi Zheng @ 2025-09-26  6:57 UTC (permalink / raw)
  To: Shakeel Butt, Zi Yan, David Hildenbrand
  Cc: hannes, hughd, mhocko, roman.gushchin, muchun.song,
	lorenzo.stoakes, harry.yoo, baolin.wang, Liam.Howlett, npache,
	ryan.roberts, dev.jain, baohua, lance.yang, akpm, linux-mm,
	linux-kernel, cgroups



On 9/26/25 6:35 AM, Shakeel Butt wrote:
> On Thu, Sep 25, 2025 at 03:15:26PM -0700, Shakeel Butt wrote:
>> On Thu, Sep 25, 2025 at 03:49:52PM -0400, Zi Yan wrote:
>>> On 25 Sep 2025, at 15:35, David Hildenbrand wrote:
>>>
>>>> On 25.09.25 08:11, Qi Zheng wrote:
>>>>> Hi David,
>>>>
>>>> Hi :)
>>>>
>>>> [...]
>>>>
>>>>>>> +++ b/include/linux/mmzone.h
>>>>>>> @@ -1346,6 +1346,7 @@ struct deferred_split {
>>>>>>>         spinlock_t split_queue_lock;
>>>>>>>         struct list_head split_queue;
>>>>>>>         unsigned long split_queue_len;
>>>>>>> +    bool is_dying;
>>>>>>
>>>>>> It's a bit weird to query whether the "struct deferred_split" is dying.
>>>>>> Shouldn't this be a memcg property? (and in particular, not exist for
>>>>>
>>>>> There is indeed a CSS_DYING flag. But we must modify 'is_dying' under
>>>>> the protection of the split_queue_lock, otherwise the folio may be added
>>>>> back to the deferred_split of child memcg.
>>>>
>>>> Is there no way to reuse the existing mechanisms, and find a way to have the shrinker / queue locking sync against that?
>>>>
>>>> There is also the offline_css() function where we clear CSS_ONLINE. But it happens after calling ss->css_offline(css);
>>>
>>> I see CSS_DYING will be set by kill_css() before offline_css() is called.
>>> Probably the code can check CSS_DYING instead.
>>>
>>>>
>>>> Being able to query "is the memcg going offline" and having a way to sync against that would be probably cleanest.
>>>
>>> So basically, something like:
>>> 1. at folio_split_queue_lock*() time, get folio’s memcg or
>>>     its parent memcg until there is no CSS_DYING set or CSS_ONLINE is set.
>>> 2. return the associated deferred_split_queue.
>>>
>>
>> Yes, css_is_dying() can be used but please note that there is a rcu
>> grace period between setting CSS_DYING and clearing CSS_ONLINE (i.e.
>> reparenting deferred split queue) and during that period the deferred
>> split THPs of the dying memcg will be hidden from shrinkers (which
>> might be fine).

My mistake, now I think using css_is_dying() is safe.

> 
> BTW if this period is not acceptable and we don't want to add is_dying
> to struct deferred_split, we can use something similar to what list_lru
> does in the similar situation i.e. set a special value (LONG_MIN) in its
> nr_items variable. That is make split_queue_len a long and set it to
> LONG_MIN during memcg offlining/reparenting.

I've considered this option, but I am concerned about the risk of
overflow.

So I will try to use css_is_dying() in the next version.

Thanks,
Qi





^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v2 4/4] mm: thp: reparent the split queue during memcg offline
  2025-09-26  6:57               ` Qi Zheng
@ 2025-09-26 16:36                 ` Shakeel Butt
  0 siblings, 0 replies; 25+ messages in thread
From: Shakeel Butt @ 2025-09-26 16:36 UTC (permalink / raw)
  To: Qi Zheng
  Cc: Zi Yan, David Hildenbrand, hannes, hughd, mhocko, roman.gushchin,
	muchun.song, lorenzo.stoakes, harry.yoo, baolin.wang,
	Liam.Howlett, npache, ryan.roberts, dev.jain, baohua, lance.yang,
	akpm, linux-mm, linux-kernel, cgroups

On Fri, Sep 26, 2025 at 02:57:39PM +0800, Qi Zheng wrote:
> 
> 
> On 9/26/25 6:35 AM, Shakeel Butt wrote:
[...]
> > > 
> > > Yes, css_is_dying() can be used but please note that there is a rcu
> > > grace period between setting CSS_DYING and clearing CSS_ONLINE (i.e.
> > > reparenting deferred split queue) and during that period the deferred
> > > split THPs of the dying memcg will be hidden from shrinkers (which
> > > might be fine).
> 
> My mistake, now I think using css_is_dying() is safe.
> 
> > 
> > BTW if this period is not acceptable and we don't want to add is_dying
> > to struct deferred_split, we can use something similar to what list_lru
> > does in the similar situation i.e. set a special value (LONG_MIN) in its
> > nr_items variable. That is make split_queue_len a long and set it to
> > LONG_MIN during memcg offlining/reparenting.
> 
> I've considered this option, but I am concerned about the risk of
> overflow.

If you modify it only within a lock then why would it over or underflow?

> 
> So I will try to use css_is_dying() in the next version.

No objection from me but add a comment on the potential window where the
deferred thps will be hidden from the shrinkers.



^ permalink raw reply	[flat|nested] 25+ messages in thread

end of thread, other threads:[~2025-09-26 16:37 UTC | newest]

Thread overview: 25+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2025-09-23  9:16 [PATCH v2 0/4] reparent the THP split queue Qi Zheng
2025-09-23  9:16 ` [PATCH v2 1/4] mm: thp: replace folio_memcg() with folio_memcg_charged() Qi Zheng
2025-09-24  9:10   ` Roman Gushchin
2025-09-23  9:16 ` [PATCH v2 2/4] mm: thp: introduce folio_split_queue_lock and its variants Qi Zheng
2025-09-23  9:16 ` [PATCH v2 3/4] mm: thp: use folio_batch to handle THP splitting in deferred_split_scan() Qi Zheng
2025-09-23 15:31   ` Zi Yan
2025-09-24  9:57     ` Qi Zheng
2025-09-24 14:57       ` Zi Yan
2025-09-24 12:31   ` David Hildenbrand
2025-09-23  9:16 ` [PATCH v2 4/4] mm: thp: reparent the split queue during memcg offline Qi Zheng
2025-09-23 15:44   ` Zi Yan
2025-09-24  9:58     ` Qi Zheng
2025-09-24  9:23   ` Roman Gushchin
2025-09-24 10:06     ` Qi Zheng
2025-09-24 12:38   ` David Hildenbrand
2025-09-25  6:11     ` Qi Zheng
2025-09-25 19:35       ` David Hildenbrand
2025-09-25 19:49         ` Zi Yan
2025-09-25 22:15           ` Shakeel Butt
2025-09-25 22:35             ` Shakeel Butt
2025-09-26  6:57               ` Qi Zheng
2025-09-26 16:36                 ` Shakeel Butt
2025-09-24 13:49   ` kernel test robot
2025-09-24 14:22   ` Harry Yoo
2025-09-25  6:29     ` Qi Zheng

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox