[PATCH hotfix 1/2] mm/thp: fix deferred split queue not partially

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [PATCH hotfix 1/2] mm/thp: fix deferred split queue not partially_mapped
@ 2024-10-24  4:10 Hugh Dickins
  2024-10-24  4:13 ` [PATCH hotfix 2/2] mm/thp: fix deferred split unqueue naming and locking Hugh Dickins
                   ` (4 more replies)
  0 siblings, 5 replies; 20+ messages in thread
From: Hugh Dickins @ 2024-10-24  4:10 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Usama Arif, Yang Shi, Wei Yang, Kirill A. Shutemov,
	Matthew Wilcox, David Hildenbrand, Johannes Weiner, Baolin Wang,
	Barry Song, Kefeng Wang, Ryan Roberts, Nhat Pham, Zi Yan,
	Chris Li, linux-kernel, linux-mm

Recent changes are putting more pressure on THP deferred split queues:
under load revealing long-standing races, causing list_del corruptions,
"Bad page state"s and worse (I keep BUGs in both of those, so usually
don't get to see how badly they end up without).  The relevant recent
changes being 6.8's mTHP, 6.10's mTHP swapout, and 6.12's mTHP swapin,
improved swap allocation, and underused THP splitting.

The new unlocked list_del_init() in deferred_split_scan() is buggy.
I gave bad advice, it looks plausible since that's a local on-stack
list, but the fact is that it can race with a third party freeing or
migrating the preceding folio (properly unqueueing it with refcount 0
while holding split_queue_lock), thereby corrupting the list linkage.

The obvious answer would be to take split_queue_lock there: but it has
a long history of contention, so I'm reluctant to add to that. Instead,
make sure that there is always one safe (raised refcount) folio before,
by delaying its folio_put().  (And of course I was wrong to suggest
updating split_queue_len without the lock: leave that until the splice.)

And remove two over-eager partially_mapped checks, restoring those tests
to how they were before: if uncharge_folio() or free_tail_page_prepare()
finds _deferred_list non-empty, it's in trouble whether or not that folio
is partially_mapped (and the flag was already cleared in the latter case).

Fixes: dafff3f4c850 ("mm: split underused THPs")
Signed-off-by: Hugh Dickins <hughd@google.com>
---
 mm/huge_memory.c | 21 +++++++++++++++++----
 mm/memcontrol.c  |  3 +--
 mm/page_alloc.c  |  5 ++---
 3 files changed, 20 insertions(+), 9 deletions(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 2fb328880b50..a1d345f1680c 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -3718,8 +3718,8 @@ static unsigned long deferred_split_scan(struct shrinker *shrink,
 	struct deferred_split *ds_queue = &pgdata->deferred_split_queue;
 	unsigned long flags;
 	LIST_HEAD(list);
-	struct folio *folio, *next;
-	int split = 0;
+	struct folio *folio, *next, *prev = NULL;
+	int split = 0, removed = 0;
 
 #ifdef CONFIG_MEMCG
 	if (sc->memcg)
@@ -3775,15 +3775,28 @@ static unsigned long deferred_split_scan(struct shrinker *shrink,
 		 */
 		if (!did_split && !folio_test_partially_mapped(folio)) {
 			list_del_init(&folio->_deferred_list);
-			ds_queue->split_queue_len--;
+			removed++;
+		} else {
+			/*
+			 * That unlocked list_del_init() above would be unsafe,
+			 * unless its folio is separated from any earlier folios
+			 * left on the list (which may be concurrently unqueued)
+			 * by one safe folio with refcount still raised.
+			 */
+			swap(folio, prev);
 		}
-		folio_put(folio);
+		if (folio)
+			folio_put(folio);
 	}
 
 	spin_lock_irqsave(&ds_queue->split_queue_lock, flags);
 	list_splice_tail(&list, &ds_queue->split_queue);
+	ds_queue->split_queue_len -= removed;
 	spin_unlock_irqrestore(&ds_queue->split_queue_lock, flags);
 
+	if (prev)
+		folio_put(prev);
+
 	/*
 	 * Stop shrinker if we didn't split any page, but the queue is empty.
 	 * This can happen if pages were freed under us.
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 7845c64a2c57..2703227cce88 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -4631,8 +4631,7 @@ static void uncharge_folio(struct folio *folio, struct uncharge_gather *ug)
 	VM_BUG_ON_FOLIO(folio_test_lru(folio), folio);
 	VM_BUG_ON_FOLIO(folio_order(folio) > 1 &&
 			!folio_test_hugetlb(folio) &&
-			!list_empty(&folio->_deferred_list) &&
-			folio_test_partially_mapped(folio), folio);
+			!list_empty(&folio->_deferred_list), folio);
 
 	/*
 	 * Nobody should be changing or seriously looking at
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 8afab64814dc..4b21a368b4e2 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -961,9 +961,8 @@ static int free_tail_page_prepare(struct page *head_page, struct page *page)
 		break;
 	case 2:
 		/* the second tail page: deferred_list overlaps ->mapping */
-		if (unlikely(!list_empty(&folio->_deferred_list) &&
-		    folio_test_partially_mapped(folio))) {
-			bad_page(page, "partially mapped folio on deferred list");
+		if (unlikely(!list_empty(&folio->_deferred_list))) {
+			bad_page(page, "on deferred list");
 			goto out;
 		}
 		break;
-- 
2.35.3


^ permalink raw reply	[flat|nested] 20+ messages in thread

* [PATCH hotfix 2/2] mm/thp: fix deferred split unqueue naming and locking
  2024-10-24  4:10 [PATCH hotfix 1/2] mm/thp: fix deferred split queue not partially_mapped Hugh Dickins
@ 2024-10-24  4:13 ` Hugh Dickins
  2024-10-24 20:00   ` Yang Shi
  2024-10-24 20:52   ` David Hildenbrand
  2024-10-24 10:20 ` [PATCH hotfix 1/2] mm/thp: fix deferred split queue not partially_mapped Usama Arif
                   ` (3 subsequent siblings)
  4 siblings, 2 replies; 20+ messages in thread
From: Hugh Dickins @ 2024-10-24  4:13 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Usama Arif, Yang Shi, Wei Yang, Kirill A. Shutemov,
	Matthew Wilcox, David Hildenbrand, Johannes Weiner, Baolin Wang,
	Barry Song, Kefeng Wang, Ryan Roberts, Nhat Pham, Zi Yan,
	Chris Li, linux-kernel, linux-mm

Recent changes are putting more pressure on THP deferred split queues:
under load revealing long-standing races, causing list_del corruptions,
"Bad page state"s and worse (I keep BUGs in both of those, so usually
don't get to see how badly they end up without).  The relevant recent
changes being 6.8's mTHP, 6.10's mTHP swapout, and 6.12's mTHP swapin,
improved swap allocation, and underused THP splitting.

Before fixing locking: rename misleading folio_undo_large_rmappable(),
which does not undo large_rmappable, to folio_unqueue_deferred_split(),
which is what it does.  But that and its out-of-line __callee are mm
internals of very limited usability: add comment and WARN_ON_ONCEs to
check usage; and return a bool to say if a deferred split was unqueued,
which can then be used in WARN_ON_ONCEs around safety checks (sparing
callers the arcane conditionals in __folio_unqueue_deferred_split()).

Swapout: mem_cgroup_swapout() has been resetting folio->memcg_data 0
without checking and unqueueing a THP folio from deferred split list;
which is unfortunate, since the split_queue_lock depends on the memcg
(when memcg is enabled); so swapout has been unqueueing such THPs later,
when freeing the folio, using the pgdat's lock instead: potentially
corrupting the memcg's list.  __remove_mapping() has frozen refcount to
0 here, so no problem with calling folio_unqueue_deferred_split() before
resetting memcg_data.

That goes back to 5.4 commit 87eaceb3faa5 ("mm: thp: make deferred split
shrinker memcg aware"): which included a check on swapcache before adding
to deferred queue (which can now be removed), but no check on deferred
queue before adding THP to swapcache (maybe other circumstances prevented
it at that time, but not now).

Memcg-v1 move (deprecated): mem_cgroup_move_account() has been changing
folio->memcg_data without checking and unqueueing a THP folio from the
deferred list, sometimes corrupting "from" memcg's list, like swapout.
Refcount is non-zero here, so folio_unqueue_deferred_split() can only be
used in a WARN_ON_ONCE to validate the fix, which must be done earlier:
mem_cgroup_move_charge_pte_range() first try to split the THP (splitting
of course unqueues), or skip it if that fails.  Not ideal, but moving
charge has been requested, and khugepaged should repair the THP later:
nobody wants new custom unqueueing code just for this deprecated case.

The 87eaceb3faa5 commit did have the code to move from one deferred list
to another (but was not conscious of its unsafety while refcount non-0);
but that was removed by 5.6 commit fac0516b5534 ("mm: thp: don't need
care deferred split queue in memcg charge move path"), which argued that
the existence of a PMD mapping guarantees that the THP cannot be on a
deferred list.  I'm not sure if that was true at the time (swapcache
remapped?), but it's clearly not true since 6.12 commit dafff3f4c850
("mm: split underused THPs").

[Note in passing: mem_cgroup_move_charge_pte_range() just skips mTHPs,
large but not PMD-mapped: that's safe, but perhaps not intended: it's
arguable whether the deprecated feature should be updated to work
better with the new feature; but certainly not in this patch.]

Backport to 6.11 should be straightforward. Earlier backports must take
care that other _deferred_list fixes and dependencies are included.  It
is unclear whether these fixes are realistically needed before 6.12.

Fixes: 87eaceb3faa5 ("mm: thp: make deferred split shrinker memcg aware")
Signed-off-by: Hugh Dickins <hughd@google.com>
Cc: <stable@vger.kernel.org>
---
 mm/huge_memory.c   | 35 +++++++++++++++++++++--------------
 mm/internal.h      | 10 +++++-----
 mm/memcontrol-v1.c | 25 +++++++++++++++++++++++++
 mm/memcontrol.c    |  8 +++++---
 mm/migrate.c       |  4 ++--
 mm/page_alloc.c    |  4 +++-
 mm/swap.c          |  4 ++--
 mm/vmscan.c        |  4 ++--
 8 files changed, 65 insertions(+), 29 deletions(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index a1d345f1680c..dc7d5bb76495 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -3588,10 +3588,27 @@ int split_folio_to_list(struct folio *folio, struct list_head *list)
 	return split_huge_page_to_list_to_order(&folio->page, list, ret);
 }
 
-void __folio_undo_large_rmappable(struct folio *folio)
+/*
+ * __folio_unqueue_deferred_split() is not to be called directly:
+ * the folio_unqueue_deferred_split() inline wrapper in mm/internal.h
+ * limits its calls to those folios which may have a _deferred_list for
+ * queueing THP splits, and that list is (racily observed to be) non-empty.
+ *
+ * It is unsafe to call folio_unqueue_deferred_split() until folio refcount is
+ * zero: because even when split_queue_lock is held, a non-empty _deferred_list
+ * might be in use on deferred_split_scan()'s unlocked on-stack list.
+ *
+ * If memory cgroups are enabled, split_queue_lock is in the mem_cgroup: it is
+ * therefore important to unqueue deferred split before changing folio memcg.
+ */
+bool __folio_unqueue_deferred_split(struct folio *folio)
 {
 	struct deferred_split *ds_queue;
 	unsigned long flags;
+	bool unqueued = false;
+
+	WARN_ON_ONCE(folio_ref_count(folio));
+	WARN_ON_ONCE(!mem_cgroup_disabled() && !folio_memcg(folio));
 
 	ds_queue = get_deferred_split_queue(folio);
 	spin_lock_irqsave(&ds_queue->split_queue_lock, flags);
@@ -3603,8 +3620,11 @@ void __folio_undo_large_rmappable(struct folio *folio)
 				      MTHP_STAT_NR_ANON_PARTIALLY_MAPPED, -1);
 		}
 		list_del_init(&folio->_deferred_list);
+		unqueued = true;
 	}
 	spin_unlock_irqrestore(&ds_queue->split_queue_lock, flags);
+
+	return unqueued;	/* useful for debug warnings */
 }
 
 /* partially_mapped=false won't clear PG_partially_mapped folio flag */
@@ -3626,19 +3646,6 @@ void deferred_split_folio(struct folio *folio, bool partially_mapped)
 	if (!partially_mapped && !split_underused_thp)
 		return;
 
-	/*
-	 * The try_to_unmap() in page reclaim path might reach here too,
-	 * this may cause a race condition to corrupt deferred split queue.
-	 * And, if page reclaim is already handling the same folio, it is
-	 * unnecessary to handle it again in shrinker.
-	 *
-	 * Check the swapcache flag to determine if the folio is being
-	 * handled by page reclaim since THP swap would add the folio into
-	 * swap cache before calling try_to_unmap().
-	 */
-	if (folio_test_swapcache(folio))
-		return;
-
 	spin_lock_irqsave(&ds_queue->split_queue_lock, flags);
 	if (partially_mapped) {
 		if (!folio_test_partially_mapped(folio)) {
diff --git a/mm/internal.h b/mm/internal.h
index 93083bbeeefa..16c1f3cd599e 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -639,11 +639,11 @@ static inline void folio_set_order(struct folio *folio, unsigned int order)
 #endif
 }
 
-void __folio_undo_large_rmappable(struct folio *folio);
-static inline void folio_undo_large_rmappable(struct folio *folio)
+bool __folio_unqueue_deferred_split(struct folio *folio);
+static inline bool folio_unqueue_deferred_split(struct folio *folio)
 {
 	if (folio_order(folio) <= 1 || !folio_test_large_rmappable(folio))
-		return;
+		return false;
 
 	/*
 	 * At this point, there is no one trying to add the folio to
@@ -651,9 +651,9 @@ static inline void folio_undo_large_rmappable(struct folio *folio)
 	 * to check without acquiring the split_queue_lock.
 	 */
 	if (data_race(list_empty(&folio->_deferred_list)))
-		return;
+		return false;
 
-	__folio_undo_large_rmappable(folio);
+	return __folio_unqueue_deferred_split(folio);
 }
 
 static inline struct folio *page_rmappable_folio(struct page *page)
diff --git a/mm/memcontrol-v1.c b/mm/memcontrol-v1.c
index 81d8819f13cd..f8744f5630bb 100644
--- a/mm/memcontrol-v1.c
+++ b/mm/memcontrol-v1.c
@@ -848,6 +848,8 @@ static int mem_cgroup_move_account(struct folio *folio,
 	css_get(&to->css);
 	css_put(&from->css);
 
+	/* Warning should never happen, so don't worry about refcount non-0 */
+	WARN_ON_ONCE(folio_unqueue_deferred_split(folio));
 	folio->memcg_data = (unsigned long)to;
 
 	__folio_memcg_unlock(from);
@@ -1217,7 +1219,9 @@ static int mem_cgroup_move_charge_pte_range(pmd_t *pmd,
 	enum mc_target_type target_type;
 	union mc_target target;
 	struct folio *folio;
+	bool tried_split_before = false;
 
+retry_pmd:
 	ptl = pmd_trans_huge_lock(pmd, vma);
 	if (ptl) {
 		if (mc.precharge < HPAGE_PMD_NR) {
@@ -1227,6 +1231,27 @@ static int mem_cgroup_move_charge_pte_range(pmd_t *pmd,
 		target_type = get_mctgt_type_thp(vma, addr, *pmd, &target);
 		if (target_type == MC_TARGET_PAGE) {
 			folio = target.folio;
+			/*
+			 * Deferred split queue locking depends on memcg,
+			 * and unqueue is unsafe unless folio refcount is 0:
+			 * split or skip if on the queue? first try to split.
+			 */
+			if (!list_empty(&folio->_deferred_list)) {
+				spin_unlock(ptl);
+				if (!tried_split_before)
+					split_folio(folio);
+				folio_unlock(folio);
+				folio_put(folio);
+				if (tried_split_before)
+					return 0;
+				tried_split_before = true;
+				goto retry_pmd;
+			}
+			/*
+			 * So long as that pmd lock is held, the folio cannot
+			 * be racily added to the _deferred_list, because
+			 * __folio_remove_rmap() will find !partially_mapped.
+			 */
 			if (folio_isolate_lru(folio)) {
 				if (!mem_cgroup_move_account(folio, true,
 							     mc.from, mc.to)) {
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 2703227cce88..06df2af97415 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -4629,9 +4629,6 @@ static void uncharge_folio(struct folio *folio, struct uncharge_gather *ug)
 	struct obj_cgroup *objcg;
 
 	VM_BUG_ON_FOLIO(folio_test_lru(folio), folio);
-	VM_BUG_ON_FOLIO(folio_order(folio) > 1 &&
-			!folio_test_hugetlb(folio) &&
-			!list_empty(&folio->_deferred_list), folio);
 
 	/*
 	 * Nobody should be changing or seriously looking at
@@ -4678,6 +4675,7 @@ static void uncharge_folio(struct folio *folio, struct uncharge_gather *ug)
 			ug->nr_memory += nr_pages;
 		ug->pgpgout++;
 
+		WARN_ON_ONCE(folio_unqueue_deferred_split(folio));
 		folio->memcg_data = 0;
 	}
 
@@ -4789,6 +4787,9 @@ void mem_cgroup_migrate(struct folio *old, struct folio *new)
 
 	/* Transfer the charge and the css ref */
 	commit_charge(new, memcg);
+
+	/* Warning should never happen, so don't worry about refcount non-0 */
+	WARN_ON_ONCE(folio_unqueue_deferred_split(old));
 	old->memcg_data = 0;
 }
 
@@ -4975,6 +4976,7 @@ void mem_cgroup_swapout(struct folio *folio, swp_entry_t entry)
 	VM_BUG_ON_FOLIO(oldid, folio);
 	mod_memcg_state(swap_memcg, MEMCG_SWAP, nr_entries);
 
+	folio_unqueue_deferred_split(folio);
 	folio->memcg_data = 0;
 
 	if (!mem_cgroup_is_root(memcg))
diff --git a/mm/migrate.c b/mm/migrate.c
index df91248755e4..691f25ee2489 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -489,7 +489,7 @@ static int __folio_migrate_mapping(struct address_space *mapping,
 		    folio_test_large_rmappable(folio)) {
 			if (!folio_ref_freeze(folio, expected_count))
 				return -EAGAIN;
-			folio_undo_large_rmappable(folio);
+			folio_unqueue_deferred_split(folio);
 			folio_ref_unfreeze(folio, expected_count);
 		}
 
@@ -514,7 +514,7 @@ static int __folio_migrate_mapping(struct address_space *mapping,
 	}
 
 	/* Take off deferred split queue while frozen and memcg set */
-	folio_undo_large_rmappable(folio);
+	folio_unqueue_deferred_split(folio);
 
 	/*
 	 * Now we know that no one else is looking at the folio:
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 4b21a368b4e2..57f64b5d0004 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -2681,7 +2681,9 @@ void free_unref_folios(struct folio_batch *folios)
 		unsigned long pfn = folio_pfn(folio);
 		unsigned int order = folio_order(folio);
 
-		folio_undo_large_rmappable(folio);
+		if (mem_cgroup_disabled())
+			folio_unqueue_deferred_split(folio);
+
 		if (!free_pages_prepare(&folio->page, order))
 			continue;
 		/*
diff --git a/mm/swap.c b/mm/swap.c
index 835bdf324b76..b8e3259ea2c4 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -121,7 +121,7 @@ void __folio_put(struct folio *folio)
 	}
 
 	page_cache_release(folio);
-	folio_undo_large_rmappable(folio);
+	folio_unqueue_deferred_split(folio);
 	mem_cgroup_uncharge(folio);
 	free_unref_page(&folio->page, folio_order(folio));
 }
@@ -988,7 +988,7 @@ void folios_put_refs(struct folio_batch *folios, unsigned int *refs)
 			free_huge_folio(folio);
 			continue;
 		}
-		folio_undo_large_rmappable(folio);
+		folio_unqueue_deferred_split(folio);
 		__page_cache_release(folio, &lruvec, &flags);
 
 		if (j != i)
diff --git a/mm/vmscan.c b/mm/vmscan.c
index eb4e8440c507..635d45745b73 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1475,7 +1475,7 @@ static unsigned int shrink_folio_list(struct list_head *folio_list,
 		 */
 		nr_reclaimed += nr_pages;
 
-		folio_undo_large_rmappable(folio);
+		folio_unqueue_deferred_split(folio);
 		if (folio_batch_add(&free_folios, folio) == 0) {
 			mem_cgroup_uncharge_folios(&free_folios);
 			try_to_unmap_flush();
@@ -1863,7 +1863,7 @@ static unsigned int move_folios_to_lru(struct lruvec *lruvec,
 		if (unlikely(folio_put_testzero(folio))) {
 			__folio_clear_lru_flags(folio);
 
-			folio_undo_large_rmappable(folio);
+			folio_unqueue_deferred_split(folio);
 			if (folio_batch_add(&free_folios, folio) == 0) {
 				spin_unlock_irq(&lruvec->lru_lock);
 				mem_cgroup_uncharge_folios(&free_folios);
-- 
2.35.3


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH hotfix 2/2] mm/thp: fix deferred split unqueue naming and locking
  2024-10-24  4:13 ` [PATCH hotfix 2/2] mm/thp: fix deferred split unqueue naming and locking Hugh Dickins
@ 2024-10-24 20:00   ` Yang Shi
  2024-10-25  1:21     ` Yang Shi
  2024-10-25  6:57     ` Hugh Dickins
  2024-10-24 20:52   ` David Hildenbrand
  1 sibling, 2 replies; 20+ messages in thread
From: Yang Shi @ 2024-10-24 20:00 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Andrew Morton, Usama Arif, Wei Yang, Kirill A. Shutemov,
	Matthew Wilcox, David Hildenbrand, Johannes Weiner, Baolin Wang,
	Barry Song, Kefeng Wang, Ryan Roberts, Nhat Pham, Zi Yan,
	Chris Li, linux-kernel, linux-mm

On Wed, Oct 23, 2024 at 9:13 PM Hugh Dickins <hughd@google.com> wrote:
>
> Recent changes are putting more pressure on THP deferred split queues:
> under load revealing long-standing races, causing list_del corruptions,
> "Bad page state"s and worse (I keep BUGs in both of those, so usually
> don't get to see how badly they end up without).  The relevant recent
> changes being 6.8's mTHP, 6.10's mTHP swapout, and 6.12's mTHP swapin,
> improved swap allocation, and underused THP splitting.
>
> Before fixing locking: rename misleading folio_undo_large_rmappable(),
> which does not undo large_rmappable, to folio_unqueue_deferred_split(),
> which is what it does.  But that and its out-of-line __callee are mm
> internals of very limited usability: add comment and WARN_ON_ONCEs to
> check usage; and return a bool to say if a deferred split was unqueued,
> which can then be used in WARN_ON_ONCEs around safety checks (sparing
> callers the arcane conditionals in __folio_unqueue_deferred_split()).
>
> Swapout: mem_cgroup_swapout() has been resetting folio->memcg_data 0
> without checking and unqueueing a THP folio from deferred split list;
> which is unfortunate, since the split_queue_lock depends on the memcg
> (when memcg is enabled); so swapout has been unqueueing such THPs later,
> when freeing the folio, using the pgdat's lock instead: potentially
> corrupting the memcg's list.  __remove_mapping() has frozen refcount to
> 0 here, so no problem with calling folio_unqueue_deferred_split() before
> resetting memcg_data.
>
> That goes back to 5.4 commit 87eaceb3faa5 ("mm: thp: make deferred split
> shrinker memcg aware"): which included a check on swapcache before adding
> to deferred queue (which can now be removed), but no check on deferred
> queue before adding THP to swapcache (maybe other circumstances prevented
> it at that time, but not now).

If I remember correctly, THP just can be added to deferred list when
there is no PMD map before mTHP swapout, so shrink_page_list() did
check THP's compound_mapcount (called _entire_mapcount now) before
adding it to swap cache.

Now the code just checked whether the large folio is on deferred list or not.

>
> Memcg-v1 move (deprecated): mem_cgroup_move_account() has been changing
> folio->memcg_data without checking and unqueueing a THP folio from the
> deferred list, sometimes corrupting "from" memcg's list, like swapout.
> Refcount is non-zero here, so folio_unqueue_deferred_split() can only be
> used in a WARN_ON_ONCE to validate the fix, which must be done earlier:
> mem_cgroup_move_charge_pte_range() first try to split the THP (splitting
> of course unqueues), or skip it if that fails.  Not ideal, but moving
> charge has been requested, and khugepaged should repair the THP later:
> nobody wants new custom unqueueing code just for this deprecated case.
>
> The 87eaceb3faa5 commit did have the code to move from one deferred list
> to another (but was not conscious of its unsafety while refcount non-0);
> but that was removed by 5.6 commit fac0516b5534 ("mm: thp: don't need
> care deferred split queue in memcg charge move path"), which argued that
> the existence of a PMD mapping guarantees that the THP cannot be on a
> deferred list.  I'm not sure if that was true at the time (swapcache
> remapped?), but it's clearly not true since 6.12 commit dafff3f4c850
> ("mm: split underused THPs").

Same reason as above.

>
> [Note in passing: mem_cgroup_move_charge_pte_range() just skips mTHPs,
> large but not PMD-mapped: that's safe, but perhaps not intended: it's
> arguable whether the deprecated feature should be updated to work
> better with the new feature; but certainly not in this patch.]
>
> Backport to 6.11 should be straightforward. Earlier backports must take
> care that other _deferred_list fixes and dependencies are included.  It
> is unclear whether these fixes are realistically needed before 6.12.
>
> Fixes: 87eaceb3faa5 ("mm: thp: make deferred split shrinker memcg aware")
> Signed-off-by: Hugh Dickins <hughd@google.com>
> Cc: <stable@vger.kernel.org>
> ---
>  mm/huge_memory.c   | 35 +++++++++++++++++++++--------------
>  mm/internal.h      | 10 +++++-----
>  mm/memcontrol-v1.c | 25 +++++++++++++++++++++++++
>  mm/memcontrol.c    |  8 +++++---
>  mm/migrate.c       |  4 ++--
>  mm/page_alloc.c    |  4 +++-
>  mm/swap.c          |  4 ++--
>  mm/vmscan.c        |  4 ++--
>  8 files changed, 65 insertions(+), 29 deletions(-)
>
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index a1d345f1680c..dc7d5bb76495 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -3588,10 +3588,27 @@ int split_folio_to_list(struct folio *folio, struct list_head *list)
>         return split_huge_page_to_list_to_order(&folio->page, list, ret);
>  }
>
> -void __folio_undo_large_rmappable(struct folio *folio)
> +/*
> + * __folio_unqueue_deferred_split() is not to be called directly:
> + * the folio_unqueue_deferred_split() inline wrapper in mm/internal.h
> + * limits its calls to those folios which may have a _deferred_list for
> + * queueing THP splits, and that list is (racily observed to be) non-empty.
> + *
> + * It is unsafe to call folio_unqueue_deferred_split() until folio refcount is
> + * zero: because even when split_queue_lock is held, a non-empty _deferred_list
> + * might be in use on deferred_split_scan()'s unlocked on-stack list.
> + *
> + * If memory cgroups are enabled, split_queue_lock is in the mem_cgroup: it is
> + * therefore important to unqueue deferred split before changing folio memcg.
> + */
> +bool __folio_unqueue_deferred_split(struct folio *folio)
>  {
>         struct deferred_split *ds_queue;
>         unsigned long flags;
> +       bool unqueued = false;
> +
> +       WARN_ON_ONCE(folio_ref_count(folio));
> +       WARN_ON_ONCE(!mem_cgroup_disabled() && !folio_memcg(folio));
>
>         ds_queue = get_deferred_split_queue(folio);
>         spin_lock_irqsave(&ds_queue->split_queue_lock, flags);
> @@ -3603,8 +3620,11 @@ void __folio_undo_large_rmappable(struct folio *folio)
>                                       MTHP_STAT_NR_ANON_PARTIALLY_MAPPED, -1);
>                 }
>                 list_del_init(&folio->_deferred_list);
> +               unqueued = true;
>         }
>         spin_unlock_irqrestore(&ds_queue->split_queue_lock, flags);
> +
> +       return unqueued;        /* useful for debug warnings */
>  }
>
>  /* partially_mapped=false won't clear PG_partially_mapped folio flag */
> @@ -3626,19 +3646,6 @@ void deferred_split_folio(struct folio *folio, bool partially_mapped)
>         if (!partially_mapped && !split_underused_thp)
>                 return;
>
> -       /*
> -        * The try_to_unmap() in page reclaim path might reach here too,
> -        * this may cause a race condition to corrupt deferred split queue.
> -        * And, if page reclaim is already handling the same folio, it is
> -        * unnecessary to handle it again in shrinker.
> -        *
> -        * Check the swapcache flag to determine if the folio is being
> -        * handled by page reclaim since THP swap would add the folio into
> -        * swap cache before calling try_to_unmap().
> -        */
> -       if (folio_test_swapcache(folio))
> -               return;
> -
>         spin_lock_irqsave(&ds_queue->split_queue_lock, flags);
>         if (partially_mapped) {
>                 if (!folio_test_partially_mapped(folio)) {
> diff --git a/mm/internal.h b/mm/internal.h
> index 93083bbeeefa..16c1f3cd599e 100644
> --- a/mm/internal.h
> +++ b/mm/internal.h
> @@ -639,11 +639,11 @@ static inline void folio_set_order(struct folio *folio, unsigned int order)
>  #endif
>  }
>
> -void __folio_undo_large_rmappable(struct folio *folio);
> -static inline void folio_undo_large_rmappable(struct folio *folio)
> +bool __folio_unqueue_deferred_split(struct folio *folio);
> +static inline bool folio_unqueue_deferred_split(struct folio *folio)
>  {
>         if (folio_order(folio) <= 1 || !folio_test_large_rmappable(folio))
> -               return;
> +               return false;
>
>         /*
>          * At this point, there is no one trying to add the folio to
> @@ -651,9 +651,9 @@ static inline void folio_undo_large_rmappable(struct folio *folio)
>          * to check without acquiring the split_queue_lock.
>          */
>         if (data_race(list_empty(&folio->_deferred_list)))
> -               return;
> +               return false;
>
> -       __folio_undo_large_rmappable(folio);
> +       return __folio_unqueue_deferred_split(folio);
>  }
>
>  static inline struct folio *page_rmappable_folio(struct page *page)
> diff --git a/mm/memcontrol-v1.c b/mm/memcontrol-v1.c
> index 81d8819f13cd..f8744f5630bb 100644
> --- a/mm/memcontrol-v1.c
> +++ b/mm/memcontrol-v1.c
> @@ -848,6 +848,8 @@ static int mem_cgroup_move_account(struct folio *folio,
>         css_get(&to->css);
>         css_put(&from->css);
>
> +       /* Warning should never happen, so don't worry about refcount non-0 */
> +       WARN_ON_ONCE(folio_unqueue_deferred_split(folio));
>         folio->memcg_data = (unsigned long)to;
>
>         __folio_memcg_unlock(from);
> @@ -1217,7 +1219,9 @@ static int mem_cgroup_move_charge_pte_range(pmd_t *pmd,
>         enum mc_target_type target_type;
>         union mc_target target;
>         struct folio *folio;
> +       bool tried_split_before = false;
>
> +retry_pmd:
>         ptl = pmd_trans_huge_lock(pmd, vma);
>         if (ptl) {
>                 if (mc.precharge < HPAGE_PMD_NR) {
> @@ -1227,6 +1231,27 @@ static int mem_cgroup_move_charge_pte_range(pmd_t *pmd,
>                 target_type = get_mctgt_type_thp(vma, addr, *pmd, &target);
>                 if (target_type == MC_TARGET_PAGE) {
>                         folio = target.folio;
> +                       /*
> +                        * Deferred split queue locking depends on memcg,
> +                        * and unqueue is unsafe unless folio refcount is 0:
> +                        * split or skip if on the queue? first try to split.
> +                        */
> +                       if (!list_empty(&folio->_deferred_list)) {
> +                               spin_unlock(ptl);
> +                               if (!tried_split_before)
> +                                       split_folio(folio);
> +                               folio_unlock(folio);
> +                               folio_put(folio);
> +                               if (tried_split_before)
> +                                       return 0;
> +                               tried_split_before = true;
> +                               goto retry_pmd;
> +                       }
> +                       /*
> +                        * So long as that pmd lock is held, the folio cannot
> +                        * be racily added to the _deferred_list, because
> +                        * __folio_remove_rmap() will find !partially_mapped.
> +                        */
>                         if (folio_isolate_lru(folio)) {
>                                 if (!mem_cgroup_move_account(folio, true,
>                                                              mc.from, mc.to)) {
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 2703227cce88..06df2af97415 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -4629,9 +4629,6 @@ static void uncharge_folio(struct folio *folio, struct uncharge_gather *ug)
>         struct obj_cgroup *objcg;
>
>         VM_BUG_ON_FOLIO(folio_test_lru(folio), folio);
> -       VM_BUG_ON_FOLIO(folio_order(folio) > 1 &&
> -                       !folio_test_hugetlb(folio) &&
> -                       !list_empty(&folio->_deferred_list), folio);
>
>         /*
>          * Nobody should be changing or seriously looking at
> @@ -4678,6 +4675,7 @@ static void uncharge_folio(struct folio *folio, struct uncharge_gather *ug)
>                         ug->nr_memory += nr_pages;
>                 ug->pgpgout++;
>
> +               WARN_ON_ONCE(folio_unqueue_deferred_split(folio));
>                 folio->memcg_data = 0;
>         }
>
> @@ -4789,6 +4787,9 @@ void mem_cgroup_migrate(struct folio *old, struct folio *new)
>
>         /* Transfer the charge and the css ref */
>         commit_charge(new, memcg);
> +
> +       /* Warning should never happen, so don't worry about refcount non-0 */
> +       WARN_ON_ONCE(folio_unqueue_deferred_split(old));
>         old->memcg_data = 0;
>  }
>
> @@ -4975,6 +4976,7 @@ void mem_cgroup_swapout(struct folio *folio, swp_entry_t entry)
>         VM_BUG_ON_FOLIO(oldid, folio);
>         mod_memcg_state(swap_memcg, MEMCG_SWAP, nr_entries);
>
> +       folio_unqueue_deferred_split(folio);
>         folio->memcg_data = 0;
>
>         if (!mem_cgroup_is_root(memcg))
> diff --git a/mm/migrate.c b/mm/migrate.c
> index df91248755e4..691f25ee2489 100644
> --- a/mm/migrate.c
> +++ b/mm/migrate.c
> @@ -489,7 +489,7 @@ static int __folio_migrate_mapping(struct address_space *mapping,
>                     folio_test_large_rmappable(folio)) {
>                         if (!folio_ref_freeze(folio, expected_count))
>                                 return -EAGAIN;
> -                       folio_undo_large_rmappable(folio);
> +                       folio_unqueue_deferred_split(folio);
>                         folio_ref_unfreeze(folio, expected_count);
>                 }
>
> @@ -514,7 +514,7 @@ static int __folio_migrate_mapping(struct address_space *mapping,
>         }
>
>         /* Take off deferred split queue while frozen and memcg set */
> -       folio_undo_large_rmappable(folio);
> +       folio_unqueue_deferred_split(folio);
>
>         /*
>          * Now we know that no one else is looking at the folio:
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 4b21a368b4e2..57f64b5d0004 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -2681,7 +2681,9 @@ void free_unref_folios(struct folio_batch *folios)
>                 unsigned long pfn = folio_pfn(folio);
>                 unsigned int order = folio_order(folio);
>
> -               folio_undo_large_rmappable(folio);
> +               if (mem_cgroup_disabled())
> +                       folio_unqueue_deferred_split(folio);

This looks confusing. It looks all callsites of free_unref_folios()
have folio_unqueue_deferred_split() and memcg uncharge called before
it. If there is any problem, memcg uncharge should catch it. Did I
miss something?

> +
>                 if (!free_pages_prepare(&folio->page, order))
>                         continue;
>                 /*
> diff --git a/mm/swap.c b/mm/swap.c
> index 835bdf324b76..b8e3259ea2c4 100644
> --- a/mm/swap.c
> +++ b/mm/swap.c
> @@ -121,7 +121,7 @@ void __folio_put(struct folio *folio)
>         }
>
>         page_cache_release(folio);
> -       folio_undo_large_rmappable(folio);
> +       folio_unqueue_deferred_split(folio);
>         mem_cgroup_uncharge(folio);
>         free_unref_page(&folio->page, folio_order(folio));
>  }
> @@ -988,7 +988,7 @@ void folios_put_refs(struct folio_batch *folios, unsigned int *refs)
>                         free_huge_folio(folio);
>                         continue;
>                 }
> -               folio_undo_large_rmappable(folio);
> +               folio_unqueue_deferred_split(folio);
>                 __page_cache_release(folio, &lruvec, &flags);
>
>                 if (j != i)
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index eb4e8440c507..635d45745b73 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -1475,7 +1475,7 @@ static unsigned int shrink_folio_list(struct list_head *folio_list,
>                  */
>                 nr_reclaimed += nr_pages;
>
> -               folio_undo_large_rmappable(folio);
> +               folio_unqueue_deferred_split(folio);
>                 if (folio_batch_add(&free_folios, folio) == 0) {
>                         mem_cgroup_uncharge_folios(&free_folios);
>                         try_to_unmap_flush();
> @@ -1863,7 +1863,7 @@ static unsigned int move_folios_to_lru(struct lruvec *lruvec,
>                 if (unlikely(folio_put_testzero(folio))) {
>                         __folio_clear_lru_flags(folio);
>
> -                       folio_undo_large_rmappable(folio);
> +                       folio_unqueue_deferred_split(folio);
>                         if (folio_batch_add(&free_folios, folio) == 0) {
>                                 spin_unlock_irq(&lruvec->lru_lock);
>                                 mem_cgroup_uncharge_folios(&free_folios);
> --
> 2.35.3


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH hotfix 2/2] mm/thp: fix deferred split unqueue naming and locking
  2024-10-24 20:00   ` Yang Shi
@ 2024-10-25  1:21     ` Yang Shi
  2024-10-25  6:57     ` Hugh Dickins
  1 sibling, 0 replies; 20+ messages in thread
From: Yang Shi @ 2024-10-25  1:21 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Andrew Morton, Usama Arif, Wei Yang, Kirill A. Shutemov,
	Matthew Wilcox, David Hildenbrand, Johannes Weiner, Baolin Wang,
	Barry Song, Kefeng Wang, Ryan Roberts, Nhat Pham, Zi Yan,
	Chris Li, linux-kernel, linux-mm

On Thu, Oct 24, 2024 at 1:00 PM Yang Shi <shy828301@gmail.com> wrote:
>
> On Wed, Oct 23, 2024 at 9:13 PM Hugh Dickins <hughd@google.com> wrote:
> >
> > Recent changes are putting more pressure on THP deferred split queues:
> > under load revealing long-standing races, causing list_del corruptions,
> > "Bad page state"s and worse (I keep BUGs in both of those, so usually
> > don't get to see how badly they end up without).  The relevant recent
> > changes being 6.8's mTHP, 6.10's mTHP swapout, and 6.12's mTHP swapin,
> > improved swap allocation, and underused THP splitting.
> >
> > Before fixing locking: rename misleading folio_undo_large_rmappable(),
> > which does not undo large_rmappable, to folio_unqueue_deferred_split(),
> > which is what it does.  But that and its out-of-line __callee are mm
> > internals of very limited usability: add comment and WARN_ON_ONCEs to
> > check usage; and return a bool to say if a deferred split was unqueued,
> > which can then be used in WARN_ON_ONCEs around safety checks (sparing
> > callers the arcane conditionals in __folio_unqueue_deferred_split()).
> >
> > Swapout: mem_cgroup_swapout() has been resetting folio->memcg_data 0
> > without checking and unqueueing a THP folio from deferred split list;
> > which is unfortunate, since the split_queue_lock depends on the memcg
> > (when memcg is enabled); so swapout has been unqueueing such THPs later,
> > when freeing the folio, using the pgdat's lock instead: potentially
> > corrupting the memcg's list.  __remove_mapping() has frozen refcount to
> > 0 here, so no problem with calling folio_unqueue_deferred_split() before
> > resetting memcg_data.
> >
> > That goes back to 5.4 commit 87eaceb3faa5 ("mm: thp: make deferred split
> > shrinker memcg aware"): which included a check on swapcache before adding
> > to deferred queue (which can now be removed), but no check on deferred
> > queue before adding THP to swapcache (maybe other circumstances prevented
> > it at that time, but not now).
>
> If I remember correctly, THP just can be added to deferred list when
> there is no PMD map before mTHP swapout, so shrink_page_list() did
> check THP's compound_mapcount (called _entire_mapcount now) before
> adding it to swap cache.

But as David pointed out there may be PMD mapped THP on deferred list.
We should just not hit it by luck. The migration failure may be very
rare because migration may split the THP or retry a few times, and
migration is called with holding extra refcount and having THP off
lru.

>
> Now the code just checked whether the large folio is on deferred list or not.
>
> >
> > Memcg-v1 move (deprecated): mem_cgroup_move_account() has been changing
> > folio->memcg_data without checking and unqueueing a THP folio from the
> > deferred list, sometimes corrupting "from" memcg's list, like swapout.
> > Refcount is non-zero here, so folio_unqueue_deferred_split() can only be
> > used in a WARN_ON_ONCE to validate the fix, which must be done earlier:
> > mem_cgroup_move_charge_pte_range() first try to split the THP (splitting
> > of course unqueues), or skip it if that fails.  Not ideal, but moving
> > charge has been requested, and khugepaged should repair the THP later:
> > nobody wants new custom unqueueing code just for this deprecated case.
> >
> > The 87eaceb3faa5 commit did have the code to move from one deferred list
> > to another (but was not conscious of its unsafety while refcount non-0);
> > but that was removed by 5.6 commit fac0516b5534 ("mm: thp: don't need
> > care deferred split queue in memcg charge move path"), which argued that
> > the existence of a PMD mapping guarantees that the THP cannot be on a
> > deferred list.  I'm not sure if that was true at the time (swapcache
> > remapped?), but it's clearly not true since 6.12 commit dafff3f4c850
> > ("mm: split underused THPs").
>
> Same reason as above.
>
> >
> > [Note in passing: mem_cgroup_move_charge_pte_range() just skips mTHPs,
> > large but not PMD-mapped: that's safe, but perhaps not intended: it's
> > arguable whether the deprecated feature should be updated to work
> > better with the new feature; but certainly not in this patch.]
> >
> > Backport to 6.11 should be straightforward. Earlier backports must take
> > care that other _deferred_list fixes and dependencies are included.  It
> > is unclear whether these fixes are realistically needed before 6.12.
> >
> > Fixes: 87eaceb3faa5 ("mm: thp: make deferred split shrinker memcg aware")
> > Signed-off-by: Hugh Dickins <hughd@google.com>
> > Cc: <stable@vger.kernel.org>
> > ---
> >  mm/huge_memory.c   | 35 +++++++++++++++++++++--------------
> >  mm/internal.h      | 10 +++++-----
> >  mm/memcontrol-v1.c | 25 +++++++++++++++++++++++++
> >  mm/memcontrol.c    |  8 +++++---
> >  mm/migrate.c       |  4 ++--
> >  mm/page_alloc.c    |  4 +++-
> >  mm/swap.c          |  4 ++--
> >  mm/vmscan.c        |  4 ++--
> >  8 files changed, 65 insertions(+), 29 deletions(-)
> >
> > diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> > index a1d345f1680c..dc7d5bb76495 100644
> > --- a/mm/huge_memory.c
> > +++ b/mm/huge_memory.c
> > @@ -3588,10 +3588,27 @@ int split_folio_to_list(struct folio *folio, struct list_head *list)
> >         return split_huge_page_to_list_to_order(&folio->page, list, ret);
> >  }
> >
> > -void __folio_undo_large_rmappable(struct folio *folio)
> > +/*
> > + * __folio_unqueue_deferred_split() is not to be called directly:
> > + * the folio_unqueue_deferred_split() inline wrapper in mm/internal.h
> > + * limits its calls to those folios which may have a _deferred_list for
> > + * queueing THP splits, and that list is (racily observed to be) non-empty.
> > + *
> > + * It is unsafe to call folio_unqueue_deferred_split() until folio refcount is
> > + * zero: because even when split_queue_lock is held, a non-empty _deferred_list
> > + * might be in use on deferred_split_scan()'s unlocked on-stack list.
> > + *
> > + * If memory cgroups are enabled, split_queue_lock is in the mem_cgroup: it is
> > + * therefore important to unqueue deferred split before changing folio memcg.
> > + */
> > +bool __folio_unqueue_deferred_split(struct folio *folio)
> >  {
> >         struct deferred_split *ds_queue;
> >         unsigned long flags;
> > +       bool unqueued = false;
> > +
> > +       WARN_ON_ONCE(folio_ref_count(folio));
> > +       WARN_ON_ONCE(!mem_cgroup_disabled() && !folio_memcg(folio));
> >
> >         ds_queue = get_deferred_split_queue(folio);
> >         spin_lock_irqsave(&ds_queue->split_queue_lock, flags);
> > @@ -3603,8 +3620,11 @@ void __folio_undo_large_rmappable(struct folio *folio)
> >                                       MTHP_STAT_NR_ANON_PARTIALLY_MAPPED, -1);
> >                 }
> >                 list_del_init(&folio->_deferred_list);
> > +               unqueued = true;
> >         }
> >         spin_unlock_irqrestore(&ds_queue->split_queue_lock, flags);
> > +
> > +       return unqueued;        /* useful for debug warnings */
> >  }
> >
> >  /* partially_mapped=false won't clear PG_partially_mapped folio flag */
> > @@ -3626,19 +3646,6 @@ void deferred_split_folio(struct folio *folio, bool partially_mapped)
> >         if (!partially_mapped && !split_underused_thp)
> >                 return;
> >
> > -       /*
> > -        * The try_to_unmap() in page reclaim path might reach here too,
> > -        * this may cause a race condition to corrupt deferred split queue.
> > -        * And, if page reclaim is already handling the same folio, it is
> > -        * unnecessary to handle it again in shrinker.
> > -        *
> > -        * Check the swapcache flag to determine if the folio is being
> > -        * handled by page reclaim since THP swap would add the folio into
> > -        * swap cache before calling try_to_unmap().
> > -        */
> > -       if (folio_test_swapcache(folio))
> > -               return;
> > -
> >         spin_lock_irqsave(&ds_queue->split_queue_lock, flags);
> >         if (partially_mapped) {
> >                 if (!folio_test_partially_mapped(folio)) {
> > diff --git a/mm/internal.h b/mm/internal.h
> > index 93083bbeeefa..16c1f3cd599e 100644
> > --- a/mm/internal.h
> > +++ b/mm/internal.h
> > @@ -639,11 +639,11 @@ static inline void folio_set_order(struct folio *folio, unsigned int order)
> >  #endif
> >  }
> >
> > -void __folio_undo_large_rmappable(struct folio *folio);
> > -static inline void folio_undo_large_rmappable(struct folio *folio)
> > +bool __folio_unqueue_deferred_split(struct folio *folio);
> > +static inline bool folio_unqueue_deferred_split(struct folio *folio)
> >  {
> >         if (folio_order(folio) <= 1 || !folio_test_large_rmappable(folio))
> > -               return;
> > +               return false;
> >
> >         /*
> >          * At this point, there is no one trying to add the folio to
> > @@ -651,9 +651,9 @@ static inline void folio_undo_large_rmappable(struct folio *folio)
> >          * to check without acquiring the split_queue_lock.
> >          */
> >         if (data_race(list_empty(&folio->_deferred_list)))
> > -               return;
> > +               return false;
> >
> > -       __folio_undo_large_rmappable(folio);
> > +       return __folio_unqueue_deferred_split(folio);
> >  }
> >
> >  static inline struct folio *page_rmappable_folio(struct page *page)
> > diff --git a/mm/memcontrol-v1.c b/mm/memcontrol-v1.c
> > index 81d8819f13cd..f8744f5630bb 100644
> > --- a/mm/memcontrol-v1.c
> > +++ b/mm/memcontrol-v1.c
> > @@ -848,6 +848,8 @@ static int mem_cgroup_move_account(struct folio *folio,
> >         css_get(&to->css);
> >         css_put(&from->css);
> >
> > +       /* Warning should never happen, so don't worry about refcount non-0 */
> > +       WARN_ON_ONCE(folio_unqueue_deferred_split(folio));
> >         folio->memcg_data = (unsigned long)to;
> >
> >         __folio_memcg_unlock(from);
> > @@ -1217,7 +1219,9 @@ static int mem_cgroup_move_charge_pte_range(pmd_t *pmd,
> >         enum mc_target_type target_type;
> >         union mc_target target;
> >         struct folio *folio;
> > +       bool tried_split_before = false;
> >
> > +retry_pmd:
> >         ptl = pmd_trans_huge_lock(pmd, vma);
> >         if (ptl) {
> >                 if (mc.precharge < HPAGE_PMD_NR) {
> > @@ -1227,6 +1231,27 @@ static int mem_cgroup_move_charge_pte_range(pmd_t *pmd,
> >                 target_type = get_mctgt_type_thp(vma, addr, *pmd, &target);
> >                 if (target_type == MC_TARGET_PAGE) {
> >                         folio = target.folio;
> > +                       /*
> > +                        * Deferred split queue locking depends on memcg,
> > +                        * and unqueue is unsafe unless folio refcount is 0:
> > +                        * split or skip if on the queue? first try to split.
> > +                        */
> > +                       if (!list_empty(&folio->_deferred_list)) {
> > +                               spin_unlock(ptl);
> > +                               if (!tried_split_before)
> > +                                       split_folio(folio);
> > +                               folio_unlock(folio);
> > +                               folio_put(folio);
> > +                               if (tried_split_before)
> > +                                       return 0;
> > +                               tried_split_before = true;
> > +                               goto retry_pmd;
> > +                       }
> > +                       /*
> > +                        * So long as that pmd lock is held, the folio cannot
> > +                        * be racily added to the _deferred_list, because
> > +                        * __folio_remove_rmap() will find !partially_mapped.
> > +                        */
> >                         if (folio_isolate_lru(folio)) {
> >                                 if (!mem_cgroup_move_account(folio, true,
> >                                                              mc.from, mc.to)) {
> > diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> > index 2703227cce88..06df2af97415 100644
> > --- a/mm/memcontrol.c
> > +++ b/mm/memcontrol.c
> > @@ -4629,9 +4629,6 @@ static void uncharge_folio(struct folio *folio, struct uncharge_gather *ug)
> >         struct obj_cgroup *objcg;
> >
> >         VM_BUG_ON_FOLIO(folio_test_lru(folio), folio);
> > -       VM_BUG_ON_FOLIO(folio_order(folio) > 1 &&
> > -                       !folio_test_hugetlb(folio) &&
> > -                       !list_empty(&folio->_deferred_list), folio);
> >
> >         /*
> >          * Nobody should be changing or seriously looking at
> > @@ -4678,6 +4675,7 @@ static void uncharge_folio(struct folio *folio, struct uncharge_gather *ug)
> >                         ug->nr_memory += nr_pages;
> >                 ug->pgpgout++;
> >
> > +               WARN_ON_ONCE(folio_unqueue_deferred_split(folio));
> >                 folio->memcg_data = 0;
> >         }
> >
> > @@ -4789,6 +4787,9 @@ void mem_cgroup_migrate(struct folio *old, struct folio *new)
> >
> >         /* Transfer the charge and the css ref */
> >         commit_charge(new, memcg);
> > +
> > +       /* Warning should never happen, so don't worry about refcount non-0 */
> > +       WARN_ON_ONCE(folio_unqueue_deferred_split(old));
> >         old->memcg_data = 0;
> >  }
> >
> > @@ -4975,6 +4976,7 @@ void mem_cgroup_swapout(struct folio *folio, swp_entry_t entry)
> >         VM_BUG_ON_FOLIO(oldid, folio);
> >         mod_memcg_state(swap_memcg, MEMCG_SWAP, nr_entries);
> >
> > +       folio_unqueue_deferred_split(folio);
> >         folio->memcg_data = 0;
> >
> >         if (!mem_cgroup_is_root(memcg))
> > diff --git a/mm/migrate.c b/mm/migrate.c
> > index df91248755e4..691f25ee2489 100644
> > --- a/mm/migrate.c
> > +++ b/mm/migrate.c
> > @@ -489,7 +489,7 @@ static int __folio_migrate_mapping(struct address_space *mapping,
> >                     folio_test_large_rmappable(folio)) {
> >                         if (!folio_ref_freeze(folio, expected_count))
> >                                 return -EAGAIN;
> > -                       folio_undo_large_rmappable(folio);
> > +                       folio_unqueue_deferred_split(folio);
> >                         folio_ref_unfreeze(folio, expected_count);
> >                 }
> >
> > @@ -514,7 +514,7 @@ static int __folio_migrate_mapping(struct address_space *mapping,
> >         }
> >
> >         /* Take off deferred split queue while frozen and memcg set */
> > -       folio_undo_large_rmappable(folio);
> > +       folio_unqueue_deferred_split(folio);
> >
> >         /*
> >          * Now we know that no one else is looking at the folio:
> > diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> > index 4b21a368b4e2..57f64b5d0004 100644
> > --- a/mm/page_alloc.c
> > +++ b/mm/page_alloc.c
> > @@ -2681,7 +2681,9 @@ void free_unref_folios(struct folio_batch *folios)
> >                 unsigned long pfn = folio_pfn(folio);
> >                 unsigned int order = folio_order(folio);
> >
> > -               folio_undo_large_rmappable(folio);
> > +               if (mem_cgroup_disabled())
> > +                       folio_unqueue_deferred_split(folio);
>
> This looks confusing. It looks all callsites of free_unref_folios()
> have folio_unqueue_deferred_split() and memcg uncharge called before
> it. If there is any problem, memcg uncharge should catch it. Did I
> miss something?
>
> > +
> >                 if (!free_pages_prepare(&folio->page, order))
> >                         continue;
> >                 /*
> > diff --git a/mm/swap.c b/mm/swap.c
> > index 835bdf324b76..b8e3259ea2c4 100644
> > --- a/mm/swap.c
> > +++ b/mm/swap.c
> > @@ -121,7 +121,7 @@ void __folio_put(struct folio *folio)
> >         }
> >
> >         page_cache_release(folio);
> > -       folio_undo_large_rmappable(folio);
> > +       folio_unqueue_deferred_split(folio);
> >         mem_cgroup_uncharge(folio);
> >         free_unref_page(&folio->page, folio_order(folio));
> >  }
> > @@ -988,7 +988,7 @@ void folios_put_refs(struct folio_batch *folios, unsigned int *refs)
> >                         free_huge_folio(folio);
> >                         continue;
> >                 }
> > -               folio_undo_large_rmappable(folio);
> > +               folio_unqueue_deferred_split(folio);
> >                 __page_cache_release(folio, &lruvec, &flags);
> >
> >                 if (j != i)
> > diff --git a/mm/vmscan.c b/mm/vmscan.c
> > index eb4e8440c507..635d45745b73 100644
> > --- a/mm/vmscan.c
> > +++ b/mm/vmscan.c
> > @@ -1475,7 +1475,7 @@ static unsigned int shrink_folio_list(struct list_head *folio_list,
> >                  */
> >                 nr_reclaimed += nr_pages;
> >
> > -               folio_undo_large_rmappable(folio);
> > +               folio_unqueue_deferred_split(folio);
> >                 if (folio_batch_add(&free_folios, folio) == 0) {
> >                         mem_cgroup_uncharge_folios(&free_folios);
> >                         try_to_unmap_flush();
> > @@ -1863,7 +1863,7 @@ static unsigned int move_folios_to_lru(struct lruvec *lruvec,
> >                 if (unlikely(folio_put_testzero(folio))) {
> >                         __folio_clear_lru_flags(folio);
> >
> > -                       folio_undo_large_rmappable(folio);
> > +                       folio_unqueue_deferred_split(folio);
> >                         if (folio_batch_add(&free_folios, folio) == 0) {
> >                                 spin_unlock_irq(&lruvec->lru_lock);
> >                                 mem_cgroup_uncharge_folios(&free_folios);
> > --
> > 2.35.3


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH hotfix 2/2] mm/thp: fix deferred split unqueue naming and locking
  2024-10-24 20:00   ` Yang Shi
  2024-10-25  1:21     ` Yang Shi
@ 2024-10-25  6:57     ` Hugh Dickins
  2024-10-25 16:34       ` Yang Shi
  1 sibling, 1 reply; 20+ messages in thread
From: Hugh Dickins @ 2024-10-25  6:57 UTC (permalink / raw)
  To: Yang Shi
  Cc: Hugh Dickins, Andrew Morton, Usama Arif, Wei Yang,
	Kirill A. Shutemov, Matthew Wilcox, David Hildenbrand,
	Johannes Weiner, Baolin Wang, Barry Song, Kefeng Wang,
	Ryan Roberts, Nhat Pham, Zi Yan, Chris Li, linux-kernel,
	linux-mm

[-- Attachment #1: Type: text/plain, Size: 3137 bytes --]

On Thu, 24 Oct 2024, Yang Shi wrote:
> On Wed, Oct 23, 2024 at 9:13 PM Hugh Dickins <hughd@google.com> wrote:
> >
> > That goes back to 5.4 commit 87eaceb3faa5 ("mm: thp: make deferred split
> > shrinker memcg aware"): which included a check on swapcache before adding
> > to deferred queue (which can now be removed), but no check on deferred
> > queue before adding THP to swapcache (maybe other circumstances prevented
> > it at that time, but not now).
> 
> If I remember correctly, THP just can be added to deferred list when
> there is no PMD map before mTHP swapout, so shrink_page_list() did
> check THP's compound_mapcount (called _entire_mapcount now) before
> adding it to swap cache.
> 
> Now the code just checked whether the large folio is on deferred list or not.

I've continued to find it hard to think about, hard to be convinced by
that sequence of checks, without an actual explicit _deferred_list check.

David has brilliantly come up with the failed THP migration example;
and I think now perhaps 5.8's 5503fbf2b0b8 ("khugepaged: allow to
collapse PTE-mapped compound pages") introduced another way?

But I certainly need to reword that wagging finger pointing to your
commit: these are much more exceptional cases than I was thinking there.

I have this evening tried running swapping load on 5.10 and 6.6 and 6.11,
each with just a BUG_ON(!list_empty(the deferred list)) before resetting
memcg in mem_cgroup_swapout() - it would of course be much easier to hit
such a BUG_ON() than for the consequent wrong locking to be so unlucky
as to actually result in list corruption.

None of those BUG_ONs hit; though I was only running each for 1.5 hour,
and looking at vmstats at the end, see the were really not exercising
deferred split very much at all.  I'd been hoping for an immediate hit
(as on 6.12-rc) to confirm my doubt, but no.  That doesn't *prove* you're
right, but (excepting David's and my weird cases) I bet you are right.

> > diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> > index 4b21a368b4e2..57f64b5d0004 100644
> > --- a/mm/page_alloc.c
> > +++ b/mm/page_alloc.c
> > @@ -2681,7 +2681,9 @@ void free_unref_folios(struct folio_batch *folios)
> >                 unsigned long pfn = folio_pfn(folio);
> >                 unsigned int order = folio_order(folio);
> >
> > -               folio_undo_large_rmappable(folio);
> > +               if (mem_cgroup_disabled())
> > +                       folio_unqueue_deferred_split(folio);
> 
> This looks confusing. It looks all callsites of free_unref_folios()
> have folio_unqueue_deferred_split() and memcg uncharge called before
> it. If there is any problem, memcg uncharge should catch it. Did I
> miss something?

I don't understand what you're suggesting there.  But David remarked
on it too, so it seems that I do need at least to add some comment.

I'd better re-examine the memcg/non-memcg forking paths again: but by
strange coincidence (or suggestion?), I'm suddenly now too tired here,
precisely where David stopped too.  I'll have to come back to this
tomorrow, sorry.

Hugh

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH hotfix 2/2] mm/thp: fix deferred split unqueue naming and locking
  2024-10-25  6:57     ` Hugh Dickins
@ 2024-10-25 16:34       ` Yang Shi
  2024-10-27  5:35         ` Hugh Dickins
  0 siblings, 1 reply; 20+ messages in thread
From: Yang Shi @ 2024-10-25 16:34 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Andrew Morton, Usama Arif, Wei Yang, Kirill A. Shutemov,
	Matthew Wilcox, David Hildenbrand, Johannes Weiner, Baolin Wang,
	Barry Song, Kefeng Wang, Ryan Roberts, Nhat Pham, Zi Yan,
	Chris Li, linux-kernel, linux-mm

On Thu, Oct 24, 2024 at 11:57 PM Hugh Dickins <hughd@google.com> wrote:
>
> On Thu, 24 Oct 2024, Yang Shi wrote:
> > On Wed, Oct 23, 2024 at 9:13 PM Hugh Dickins <hughd@google.com> wrote:
> > >
> > > That goes back to 5.4 commit 87eaceb3faa5 ("mm: thp: make deferred split
> > > shrinker memcg aware"): which included a check on swapcache before adding
> > > to deferred queue (which can now be removed), but no check on deferred
> > > queue before adding THP to swapcache (maybe other circumstances prevented
> > > it at that time, but not now).
> >
> > If I remember correctly, THP just can be added to deferred list when
> > there is no PMD map before mTHP swapout, so shrink_page_list() did
> > check THP's compound_mapcount (called _entire_mapcount now) before
> > adding it to swap cache.
> >
> > Now the code just checked whether the large folio is on deferred list or not.
>
> I've continued to find it hard to think about, hard to be convinced by
> that sequence of checks, without an actual explicit _deferred_list check.

You meant the swap cache check? I was trying to recall the reason. If
I remember correctly (sorry, memory is still vague), if the THP was
PMD-mapped and PTE-mapped by two processes, the THP may be added to
swap cache since just compound_mapcount was checked. Then
try_to_unmap() in shrink_page_list() may add it to deferred list if
PMD mapping was unmapped first. The potential list corruption fixed by
you now may be triggered.

But this was based on the assumption that there can't be PMD-mapped
THP on deferred list. If this happens (as David's migration fail
example), the swap cache check should be not enough. This case was
overlooked.

>
> David has brilliantly come up with the failed THP migration example;
> and I think now perhaps 5.8's 5503fbf2b0b8 ("khugepaged: allow to
> collapse PTE-mapped compound pages") introduced another way?
>
> But I certainly need to reword that wagging finger pointing to your
> commit: these are much more exceptional cases than I was thinking there.
>
> I have this evening tried running swapping load on 5.10 and 6.6 and 6.11,
> each with just a BUG_ON(!list_empty(the deferred list)) before resetting
> memcg in mem_cgroup_swapout() - it would of course be much easier to hit
> such a BUG_ON() than for the consequent wrong locking to be so unlucky
> as to actually result in list corruption.
>
> None of those BUG_ONs hit; though I was only running each for 1.5 hour,
> and looking at vmstats at the end, see the were really not exercising
> deferred split very much at all.  I'd been hoping for an immediate hit
> (as on 6.12-rc) to confirm my doubt, but no.  That doesn't *prove* you're
> right, but (excepting David's and my weird cases) I bet you are right.

Maybe it just happened rarely and was hard to hit. But still
theoretically possible. Your fix is more reliable.

>
> > > diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> > > index 4b21a368b4e2..57f64b5d0004 100644
> > > --- a/mm/page_alloc.c
> > > +++ b/mm/page_alloc.c
> > > @@ -2681,7 +2681,9 @@ void free_unref_folios(struct folio_batch *folios)
> > >                 unsigned long pfn = folio_pfn(folio);
> > >                 unsigned int order = folio_order(folio);
> > >
> > > -               folio_undo_large_rmappable(folio);
> > > +               if (mem_cgroup_disabled())
> > > +                       folio_unqueue_deferred_split(folio);
> >
> > This looks confusing. It looks all callsites of free_unref_folios()
> > have folio_unqueue_deferred_split() and memcg uncharge called before
> > it. If there is any problem, memcg uncharge should catch it. Did I
> > miss something?
>
> I don't understand what you're suggesting there.  But David remarked
> on it too, so it seems that I do need at least to add some comment.
>
> I'd better re-examine the memcg/non-memcg forking paths again: but by
> strange coincidence (or suggestion?), I'm suddenly now too tired here,
> precisely where David stopped too.  I'll have to come back to this
> tomorrow, sorry.

I perhaps misunderstood this code. Just feel free to correct me if it
doesn't make sense to you. But, yes, some comments are definitely
welcome and helpful for understanding the code and review.

>
> Hugh


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH hotfix 2/2] mm/thp: fix deferred split unqueue naming and locking
  2024-10-25 16:34       ` Yang Shi
@ 2024-10-27  5:35         ` Hugh Dickins
  0 siblings, 0 replies; 20+ messages in thread
From: Hugh Dickins @ 2024-10-27  5:35 UTC (permalink / raw)
  To: Yang Shi
  Cc: Hugh Dickins, Andrew Morton, Usama Arif, Wei Yang,
	Kirill A. Shutemov, Matthew Wilcox, David Hildenbrand,
	Johannes Weiner, Baolin Wang, Barry Song, Kefeng Wang,
	Ryan Roberts, Nhat Pham, Zi Yan, Chris Li, linux-kernel,
	linux-mm

[-- Attachment #1: Type: text/plain, Size: 2942 bytes --]

On Fri, 25 Oct 2024, Yang Shi wrote:
> On Thu, Oct 24, 2024 at 11:57 PM Hugh Dickins <hughd@google.com> wrote:
> > On Thu, 24 Oct 2024, Yang Shi wrote:
> > > On Wed, Oct 23, 2024 at 9:13 PM Hugh Dickins <hughd@google.com> wrote:
> > > > --- a/mm/page_alloc.c
> > > > +++ b/mm/page_alloc.c
> > > > @@ -2681,7 +2681,9 @@ void free_unref_folios(struct folio_batch *folios)
> > > >                 unsigned long pfn = folio_pfn(folio);
> > > >                 unsigned int order = folio_order(folio);
> > > >
> > > > -               folio_undo_large_rmappable(folio);
> > > > +               if (mem_cgroup_disabled())
> > > > +                       folio_unqueue_deferred_split(folio);
> > >
> > > This looks confusing. It looks all callsites of free_unref_folios()
> > > have folio_unqueue_deferred_split() and memcg uncharge called before
> > > it. If there is any problem, memcg uncharge should catch it. Did I
> > > miss something?
> >
> > I don't understand what you're suggesting there.  But David remarked
> > on it too, so it seems that I do need at least to add some comment.
> >
> > I'd better re-examine the memcg/non-memcg forking paths again: but by
> > strange coincidence (or suggestion?), I'm suddenly now too tired here,
> > precisely where David stopped too.  I'll have to come back to this
> > tomorrow, sorry.
> 
> I perhaps misunderstood this code. Just feel free to correct me if it
> doesn't make sense to you. But, yes, some comments are definitely
> welcome and helpful for understanding the code and review.

Thanks a lot for challenging that: it was me who misunderstood, not you.

I might just be inventing this excuse, but I think what happened was,
I'd been staring at an earlier release tree, and in that earlier tree
the prior unqueueing was tucked away inside a memcg function, but not
done in the #ifndef CONFIG_MEMCG stub: so I thought that this
folio_unqueue_deferred_split() in free_unref_folios() was needed just
to do it when mem_cgroup_disabled() (either by CONFIG or bootoption).

And I thought the "if (mem_cgroup_disabled())" was comment enough:
except it made no sense to you and David who saw what I was blind to
(and what you describe perfectly clearly above - it depresses me
sometimes, how I cannot even read what someone wrote, until I've
arrived at the same conclusion myself!).

If my story about !memcg stubs is true, then I think Matthew has
been cleaning all that up recently.  Except for put_pages_list()
(where I now see he wanted to insert a VM_BUG_ON(folio_memcg) in
April, but was forced to retreat): that one does not have a
folio_unqueue_deferred_split() in, but the good news is that
it no longer has any callers - I'll send a patch to delete it.

And instead of my misunderstood code above in free_unref_folios(),
just deleting the folio_undo_large_unmappable() line, with a
comment in the commit message.

Thanks!
Hugh

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH hotfix 2/2] mm/thp: fix deferred split unqueue naming and locking
  2024-10-24  4:13 ` [PATCH hotfix 2/2] mm/thp: fix deferred split unqueue naming and locking Hugh Dickins
  2024-10-24 20:00   ` Yang Shi
@ 2024-10-24 20:52   ` David Hildenbrand
  2024-10-25  1:25     ` Yang Shi
  2024-10-27  7:07     ` Hugh Dickins
  1 sibling, 2 replies; 20+ messages in thread
From: David Hildenbrand @ 2024-10-24 20:52 UTC (permalink / raw)
  To: Hugh Dickins, Andrew Morton
  Cc: Usama Arif, Yang Shi, Wei Yang, Kirill A. Shutemov,
	Matthew Wilcox, Johannes Weiner, Baolin Wang, Barry Song,
	Kefeng Wang, Ryan Roberts, Nhat Pham, Zi Yan, Chris Li,
	linux-kernel, linux-mm

On 24.10.24 06:13, Hugh Dickins wrote:
> Recent changes are putting more pressure on THP deferred split queues:
> under load revealing long-standing races, causing list_del corruptions,
> "Bad page state"s and worse (I keep BUGs in both of those, so usually
> don't get to see how badly they end up without).  The relevant recent
> changes being 6.8's mTHP, 6.10's mTHP swapout, and 6.12's mTHP swapin,
> improved swap allocation, and underused THP splitting.
> 
> Before fixing locking: rename misleading folio_undo_large_rmappable(),
> which does not undo large_rmappable, to folio_unqueue_deferred_split(),
> which is what it does.  

Yes please. I stumbled into that myself recently -- leftover from 
previous rework.

It would have been reasonable to move that into a separate (follow-up?) 
patch.

> But that and its out-of-line __callee are mm
> internals of very limited usability: add comment and WARN_ON_ONCEs to
> check usage; and return a bool to say if a deferred split was unqueued,
> which can then be used in WARN_ON_ONCEs around safety checks (sparing
> callers the arcane conditionals in __folio_unqueue_deferred_split()).
> 
> Swapout: mem_cgroup_swapout() has been resetting folio->memcg_data 0
> without checking and unqueueing a THP folio from deferred split list;
> which is unfortunate, since the split_queue_lock depends on the memcg
> (when memcg is enabled); so swapout has been unqueueing such THPs later,
> when freeing the folio, using the pgdat's lock instead: potentially
> corrupting the memcg's list.  __remove_mapping() has frozen refcount to
> 0 here, so no problem with calling folio_unqueue_deferred_split() before
> resetting memcg_data.
> 
> That goes back to 5.4 commit 87eaceb3faa5 ("mm: thp: make deferred split
> shrinker memcg aware"): which included a check on swapcache before adding
> to deferred queue (which can now be removed), but no check on deferred
> queue before adding THP to swapcache (maybe other circumstances prevented
> it at that time, but not now).
> 
> Memcg-v1 move (deprecated): mem_cgroup_move_account() has been changing
> folio->memcg_data without checking and unqueueing a THP folio from the
> deferred list, sometimes corrupting "from" memcg's list, like swapout.
> Refcount is non-zero here, so folio_unqueue_deferred_split() can only be
> used in a WARN_ON_ONCE to validate the fix, which must be done earlier:
> mem_cgroup_move_charge_pte_range() first try to split the THP (splitting
> of course unqueues), or skip it if that fails.  Not ideal, but moving
> charge has been requested, and khugepaged should repair the THP later:
> nobody wants new custom unqueueing code just for this deprecated case.
> 
> The 87eaceb3faa5 commit did have the code to move from one deferred list
> to another (but was not conscious of its unsafety while refcount non-0);
> but that was removed by 5.6 commit fac0516b5534 ("mm: thp: don't need
> care deferred split queue in memcg charge move path"), which argued that
> the existence of a PMD mapping guarantees that the THP cannot be on a
> deferred list.  

I recall this can happen, not sure on 5.6 though: assume we have an anon 
THP with 1 PMD mapping and a single PTE mapping for simplicity.

Assume we want to migrate that folio and first remove the PMD mapping, 
then the PTE mapping. After removing the PMD mapping, we add it to the 
deferred split queue (single PTE mapped).

Now assume migration fails and we remove migration entries -> remap.

We now have a PMD-mapped THP on the deferred split queue.

(again, I might be wrong but that's from memory without digging into the 
code)


> I'm not sure if that was true at the time (swapcache
> remapped?), but it's clearly not true since 6.12 commit dafff3f4c850
> ("mm: split underused THPs").

We only remap PTEs from the swapcache, never PMDs.

> 
> [Note in passing: mem_cgroup_move_charge_pte_range() just skips mTHPs,
> large but not PMD-mapped: that's safe, but perhaps not intended: it's
> arguable whether the deprecated feature should be updated to work
> better with the new feature; but certainly not in this patch.]
> 
> Backport to 6.11 should be straightforward. Earlier backports must take
> care that other _deferred_list fixes and dependencies are included.  It
> is unclear whether these fixes are realistically needed before 6.12.
> 
> Fixes: 87eaceb3faa5 ("mm: thp: make deferred split shrinker memcg aware")
> Signed-off-by: Hugh Dickins <hughd@google.com>
> Cc: <stable@vger.kernel.org>
> ---
>   mm/huge_memory.c   | 35 +++++++++++++++++++++--------------
>   mm/internal.h      | 10 +++++-----
>   mm/memcontrol-v1.c | 25 +++++++++++++++++++++++++
>   mm/memcontrol.c    |  8 +++++---
>   mm/migrate.c       |  4 ++--
>   mm/page_alloc.c    |  4 +++-
>   mm/swap.c          |  4 ++--
>   mm/vmscan.c        |  4 ++--
>   8 files changed, 65 insertions(+), 29 deletions(-)
> 
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index a1d345f1680c..dc7d5bb76495 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -3588,10 +3588,27 @@ int split_folio_to_list(struct folio *folio, struct list_head *list)
>   	return split_huge_page_to_list_to_order(&folio->page, list, ret);
>   }
>   
> -void __folio_undo_large_rmappable(struct folio *folio)
> +/*
> + * __folio_unqueue_deferred_split() is not to be called directly:
> + * the folio_unqueue_deferred_split() inline wrapper in mm/internal.h
> + * limits its calls to those folios which may have a _deferred_list for
> + * queueing THP splits, and that list is (racily observed to be) non-empty.
> + *
> + * It is unsafe to call folio_unqueue_deferred_split() until folio refcount is
> + * zero: because even when split_queue_lock is held, a non-empty _deferred_list
> + * might be in use on deferred_split_scan()'s unlocked on-stack list.
> + *
> + * If memory cgroups are enabled, split_queue_lock is in the mem_cgroup: it is
> + * therefore important to unqueue deferred split before changing folio memcg.
> + */
> +bool __folio_unqueue_deferred_split(struct folio *folio)
>   {
>   	struct deferred_split *ds_queue;
>   	unsigned long flags;
> +	bool unqueued = false;
> +
> +	WARN_ON_ONCE(folio_ref_count(folio));
> +	WARN_ON_ONCE(!mem_cgroup_disabled() && !folio_memcg(folio));
>   
>   	ds_queue = get_deferred_split_queue(folio);
>   	spin_lock_irqsave(&ds_queue->split_queue_lock, flags);
> @@ -3603,8 +3620,11 @@ void __folio_undo_large_rmappable(struct folio *folio)
>   				      MTHP_STAT_NR_ANON_PARTIALLY_MAPPED, -1);
>   		}
>   		list_del_init(&folio->_deferred_list);
> +		unqueued = true;
>   	}
>   	spin_unlock_irqrestore(&ds_queue->split_queue_lock, flags);
> +
> +	return unqueued;	/* useful for debug warnings */
>   }
>   
>   /* partially_mapped=false won't clear PG_partially_mapped folio flag */
> @@ -3626,19 +3646,6 @@ void deferred_split_folio(struct folio *folio, bool partially_mapped)
>   	if (!partially_mapped && !split_underused_thp)
>   		return;
>   
> -	/*
> -	 * The try_to_unmap() in page reclaim path might reach here too,
> -	 * this may cause a race condition to corrupt deferred split queue.
> -	 * And, if page reclaim is already handling the same folio, it is
> -	 * unnecessary to handle it again in shrinker.
> -	 *
> -	 * Check the swapcache flag to determine if the folio is being
> -	 * handled by page reclaim since THP swap would add the folio into
> -	 * swap cache before calling try_to_unmap().
> -	 */
> -	if (folio_test_swapcache(folio))
> -		return;
> -
>   	spin_lock_irqsave(&ds_queue->split_queue_lock, flags);
>   	if (partially_mapped) {
>   		if (!folio_test_partially_mapped(folio)) {
> diff --git a/mm/internal.h b/mm/internal.h
> index 93083bbeeefa..16c1f3cd599e 100644
> --- a/mm/internal.h
> +++ b/mm/internal.h
> @@ -639,11 +639,11 @@ static inline void folio_set_order(struct folio *folio, unsigned int order)
>   #endif
>   }
>   
> -void __folio_undo_large_rmappable(struct folio *folio);
> -static inline void folio_undo_large_rmappable(struct folio *folio)
> +bool __folio_unqueue_deferred_split(struct folio *folio);
> +static inline bool folio_unqueue_deferred_split(struct folio *folio)
>   {
>   	if (folio_order(folio) <= 1 || !folio_test_large_rmappable(folio))

The rmappable check here is still confusing for me. I assume we want to 
exclude hugetlb or others that reuse the field for other purposes ...

> -		return;
> +		return false;
>   
>   	/*
>   	 * At this point, there is no one trying to add the folio to
> @@ -651,9 +651,9 @@ static inline void folio_undo_large_rmappable(struct folio *folio)
>   	 * to check without acquiring the split_queue_lock.
>   	 */
>   	if (data_race(list_empty(&folio->_deferred_list)))
> -		return;
> +		return false;
>   
> -	__folio_undo_large_rmappable(folio);
> +	return __folio_unqueue_deferred_split(folio);
>   }
>   
>   static inline struct folio *page_rmappable_folio(struct page *page)
> diff --git a/mm/memcontrol-v1.c b/mm/memcontrol-v1.c
> index 81d8819f13cd..f8744f5630bb 100644
> --- a/mm/memcontrol-v1.c
> +++ b/mm/memcontrol-v1.c
> @@ -848,6 +848,8 @@ static int mem_cgroup_move_account(struct folio *folio,
>   	css_get(&to->css);
>   	css_put(&from->css);
>   
> +	/* Warning should never happen, so don't worry about refcount non-0 */
> +	WARN_ON_ONCE(folio_unqueue_deferred_split(folio));
>   	folio->memcg_data = (unsigned long)to;
>   
>   	__folio_memcg_unlock(from);
> @@ -1217,7 +1219,9 @@ static int mem_cgroup_move_charge_pte_range(pmd_t *pmd,
>   	enum mc_target_type target_type;
>   	union mc_target target;
>   	struct folio *folio;
> +	bool tried_split_before = false;
>   
> +retry_pmd:
>   	ptl = pmd_trans_huge_lock(pmd, vma);
>   	if (ptl) {
>   		if (mc.precharge < HPAGE_PMD_NR) {
> @@ -1227,6 +1231,27 @@ static int mem_cgroup_move_charge_pte_range(pmd_t *pmd,
>   		target_type = get_mctgt_type_thp(vma, addr, *pmd, &target);
>   		if (target_type == MC_TARGET_PAGE) {
>   			folio = target.folio;
> +			/*
> +			 * Deferred split queue locking depends on memcg,
> +			 * and unqueue is unsafe unless folio refcount is 0:
> +			 * split or skip if on the queue? first try to split.
> +			 */
> +			if (!list_empty(&folio->_deferred_list)) {
> +				spin_unlock(ptl);
> +				if (!tried_split_before)
> +					split_folio(folio);
> +				folio_unlock(folio);
> +				folio_put(folio);
> +				if (tried_split_before)
> +					return 0;
> +				tried_split_before = true;
> +				goto retry_pmd;
> +			}
> +			/*
> +			 * So long as that pmd lock is held, the folio cannot
> +			 * be racily added to the _deferred_list, because
> +			 * __folio_remove_rmap() will find !partially_mapped.
> +			 */
>   			if (folio_isolate_lru(folio)) {
>   				if (!mem_cgroup_move_account(folio, true,
>   							     mc.from, mc.to)) {
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 2703227cce88..06df2af97415 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -4629,9 +4629,6 @@ static void uncharge_folio(struct folio *folio, struct uncharge_gather *ug)
>   	struct obj_cgroup *objcg;
>   
>   	VM_BUG_ON_FOLIO(folio_test_lru(folio), folio);
> -	VM_BUG_ON_FOLIO(folio_order(folio) > 1 &&
> -			!folio_test_hugetlb(folio) &&
> -			!list_empty(&folio->_deferred_list), folio);
>   
>   	/*
>   	 * Nobody should be changing or seriously looking at
> @@ -4678,6 +4675,7 @@ static void uncharge_folio(struct folio *folio, struct uncharge_gather *ug)
>   			ug->nr_memory += nr_pages;
>   		ug->pgpgout++;
>   
> +		WARN_ON_ONCE(folio_unqueue_deferred_split(folio));
>   		folio->memcg_data = 0;
>   	}
>   
> @@ -4789,6 +4787,9 @@ void mem_cgroup_migrate(struct folio *old, struct folio *new)
>   
>   	/* Transfer the charge and the css ref */
>   	commit_charge(new, memcg);
> +
> +	/* Warning should never happen, so don't worry about refcount non-0 */
> +	WARN_ON_ONCE(folio_unqueue_deferred_split(old));
>   	old->memcg_data = 0;
>   }
>   
> @@ -4975,6 +4976,7 @@ void mem_cgroup_swapout(struct folio *folio, swp_entry_t entry)
>   	VM_BUG_ON_FOLIO(oldid, folio);
>   	mod_memcg_state(swap_memcg, MEMCG_SWAP, nr_entries);
>   
> +	folio_unqueue_deferred_split(folio);
>   	folio->memcg_data = 0;
>   
>   	if (!mem_cgroup_is_root(memcg))
> diff --git a/mm/migrate.c b/mm/migrate.c
> index df91248755e4..691f25ee2489 100644
> --- a/mm/migrate.c
> +++ b/mm/migrate.c
> @@ -489,7 +489,7 @@ static int __folio_migrate_mapping(struct address_space *mapping,
>   		    folio_test_large_rmappable(folio)) {
>   			if (!folio_ref_freeze(folio, expected_count))
>   				return -EAGAIN;
> -			folio_undo_large_rmappable(folio);
> +			folio_unqueue_deferred_split(folio);
>   			folio_ref_unfreeze(folio, expected_count);
>   		}
>   
> @@ -514,7 +514,7 @@ static int __folio_migrate_mapping(struct address_space *mapping,
>   	}
>   
>   	/* Take off deferred split queue while frozen and memcg set */
> -	folio_undo_large_rmappable(folio);
> +	folio_unqueue_deferred_split(folio);
>   
>   	/*
>   	 * Now we know that no one else is looking at the folio:
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 4b21a368b4e2..57f64b5d0004 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -2681,7 +2681,9 @@ void free_unref_folios(struct folio_batch *folios)
>   		unsigned long pfn = folio_pfn(folio);
>   		unsigned int order = folio_order(folio);
>   
> -		folio_undo_large_rmappable(folio);
> +		if (mem_cgroup_disabled())
> +			folio_unqueue_deferred_split(folio);
> +

Is it worth adding a comment here where we take care of ths with memcg 
enabled?

It's complicated stuff, nothing jumped at me, but it's late here so my 
brain is not fully functional ...

-- 
Cheers,

David / dhildenb



^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH hotfix 2/2] mm/thp: fix deferred split unqueue naming and locking
  2024-10-24 20:52   ` David Hildenbrand
@ 2024-10-25  1:25     ` Yang Shi
  2024-10-27  7:07     ` Hugh Dickins
  1 sibling, 0 replies; 20+ messages in thread
From: Yang Shi @ 2024-10-25  1:25 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Hugh Dickins, Andrew Morton, Usama Arif, Wei Yang,
	Kirill A. Shutemov, Matthew Wilcox, Johannes Weiner, Baolin Wang,
	Barry Song, Kefeng Wang, Ryan Roberts, Nhat Pham, Zi Yan,
	Chris Li, linux-kernel, linux-mm

On Thu, Oct 24, 2024 at 1:52 PM David Hildenbrand <david@redhat.com> wrote:
>
> On 24.10.24 06:13, Hugh Dickins wrote:
> > Recent changes are putting more pressure on THP deferred split queues:
> > under load revealing long-standing races, causing list_del corruptions,
> > "Bad page state"s and worse (I keep BUGs in both of those, so usually
> > don't get to see how badly they end up without).  The relevant recent
> > changes being 6.8's mTHP, 6.10's mTHP swapout, and 6.12's mTHP swapin,
> > improved swap allocation, and underused THP splitting.
> >
> > Before fixing locking: rename misleading folio_undo_large_rmappable(),
> > which does not undo large_rmappable, to folio_unqueue_deferred_split(),
> > which is what it does.
>
> Yes please. I stumbled into that myself recently -- leftover from
> previous rework.
>
> It would have been reasonable to move that into a separate (follow-up?)
> patch.
>
> > But that and its out-of-line __callee are mm
> > internals of very limited usability: add comment and WARN_ON_ONCEs to
> > check usage; and return a bool to say if a deferred split was unqueued,
> > which can then be used in WARN_ON_ONCEs around safety checks (sparing
> > callers the arcane conditionals in __folio_unqueue_deferred_split()).
> >
> > Swapout: mem_cgroup_swapout() has been resetting folio->memcg_data 0
> > without checking and unqueueing a THP folio from deferred split list;
> > which is unfortunate, since the split_queue_lock depends on the memcg
> > (when memcg is enabled); so swapout has been unqueueing such THPs later,
> > when freeing the folio, using the pgdat's lock instead: potentially
> > corrupting the memcg's list.  __remove_mapping() has frozen refcount to
> > 0 here, so no problem with calling folio_unqueue_deferred_split() before
> > resetting memcg_data.
> >
> > That goes back to 5.4 commit 87eaceb3faa5 ("mm: thp: make deferred split
> > shrinker memcg aware"): which included a check on swapcache before adding
> > to deferred queue (which can now be removed), but no check on deferred
> > queue before adding THP to swapcache (maybe other circumstances prevented
> > it at that time, but not now).
> >
> > Memcg-v1 move (deprecated): mem_cgroup_move_account() has been changing
> > folio->memcg_data without checking and unqueueing a THP folio from the
> > deferred list, sometimes corrupting "from" memcg's list, like swapout.
> > Refcount is non-zero here, so folio_unqueue_deferred_split() can only be
> > used in a WARN_ON_ONCE to validate the fix, which must be done earlier:
> > mem_cgroup_move_charge_pte_range() first try to split the THP (splitting
> > of course unqueues), or skip it if that fails.  Not ideal, but moving
> > charge has been requested, and khugepaged should repair the THP later:
> > nobody wants new custom unqueueing code just for this deprecated case.
> >
> > The 87eaceb3faa5 commit did have the code to move from one deferred list
> > to another (but was not conscious of its unsafety while refcount non-0);
> > but that was removed by 5.6 commit fac0516b5534 ("mm: thp: don't need
> > care deferred split queue in memcg charge move path"), which argued that
> > the existence of a PMD mapping guarantees that the THP cannot be on a
> > deferred list.
>
> I recall this can happen, not sure on 5.6 though: assume we have an anon
> THP with 1 PMD mapping and a single PTE mapping for simplicity.
>
> Assume we want to migrate that folio and first remove the PMD mapping,
> then the PTE mapping. After removing the PMD mapping, we add it to the
> deferred split queue (single PTE mapped).
>
> Now assume migration fails and we remove migration entries -> remap.
>
> We now have a PMD-mapped THP on the deferred split queue.
>
> (again, I might be wrong but that's from memory without digging into the
> code)

It sounds possible to me on 5.6 too. I didn't see
remove_migration_ptes() remove PMD-mapped THP from deferred list. As I
said in earlier email, this case should be not common and we were luck
not to hit it.

>
>
> > I'm not sure if that was true at the time (swapcache
> > remapped?), but it's clearly not true since 6.12 commit dafff3f4c850
> > ("mm: split underused THPs").
>
> We only remap PTEs from the swapcache, never PMDs.
>
> >
> > [Note in passing: mem_cgroup_move_charge_pte_range() just skips mTHPs,
> > large but not PMD-mapped: that's safe, but perhaps not intended: it's
> > arguable whether the deprecated feature should be updated to work
> > better with the new feature; but certainly not in this patch.]
> >
> > Backport to 6.11 should be straightforward. Earlier backports must take
> > care that other _deferred_list fixes and dependencies are included.  It
> > is unclear whether these fixes are realistically needed before 6.12.
> >
> > Fixes: 87eaceb3faa5 ("mm: thp: make deferred split shrinker memcg aware")
> > Signed-off-by: Hugh Dickins <hughd@google.com>
> > Cc: <stable@vger.kernel.org>
> > ---
> >   mm/huge_memory.c   | 35 +++++++++++++++++++++--------------
> >   mm/internal.h      | 10 +++++-----
> >   mm/memcontrol-v1.c | 25 +++++++++++++++++++++++++
> >   mm/memcontrol.c    |  8 +++++---
> >   mm/migrate.c       |  4 ++--
> >   mm/page_alloc.c    |  4 +++-
> >   mm/swap.c          |  4 ++--
> >   mm/vmscan.c        |  4 ++--
> >   8 files changed, 65 insertions(+), 29 deletions(-)
> >
> > diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> > index a1d345f1680c..dc7d5bb76495 100644
> > --- a/mm/huge_memory.c
> > +++ b/mm/huge_memory.c
> > @@ -3588,10 +3588,27 @@ int split_folio_to_list(struct folio *folio, struct list_head *list)
> >       return split_huge_page_to_list_to_order(&folio->page, list, ret);
> >   }
> >
> > -void __folio_undo_large_rmappable(struct folio *folio)
> > +/*
> > + * __folio_unqueue_deferred_split() is not to be called directly:
> > + * the folio_unqueue_deferred_split() inline wrapper in mm/internal.h
> > + * limits its calls to those folios which may have a _deferred_list for
> > + * queueing THP splits, and that list is (racily observed to be) non-empty.
> > + *
> > + * It is unsafe to call folio_unqueue_deferred_split() until folio refcount is
> > + * zero: because even when split_queue_lock is held, a non-empty _deferred_list
> > + * might be in use on deferred_split_scan()'s unlocked on-stack list.
> > + *
> > + * If memory cgroups are enabled, split_queue_lock is in the mem_cgroup: it is
> > + * therefore important to unqueue deferred split before changing folio memcg.
> > + */
> > +bool __folio_unqueue_deferred_split(struct folio *folio)
> >   {
> >       struct deferred_split *ds_queue;
> >       unsigned long flags;
> > +     bool unqueued = false;
> > +
> > +     WARN_ON_ONCE(folio_ref_count(folio));
> > +     WARN_ON_ONCE(!mem_cgroup_disabled() && !folio_memcg(folio));
> >
> >       ds_queue = get_deferred_split_queue(folio);
> >       spin_lock_irqsave(&ds_queue->split_queue_lock, flags);
> > @@ -3603,8 +3620,11 @@ void __folio_undo_large_rmappable(struct folio *folio)
> >                                     MTHP_STAT_NR_ANON_PARTIALLY_MAPPED, -1);
> >               }
> >               list_del_init(&folio->_deferred_list);
> > +             unqueued = true;
> >       }
> >       spin_unlock_irqrestore(&ds_queue->split_queue_lock, flags);
> > +
> > +     return unqueued;        /* useful for debug warnings */
> >   }
> >
> >   /* partially_mapped=false won't clear PG_partially_mapped folio flag */
> > @@ -3626,19 +3646,6 @@ void deferred_split_folio(struct folio *folio, bool partially_mapped)
> >       if (!partially_mapped && !split_underused_thp)
> >               return;
> >
> > -     /*
> > -      * The try_to_unmap() in page reclaim path might reach here too,
> > -      * this may cause a race condition to corrupt deferred split queue.
> > -      * And, if page reclaim is already handling the same folio, it is
> > -      * unnecessary to handle it again in shrinker.
> > -      *
> > -      * Check the swapcache flag to determine if the folio is being
> > -      * handled by page reclaim since THP swap would add the folio into
> > -      * swap cache before calling try_to_unmap().
> > -      */
> > -     if (folio_test_swapcache(folio))
> > -             return;
> > -
> >       spin_lock_irqsave(&ds_queue->split_queue_lock, flags);
> >       if (partially_mapped) {
> >               if (!folio_test_partially_mapped(folio)) {
> > diff --git a/mm/internal.h b/mm/internal.h
> > index 93083bbeeefa..16c1f3cd599e 100644
> > --- a/mm/internal.h
> > +++ b/mm/internal.h
> > @@ -639,11 +639,11 @@ static inline void folio_set_order(struct folio *folio, unsigned int order)
> >   #endif
> >   }
> >
> > -void __folio_undo_large_rmappable(struct folio *folio);
> > -static inline void folio_undo_large_rmappable(struct folio *folio)
> > +bool __folio_unqueue_deferred_split(struct folio *folio);
> > +static inline bool folio_unqueue_deferred_split(struct folio *folio)
> >   {
> >       if (folio_order(folio) <= 1 || !folio_test_large_rmappable(folio))
>
> The rmappable check here is still confusing for me. I assume we want to
> exclude hugetlb or others that reuse the field for other purposes ...
>
> > -             return;
> > +             return false;
> >
> >       /*
> >        * At this point, there is no one trying to add the folio to
> > @@ -651,9 +651,9 @@ static inline void folio_undo_large_rmappable(struct folio *folio)
> >        * to check without acquiring the split_queue_lock.
> >        */
> >       if (data_race(list_empty(&folio->_deferred_list)))
> > -             return;
> > +             return false;
> >
> > -     __folio_undo_large_rmappable(folio);
> > +     return __folio_unqueue_deferred_split(folio);
> >   }
> >
> >   static inline struct folio *page_rmappable_folio(struct page *page)
> > diff --git a/mm/memcontrol-v1.c b/mm/memcontrol-v1.c
> > index 81d8819f13cd..f8744f5630bb 100644
> > --- a/mm/memcontrol-v1.c
> > +++ b/mm/memcontrol-v1.c
> > @@ -848,6 +848,8 @@ static int mem_cgroup_move_account(struct folio *folio,
> >       css_get(&to->css);
> >       css_put(&from->css);
> >
> > +     /* Warning should never happen, so don't worry about refcount non-0 */
> > +     WARN_ON_ONCE(folio_unqueue_deferred_split(folio));
> >       folio->memcg_data = (unsigned long)to;
> >
> >       __folio_memcg_unlock(from);
> > @@ -1217,7 +1219,9 @@ static int mem_cgroup_move_charge_pte_range(pmd_t *pmd,
> >       enum mc_target_type target_type;
> >       union mc_target target;
> >       struct folio *folio;
> > +     bool tried_split_before = false;
> >
> > +retry_pmd:
> >       ptl = pmd_trans_huge_lock(pmd, vma);
> >       if (ptl) {
> >               if (mc.precharge < HPAGE_PMD_NR) {
> > @@ -1227,6 +1231,27 @@ static int mem_cgroup_move_charge_pte_range(pmd_t *pmd,
> >               target_type = get_mctgt_type_thp(vma, addr, *pmd, &target);
> >               if (target_type == MC_TARGET_PAGE) {
> >                       folio = target.folio;
> > +                     /*
> > +                      * Deferred split queue locking depends on memcg,
> > +                      * and unqueue is unsafe unless folio refcount is 0:
> > +                      * split or skip if on the queue? first try to split.
> > +                      */
> > +                     if (!list_empty(&folio->_deferred_list)) {
> > +                             spin_unlock(ptl);
> > +                             if (!tried_split_before)
> > +                                     split_folio(folio);
> > +                             folio_unlock(folio);
> > +                             folio_put(folio);
> > +                             if (tried_split_before)
> > +                                     return 0;
> > +                             tried_split_before = true;
> > +                             goto retry_pmd;
> > +                     }
> > +                     /*
> > +                      * So long as that pmd lock is held, the folio cannot
> > +                      * be racily added to the _deferred_list, because
> > +                      * __folio_remove_rmap() will find !partially_mapped.
> > +                      */
> >                       if (folio_isolate_lru(folio)) {
> >                               if (!mem_cgroup_move_account(folio, true,
> >                                                            mc.from, mc.to)) {
> > diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> > index 2703227cce88..06df2af97415 100644
> > --- a/mm/memcontrol.c
> > +++ b/mm/memcontrol.c
> > @@ -4629,9 +4629,6 @@ static void uncharge_folio(struct folio *folio, struct uncharge_gather *ug)
> >       struct obj_cgroup *objcg;
> >
> >       VM_BUG_ON_FOLIO(folio_test_lru(folio), folio);
> > -     VM_BUG_ON_FOLIO(folio_order(folio) > 1 &&
> > -                     !folio_test_hugetlb(folio) &&
> > -                     !list_empty(&folio->_deferred_list), folio);
> >
> >       /*
> >        * Nobody should be changing or seriously looking at
> > @@ -4678,6 +4675,7 @@ static void uncharge_folio(struct folio *folio, struct uncharge_gather *ug)
> >                       ug->nr_memory += nr_pages;
> >               ug->pgpgout++;
> >
> > +             WARN_ON_ONCE(folio_unqueue_deferred_split(folio));
> >               folio->memcg_data = 0;
> >       }
> >
> > @@ -4789,6 +4787,9 @@ void mem_cgroup_migrate(struct folio *old, struct folio *new)
> >
> >       /* Transfer the charge and the css ref */
> >       commit_charge(new, memcg);
> > +
> > +     /* Warning should never happen, so don't worry about refcount non-0 */
> > +     WARN_ON_ONCE(folio_unqueue_deferred_split(old));
> >       old->memcg_data = 0;
> >   }
> >
> > @@ -4975,6 +4976,7 @@ void mem_cgroup_swapout(struct folio *folio, swp_entry_t entry)
> >       VM_BUG_ON_FOLIO(oldid, folio);
> >       mod_memcg_state(swap_memcg, MEMCG_SWAP, nr_entries);
> >
> > +     folio_unqueue_deferred_split(folio);
> >       folio->memcg_data = 0;
> >
> >       if (!mem_cgroup_is_root(memcg))
> > diff --git a/mm/migrate.c b/mm/migrate.c
> > index df91248755e4..691f25ee2489 100644
> > --- a/mm/migrate.c
> > +++ b/mm/migrate.c
> > @@ -489,7 +489,7 @@ static int __folio_migrate_mapping(struct address_space *mapping,
> >                   folio_test_large_rmappable(folio)) {
> >                       if (!folio_ref_freeze(folio, expected_count))
> >                               return -EAGAIN;
> > -                     folio_undo_large_rmappable(folio);
> > +                     folio_unqueue_deferred_split(folio);
> >                       folio_ref_unfreeze(folio, expected_count);
> >               }
> >
> > @@ -514,7 +514,7 @@ static int __folio_migrate_mapping(struct address_space *mapping,
> >       }
> >
> >       /* Take off deferred split queue while frozen and memcg set */
> > -     folio_undo_large_rmappable(folio);
> > +     folio_unqueue_deferred_split(folio);
> >
> >       /*
> >        * Now we know that no one else is looking at the folio:
> > diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> > index 4b21a368b4e2..57f64b5d0004 100644
> > --- a/mm/page_alloc.c
> > +++ b/mm/page_alloc.c
> > @@ -2681,7 +2681,9 @@ void free_unref_folios(struct folio_batch *folios)
> >               unsigned long pfn = folio_pfn(folio);
> >               unsigned int order = folio_order(folio);
> >
> > -             folio_undo_large_rmappable(folio);
> > +             if (mem_cgroup_disabled())
> > +                     folio_unqueue_deferred_split(folio);
> > +
>
> Is it worth adding a comment here where we take care of ths with memcg
> enabled?
>
> It's complicated stuff, nothing jumped at me, but it's late here so my
> brain is not fully functional ...
>
> --
> Cheers,
>
> David / dhildenb
>


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH hotfix 2/2] mm/thp: fix deferred split unqueue naming and locking
  2024-10-24 20:52   ` David Hildenbrand
  2024-10-25  1:25     ` Yang Shi
@ 2024-10-27  7:07     ` Hugh Dickins
  1 sibling, 0 replies; 20+ messages in thread
From: Hugh Dickins @ 2024-10-27  7:07 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Hugh Dickins, Andrew Morton, Usama Arif, Yang Shi, Wei Yang,
	Kirill A. Shutemov, Matthew Wilcox, Johannes Weiner, Baolin Wang,
	Barry Song, Kefeng Wang, Ryan Roberts, Nhat Pham, Zi Yan,
	Chris Li, linux-kernel, linux-mm

On Thu, 24 Oct 2024, David Hildenbrand wrote:
> On 24.10.24 06:13, Hugh Dickins wrote:
> > Recent changes are putting more pressure on THP deferred split queues:
> > under load revealing long-standing races, causing list_del corruptions,
> > "Bad page state"s and worse (I keep BUGs in both of those, so usually
> > don't get to see how badly they end up without).  The relevant recent
> > changes being 6.8's mTHP, 6.10's mTHP swapout, and 6.12's mTHP swapin,
> > improved swap allocation, and underused THP splitting.
> > 
> > Before fixing locking: rename misleading folio_undo_large_rmappable(),
> > which does not undo large_rmappable, to folio_unqueue_deferred_split(),
> > which is what it does.  
> 
> Yes please. I stumbled into that myself recently -- leftover from previous
> rework.
> 
> It would have been reasonable to move that into a separate (follow-up?) patch.

I was expecting someone to ask for a breakup: I plead, as the one who
will try some backports, to keep this all in one to make that job easier.

I doubt I'll go back further than 6.6 LTS.  Porting this patch is easy,
the hard part is gathering together the little bits and pieces it also
needs to be good (I'm thinking of folio migration freezing anon to 0
in particular, that will lead to others).

> > The 87eaceb3faa5 commit did have the code to move from one deferred list
> > to another (but was not conscious of its unsafety while refcount non-0);
> > but that was removed by 5.6 commit fac0516b5534 ("mm: thp: don't need
> > care deferred split queue in memcg charge move path"), which argued that
> > the existence of a PMD mapping guarantees that the THP cannot be on a
> > deferred list.  
> 
> I recall this can happen, not sure on 5.6 though: assume we have an anon THP
> with 1 PMD mapping and a single PTE mapping for simplicity.
> 
> Assume we want to migrate that folio and first remove the PMD mapping, then
> the PTE mapping. After removing the PMD mapping, we add it to the deferred
> split queue (single PTE mapped).
> 
> Now assume migration fails and we remove migration entries -> remap.
> 
> We now have a PMD-mapped THP on the deferred split queue.
> 
> (again, I might be wrong but that's from memory without digging into the code)

I believe you're right, thank you for doing the heavy thinking for me.
Then I came up with 5.8's 5503fbf2b0b8 ("khugepaged: allow to
collapse PTE-mapped compound pages") introducing another way.

It looks to me like there is a case for backporting this patch,
but not a very strong case: worth some effort, but not too much -
I've not heard of the predicted issues in practice.

> 
> 
> > I'm not sure if that was true at the time (swapcache
> > remapped?), but it's clearly not true since 6.12 commit dafff3f4c850
> > ("mm: split underused THPs").
> 
> We only remap PTEs from the swapcache, never PMDs.

Right, thanks.

> > --- a/mm/huge_memory.c
> > +++ b/mm/huge_memory.c
> > @@ -3626,19 +3646,6 @@ void deferred_split_folio(struct folio *folio, bool
> > partially_mapped)
> >    if (!partially_mapped && !split_underused_thp)
> >     return;
> >   -	/*
> > -	 * The try_to_unmap() in page reclaim path might reach here too,
> > -	 * this may cause a race condition to corrupt deferred split queue.
> > -	 * And, if page reclaim is already handling the same folio, it is
> > -	 * unnecessary to handle it again in shrinker.
> > -	 *
> > -	 * Check the swapcache flag to determine if the folio is being
> > -	 * handled by page reclaim since THP swap would add the folio into
> > -	 * swap cache before calling try_to_unmap().
> > -	 */
> > -	if (folio_test_swapcache(folio))
> > -		return;
> > -
> >    spin_lock_irqsave(&ds_queue->split_queue_lock, flags);
> >    if (partially_mapped) {
> >   		if (!folio_test_partially_mapped(folio)) {

I'm changing this part in v2: keeping the return on swapcache, updating
the comment to say it's no longer essential, but a likely optimization.

I noticed the stats for deferred split were very different when I made
this change, largely because I'm doing a lot of unrealistic swapping
no doubt, but I don't want to take the risk of changing behaviour
when all we're trying to fix is list corruption.

> > diff --git a/mm/internal.h b/mm/internal.h
> > index 93083bbeeefa..16c1f3cd599e 100644
> > --- a/mm/internal.h
> > +++ b/mm/internal.h
> > @@ -639,11 +639,11 @@ static inline void folio_set_order(struct folio
> > *folio, unsigned int order)
> >   #endif
> >   }
> >   
> > -void __folio_undo_large_rmappable(struct folio *folio);
> > -static inline void folio_undo_large_rmappable(struct folio *folio)
> > +bool __folio_unqueue_deferred_split(struct folio *folio);
> > +static inline bool folio_unqueue_deferred_split(struct folio *folio)
> >   {
> >    if (folio_order(folio) <= 1 || !folio_test_large_rmappable(folio))
> 
> The rmappable check here is still confusing for me. I assume we want to
> exclude hugetlb or others that reuse the field for other purposes ...

Yes, I believe that is what it's about.  Somewhere else excludes hugetlb
explicitly instead.  I know Matthew would have liked to find a better name
than "large_rmappable" (aren't hugetlb folios large and rmappable?), but
nobody suggested a better.

You've reminded me to worry about how hugetlb pages get through
free_tail_page_prepare()'s case 2 check for list_empty(_deferred_list).
Ah, there's an INIT_LIST_HEAD(&folio->_deferred_list) in mm/hugetlb.c,
before freeing a hugetlb folio.  (The kind of detail which may be
different in backports.)

> > diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> > index 4b21a368b4e2..57f64b5d0004 100644
> > --- a/mm/page_alloc.c
> > +++ b/mm/page_alloc.c
> > @@ -2681,7 +2681,9 @@ void free_unref_folios(struct folio_batch *folios)
> >     unsigned long pfn = folio_pfn(folio);
> >     unsigned int order = folio_order(folio);
> >   -		folio_undo_large_rmappable(folio);
> > +		if (mem_cgroup_disabled())
> > +			folio_unqueue_deferred_split(folio);
> > +
> 
> Is it worth adding a comment here where we take care of ths with memcg
> enabled?

I've confessed to Yang Shi that I got this wrong: so in v2 there is
not even a line to add such a comment on (but comment in commit message).

> 
> It's complicated stuff, nothing jumped at me, but it's late here so my brain
> is not fully functional ...

Even your not fully functional brain is so much quicker than mine:
many thanks for applying it.

Hugh


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH hotfix 1/2] mm/thp: fix deferred split queue not partially_mapped
  2024-10-24  4:10 [PATCH hotfix 1/2] mm/thp: fix deferred split queue not partially_mapped Hugh Dickins
  2024-10-24  4:13 ` [PATCH hotfix 2/2] mm/thp: fix deferred split unqueue naming and locking Hugh Dickins
@ 2024-10-24 10:20 ` Usama Arif
  2024-10-24 20:39 ` David Hildenbrand
                   ` (2 subsequent siblings)
  4 siblings, 0 replies; 20+ messages in thread
From: Usama Arif @ 2024-10-24 10:20 UTC (permalink / raw)
  To: Hugh Dickins, Andrew Morton
  Cc: Yang Shi, Wei Yang, Kirill A. Shutemov, Matthew Wilcox,
	David Hildenbrand, Johannes Weiner, Baolin Wang, Barry Song,
	Kefeng Wang, Ryan Roberts, Nhat Pham, Zi Yan, Chris Li,
	linux-kernel, linux-mm



On 24/10/2024 05:10, Hugh Dickins wrote:
> Recent changes are putting more pressure on THP deferred split queues:
> under load revealing long-standing races, causing list_del corruptions,
> "Bad page state"s and worse (I keep BUGs in both of those, so usually
> don't get to see how badly they end up without).  The relevant recent
> changes being 6.8's mTHP, 6.10's mTHP swapout, and 6.12's mTHP swapin,
> improved swap allocation, and underused THP splitting.
> 
> The new unlocked list_del_init() in deferred_split_scan() is buggy.
> I gave bad advice, it looks plausible since that's a local on-stack
> list, but the fact is that it can race with a third party freeing or
> migrating the preceding folio (properly unqueueing it with refcount 0
> while holding split_queue_lock), thereby corrupting the list linkage.
> 
> The obvious answer would be to take split_queue_lock there: but it has
> a long history of contention, so I'm reluctant to add to that. Instead,
> make sure that there is always one safe (raised refcount) folio before,
> by delaying its folio_put().  (And of course I was wrong to suggest
> updating split_queue_len without the lock: leave that until the splice.)
> 

Thanks for this, I imagine this was quite tricky to debug.

Avoiding taking the split_queue_lock by keeping reference of the
preceding folio is really nice.

And I should have spotted in the original patch that split_queue_len
shouldn't be changed without holding split_queue_lock.

Acked-by: Usama Arif <usamaarif642@gmail.com>

> And remove two over-eager partially_mapped checks, restoring those tests
> to how they were before: if uncharge_folio() or free_tail_page_prepare()
> finds _deferred_list non-empty, it's in trouble whether or not that folio
> is partially_mapped (and the flag was already cleared in the latter case).
> 
> Fixes: dafff3f4c850 ("mm: split underused THPs")
> Signed-off-by: Hugh Dickins <hughd@google.com>
> ---
>  mm/huge_memory.c | 21 +++++++++++++++++----
>  mm/memcontrol.c  |  3 +--
>  mm/page_alloc.c  |  5 ++---
>  3 files changed, 20 insertions(+), 9 deletions(-)
> 
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index 2fb328880b50..a1d345f1680c 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -3718,8 +3718,8 @@ static unsigned long deferred_split_scan(struct shrinker *shrink,
>  	struct deferred_split *ds_queue = &pgdata->deferred_split_queue;
>  	unsigned long flags;
>  	LIST_HEAD(list);
> -	struct folio *folio, *next;
> -	int split = 0;
> +	struct folio *folio, *next, *prev = NULL;
> +	int split = 0, removed = 0;
>  
>  #ifdef CONFIG_MEMCG
>  	if (sc->memcg)
> @@ -3775,15 +3775,28 @@ static unsigned long deferred_split_scan(struct shrinker *shrink,
>  		 */
>  		if (!did_split && !folio_test_partially_mapped(folio)) {
>  			list_del_init(&folio->_deferred_list);
> -			ds_queue->split_queue_len--;
> +			removed++;
> +		} else {
> +			/*
> +			 * That unlocked list_del_init() above would be unsafe,
> +			 * unless its folio is separated from any earlier folios
> +			 * left on the list (which may be concurrently unqueued)
> +			 * by one safe folio with refcount still raised.
> +			 */
> +			swap(folio, prev);
>  		}
> -		folio_put(folio);
> +		if (folio)
> +			folio_put(folio);
>  	}
>  
>  	spin_lock_irqsave(&ds_queue->split_queue_lock, flags);
>  	list_splice_tail(&list, &ds_queue->split_queue);
> +	ds_queue->split_queue_len -= removed;
>  	spin_unlock_irqrestore(&ds_queue->split_queue_lock, flags);
>  
> +	if (prev)
> +		folio_put(prev);
> +
>  	/*
>  	 * Stop shrinker if we didn't split any page, but the queue is empty.
>  	 * This can happen if pages were freed under us.
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 7845c64a2c57..2703227cce88 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -4631,8 +4631,7 @@ static void uncharge_folio(struct folio *folio, struct uncharge_gather *ug)
>  	VM_BUG_ON_FOLIO(folio_test_lru(folio), folio);
>  	VM_BUG_ON_FOLIO(folio_order(folio) > 1 &&
>  			!folio_test_hugetlb(folio) &&
> -			!list_empty(&folio->_deferred_list) &&
> -			folio_test_partially_mapped(folio), folio);
> +			!list_empty(&folio->_deferred_list), folio);
>  
>  	/*
>  	 * Nobody should be changing or seriously looking at
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 8afab64814dc..4b21a368b4e2 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -961,9 +961,8 @@ static int free_tail_page_prepare(struct page *head_page, struct page *page)
>  		break;
>  	case 2:
>  		/* the second tail page: deferred_list overlaps ->mapping */
> -		if (unlikely(!list_empty(&folio->_deferred_list) &&
> -		    folio_test_partially_mapped(folio))) {
> -			bad_page(page, "partially mapped folio on deferred list");
> +		if (unlikely(!list_empty(&folio->_deferred_list))) {
> +			bad_page(page, "on deferred list");
>  			goto out;
>  		}
>  		break;



^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH hotfix 1/2] mm/thp: fix deferred split queue not partially_mapped
  2024-10-24  4:10 [PATCH hotfix 1/2] mm/thp: fix deferred split queue not partially_mapped Hugh Dickins
  2024-10-24  4:13 ` [PATCH hotfix 2/2] mm/thp: fix deferred split unqueue naming and locking Hugh Dickins
  2024-10-24 10:20 ` [PATCH hotfix 1/2] mm/thp: fix deferred split queue not partially_mapped Usama Arif
@ 2024-10-24 20:39 ` David Hildenbrand
  2024-10-24 22:37 ` Zi Yan
  2024-10-25  1:56 ` Baolin Wang
  4 siblings, 0 replies; 20+ messages in thread
From: David Hildenbrand @ 2024-10-24 20:39 UTC (permalink / raw)
  To: Hugh Dickins, Andrew Morton
  Cc: Usama Arif, Yang Shi, Wei Yang, Kirill A. Shutemov,
	Matthew Wilcox, Johannes Weiner, Baolin Wang, Barry Song,
	Kefeng Wang, Ryan Roberts, Nhat Pham, Zi Yan, Chris Li,
	linux-kernel, linux-mm

On 24.10.24 06:10, Hugh Dickins wrote:
> Recent changes are putting more pressure on THP deferred split queues:
> under load revealing long-standing races, causing list_del corruptions,
> "Bad page state"s and worse (I keep BUGs in both of those, so usually
> don't get to see how badly they end up without).  The relevant recent
> changes being 6.8's mTHP, 6.10's mTHP swapout, and 6.12's mTHP swapin,
> improved swap allocation, and underused THP splitting.
> 
> The new unlocked list_del_init() in deferred_split_scan() is buggy.
> I gave bad advice, it looks plausible since that's a local on-stack
> list, but the fact is that it can race with a third party freeing or
> migrating the preceding folio (properly unqueueing it with refcount 0
> while holding split_queue_lock), thereby corrupting the list linkage.
> 
> The obvious answer would be to take split_queue_lock there: but it has
> a long history of contention, so I'm reluctant to add to that. Instead,
> make sure that there is always one safe (raised refcount) folio before,
> by delaying its folio_put().  (And of course I was wrong to suggest
> updating split_queue_len without the lock: leave that until the splice.)
> 
> And remove two over-eager partially_mapped checks, restoring those tests
> to how they were before: if uncharge_folio() or free_tail_page_prepare()
> finds _deferred_list non-empty, it's in trouble whether or not that folio
> is partially_mapped (and the flag was already cleared in the latter case).
> 
> Fixes: dafff3f4c850 ("mm: split underused THPs")
> Signed-off-by: Hugh Dickins <hughd@google.com>
> ---

Challenging problem and elegant fix!

Reviewed-by: David Hildenbrand <david@redhat.com>

At some point you have to tell me your secrets, how you are able to 
trigger all these hard-to-trigger problems (6.8 stuff and no reports on 
the list) :)

-- 
Cheers,

David / dhildenb



^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH hotfix 1/2] mm/thp: fix deferred split queue not partially_mapped
  2024-10-24  4:10 [PATCH hotfix 1/2] mm/thp: fix deferred split queue not partially_mapped Hugh Dickins
                   ` (2 preceding siblings ...)
  2024-10-24 20:39 ` David Hildenbrand
@ 2024-10-24 22:37 ` Zi Yan
  2024-10-25  5:41   ` Hugh Dickins
  2024-10-25  1:56 ` Baolin Wang
  4 siblings, 1 reply; 20+ messages in thread
From: Zi Yan @ 2024-10-24 22:37 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Andrew Morton, Usama Arif, Yang Shi, Wei Yang,
	Kirill A. Shutemov, Matthew Wilcox, David Hildenbrand,
	Johannes Weiner, Baolin Wang, Barry Song, Kefeng Wang,
	Ryan Roberts, Nhat Pham, Chris Li, linux-kernel, linux-mm

On 24 Oct 2024, at 0:10, Hugh Dickins wrote:

> Recent changes are putting more pressure on THP deferred split queues:
> under load revealing long-standing races, causing list_del corruptions,
> "Bad page state"s and worse (I keep BUGs in both of those, so usually
> don't get to see how badly they end up without).  The relevant recent
> changes being 6.8's mTHP, 6.10's mTHP swapout, and 6.12's mTHP swapin,
> improved swap allocation, and underused THP splitting.
>
> The new unlocked list_del_init() in deferred_split_scan() is buggy.
> I gave bad advice, it looks plausible since that's a local on-stack
> list, but the fact is that it can race with a third party freeing or
> migrating the preceding folio (properly unqueueing it with refcount 0
> while holding split_queue_lock), thereby corrupting the list linkage.
>
> The obvious answer would be to take split_queue_lock there: but it has
> a long history of contention, so I'm reluctant to add to that. Instead,
> make sure that there is always one safe (raised refcount) folio before,
> by delaying its folio_put().  (And of course I was wrong to suggest
> updating split_queue_len without the lock: leave that until the splice.)

I feel like this is not the right approach, since it breaks the existing
condition of changing folio->_deferred_list, namely taking
ds_queue->split_queue_lock for serialization. The contention might not be
as high as you think, since if a folio were split, the split_queue_lock
needed to be taken during split anyway. So the worse case is the same
as all folios are split. Do you see significant perf degradation due to
taking the lock when doing list_del_init()?

I am afraid if we take this route, we might hit hard-to-debug bugs
in the future when someone touches the code.

Thanks.

>
> And remove two over-eager partially_mapped checks, restoring those tests
> to how they were before: if uncharge_folio() or free_tail_page_prepare()
> finds _deferred_list non-empty, it's in trouble whether or not that folio
> is partially_mapped (and the flag was already cleared in the latter case).
>
> Fixes: dafff3f4c850 ("mm: split underused THPs")
> Signed-off-by: Hugh Dickins <hughd@google.com>
> ---
>  mm/huge_memory.c | 21 +++++++++++++++++----
>  mm/memcontrol.c  |  3 +--
>  mm/page_alloc.c  |  5 ++---
>  3 files changed, 20 insertions(+), 9 deletions(-)

Best Regards,
Yan, Zi


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH hotfix 1/2] mm/thp: fix deferred split queue not partially_mapped
  2024-10-24 22:37 ` Zi Yan
@ 2024-10-25  5:41   ` Hugh Dickins
  2024-10-25 15:32     ` Zi Yan
  0 siblings, 1 reply; 20+ messages in thread
From: Hugh Dickins @ 2024-10-25  5:41 UTC (permalink / raw)
  To: Zi Yan
  Cc: Hugh Dickins, Andrew Morton, Usama Arif, Yang Shi, Wei Yang,
	Kirill A. Shutemov, Matthew Wilcox, David Hildenbrand,
	Johannes Weiner, Baolin Wang, Barry Song, Kefeng Wang,
	Ryan Roberts, Nhat Pham, Chris Li, linux-kernel, linux-mm

On Thu, 24 Oct 2024, Zi Yan wrote:
> On 24 Oct 2024, at 0:10, Hugh Dickins wrote:
> 
> > The new unlocked list_del_init() in deferred_split_scan() is buggy.
> > I gave bad advice, it looks plausible since that's a local on-stack
> > list, but the fact is that it can race with a third party freeing or
> > migrating the preceding folio (properly unqueueing it with refcount 0
> > while holding split_queue_lock), thereby corrupting the list linkage.
> >
> > The obvious answer would be to take split_queue_lock there: but it has
> > a long history of contention, so I'm reluctant to add to that. Instead,
> > make sure that there is always one safe (raised refcount) folio before,
> > by delaying its folio_put().  (And of course I was wrong to suggest
> > updating split_queue_len without the lock: leave that until the splice.)
> 
> I feel like this is not the right approach, since it breaks the existing
> condition of changing folio->_deferred_list, namely taking
> ds_queue->split_queue_lock for serialization. The contention might not be
> as high as you think, since if a folio were split, the split_queue_lock
> needed to be taken during split anyway. So the worse case is the same
> as all folios are split. Do you see significant perf degradation due to
> taking the lock when doing list_del_init()?
> 
> I am afraid if we take this route, we might hit hard-to-debug bugs
> in the future when someone touches the code.

You have a good point: I am adding another element of trickiness
to that already-tricky local-but-not-quite list - which has tripped
us up a few times in the past.

I do still feel that this solution is right in the spirit of that list;
but I've certainly not done any performance measurement to justify it,
nor would I ever trust my skill to do so.  I just tried to solve the
corruptions in what I thought was the best way.

(To be honest, I found this solution to the corruptions first, and thought
the bug went back to the original implemention: that its put_page() at the
end of the loop was premature all along.  It was only when writing the
commit message two days ago, that I came to realize that even put_page()
or folio_put() would be safely using the lock to unqueue: that it is only
this new list_del_init() which is the exception which introduces the bug.)

Looking at vmstats, I'm coming to believe that the performance advantage
of this way is likely to be in the noise: that mTHPs alone, and the
!partially_mapped case on top, are greatly increasing the split_deferred
stats: and may give rise to renewed complaints of lock contention, with
or without this optimization.

While I still prefer to stick with what's posted and most tested, I am
giving the locked version a run overnight.  Thanks a lot for the reviews
and acks everyone: at present Zi Yan is in the minority preferring a
locked version, but please feel free to change your vote if you wish.

Hugh

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH hotfix 1/2] mm/thp: fix deferred split queue not partially_mapped
  2024-10-25  5:41   ` Hugh Dickins
@ 2024-10-25 15:32     ` Zi Yan
  2024-10-25 18:36       ` Yang Shi
  2024-10-27  4:43       ` Hugh Dickins
  0 siblings, 2 replies; 20+ messages in thread
From: Zi Yan @ 2024-10-25 15:32 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Andrew Morton, Usama Arif, Yang Shi, Wei Yang,
	Kirill A. Shutemov, Matthew Wilcox, David Hildenbrand,
	Johannes Weiner, Baolin Wang, Barry Song, Kefeng Wang,
	Ryan Roberts, Nhat Pham, Chris Li, linux-kernel, linux-mm

On 25 Oct 2024, at 1:41, Hugh Dickins wrote:

> On Thu, 24 Oct 2024, Zi Yan wrote:
>> On 24 Oct 2024, at 0:10, Hugh Dickins wrote:
>>
>>> The new unlocked list_del_init() in deferred_split_scan() is buggy.
>>> I gave bad advice, it looks plausible since that's a local on-stack
>>> list, but the fact is that it can race with a third party freeing or
>>> migrating the preceding folio (properly unqueueing it with refcount 0
>>> while holding split_queue_lock), thereby corrupting the list linkage.
>>>
>>> The obvious answer would be to take split_queue_lock there: but it has
>>> a long history of contention, so I'm reluctant to add to that. Instead,
>>> make sure that there is always one safe (raised refcount) folio before,
>>> by delaying its folio_put().  (And of course I was wrong to suggest
>>> updating split_queue_len without the lock: leave that until the splice.)
>>
>> I feel like this is not the right approach, since it breaks the existing
>> condition of changing folio->_deferred_list, namely taking
>> ds_queue->split_queue_lock for serialization. The contention might not be
>> as high as you think, since if a folio were split, the split_queue_lock
>> needed to be taken during split anyway. So the worse case is the same
>> as all folios are split. Do you see significant perf degradation due to
>> taking the lock when doing list_del_init()?
>>
>> I am afraid if we take this route, we might hit hard-to-debug bugs
>> in the future when someone touches the code.
>
> You have a good point: I am adding another element of trickiness
> to that already-tricky local-but-not-quite list - which has tripped
> us up a few times in the past.
>
> I do still feel that this solution is right in the spirit of that list;
> but I've certainly not done any performance measurement to justify it,
> nor would I ever trust my skill to do so.  I just tried to solve the
> corruptions in what I thought was the best way.
>
> (To be honest, I found this solution to the corruptions first, and thought
> the bug went back to the original implemention: that its put_page() at the
> end of the loop was premature all along.  It was only when writing the
> commit message two days ago, that I came to realize that even put_page()
> or folio_put() would be safely using the lock to unqueue: that it is only
> this new list_del_init() which is the exception which introduces the bug.)
>
> Looking at vmstats, I'm coming to believe that the performance advantage
> of this way is likely to be in the noise: that mTHPs alone, and the
> !partially_mapped case on top, are greatly increasing the split_deferred
> stats: and may give rise to renewed complaints of lock contention, with
> or without this optimization.
>
> While I still prefer to stick with what's posted and most tested, I am
> giving the locked version a run overnight.  Thanks a lot for the reviews
> and acks everyone: at present Zi Yan is in the minority preferring a
> locked version, but please feel free to change your vote if you wish.

Thank you a lot for taking the time to check the locked version. Looking
forward to the result. BTW, I am not going to block this patch since it
fixes the bug.

The tricky part in deferred_list_scan() is always the use of
folio->_deferred_list without taking split_queue_lock. I am thinking about
use folio_batch to store the out-of-split_queue folios, so that _deferred_list
will not be touched when these folios are tried to be split. Basically,

1. loop through split_queue and move folios to a folio_batch until the
   folio_batch is full;
2. loop through the folio_batch to try to split each folio;
3. move the remaining folios back to split_queue.

With this approach, split_queue_lock might be taken more if there are
more than 31 (folio_batch max size) folios on split_queue and split_queue_lock
will be held longer in step 3, since the remaining folios need to be
added back to split_queue one by one instead of a single list splice.

Let me know your thoughts. I can look into this if this approach sounds
promising. Thanks.


Best Regards,
Yan, Zi


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH hotfix 1/2] mm/thp: fix deferred split queue not partially_mapped
  2024-10-25 15:32     ` Zi Yan
@ 2024-10-25 18:36       ` Yang Shi
  2024-10-27  5:08         ` Hugh Dickins
  2024-10-27  4:43       ` Hugh Dickins
  1 sibling, 1 reply; 20+ messages in thread
From: Yang Shi @ 2024-10-25 18:36 UTC (permalink / raw)
  To: Zi Yan
  Cc: Hugh Dickins, Andrew Morton, Usama Arif, Wei Yang,
	Kirill A. Shutemov, Matthew Wilcox, David Hildenbrand,
	Johannes Weiner, Baolin Wang, Barry Song, Kefeng Wang,
	Ryan Roberts, Nhat Pham, Chris Li, linux-kernel, linux-mm

On Fri, Oct 25, 2024 at 8:32 AM Zi Yan <ziy@nvidia.com> wrote:
>
> On 25 Oct 2024, at 1:41, Hugh Dickins wrote:
>
> > On Thu, 24 Oct 2024, Zi Yan wrote:
> >> On 24 Oct 2024, at 0:10, Hugh Dickins wrote:
> >>
> >>> The new unlocked list_del_init() in deferred_split_scan() is buggy.
> >>> I gave bad advice, it looks plausible since that's a local on-stack
> >>> list, but the fact is that it can race with a third party freeing or
> >>> migrating the preceding folio (properly unqueueing it with refcount 0
> >>> while holding split_queue_lock), thereby corrupting the list linkage.
> >>>
> >>> The obvious answer would be to take split_queue_lock there: but it has
> >>> a long history of contention, so I'm reluctant to add to that. Instead,
> >>> make sure that there is always one safe (raised refcount) folio before,
> >>> by delaying its folio_put().  (And of course I was wrong to suggest
> >>> updating split_queue_len without the lock: leave that until the splice.)
> >>
> >> I feel like this is not the right approach, since it breaks the existing
> >> condition of changing folio->_deferred_list, namely taking
> >> ds_queue->split_queue_lock for serialization. The contention might not be
> >> as high as you think, since if a folio were split, the split_queue_lock
> >> needed to be taken during split anyway. So the worse case is the same
> >> as all folios are split. Do you see significant perf degradation due to
> >> taking the lock when doing list_del_init()?
> >>
> >> I am afraid if we take this route, we might hit hard-to-debug bugs
> >> in the future when someone touches the code.
> >
> > You have a good point: I am adding another element of trickiness
> > to that already-tricky local-but-not-quite list - which has tripped
> > us up a few times in the past.
> >
> > I do still feel that this solution is right in the spirit of that list;
> > but I've certainly not done any performance measurement to justify it,
> > nor would I ever trust my skill to do so.  I just tried to solve the
> > corruptions in what I thought was the best way.
> >
> > (To be honest, I found this solution to the corruptions first, and thought
> > the bug went back to the original implemention: that its put_page() at the
> > end of the loop was premature all along.  It was only when writing the
> > commit message two days ago, that I came to realize that even put_page()
> > or folio_put() would be safely using the lock to unqueue: that it is only
> > this new list_del_init() which is the exception which introduces the bug.)
> >
> > Looking at vmstats, I'm coming to believe that the performance advantage
> > of this way is likely to be in the noise: that mTHPs alone, and the
> > !partially_mapped case on top, are greatly increasing the split_deferred
> > stats: and may give rise to renewed complaints of lock contention, with
> > or without this optimization.
> >
> > While I still prefer to stick with what's posted and most tested, I am
> > giving the locked version a run overnight.  Thanks a lot for the reviews
> > and acks everyone: at present Zi Yan is in the minority preferring a
> > locked version, but please feel free to change your vote if you wish.
>
> Thank you a lot for taking the time to check the locked version. Looking
> forward to the result. BTW, I am not going to block this patch since it
> fixes the bug.
>
> The tricky part in deferred_list_scan() is always the use of
> folio->_deferred_list without taking split_queue_lock. I am thinking about
> use folio_batch to store the out-of-split_queue folios, so that _deferred_list
> will not be touched when these folios are tried to be split. Basically,
>
> 1. loop through split_queue and move folios to a folio_batch until the
>    folio_batch is full;
> 2. loop through the folio_batch to try to split each folio;
> 3. move the remaining folios back to split_queue.
>
> With this approach, split_queue_lock might be taken more if there are
> more than 31 (folio_batch max size) folios on split_queue and split_queue_lock
> will be held longer in step 3, since the remaining folios need to be
> added back to split_queue one by one instead of a single list splice.

IMHO, the folio_batch approach is worth trying. The deferred list lock
is just held when deleting folio from deferred list and updating the
list len. Re-acquiring the lock every 31 folios seems not very bad. Of
course, some benchmark is needed.

The other subtle thing is folio->_deferred_list is reused when the
folio is moved to the local on-stack list. And some
list_empty(deferred_list) checks return true even though the folio is
actually on the local on-stack list. Some code may depend on or
inadvertently depend on this behavior. Using folio_batch may break
some assumptions, but depending on this subtle behavior is definitely
not reliable IMHO.

>
> Let me know your thoughts. I can look into this if this approach sounds
> promising. Thanks.
>
>
> Best Regards,
> Yan, Zi


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH hotfix 1/2] mm/thp: fix deferred split queue not partially_mapped
  2024-10-25 18:36       ` Yang Shi
@ 2024-10-27  5:08         ` Hugh Dickins
  2024-10-28 18:36           ` Yang Shi
  0 siblings, 1 reply; 20+ messages in thread
From: Hugh Dickins @ 2024-10-27  5:08 UTC (permalink / raw)
  To: Yang Shi
  Cc: Zi Yan, Hugh Dickins, Andrew Morton, Usama Arif, Wei Yang,
	Kirill A. Shutemov, Matthew Wilcox, David Hildenbrand,
	Johannes Weiner, Baolin Wang, Barry Song, Kefeng Wang,
	Ryan Roberts, Nhat Pham, Chris Li, linux-kernel, linux-mm

On Fri, 25 Oct 2024, Yang Shi wrote:
> 
> The other subtle thing is folio->_deferred_list is reused when the
> folio is moved to the local on-stack list. And some

Yes.

> list_empty(deferred_list) checks return true even though the folio is
> actually on the local on-stack list. Some code may depend on or

The code definitely depends on that behaviour: that's how folios get
unqueued when refcount reaches 0, whether they are on the public list
or on the local list at that time.

> inadvertently depend on this behavior. Using folio_batch may break
> some assumptions, but depending on this subtle behavior is definitely
> not reliable IMHO.


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH hotfix 1/2] mm/thp: fix deferred split queue not partially_mapped
  2024-10-27  5:08         ` Hugh Dickins
@ 2024-10-28 18:36           ` Yang Shi
  0 siblings, 0 replies; 20+ messages in thread
From: Yang Shi @ 2024-10-28 18:36 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Zi Yan, Andrew Morton, Usama Arif, Wei Yang, Kirill A. Shutemov,
	Matthew Wilcox, David Hildenbrand, Johannes Weiner, Baolin Wang,
	Barry Song, Kefeng Wang, Ryan Roberts, Nhat Pham, Chris Li,
	linux-kernel, linux-mm

On Sat, Oct 26, 2024 at 10:08 PM Hugh Dickins <hughd@google.com> wrote:
>
> On Fri, 25 Oct 2024, Yang Shi wrote:
> >
> > The other subtle thing is folio->_deferred_list is reused when the
> > folio is moved to the local on-stack list. And some
>
> Yes.
>
> > list_empty(deferred_list) checks return true even though the folio is
> > actually on the local on-stack list. Some code may depend on or
>
> The code definitely depends on that behaviour: that's how folios get
> unqueued when refcount reaches 0, whether they are on the public list
> or on the local list at that time.

Yeah, folio may have 0 refcount on the local list after that
folio_put() before it is moved back to deferred list.

The main purpose for using folio_batch is to disambiguate list_empty()
so that we don't rely on this subtle behavior. But I soon realized
this may make deferred list lock contention worse when moving the
folios back to deferred list. Currently we just need to do list
splice, but we have to add every single folio back to deferred list
one by one with folio_batch. It depends on how often folio split
fails.

>
> > inadvertently depend on this behavior. Using folio_batch may break
> > some assumptions, but depending on this subtle behavior is definitely
> > not reliable IMHO.


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH hotfix 1/2] mm/thp: fix deferred split queue not partially_mapped
  2024-10-25 15:32     ` Zi Yan
  2024-10-25 18:36       ` Yang Shi
@ 2024-10-27  4:43       ` Hugh Dickins
  1 sibling, 0 replies; 20+ messages in thread
From: Hugh Dickins @ 2024-10-27  4:43 UTC (permalink / raw)
  To: Zi Yan
  Cc: Hugh Dickins, Andrew Morton, Usama Arif, Yang Shi, Wei Yang,
	Kirill A. Shutemov, Matthew Wilcox, David Hildenbrand,
	Johannes Weiner, Baolin Wang, Barry Song, Kefeng Wang,
	Ryan Roberts, Nhat Pham, Chris Li, linux-kernel, linux-mm

On Fri, 25 Oct 2024, Zi Yan wrote:
> 
> Thank you a lot for taking the time to check the locked version. Looking
> forward to the result.

The locked version worked fine (I did see some unusual filesystem on loop
errors on this laptop, but very much doubt that was related to the extra
deferred split queue locking).  But I do still prefer the original version.

> BTW, I am not going to block this patch since it fixes the bug.

Thanks!  I'll put out the same as v2 1/2.

> 
> The tricky part in deferred_list_scan() is always the use of
> folio->_deferred_list without taking split_queue_lock. I am thinking about
> use folio_batch to store the out-of-split_queue folios, so that _deferred_list
> will not be touched when these folios are tried to be split. Basically,
> 
> 1. loop through split_queue and move folios to a folio_batch until the
>    folio_batch is full;
> 2. loop through the folio_batch to try to split each folio;
> 3. move the remaining folios back to split_queue.
> 
> With this approach, split_queue_lock might be taken more if there are
> more than 31 (folio_batch max size) folios on split_queue and split_queue_lock
> will be held longer in step 3, since the remaining folios need to be
> added back to split_queue one by one instead of a single list splice.
> 
> Let me know your thoughts. I can look into this if this approach sounds
> promising. Thanks.

TBH, I just don't see the point: we have a working mechanism, and
complicating it to rely on more infrastructure, just to reach the
same endpoint, seems a waste of developer time to me.  It's not as
if a folio_batch there would make the split handling more efficient.

It would disambiguate the list_empty(), sure; but as it stands,
there's nothing in the handling which needs that disambiguated.

I can half-imagine that a folio_batch might become a better technique,
if we go on to need less restricted use of the deferred split queue:
if for some reason we grow to want to unqueue a THP while it's still
in use (as memcg move wanted in 2/2, but was not worth recoding for).

But I'm not sure whether a folio_batch would actually help there,
and I wouldn't worry about it unless there's a need.

Hugh

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH hotfix 1/2] mm/thp: fix deferred split queue not partially_mapped
  2024-10-24  4:10 [PATCH hotfix 1/2] mm/thp: fix deferred split queue not partially_mapped Hugh Dickins
                   ` (3 preceding siblings ...)
  2024-10-24 22:37 ` Zi Yan
@ 2024-10-25  1:56 ` Baolin Wang
  4 siblings, 0 replies; 20+ messages in thread
From: Baolin Wang @ 2024-10-25  1:56 UTC (permalink / raw)
  To: Hugh Dickins, Andrew Morton
  Cc: Usama Arif, Yang Shi, Wei Yang, Kirill A. Shutemov,
	Matthew Wilcox, David Hildenbrand, Johannes Weiner, Barry Song,
	Kefeng Wang, Ryan Roberts, Nhat Pham, Zi Yan, Chris Li,
	linux-kernel, linux-mm



On 2024/10/24 12:10, Hugh Dickins wrote:
> Recent changes are putting more pressure on THP deferred split queues:
> under load revealing long-standing races, causing list_del corruptions,
> "Bad page state"s and worse (I keep BUGs in both of those, so usually
> don't get to see how badly they end up without).  The relevant recent
> changes being 6.8's mTHP, 6.10's mTHP swapout, and 6.12's mTHP swapin,
> improved swap allocation, and underused THP splitting.
> 
> The new unlocked list_del_init() in deferred_split_scan() is buggy.
> I gave bad advice, it looks plausible since that's a local on-stack
> list, but the fact is that it can race with a third party freeing or
> migrating the preceding folio (properly unqueueing it with refcount 0
> while holding split_queue_lock), thereby corrupting the list linkage.
> 
> The obvious answer would be to take split_queue_lock there: but it has
> a long history of contention, so I'm reluctant to add to that. Instead,
> make sure that there is always one safe (raised refcount) folio before,
> by delaying its folio_put().  (And of course I was wrong to suggest
> updating split_queue_len without the lock: leave that until the splice.)
> 
> And remove two over-eager partially_mapped checks, restoring those tests
> to how they were before: if uncharge_folio() or free_tail_page_prepare()
> finds _deferred_list non-empty, it's in trouble whether or not that folio
> is partially_mapped (and the flag was already cleared in the latter case).
> 
> Fixes: dafff3f4c850 ("mm: split underused THPs")
> Signed-off-by: Hugh Dickins <hughd@google.com>

Good catch. LGTM.
Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com>


^ permalink raw reply	[flat|nested] 20+ messages in thread

end of thread, other threads:[~2024-10-28 18:37 UTC | newest]

Thread overview: 20+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2024-10-24  4:10 [PATCH hotfix 1/2] mm/thp: fix deferred split queue not partially_mapped Hugh Dickins
2024-10-24  4:13 ` [PATCH hotfix 2/2] mm/thp: fix deferred split unqueue naming and locking Hugh Dickins
2024-10-24 20:00   ` Yang Shi
2024-10-25  1:21     ` Yang Shi
2024-10-25  6:57     ` Hugh Dickins
2024-10-25 16:34       ` Yang Shi
2024-10-27  5:35         ` Hugh Dickins
2024-10-24 20:52   ` David Hildenbrand
2024-10-25  1:25     ` Yang Shi
2024-10-27  7:07     ` Hugh Dickins
2024-10-24 10:20 ` [PATCH hotfix 1/2] mm/thp: fix deferred split queue not partially_mapped Usama Arif
2024-10-24 20:39 ` David Hildenbrand
2024-10-24 22:37 ` Zi Yan
2024-10-25  5:41   ` Hugh Dickins
2024-10-25 15:32     ` Zi Yan
2024-10-25 18:36       ` Yang Shi
2024-10-27  5:08         ` Hugh Dickins
2024-10-28 18:36           ` Yang Shi
2024-10-27  4:43       ` Hugh Dickins
2024-10-25  1:56 ` Baolin Wang

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox