[PATCH hotfix v2 1/2] mm/thp: fix deferred split queue not partially

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [PATCH hotfix v2 1/2] mm/thp: fix deferred split queue not partially_mapped
@ 2024-10-27 19:59 Hugh Dickins
  2024-10-27 20:02 ` [PATCH hotfix v2 2/2] mm/thp: fix deferred split unqueue naming and locking Hugh Dickins
  2024-10-27 20:06 ` [PATCH hotfix v2 1/2] mm/thp: fix deferred split queue not partially_mapped Zi Yan
  0 siblings, 2 replies; 13+ messages in thread
From: Hugh Dickins @ 2024-10-27 19:59 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Usama Arif, Yang Shi, Wei Yang, Kirill A. Shutemov,
	Matthew Wilcox, David Hildenbrand, Johannes Weiner, Baolin Wang,
	Barry Song, Kefeng Wang, Ryan Roberts, Nhat Pham, Zi Yan,
	Chris Li, Shakeel Butt, linux-kernel, linux-mm

Recent changes are putting more pressure on THP deferred split queues:
under load revealing long-standing races, causing list_del corruptions,
"Bad page state"s and worse (I keep BUGs in both of those, so usually
don't get to see how badly they end up without).  The relevant recent
changes being 6.8's mTHP, 6.10's mTHP swapout, and 6.12's mTHP swapin,
improved swap allocation, and underused THP splitting.

The new unlocked list_del_init() in deferred_split_scan() is buggy.
I gave bad advice, it looks plausible since that's a local on-stack
list, but the fact is that it can race with a third party freeing or
migrating the preceding folio (properly unqueueing it with refcount 0
while holding split_queue_lock), thereby corrupting the list linkage.

The obvious answer would be to take split_queue_lock there: but it has
a long history of contention, so I'm reluctant to add to that. Instead,
make sure that there is always one safe (raised refcount) folio before,
by delaying its folio_put().  (And of course I was wrong to suggest
updating split_queue_len without the lock: leave that until the splice.)

And remove two over-eager partially_mapped checks, restoring those tests
to how they were before: if uncharge_folio() or free_tail_page_prepare()
finds _deferred_list non-empty, it's in trouble whether or not that folio
is partially_mapped (and the flag was already cleared in the latter case).

Fixes: dafff3f4c850 ("mm: split underused THPs")
Signed-off-by: Hugh Dickins <hughd@google.com>
Acked-by: Usama Arif <usamaarif642@gmail.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com>
---
Based on 6.12-rc4
v2: added ack and reviewed-bys

 mm/huge_memory.c | 21 +++++++++++++++++----
 mm/memcontrol.c  |  3 +--
 mm/page_alloc.c  |  5 ++---
 3 files changed, 20 insertions(+), 9 deletions(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 2fb328880b50..a1d345f1680c 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -3718,8 +3718,8 @@ static unsigned long deferred_split_scan(struct shrinker *shrink,
 	struct deferred_split *ds_queue = &pgdata->deferred_split_queue;
 	unsigned long flags;
 	LIST_HEAD(list);
-	struct folio *folio, *next;
-	int split = 0;
+	struct folio *folio, *next, *prev = NULL;
+	int split = 0, removed = 0;
 
 #ifdef CONFIG_MEMCG
 	if (sc->memcg)
@@ -3775,15 +3775,28 @@ static unsigned long deferred_split_scan(struct shrinker *shrink,
 		 */
 		if (!did_split && !folio_test_partially_mapped(folio)) {
 			list_del_init(&folio->_deferred_list);
-			ds_queue->split_queue_len--;
+			removed++;
+		} else {
+			/*
+			 * That unlocked list_del_init() above would be unsafe,
+			 * unless its folio is separated from any earlier folios
+			 * left on the list (which may be concurrently unqueued)
+			 * by one safe folio with refcount still raised.
+			 */
+			swap(folio, prev);
 		}
-		folio_put(folio);
+		if (folio)
+			folio_put(folio);
 	}
 
 	spin_lock_irqsave(&ds_queue->split_queue_lock, flags);
 	list_splice_tail(&list, &ds_queue->split_queue);
+	ds_queue->split_queue_len -= removed;
 	spin_unlock_irqrestore(&ds_queue->split_queue_lock, flags);
 
+	if (prev)
+		folio_put(prev);
+
 	/*
 	 * Stop shrinker if we didn't split any page, but the queue is empty.
 	 * This can happen if pages were freed under us.
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 7845c64a2c57..2703227cce88 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -4631,8 +4631,7 @@ static void uncharge_folio(struct folio *folio, struct uncharge_gather *ug)
 	VM_BUG_ON_FOLIO(folio_test_lru(folio), folio);
 	VM_BUG_ON_FOLIO(folio_order(folio) > 1 &&
 			!folio_test_hugetlb(folio) &&
-			!list_empty(&folio->_deferred_list) &&
-			folio_test_partially_mapped(folio), folio);
+			!list_empty(&folio->_deferred_list), folio);
 
 	/*
 	 * Nobody should be changing or seriously looking at
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 8afab64814dc..4b21a368b4e2 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -961,9 +961,8 @@ static int free_tail_page_prepare(struct page *head_page, struct page *page)
 		break;
 	case 2:
 		/* the second tail page: deferred_list overlaps ->mapping */
-		if (unlikely(!list_empty(&folio->_deferred_list) &&
-		    folio_test_partially_mapped(folio))) {
-			bad_page(page, "partially mapped folio on deferred list");
+		if (unlikely(!list_empty(&folio->_deferred_list))) {
+			bad_page(page, "on deferred list");
 			goto out;
 		}
 		break;
-- 
2.35.3


^ permalink raw reply	[flat|nested] 13+ messages in thread

* [PATCH hotfix v2 2/2] mm/thp: fix deferred split unqueue naming and locking
  2024-10-27 19:59 [PATCH hotfix v2 1/2] mm/thp: fix deferred split queue not partially_mapped Hugh Dickins
@ 2024-10-27 20:02 ` Hugh Dickins
  2024-10-28 10:43   ` David Hildenbrand
  2024-10-28 18:39   ` Yang Shi
  2024-10-27 20:06 ` [PATCH hotfix v2 1/2] mm/thp: fix deferred split queue not partially_mapped Zi Yan
  1 sibling, 2 replies; 13+ messages in thread
From: Hugh Dickins @ 2024-10-27 20:02 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Usama Arif, Yang Shi, Wei Yang, Kirill A. Shutemov,
	Matthew Wilcox, David Hildenbrand, Johannes Weiner, Baolin Wang,
	Barry Song, Kefeng Wang, Ryan Roberts, Nhat Pham, Zi Yan,
	Chris Li, Shakeel Butt, linux-kernel, linux-mm

Recent changes are putting more pressure on THP deferred split queues:
under load revealing long-standing races, causing list_del corruptions,
"Bad page state"s and worse (I keep BUGs in both of those, so usually
don't get to see how badly they end up without).  The relevant recent
changes being 6.8's mTHP, 6.10's mTHP swapout, and 6.12's mTHP swapin,
improved swap allocation, and underused THP splitting.

Before fixing locking: rename misleading folio_undo_large_rmappable(),
which does not undo large_rmappable, to folio_unqueue_deferred_split(),
which is what it does.  But that and its out-of-line __callee are mm
internals of very limited usability: add comment and WARN_ON_ONCEs to
check usage; and return a bool to say if a deferred split was unqueued,
which can then be used in WARN_ON_ONCEs around safety checks (sparing
callers the arcane conditionals in __folio_unqueue_deferred_split()).

Just omit the folio_unqueue_deferred_split() from free_unref_folios(),
all of whose callers now call it beforehand (and if any forget then
bad_page() will tell) - except for its caller put_pages_list(), which
itself no longer has any callers (and will be deleted separately).

Swapout: mem_cgroup_swapout() has been resetting folio->memcg_data 0
without checking and unqueueing a THP folio from deferred split list;
which is unfortunate, since the split_queue_lock depends on the memcg
(when memcg is enabled); so swapout has been unqueueing such THPs later,
when freeing the folio, using the pgdat's lock instead: potentially
corrupting the memcg's list.  __remove_mapping() has frozen refcount to
0 here, so no problem with calling folio_unqueue_deferred_split() before
resetting memcg_data.

That goes back to 5.4 commit 87eaceb3faa5 ("mm: thp: make deferred split
shrinker memcg aware"): which included a check on swapcache before adding
to deferred queue, but no check on deferred queue before adding THP to
swapcache.  That worked fine with the usual sequence of events in reclaim
(though there were a couple of rare ways in which a THP on deferred queue
could have been swapped out), but 6.12 commit dafff3f4c850 ("mm: split
underused THPs") avoids splitting underused THPs in reclaim, which makes
swapcache THPs on deferred queue commonplace.

Keep the check on swapcache before adding to deferred queue?  Yes: it is
no longer essential, but preserves the existing behaviour, and is likely
to be a worthwhile optimization (vmstat showed much more traffic on the
queue under swapping load if the check was removed); update its comment.

Memcg-v1 move (deprecated): mem_cgroup_move_account() has been changing
folio->memcg_data without checking and unqueueing a THP folio from the
deferred list, sometimes corrupting "from" memcg's list, like swapout.
Refcount is non-zero here, so folio_unqueue_deferred_split() can only be
used in a WARN_ON_ONCE to validate the fix, which must be done earlier:
mem_cgroup_move_charge_pte_range() first try to split the THP (splitting
of course unqueues), or skip it if that fails.  Not ideal, but moving
charge has been requested, and khugepaged should repair the THP later:
nobody wants new custom unqueueing code just for this deprecated case.

The 87eaceb3faa5 commit did have the code to move from one deferred list
to another (but was not conscious of its unsafety while refcount non-0);
but that was removed by 5.6 commit fac0516b5534 ("mm: thp: don't need
care deferred split queue in memcg charge move path"), which argued that
the existence of a PMD mapping guarantees that the THP cannot be on a
deferred list.  As above, false in rare cases, and now commonly false.

Backport to 6.11 should be straightforward. Earlier backports must take
care that other _deferred_list fixes and dependencies are included.
There is not a strong case for backports, but they can fix cornercases.

Fixes: 87eaceb3faa5 ("mm: thp: make deferred split shrinker memcg aware")
Fixes: dafff3f4c850 ("mm: split underused THPs")
Signed-off-by: Hugh Dickins <hughd@google.com>
Cc: <stable@vger.kernel.org>
---
Based on 6.12-rc4
v2: adjusted commit message following info from Yang and David
    reinstated deferred_split_folio swapcache check, adjusting comment
    omitted (mem_cgroup_disabled) unqueue from free_unref_folios

 mm/huge_memory.c   | 35 ++++++++++++++++++++++++++---------
 mm/internal.h      | 10 +++++-----
 mm/memcontrol-v1.c | 25 +++++++++++++++++++++++++
 mm/memcontrol.c    |  8 +++++---
 mm/migrate.c       |  4 ++--
 mm/page_alloc.c    |  1 -
 mm/swap.c          |  4 ++--
 mm/vmscan.c        |  4 ++--
 8 files changed, 67 insertions(+), 24 deletions(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index a1d345f1680c..03fd4bc39ea1 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -3588,10 +3588,27 @@ int split_folio_to_list(struct folio *folio, struct list_head *list)
 	return split_huge_page_to_list_to_order(&folio->page, list, ret);
 }
 
-void __folio_undo_large_rmappable(struct folio *folio)
+/*
+ * __folio_unqueue_deferred_split() is not to be called directly:
+ * the folio_unqueue_deferred_split() inline wrapper in mm/internal.h
+ * limits its calls to those folios which may have a _deferred_list for
+ * queueing THP splits, and that list is (racily observed to be) non-empty.
+ *
+ * It is unsafe to call folio_unqueue_deferred_split() until folio refcount is
+ * zero: because even when split_queue_lock is held, a non-empty _deferred_list
+ * might be in use on deferred_split_scan()'s unlocked on-stack list.
+ *
+ * If memory cgroups are enabled, split_queue_lock is in the mem_cgroup: it is
+ * therefore important to unqueue deferred split before changing folio memcg.
+ */
+bool __folio_unqueue_deferred_split(struct folio *folio)
 {
 	struct deferred_split *ds_queue;
 	unsigned long flags;
+	bool unqueued = false;
+
+	WARN_ON_ONCE(folio_ref_count(folio));
+	WARN_ON_ONCE(!mem_cgroup_disabled() && !folio_memcg(folio));
 
 	ds_queue = get_deferred_split_queue(folio);
 	spin_lock_irqsave(&ds_queue->split_queue_lock, flags);
@@ -3603,8 +3620,11 @@ void __folio_undo_large_rmappable(struct folio *folio)
 				      MTHP_STAT_NR_ANON_PARTIALLY_MAPPED, -1);
 		}
 		list_del_init(&folio->_deferred_list);
+		unqueued = true;
 	}
 	spin_unlock_irqrestore(&ds_queue->split_queue_lock, flags);
+
+	return unqueued;	/* useful for debug warnings */
 }
 
 /* partially_mapped=false won't clear PG_partially_mapped folio flag */
@@ -3627,14 +3647,11 @@ void deferred_split_folio(struct folio *folio, bool partially_mapped)
 		return;
 
 	/*
-	 * The try_to_unmap() in page reclaim path might reach here too,
-	 * this may cause a race condition to corrupt deferred split queue.
-	 * And, if page reclaim is already handling the same folio, it is
-	 * unnecessary to handle it again in shrinker.
-	 *
-	 * Check the swapcache flag to determine if the folio is being
-	 * handled by page reclaim since THP swap would add the folio into
-	 * swap cache before calling try_to_unmap().
+	 * Exclude swapcache: originally to avoid a corrupt deferred split
+	 * queue. Nowadays that is fully prevented by mem_cgroup_swapout();
+	 * but if page reclaim is already handling the same folio, it is
+	 * unnecessary to handle it again in the shrinker, so excluding
+	 * swapcache here may still be a useful optimization.
 	 */
 	if (folio_test_swapcache(folio))
 		return;
diff --git a/mm/internal.h b/mm/internal.h
index 93083bbeeefa..16c1f3cd599e 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -639,11 +639,11 @@ static inline void folio_set_order(struct folio *folio, unsigned int order)
 #endif
 }
 
-void __folio_undo_large_rmappable(struct folio *folio);
-static inline void folio_undo_large_rmappable(struct folio *folio)
+bool __folio_unqueue_deferred_split(struct folio *folio);
+static inline bool folio_unqueue_deferred_split(struct folio *folio)
 {
 	if (folio_order(folio) <= 1 || !folio_test_large_rmappable(folio))
-		return;
+		return false;
 
 	/*
 	 * At this point, there is no one trying to add the folio to
@@ -651,9 +651,9 @@ static inline void folio_undo_large_rmappable(struct folio *folio)
 	 * to check without acquiring the split_queue_lock.
 	 */
 	if (data_race(list_empty(&folio->_deferred_list)))
-		return;
+		return false;
 
-	__folio_undo_large_rmappable(folio);
+	return __folio_unqueue_deferred_split(folio);
 }
 
 static inline struct folio *page_rmappable_folio(struct page *page)
diff --git a/mm/memcontrol-v1.c b/mm/memcontrol-v1.c
index 81d8819f13cd..f8744f5630bb 100644
--- a/mm/memcontrol-v1.c
+++ b/mm/memcontrol-v1.c
@@ -848,6 +848,8 @@ static int mem_cgroup_move_account(struct folio *folio,
 	css_get(&to->css);
 	css_put(&from->css);
 
+	/* Warning should never happen, so don't worry about refcount non-0 */
+	WARN_ON_ONCE(folio_unqueue_deferred_split(folio));
 	folio->memcg_data = (unsigned long)to;
 
 	__folio_memcg_unlock(from);
@@ -1217,7 +1219,9 @@ static int mem_cgroup_move_charge_pte_range(pmd_t *pmd,
 	enum mc_target_type target_type;
 	union mc_target target;
 	struct folio *folio;
+	bool tried_split_before = false;
 
+retry_pmd:
 	ptl = pmd_trans_huge_lock(pmd, vma);
 	if (ptl) {
 		if (mc.precharge < HPAGE_PMD_NR) {
@@ -1227,6 +1231,27 @@ static int mem_cgroup_move_charge_pte_range(pmd_t *pmd,
 		target_type = get_mctgt_type_thp(vma, addr, *pmd, &target);
 		if (target_type == MC_TARGET_PAGE) {
 			folio = target.folio;
+			/*
+			 * Deferred split queue locking depends on memcg,
+			 * and unqueue is unsafe unless folio refcount is 0:
+			 * split or skip if on the queue? first try to split.
+			 */
+			if (!list_empty(&folio->_deferred_list)) {
+				spin_unlock(ptl);
+				if (!tried_split_before)
+					split_folio(folio);
+				folio_unlock(folio);
+				folio_put(folio);
+				if (tried_split_before)
+					return 0;
+				tried_split_before = true;
+				goto retry_pmd;
+			}
+			/*
+			 * So long as that pmd lock is held, the folio cannot
+			 * be racily added to the _deferred_list, because
+			 * __folio_remove_rmap() will find !partially_mapped.
+			 */
 			if (folio_isolate_lru(folio)) {
 				if (!mem_cgroup_move_account(folio, true,
 							     mc.from, mc.to)) {
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 2703227cce88..06df2af97415 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -4629,9 +4629,6 @@ static void uncharge_folio(struct folio *folio, struct uncharge_gather *ug)
 	struct obj_cgroup *objcg;
 
 	VM_BUG_ON_FOLIO(folio_test_lru(folio), folio);
-	VM_BUG_ON_FOLIO(folio_order(folio) > 1 &&
-			!folio_test_hugetlb(folio) &&
-			!list_empty(&folio->_deferred_list), folio);
 
 	/*
 	 * Nobody should be changing or seriously looking at
@@ -4678,6 +4675,7 @@ static void uncharge_folio(struct folio *folio, struct uncharge_gather *ug)
 			ug->nr_memory += nr_pages;
 		ug->pgpgout++;
 
+		WARN_ON_ONCE(folio_unqueue_deferred_split(folio));
 		folio->memcg_data = 0;
 	}
 
@@ -4789,6 +4787,9 @@ void mem_cgroup_migrate(struct folio *old, struct folio *new)
 
 	/* Transfer the charge and the css ref */
 	commit_charge(new, memcg);
+
+	/* Warning should never happen, so don't worry about refcount non-0 */
+	WARN_ON_ONCE(folio_unqueue_deferred_split(old));
 	old->memcg_data = 0;
 }
 
@@ -4975,6 +4976,7 @@ void mem_cgroup_swapout(struct folio *folio, swp_entry_t entry)
 	VM_BUG_ON_FOLIO(oldid, folio);
 	mod_memcg_state(swap_memcg, MEMCG_SWAP, nr_entries);
 
+	folio_unqueue_deferred_split(folio);
 	folio->memcg_data = 0;
 
 	if (!mem_cgroup_is_root(memcg))
diff --git a/mm/migrate.c b/mm/migrate.c
index df91248755e4..691f25ee2489 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -489,7 +489,7 @@ static int __folio_migrate_mapping(struct address_space *mapping,
 		    folio_test_large_rmappable(folio)) {
 			if (!folio_ref_freeze(folio, expected_count))
 				return -EAGAIN;
-			folio_undo_large_rmappable(folio);
+			folio_unqueue_deferred_split(folio);
 			folio_ref_unfreeze(folio, expected_count);
 		}
 
@@ -514,7 +514,7 @@ static int __folio_migrate_mapping(struct address_space *mapping,
 	}
 
 	/* Take off deferred split queue while frozen and memcg set */
-	folio_undo_large_rmappable(folio);
+	folio_unqueue_deferred_split(folio);
 
 	/*
 	 * Now we know that no one else is looking at the folio:
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 4b21a368b4e2..815100a45b25 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -2681,7 +2681,6 @@ void free_unref_folios(struct folio_batch *folios)
 		unsigned long pfn = folio_pfn(folio);
 		unsigned int order = folio_order(folio);
 
-		folio_undo_large_rmappable(folio);
 		if (!free_pages_prepare(&folio->page, order))
 			continue;
 		/*
diff --git a/mm/swap.c b/mm/swap.c
index 835bdf324b76..b8e3259ea2c4 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -121,7 +121,7 @@ void __folio_put(struct folio *folio)
 	}
 
 	page_cache_release(folio);
-	folio_undo_large_rmappable(folio);
+	folio_unqueue_deferred_split(folio);
 	mem_cgroup_uncharge(folio);
 	free_unref_page(&folio->page, folio_order(folio));
 }
@@ -988,7 +988,7 @@ void folios_put_refs(struct folio_batch *folios, unsigned int *refs)
 			free_huge_folio(folio);
 			continue;
 		}
-		folio_undo_large_rmappable(folio);
+		folio_unqueue_deferred_split(folio);
 		__page_cache_release(folio, &lruvec, &flags);
 
 		if (j != i)
diff --git a/mm/vmscan.c b/mm/vmscan.c
index eb4e8440c507..635d45745b73 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1475,7 +1475,7 @@ static unsigned int shrink_folio_list(struct list_head *folio_list,
 		 */
 		nr_reclaimed += nr_pages;
 
-		folio_undo_large_rmappable(folio);
+		folio_unqueue_deferred_split(folio);
 		if (folio_batch_add(&free_folios, folio) == 0) {
 			mem_cgroup_uncharge_folios(&free_folios);
 			try_to_unmap_flush();
@@ -1863,7 +1863,7 @@ static unsigned int move_folios_to_lru(struct lruvec *lruvec,
 		if (unlikely(folio_put_testzero(folio))) {
 			__folio_clear_lru_flags(folio);
 
-			folio_undo_large_rmappable(folio);
+			folio_unqueue_deferred_split(folio);
 			if (folio_batch_add(&free_folios, folio) == 0) {
 				spin_unlock_irq(&lruvec->lru_lock);
 				mem_cgroup_uncharge_folios(&free_folios);
-- 
2.35.3


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH hotfix v2 1/2] mm/thp: fix deferred split queue not partially_mapped
  2024-10-27 19:59 [PATCH hotfix v2 1/2] mm/thp: fix deferred split queue not partially_mapped Hugh Dickins
  2024-10-27 20:02 ` [PATCH hotfix v2 2/2] mm/thp: fix deferred split unqueue naming and locking Hugh Dickins
@ 2024-10-27 20:06 ` Zi Yan
  2024-11-10 21:08   ` Hugh Dickins
  1 sibling, 1 reply; 13+ messages in thread
From: Zi Yan @ 2024-10-27 20:06 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Andrew Morton, Usama Arif, Yang Shi, Wei Yang,
	Kirill A. Shutemov, Matthew Wilcox, David Hildenbrand,
	Johannes Weiner, Baolin Wang, Barry Song, Kefeng Wang,
	Ryan Roberts, Nhat Pham, Chris Li, Shakeel Butt, linux-kernel,
	linux-mm

[-- Attachment #1: Type: text/plain, Size: 1879 bytes --]

On 27 Oct 2024, at 15:59, Hugh Dickins wrote:

> Recent changes are putting more pressure on THP deferred split queues:
> under load revealing long-standing races, causing list_del corruptions,
> "Bad page state"s and worse (I keep BUGs in both of those, so usually
> don't get to see how badly they end up without).  The relevant recent
> changes being 6.8's mTHP, 6.10's mTHP swapout, and 6.12's mTHP swapin,
> improved swap allocation, and underused THP splitting.
>
> The new unlocked list_del_init() in deferred_split_scan() is buggy.
> I gave bad advice, it looks plausible since that's a local on-stack
> list, but the fact is that it can race with a third party freeing or
> migrating the preceding folio (properly unqueueing it with refcount 0
> while holding split_queue_lock), thereby corrupting the list linkage.
>
> The obvious answer would be to take split_queue_lock there: but it has
> a long history of contention, so I'm reluctant to add to that. Instead,
> make sure that there is always one safe (raised refcount) folio before,
> by delaying its folio_put().  (And of course I was wrong to suggest
> updating split_queue_len without the lock: leave that until the splice.)
>
> And remove two over-eager partially_mapped checks, restoring those tests
> to how they were before: if uncharge_folio() or free_tail_page_prepare()
> finds _deferred_list non-empty, it's in trouble whether or not that folio
> is partially_mapped (and the flag was already cleared in the latter case).
>
> Fixes: dafff3f4c850 ("mm: split underused THPs")
> Signed-off-by: Hugh Dickins <hughd@google.com>
> Acked-by: Usama Arif <usamaarif642@gmail.com>
> Reviewed-by: David Hildenbrand <david@redhat.com>
> Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com>
> ---
> Based on 6.12-rc4
> v2: added ack and reviewed-bys

Acked-by: Zi Yan <ziy@nvidia.com>

--
Best Regards,
Yan, Zi

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 854 bytes --]

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH hotfix v2 2/2] mm/thp: fix deferred split unqueue naming and locking
  2024-10-27 20:02 ` [PATCH hotfix v2 2/2] mm/thp: fix deferred split unqueue naming and locking Hugh Dickins
@ 2024-10-28 10:43   ` David Hildenbrand
  2024-10-28 17:19     ` Hugh Dickins
  2024-10-28 18:39   ` Yang Shi
  1 sibling, 1 reply; 13+ messages in thread
From: David Hildenbrand @ 2024-10-28 10:43 UTC (permalink / raw)
  To: Hugh Dickins, Andrew Morton
  Cc: Usama Arif, Yang Shi, Wei Yang, Kirill A. Shutemov,
	Matthew Wilcox, Johannes Weiner, Baolin Wang, Barry Song,
	Kefeng Wang, Ryan Roberts, Nhat Pham, Zi Yan, Chris Li,
	Shakeel Butt, linux-kernel, linux-mm

Hi Hugh,

mostly looks good to me, one comment:

> +++ b/mm/memcontrol-v1.c
> @@ -848,6 +848,8 @@ static int mem_cgroup_move_account(struct folio *folio,
>   	css_get(&to->css);
>   	css_put(&from->css);
>   
> +	/* Warning should never happen, so don't worry about refcount non-0 */
> +	WARN_ON_ONCE(folio_unqueue_deferred_split(folio));
>   	folio->memcg_data = (unsigned long)to;
>   
>   	__folio_memcg_unlock(from);
> @@ -1217,7 +1219,9 @@ static int mem_cgroup_move_charge_pte_range(pmd_t *pmd,
>   	enum mc_target_type target_type;
>   	union mc_target target;
>   	struct folio *folio;
> +	bool tried_split_before = false;
>   
> +retry_pmd:
>   	ptl = pmd_trans_huge_lock(pmd, vma);
>   	if (ptl) {
>   		if (mc.precharge < HPAGE_PMD_NR) {
> @@ -1227,6 +1231,27 @@ static int mem_cgroup_move_charge_pte_range(pmd_t *pmd,
>   		target_type = get_mctgt_type_thp(vma, addr, *pmd, &target);
>   		if (target_type == MC_TARGET_PAGE) {
>   			folio = target.folio;
> +			/*
> +			 * Deferred split queue locking depends on memcg,
> +			 * and unqueue is unsafe unless folio refcount is 0:
> +			 * split or skip if on the queue? first try to split.
> +			 */
> +			if (!list_empty(&folio->_deferred_list)) {
> +				spin_unlock(ptl);
> +				if (!tried_split_before)
> +					split_folio(folio);
> +				folio_unlock(folio);
> +				folio_put(folio);
> +				if (tried_split_before)
> +					return 0;
> +				tried_split_before = true;
> +				goto retry_pmd;
> +			}
> +			/*
> +			 * So long as that pmd lock is held, the folio cannot
> +			 * be racily added to the _deferred_list, because
> +			 * __folio_remove_rmap() will find !partially_mapped.
> +			 */

Fortunately that code is getting ripped out.

https://lkml.kernel.org/r/20241025012304.2473312-3-shakeel.butt@linux.dev

So I wonder ... as a quick fix should we simply handle it like the code 
further down where we refuse PTE-mapped large folios completely?

"ignore such a partial THP and keep it in original memcg"

...

and simply skip this folio similarly? I mean, it's a corner case either way.

-- 
Cheers,

David / dhildenb



^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH hotfix v2 2/2] mm/thp: fix deferred split unqueue naming and locking
  2024-10-28 10:43   ` David Hildenbrand
@ 2024-10-28 17:19     ` Hugh Dickins
  2024-10-28 17:26       ` David Hildenbrand
  0 siblings, 1 reply; 13+ messages in thread
From: Hugh Dickins @ 2024-10-28 17:19 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Hugh Dickins, Andrew Morton, Usama Arif, Yang Shi, Wei Yang,
	Kirill A. Shutemov, Matthew Wilcox, Johannes Weiner, Baolin Wang,
	Barry Song, Kefeng Wang, Ryan Roberts, Nhat Pham, Zi Yan,
	Chris Li, Shakeel Butt, linux-kernel, linux-mm

On Mon, 28 Oct 2024, David Hildenbrand wrote:

> Hi Hugh,
> 
> mostly looks good to me, one comment:

Thanks...

> 
> > +++ b/mm/memcontrol-v1.c
> > @@ -848,6 +848,8 @@ static int mem_cgroup_move_account(struct folio *folio,
> >    css_get(&to->css);
> >    css_put(&from->css);
> >   +	/* Warning should never happen, so don't worry about refcount non-0 */
> > +	WARN_ON_ONCE(folio_unqueue_deferred_split(folio));
> >    folio->memcg_data = (unsigned long)to;
> >   
> >   	__folio_memcg_unlock(from);
> > @@ -1217,7 +1219,9 @@ static int mem_cgroup_move_charge_pte_range(pmd_t
> > *pmd,
> >    enum mc_target_type target_type;
> >    union mc_target target;
> >    struct folio *folio;
> > +	bool tried_split_before = false;
> >   +retry_pmd:
> >    ptl = pmd_trans_huge_lock(pmd, vma);
> >    if (ptl) {
> >   		if (mc.precharge < HPAGE_PMD_NR) {
> > @@ -1227,6 +1231,27 @@ static int mem_cgroup_move_charge_pte_range(pmd_t
> > *pmd,
> >     target_type = get_mctgt_type_thp(vma, addr, *pmd, &target);
> >     if (target_type == MC_TARGET_PAGE) {
> >   			folio = target.folio;
> > +			/*
> > +			 * Deferred split queue locking depends on memcg,
> > +			 * and unqueue is unsafe unless folio refcount is 0:
> > +			 * split or skip if on the queue? first try to split.
> > +			 */
> > +			if (!list_empty(&folio->_deferred_list)) {
> > +				spin_unlock(ptl);
> > +				if (!tried_split_before)
> > +					split_folio(folio);
> > +				folio_unlock(folio);
> > +				folio_put(folio);
> > +				if (tried_split_before)
> > +					return 0;
> > +				tried_split_before = true;
> > +				goto retry_pmd;
> > +			}
> > +			/*
> > +			 * So long as that pmd lock is held, the folio cannot
> > +			 * be racily added to the _deferred_list, because
> > +			 * __folio_remove_rmap() will find !partially_mapped.
> > +			 */
> 
> Fortunately that code is getting ripped out.

Yes, and even more fortunately, we're in time to fix its final incarnation!

> 
> https://lkml.kernel.org/r/20241025012304.2473312-3-shakeel.butt@linux.dev
> 
> So I wonder ... as a quick fix should we simply handle it like the code
> further down where we refuse PTE-mapped large folios completely?

(I went through the same anxiety attack as you did, wondering what
happens to the large-but-not-PMD-large folios: then noticed it's safe
as you did.  The v1 commit message had a paragraph pondering whether
the deprecated code will need a patch to extend it for the new feature:
but once Shakeel posted the ripout, I ripped out that paragraph -
no longer any need for an answer.)

> 
> "ignore such a partial THP and keep it in original memcg"
> 
> ...
> 
> and simply skip this folio similarly? I mean, it's a corner case either way.

I certainly considered that option: it's known to give up like that
for many reasons.  But my thinking (in the commit message) was "Not ideal,
but moving charge has been requested, and khugepaged should repair the THP
later" - if someone is still using move_charge_at_immigrate, I thought
this change would generate fewer surprises - that huge charge likely
to be moved as it used to be.

Hugh


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH hotfix v2 2/2] mm/thp: fix deferred split unqueue naming and locking
  2024-10-28 17:19     ` Hugh Dickins
@ 2024-10-28 17:26       ` David Hildenbrand
  0 siblings, 0 replies; 13+ messages in thread
From: David Hildenbrand @ 2024-10-28 17:26 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Andrew Morton, Usama Arif, Yang Shi, Wei Yang,
	Kirill A. Shutemov, Matthew Wilcox, Johannes Weiner, Baolin Wang,
	Barry Song, Kefeng Wang, Ryan Roberts, Nhat Pham, Zi Yan,
	Chris Li, Shakeel Butt, linux-kernel, linux-mm

>> https://lkml.kernel.org/r/20241025012304.2473312-3-shakeel.butt@linux.dev
>>
>> So I wonder ... as a quick fix should we simply handle it like the code
>> further down where we refuse PTE-mapped large folios completely?
> 
> (I went through the same anxiety attack as you did, wondering what
> happens to the large-but-not-PMD-large folios: then noticed it's safe
> as you did.  The v1 commit message had a paragraph pondering whether
> the deprecated code will need a patch to extend it for the new feature:
> but once Shakeel posted the ripout, I ripped out that paragraph -
> no longer any need for an answer.)

Ah, missed that.

> 
>>
>> "ignore such a partial THP and keep it in original memcg"
>>
>> ...
>>
>> and simply skip this folio similarly? I mean, it's a corner case either way.
> 
> I certainly considered that option: it's known to give up like that
> for many reasons.  But my thinking (in the commit message) was "Not ideal,
> but moving charge has been requested, and khugepaged should repair the THP
> later" - if someone is still using move_charge_at_immigrate, I thought
> this change would generate fewer surprises - that huge charge likely
> to be moved as it used to be.

Fair enough, I'd have kept it simpler for this almost-dead code :)

Looks good to me, thanks!

Acked-by: David Hildenbrand <david@redhat.com>

-- 
Cheers,

David / dhildenb



^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH hotfix v2 2/2] mm/thp: fix deferred split unqueue naming and locking
  2024-10-27 20:02 ` [PATCH hotfix v2 2/2] mm/thp: fix deferred split unqueue naming and locking Hugh Dickins
  2024-10-28 10:43   ` David Hildenbrand
@ 2024-10-28 18:39   ` Yang Shi
  1 sibling, 0 replies; 13+ messages in thread
From: Yang Shi @ 2024-10-28 18:39 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Andrew Morton, Usama Arif, Wei Yang, Kirill A. Shutemov,
	Matthew Wilcox, David Hildenbrand, Johannes Weiner, Baolin Wang,
	Barry Song, Kefeng Wang, Ryan Roberts, Nhat Pham, Zi Yan,
	Chris Li, Shakeel Butt, linux-kernel, linux-mm

On Sun, Oct 27, 2024 at 1:02 PM Hugh Dickins <hughd@google.com> wrote:
>
> Recent changes are putting more pressure on THP deferred split queues:
> under load revealing long-standing races, causing list_del corruptions,
> "Bad page state"s and worse (I keep BUGs in both of those, so usually
> don't get to see how badly they end up without).  The relevant recent
> changes being 6.8's mTHP, 6.10's mTHP swapout, and 6.12's mTHP swapin,
> improved swap allocation, and underused THP splitting.
>
> Before fixing locking: rename misleading folio_undo_large_rmappable(),
> which does not undo large_rmappable, to folio_unqueue_deferred_split(),
> which is what it does.  But that and its out-of-line __callee are mm
> internals of very limited usability: add comment and WARN_ON_ONCEs to
> check usage; and return a bool to say if a deferred split was unqueued,
> which can then be used in WARN_ON_ONCEs around safety checks (sparing
> callers the arcane conditionals in __folio_unqueue_deferred_split()).
>
> Just omit the folio_unqueue_deferred_split() from free_unref_folios(),
> all of whose callers now call it beforehand (and if any forget then
> bad_page() will tell) - except for its caller put_pages_list(), which
> itself no longer has any callers (and will be deleted separately).
>
> Swapout: mem_cgroup_swapout() has been resetting folio->memcg_data 0
> without checking and unqueueing a THP folio from deferred split list;
> which is unfortunate, since the split_queue_lock depends on the memcg
> (when memcg is enabled); so swapout has been unqueueing such THPs later,
> when freeing the folio, using the pgdat's lock instead: potentially
> corrupting the memcg's list.  __remove_mapping() has frozen refcount to
> 0 here, so no problem with calling folio_unqueue_deferred_split() before
> resetting memcg_data.
>
> That goes back to 5.4 commit 87eaceb3faa5 ("mm: thp: make deferred split
> shrinker memcg aware"): which included a check on swapcache before adding
> to deferred queue, but no check on deferred queue before adding THP to
> swapcache.  That worked fine with the usual sequence of events in reclaim
> (though there were a couple of rare ways in which a THP on deferred queue
> could have been swapped out), but 6.12 commit dafff3f4c850 ("mm: split
> underused THPs") avoids splitting underused THPs in reclaim, which makes
> swapcache THPs on deferred queue commonplace.
>
> Keep the check on swapcache before adding to deferred queue?  Yes: it is
> no longer essential, but preserves the existing behaviour, and is likely
> to be a worthwhile optimization (vmstat showed much more traffic on the
> queue under swapping load if the check was removed); update its comment.
>
> Memcg-v1 move (deprecated): mem_cgroup_move_account() has been changing
> folio->memcg_data without checking and unqueueing a THP folio from the
> deferred list, sometimes corrupting "from" memcg's list, like swapout.
> Refcount is non-zero here, so folio_unqueue_deferred_split() can only be
> used in a WARN_ON_ONCE to validate the fix, which must be done earlier:
> mem_cgroup_move_charge_pte_range() first try to split the THP (splitting
> of course unqueues), or skip it if that fails.  Not ideal, but moving
> charge has been requested, and khugepaged should repair the THP later:
> nobody wants new custom unqueueing code just for this deprecated case.
>
> The 87eaceb3faa5 commit did have the code to move from one deferred list
> to another (but was not conscious of its unsafety while refcount non-0);
> but that was removed by 5.6 commit fac0516b5534 ("mm: thp: don't need
> care deferred split queue in memcg charge move path"), which argued that
> the existence of a PMD mapping guarantees that the THP cannot be on a
> deferred list.  As above, false in rare cases, and now commonly false.
>
> Backport to 6.11 should be straightforward. Earlier backports must take
> care that other _deferred_list fixes and dependencies are included.
> There is not a strong case for backports, but they can fix cornercases.
>
> Fixes: 87eaceb3faa5 ("mm: thp: make deferred split shrinker memcg aware")
> Fixes: dafff3f4c850 ("mm: split underused THPs")
> Signed-off-by: Hugh Dickins <hughd@google.com>
> Cc: <stable@vger.kernel.org>
> ---
> Based on 6.12-rc4
> v2: adjusted commit message following info from Yang and David
>     reinstated deferred_split_folio swapcache check, adjusting comment
>     omitted (mem_cgroup_disabled) unqueue from free_unref_folios

Reviewed-by: Yang Shi <shy828301@gmail.com>

>
>  mm/huge_memory.c   | 35 ++++++++++++++++++++++++++---------
>  mm/internal.h      | 10 +++++-----
>  mm/memcontrol-v1.c | 25 +++++++++++++++++++++++++
>  mm/memcontrol.c    |  8 +++++---
>  mm/migrate.c       |  4 ++--
>  mm/page_alloc.c    |  1 -
>  mm/swap.c          |  4 ++--
>  mm/vmscan.c        |  4 ++--
>  8 files changed, 67 insertions(+), 24 deletions(-)
>
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index a1d345f1680c..03fd4bc39ea1 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -3588,10 +3588,27 @@ int split_folio_to_list(struct folio *folio, struct list_head *list)
>         return split_huge_page_to_list_to_order(&folio->page, list, ret);
>  }
>
> -void __folio_undo_large_rmappable(struct folio *folio)
> +/*
> + * __folio_unqueue_deferred_split() is not to be called directly:
> + * the folio_unqueue_deferred_split() inline wrapper in mm/internal.h
> + * limits its calls to those folios which may have a _deferred_list for
> + * queueing THP splits, and that list is (racily observed to be) non-empty.
> + *
> + * It is unsafe to call folio_unqueue_deferred_split() until folio refcount is
> + * zero: because even when split_queue_lock is held, a non-empty _deferred_list
> + * might be in use on deferred_split_scan()'s unlocked on-stack list.
> + *
> + * If memory cgroups are enabled, split_queue_lock is in the mem_cgroup: it is
> + * therefore important to unqueue deferred split before changing folio memcg.
> + */
> +bool __folio_unqueue_deferred_split(struct folio *folio)
>  {
>         struct deferred_split *ds_queue;
>         unsigned long flags;
> +       bool unqueued = false;
> +
> +       WARN_ON_ONCE(folio_ref_count(folio));
> +       WARN_ON_ONCE(!mem_cgroup_disabled() && !folio_memcg(folio));
>
>         ds_queue = get_deferred_split_queue(folio);
>         spin_lock_irqsave(&ds_queue->split_queue_lock, flags);
> @@ -3603,8 +3620,11 @@ void __folio_undo_large_rmappable(struct folio *folio)
>                                       MTHP_STAT_NR_ANON_PARTIALLY_MAPPED, -1);
>                 }
>                 list_del_init(&folio->_deferred_list);
> +               unqueued = true;
>         }
>         spin_unlock_irqrestore(&ds_queue->split_queue_lock, flags);
> +
> +       return unqueued;        /* useful for debug warnings */
>  }
>
>  /* partially_mapped=false won't clear PG_partially_mapped folio flag */
> @@ -3627,14 +3647,11 @@ void deferred_split_folio(struct folio *folio, bool partially_mapped)
>                 return;
>
>         /*
> -        * The try_to_unmap() in page reclaim path might reach here too,
> -        * this may cause a race condition to corrupt deferred split queue.
> -        * And, if page reclaim is already handling the same folio, it is
> -        * unnecessary to handle it again in shrinker.
> -        *
> -        * Check the swapcache flag to determine if the folio is being
> -        * handled by page reclaim since THP swap would add the folio into
> -        * swap cache before calling try_to_unmap().
> +        * Exclude swapcache: originally to avoid a corrupt deferred split
> +        * queue. Nowadays that is fully prevented by mem_cgroup_swapout();
> +        * but if page reclaim is already handling the same folio, it is
> +        * unnecessary to handle it again in the shrinker, so excluding
> +        * swapcache here may still be a useful optimization.
>          */
>         if (folio_test_swapcache(folio))
>                 return;
> diff --git a/mm/internal.h b/mm/internal.h
> index 93083bbeeefa..16c1f3cd599e 100644
> --- a/mm/internal.h
> +++ b/mm/internal.h
> @@ -639,11 +639,11 @@ static inline void folio_set_order(struct folio *folio, unsigned int order)
>  #endif
>  }
>
> -void __folio_undo_large_rmappable(struct folio *folio);
> -static inline void folio_undo_large_rmappable(struct folio *folio)
> +bool __folio_unqueue_deferred_split(struct folio *folio);
> +static inline bool folio_unqueue_deferred_split(struct folio *folio)
>  {
>         if (folio_order(folio) <= 1 || !folio_test_large_rmappable(folio))
> -               return;
> +               return false;
>
>         /*
>          * At this point, there is no one trying to add the folio to
> @@ -651,9 +651,9 @@ static inline void folio_undo_large_rmappable(struct folio *folio)
>          * to check without acquiring the split_queue_lock.
>          */
>         if (data_race(list_empty(&folio->_deferred_list)))
> -               return;
> +               return false;
>
> -       __folio_undo_large_rmappable(folio);
> +       return __folio_unqueue_deferred_split(folio);
>  }
>
>  static inline struct folio *page_rmappable_folio(struct page *page)
> diff --git a/mm/memcontrol-v1.c b/mm/memcontrol-v1.c
> index 81d8819f13cd..f8744f5630bb 100644
> --- a/mm/memcontrol-v1.c
> +++ b/mm/memcontrol-v1.c
> @@ -848,6 +848,8 @@ static int mem_cgroup_move_account(struct folio *folio,
>         css_get(&to->css);
>         css_put(&from->css);
>
> +       /* Warning should never happen, so don't worry about refcount non-0 */
> +       WARN_ON_ONCE(folio_unqueue_deferred_split(folio));
>         folio->memcg_data = (unsigned long)to;
>
>         __folio_memcg_unlock(from);
> @@ -1217,7 +1219,9 @@ static int mem_cgroup_move_charge_pte_range(pmd_t *pmd,
>         enum mc_target_type target_type;
>         union mc_target target;
>         struct folio *folio;
> +       bool tried_split_before = false;
>
> +retry_pmd:
>         ptl = pmd_trans_huge_lock(pmd, vma);
>         if (ptl) {
>                 if (mc.precharge < HPAGE_PMD_NR) {
> @@ -1227,6 +1231,27 @@ static int mem_cgroup_move_charge_pte_range(pmd_t *pmd,
>                 target_type = get_mctgt_type_thp(vma, addr, *pmd, &target);
>                 if (target_type == MC_TARGET_PAGE) {
>                         folio = target.folio;
> +                       /*
> +                        * Deferred split queue locking depends on memcg,
> +                        * and unqueue is unsafe unless folio refcount is 0:
> +                        * split or skip if on the queue? first try to split.
> +                        */
> +                       if (!list_empty(&folio->_deferred_list)) {
> +                               spin_unlock(ptl);
> +                               if (!tried_split_before)
> +                                       split_folio(folio);
> +                               folio_unlock(folio);
> +                               folio_put(folio);
> +                               if (tried_split_before)
> +                                       return 0;
> +                               tried_split_before = true;
> +                               goto retry_pmd;
> +                       }
> +                       /*
> +                        * So long as that pmd lock is held, the folio cannot
> +                        * be racily added to the _deferred_list, because
> +                        * __folio_remove_rmap() will find !partially_mapped.
> +                        */
>                         if (folio_isolate_lru(folio)) {
>                                 if (!mem_cgroup_move_account(folio, true,
>                                                              mc.from, mc.to)) {
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 2703227cce88..06df2af97415 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -4629,9 +4629,6 @@ static void uncharge_folio(struct folio *folio, struct uncharge_gather *ug)
>         struct obj_cgroup *objcg;
>
>         VM_BUG_ON_FOLIO(folio_test_lru(folio), folio);
> -       VM_BUG_ON_FOLIO(folio_order(folio) > 1 &&
> -                       !folio_test_hugetlb(folio) &&
> -                       !list_empty(&folio->_deferred_list), folio);
>
>         /*
>          * Nobody should be changing or seriously looking at
> @@ -4678,6 +4675,7 @@ static void uncharge_folio(struct folio *folio, struct uncharge_gather *ug)
>                         ug->nr_memory += nr_pages;
>                 ug->pgpgout++;
>
> +               WARN_ON_ONCE(folio_unqueue_deferred_split(folio));
>                 folio->memcg_data = 0;
>         }
>
> @@ -4789,6 +4787,9 @@ void mem_cgroup_migrate(struct folio *old, struct folio *new)
>
>         /* Transfer the charge and the css ref */
>         commit_charge(new, memcg);
> +
> +       /* Warning should never happen, so don't worry about refcount non-0 */
> +       WARN_ON_ONCE(folio_unqueue_deferred_split(old));
>         old->memcg_data = 0;
>  }
>
> @@ -4975,6 +4976,7 @@ void mem_cgroup_swapout(struct folio *folio, swp_entry_t entry)
>         VM_BUG_ON_FOLIO(oldid, folio);
>         mod_memcg_state(swap_memcg, MEMCG_SWAP, nr_entries);
>
> +       folio_unqueue_deferred_split(folio);
>         folio->memcg_data = 0;
>
>         if (!mem_cgroup_is_root(memcg))
> diff --git a/mm/migrate.c b/mm/migrate.c
> index df91248755e4..691f25ee2489 100644
> --- a/mm/migrate.c
> +++ b/mm/migrate.c
> @@ -489,7 +489,7 @@ static int __folio_migrate_mapping(struct address_space *mapping,
>                     folio_test_large_rmappable(folio)) {
>                         if (!folio_ref_freeze(folio, expected_count))
>                                 return -EAGAIN;
> -                       folio_undo_large_rmappable(folio);
> +                       folio_unqueue_deferred_split(folio);
>                         folio_ref_unfreeze(folio, expected_count);
>                 }
>
> @@ -514,7 +514,7 @@ static int __folio_migrate_mapping(struct address_space *mapping,
>         }
>
>         /* Take off deferred split queue while frozen and memcg set */
> -       folio_undo_large_rmappable(folio);
> +       folio_unqueue_deferred_split(folio);
>
>         /*
>          * Now we know that no one else is looking at the folio:
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 4b21a368b4e2..815100a45b25 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -2681,7 +2681,6 @@ void free_unref_folios(struct folio_batch *folios)
>                 unsigned long pfn = folio_pfn(folio);
>                 unsigned int order = folio_order(folio);
>
> -               folio_undo_large_rmappable(folio);
>                 if (!free_pages_prepare(&folio->page, order))
>                         continue;
>                 /*
> diff --git a/mm/swap.c b/mm/swap.c
> index 835bdf324b76..b8e3259ea2c4 100644
> --- a/mm/swap.c
> +++ b/mm/swap.c
> @@ -121,7 +121,7 @@ void __folio_put(struct folio *folio)
>         }
>
>         page_cache_release(folio);
> -       folio_undo_large_rmappable(folio);
> +       folio_unqueue_deferred_split(folio);
>         mem_cgroup_uncharge(folio);
>         free_unref_page(&folio->page, folio_order(folio));
>  }
> @@ -988,7 +988,7 @@ void folios_put_refs(struct folio_batch *folios, unsigned int *refs)
>                         free_huge_folio(folio);
>                         continue;
>                 }
> -               folio_undo_large_rmappable(folio);
> +               folio_unqueue_deferred_split(folio);
>                 __page_cache_release(folio, &lruvec, &flags);
>
>                 if (j != i)
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index eb4e8440c507..635d45745b73 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -1475,7 +1475,7 @@ static unsigned int shrink_folio_list(struct list_head *folio_list,
>                  */
>                 nr_reclaimed += nr_pages;
>
> -               folio_undo_large_rmappable(folio);
> +               folio_unqueue_deferred_split(folio);
>                 if (folio_batch_add(&free_folios, folio) == 0) {
>                         mem_cgroup_uncharge_folios(&free_folios);
>                         try_to_unmap_flush();
> @@ -1863,7 +1863,7 @@ static unsigned int move_folios_to_lru(struct lruvec *lruvec,
>                 if (unlikely(folio_put_testzero(folio))) {
>                         __folio_clear_lru_flags(folio);
>
> -                       folio_undo_large_rmappable(folio);
> +                       folio_unqueue_deferred_split(folio);
>                         if (folio_batch_add(&free_folios, folio) == 0) {
>                                 spin_unlock_irq(&lruvec->lru_lock);
>                                 mem_cgroup_uncharge_folios(&free_folios);
> --
> 2.35.3


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH hotfix v2 1/2] mm/thp: fix deferred split queue not partially_mapped
  2024-10-27 20:06 ` [PATCH hotfix v2 1/2] mm/thp: fix deferred split queue not partially_mapped Zi Yan
@ 2024-11-10 21:08   ` Hugh Dickins
  2024-11-10 21:11     ` [PATCH hotfix] mm/thp: fix deferred split queue not partially_mapped: fix Hugh Dickins
  0 siblings, 1 reply; 13+ messages in thread
From: Hugh Dickins @ 2024-11-10 21:08 UTC (permalink / raw)
  To: Zi Yan
  Cc: Hugh Dickins, Andrew Morton, Usama Arif, Yang Shi, Wei Yang,
	Kirill A. Shutemov, Matthew Wilcox, David Hildenbrand,
	Johannes Weiner, Baolin Wang, Barry Song, Kefeng Wang,
	Ryan Roberts, Nhat Pham, Chris Li, Shakeel Butt, linux-kernel,
	linux-mm

On Sun, 27 Oct 2024, Zi Yan wrote:
> On 27 Oct 2024, at 15:59, Hugh Dickins wrote:
> 
> > Recent changes are putting more pressure on THP deferred split queues:
> > under load revealing long-standing races, causing list_del corruptions,
> > "Bad page state"s and worse (I keep BUGs in both of those, so usually
> > don't get to see how badly they end up without).  The relevant recent
> > changes being 6.8's mTHP, 6.10's mTHP swapout, and 6.12's mTHP swapin,
> > improved swap allocation, and underused THP splitting.
> >
> > The new unlocked list_del_init() in deferred_split_scan() is buggy.
> > I gave bad advice, it looks plausible since that's a local on-stack
> > list, but the fact is that it can race with a third party freeing or
> > migrating the preceding folio (properly unqueueing it with refcount 0
> > while holding split_queue_lock), thereby corrupting the list linkage.
> >
> > The obvious answer would be to take split_queue_lock there: but it has
> > a long history of contention, so I'm reluctant to add to that. Instead,
> > make sure that there is always one safe (raised refcount) folio before,
> > by delaying its folio_put().  (And of course I was wrong to suggest
> > updating split_queue_len without the lock: leave that until the splice.)
> >
> > And remove two over-eager partially_mapped checks, restoring those tests
> > to how they were before: if uncharge_folio() or free_tail_page_prepare()
> > finds _deferred_list non-empty, it's in trouble whether or not that folio
> > is partially_mapped (and the flag was already cleared in the latter case).
> >
> > Fixes: dafff3f4c850 ("mm: split underused THPs")
> > Signed-off-by: Hugh Dickins <hughd@google.com>
> > Acked-by: Usama Arif <usamaarif642@gmail.com>
> > Reviewed-by: David Hildenbrand <david@redhat.com>
> > Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com>
> > ---
> > Based on 6.12-rc4
> > v2: added ack and reviewed-bys
> 
> Acked-by: Zi Yan <ziy@nvidia.com>

Thank you: but I owe you and Andrew and everyone else an apology.

Those 1/2 and 2/2, which have gone in to Linus's tree this morning
(thank you all), have still left a once-a-week list_del corruption on
the deferred split queue: which I've been agonizing over then giving up
on repeatedly for three weeks now (last weekend's seemed to get fixed by
applying a missed microcode update; but then another crash this Friday).

Sorry if the timing makes it look as if I'm trying to game the system
in some way, but it was only yesterday evening that at last I understood
the reason for (I hope the last of) these deferred split queue corruptions;
and the fix turns out to be to this patch.  Perhaps if I'd worked out why
sooner, I'd have just switched to proper spinlocking as you asked; but now
that I do understand, I still prefer to continue this much more tested way.

My ability to reproduce these crashes seems to be one or two orders of
magnitude weaker than it used to be (generally a good thing I suppose: but
frustrating when I want to test), and there's no way I can satisfy myself
that the crashes are completely eliminated in a single week.

But I have been successful in adding temporary debug code, to check that
the preceding "safe" folio on the local list has non-0 refcount: that
check fails much sooner than reaching corruption, and I've run it often
enough now to confirm that the fix does fix that.

Fix patch follows... as you'll see, it's very obvious *in retrospect*.

Hugh

^ permalink raw reply	[flat|nested] 13+ messages in thread

* [PATCH hotfix] mm/thp: fix deferred split queue not partially_mapped: fix
  2024-11-10 21:08   ` Hugh Dickins
@ 2024-11-10 21:11     ` Hugh Dickins
  2024-11-10 21:22       ` Usama Arif
                         ` (3 more replies)
  0 siblings, 4 replies; 13+ messages in thread
From: Hugh Dickins @ 2024-11-10 21:11 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Zi Yan, Usama Arif, Yang Shi, Wei Yang, Kirill A. Shutemov,
	Matthew Wilcox, David Hildenbrand, Johannes Weiner, Baolin Wang,
	Barry Song, Kefeng Wang, Ryan Roberts, Nhat Pham, Chris Li,
	Shakeel Butt, linux-kernel, linux-mm

Though even more elusive than before, list_del corruption has still been
seen on THP's deferred split queue.

The idea in commit e66f3185fa04 was right, but its implementation wrong.
The context omitted an important comment just before the critical test:
"split_folio() removes folio from list on success."  In ignoring that
comment, when a THP split succeeded, the code went on to release the
preceding safe folio, preserving instead an irrelevant (formerly head)
folio: which gives no safety because it's not on the list. Fix the logic.

Fixes: e66f3185fa04 ("mm/thp: fix deferred split queue not partially_mapped")
Signed-off-by: Hugh Dickins <hughd@google.com>
---
 mm/huge_memory.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 03fd4bc39ea1..5734d5d5060f 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -3790,7 +3790,9 @@ static unsigned long deferred_split_scan(struct shrinker *shrink,
 		 * in the case it was underused, then consider it used and
 		 * don't add it back to split_queue.
 		 */
-		if (!did_split && !folio_test_partially_mapped(folio)) {
+		if (did_split) {
+			; /* folio already removed from list */
+		} else if (!folio_test_partially_mapped(folio)) {
 			list_del_init(&folio->_deferred_list);
 			removed++;
 		} else {
-- 
2.35.3


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH hotfix] mm/thp: fix deferred split queue not partially_mapped: fix
  2024-11-10 21:11     ` [PATCH hotfix] mm/thp: fix deferred split queue not partially_mapped: fix Hugh Dickins
@ 2024-11-10 21:22       ` Usama Arif
  2024-11-11  3:10       ` Zi Yan
                         ` (2 subsequent siblings)
  3 siblings, 0 replies; 13+ messages in thread
From: Usama Arif @ 2024-11-10 21:22 UTC (permalink / raw)
  To: Hugh Dickins, Andrew Morton
  Cc: Zi Yan, Yang Shi, Wei Yang, Kirill A. Shutemov, Matthew Wilcox,
	David Hildenbrand, Johannes Weiner, Baolin Wang, Barry Song,
	Kefeng Wang, Ryan Roberts, Nhat Pham, Chris Li, Shakeel Butt,
	linux-kernel, linux-mm



On 10/11/2024 21:11, Hugh Dickins wrote:
> Though even more elusive than before, list_del corruption has still been
> seen on THP's deferred split queue.
> 
> The idea in commit e66f3185fa04 was right, but its implementation wrong.
> The context omitted an important comment just before the critical test:
> "split_folio() removes folio from list on success."  In ignoring that
> comment, when a THP split succeeded, the code went on to release the
> preceding safe folio, preserving instead an irrelevant (formerly head)
> folio: which gives no safety because it's not on the list. Fix the logic.
> 
> Fixes: e66f3185fa04 ("mm/thp: fix deferred split queue not partially_mapped")
> Signed-off-by: Hugh Dickins <hughd@google.com>

Acked-by: Usama Arif <usamaarif642@gmail.com>


> ---
>  mm/huge_memory.c | 4 +++-
>  1 file changed, 3 insertions(+), 1 deletion(-)
> 
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index 03fd4bc39ea1..5734d5d5060f 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -3790,7 +3790,9 @@ static unsigned long deferred_split_scan(struct shrinker *shrink,
>  		 * in the case it was underused, then consider it used and
>  		 * don't add it back to split_queue.
>  		 */
> -		if (!did_split && !folio_test_partially_mapped(folio)) {
> +		if (did_split) {
> +			; /* folio already removed from list */
> +		} else if (!folio_test_partially_mapped(folio)) {
>  			list_del_init(&folio->_deferred_list);
>  			removed++;
>  		} else {



^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH hotfix] mm/thp: fix deferred split queue not partially_mapped: fix
  2024-11-10 21:11     ` [PATCH hotfix] mm/thp: fix deferred split queue not partially_mapped: fix Hugh Dickins
  2024-11-10 21:22       ` Usama Arif
@ 2024-11-11  3:10       ` Zi Yan
  2024-11-12  1:36       ` Baolin Wang
  2024-11-13 22:57       ` Chris Li
  3 siblings, 0 replies; 13+ messages in thread
From: Zi Yan @ 2024-11-11  3:10 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Andrew Morton, Usama Arif, Yang Shi, Wei Yang,
	Kirill A. Shutemov, Matthew Wilcox, David Hildenbrand,
	Johannes Weiner, Baolin Wang, Barry Song, Kefeng Wang,
	Ryan Roberts, Nhat Pham, Chris Li, Shakeel Butt, linux-kernel,
	linux-mm

[-- Attachment #1: Type: text/plain, Size: 891 bytes --]

On 10 Nov 2024, at 16:11, Hugh Dickins wrote:

> Though even more elusive than before, list_del corruption has still been
> seen on THP's deferred split queue.
>
> The idea in commit e66f3185fa04 was right, but its implementation wrong.
> The context omitted an important comment just before the critical test:
> "split_folio() removes folio from list on success."  In ignoring that
> comment, when a THP split succeeded, the code went on to release the
> preceding safe folio, preserving instead an irrelevant (formerly head)
> folio: which gives no safety because it's not on the list. Fix the logic.
>
> Fixes: e66f3185fa04 ("mm/thp: fix deferred split queue not partially_mapped")
> Signed-off-by: Hugh Dickins <hughd@google.com>
> ---
>  mm/huge_memory.c | 4 +++-
>  1 file changed, 3 insertions(+), 1 deletion(-)

Thanks. Reviewed-by: Zi Yan <ziy@nvidia.com>

--
Best Regards,
Yan, Zi

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 854 bytes --]

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH hotfix] mm/thp: fix deferred split queue not partially_mapped: fix
  2024-11-10 21:11     ` [PATCH hotfix] mm/thp: fix deferred split queue not partially_mapped: fix Hugh Dickins
  2024-11-10 21:22       ` Usama Arif
  2024-11-11  3:10       ` Zi Yan
@ 2024-11-12  1:36       ` Baolin Wang
  2024-11-13 22:57       ` Chris Li
  3 siblings, 0 replies; 13+ messages in thread
From: Baolin Wang @ 2024-11-12  1:36 UTC (permalink / raw)
  To: Hugh Dickins, Andrew Morton
  Cc: Zi Yan, Usama Arif, Yang Shi, Wei Yang, Kirill A. Shutemov,
	Matthew Wilcox, David Hildenbrand, Johannes Weiner, Barry Song,
	Kefeng Wang, Ryan Roberts, Nhat Pham, Chris Li, Shakeel Butt,
	linux-kernel, linux-mm



On 2024/11/11 05:11, Hugh Dickins wrote:
> Though even more elusive than before, list_del corruption has still been
> seen on THP's deferred split queue.
> 
> The idea in commit e66f3185fa04 was right, but its implementation wrong.
> The context omitted an important comment just before the critical test:
> "split_folio() removes folio from list on success."  In ignoring that
> comment, when a THP split succeeded, the code went on to release the
> preceding safe folio, preserving instead an irrelevant (formerly head)
> folio: which gives no safety because it's not on the list. Fix the logic.
> 
> Fixes: e66f3185fa04 ("mm/thp: fix deferred split queue not partially_mapped")
> Signed-off-by: Hugh Dickins <hughd@google.com>

Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com>

> ---
>   mm/huge_memory.c | 4 +++-
>   1 file changed, 3 insertions(+), 1 deletion(-)
> 
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index 03fd4bc39ea1..5734d5d5060f 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -3790,7 +3790,9 @@ static unsigned long deferred_split_scan(struct shrinker *shrink,
>   		 * in the case it was underused, then consider it used and
>   		 * don't add it back to split_queue.
>   		 */
> -		if (!did_split && !folio_test_partially_mapped(folio)) {
> +		if (did_split) {
> +			; /* folio already removed from list */
> +		} else if (!folio_test_partially_mapped(folio)) {
>   			list_del_init(&folio->_deferred_list);
>   			removed++;
>   		} else {


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH hotfix] mm/thp: fix deferred split queue not partially_mapped: fix
  2024-11-10 21:11     ` [PATCH hotfix] mm/thp: fix deferred split queue not partially_mapped: fix Hugh Dickins
                         ` (2 preceding siblings ...)
  2024-11-12  1:36       ` Baolin Wang
@ 2024-11-13 22:57       ` Chris Li
  3 siblings, 0 replies; 13+ messages in thread
From: Chris Li @ 2024-11-13 22:57 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Andrew Morton, Zi Yan, Usama Arif, Yang Shi, Wei Yang,
	Kirill A. Shutemov, Matthew Wilcox, David Hildenbrand,
	Johannes Weiner, Baolin Wang, Barry Song, Kefeng Wang,
	Ryan Roberts, Nhat Pham, Shakeel Butt, linux-kernel, linux-mm

Hi Hugh,

LGTM.

Acked-by: Chris Li <chrisl@kernel.org>

Chris

On Sun, Nov 10, 2024 at 1:11 PM Hugh Dickins <hughd@google.com> wrote:
>
> Though even more elusive than before, list_del corruption has still been
> seen on THP's deferred split queue.
>
> The idea in commit e66f3185fa04 was right, but its implementation wrong.
> The context omitted an important comment just before the critical test:
> "split_folio() removes folio from list on success."  In ignoring that
> comment, when a THP split succeeded, the code went on to release the
> preceding safe folio, preserving instead an irrelevant (formerly head)
> folio: which gives no safety because it's not on the list. Fix the logic.
>
> Fixes: e66f3185fa04 ("mm/thp: fix deferred split queue not partially_mapped")
> Signed-off-by: Hugh Dickins <hughd@google.com>
> ---
>  mm/huge_memory.c | 4 +++-
>  1 file changed, 3 insertions(+), 1 deletion(-)
>
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index 03fd4bc39ea1..5734d5d5060f 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -3790,7 +3790,9 @@ static unsigned long deferred_split_scan(struct shrinker *shrink,
>                  * in the case it was underused, then consider it used and
>                  * don't add it back to split_queue.
>                  */
> -               if (!did_split && !folio_test_partially_mapped(folio)) {
> +               if (did_split) {
> +                       ; /* folio already removed from list */
> +               } else if (!folio_test_partially_mapped(folio)) {
>                         list_del_init(&folio->_deferred_list);
>                         removed++;
>                 } else {
> --
> 2.35.3


^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2024-11-13 22:57 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2024-10-27 19:59 [PATCH hotfix v2 1/2] mm/thp: fix deferred split queue not partially_mapped Hugh Dickins
2024-10-27 20:02 ` [PATCH hotfix v2 2/2] mm/thp: fix deferred split unqueue naming and locking Hugh Dickins
2024-10-28 10:43   ` David Hildenbrand
2024-10-28 17:19     ` Hugh Dickins
2024-10-28 17:26       ` David Hildenbrand
2024-10-28 18:39   ` Yang Shi
2024-10-27 20:06 ` [PATCH hotfix v2 1/2] mm/thp: fix deferred split queue not partially_mapped Zi Yan
2024-11-10 21:08   ` Hugh Dickins
2024-11-10 21:11     ` [PATCH hotfix] mm/thp: fix deferred split queue not partially_mapped: fix Hugh Dickins
2024-11-10 21:22       ` Usama Arif
2024-11-11  3:10       ` Zi Yan
2024-11-12  1:36       ` Baolin Wang
2024-11-13 22:57       ` Chris Li

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox