[PATCH 0/4] mm, swap: misc cleanup and bugfix

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [PATCH 0/4] mm, swap: misc cleanup and bugfix
@ 2025-10-06 20:02 Kairui Song
  2025-10-06 20:02 ` [PATCH 1/4] mm, swap: do not perform synchronous discard during allocation Kairui Song
                   ` (4 more replies)
  0 siblings, 5 replies; 32+ messages in thread
From: Kairui Song @ 2025-10-06 20:02 UTC (permalink / raw)
  To: linux-mm
  Cc: Andrew Morton, Kemeng Shi, Kairui Song, Nhat Pham, Baoquan He,
	Barry Song, Chris Li, Baolin Wang, David Hildenbrand,
	Matthew Wilcox (Oracle),
	Ying Huang, Kairui Song, linux-kernel, stable

A few cleanups and a bugfix that are either suitable after the swap
table phase I or found during code review.

Patch 1 is a bugfix and needs to be included in the stable branch,
the rest have no behavior change.

---
Kairui Song (4):
      mm, swap: do not perform synchronous discard during allocation
      mm, swap: rename helper for setup bad slots
      mm, swap: cleanup swap entry allocation parameter
      mm/migrate, swap: drop usage of folio_index

 include/linux/swap.h |  4 ++--
 mm/migrate.c         |  4 ++--
 mm/shmem.c           |  2 +-
 mm/swap.h            | 21 -----------------
 mm/swapfile.c        | 64 ++++++++++++++++++++++++++++++++++++----------------
 mm/vmscan.c          |  4 ++--
 6 files changed, 52 insertions(+), 47 deletions(-)
---
base-commit: 53e573001f2b5168f9b65d2b79e9563a3b479c17
change-id: 20251007-swap-clean-after-swap-table-p1-b9a7635ee3fa

Best regards,
-- 
Kairui Song <kasong@tencent.com>



^ permalink raw reply	[flat|nested] 32+ messages in thread

* [PATCH 1/4] mm, swap: do not perform synchronous discard during allocation
  2025-10-06 20:02 [PATCH 0/4] mm, swap: misc cleanup and bugfix Kairui Song
@ 2025-10-06 20:02 ` Kairui Song
  2025-10-07 23:52   ` Nhat Pham
  2025-10-08 20:54   ` Chris Li
  2025-10-06 20:02 ` [PATCH 2/4] mm, swap: rename helper for setup bad slots Kairui Song
                   ` (3 subsequent siblings)
  4 siblings, 2 replies; 32+ messages in thread
From: Kairui Song @ 2025-10-06 20:02 UTC (permalink / raw)
  To: linux-mm
  Cc: Andrew Morton, Kemeng Shi, Kairui Song, Nhat Pham, Baoquan He,
	Barry Song, Chris Li, Baolin Wang, David Hildenbrand,
	Matthew Wilcox (Oracle),
	Ying Huang, Kairui Song, linux-kernel, stable

From: Kairui Song <kasong@tencent.com>

Since commit 1b7e90020eb77 ("mm, swap: use percpu cluster as allocation
fast path"), swap allocation is protected by a local lock, which means
we can't do any sleeping calls during allocation.

However, the discard routine is not taken well care of. When the swap
allocator failed to find any usable cluster, it would look at the
pending discard cluster and try to issue some blocking discards. It may
not necessarily sleep, but the cond_resched at the bio layer indicates
this is wrong when combined with a local lock. And the bio GFP flag used
for discard bio is also wrong (not atomic).

It's arguable whether this synchronous discard is helpful at all. In
most cases, the async discard is good enough. And the swap allocator is
doing very differently at organizing the clusters since the recent
change, so it is very rare to see discard clusters piling up.

So far, no issues have been observed or reported with typical SSD setups
under months of high pressure. This issue was found during my code
review. But by hacking the kernel a bit: adding a mdelay(100) in the
async discard path, this issue will be observable with WARNING triggered
by the wrong GFP and cond_resched in the bio layer.

So let's fix this issue in a safe way: remove the synchronous discard in
the swap allocation path. And when order 0 is failing with all cluster
list drained on all swap devices, try to do a discard following the swap
device priority list. If any discards released some cluster, try the
allocation again. This way, we can still avoid OOM due to swap failure
if the hardware is very slow and memory pressure is extremely high.

Cc: <stable@vger.kernel.org>
Fixes: 1b7e90020eb77 ("mm, swap: use percpu cluster as allocation fast path")
Signed-off-by: Kairui Song <kasong@tencent.com>
---
 mm/swapfile.c | 40 +++++++++++++++++++++++++++++++++-------
 1 file changed, 33 insertions(+), 7 deletions(-)

diff --git a/mm/swapfile.c b/mm/swapfile.c
index cb2392ed8e0e..0d1924f6f495 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -1101,13 +1101,6 @@ static unsigned long cluster_alloc_swap_entry(struct swap_info_struct *si, int o
 			goto done;
 	}

-	/*
-	 * We don't have free cluster but have some clusters in discarding,
-	 * do discard now and reclaim them.
-	 */
-	if ((si->flags & SWP_PAGE_DISCARD) && swap_do_scheduled_discard(si))
-		goto new_cluster;
-
 	if (order)
 		goto done;

@@ -1394,6 +1387,33 @@ static bool swap_alloc_slow(swp_entry_t *entry,
 	return false;
 }

+/*
+ * Discard pending clusters in a synchronized way when under high pressure.
+ * Return: true if any cluster is discarded.
+ */
+static bool swap_sync_discard(void)
+{
+	bool ret = false;
+	int nid = numa_node_id();
+	struct swap_info_struct *si, *next;
+
+	spin_lock(&swap_avail_lock);
+	plist_for_each_entry_safe(si, next, &swap_avail_heads[nid], avail_lists[nid]) {
+		spin_unlock(&swap_avail_lock);
+		if (get_swap_device_info(si)) {
+			if (si->flags & SWP_PAGE_DISCARD)
+				ret = swap_do_scheduled_discard(si);
+			put_swap_device(si);
+		}
+		if (ret)
+			break;
+		spin_lock(&swap_avail_lock);
+	}
+	spin_unlock(&swap_avail_lock);
+
+	return ret;
+}
+
 /**
  * folio_alloc_swap - allocate swap space for a folio
  * @folio: folio we want to move to swap
@@ -1432,11 +1452,17 @@ int folio_alloc_swap(struct folio *folio, gfp_t gfp)
 		}
 	}

+again:
 	local_lock(&percpu_swap_cluster.lock);
 	if (!swap_alloc_fast(&entry, order))
 		swap_alloc_slow(&entry, order);
 	local_unlock(&percpu_swap_cluster.lock);

+	if (unlikely(!order && !entry.val)) {
+		if (swap_sync_discard())
+			goto again;
+	}
+
 	/* Need to call this even if allocation failed, for MEMCG_SWAP_FAIL. */
 	if (mem_cgroup_try_charge_swap(folio, entry))
 		goto out_free;

-- 
2.51.0

^ permalink raw reply	[flat|nested] 32+ messages in thread

* [PATCH 2/4] mm, swap: rename helper for setup bad slots
  2025-10-06 20:02 [PATCH 0/4] mm, swap: misc cleanup and bugfix Kairui Song
  2025-10-06 20:02 ` [PATCH 1/4] mm, swap: do not perform synchronous discard during allocation Kairui Song
@ 2025-10-06 20:02 ` Kairui Song
  2025-10-07 23:47   ` Nhat Pham
                     ` (2 more replies)
  2025-10-06 20:02 ` [PATCH 3/4] mm, swap: cleanup swap entry allocation parameter Kairui Song
                   ` (2 subsequent siblings)
  4 siblings, 3 replies; 32+ messages in thread
From: Kairui Song @ 2025-10-06 20:02 UTC (permalink / raw)
  To: linux-mm
  Cc: Andrew Morton, Kemeng Shi, Kairui Song, Nhat Pham, Baoquan He,
	Barry Song, Chris Li, Baolin Wang, David Hildenbrand,
	Matthew Wilcox (Oracle),
	Ying Huang, Kairui Song, linux-kernel

From: Kairui Song <kasong@tencent.com>

The name inc_cluster_info_page is very confusing, as this helper is only
used during swapon to mark bad slots. Rename it properly and turn the
VM_BUG_ON in it into WARN_ON to expose more potential issues. Swapon is
a cold path, so adding more checks should be a good idea.

No feature change except new WARN_ON.

Signed-off-by: Kairui Song <kasong@tencent.com>
---
 mm/swapfile.c | 22 +++++++++++-----------
 1 file changed, 11 insertions(+), 11 deletions(-)

diff --git a/mm/swapfile.c b/mm/swapfile.c
index 0d1924f6f495..732e07c70ce9 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -751,14 +751,14 @@ static void relocate_cluster(struct swap_info_struct *si,
 }
 
 /*
- * The cluster corresponding to page_nr will be used. The cluster will not be
- * added to free cluster list and its usage counter will be increased by 1.
- * Only used for initialization.
+ * The cluster corresponding to @offset will be accounted as having one bad
+ * slot. The cluster will not be added to the free cluster list, and its
+ * usage counter will be increased by 1. Only used for initialization.
  */
-static int inc_cluster_info_page(struct swap_info_struct *si,
-	struct swap_cluster_info *cluster_info, unsigned long page_nr)
+static int swap_cluster_setup_bad_slot(struct swap_cluster_info *cluster_info,
+				       unsigned long offset)
 {
-	unsigned long idx = page_nr / SWAPFILE_CLUSTER;
+	unsigned long idx = offset / SWAPFILE_CLUSTER;
 	struct swap_table *table;
 	struct swap_cluster_info *ci;
 
@@ -772,8 +772,8 @@ static int inc_cluster_info_page(struct swap_info_struct *si,
 
 	ci->count++;
 
-	VM_BUG_ON(ci->count > SWAPFILE_CLUSTER);
-	VM_BUG_ON(ci->flags);
+	WARN_ON(ci->count > SWAPFILE_CLUSTER);
+	WARN_ON(ci->flags);
 
 	return 0;
 }
@@ -3396,7 +3396,7 @@ static struct swap_cluster_info *setup_clusters(struct swap_info_struct *si,
 	 * See setup_swap_map(): header page, bad pages,
 	 * and the EOF part of the last cluster.
 	 */
-	err = inc_cluster_info_page(si, cluster_info, 0);
+	err = swap_cluster_setup_bad_slot(cluster_info, 0);
 	if (err)
 		goto err;
 	for (i = 0; i < swap_header->info.nr_badpages; i++) {
@@ -3404,12 +3404,12 @@ static struct swap_cluster_info *setup_clusters(struct swap_info_struct *si,
 
 		if (page_nr >= maxpages)
 			continue;
-		err = inc_cluster_info_page(si, cluster_info, page_nr);
+		err = swap_cluster_setup_bad_slot(cluster_info, page_nr);
 		if (err)
 			goto err;
 	}
 	for (i = maxpages; i < round_up(maxpages, SWAPFILE_CLUSTER); i++) {
-		err = inc_cluster_info_page(si, cluster_info, i);
+		err = swap_cluster_setup_bad_slot(cluster_info, i);
 		if (err)
 			goto err;
 	}

-- 
2.51.0




^ permalink raw reply	[flat|nested] 32+ messages in thread

* [PATCH 3/4] mm, swap: cleanup swap entry allocation parameter
  2025-10-06 20:02 [PATCH 0/4] mm, swap: misc cleanup and bugfix Kairui Song
  2025-10-06 20:02 ` [PATCH 1/4] mm, swap: do not perform synchronous discard during allocation Kairui Song
  2025-10-06 20:02 ` [PATCH 2/4] mm, swap: rename helper for setup bad slots Kairui Song
@ 2025-10-06 20:02 ` Kairui Song
  2025-10-06 20:07   ` Kairui Song
                     ` (2 more replies)
  2025-10-06 20:02 ` [PATCH 4/4] mm/migrate, swap: drop usage of folio_index Kairui Song
  2025-10-07 22:20 ` [PATCH 0/4] mm, swap: misc cleanup and bugfix Andrew Morton
  4 siblings, 3 replies; 32+ messages in thread
From: Kairui Song @ 2025-10-06 20:02 UTC (permalink / raw)
  To: linux-mm
  Cc: Andrew Morton, Kemeng Shi, Kairui Song, Nhat Pham, Baoquan He,
	Barry Song, Chris Li, Baolin Wang, David Hildenbrand,
	Matthew Wilcox (Oracle),
	Ying Huang, Kairui Song, linux-kernel

From: Kairui Song <kasong@tencent.com>

We no longer need this GFP parameter after commit 8578e0c00dcf ("mm, swap:
use the swap table for the swap cache and switch API"). Before that
commit the GFP parameter is already almost identical for all callers, so
nothing changed by that commit. Swap table just moved the GFP to lower
layer and make it more defined and changes depend on atomic or sleep
allocation.

Now this parameter is no longer used, just remove it. No behavior
change.

Signed-off-by: Kairui Song <kasong@tencent.com>
---
 include/linux/swap.h | 4 ++--
 mm/shmem.c           | 2 +-
 mm/swapfile.c        | 2 +-
 mm/vmscan.c          | 4 ++--
 4 files changed, 6 insertions(+), 6 deletions(-)

diff --git a/include/linux/swap.h b/include/linux/swap.h
index e818fbade1e2..a4b264817735 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -462,7 +462,7 @@ static inline long get_nr_swap_pages(void)
 }
 
 extern void si_swapinfo(struct sysinfo *);
-int folio_alloc_swap(struct folio *folio, gfp_t gfp_mask);
+int folio_alloc_swap(struct folio *folio);
 bool folio_free_swap(struct folio *folio);
 void put_swap_folio(struct folio *folio, swp_entry_t entry);
 extern swp_entry_t get_swap_page_of_type(int);
@@ -560,7 +560,7 @@ static inline int swp_swapcount(swp_entry_t entry)
 	return 0;
 }
 
-static inline int folio_alloc_swap(struct folio *folio, gfp_t gfp_mask)
+static inline int folio_alloc_swap(struct folio *folio)
 {
 	return -EINVAL;
 }
diff --git a/mm/shmem.c b/mm/shmem.c
index 45f51745ad88..63092cc0b141 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -1617,7 +1617,7 @@ int shmem_writeout(struct folio *folio, struct swap_iocb **plug,
 		folio_mark_uptodate(folio);
 	}
 
-	if (!folio_alloc_swap(folio, __GFP_HIGH | __GFP_NOMEMALLOC | __GFP_NOWARN)) {
+	if (!folio_alloc_swap(folio)) {
 		bool first_swapped = shmem_recalc_inode(inode, 0, nr_pages);
 		int error;
 
diff --git a/mm/swapfile.c b/mm/swapfile.c
index 732e07c70ce9..534b21aeef5a 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -1425,7 +1425,7 @@ static bool swap_sync_discard(void)
  * Context: Caller needs to hold the folio lock.
  * Return: Whether the folio was added to the swap cache.
  */
-int folio_alloc_swap(struct folio *folio, gfp_t gfp)
+int folio_alloc_swap(struct folio *folio)
 {
 	unsigned int order = folio_order(folio);
 	unsigned int size = 1 << order;
diff --git a/mm/vmscan.c b/mm/vmscan.c
index aadbee50a851..c99f7d6d5dd9 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1296,7 +1296,7 @@ static unsigned int shrink_folio_list(struct list_head *folio_list,
 					    split_folio_to_list(folio, folio_list))
 						goto activate_locked;
 				}
-				if (folio_alloc_swap(folio, __GFP_HIGH | __GFP_NOWARN)) {
+				if (folio_alloc_swap(folio)) {
 					int __maybe_unused order = folio_order(folio);
 
 					if (!folio_test_large(folio))
@@ -1312,7 +1312,7 @@ static unsigned int shrink_folio_list(struct list_head *folio_list,
 					}
 #endif
 					count_mthp_stat(order, MTHP_STAT_SWPOUT_FALLBACK);
-					if (folio_alloc_swap(folio, __GFP_HIGH | __GFP_NOWARN))
+					if (folio_alloc_swap(folio))
 						goto activate_locked_split;
 				}
 				/*

-- 
2.51.0



^ permalink raw reply	[flat|nested] 32+ messages in thread

* [PATCH 4/4] mm/migrate, swap: drop usage of folio_index
  2025-10-06 20:02 [PATCH 0/4] mm, swap: misc cleanup and bugfix Kairui Song
                   ` (2 preceding siblings ...)
  2025-10-06 20:02 ` [PATCH 3/4] mm, swap: cleanup swap entry allocation parameter Kairui Song
@ 2025-10-06 20:02 ` Kairui Song
  2025-10-07 23:48   ` Nhat Pham
                     ` (2 more replies)
  2025-10-07 22:20 ` [PATCH 0/4] mm, swap: misc cleanup and bugfix Andrew Morton
  4 siblings, 3 replies; 32+ messages in thread
From: Kairui Song @ 2025-10-06 20:02 UTC (permalink / raw)
  To: linux-mm
  Cc: Andrew Morton, Kemeng Shi, Kairui Song, Nhat Pham, Baoquan He,
	Barry Song, Chris Li, Baolin Wang, David Hildenbrand,
	Matthew Wilcox (Oracle),
	Ying Huang, Kairui Song, linux-kernel

From: Kairui Song <kasong@tencent.com>

This helper was used when swap cache was mixed with swap cache. Now they
are completely separate from each other, access to the swap cache is all
wrapped by the swap_cache_* helpers, which expect the folio's swap entry
as a parameter.

This helper is no longer used, remove the last redundant user and drop it.

Signed-off-by: Kairui Song <kasong@tencent.com>
---
 mm/migrate.c |  4 ++--
 mm/swap.h    | 21 ---------------------
 2 files changed, 2 insertions(+), 23 deletions(-)

diff --git a/mm/migrate.c b/mm/migrate.c
index e3065c9edb55..97c931b31940 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -561,7 +561,7 @@ void pmd_migration_entry_wait(struct mm_struct *mm, pmd_t *pmd)
 static int __folio_migrate_mapping(struct address_space *mapping,
 		struct folio *newfolio, struct folio *folio, int expected_count)
 {
-	XA_STATE(xas, &mapping->i_pages, folio_index(folio));
+	XA_STATE(xas, &mapping->i_pages, folio->index);
 	struct swap_cluster_info *ci = NULL;
 	struct zone *oldzone, *newzone;
 	int dirty;
@@ -714,7 +714,7 @@ EXPORT_SYMBOL(folio_migrate_mapping);
 int migrate_huge_page_move_mapping(struct address_space *mapping,
 				   struct folio *dst, struct folio *src)
 {
-	XA_STATE(xas, &mapping->i_pages, folio_index(src));
+	XA_STATE(xas, &mapping->i_pages, src->index);
 	int rc, expected_count = folio_expected_ref_count(src) + 1;
 
 	if (folio_ref_count(src) != expected_count)
diff --git a/mm/swap.h b/mm/swap.h
index 8d8efdf1297a..d034c13d8dd2 100644
--- a/mm/swap.h
+++ b/mm/swap.h
@@ -445,25 +445,4 @@ static inline int non_swapcache_batch(swp_entry_t entry, int max_nr)
 	return 0;
 }
 #endif /* CONFIG_SWAP */
-
-/**
- * folio_index - File index of a folio.
- * @folio: The folio.
- *
- * For a folio which is either in the page cache or the swap cache,
- * return its index within the address_space it belongs to.  If you know
- * the folio is definitely in the page cache, you can look at the folio's
- * index directly.
- *
- * Return: The index (offset in units of pages) of a folio in its file.
- */
-static inline pgoff_t folio_index(struct folio *folio)
-{
-#ifdef CONFIG_SWAP
-	if (unlikely(folio_test_swapcache(folio)))
-		return swp_offset(folio->swap);
-#endif
-	return folio->index;
-}
-
 #endif /* _MM_SWAP_H */

-- 
2.51.0



^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 3/4] mm, swap: cleanup swap entry allocation parameter
  2025-10-06 20:02 ` [PATCH 3/4] mm, swap: cleanup swap entry allocation parameter Kairui Song
@ 2025-10-06 20:07   ` Kairui Song
  2025-10-07 23:49     ` Nhat Pham
  2025-10-08 20:59   ` Chris Li
  2025-10-14  3:12   ` Baolin Wang
  2 siblings, 1 reply; 32+ messages in thread
From: Kairui Song @ 2025-10-06 20:07 UTC (permalink / raw)
  To: linux-mm
  Cc: Andrew Morton, Kemeng Shi, Nhat Pham, Baoquan He, Barry Song,
	Chris Li, Baolin Wang, David Hildenbrand, Matthew Wilcox (Oracle),
	Ying Huang, linux-kernel

On Tue, Oct 7, 2025 at 4:03 AM Kairui Song <ryncsn@gmail.com> wrote:
>
> From: Kairui Song <kasong@tencent.com>
>
> We no longer need this GFP parameter after commit 8578e0c00dcf ("mm, swap:
> use the swap table for the swap cache and switch API"). Before that
> commit the GFP parameter is already almost identical for all callers, so
> nothing changed by that commit. Swap table just moved the GFP to lower
> layer and make it more defined and changes depend on atomic or sleep
> allocation.
>
> Now this parameter is no longer used, just remove it. No behavior
> change.
>
> Signed-off-by: Kairui Song <kasong@tencent.com>
> ---
>  include/linux/swap.h | 4 ++--
>  mm/shmem.c           | 2 +-
>  mm/swapfile.c        | 2 +-
>  mm/vmscan.c          | 4 ++--
>  4 files changed, 6 insertions(+), 6 deletions(-)
>
> diff --git a/include/linux/swap.h b/include/linux/swap.h
> index e818fbade1e2..a4b264817735 100644
> --- a/include/linux/swap.h
> +++ b/include/linux/swap.h
> @@ -462,7 +462,7 @@ static inline long get_nr_swap_pages(void)
>  }
>
>  extern void si_swapinfo(struct sysinfo *);
> -int folio_alloc_swap(struct folio *folio, gfp_t gfp_mask);
> +int folio_alloc_swap(struct folio *folio);
>  bool folio_free_swap(struct folio *folio);
>  void put_swap_folio(struct folio *folio, swp_entry_t entry);
>  extern swp_entry_t get_swap_page_of_type(int);
> @@ -560,7 +560,7 @@ static inline int swp_swapcount(swp_entry_t entry)
>         return 0;
>  }
>
> -static inline int folio_alloc_swap(struct folio *folio, gfp_t gfp_mask)
> +static inline int folio_alloc_swap(struct folio *folio)
>  {
>         return -EINVAL;
>  }
> diff --git a/mm/shmem.c b/mm/shmem.c
> index 45f51745ad88..63092cc0b141 100644
> --- a/mm/shmem.c
> +++ b/mm/shmem.c
> @@ -1617,7 +1617,7 @@ int shmem_writeout(struct folio *folio, struct swap_iocb **plug,
>                 folio_mark_uptodate(folio);
>         }
>
> -       if (!folio_alloc_swap(folio, __GFP_HIGH | __GFP_NOMEMALLOC | __GFP_NOWARN)) {
> +       if (!folio_alloc_swap(folio)) {
>                 bool first_swapped = shmem_recalc_inode(inode, 0, nr_pages);
>                 int error;
>
> diff --git a/mm/swapfile.c b/mm/swapfile.c
> index 732e07c70ce9..534b21aeef5a 100644
> --- a/mm/swapfile.c
> +++ b/mm/swapfile.c
> @@ -1425,7 +1425,7 @@ static bool swap_sync_discard(void)
>   * Context: Caller needs to hold the folio lock.
>   * Return: Whether the folio was added to the swap cache.
>   */
> -int folio_alloc_swap(struct folio *folio, gfp_t gfp)
> +int folio_alloc_swap(struct folio *folio)
>  {
>         unsigned int order = folio_order(folio);
>         unsigned int size = 1 << order;

One trivial issue for myself, I better update the kerneldoc too...
sorry about this:

diff --git a/mm/swapfile.c b/mm/swapfile.c
index 534b21aeef5a..0c2174d6b924 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -1417,7 +1417,6 @@ static bool swap_sync_discard(void)
 /**
  * folio_alloc_swap - allocate swap space for a folio
  * @folio: folio we want to move to swap
- * @gfp: gfp mask for shadow nodes
  *
  * Allocate swap space for the folio and add the folio to the
  * swap cache.


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 0/4] mm, swap: misc cleanup and bugfix
  2025-10-06 20:02 [PATCH 0/4] mm, swap: misc cleanup and bugfix Kairui Song
                   ` (3 preceding siblings ...)
  2025-10-06 20:02 ` [PATCH 4/4] mm/migrate, swap: drop usage of folio_index Kairui Song
@ 2025-10-07 22:20 ` Andrew Morton
  4 siblings, 0 replies; 32+ messages in thread
From: Andrew Morton @ 2025-10-07 22:20 UTC (permalink / raw)
  To: Kairui Song
  Cc: linux-mm, Kemeng Shi, Kairui Song, Nhat Pham, Baoquan He,
	Barry Song, Chris Li, Baolin Wang, David Hildenbrand,
	Matthew Wilcox (Oracle),
	Ying Huang, linux-kernel, stable

On Tue, 07 Oct 2025 04:02:32 +0800 Kairui Song <ryncsn@gmail.com> wrote:

> A few cleanups and a bugfix that are either suitable after the swap
> table phase I or found during code review.
> 
> Patch 1 is a bugfix and needs to be included in the stable branch,
> the rest have no behavior change.

fyi, the presentation of the series suggests that [1/4] is not a hotfix
- that it won't hit mainline (and then -stable) until after 6.19-rc1.

Which sounds OK given this:

> So far, no issues have been observed or reported with typical SSD setups
> under months of high pressure. This issue was found during my code
> review. But by hacking the kernel a bit: adding a mdelay(100) in the
> async discard path, this issue will be observable with WARNING triggered
> by the wrong GFP and cond_resched in the bio layer.



^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 2/4] mm, swap: rename helper for setup bad slots
  2025-10-06 20:02 ` [PATCH 2/4] mm, swap: rename helper for setup bad slots Kairui Song
@ 2025-10-07 23:47   ` Nhat Pham
  2025-10-08 10:25   ` David Hildenbrand
  2025-10-08 20:58   ` Chris Li
  2 siblings, 0 replies; 32+ messages in thread
From: Nhat Pham @ 2025-10-07 23:47 UTC (permalink / raw)
  To: Kairui Song
  Cc: linux-mm, Andrew Morton, Kemeng Shi, Kairui Song, Baoquan He,
	Barry Song, Chris Li, Baolin Wang, David Hildenbrand,
	Matthew Wilcox (Oracle),
	Ying Huang, linux-kernel

On Mon, Oct 6, 2025 at 1:03 PM Kairui Song <ryncsn@gmail.com> wrote:
>
> From: Kairui Song <kasong@tencent.com>
>
> The name inc_cluster_info_page is very confusing, as this helper is only
> used during swapon to mark bad slots. Rename it properly and turn the
> VM_BUG_ON in it into WARN_ON to expose more potential issues. Swapon is
> a cold path, so adding more checks should be a good idea.
>
> No feature change except new WARN_ON.
>
> Signed-off-by: Kairui Song <kasong@tencent.com>

Acked-by: Nhat Pham <nphamcs@gmail.com>


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 4/4] mm/migrate, swap: drop usage of folio_index
  2025-10-06 20:02 ` [PATCH 4/4] mm/migrate, swap: drop usage of folio_index Kairui Song
@ 2025-10-07 23:48   ` Nhat Pham
  2025-10-08  1:20     ` Andrew Morton
  2025-10-09 15:33     ` Kairui Song
  2025-10-08 21:03   ` Chris Li
  2025-10-14  3:15   ` Baolin Wang
  2 siblings, 2 replies; 32+ messages in thread
From: Nhat Pham @ 2025-10-07 23:48 UTC (permalink / raw)
  To: Kairui Song
  Cc: linux-mm, Andrew Morton, Kemeng Shi, Kairui Song, Baoquan He,
	Barry Song, Chris Li, Baolin Wang, David Hildenbrand,
	Matthew Wilcox (Oracle),
	Ying Huang, linux-kernel

On Mon, Oct 6, 2025 at 1:03 PM Kairui Song <ryncsn@gmail.com> wrote:
>
> From: Kairui Song <kasong@tencent.com>
>
> This helper was used when swap cache was mixed with swap cache. Now they

mixed with page cache? ;)

With that nit fixed:

Acked-by: Nhat Pham <nphamcs@gmail.com>


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 3/4] mm, swap: cleanup swap entry allocation parameter
  2025-10-06 20:07   ` Kairui Song
@ 2025-10-07 23:49     ` Nhat Pham
  2025-10-08 10:26       ` David Hildenbrand
  0 siblings, 1 reply; 32+ messages in thread
From: Nhat Pham @ 2025-10-07 23:49 UTC (permalink / raw)
  To: Kairui Song
  Cc: linux-mm, Andrew Morton, Kemeng Shi, Baoquan He, Barry Song,
	Chris Li, Baolin Wang, David Hildenbrand, Matthew Wilcox (Oracle),
	Ying Huang, linux-kernel

On Mon, Oct 6, 2025 at 1:08 PM Kairui Song <ryncsn@gmail.com> wrote:
>
> On Tue, Oct 7, 2025 at 4:03 AM Kairui Song <ryncsn@gmail.com> wrote:
> >
> > From: Kairui Song <kasong@tencent.com>
> >
> > We no longer need this GFP parameter after commit 8578e0c00dcf ("mm, swap:
> > use the swap table for the swap cache and switch API"). Before that
> > commit the GFP parameter is already almost identical for all callers, so
> > nothing changed by that commit. Swap table just moved the GFP to lower
> > layer and make it more defined and changes depend on atomic or sleep
> > allocation.
> >
> > Now this parameter is no longer used, just remove it. No behavior
> > change.
> >
> > Signed-off-by: Kairui Song <kasong@tencent.com>
> > ---
> >  include/linux/swap.h | 4 ++--
> >  mm/shmem.c           | 2 +-
> >  mm/swapfile.c        | 2 +-
> >  mm/vmscan.c          | 4 ++--
> >  4 files changed, 6 insertions(+), 6 deletions(-)
> >
> > diff --git a/include/linux/swap.h b/include/linux/swap.h
> > index e818fbade1e2..a4b264817735 100644
> > --- a/include/linux/swap.h
> > +++ b/include/linux/swap.h
> > @@ -462,7 +462,7 @@ static inline long get_nr_swap_pages(void)
> >  }
> >
> >  extern void si_swapinfo(struct sysinfo *);
> > -int folio_alloc_swap(struct folio *folio, gfp_t gfp_mask);
> > +int folio_alloc_swap(struct folio *folio);
> >  bool folio_free_swap(struct folio *folio);
> >  void put_swap_folio(struct folio *folio, swp_entry_t entry);
> >  extern swp_entry_t get_swap_page_of_type(int);
> > @@ -560,7 +560,7 @@ static inline int swp_swapcount(swp_entry_t entry)
> >         return 0;
> >  }
> >
> > -static inline int folio_alloc_swap(struct folio *folio, gfp_t gfp_mask)
> > +static inline int folio_alloc_swap(struct folio *folio)
> >  {
> >         return -EINVAL;
> >  }
> > diff --git a/mm/shmem.c b/mm/shmem.c
> > index 45f51745ad88..63092cc0b141 100644
> > --- a/mm/shmem.c
> > +++ b/mm/shmem.c
> > @@ -1617,7 +1617,7 @@ int shmem_writeout(struct folio *folio, struct swap_iocb **plug,
> >                 folio_mark_uptodate(folio);
> >         }
> >
> > -       if (!folio_alloc_swap(folio, __GFP_HIGH | __GFP_NOMEMALLOC | __GFP_NOWARN)) {
> > +       if (!folio_alloc_swap(folio)) {
> >                 bool first_swapped = shmem_recalc_inode(inode, 0, nr_pages);
> >                 int error;
> >
> > diff --git a/mm/swapfile.c b/mm/swapfile.c
> > index 732e07c70ce9..534b21aeef5a 100644
> > --- a/mm/swapfile.c
> > +++ b/mm/swapfile.c
> > @@ -1425,7 +1425,7 @@ static bool swap_sync_discard(void)
> >   * Context: Caller needs to hold the folio lock.
> >   * Return: Whether the folio was added to the swap cache.
> >   */
> > -int folio_alloc_swap(struct folio *folio, gfp_t gfp)
> > +int folio_alloc_swap(struct folio *folio)
> >  {
> >         unsigned int order = folio_order(folio);
> >         unsigned int size = 1 << order;
>
> One trivial issue for myself, I better update the kerneldoc too...
> sorry about this:
>

LGTM, with the kerneldoc update:

Acked-by: Nhat Pham <nphamcs@gmail.com>


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 1/4] mm, swap: do not perform synchronous discard during allocation
  2025-10-06 20:02 ` [PATCH 1/4] mm, swap: do not perform synchronous discard during allocation Kairui Song
@ 2025-10-07 23:52   ` Nhat Pham
  2025-10-08 20:54   ` Chris Li
  1 sibling, 0 replies; 32+ messages in thread
From: Nhat Pham @ 2025-10-07 23:52 UTC (permalink / raw)
  To: Kairui Song
  Cc: linux-mm, Andrew Morton, Kemeng Shi, Kairui Song, Baoquan He,
	Barry Song, Chris Li, Baolin Wang, David Hildenbrand,
	Matthew Wilcox (Oracle),
	Ying Huang, linux-kernel, stable

On Mon, Oct 6, 2025 at 1:03 PM Kairui Song <ryncsn@gmail.com> wrote:
>
> From: Kairui Song <kasong@tencent.com>
>
> Since commit 1b7e90020eb77 ("mm, swap: use percpu cluster as allocation
> fast path"), swap allocation is protected by a local lock, which means
> we can't do any sleeping calls during allocation.
>
> However, the discard routine is not taken well care of. When the swap
> allocator failed to find any usable cluster, it would look at the
> pending discard cluster and try to issue some blocking discards. It may
> not necessarily sleep, but the cond_resched at the bio layer indicates
> this is wrong when combined with a local lock. And the bio GFP flag used
> for discard bio is also wrong (not atomic).
>
> It's arguable whether this synchronous discard is helpful at all. In
> most cases, the async discard is good enough. And the swap allocator is
> doing very differently at organizing the clusters since the recent
> change, so it is very rare to see discard clusters piling up.
>
> So far, no issues have been observed or reported with typical SSD setups
> under months of high pressure. This issue was found during my code
> review. But by hacking the kernel a bit: adding a mdelay(100) in the
> async discard path, this issue will be observable with WARNING triggered
> by the wrong GFP and cond_resched in the bio layer.
>
> So let's fix this issue in a safe way: remove the synchronous discard in
> the swap allocation path. And when order 0 is failing with all cluster
> list drained on all swap devices, try to do a discard following the swap
> device priority list. If any discards released some cluster, try the
> allocation again. This way, we can still avoid OOM due to swap failure
> if the hardware is very slow and memory pressure is extremely high.
>
> Cc: <stable@vger.kernel.org>
> Fixes: 1b7e90020eb77 ("mm, swap: use percpu cluster as allocation fast path")
> Signed-off-by: Kairui Song <kasong@tencent.com>
> ---

Seems reasonable to me.

Acked-by: Nhat Pham <nphamcs@gmail.com>


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 4/4] mm/migrate, swap: drop usage of folio_index
  2025-10-07 23:48   ` Nhat Pham
@ 2025-10-08  1:20     ` Andrew Morton
  2025-10-09 15:33     ` Kairui Song
  1 sibling, 0 replies; 32+ messages in thread
From: Andrew Morton @ 2025-10-08  1:20 UTC (permalink / raw)
  To: Nhat Pham
  Cc: Kairui Song, linux-mm, Kemeng Shi, Kairui Song, Baoquan He,
	Barry Song, Chris Li, Baolin Wang, David Hildenbrand,
	Matthew Wilcox (Oracle),
	Ying Huang, linux-kernel

On Tue, 7 Oct 2025 16:48:40 -0700 Nhat Pham <nphamcs@gmail.com> wrote:

> > This helper was used when swap cache was mixed with swap cache. Now they
> 
> mixed with page cache? ;)

I locally made that edit, thanks.


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 2/4] mm, swap: rename helper for setup bad slots
  2025-10-06 20:02 ` [PATCH 2/4] mm, swap: rename helper for setup bad slots Kairui Song
  2025-10-07 23:47   ` Nhat Pham
@ 2025-10-08 10:25   ` David Hildenbrand
  2025-10-08 20:58   ` Chris Li
  2 siblings, 0 replies; 32+ messages in thread
From: David Hildenbrand @ 2025-10-08 10:25 UTC (permalink / raw)
  To: Kairui Song, linux-mm
  Cc: Andrew Morton, Kemeng Shi, Kairui Song, Nhat Pham, Baoquan He,
	Barry Song, Chris Li, Baolin Wang, Matthew Wilcox (Oracle),
	Ying Huang, linux-kernel

On 06.10.25 22:02, Kairui Song wrote:
> From: Kairui Song <kasong@tencent.com>
> 
> The name inc_cluster_info_page is very confusing, as this helper is only
> used during swapon to mark bad slots. Rename it properly and turn the
> VM_BUG_ON in it into WARN_ON to expose more potential issues. Swapon is
> a cold path, so adding more checks should be a good idea.
> 
> No feature change except new WARN_ON.
> 
> Signed-off-by: Kairui Song <kasong@tencent.com>
> ---

Reviewed-by: David Hildenbrand <david@redhat.com>

-- 
Cheers

David / dhildenb



^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 3/4] mm, swap: cleanup swap entry allocation parameter
  2025-10-07 23:49     ` Nhat Pham
@ 2025-10-08 10:26       ` David Hildenbrand
  0 siblings, 0 replies; 32+ messages in thread
From: David Hildenbrand @ 2025-10-08 10:26 UTC (permalink / raw)
  To: Nhat Pham, Kairui Song
  Cc: linux-mm, Andrew Morton, Kemeng Shi, Baoquan He, Barry Song,
	Chris Li, Baolin Wang, Matthew Wilcox (Oracle),
	Ying Huang, linux-kernel

>>> diff --git a/mm/swapfile.c b/mm/swapfile.c
>>> index 732e07c70ce9..534b21aeef5a 100644
>>> --- a/mm/swapfile.c
>>> +++ b/mm/swapfile.c
>>> @@ -1425,7 +1425,7 @@ static bool swap_sync_discard(void)
>>>    * Context: Caller needs to hold the folio lock.
>>>    * Return: Whether the folio was added to the swap cache.
>>>    */
>>> -int folio_alloc_swap(struct folio *folio, gfp_t gfp)
>>> +int folio_alloc_swap(struct folio *folio)
>>>   {
>>>          unsigned int order = folio_order(folio);
>>>          unsigned int size = 1 << order;
>>
>> One trivial issue for myself, I better update the kerneldoc too...
>> sorry about this:
>>
> 
> LGTM, with the kerneldoc update:
> 
> Acked-by: Nhat Pham <nphamcs@gmail.com>
> 

Reviewed-by: David Hildenbrand <david@redhat.com>

-- 
Cheers

David / dhildenb



^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 1/4] mm, swap: do not perform synchronous discard during allocation
  2025-10-06 20:02 ` [PATCH 1/4] mm, swap: do not perform synchronous discard during allocation Kairui Song
  2025-10-07 23:52   ` Nhat Pham
@ 2025-10-08 20:54   ` Chris Li
  2025-10-09 15:32     ` Kairui Song
  2025-10-12 16:49     ` Kairui Song
  1 sibling, 2 replies; 32+ messages in thread
From: Chris Li @ 2025-10-08 20:54 UTC (permalink / raw)
  To: Kairui Song
  Cc: linux-mm, Andrew Morton, Kemeng Shi, Kairui Song, Nhat Pham,
	Baoquan He, Barry Song, Baolin Wang, David Hildenbrand,
	Matthew Wilcox (Oracle),
	Ying Huang, linux-kernel, stable

Hi Kairui,

First of all, your title is a bit misleading:
"do not perform synchronous discard during allocation"

You still do the synchronous discard, just limited to order 0 failing.

Also your commit did not describe the behavior change of this patch.
The behavior change is that, it now prefers to allocate from the
fragment list before waiting for the discard. Which I feel is not
justified.

After reading your patch, I feel that you still do the synchronous
discard, just now you do it with less lock held.
I suggest you just fix the lock held issue without changing the
discard ordering behavior.

On Mon, Oct 6, 2025 at 1:03 PM Kairui Song <ryncsn@gmail.com> wrote:
>
> From: Kairui Song <kasong@tencent.com>
>
> Since commit 1b7e90020eb77 ("mm, swap: use percpu cluster as allocation
> fast path"), swap allocation is protected by a local lock, which means
> we can't do any sleeping calls during allocation.
>
> However, the discard routine is not taken well care of. When the swap
> allocator failed to find any usable cluster, it would look at the
> pending discard cluster and try to issue some blocking discards. It may
> not necessarily sleep, but the cond_resched at the bio layer indicates
> this is wrong when combined with a local lock. And the bio GFP flag used
> for discard bio is also wrong (not atomic).

If lock is the issue, let's fix the lock issue.

> It's arguable whether this synchronous discard is helpful at all. In
> most cases, the async discard is good enough. And the swap allocator is
> doing very differently at organizing the clusters since the recent
> change, so it is very rare to see discard clusters piling up.

Very rare does not mean this never happens. If you have a cluster on
the discarding queue, I think it is better to wait for the discard to
complete before using the fragmented list, to reduce the
fragmentation. So it seems the real issue is holding a lock while
doing the block discard?

> So far, no issues have been observed or reported with typical SSD setups
> under months of high pressure. This issue was found during my code
> review. But by hacking the kernel a bit: adding a mdelay(100) in the
> async discard path, this issue will be observable with WARNING triggered
> by the wrong GFP and cond_resched in the bio layer.

I think that makes an assumption on how slow the SSD discard is. Some
SSD can be really slow. We want our kernel to work for those slow
discard SSD cases as well.

> So let's fix this issue in a safe way: remove the synchronous discard in
> the swap allocation path. And when order 0 is failing with all cluster
> list drained on all swap devices, try to do a discard following the swap

I don't feel that changing the discard behavior is justified here, the
real fix is discarding with less lock held. Am I missing something?
If I understand correctly, we should be able to keep the current
discard ordering behavior, discard before the fragment list. But with
less lock held as your current patch does.

I suggest the allocation here detects there is a discard pending and
running out of free blocks. Return there and indicate the need to
discard. The caller performs the discard without holding the lock,
similar to what you do with the order == 0 case.

> device priority list. If any discards released some cluster, try the
> allocation again. This way, we can still avoid OOM due to swap failure
> if the hardware is very slow and memory pressure is extremely high.
>
> Cc: <stable@vger.kernel.org>
> Fixes: 1b7e90020eb77 ("mm, swap: use percpu cluster as allocation fast path")
> Signed-off-by: Kairui Song <kasong@tencent.com>
> ---
>  mm/swapfile.c | 40 +++++++++++++++++++++++++++++++++-------
>  1 file changed, 33 insertions(+), 7 deletions(-)
>
> diff --git a/mm/swapfile.c b/mm/swapfile.c
> index cb2392ed8e0e..0d1924f6f495 100644
> --- a/mm/swapfile.c
> +++ b/mm/swapfile.c
> @@ -1101,13 +1101,6 @@ static unsigned long cluster_alloc_swap_entry(struct swap_info_struct *si, int o
>                         goto done;
>         }
>
> -       /*
> -        * We don't have free cluster but have some clusters in discarding,
> -        * do discard now and reclaim them.
> -        */
> -       if ((si->flags & SWP_PAGE_DISCARD) && swap_do_scheduled_discard(si))
> -               goto new_cluster;

Assume you follow my suggestion.
Change this to some function to detect if there is a pending discard
on this device. Return to the caller indicating that you need a
discard for this device that has a pending discard.
Add an output argument to indicate the discard device "discard" if needed.

> -
>         if (order)
>                 goto done;
>
> @@ -1394,6 +1387,33 @@ static bool swap_alloc_slow(swp_entry_t *entry,
>         return false;
>  }
>
> +/*
> + * Discard pending clusters in a synchronized way when under high pressure.
> + * Return: true if any cluster is discarded.
> + */
> +static bool swap_sync_discard(void)
> +{

This function discards all swap devices. I am wondering if we should
just discard the current working device instead.
Another device supposedly discarded is already on going with the work
queue. We don't have to wait for that.

To unblock the current swap allocation.  We only need to wait for the
discard on the current swap device to indicate it needs to wait for
discard. Assume you take my above suggestion.

> +       bool ret = false;
> +       int nid = numa_node_id();
> +       struct swap_info_struct *si, *next;
> +
> +       spin_lock(&swap_avail_lock);
> +       plist_for_each_entry_safe(si, next, &swap_avail_heads[nid], avail_lists[nid]) {
> +               spin_unlock(&swap_avail_lock);
> +               if (get_swap_device_info(si)) {
> +                       if (si->flags & SWP_PAGE_DISCARD)
> +                               ret = swap_do_scheduled_discard(si);
> +                       put_swap_device(si);
> +               }
> +               if (ret)
> +                       break;
> +               spin_lock(&swap_avail_lock);
> +       }
> +       spin_unlock(&swap_avail_lock);
> +
> +       return ret;
> +}
> +
>  /**
>   * folio_alloc_swap - allocate swap space for a folio
>   * @folio: folio we want to move to swap
> @@ -1432,11 +1452,17 @@ int folio_alloc_swap(struct folio *folio, gfp_t gfp)
>                 }
>         }
>
> +again:
>         local_lock(&percpu_swap_cluster.lock);
>         if (!swap_alloc_fast(&entry, order))
>                 swap_alloc_slow(&entry, order);

Here we can have a "discard" output function argument to indicate
which swap device needs to be discarded.

>         local_unlock(&percpu_swap_cluster.lock);
>
> +       if (unlikely(!order && !entry.val)) {

If you take the above suggestion, here will be just check if the
"discard" device is not NULL, perform discard on that device and done.

> +               if (swap_sync_discard())
> +                       goto again;
> +       }
> +
>         /* Need to call this even if allocation failed, for MEMCG_SWAP_FAIL. */
>         if (mem_cgroup_try_charge_swap(folio, entry))
>                 goto out_free;

Chris

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 2/4] mm, swap: rename helper for setup bad slots
  2025-10-06 20:02 ` [PATCH 2/4] mm, swap: rename helper for setup bad slots Kairui Song
  2025-10-07 23:47   ` Nhat Pham
  2025-10-08 10:25   ` David Hildenbrand
@ 2025-10-08 20:58   ` Chris Li
  2 siblings, 0 replies; 32+ messages in thread
From: Chris Li @ 2025-10-08 20:58 UTC (permalink / raw)
  To: Kairui Song
  Cc: linux-mm, Andrew Morton, Kemeng Shi, Kairui Song, Nhat Pham,
	Baoquan He, Barry Song, Baolin Wang, David Hildenbrand,
	Matthew Wilcox (Oracle),
	Ying Huang, linux-kernel

Acked-by: Chris Li <chrisl@kernel.org>

Chris

On Mon, Oct 6, 2025 at 1:03 PM Kairui Song <ryncsn@gmail.com> wrote:
>
> From: Kairui Song <kasong@tencent.com>
>
> The name inc_cluster_info_page is very confusing, as this helper is only
> used during swapon to mark bad slots. Rename it properly and turn the
> VM_BUG_ON in it into WARN_ON to expose more potential issues. Swapon is
> a cold path, so adding more checks should be a good idea.
>
> No feature change except new WARN_ON.
>
> Signed-off-by: Kairui Song <kasong@tencent.com>
> ---
>  mm/swapfile.c | 22 +++++++++++-----------
>  1 file changed, 11 insertions(+), 11 deletions(-)
>
> diff --git a/mm/swapfile.c b/mm/swapfile.c
> index 0d1924f6f495..732e07c70ce9 100644
> --- a/mm/swapfile.c
> +++ b/mm/swapfile.c
> @@ -751,14 +751,14 @@ static void relocate_cluster(struct swap_info_struct *si,
>  }
>
>  /*
> - * The cluster corresponding to page_nr will be used. The cluster will not be
> - * added to free cluster list and its usage counter will be increased by 1.
> - * Only used for initialization.
> + * The cluster corresponding to @offset will be accounted as having one bad
> + * slot. The cluster will not be added to the free cluster list, and its
> + * usage counter will be increased by 1. Only used for initialization.
>   */
> -static int inc_cluster_info_page(struct swap_info_struct *si,
> -       struct swap_cluster_info *cluster_info, unsigned long page_nr)
> +static int swap_cluster_setup_bad_slot(struct swap_cluster_info *cluster_info,
> +                                      unsigned long offset)
>  {
> -       unsigned long idx = page_nr / SWAPFILE_CLUSTER;
> +       unsigned long idx = offset / SWAPFILE_CLUSTER;
>         struct swap_table *table;
>         struct swap_cluster_info *ci;
>
> @@ -772,8 +772,8 @@ static int inc_cluster_info_page(struct swap_info_struct *si,
>
>         ci->count++;
>
> -       VM_BUG_ON(ci->count > SWAPFILE_CLUSTER);
> -       VM_BUG_ON(ci->flags);
> +       WARN_ON(ci->count > SWAPFILE_CLUSTER);
> +       WARN_ON(ci->flags);
>
>         return 0;
>  }
> @@ -3396,7 +3396,7 @@ static struct swap_cluster_info *setup_clusters(struct swap_info_struct *si,
>          * See setup_swap_map(): header page, bad pages,
>          * and the EOF part of the last cluster.
>          */
> -       err = inc_cluster_info_page(si, cluster_info, 0);
> +       err = swap_cluster_setup_bad_slot(cluster_info, 0);
>         if (err)
>                 goto err;
>         for (i = 0; i < swap_header->info.nr_badpages; i++) {
> @@ -3404,12 +3404,12 @@ static struct swap_cluster_info *setup_clusters(struct swap_info_struct *si,
>
>                 if (page_nr >= maxpages)
>                         continue;
> -               err = inc_cluster_info_page(si, cluster_info, page_nr);
> +               err = swap_cluster_setup_bad_slot(cluster_info, page_nr);
>                 if (err)
>                         goto err;
>         }
>         for (i = maxpages; i < round_up(maxpages, SWAPFILE_CLUSTER); i++) {
> -               err = inc_cluster_info_page(si, cluster_info, i);
> +               err = swap_cluster_setup_bad_slot(cluster_info, i);
>                 if (err)
>                         goto err;
>         }
>
> --
> 2.51.0
>


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 3/4] mm, swap: cleanup swap entry allocation parameter
  2025-10-06 20:02 ` [PATCH 3/4] mm, swap: cleanup swap entry allocation parameter Kairui Song
  2025-10-06 20:07   ` Kairui Song
@ 2025-10-08 20:59   ` Chris Li
  2025-10-14  3:12   ` Baolin Wang
  2 siblings, 0 replies; 32+ messages in thread
From: Chris Li @ 2025-10-08 20:59 UTC (permalink / raw)
  To: Kairui Song
  Cc: linux-mm, Andrew Morton, Kemeng Shi, Kairui Song, Nhat Pham,
	Baoquan He, Barry Song, Baolin Wang, David Hildenbrand,
	Matthew Wilcox (Oracle),
	Ying Huang, linux-kernel

Acked-by: Chris Li <chrisl@kernel.org>

Chris

On Mon, Oct 6, 2025 at 1:03 PM Kairui Song <ryncsn@gmail.com> wrote:
>
> From: Kairui Song <kasong@tencent.com>
>
> We no longer need this GFP parameter after commit 8578e0c00dcf ("mm, swap:
> use the swap table for the swap cache and switch API"). Before that
> commit the GFP parameter is already almost identical for all callers, so
> nothing changed by that commit. Swap table just moved the GFP to lower
> layer and make it more defined and changes depend on atomic or sleep
> allocation.
>
> Now this parameter is no longer used, just remove it. No behavior
> change.
>
> Signed-off-by: Kairui Song <kasong@tencent.com>
> ---
>  include/linux/swap.h | 4 ++--
>  mm/shmem.c           | 2 +-
>  mm/swapfile.c        | 2 +-
>  mm/vmscan.c          | 4 ++--
>  4 files changed, 6 insertions(+), 6 deletions(-)
>
> diff --git a/include/linux/swap.h b/include/linux/swap.h
> index e818fbade1e2..a4b264817735 100644
> --- a/include/linux/swap.h
> +++ b/include/linux/swap.h
> @@ -462,7 +462,7 @@ static inline long get_nr_swap_pages(void)
>  }
>
>  extern void si_swapinfo(struct sysinfo *);
> -int folio_alloc_swap(struct folio *folio, gfp_t gfp_mask);
> +int folio_alloc_swap(struct folio *folio);
>  bool folio_free_swap(struct folio *folio);
>  void put_swap_folio(struct folio *folio, swp_entry_t entry);
>  extern swp_entry_t get_swap_page_of_type(int);
> @@ -560,7 +560,7 @@ static inline int swp_swapcount(swp_entry_t entry)
>         return 0;
>  }
>
> -static inline int folio_alloc_swap(struct folio *folio, gfp_t gfp_mask)
> +static inline int folio_alloc_swap(struct folio *folio)
>  {
>         return -EINVAL;
>  }
> diff --git a/mm/shmem.c b/mm/shmem.c
> index 45f51745ad88..63092cc0b141 100644
> --- a/mm/shmem.c
> +++ b/mm/shmem.c
> @@ -1617,7 +1617,7 @@ int shmem_writeout(struct folio *folio, struct swap_iocb **plug,
>                 folio_mark_uptodate(folio);
>         }
>
> -       if (!folio_alloc_swap(folio, __GFP_HIGH | __GFP_NOMEMALLOC | __GFP_NOWARN)) {
> +       if (!folio_alloc_swap(folio)) {
>                 bool first_swapped = shmem_recalc_inode(inode, 0, nr_pages);
>                 int error;
>
> diff --git a/mm/swapfile.c b/mm/swapfile.c
> index 732e07c70ce9..534b21aeef5a 100644
> --- a/mm/swapfile.c
> +++ b/mm/swapfile.c
> @@ -1425,7 +1425,7 @@ static bool swap_sync_discard(void)
>   * Context: Caller needs to hold the folio lock.
>   * Return: Whether the folio was added to the swap cache.
>   */
> -int folio_alloc_swap(struct folio *folio, gfp_t gfp)
> +int folio_alloc_swap(struct folio *folio)
>  {
>         unsigned int order = folio_order(folio);
>         unsigned int size = 1 << order;
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index aadbee50a851..c99f7d6d5dd9 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -1296,7 +1296,7 @@ static unsigned int shrink_folio_list(struct list_head *folio_list,
>                                             split_folio_to_list(folio, folio_list))
>                                                 goto activate_locked;
>                                 }
> -                               if (folio_alloc_swap(folio, __GFP_HIGH | __GFP_NOWARN)) {
> +                               if (folio_alloc_swap(folio)) {
>                                         int __maybe_unused order = folio_order(folio);
>
>                                         if (!folio_test_large(folio))
> @@ -1312,7 +1312,7 @@ static unsigned int shrink_folio_list(struct list_head *folio_list,
>                                         }
>  #endif
>                                         count_mthp_stat(order, MTHP_STAT_SWPOUT_FALLBACK);
> -                                       if (folio_alloc_swap(folio, __GFP_HIGH | __GFP_NOWARN))
> +                                       if (folio_alloc_swap(folio))
>                                                 goto activate_locked_split;
>                                 }
>                                 /*
>
> --
> 2.51.0
>


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 4/4] mm/migrate, swap: drop usage of folio_index
  2025-10-06 20:02 ` [PATCH 4/4] mm/migrate, swap: drop usage of folio_index Kairui Song
  2025-10-07 23:48   ` Nhat Pham
@ 2025-10-08 21:03   ` Chris Li
  2025-10-14  3:15   ` Baolin Wang
  2 siblings, 0 replies; 32+ messages in thread
From: Chris Li @ 2025-10-08 21:03 UTC (permalink / raw)
  To: Kairui Song
  Cc: linux-mm, Andrew Morton, Kemeng Shi, Kairui Song, Nhat Pham,
	Baoquan He, Barry Song, Baolin Wang, David Hildenbrand,
	Matthew Wilcox (Oracle),
	Ying Huang, linux-kernel

Thanks for the cleanup.

Agree with the other fix suggested by Nhat regarding swap cache vs page cache.

On Mon, Oct 6, 2025 at 1:03 PM Kairui Song <ryncsn@gmail.com> wrote:
>
> From: Kairui Song <kasong@tencent.com>
>
> This helper was used when swap cache was mixed with swap cache. Now they
> are completely separate from each other, access to the swap cache is all
> wrapped by the swap_cache_* helpers, which expect the folio's swap entry
> as a parameter.
>
> This helper is no longer used, remove the last redundant user and drop it.
>
> Signed-off-by: Kairui Song <kasong@tencent.com>
> ---
>  mm/migrate.c |  4 ++--
>  mm/swap.h    | 21 ---------------------

Nice.

Acked-by: Chris Li <chrisl@kernel.org>

Chris


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 1/4] mm, swap: do not perform synchronous discard during allocation
  2025-10-08 20:54   ` Chris Li
@ 2025-10-09 15:32     ` Kairui Song
  2025-10-09 16:58       ` Chris Li
  2025-10-12 16:49     ` Kairui Song
  1 sibling, 1 reply; 32+ messages in thread
From: Kairui Song @ 2025-10-09 15:32 UTC (permalink / raw)
  To: Chris Li
  Cc: linux-mm, Andrew Morton, Kemeng Shi, Nhat Pham, Baoquan He,
	Barry Song, Baolin Wang, David Hildenbrand,
	Matthew Wilcox (Oracle),
	Ying Huang, linux-kernel, stable

On Thu, Oct 9, 2025 at 5:10 AM Chris Li <chrisl@kernel.org> wrote:
>
> Hi Kairui,
>
> First of all, your title is a bit misleading:
> "do not perform synchronous discard during allocation"
>
> You still do the synchronous discard, just limited to order 0 failing.
>
> Also your commit did not describe the behavior change of this patch.
> The behavior change is that, it now prefers to allocate from the
> fragment list before waiting for the discard. Which I feel is not
> justified.
>
> After reading your patch, I feel that you still do the synchronous
> discard, just now you do it with less lock held.
> I suggest you just fix the lock held issue without changing the
> discard ordering behavior.
>
> On Mon, Oct 6, 2025 at 1:03 PM Kairui Song <ryncsn@gmail.com> wrote:
> >
> > From: Kairui Song <kasong@tencent.com>
> >
> > Since commit 1b7e90020eb77 ("mm, swap: use percpu cluster as allocation
> > fast path"), swap allocation is protected by a local lock, which means
> > we can't do any sleeping calls during allocation.
> >
> > However, the discard routine is not taken well care of. When the swap
> > allocator failed to find any usable cluster, it would look at the
> > pending discard cluster and try to issue some blocking discards. It may
> > not necessarily sleep, but the cond_resched at the bio layer indicates
> > this is wrong when combined with a local lock. And the bio GFP flag used
> > for discard bio is also wrong (not atomic).
>
> If lock is the issue, let's fix the lock issue.
>
> > It's arguable whether this synchronous discard is helpful at all. In
> > most cases, the async discard is good enough. And the swap allocator is
> > doing very differently at organizing the clusters since the recent
> > change, so it is very rare to see discard clusters piling up.
>
> Very rare does not mean this never happens. If you have a cluster on
> the discarding queue, I think it is better to wait for the discard to
> complete before using the fragmented list, to reduce the
> fragmentation. So it seems the real issue is holding a lock while
> doing the block discard?
>
> > So far, no issues have been observed or reported with typical SSD setups
> > under months of high pressure. This issue was found during my code
> > review. But by hacking the kernel a bit: adding a mdelay(100) in the
> > async discard path, this issue will be observable with WARNING triggered
> > by the wrong GFP and cond_resched in the bio layer.
>
> I think that makes an assumption on how slow the SSD discard is. Some
> SSD can be really slow. We want our kernel to work for those slow
> discard SSD cases as well.
>
> > So let's fix this issue in a safe way: remove the synchronous discard in
> > the swap allocation path. And when order 0 is failing with all cluster
> > list drained on all swap devices, try to do a discard following the swap
>
> I don't feel that changing the discard behavior is justified here, the
> real fix is discarding with less lock held. Am I missing something?
> If I understand correctly, we should be able to keep the current
> discard ordering behavior, discard before the fragment list. But with
> less lock held as your current patch does.
>
> I suggest the allocation here detects there is a discard pending and
> running out of free blocks. Return there and indicate the need to
> discard. The caller performs the discard without holding the lock,
> similar to what you do with the order == 0 case.

Thanks for the suggestion. Right, that sounds even better. My initial
though was that maybe we can just remove this discard completely since
it rarely helps, and if the SSD is really that slow, OOM under heavy
pressure might even be an acceptable behaviour. But to make it safer,
I made it do discard only when order 0 is failing so the code is
simpler.

Let me sent a V2 to handle the discard carefully to reduce potential impact.

> > device priority list. If any discards released some cluster, try the
> > allocation again. This way, we can still avoid OOM due to swap failure
> > if the hardware is very slow and memory pressure is extremely high.
> >
> > Cc: <stable@vger.kernel.org>
> > Fixes: 1b7e90020eb77 ("mm, swap: use percpu cluster as allocation fast path")
> > Signed-off-by: Kairui Song <kasong@tencent.com>
> > ---
> >  mm/swapfile.c | 40 +++++++++++++++++++++++++++++++++-------
> >  1 file changed, 33 insertions(+), 7 deletions(-)
> >
> > diff --git a/mm/swapfile.c b/mm/swapfile.c
> > index cb2392ed8e0e..0d1924f6f495 100644
> > --- a/mm/swapfile.c
> > +++ b/mm/swapfile.c
> > @@ -1101,13 +1101,6 @@ static unsigned long cluster_alloc_swap_entry(struct swap_info_struct *si, int o
> >                         goto done;
> >         }
> >
> > -       /*
> > -        * We don't have free cluster but have some clusters in discarding,
> > -        * do discard now and reclaim them.
> > -        */
> > -       if ((si->flags & SWP_PAGE_DISCARD) && swap_do_scheduled_discard(si))
> > -               goto new_cluster;
>
> Assume you follow my suggestion.
> Change this to some function to detect if there is a pending discard
> on this device. Return to the caller indicating that you need a
> discard for this device that has a pending discard.

Checking `!list_empty(si->discard_clusters)` should be good enough.


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 4/4] mm/migrate, swap: drop usage of folio_index
  2025-10-07 23:48   ` Nhat Pham
  2025-10-08  1:20     ` Andrew Morton
@ 2025-10-09 15:33     ` Kairui Song
  1 sibling, 0 replies; 32+ messages in thread
From: Kairui Song @ 2025-10-09 15:33 UTC (permalink / raw)
  To: Nhat Pham
  Cc: linux-mm, Andrew Morton, Kemeng Shi, Baoquan He, Barry Song,
	Chris Li, Baolin Wang, David Hildenbrand, Matthew Wilcox (Oracle),
	Ying Huang, linux-kernel

On Wed, Oct 8, 2025 at 7:48 AM Nhat Pham <nphamcs@gmail.com> wrote:
>
> On Mon, Oct 6, 2025 at 1:03 PM Kairui Song <ryncsn@gmail.com> wrote:
> >
> > From: Kairui Song <kasong@tencent.com>
> >
> > This helper was used when swap cache was mixed with swap cache. Now they
>
> mixed with page cache? ;)

Yes you're right :), thanks!
>
> With that nit fixed:
>
> Acked-by: Nhat Pham <nphamcs@gmail.com>


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 1/4] mm, swap: do not perform synchronous discard during allocation
  2025-10-09 15:32     ` Kairui Song
@ 2025-10-09 16:58       ` Chris Li
  0 siblings, 0 replies; 32+ messages in thread
From: Chris Li @ 2025-10-09 16:58 UTC (permalink / raw)
  To: Kairui Song
  Cc: linux-mm, Andrew Morton, Kemeng Shi, Nhat Pham, Baoquan He,
	Barry Song, Baolin Wang, David Hildenbrand,
	Matthew Wilcox (Oracle),
	Ying Huang, linux-kernel, stable

On Thu, Oct 9, 2025 at 8:33 AM Kairui Song <ryncsn@gmail.com> wrote:
>
> On Thu, Oct 9, 2025 at 5:10 AM Chris Li <chrisl@kernel.org> wrote:
> > I suggest the allocation here detects there is a discard pending and
> > running out of free blocks. Return there and indicate the need to
> > discard. The caller performs the discard without holding the lock,
> > similar to what you do with the order == 0 case.
>
> Thanks for the suggestion. Right, that sounds even better. My initial
> though was that maybe we can just remove this discard completely since
> it rarely helps, and if the SSD is really that slow, OOM under heavy

Your argument is that cases happen very rarely. I agree with you on
that. The follow up question is that, if that rare case does happen,
are we doing the best we can in that situation? The V1 patch is not
doing the best as we can, it is pretty much I don't care about the
discard much, just ignore it unless order 0 failing forces our hand.
As far as I can tell, properly handling that having discard pending
condition is not much more complicated than your V1 patch, it might be
even simpler because you don't have that order 0 failing logic any
more.

> pressure might even be an acceptable behaviour. But to make it safer,
> I made it do discard only when order 0 is failing so the code is
> simpler.
>
> Let me sent a V2 to handle the discard carefully to reduce potential impact.

Great. Looking forward to it.

BTW, In the caller retry loop, the caller can retry the very swap
device that has discard just perform on it, it does not need to retry
from the very first swap device. In that regard, it is also a better
behavior than V1 or even existing discard behavior, which waits for
all devices to discard.

Chris

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 1/4] mm, swap: do not perform synchronous discard during allocation
  2025-10-08 20:54   ` Chris Li
  2025-10-09 15:32     ` Kairui Song
@ 2025-10-12 16:49     ` Kairui Song
  2025-10-14 21:27       ` Chris Li
  1 sibling, 1 reply; 32+ messages in thread
From: Kairui Song @ 2025-10-12 16:49 UTC (permalink / raw)
  To: Chris Li
  Cc: linux-mm, Andrew Morton, Kemeng Shi, Nhat Pham, Baoquan He,
	Barry Song, Baolin Wang, David Hildenbrand,
	Matthew Wilcox (Oracle),
	Ying Huang, linux-kernel, stable

On Thu, Oct 9, 2025 at 5:10 AM Chris Li <chrisl@kernel.org> wrote:
>
> Hi Kairui,
>
> First of all, your title is a bit misleading:
> "do not perform synchronous discard during allocation"
>
> You still do the synchronous discard, just limited to order 0 failing.
>
> Also your commit did not describe the behavior change of this patch.
> The behavior change is that, it now prefers to allocate from the
> fragment list before waiting for the discard. Which I feel is not
> justified.
>
> After reading your patch, I feel that you still do the synchronous
> discard, just now you do it with less lock held.
> I suggest you just fix the lock held issue without changing the
> discard ordering behavior.
>
> On Mon, Oct 6, 2025 at 1:03 PM Kairui Song <ryncsn@gmail.com> wrote:
> >
> > From: Kairui Song <kasong@tencent.com>
> >
> > Since commit 1b7e90020eb77 ("mm, swap: use percpu cluster as allocation
> > fast path"), swap allocation is protected by a local lock, which means
> > we can't do any sleeping calls during allocation.
> >
> > However, the discard routine is not taken well care of. When the swap
> > allocator failed to find any usable cluster, it would look at the
> > pending discard cluster and try to issue some blocking discards. It may
> > not necessarily sleep, but the cond_resched at the bio layer indicates
> > this is wrong when combined with a local lock. And the bio GFP flag used
> > for discard bio is also wrong (not atomic).
>
> If lock is the issue, let's fix the lock issue.
>
> > It's arguable whether this synchronous discard is helpful at all. In
> > most cases, the async discard is good enough. And the swap allocator is
> > doing very differently at organizing the clusters since the recent
> > change, so it is very rare to see discard clusters piling up.
>
> Very rare does not mean this never happens. If you have a cluster on
> the discarding queue, I think it is better to wait for the discard to
> complete before using the fragmented list, to reduce the
> fragmentation. So it seems the real issue is holding a lock while
> doing the block discard?
>
> > So far, no issues have been observed or reported with typical SSD setups
> > under months of high pressure. This issue was found during my code
> > review. But by hacking the kernel a bit: adding a mdelay(100) in the
> > async discard path, this issue will be observable with WARNING triggered
> > by the wrong GFP and cond_resched in the bio layer.
>
> I think that makes an assumption on how slow the SSD discard is. Some
> SSD can be really slow. We want our kernel to work for those slow
> discard SSD cases as well.
>
> > So let's fix this issue in a safe way: remove the synchronous discard in
> > the swap allocation path. And when order 0 is failing with all cluster
> > list drained on all swap devices, try to do a discard following the swap
>
> I don't feel that changing the discard behavior is justified here, the
> real fix is discarding with less lock held. Am I missing something?
> If I understand correctly, we should be able to keep the current
> discard ordering behavior, discard before the fragment list. But with
> less lock held as your current patch does.
>
> I suggest the allocation here detects there is a discard pending and
> running out of free blocks. Return there and indicate the need to
> discard. The caller performs the discard without holding the lock,
> similar to what you do with the order == 0 case.
>
> > device priority list. If any discards released some cluster, try the
> > allocation again. This way, we can still avoid OOM due to swap failure
> > if the hardware is very slow and memory pressure is extremely high.
> >
> > Cc: <stable@vger.kernel.org>
> > Fixes: 1b7e90020eb77 ("mm, swap: use percpu cluster as allocation fast path")
> > Signed-off-by: Kairui Song <kasong@tencent.com>
> > ---
> >  mm/swapfile.c | 40 +++++++++++++++++++++++++++++++++-------
> >  1 file changed, 33 insertions(+), 7 deletions(-)
> >
> > diff --git a/mm/swapfile.c b/mm/swapfile.c
> > index cb2392ed8e0e..0d1924f6f495 100644
> > --- a/mm/swapfile.c
> > +++ b/mm/swapfile.c
> > @@ -1101,13 +1101,6 @@ static unsigned long cluster_alloc_swap_entry(struct swap_info_struct *si, int o
> >                         goto done;
> >         }
> >
> > -       /*
> > -        * We don't have free cluster but have some clusters in discarding,
> > -        * do discard now and reclaim them.
> > -        */
> > -       if ((si->flags & SWP_PAGE_DISCARD) && swap_do_scheduled_discard(si))
> > -               goto new_cluster;
>
> Assume you follow my suggestion.
> Change this to some function to detect if there is a pending discard
> on this device. Return to the caller indicating that you need a
> discard for this device that has a pending discard.
> Add an output argument to indicate the discard device "discard" if needed.

The problem I just realized is that, if we just bail out here, we are
forbidding order 0 to steal if there is any discarding cluster. We
just return here to let the caller handle the discard outside
the lock.

It may just discard the cluster just fine, then retry from free clusters.
Then everything is fine, that's the easy part.

But it might also fail, and interestingly, in the failure case we need
to try again as well. It might fail with a race with another discard,
in that case order 0 steal is still feasible. Or it fail with
get_swap_device_info (we have to release the device to return here),
in that case we should go back to the plist and try other devices.

This is doable but seems kind of fragile, we'll have something like
this in the folio_alloc_swap function:

local_lock(&percpu_swap_cluster.lock);
if (!swap_alloc_fast(&entry, order))
    swap_alloc_slow(&entry, order, &discard_si);
local_unlock(&percpu_swap_cluster.lock);

+if (discard_si) {
+    if (get_swap_device_info(discard_si)) {
+        swap_do_scheduled_discard(discard_si);
+        put_swap_device(discard_si);
+        /*
+         * Ignoring the return value, since we need to try
+         * again even if the discard failed. If failed due to
+         * race with another discard, we should still try
+         * order 0 steal.
+         */
+    } else {
+        discard_si = NULL;
+        /*
+         * If raced with swapoff, we should try again too but
+         * not using the discard device anymore.
+         */
+    }
+    goto again;
+}

And the `again` retry we'll have to always start from free_clusters again,
unless we have another parameter just to indicate that we want to skip
everything and jump to stealing, or pass and reuse the discard_si
pointer as return argument to cluster_alloc_swap_entry as well,
as the indicator to jump to stealing directly.

It looks kind of strange. So far swap_do_scheduled_discard can only
fail due to a race with another successful discard, so retrying is
safe and won't run into an endless loop. But it seems easy to break,
e.g. if we may handle bio alloc failure of discard request in the
future. And trying again if get_swap_device_info failed makes no sense
if there is only one device, but has to be done here to cover
multi-device usage, or we have to add more special checks.

swap_alloc_slow will be a bit longer too if we want to prevent
touching plist again:
+/*
+ * Resuming after trying to discard cluster on a swap device,
+ * try the discarded device first.
+ */
+si = *discard_si;
+if (unlikely(si)) {
+    *discard_si = NULL;
+    if (get_swap_device_info(si)) {
+        offset = cluster_alloc_swap_entry(si, order, SWAP_HAS_CACHE,
&need_discard);
+        put_swap_device(si);
+        if (offset) {
+            *entry = swp_entry(si->type, offset);
+            return true;
+        }
+        if (need_discard) {
+            *discard_si = si;
+            return false;
+        }
+    }
+}

The logic of the workflow jumping between several functions might also
be kind of hard to follow. Some cleanup can be done later though.

Considering the discard issue is really rare, I'm not sure if this is
the right way to go? How do you think?

BTW: The logic of V1 can be optimized a little bit to let discards also
happen with order > 0 cases too. That seems closer to what the current
upstream kernel was doing except: Allocator prefers to try another
device instead of waiting for discard, which seems OK?
And order 0 steal can happen without waiting for discard.
Fragmentation under extreme pressure might not be that
serious an issue if we are having really slow SSDs, and
might even be no longer an issue if we have a generic
solution for frags?


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 3/4] mm, swap: cleanup swap entry allocation parameter
  2025-10-06 20:02 ` [PATCH 3/4] mm, swap: cleanup swap entry allocation parameter Kairui Song
  2025-10-06 20:07   ` Kairui Song
  2025-10-08 20:59   ` Chris Li
@ 2025-10-14  3:12   ` Baolin Wang
  2 siblings, 0 replies; 32+ messages in thread
From: Baolin Wang @ 2025-10-14  3:12 UTC (permalink / raw)
  To: Kairui Song, linux-mm
  Cc: Andrew Morton, Kemeng Shi, Kairui Song, Nhat Pham, Baoquan He,
	Barry Song, Chris Li, David Hildenbrand, Matthew Wilcox (Oracle),
	Ying Huang, linux-kernel



On 2025/10/7 04:02, Kairui Song wrote:
> From: Kairui Song <kasong@tencent.com>
> 
> We no longer need this GFP parameter after commit 8578e0c00dcf ("mm, swap:
> use the swap table for the swap cache and switch API"). Before that
> commit the GFP parameter is already almost identical for all callers, so
> nothing changed by that commit. Swap table just moved the GFP to lower
> layer and make it more defined and changes depend on atomic or sleep
> allocation.
> 
> Now this parameter is no longer used, just remove it. No behavior
> change.
> 
> Signed-off-by: Kairui Song <kasong@tencent.com>
> ---

LGTM.
Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com>


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 4/4] mm/migrate, swap: drop usage of folio_index
  2025-10-06 20:02 ` [PATCH 4/4] mm/migrate, swap: drop usage of folio_index Kairui Song
  2025-10-07 23:48   ` Nhat Pham
  2025-10-08 21:03   ` Chris Li
@ 2025-10-14  3:15   ` Baolin Wang
  2 siblings, 0 replies; 32+ messages in thread
From: Baolin Wang @ 2025-10-14  3:15 UTC (permalink / raw)
  To: Kairui Song, linux-mm
  Cc: Andrew Morton, Kemeng Shi, Kairui Song, Nhat Pham, Baoquan He,
	Barry Song, Chris Li, David Hildenbrand, Matthew Wilcox (Oracle),
	Ying Huang, linux-kernel



On 2025/10/7 04:02, Kairui Song wrote:
> From: Kairui Song <kasong@tencent.com>
> 
> This helper was used when swap cache was mixed with swap cache. Now they
> are completely separate from each other, access to the swap cache is all
> wrapped by the swap_cache_* helpers, which expect the folio's swap entry
> as a parameter.
> 
> This helper is no longer used, remove the last redundant user and drop it.
> 
> Signed-off-by: Kairui Song <kasong@tencent.com>
> ---

With the commit message fixed, LGTM.
Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com>



^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 1/4] mm, swap: do not perform synchronous discard during allocation
  2025-10-12 16:49     ` Kairui Song
@ 2025-10-14 21:27       ` Chris Li
  2025-10-15  2:55         ` Chris Li
  0 siblings, 1 reply; 32+ messages in thread
From: Chris Li @ 2025-10-14 21:27 UTC (permalink / raw)
  To: Kairui Song
  Cc: linux-mm, Andrew Morton, Kemeng Shi, Nhat Pham, Baoquan He,
	Barry Song, Baolin Wang, David Hildenbrand,
	Matthew Wilcox (Oracle),
	Ying Huang, linux-kernel, stable

On Sun, Oct 12, 2025 at 9:49 AM Kairui Song <ryncsn@gmail.com> wrote:
> > > diff --git a/mm/swapfile.c b/mm/swapfile.c
> > > index cb2392ed8e0e..0d1924f6f495 100644
> > > --- a/mm/swapfile.c
> > > +++ b/mm/swapfile.c
> > > @@ -1101,13 +1101,6 @@ static unsigned long cluster_alloc_swap_entry(struct swap_info_struct *si, int o
> > >                         goto done;
> > >         }
> > >
> > > -       /*
> > > -        * We don't have free cluster but have some clusters in discarding,
> > > -        * do discard now and reclaim them.
> > > -        */
> > > -       if ((si->flags & SWP_PAGE_DISCARD) && swap_do_scheduled_discard(si))
> > > -               goto new_cluster;
> >
> > Assume you follow my suggestion.
> > Change this to some function to detect if there is a pending discard
> > on this device. Return to the caller indicating that you need a
> > discard for this device that has a pending discard.
> > Add an output argument to indicate the discard device "discard" if needed.
>
> The problem I just realized is that, if we just bail out here, we are
> forbidding order 0 to steal if there is any discarding cluster. We
> just return here to let the caller handle the discard outside
> the lock.

Oh, yes, there might be a bit of change in behavior. However I can't
see it is such a bad thing if we wait for the pending discard to
complete before stealing and fragmenting the existing folio list. We
will have less fragments compared to the original result. Again, my
point is not that we always keep 100% the old behavior, then there is
no room for improvement.

My point is that, are we doing the best we can in that situation,
regardless how unlikely it is.

>
> It may just discard the cluster just fine, then retry from free clusters.
> Then everything is fine, that's the easy part.

Ack.

> But it might also fail, and interestingly, in the failure case we need

Can you spell out the failure case you have in mind? Do you mean the
discard did happen but another thread stole "the recently discarded
then became free cluster"?

Anyway, in such a case, the swap allocator should continue and find
out we don't have things to discard now, it will continue to the
"steal from other order > 0 list".

> to try again as well. It might fail with a race with another discard,
> in that case order 0 steal is still feasible. Or it fail with
> get_swap_device_info (we have to release the device to return here),
> in that case we should go back to the plist and try other devices.

When stealing from the other order >0 list failed, we should try
another device in the plist.

>
> This is doable but seems kind of fragile, we'll have something like
> this in the folio_alloc_swap function:
>
> local_lock(&percpu_swap_cluster.lock);
> if (!swap_alloc_fast(&entry, order))
>     swap_alloc_slow(&entry, order, &discard_si);
> local_unlock(&percpu_swap_cluster.lock);
>
> +if (discard_si) {

I feel the discard logic should be inside the swap_alloc_slow().
There is a  plist_for_each_entry_safe(), inside that loop to do the
discard and retry().
If I previously suggested it change in here, sorry I have changed my
mind after reasoning the code a bit more.

The fast path layer should not know about the discard() and also
should not retry the fast path if after waiting for the discard to
complete.

The discard should be on the slow path for sure.

> +    if (get_swap_device_info(discard_si)) {

Inside the slow path there is get_swap_device_info(si), you should be
able to reuse those?

> +        swap_do_scheduled_discard(discard_si);
> +        put_swap_device(discard_si);
> +        /*
> +         * Ignoring the return value, since we need to try
> +         * again even if the discard failed. If failed due to
> +         * race with another discard, we should still try
> +         * order 0 steal.
> +         */
> +    } else {

Shouldn't need the "else", the swap_alloc_slow() can always set
dicard_si = NULL internally if no device to discard or just set
discard = NULL regardless.

> +        discard_si = NULL;
> +        /*
> +         * If raced with swapoff, we should try again too but
> +         * not using the discard device anymore.
> +         */
> +    }
> +    goto again;
> +}
>
> And the `again` retry we'll have to always start from free_clusters again,

That is fine, because discard causes clusters to move into free_clusters now.

> unless we have another parameter just to indicate that we want to skip
> everything and jump to stealing, or pass and reuse the discard_si
> pointer as return argument to cluster_alloc_swap_entry as well,
> as the indicator to jump to stealing directly.

It is a rare case, we don't have to jump directly to stealing. If the
discard happens and that discarded cluster gets stolen by other
threads, I think it is fine going through the fragment list before
going to the order 0 stealing from another order fragment list.

> It looks kind of strange. So far swap_do_scheduled_discard can only
> fail due to a race with another successful discard, so retrying is
> safe and won't run into an endless loop. But it seems easy to break,
> e.g. if we may handle bio alloc failure of discard request in the
> future. And trying again if get_swap_device_info failed makes no sense
> if there is only one device, but has to be done here to cover
> multi-device usage, or we have to add more special checks.

Well, you can have sync wait check check for discard if there is >0
number of clusters successfully discarded.

>
> swap_alloc_slow will be a bit longer too if we want to prevent
> touching plist again:
> +/*
> + * Resuming after trying to discard cluster on a swap device,
> + * try the discarded device first.
> + */
> +si = *discard_si;
> +if (unlikely(si)) {
> +    *discard_si = NULL;
> +    if (get_swap_device_info(si)) {
> +        offset = cluster_alloc_swap_entry(si, order, SWAP_HAS_CACHE,
> &need_discard);
> +        put_swap_device(si);
> +        if (offset) {
> +            *entry = swp_entry(si->type, offset);
> +            return true;
> +        }
> +        if (need_discard) {
> +            *discard_si = si;

> +            return false;

I haven't tried it myself. but I feel we should move the sync wait for
discard here but with the lock released then re-acquire the lock.
That might simplify the logic. The discard should belong to the slow
path behavior, definitely not part of the fast path.

> +        }
> +    }
> +}
>
> The logic of the workflow jumping between several functions might also
> be kind of hard to follow. Some cleanup can be done later though.
>
> Considering the discard issue is really rare, I'm not sure if this is
> the right way to go? How do you think?

Let's try moving the discard and retry inside the slow path but
release the lock and see how it feels.
If you want, I can also give it a try, I just don't want to step on your toes.

> BTW: The logic of V1 can be optimized a little bit to let discards also
> happen with order > 0 cases too. That seems closer to what the current
> upstream kernel was doing except: Allocator prefers to try another
> device instead of waiting for discard, which seems OK?

I think we should wait for the discard. Having discard means the
device can have maybe (many?) free clusters soon. We can wait. It is a
rare case anyway. From the swap.tiers point of view, it would be
better to exhaust the current high priority device before consuming
the low priority device. Otherwise you will have very minor swap
device priority inversion for a few swap entries, those swap entries
otherwise can be allocated on the discarded free cluster from high
priority swapdevice.

> And order 0 steal can happen without waiting for discard.

I am OK to change the behavior to let order 0 wait for the discard as
well. It happens so rarely and we have less fragmented clusters
compared to the alternatives of stealing from higher order clusters
now. I think that is OK. We end up having less fragmented clusters,
which is a
good thing.

> Fragmentation under extreme pressure might not be that
> serious an issue if we are having really slow SSDs, and
> might even be no longer an issue if we have a generic
> solution for frags?

Chris

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 1/4] mm, swap: do not perform synchronous discard during allocation
  2025-10-14 21:27       ` Chris Li
@ 2025-10-15  2:55         ` Chris Li
  2025-10-15  6:24           ` Kairui Song
  0 siblings, 1 reply; 32+ messages in thread
From: Chris Li @ 2025-10-15  2:55 UTC (permalink / raw)
  To: Kairui Song
  Cc: linux-mm, Andrew Morton, Kemeng Shi, Nhat Pham, Baoquan He,
	Barry Song, Baolin Wang, David Hildenbrand,
	Matthew Wilcox (Oracle),
	Ying Huang, linux-kernel, stable

On Tue, Oct 14, 2025 at 2:27 PM Chris Li <chrisl@kernel.org> wrote:
>
> On Sun, Oct 12, 2025 at 9:49 AM Kairui Song <ryncsn@gmail.com> wrote:
> > > > diff --git a/mm/swapfile.c b/mm/swapfile.c
> > > > index cb2392ed8e0e..0d1924f6f495 100644
> > > > --- a/mm/swapfile.c
> > > > +++ b/mm/swapfile.c
> > > > @@ -1101,13 +1101,6 @@ static unsigned long cluster_alloc_swap_entry(struct swap_info_struct *si, int o
> > > >                         goto done;
> > > >         }
> > > >
> > > > -       /*
> > > > -        * We don't have free cluster but have some clusters in discarding,
> > > > -        * do discard now and reclaim them.
> > > > -        */
> > > > -       if ((si->flags & SWP_PAGE_DISCARD) && swap_do_scheduled_discard(si))
> > > > -               goto new_cluster;
> > >
> > > Assume you follow my suggestion.
> > > Change this to some function to detect if there is a pending discard
> > > on this device. Return to the caller indicating that you need a
> > > discard for this device that has a pending discard.
> > > Add an output argument to indicate the discard device "discard" if needed.
> >
> > The problem I just realized is that, if we just bail out here, we are
> > forbidding order 0 to steal if there is any discarding cluster. We
> > just return here to let the caller handle the discard outside
> > the lock.
>
> Oh, yes, there might be a bit of change in behavior. However I can't
> see it is such a bad thing if we wait for the pending discard to
> complete before stealing and fragmenting the existing folio list. We
> will have less fragments compared to the original result. Again, my
> point is not that we always keep 100% the old behavior, then there is
> no room for improvement.
>
> My point is that, are we doing the best we can in that situation,
> regardless how unlikely it is.
>
> >
> > It may just discard the cluster just fine, then retry from free clusters.
> > Then everything is fine, that's the easy part.
>
> Ack.
>
> > But it might also fail, and interestingly, in the failure case we need
>
> Can you spell out the failure case you have in mind? Do you mean the
> discard did happen but another thread stole "the recently discarded
> then became free cluster"?
>
> Anyway, in such a case, the swap allocator should continue and find
> out we don't have things to discard now, it will continue to the
> "steal from other order > 0 list".
>
> > to try again as well. It might fail with a race with another discard,
> > in that case order 0 steal is still feasible. Or it fail with
> > get_swap_device_info (we have to release the device to return here),
> > in that case we should go back to the plist and try other devices.
>
> When stealing from the other order >0 list failed, we should try
> another device in the plist.
>
> >
> > This is doable but seems kind of fragile, we'll have something like
> > this in the folio_alloc_swap function:
> >
> > local_lock(&percpu_swap_cluster.lock);
> > if (!swap_alloc_fast(&entry, order))
> >     swap_alloc_slow(&entry, order, &discard_si);
> > local_unlock(&percpu_swap_cluster.lock);
> >
> > +if (discard_si) {
>
> I feel the discard logic should be inside the swap_alloc_slow().
> There is a  plist_for_each_entry_safe(), inside that loop to do the
> discard and retry().
> If I previously suggested it change in here, sorry I have changed my
> mind after reasoning the code a bit more.

Actually now I have given it a bit more thought, one thing I realized
is that you might need to hold the percpu_swap_cluster lock all the
time during allocation. That might force you to do the release lock
and discard in the current position.

If that is the case, then just making the small change in your patch
to allow hold waiting to discard before trying the fragmentation list
might be good enough.

Chris

>
> The fast path layer should not know about the discard() and also
> should not retry the fast path if after waiting for the discard to
> complete.
>
> The discard should be on the slow path for sure.
>
> > +    if (get_swap_device_info(discard_si)) {
>
> Inside the slow path there is get_swap_device_info(si), you should be
> able to reuse those?
>
> > +        swap_do_scheduled_discard(discard_si);
> > +        put_swap_device(discard_si);
> > +        /*
> > +         * Ignoring the return value, since we need to try
> > +         * again even if the discard failed. If failed due to
> > +         * race with another discard, we should still try
> > +         * order 0 steal.
> > +         */
> > +    } else {
>
> Shouldn't need the "else", the swap_alloc_slow() can always set
> dicard_si = NULL internally if no device to discard or just set
> discard = NULL regardless.
>
> > +        discard_si = NULL;
> > +        /*
> > +         * If raced with swapoff, we should try again too but
> > +         * not using the discard device anymore.
> > +         */
> > +    }
> > +    goto again;
> > +}
> >
> > And the `again` retry we'll have to always start from free_clusters again,
>
> That is fine, because discard causes clusters to move into free_clusters now.
>
> > unless we have another parameter just to indicate that we want to skip
> > everything and jump to stealing, or pass and reuse the discard_si
> > pointer as return argument to cluster_alloc_swap_entry as well,
> > as the indicator to jump to stealing directly.
>
> It is a rare case, we don't have to jump directly to stealing. If the
> discard happens and that discarded cluster gets stolen by other
> threads, I think it is fine going through the fragment list before
> going to the order 0 stealing from another order fragment list.
>
> > It looks kind of strange. So far swap_do_scheduled_discard can only
> > fail due to a race with another successful discard, so retrying is
> > safe and won't run into an endless loop. But it seems easy to break,
> > e.g. if we may handle bio alloc failure of discard request in the
> > future. And trying again if get_swap_device_info failed makes no sense
> > if there is only one device, but has to be done here to cover
> > multi-device usage, or we have to add more special checks.
>
> Well, you can have sync wait check check for discard if there is >0
> number of clusters successfully discarded.
>
> >
> > swap_alloc_slow will be a bit longer too if we want to prevent
> > touching plist again:
> > +/*
> > + * Resuming after trying to discard cluster on a swap device,
> > + * try the discarded device first.
> > + */
> > +si = *discard_si;
> > +if (unlikely(si)) {
> > +    *discard_si = NULL;
> > +    if (get_swap_device_info(si)) {
> > +        offset = cluster_alloc_swap_entry(si, order, SWAP_HAS_CACHE,
> > &need_discard);
> > +        put_swap_device(si);
> > +        if (offset) {
> > +            *entry = swp_entry(si->type, offset);
> > +            return true;
> > +        }
> > +        if (need_discard) {
> > +            *discard_si = si;
>
> > +            return false;
>
> I haven't tried it myself. but I feel we should move the sync wait for
> discard here but with the lock released then re-acquire the lock.
> That might simplify the logic. The discard should belong to the slow
> path behavior, definitely not part of the fast path.
>
> > +        }
> > +    }
> > +}
> >
> > The logic of the workflow jumping between several functions might also
> > be kind of hard to follow. Some cleanup can be done later though.
> >
> > Considering the discard issue is really rare, I'm not sure if this is
> > the right way to go? How do you think?
>
> Let's try moving the discard and retry inside the slow path but
> release the lock and see how it feels.
> If you want, I can also give it a try, I just don't want to step on your toes.
>
> > BTW: The logic of V1 can be optimized a little bit to let discards also
> > happen with order > 0 cases too. That seems closer to what the current
> > upstream kernel was doing except: Allocator prefers to try another
> > device instead of waiting for discard, which seems OK?
>
> I think we should wait for the discard. Having discard means the
> device can have maybe (many?) free clusters soon. We can wait. It is a
> rare case anyway. From the swap.tiers point of view, it would be
> better to exhaust the current high priority device before consuming
> the low priority device. Otherwise you will have very minor swap
> device priority inversion for a few swap entries, those swap entries
> otherwise can be allocated on the discarded free cluster from high
> priority swapdevice.
>
> > And order 0 steal can happen without waiting for discard.
>
> I am OK to change the behavior to let order 0 wait for the discard as
> well. It happens so rarely and we have less fragmented clusters
> compared to the alternatives of stealing from higher order clusters
> now. I think that is OK. We end up having less fragmented clusters,
> which is a
> good thing.
>
> > Fragmentation under extreme pressure might not be that
> > serious an issue if we are having really slow SSDs, and
> > might even be no longer an issue if we have a generic
> > solution for frags?
>
> Chris


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 1/4] mm, swap: do not perform synchronous discard during allocation
  2025-10-15  2:55         ` Chris Li
@ 2025-10-15  6:24           ` Kairui Song
  2025-10-15 16:45             ` Kairui Song
  0 siblings, 1 reply; 32+ messages in thread
From: Kairui Song @ 2025-10-15  6:24 UTC (permalink / raw)
  To: Chris Li
  Cc: linux-mm, Andrew Morton, Kemeng Shi, Nhat Pham, Baoquan He,
	Barry Song, Baolin Wang, David Hildenbrand,
	Matthew Wilcox (Oracle),
	Ying Huang, linux-kernel, stable

On Wed, Oct 15, 2025 at 12:00 PM Chris Li <chrisl@kernel.org> wrote:
>
> On Tue, Oct 14, 2025 at 2:27 PM Chris Li <chrisl@kernel.org> wrote:
> >
> > On Sun, Oct 12, 2025 at 9:49 AM Kairui Song <ryncsn@gmail.com> wrote:
> > > > > diff --git a/mm/swapfile.c b/mm/swapfile.c
> > > > > index cb2392ed8e0e..0d1924f6f495 100644
> > > > > --- a/mm/swapfile.c
> > > > > +++ b/mm/swapfile.c
> > > > > @@ -1101,13 +1101,6 @@ static unsigned long cluster_alloc_swap_entry(struct swap_info_struct *si, int o
> > > > >                         goto done;
> > > > >         }
> > > > >
> > > > > -       /*
> > > > > -        * We don't have free cluster but have some clusters in discarding,
> > > > > -        * do discard now and reclaim them.
> > > > > -        */
> > > > > -       if ((si->flags & SWP_PAGE_DISCARD) && swap_do_scheduled_discard(si))
> > > > > -               goto new_cluster;
> > > >
> > > > Assume you follow my suggestion.
> > > > Change this to some function to detect if there is a pending discard
> > > > on this device. Return to the caller indicating that you need a
> > > > discard for this device that has a pending discard.
> > > > Add an output argument to indicate the discard device "discard" if needed.
> > >
> > > The problem I just realized is that, if we just bail out here, we are
> > > forbidding order 0 to steal if there is any discarding cluster. We
> > > just return here to let the caller handle the discard outside
> > > the lock.
> >
> > Oh, yes, there might be a bit of change in behavior. However I can't
> > see it is such a bad thing if we wait for the pending discard to
> > complete before stealing and fragmenting the existing folio list. We
> > will have less fragments compared to the original result. Again, my
> > point is not that we always keep 100% the old behavior, then there is
> > no room for improvement.
> >
> > My point is that, are we doing the best we can in that situation,
> > regardless how unlikely it is.
> >
> > >
> > > It may just discard the cluster just fine, then retry from free clusters.
> > > Then everything is fine, that's the easy part.
> >
> > Ack.
> >
> > > But it might also fail, and interestingly, in the failure case we need
> >
> > Can you spell out the failure case you have in mind? Do you mean the
> > discard did happen but another thread stole "the recently discarded
> > then became free cluster"?
> >
> > Anyway, in such a case, the swap allocator should continue and find
> > out we don't have things to discard now, it will continue to the
> > "steal from other order > 0 list".
> >
> > > to try again as well. It might fail with a race with another discard,
> > > in that case order 0 steal is still feasible. Or it fail with
> > > get_swap_device_info (we have to release the device to return here),
> > > in that case we should go back to the plist and try other devices.
> >
> > When stealing from the other order >0 list failed, we should try
> > another device in the plist.
> >
> > >
> > > This is doable but seems kind of fragile, we'll have something like
> > > this in the folio_alloc_swap function:
> > >
> > > local_lock(&percpu_swap_cluster.lock);
> > > if (!swap_alloc_fast(&entry, order))
> > >     swap_alloc_slow(&entry, order, &discard_si);
> > > local_unlock(&percpu_swap_cluster.lock);
> > >
> > > +if (discard_si) {
> >
> > I feel the discard logic should be inside the swap_alloc_slow().
> > There is a  plist_for_each_entry_safe(), inside that loop to do the
> > discard and retry().
> > If I previously suggested it change in here, sorry I have changed my
> > mind after reasoning the code a bit more.
>
> Actually now I have given it a bit more thought, one thing I realized
> is that you might need to hold the percpu_swap_cluster lock all the
> time during allocation. That might force you to do the release lock
> and discard in the current position.
>
> If that is the case, then just making the small change in your patch
> to allow hold waiting to discard before trying the fragmentation list
> might be good enough.
>
> Chris
>

Thanks, I was composing a reply on this and just saw your new comment.
I agree with this.


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 1/4] mm, swap: do not perform synchronous discard during allocation
  2025-10-15  6:24           ` Kairui Song
@ 2025-10-15 16:45             ` Kairui Song
  2025-10-21  6:48               ` Chris Li
  2025-10-21  7:34               ` YoungJun Park
  0 siblings, 2 replies; 32+ messages in thread
From: Kairui Song @ 2025-10-15 16:45 UTC (permalink / raw)
  To: Chris Li, YoungJun Park
  Cc: linux-mm, Andrew Morton, Kemeng Shi, Nhat Pham, Baoquan He,
	Barry Song, Baolin Wang, David Hildenbrand,
	Matthew Wilcox (Oracle),
	Ying Huang, linux-kernel, stable

On Wed, Oct 15, 2025 at 2:24 PM Kairui Song <ryncsn@gmail.com> wrote:
>
> On Wed, Oct 15, 2025 at 12:00 PM Chris Li <chrisl@kernel.org> wrote:
> >
> > On Tue, Oct 14, 2025 at 2:27 PM Chris Li <chrisl@kernel.org> wrote:
> > >
> > > On Sun, Oct 12, 2025 at 9:49 AM Kairui Song <ryncsn@gmail.com> wrote:
> > > > > > diff --git a/mm/swapfile.c b/mm/swapfile.c
> > > > > > index cb2392ed8e0e..0d1924f6f495 100644
> > > > > > --- a/mm/swapfile.c
> > > > > > +++ b/mm/swapfile.c
> > > > > > @@ -1101,13 +1101,6 @@ static unsigned long cluster_alloc_swap_entry(struct swap_info_struct *si, int o
> > > > > >                         goto done;
> > > > > >         }
> > > > > >
> > > > > > -       /*
> > > > > > -        * We don't have free cluster but have some clusters in discarding,
> > > > > > -        * do discard now and reclaim them.
> > > > > > -        */
> > > > > > -       if ((si->flags & SWP_PAGE_DISCARD) && swap_do_scheduled_discard(si))
> > > > > > -               goto new_cluster;
> > > > >
> > > > > Assume you follow my suggestion.
> > > > > Change this to some function to detect if there is a pending discard
> > > > > on this device. Return to the caller indicating that you need a
> > > > > discard for this device that has a pending discard.
> > > > > Add an output argument to indicate the discard device "discard" if needed.
> > > >
> > > > The problem I just realized is that, if we just bail out here, we are
> > > > forbidding order 0 to steal if there is any discarding cluster. We
> > > > just return here to let the caller handle the discard outside
> > > > the lock.
> > >
> > > Oh, yes, there might be a bit of change in behavior. However I can't
> > > see it is such a bad thing if we wait for the pending discard to
> > > complete before stealing and fragmenting the existing folio list. We
> > > will have less fragments compared to the original result. Again, my
> > > point is not that we always keep 100% the old behavior, then there is
> > > no room for improvement.
> > >
> > > My point is that, are we doing the best we can in that situation,
> > > regardless how unlikely it is.
> > >
> > > >
> > > > It may just discard the cluster just fine, then retry from free clusters.
> > > > Then everything is fine, that's the easy part.
> > >
> > > Ack.
> > >
> > > > But it might also fail, and interestingly, in the failure case we need
> > >
> > > Can you spell out the failure case you have in mind? Do you mean the
> > > discard did happen but another thread stole "the recently discarded
> > > then became free cluster"?
> > >
> > > Anyway, in such a case, the swap allocator should continue and find
> > > out we don't have things to discard now, it will continue to the
> > > "steal from other order > 0 list".
> > >
> > > > to try again as well. It might fail with a race with another discard,
> > > > in that case order 0 steal is still feasible. Or it fail with
> > > > get_swap_device_info (we have to release the device to return here),
> > > > in that case we should go back to the plist and try other devices.
> > >
> > > When stealing from the other order >0 list failed, we should try
> > > another device in the plist.
> > >
> > > >
> > > > This is doable but seems kind of fragile, we'll have something like
> > > > this in the folio_alloc_swap function:
> > > >
> > > > local_lock(&percpu_swap_cluster.lock);
> > > > if (!swap_alloc_fast(&entry, order))
> > > >     swap_alloc_slow(&entry, order, &discard_si);
> > > > local_unlock(&percpu_swap_cluster.lock);
> > > >
> > > > +if (discard_si) {
> > >
> > > I feel the discard logic should be inside the swap_alloc_slow().
> > > There is a  plist_for_each_entry_safe(), inside that loop to do the
> > > discard and retry().
> > > If I previously suggested it change in here, sorry I have changed my
> > > mind after reasoning the code a bit more.
> >
> > Actually now I have given it a bit more thought, one thing I realized
> > is that you might need to hold the percpu_swap_cluster lock all the
> > time during allocation. That might force you to do the release lock
> > and discard in the current position.
> >
> > If that is the case, then just making the small change in your patch
> > to allow hold waiting to discard before trying the fragmentation list
> > might be good enough.
> >
> > Chris
> >
>
> Thanks, I was composing a reply on this and just saw your new comment.
> I agree with this.

Hmm, it turns out modifying V1 to handle non-order 0 allocation
failure also has some minor issues. Every mTHP SWAP allocation failure
will have a slight higher overhead due to the discard check. V1 is
fine since it only checks discard for order 0, and order 0 alloc
failure is uncommon and usually means OOM already.

I'm not saying V1 is the final solution, but I think maybe we can just
keep V1 as it is? That's easier for a stable backport too, and this is
doing far better than what it was like. The sync discard was added in
2013 and the later added percpu cluster at the same year never treated
it carefully. And the discard during allocation after recent swap
allocator rework has been kind of broken for a while.

To optimize it further in a clean way, we have to reverse the
allocator's handling order of the plist and fast / slow path. Current
order is local_lock -> fast -> slow (plist).

We can walk the plist first, then do the fast / slow path: plist (or
maybe something faster than plist but handles the priority) ->
local_lock -> fast -> slow (bonus: this is more friendly to RT kernels
too I think). That way we don't need to rewalk the plist after
releasing the local_lock and save a lot of trouble. I remember I
discussed with Youngjun on this sometime ago in the mail list, I know
things have changed a lot but some ideas seems are still similar. I
think his series is moving the percpu cluster into each device (si),
we can also move the local_lock there, which is just what I'm talking
about here.


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 1/4] mm, swap: do not perform synchronous discard during allocation
  2025-10-15 16:45             ` Kairui Song
@ 2025-10-21  6:48               ` Chris Li
  2025-10-21  8:44                 ` Kairui Song
  2025-10-21  7:34               ` YoungJun Park
  1 sibling, 1 reply; 32+ messages in thread
From: Chris Li @ 2025-10-21  6:48 UTC (permalink / raw)
  To: Kairui Song
  Cc: YoungJun Park, linux-mm, Andrew Morton, Kemeng Shi, Nhat Pham,
	Baoquan He, Barry Song, Baolin Wang, David Hildenbrand,
	Matthew Wilcox (Oracle),
	Ying Huang, linux-kernel, stable

On Wed, Oct 15, 2025 at 9:46 AM Kairui Song <ryncsn@gmail.com> wrote:
>
> On Wed, Oct 15, 2025 at 2:24 PM Kairui Song <ryncsn@gmail.com> wrote:
> >
> > On Wed, Oct 15, 2025 at 12:00 PM Chris Li <chrisl@kernel.org> wrote:
> > >
> > > On Tue, Oct 14, 2025 at 2:27 PM Chris Li <chrisl@kernel.org> wrote:
> > > >
> > > > On Sun, Oct 12, 2025 at 9:49 AM Kairui Song <ryncsn@gmail.com> wrote:
> > > > > > > diff --git a/mm/swapfile.c b/mm/swapfile.c
> > > > > > > index cb2392ed8e0e..0d1924f6f495 100644
> > > > > > > --- a/mm/swapfile.c
> > > > > > > +++ b/mm/swapfile.c
> > > > > > > @@ -1101,13 +1101,6 @@ static unsigned long cluster_alloc_swap_entry(struct swap_info_struct *si, int o
> > > > > > >                         goto done;
> > > > > > >         }
> > > > > > >
> > > > > > > -       /*
> > > > > > > -        * We don't have free cluster but have some clusters in discarding,
> > > > > > > -        * do discard now and reclaim them.
> > > > > > > -        */
> > > > > > > -       if ((si->flags & SWP_PAGE_DISCARD) && swap_do_scheduled_discard(si))
> > > > > > > -               goto new_cluster;
> > > > > >
> > > > > > Assume you follow my suggestion.
> > > > > > Change this to some function to detect if there is a pending discard
> > > > > > on this device. Return to the caller indicating that you need a
> > > > > > discard for this device that has a pending discard.
> > > > > > Add an output argument to indicate the discard device "discard" if needed.
> > > > >
> > > > > The problem I just realized is that, if we just bail out here, we are
> > > > > forbidding order 0 to steal if there is any discarding cluster. We
> > > > > just return here to let the caller handle the discard outside
> > > > > the lock.
> > > >
> > > > Oh, yes, there might be a bit of change in behavior. However I can't
> > > > see it is such a bad thing if we wait for the pending discard to
> > > > complete before stealing and fragmenting the existing folio list. We
> > > > will have less fragments compared to the original result. Again, my
> > > > point is not that we always keep 100% the old behavior, then there is
> > > > no room for improvement.
> > > >
> > > > My point is that, are we doing the best we can in that situation,
> > > > regardless how unlikely it is.
> > > >
> > > > >
> > > > > It may just discard the cluster just fine, then retry from free clusters.
> > > > > Then everything is fine, that's the easy part.
> > > >
> > > > Ack.
> > > >
> > > > > But it might also fail, and interestingly, in the failure case we need
> > > >
> > > > Can you spell out the failure case you have in mind? Do you mean the
> > > > discard did happen but another thread stole "the recently discarded
> > > > then became free cluster"?
> > > >
> > > > Anyway, in such a case, the swap allocator should continue and find
> > > > out we don't have things to discard now, it will continue to the
> > > > "steal from other order > 0 list".
> > > >
> > > > > to try again as well. It might fail with a race with another discard,
> > > > > in that case order 0 steal is still feasible. Or it fail with
> > > > > get_swap_device_info (we have to release the device to return here),
> > > > > in that case we should go back to the plist and try other devices.
> > > >
> > > > When stealing from the other order >0 list failed, we should try
> > > > another device in the plist.
> > > >
> > > > >
> > > > > This is doable but seems kind of fragile, we'll have something like
> > > > > this in the folio_alloc_swap function:
> > > > >
> > > > > local_lock(&percpu_swap_cluster.lock);
> > > > > if (!swap_alloc_fast(&entry, order))
> > > > >     swap_alloc_slow(&entry, order, &discard_si);
> > > > > local_unlock(&percpu_swap_cluster.lock);
> > > > >
> > > > > +if (discard_si) {
> > > >
> > > > I feel the discard logic should be inside the swap_alloc_slow().
> > > > There is a  plist_for_each_entry_safe(), inside that loop to do the
> > > > discard and retry().
> > > > If I previously suggested it change in here, sorry I have changed my
> > > > mind after reasoning the code a bit more.
> > >
> > > Actually now I have given it a bit more thought, one thing I realized
> > > is that you might need to hold the percpu_swap_cluster lock all the
> > > time during allocation. That might force you to do the release lock
> > > and discard in the current position.
> > >
> > > If that is the case, then just making the small change in your patch
> > > to allow hold waiting to discard before trying the fragmentation list
> > > might be good enough.
> > >
> > > Chris
> > >
> >
> > Thanks, I was composing a reply on this and just saw your new comment.
> > I agree with this.
>
> Hmm, it turns out modifying V1 to handle non-order 0 allocation
> failure also has some minor issues. Every mTHP SWAP allocation failure
> will have a slight higher overhead due to the discard check. V1 is
> fine since it only checks discard for order 0, and order 0 alloc
> failure is uncommon and usually means OOM already.
>
> I'm not saying V1 is the final solution, but I think maybe we can just
> keep V1 as it is? That's easier for a stable backport too, and this is

I am fine with that, assuming you need to adjust the presentation to
push V1 as hotfix. I can ack your newer  patch to adjust the
presentation.

> doing far better than what it was like. The sync discard was added in
> 2013 and the later added percpu cluster at the same year never treated
> it carefully. And the discard during allocation after recent swap
> allocator rework has been kind of broken for a while.
>
> To optimize it further in a clean way, we have to reverse the
> allocator's handling order of the plist and fast / slow path. Current
> order is local_lock -> fast -> slow (plist).

I like that. I think that is the eventual way to go. I want to see how
it integrates with the swap.tiers patches. If you let me pick, I would
go straight with this one for 6.19.

>
> We can walk the plist first, then do the fast / slow path: plist (or
> maybe something faster than plist but handles the priority) ->
> local_lock -> fast -> slow (bonus: this is more friendly to RT kernels
> too I think). That way we don't need to rewalk the plist after
> releasing the local_lock and save a lot of trouble. I remember I
> discussed with Youngjun on this sometime ago in the mail list, I know
> things have changed a lot but some ideas seems are still similar. I
> think his series is moving the percpu cluster into each device (si),
> we can also move the local_lock there, which is just what I'm talking
> about here.

Ack. We will need to see both patches to figure out how to integrate
them together. Right now we have two moving parts. More to the point
that we get the eventual patch sooner.

Chris


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 1/4] mm, swap: do not perform synchronous discard during allocation
  2025-10-15 16:45             ` Kairui Song
  2025-10-21  6:48               ` Chris Li
@ 2025-10-21  7:34               ` YoungJun Park
  2025-10-24  4:00                 ` Kairui Song
  1 sibling, 1 reply; 32+ messages in thread
From: YoungJun Park @ 2025-10-21  7:34 UTC (permalink / raw)
  To: Kairui Song
  Cc: Chris Li, linux-mm, Andrew Morton, Kemeng Shi, Nhat Pham,
	Baoquan He, Barry Song, Baolin Wang, David Hildenbrand,
	Matthew Wilcox (Oracle),
	Ying Huang, linux-kernel, stable

> > Thanks, I was composing a reply on this and just saw your new comment.
> > I agree with this.
> 
> Hmm, it turns out modifying V1 to handle non-order 0 allocation
> failure also has some minor issues. Every mTHP SWAP allocation failure
> will have a slight higher overhead due to the discard check. V1 is
> fine since it only checks discard for order 0, and order 0 alloc
> failure is uncommon and usually means OOM already.

Looking at the original proposed patch.

 +	spin_lock(&swap_avail_lock);
 +	plist_for_each_entry_safe(si, next, &swap_avail_heads[nid], avail_lists[nid]) {
 +		spin_unlock(&swap_avail_lock);
 +		if (get_swap_device_info(si)) {
 +			if (si->flags & SWP_PAGE_DISCARD)
 +				ret = swap_do_scheduled_discard(si);
 +			put_swap_device(si);
 +		}
 +		if (ret)
 +			break;

if ret is true and we break, 
wouldn’t that cause spin_unlock to run without the lock being held?

 +		spin_lock(&swap_avail_lock);
 +	}
 +	spin_unlock(&swap_avail_lock); <- unlocked without lock grab.
 +
 +	return ret;
 +}

> I'm not saying V1 is the final solution, but I think maybe we can just
> keep V1 as it is? That's easier for a stable backport too, and this is
> doing far better than what it was like. The sync discard was added in
> 2013 and the later added percpu cluster at the same year never treated
> it carefully. And the discard during allocation after recent swap
> allocator rework has been kind of broken for a while.
> 
> To optimize it further in a clean way, we have to reverse the
> allocator's handling order of the plist and fast / slow path. Current
> order is local_lock -> fast -> slow (plist).
> We can walk the plist first, then do the fast / slow path: plist (or
> maybe something faster than plist but handles the priority) ->
> local_lock -> fast -> slow (bonus: this is more friendly to RT kernels

I think the idea is good, but when approaching it that way, 
I am curious about rotation handling.

In the current code, rotation is always done when traversing the plist in the slow path.
If we traverse the plist first, how should rotation be handled?

1. Do a naive rotation at plist traversal time. 
(But then fast path might allocate from an si we didn’t select.)
2. Rotate when allocating in the slow path. 
(But between releasing swap_avail_lock, we might access an si that wasn’t rotated.)

Both cases could break rotation behavior — what do you think?

> too I think). That way we don't need to rewalk the plist after
> releasing the local_lock and save a lot of trouble. I remember I
> discussed with Youngjun on this sometime ago in the mail list, I know

Recapping your earlier idea: cache only the swap device per cgroup in percpu, 
and keep the cluster inside the swap device.
Applied to swap tiers, cache only the percpu si per tier, 
and keep the cluster in the swap device.
This seems to fit well with your previous suggestion.

However, since we shifted from per-cgroup swap priority to swap tier, 
and will re-submit RFC for swap tier, we’ll need to revisit the discussion.

Youngjun Park

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 1/4] mm, swap: do not perform synchronous discard during allocation
  2025-10-21  6:48               ` Chris Li
@ 2025-10-21  8:44                 ` Kairui Song
  0 siblings, 0 replies; 32+ messages in thread
From: Kairui Song @ 2025-10-21  8:44 UTC (permalink / raw)
  To: Chris Li
  Cc: YoungJun Park, linux-mm, Andrew Morton, Kemeng Shi, Nhat Pham,
	Baoquan He, Barry Song, Baolin Wang, David Hildenbrand,
	Matthew Wilcox (Oracle),
	Ying Huang, linux-kernel, stable

On Tue, Oct 21, 2025 at 3:05 PM Chris Li <chrisl@kernel.org> wrote:
>
> On Wed, Oct 15, 2025 at 9:46 AM Kairui Song <ryncsn@gmail.com> wrote:
> >
> > On Wed, Oct 15, 2025 at 2:24 PM Kairui Song <ryncsn@gmail.com> wrote:
> > >
> > > On Wed, Oct 15, 2025 at 12:00 PM Chris Li <chrisl@kernel.org> wrote:
> > > >
> > > > On Tue, Oct 14, 2025 at 2:27 PM Chris Li <chrisl@kernel.org> wrote:
> > > > >
> > > > > On Sun, Oct 12, 2025 at 9:49 AM Kairui Song <ryncsn@gmail.com> wrote:
> > > > > > > > diff --git a/mm/swapfile.c b/mm/swapfile.c
> > > > > > > > index cb2392ed8e0e..0d1924f6f495 100644
> > > > > > > > --- a/mm/swapfile.c
> > > > > > > > +++ b/mm/swapfile.c
> > > > > > > > @@ -1101,13 +1101,6 @@ static unsigned long cluster_alloc_swap_entry(struct swap_info_struct *si, int o
> > > > > > > >                         goto done;
> > > > > > > >         }
> > > > > > > >
> > > > > > > > -       /*
> > > > > > > > -        * We don't have free cluster but have some clusters in discarding,
> > > > > > > > -        * do discard now and reclaim them.
> > > > > > > > -        */
> > > > > > > > -       if ((si->flags & SWP_PAGE_DISCARD) && swap_do_scheduled_discard(si))
> > > > > > > > -               goto new_cluster;
> > > > > > >
> > > > > > > Assume you follow my suggestion.
> > > > > > > Change this to some function to detect if there is a pending discard
> > > > > > > on this device. Return to the caller indicating that you need a
> > > > > > > discard for this device that has a pending discard.
> > > > > > > Add an output argument to indicate the discard device "discard" if needed.
> > > > > >
> > > > > > The problem I just realized is that, if we just bail out here, we are
> > > > > > forbidding order 0 to steal if there is any discarding cluster. We
> > > > > > just return here to let the caller handle the discard outside
> > > > > > the lock.
> > > > >
> > > > > Oh, yes, there might be a bit of change in behavior. However I can't
> > > > > see it is such a bad thing if we wait for the pending discard to
> > > > > complete before stealing and fragmenting the existing folio list. We
> > > > > will have less fragments compared to the original result. Again, my
> > > > > point is not that we always keep 100% the old behavior, then there is
> > > > > no room for improvement.
> > > > >
> > > > > My point is that, are we doing the best we can in that situation,
> > > > > regardless how unlikely it is.
> > > > >
> > > > > >
> > > > > > It may just discard the cluster just fine, then retry from free clusters.
> > > > > > Then everything is fine, that's the easy part.
> > > > >
> > > > > Ack.
> > > > >
> > > > > > But it might also fail, and interestingly, in the failure case we need
> > > > >
> > > > > Can you spell out the failure case you have in mind? Do you mean the
> > > > > discard did happen but another thread stole "the recently discarded
> > > > > then became free cluster"?
> > > > >
> > > > > Anyway, in such a case, the swap allocator should continue and find
> > > > > out we don't have things to discard now, it will continue to the
> > > > > "steal from other order > 0 list".
> > > > >
> > > > > > to try again as well. It might fail with a race with another discard,
> > > > > > in that case order 0 steal is still feasible. Or it fail with
> > > > > > get_swap_device_info (we have to release the device to return here),
> > > > > > in that case we should go back to the plist and try other devices.
> > > > >
> > > > > When stealing from the other order >0 list failed, we should try
> > > > > another device in the plist.
> > > > >
> > > > > >
> > > > > > This is doable but seems kind of fragile, we'll have something like
> > > > > > this in the folio_alloc_swap function:
> > > > > >
> > > > > > local_lock(&percpu_swap_cluster.lock);
> > > > > > if (!swap_alloc_fast(&entry, order))
> > > > > >     swap_alloc_slow(&entry, order, &discard_si);
> > > > > > local_unlock(&percpu_swap_cluster.lock);
> > > > > >
> > > > > > +if (discard_si) {
> > > > >
> > > > > I feel the discard logic should be inside the swap_alloc_slow().
> > > > > There is a  plist_for_each_entry_safe(), inside that loop to do the
> > > > > discard and retry().
> > > > > If I previously suggested it change in here, sorry I have changed my
> > > > > mind after reasoning the code a bit more.
> > > >
> > > > Actually now I have given it a bit more thought, one thing I realized
> > > > is that you might need to hold the percpu_swap_cluster lock all the
> > > > time during allocation. That might force you to do the release lock
> > > > and discard in the current position.
> > > >
> > > > If that is the case, then just making the small change in your patch
> > > > to allow hold waiting to discard before trying the fragmentation list
> > > > might be good enough.
> > > >
> > > > Chris
> > > >
> > >
> > > Thanks, I was composing a reply on this and just saw your new comment.
> > > I agree with this.
> >
> > Hmm, it turns out modifying V1 to handle non-order 0 allocation
> > failure also has some minor issues. Every mTHP SWAP allocation failure
> > will have a slight higher overhead due to the discard check. V1 is
> > fine since it only checks discard for order 0, and order 0 alloc
> > failure is uncommon and usually means OOM already.
> >
> > I'm not saying V1 is the final solution, but I think maybe we can just
> > keep V1 as it is? That's easier for a stable backport too, and this is
>
> I am fine with that, assuming you need to adjust the presentation to
> push V1 as hotfix. I can ack your newer  patch to adjust the
> presentation.

Thanks, I'll update it then.

> > doing far better than what it was like. The sync discard was added in
> > 2013 and the later added percpu cluster at the same year never treated
> > it carefully. And the discard during allocation after recent swap
> > allocator rework has been kind of broken for a while.
> >
> > To optimize it further in a clean way, we have to reverse the
> > allocator's handling order of the plist and fast / slow path. Current
> > order is local_lock -> fast -> slow (plist).
>
> I like that. I think that is the eventual way to go. I want to see how
> it integrates with the swap.tiers patches. If you let me pick, I would
> go straight with this one for 6.19.
>
> >
> > We can walk the plist first, then do the fast / slow path: plist (or
> > maybe something faster than plist but handles the priority) ->
> > local_lock -> fast -> slow (bonus: this is more friendly to RT kernels
> > too I think). That way we don't need to rewalk the plist after
> > releasing the local_lock and save a lot of trouble. I remember I
> > discussed with Youngjun on this sometime ago in the mail list, I know
> > things have changed a lot but some ideas seems are still similar. I
> > think his series is moving the percpu cluster into each device (si),
> > we can also move the local_lock there, which is just what I'm talking
> > about here.
>
> Ack. We will need to see both patches to figure out how to integrate
> them together. Right now we have two moving parts. More to the point
> that we get the eventual patch sooner.

BTW I found there are some minor cleanups needed, mostly trivial, I'll
include them in the next update I think.


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 1/4] mm, swap: do not perform synchronous discard during allocation
  2025-10-21  7:34               ` YoungJun Park
@ 2025-10-24  4:00                 ` Kairui Song
  0 siblings, 0 replies; 32+ messages in thread
From: Kairui Song @ 2025-10-24  4:00 UTC (permalink / raw)
  To: YoungJun Park
  Cc: Chris Li, linux-mm, Andrew Morton, Kemeng Shi, Nhat Pham,
	Baoquan He, Barry Song, Baolin Wang, David Hildenbrand,
	Matthew Wilcox (Oracle),
	Ying Huang, linux-kernel, stable

On Tue, Oct 21, 2025 at 3:34 PM YoungJun Park <youngjun.park@lge.com> wrote:
>
> > > Thanks, I was composing a reply on this and just saw your new comment.
> > > I agree with this.
> >
> > Hmm, it turns out modifying V1 to handle non-order 0 allocation
> > failure also has some minor issues. Every mTHP SWAP allocation failure
> > will have a slight higher overhead due to the discard check. V1 is
> > fine since it only checks discard for order 0, and order 0 alloc
> > failure is uncommon and usually means OOM already.
>
> Looking at the original proposed patch.
>
>  +      spin_lock(&swap_avail_lock);
>  +      plist_for_each_entry_safe(si, next, &swap_avail_heads[nid], avail_lists[nid]) {
>  +              spin_unlock(&swap_avail_lock);
>  +              if (get_swap_device_info(si)) {
>  +                      if (si->flags & SWP_PAGE_DISCARD)
>  +                              ret = swap_do_scheduled_discard(si);
>  +                      put_swap_device(si);
>  +              }
>  +              if (ret)
>  +                      break;
>
> if ret is true and we break,
> wouldn’t that cause spin_unlock to run without the lock being held?

Thanks for catching this! Right, I need to return directly instead of
break. I've fixed that.

>
>  +              spin_lock(&swap_avail_lock);
>  +      }
>  +      spin_unlock(&swap_avail_lock); <- unlocked without lock grab.
>  +
>  +      return ret;
>  +}
>
> > I'm not saying V1 is the final solution, but I think maybe we can just
> > keep V1 as it is? That's easier for a stable backport too, and this is
> > doing far better than what it was like. The sync discard was added in
> > 2013 and the later added percpu cluster at the same year never treated
> > it carefully. And the discard during allocation after recent swap
> > allocator rework has been kind of broken for a while.
> >
> > To optimize it further in a clean way, we have to reverse the
> > allocator's handling order of the plist and fast / slow path. Current
> > order is local_lock -> fast -> slow (plist).
> > We can walk the plist first, then do the fast / slow path: plist (or
> > maybe something faster than plist but handles the priority) ->
> > local_lock -> fast -> slow (bonus: this is more friendly to RT kernels
>
> I think the idea is good, but when approaching it that way,
> I am curious about rotation handling.
>
> In the current code, rotation is always done when traversing the plist in the slow path.
> If we traverse the plist first, how should rotation be handled?

That's a very good question, things always get tricky when it comes to
the details...

> 1. Do a naive rotation at plist traversal time.
> (But then fast path might allocate from an si we didn’t select.)
> 2. Rotate when allocating in the slow path.
> (But between releasing swap_avail_lock, we might access an si that wasn’t rotated.)
>
> Both cases could break rotation behavior — what do you think?

I think cluster level rotating is better, it prevents things from
going too fragmented and spreads the workload between devices in a
helpful way, but just my guess.

We can change the rotation behavior if the test shows some other
strategy is better.

Maybe we'll need something with a better design, like a alloc counter
for rotation. And if we look at the plist before the fast path we may
need to do some optimization for the plist lock too...


^ permalink raw reply	[flat|nested] 32+ messages in thread

end of thread, other threads:[~2025-10-24  4:01 UTC | newest]

Thread overview: 32+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2025-10-06 20:02 [PATCH 0/4] mm, swap: misc cleanup and bugfix Kairui Song
2025-10-06 20:02 ` [PATCH 1/4] mm, swap: do not perform synchronous discard during allocation Kairui Song
2025-10-07 23:52   ` Nhat Pham
2025-10-08 20:54   ` Chris Li
2025-10-09 15:32     ` Kairui Song
2025-10-09 16:58       ` Chris Li
2025-10-12 16:49     ` Kairui Song
2025-10-14 21:27       ` Chris Li
2025-10-15  2:55         ` Chris Li
2025-10-15  6:24           ` Kairui Song
2025-10-15 16:45             ` Kairui Song
2025-10-21  6:48               ` Chris Li
2025-10-21  8:44                 ` Kairui Song
2025-10-21  7:34               ` YoungJun Park
2025-10-24  4:00                 ` Kairui Song
2025-10-06 20:02 ` [PATCH 2/4] mm, swap: rename helper for setup bad slots Kairui Song
2025-10-07 23:47   ` Nhat Pham
2025-10-08 10:25   ` David Hildenbrand
2025-10-08 20:58   ` Chris Li
2025-10-06 20:02 ` [PATCH 3/4] mm, swap: cleanup swap entry allocation parameter Kairui Song
2025-10-06 20:07   ` Kairui Song
2025-10-07 23:49     ` Nhat Pham
2025-10-08 10:26       ` David Hildenbrand
2025-10-08 20:59   ` Chris Li
2025-10-14  3:12   ` Baolin Wang
2025-10-06 20:02 ` [PATCH 4/4] mm/migrate, swap: drop usage of folio_index Kairui Song
2025-10-07 23:48   ` Nhat Pham
2025-10-08  1:20     ` Andrew Morton
2025-10-09 15:33     ` Kairui Song
2025-10-08 21:03   ` Chris Li
2025-10-14  3:15   ` Baolin Wang
2025-10-07 22:20 ` [PATCH 0/4] mm, swap: misc cleanup and bugfix Andrew Morton

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox