[PATCH v3 00/12] mm, swap: swap table phase III: remove swap

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [PATCH v3 00/12] mm, swap: swap table phase III: remove swap_map
@ 2026-02-17 20:06 Kairui Song via B4 Relay
  2026-02-17 20:06 ` [PATCH v3 01/12] mm, swap: protect si->swap_file properly and use as a mount indicator Kairui Song via B4 Relay
                   ` (12 more replies)
  0 siblings, 13 replies; 28+ messages in thread
From: Kairui Song via B4 Relay @ 2026-02-17 20:06 UTC (permalink / raw)
  To: linux-mm
  Cc: Andrew Morton, Kemeng Shi, Nhat Pham, Baoquan He, Barry Song,
	Johannes Weiner, David Hildenbrand, Lorenzo Stoakes,
	Youngjun Park, linux-kernel, Chris Li, Kairui Song

This series is based on phase II which is still in mm-unstable.

This series removes the static swap_map and uses the swap table for the
swap count directly. This saves about ~30% memory usage for the static
swap metadata. For example, this saves 256MB of memory when mounting a
1TB swap device. Performance is slightly better too, since the double
update of the swap table and swap_map is now gone.

Test results:

Mounting a swap device:
=======================
Mount a 1TB brd device as SWAP, just to verify the memory save:

`free -m` before:
               total        used        free      shared  buff/cache   available
Mem:            1465        1051         417           1          61         413
Swap:        1054435           0     1054435

`free -m` after:
               total        used        free      shared  buff/cache   available
Mem:            1465         795         672           1          62         670
Swap:        1054435           0     1054435

Idle memory usage is reduced by ~256MB just as expected. And following
this design we should be able to save another ~512MB in a next phase.

Build kernel test:
==================
Test using ZSWAP with NVME SWAP, make -j48, defconfig, in a x86_64 VM
with 5G RAM, under global pressure, avg of 32 test run:

                Before            After:
System time:    1038.97s          1013.75s (-2.4%)

Test using ZRAM as SWAP, make -j12, tinyconfig, in a ARM64 VM with 1.5G
RAM, under global pressure, avg of 32 test run:

                Before            After:
System time:    67.75s            66.65s (-1.6%)

The result is slightly better.

Redis / Valkey benchmark:
=========================
Test using ZRAM as SWAP, in a ARM64 VM with 1.5G RAM, under global pressure,
avg of 64 test run:

Server: valkey-server --maxmemory 2560M
Client: redis-benchmark -r 3000000 -n 3000000 -d 1024 -c 12 -P 32 -t get

        no persistence              with BGSAVE
Before: 472705.71 RPS               369451.68 RPS
After:  481197.93 RPS (+1.8%)       374922.32 RPS (+1.5%)

In conclusion, performance is better in all cases, and memory usage is
much lower.

The swap cgroup array will also be merged into the swap table in a later
phase, saving the other ~60% part of the static swap metadata and making
all the swap metadata dynamic. The improved API for swap operations also
reduces the lock contention and makes more batching operations possible.

Suggested-by: Chris Li <chrisl@kernel.org>
Signed-off-by: Kairui Song <kasong@tencent.com>
---
Changes in v3:
- Use unsigned int instead of unsigned long for extended map as
  suggested by [Youngjun Park].
- Update a few stalled comments, and add back alloc failure warn.
- Link to v2: https://lore.kernel.org/r/20260128-swap-table-p3-v2-0-fe0b67ef0215@tencent.com

Changes in v2:
- Fix build error for ARC with 40 bits of PAE address, adjust macros to
  shrink SWP_TB_COUNT_BITS if needed, and trigger build error if that
  field is too small. There should be no code change for 64 bit builds.
- Fix build warning of unused variables.
- SWP_TB_COUNT_MAX should be ((1 << SWP_TB_COUNT_BITS) - 1), not ((1 <<
  SWP_TB_COUNT_BITS) - 2). No behavior change, just don't waste usable
  bits and reduce the chance of a slower extended table path.
- Add a missing NULL check in swap_extend_table_try_free.
- Fix a typecast error in the swapoff path to silence some static analyzer.
- Stress tested setups with SWP_TB_COUNT_BITS == 2, looks fine.
- Link to v1:
  https://lore.kernel.org/r/20260126-swap-table-p3-v1-0-a74155fab9b0@tencent.com

---
Kairui Song (12):
      mm, swap: protect si->swap_file properly and use as a mount indicator
      mm, swap: clean up swapon process and locking
      mm, swap: remove redundant arguments and locking for enabling a device
      mm, swap: consolidate bad slots setup and make it more robust
      mm/workingset: leave highest bits empty for anon shadow
      mm, swap: implement helpers for reserving data in the swap table
      mm, swap: mark bad slots in swap table directly
      mm, swap: simplify swap table sanity range check
      mm, swap: use the swap table to track the swap count
      mm, swap: no need to truncate the scan border
      mm, swap: simplify checking if a folio is swapped
      mm, swap: no need to clear the shadow explicitly

 include/linux/swap.h |   28 +-
 mm/memory.c          |    2 +-
 mm/swap.h            |   22 +-
 mm/swap_state.c      |   72 ++--
 mm/swap_table.h      |  138 ++++++-
 mm/swapfile.c        | 1121 +++++++++++++++++++++-----------------------------
 mm/workingset.c      |   49 ++-
 7 files changed, 667 insertions(+), 765 deletions(-)
---
base-commit: d9982f38eb6e9a0cb6bdd1116cc87f75a1084aad
change-id: 20251216-swap-table-p3-8de73fee7b5f

Best regards,
-- 
Kairui Song <kasong@tencent.com>




^ permalink raw reply	[flat|nested] 28+ messages in thread

* [PATCH v3 01/12] mm, swap: protect si->swap_file properly and use as a mount indicator
  2026-02-17 20:06 [PATCH v3 00/12] mm, swap: swap table phase III: remove swap_map Kairui Song via B4 Relay
@ 2026-02-17 20:06 ` Kairui Song via B4 Relay
  2026-02-19  6:36   ` Chris Li
  2026-02-17 20:06 ` [PATCH v3 02/12] mm, swap: clean up swapon process and locking Kairui Song via B4 Relay
                   ` (11 subsequent siblings)
  12 siblings, 1 reply; 28+ messages in thread
From: Kairui Song via B4 Relay @ 2026-02-17 20:06 UTC (permalink / raw)
  To: linux-mm
  Cc: Andrew Morton, Kemeng Shi, Nhat Pham, Baoquan He, Barry Song,
	Johannes Weiner, David Hildenbrand, Lorenzo Stoakes,
	Youngjun Park, linux-kernel, Chris Li, Kairui Song

From: Kairui Song <kasong@tencent.com>

/proc/swaps uses si->swap_map as the indicator to check if the swap
device is mounted. swap_map will be removed soon, so change it to use
si->swap_file instead because:

- si->swap_file is exactly the only dynamic content that /proc/swaps is
  interested in. Previously, it was checking si->swap_map just to ensure
  si->swap_file is available. si->swap_map is set under mutex
  protection, and after si->swap_file is set, so having si->swap_map set
  guarantees si->swap_file is set.

- Checking si->flags doesn't work here. SWP_WRITEOK is cleared during
  swapoff, but /proc/swaps is supposed to show the device under swapoff
  too to report the swapoff progress. And SWP_USED is set even if the
  device hasn't been properly set up.

We can have another flag, but the easier way is to just check
si->swap_file directly. So protect si->swap_file setting with mutext,
and set si->swap_file only when the swap device is truly enabled.

/proc/swaps only interested in si->swap_file and a few static data
reading. Only si->swap_file needs protection. Reading other static
fields is always fine.

Signed-off-by: Kairui Song <kasong@tencent.com>
---
 mm/swapfile.c | 25 +++++++++++++------------
 1 file changed, 13 insertions(+), 12 deletions(-)

diff --git a/mm/swapfile.c b/mm/swapfile.c
index 32e0e7545ab8..25dfe992538d 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -110,6 +110,7 @@ struct swap_info_struct *swap_info[MAX_SWAPFILES];
 
 static struct kmem_cache *swap_table_cachep;
 
+/* Protects si->swap_file for /proc/swaps usage */
 static DEFINE_MUTEX(swapon_mutex);
 
 static DECLARE_WAIT_QUEUE_HEAD(proc_poll_wait);
@@ -2532,7 +2533,8 @@ static void drain_mmlist(void)
 /*
  * Free all of a swapdev's extent information
  */
-static void destroy_swap_extents(struct swap_info_struct *sis)
+static void destroy_swap_extents(struct swap_info_struct *sis,
+				 struct file *swap_file)
 {
 	while (!RB_EMPTY_ROOT(&sis->swap_extent_root)) {
 		struct rb_node *rb = sis->swap_extent_root.rb_node;
@@ -2543,7 +2545,6 @@ static void destroy_swap_extents(struct swap_info_struct *sis)
 	}
 
 	if (sis->flags & SWP_ACTIVATED) {
-		struct file *swap_file = sis->swap_file;
 		struct address_space *mapping = swap_file->f_mapping;
 
 		sis->flags &= ~SWP_ACTIVATED;
@@ -2626,9 +2627,9 @@ EXPORT_SYMBOL_GPL(add_swap_extent);
  * Typically it is in the 1-4 megabyte range.  So we can have hundreds of
  * extents in the rbtree. - akpm.
  */
-static int setup_swap_extents(struct swap_info_struct *sis, sector_t *span)
+static int setup_swap_extents(struct swap_info_struct *sis,
+			      struct file *swap_file, sector_t *span)
 {
-	struct file *swap_file = sis->swap_file;
 	struct address_space *mapping = swap_file->f_mapping;
 	struct inode *inode = mapping->host;
 	int ret;
@@ -2646,7 +2647,7 @@ static int setup_swap_extents(struct swap_info_struct *sis, sector_t *span)
 		sis->flags |= SWP_ACTIVATED;
 		if ((sis->flags & SWP_FS_OPS) &&
 		    sio_pool_init() != 0) {
-			destroy_swap_extents(sis);
+			destroy_swap_extents(sis, swap_file);
 			return -ENOMEM;
 		}
 		return ret;
@@ -2862,7 +2863,7 @@ SYSCALL_DEFINE1(swapoff, const char __user *, specialfile)
 	flush_work(&p->reclaim_work);
 	flush_percpu_swap_cluster(p);
 
-	destroy_swap_extents(p);
+	destroy_swap_extents(p, p->swap_file);
 	if (p->flags & SWP_CONTINUED)
 		free_swap_count_continuations(p);
 
@@ -2952,7 +2953,7 @@ static void *swap_start(struct seq_file *swap, loff_t *pos)
 		return SEQ_START_TOKEN;
 
 	for (type = 0; (si = swap_type_to_info(type)); type++) {
-		if (!(si->flags & SWP_USED) || !si->swap_map)
+		if (!(si->swap_file))
 			continue;
 		if (!--l)
 			return si;
@@ -2973,7 +2974,7 @@ static void *swap_next(struct seq_file *swap, void *v, loff_t *pos)
 
 	++(*pos);
 	for (; (si = swap_type_to_info(type)); type++) {
-		if (!(si->flags & SWP_USED) || !si->swap_map)
+		if (!(si->swap_file))
 			continue;
 		return si;
 	}
@@ -3390,7 +3391,6 @@ SYSCALL_DEFINE2(swapon, const char __user *, specialfile, int, swap_flags)
 		goto bad_swap;
 	}
 
-	si->swap_file = swap_file;
 	mapping = swap_file->f_mapping;
 	dentry = swap_file->f_path.dentry;
 	inode = mapping->host;
@@ -3440,7 +3440,7 @@ SYSCALL_DEFINE2(swapon, const char __user *, specialfile, int, swap_flags)
 
 	si->max = maxpages;
 	si->pages = maxpages - 1;
-	nr_extents = setup_swap_extents(si, &span);
+	nr_extents = setup_swap_extents(si, swap_file, &span);
 	if (nr_extents < 0) {
 		error = nr_extents;
 		goto bad_swap_unlock_inode;
@@ -3549,6 +3549,8 @@ SYSCALL_DEFINE2(swapon, const char __user *, specialfile, int, swap_flags)
 	prio = DEF_SWAP_PRIO;
 	if (swap_flags & SWAP_FLAG_PREFER)
 		prio = swap_flags & SWAP_FLAG_PRIO_MASK;
+
+	si->swap_file = swap_file;
 	enable_swap_info(si, prio, swap_map, cluster_info, zeromap);
 
 	pr_info("Adding %uk swap on %s.  Priority:%d extents:%d across:%lluk %s%s%s%s\n",
@@ -3573,10 +3575,9 @@ SYSCALL_DEFINE2(swapon, const char __user *, specialfile, int, swap_flags)
 	kfree(si->global_cluster);
 	si->global_cluster = NULL;
 	inode = NULL;
-	destroy_swap_extents(si);
+	destroy_swap_extents(si, swap_file);
 	swap_cgroup_swapoff(si->type);
 	spin_lock(&swap_lock);
-	si->swap_file = NULL;
 	si->flags = 0;
 	spin_unlock(&swap_lock);
 	vfree(swap_map);

-- 
2.52.0




^ permalink raw reply	[flat|nested] 28+ messages in thread

* [PATCH v3 02/12] mm, swap: clean up swapon process and locking
  2026-02-17 20:06 [PATCH v3 00/12] mm, swap: swap table phase III: remove swap_map Kairui Song via B4 Relay
  2026-02-17 20:06 ` [PATCH v3 01/12] mm, swap: protect si->swap_file properly and use as a mount indicator Kairui Song via B4 Relay
@ 2026-02-17 20:06 ` Kairui Song via B4 Relay
  2026-02-19  6:45   ` Chris Li
  2026-02-17 20:06 ` [PATCH v3 03/12] mm, swap: remove redundant arguments and locking for enabling a device Kairui Song via B4 Relay
                   ` (10 subsequent siblings)
  12 siblings, 1 reply; 28+ messages in thread
From: Kairui Song via B4 Relay @ 2026-02-17 20:06 UTC (permalink / raw)
  To: linux-mm
  Cc: Andrew Morton, Kemeng Shi, Nhat Pham, Baoquan He, Barry Song,
	Johannes Weiner, David Hildenbrand, Lorenzo Stoakes,
	Youngjun Park, linux-kernel, Chris Li, Kairui Song

From: Kairui Song <kasong@tencent.com>

Slightly clean up the swapon process. Add comments about what swap_lock
protects, introduce and rename helpers that wrap swap_map and
cluster_info setup, and do it outside of the swap_lock lock.

This lock protection is not needed for swap_map and cluster_info setup
because all swap users must either hold the percpu ref or hold a stable
allocated swap entry (e.g., locking a folio in the swap cache) before
accessing. So before the swap device is exposed by enable_swap_info,
nothing would use the swap device's map or cluster.

So we are safe to allocate and set up swap data freely first, then
expose the swap device and set the SWP_WRITEOK flag.

Signed-off-by: Kairui Song <kasong@tencent.com>
---
 mm/swapfile.c | 87 ++++++++++++++++++++++++++++++++---------------------------
 1 file changed, 48 insertions(+), 39 deletions(-)

diff --git a/mm/swapfile.c b/mm/swapfile.c
index 25dfe992538d..8fc35b316ade 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -65,6 +65,13 @@ static void move_cluster(struct swap_info_struct *si,
 			 struct swap_cluster_info *ci, struct list_head *list,
 			 enum swap_cluster_flags new_flags);
 
+/*
+ * Protects the swap_info array, and the SWP_USED flag. swap_info contains
+ * lazily allocated & freed swap device info struts, and SWP_USED indicates
+ * which device is used, ~SWP_USED devices and can be reused.
+ *
+ * Also protects swap_active_head total_swap_pages, and the SWP_WRITEOK flag.
+ */
 static DEFINE_SPINLOCK(swap_lock);
 static unsigned int nr_swapfiles;
 atomic_long_t nr_swap_pages;
@@ -2657,8 +2664,6 @@ static int setup_swap_extents(struct swap_info_struct *sis,
 }
 
 static void setup_swap_info(struct swap_info_struct *si, int prio,
-			    unsigned char *swap_map,
-			    struct swap_cluster_info *cluster_info,
 			    unsigned long *zeromap)
 {
 	si->prio = prio;
@@ -2668,8 +2673,6 @@ static void setup_swap_info(struct swap_info_struct *si, int prio,
 	 */
 	si->list.prio = -si->prio;
 	si->avail_list.prio = -si->prio;
-	si->swap_map = swap_map;
-	si->cluster_info = cluster_info;
 	si->zeromap = zeromap;
 }
 
@@ -2687,13 +2690,11 @@ static void _enable_swap_info(struct swap_info_struct *si)
 }
 
 static void enable_swap_info(struct swap_info_struct *si, int prio,
-				unsigned char *swap_map,
-				struct swap_cluster_info *cluster_info,
-				unsigned long *zeromap)
+			     unsigned long *zeromap)
 {
 	spin_lock(&swap_lock);
 	spin_lock(&si->lock);
-	setup_swap_info(si, prio, swap_map, cluster_info, zeromap);
+	setup_swap_info(si, prio, zeromap);
 	spin_unlock(&si->lock);
 	spin_unlock(&swap_lock);
 	/*
@@ -2711,7 +2712,7 @@ static void reinsert_swap_info(struct swap_info_struct *si)
 {
 	spin_lock(&swap_lock);
 	spin_lock(&si->lock);
-	setup_swap_info(si, si->prio, si->swap_map, si->cluster_info, si->zeromap);
+	setup_swap_info(si, si->prio, si->zeromap);
 	_enable_swap_info(si);
 	spin_unlock(&si->lock);
 	spin_unlock(&swap_lock);
@@ -2735,8 +2736,8 @@ static void wait_for_allocation(struct swap_info_struct *si)
 	}
 }
 
-static void free_cluster_info(struct swap_cluster_info *cluster_info,
-			      unsigned long maxpages)
+static void free_swap_cluster_info(struct swap_cluster_info *cluster_info,
+				   unsigned long maxpages)
 {
 	struct swap_cluster_info *ci;
 	int i, nr_clusters = DIV_ROUND_UP(maxpages, SWAPFILE_CLUSTER);
@@ -2894,7 +2895,7 @@ SYSCALL_DEFINE1(swapoff, const char __user *, specialfile)
 	p->global_cluster = NULL;
 	vfree(swap_map);
 	kvfree(zeromap);
-	free_cluster_info(cluster_info, maxpages);
+	free_swap_cluster_info(cluster_info, maxpages);
 	/* Destroy swap account information */
 	swap_cgroup_swapoff(p->type);
 
@@ -3243,10 +3244,15 @@ static unsigned long read_swap_header(struct swap_info_struct *si,
 
 static int setup_swap_map(struct swap_info_struct *si,
 			  union swap_header *swap_header,
-			  unsigned char *swap_map,
 			  unsigned long maxpages)
 {
 	unsigned long i;
+	unsigned char *swap_map;
+
+	swap_map = vzalloc(maxpages);
+	si->swap_map = swap_map;
+	if (!swap_map)
+		return -ENOMEM;
 
 	swap_map[0] = SWAP_MAP_BAD; /* omit header page */
 	for (i = 0; i < swap_header->info.nr_badpages; i++) {
@@ -3267,9 +3273,9 @@ static int setup_swap_map(struct swap_info_struct *si,
 	return 0;
 }
 
-static struct swap_cluster_info *setup_clusters(struct swap_info_struct *si,
-						union swap_header *swap_header,
-						unsigned long maxpages)
+static int setup_swap_clusters_info(struct swap_info_struct *si,
+				    union swap_header *swap_header,
+				    unsigned long maxpages)
 {
 	unsigned long nr_clusters = DIV_ROUND_UP(maxpages, SWAPFILE_CLUSTER);
 	struct swap_cluster_info *cluster_info;
@@ -3339,10 +3345,11 @@ static struct swap_cluster_info *setup_clusters(struct swap_info_struct *si,
 		}
 	}
 
-	return cluster_info;
+	si->cluster_info = cluster_info;
+	return 0;
 err:
-	free_cluster_info(cluster_info, maxpages);
-	return ERR_PTR(err);
+	free_swap_cluster_info(cluster_info, maxpages);
+	return err;
 }
 
 SYSCALL_DEFINE2(swapon, const char __user *, specialfile, int, swap_flags)
@@ -3358,9 +3365,7 @@ SYSCALL_DEFINE2(swapon, const char __user *, specialfile, int, swap_flags)
 	int nr_extents;
 	sector_t span;
 	unsigned long maxpages;
-	unsigned char *swap_map = NULL;
 	unsigned long *zeromap = NULL;
-	struct swap_cluster_info *cluster_info = NULL;
 	struct folio *folio = NULL;
 	struct inode *inode = NULL;
 	bool inced_nr_rotate_swap = false;
@@ -3371,6 +3376,11 @@ SYSCALL_DEFINE2(swapon, const char __user *, specialfile, int, swap_flags)
 	if (!capable(CAP_SYS_ADMIN))
 		return -EPERM;
 
+	/*
+	 * Allocate or reuse existing !SWP_USED swap_info. The returned
+	 * si will stay in a dying status, so nothing will access its content
+	 * until enable_swap_info resurrects its percpu ref and expose it.
+	 */
 	si = alloc_swap_info();
 	if (IS_ERR(si))
 		return PTR_ERR(si);
@@ -3453,18 +3463,17 @@ SYSCALL_DEFINE2(swapon, const char __user *, specialfile, int, swap_flags)
 
 	maxpages = si->max;
 
-	/* OK, set up the swap map and apply the bad block list */
-	swap_map = vzalloc(maxpages);
-	if (!swap_map) {
-		error = -ENOMEM;
+	/* Setup the swap map and apply bad block */
+	error = setup_swap_map(si, swap_header, maxpages);
+	if (error)
 		goto bad_swap_unlock_inode;
-	}
 
-	error = swap_cgroup_swapon(si->type, maxpages);
+	/* Set up the swap cluster info */
+	error = setup_swap_clusters_info(si, swap_header, maxpages);
 	if (error)
 		goto bad_swap_unlock_inode;
 
-	error = setup_swap_map(si, swap_header, swap_map, maxpages);
+	error = swap_cgroup_swapon(si->type, maxpages);
 	if (error)
 		goto bad_swap_unlock_inode;
 
@@ -3492,13 +3501,6 @@ SYSCALL_DEFINE2(swapon, const char __user *, specialfile, int, swap_flags)
 		inced_nr_rotate_swap = true;
 	}
 
-	cluster_info = setup_clusters(si, swap_header, maxpages);
-	if (IS_ERR(cluster_info)) {
-		error = PTR_ERR(cluster_info);
-		cluster_info = NULL;
-		goto bad_swap_unlock_inode;
-	}
-
 	if ((swap_flags & SWAP_FLAG_DISCARD) &&
 	    si->bdev && bdev_max_discard_sectors(si->bdev)) {
 		/*
@@ -3551,7 +3553,9 @@ SYSCALL_DEFINE2(swapon, const char __user *, specialfile, int, swap_flags)
 		prio = swap_flags & SWAP_FLAG_PRIO_MASK;
 
 	si->swap_file = swap_file;
-	enable_swap_info(si, prio, swap_map, cluster_info, zeromap);
+
+	/* Sets SWP_WRITEOK, resurrect the percpu ref, expose the swap device */
+	enable_swap_info(si, prio, zeromap);
 
 	pr_info("Adding %uk swap on %s.  Priority:%d extents:%d across:%lluk %s%s%s%s\n",
 		K(si->pages), name->name, si->prio, nr_extents,
@@ -3577,13 +3581,18 @@ SYSCALL_DEFINE2(swapon, const char __user *, specialfile, int, swap_flags)
 	inode = NULL;
 	destroy_swap_extents(si, swap_file);
 	swap_cgroup_swapoff(si->type);
+	vfree(si->swap_map);
+	si->swap_map = NULL;
+	free_swap_cluster_info(si->cluster_info, si->max);
+	si->cluster_info = NULL;
+	/*
+	 * Clear the SWP_USED flag after all resources are freed so
+	 * alloc_swap_info can reuse this si safely.
+	 */
 	spin_lock(&swap_lock);
 	si->flags = 0;
 	spin_unlock(&swap_lock);
-	vfree(swap_map);
 	kvfree(zeromap);
-	if (cluster_info)
-		free_cluster_info(cluster_info, maxpages);
 	if (inced_nr_rotate_swap)
 		atomic_dec(&nr_rotate_swap);
 	if (swap_file)

-- 
2.52.0




^ permalink raw reply	[flat|nested] 28+ messages in thread

* [PATCH v3 03/12] mm, swap: remove redundant arguments and locking for enabling a device
  2026-02-17 20:06 [PATCH v3 00/12] mm, swap: swap table phase III: remove swap_map Kairui Song via B4 Relay
  2026-02-17 20:06 ` [PATCH v3 01/12] mm, swap: protect si->swap_file properly and use as a mount indicator Kairui Song via B4 Relay
  2026-02-17 20:06 ` [PATCH v3 02/12] mm, swap: clean up swapon process and locking Kairui Song via B4 Relay
@ 2026-02-17 20:06 ` Kairui Song via B4 Relay
  2026-02-19  6:48   ` Chris Li
  2026-02-17 20:06 ` [PATCH v3 04/12] mm, swap: consolidate bad slots setup and make it more robust Kairui Song via B4 Relay
                   ` (9 subsequent siblings)
  12 siblings, 1 reply; 28+ messages in thread
From: Kairui Song via B4 Relay @ 2026-02-17 20:06 UTC (permalink / raw)
  To: linux-mm
  Cc: Andrew Morton, Kemeng Shi, Nhat Pham, Baoquan He, Barry Song,
	Johannes Weiner, David Hildenbrand, Lorenzo Stoakes,
	Youngjun Park, linux-kernel, Chris Li, Kairui Song

From: Kairui Song <kasong@tencent.com>

There is no need to repeatedly pass zero map and priority values.
zeromap is similar to cluster info and swap_map, which are only used
once the swap device is exposed. And the prio values are currently
read only once set, and only used for the list insertion upon expose
or swap info display.

Signed-off-by: Kairui Song <kasong@tencent.com>
---
 mm/swapfile.c | 48 ++++++++++++++++++------------------------------
 1 file changed, 18 insertions(+), 30 deletions(-)

diff --git a/mm/swapfile.c b/mm/swapfile.c
index 8fc35b316ade..fb0d48681c48 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -2663,19 +2663,6 @@ static int setup_swap_extents(struct swap_info_struct *sis,
 	return generic_swapfile_activate(sis, swap_file, span);
 }
 
-static void setup_swap_info(struct swap_info_struct *si, int prio,
-			    unsigned long *zeromap)
-{
-	si->prio = prio;
-	/*
-	 * the plist prio is negated because plist ordering is
-	 * low-to-high, while swap ordering is high-to-low
-	 */
-	si->list.prio = -si->prio;
-	si->avail_list.prio = -si->prio;
-	si->zeromap = zeromap;
-}
-
 static void _enable_swap_info(struct swap_info_struct *si)
 {
 	atomic_long_add(si->pages, &nr_swap_pages);
@@ -2689,17 +2676,12 @@ static void _enable_swap_info(struct swap_info_struct *si)
 	add_to_avail_list(si, true);
 }
 
-static void enable_swap_info(struct swap_info_struct *si, int prio,
-			     unsigned long *zeromap)
+/*
+ * Called after the swap device is ready, resurrect its percpu ref, it's now
+ * safe to reference it. Add it to the list to expose it to the allocator.
+ */
+static void enable_swap_info(struct swap_info_struct *si)
 {
-	spin_lock(&swap_lock);
-	spin_lock(&si->lock);
-	setup_swap_info(si, prio, zeromap);
-	spin_unlock(&si->lock);
-	spin_unlock(&swap_lock);
-	/*
-	 * Finished initializing swap device, now it's safe to reference it.
-	 */
 	percpu_ref_resurrect(&si->users);
 	spin_lock(&swap_lock);
 	spin_lock(&si->lock);
@@ -2712,7 +2694,6 @@ static void reinsert_swap_info(struct swap_info_struct *si)
 {
 	spin_lock(&swap_lock);
 	spin_lock(&si->lock);
-	setup_swap_info(si, si->prio, si->zeromap);
 	_enable_swap_info(si);
 	spin_unlock(&si->lock);
 	spin_unlock(&swap_lock);
@@ -3365,7 +3346,6 @@ SYSCALL_DEFINE2(swapon, const char __user *, specialfile, int, swap_flags)
 	int nr_extents;
 	sector_t span;
 	unsigned long maxpages;
-	unsigned long *zeromap = NULL;
 	struct folio *folio = NULL;
 	struct inode *inode = NULL;
 	bool inced_nr_rotate_swap = false;
@@ -3481,9 +3461,9 @@ SYSCALL_DEFINE2(swapon, const char __user *, specialfile, int, swap_flags)
 	 * Use kvmalloc_array instead of bitmap_zalloc as the allocation order might
 	 * be above MAX_PAGE_ORDER incase of a large swap file.
 	 */
-	zeromap = kvmalloc_array(BITS_TO_LONGS(maxpages), sizeof(long),
-				    GFP_KERNEL | __GFP_ZERO);
-	if (!zeromap) {
+	si->zeromap = kvmalloc_array(BITS_TO_LONGS(maxpages), sizeof(long),
+				     GFP_KERNEL | __GFP_ZERO);
+	if (!si->zeromap) {
 		error = -ENOMEM;
 		goto bad_swap_unlock_inode;
 	}
@@ -3552,10 +3532,17 @@ SYSCALL_DEFINE2(swapon, const char __user *, specialfile, int, swap_flags)
 	if (swap_flags & SWAP_FLAG_PREFER)
 		prio = swap_flags & SWAP_FLAG_PRIO_MASK;
 
+	/*
+	 * The plist prio is negated because plist ordering is
+	 * low-to-high, while swap ordering is high-to-low
+	 */
+	si->prio = prio;
+	si->list.prio = -si->prio;
+	si->avail_list.prio = -si->prio;
 	si->swap_file = swap_file;
 
 	/* Sets SWP_WRITEOK, resurrect the percpu ref, expose the swap device */
-	enable_swap_info(si, prio, zeromap);
+	enable_swap_info(si);
 
 	pr_info("Adding %uk swap on %s.  Priority:%d extents:%d across:%lluk %s%s%s%s\n",
 		K(si->pages), name->name, si->prio, nr_extents,
@@ -3585,6 +3572,8 @@ SYSCALL_DEFINE2(swapon, const char __user *, specialfile, int, swap_flags)
 	si->swap_map = NULL;
 	free_swap_cluster_info(si->cluster_info, si->max);
 	si->cluster_info = NULL;
+	kvfree(si->zeromap);
+	si->zeromap = NULL;
 	/*
 	 * Clear the SWP_USED flag after all resources are freed so
 	 * alloc_swap_info can reuse this si safely.
@@ -3592,7 +3581,6 @@ SYSCALL_DEFINE2(swapon, const char __user *, specialfile, int, swap_flags)
 	spin_lock(&swap_lock);
 	si->flags = 0;
 	spin_unlock(&swap_lock);
-	kvfree(zeromap);
 	if (inced_nr_rotate_swap)
 		atomic_dec(&nr_rotate_swap);
 	if (swap_file)

-- 
2.52.0




^ permalink raw reply	[flat|nested] 28+ messages in thread

* [PATCH v3 04/12] mm, swap: consolidate bad slots setup and make it more robust
  2026-02-17 20:06 [PATCH v3 00/12] mm, swap: swap table phase III: remove swap_map Kairui Song via B4 Relay
                   ` (2 preceding siblings ...)
  2026-02-17 20:06 ` [PATCH v3 03/12] mm, swap: remove redundant arguments and locking for enabling a device Kairui Song via B4 Relay
@ 2026-02-17 20:06 ` Kairui Song via B4 Relay
  2026-02-19  6:51   ` Chris Li
  2026-02-17 20:06 ` [PATCH v3 05/12] mm/workingset: leave highest bits empty for anon shadow Kairui Song via B4 Relay
                   ` (8 subsequent siblings)
  12 siblings, 1 reply; 28+ messages in thread
From: Kairui Song via B4 Relay @ 2026-02-17 20:06 UTC (permalink / raw)
  To: linux-mm
  Cc: Andrew Morton, Kemeng Shi, Nhat Pham, Baoquan He, Barry Song,
	Johannes Weiner, David Hildenbrand, Lorenzo Stoakes,
	Youngjun Park, linux-kernel, Chris Li, Kairui Song

From: Kairui Song <kasong@tencent.com>

In preparation for using the swap table to track bad slots directly,
move the bad slot setup to one place, set up the swap_map mark, and
cluster counter update together.

While at it, provide more informative logs and a more robust fallback if
any bad slot info looks incorrect.

Fixes a potential issue that a malformed swap file may cause the cluster
to be unusable upon swapon, and provides a more verbose warning on a
malformed swap file

Signed-off-by: Kairui Song <kasong@tencent.com>
---
 mm/swapfile.c | 68 +++++++++++++++++++++++++++++++++--------------------------
 1 file changed, 38 insertions(+), 30 deletions(-)

diff --git a/mm/swapfile.c b/mm/swapfile.c
index fb0d48681c48..91c1fa804185 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -743,13 +743,37 @@ static void relocate_cluster(struct swap_info_struct *si,
  * slot. The cluster will not be added to the free cluster list, and its
  * usage counter will be increased by 1. Only used for initialization.
  */
-static int swap_cluster_setup_bad_slot(struct swap_cluster_info *cluster_info,
-				       unsigned long offset)
+static int swap_cluster_setup_bad_slot(struct swap_info_struct *si,
+				       struct swap_cluster_info *cluster_info,
+				       unsigned int offset, bool mask)
 {
 	unsigned long idx = offset / SWAPFILE_CLUSTER;
 	struct swap_table *table;
 	struct swap_cluster_info *ci;
 
+	/* si->max may got shrunk by swap swap_activate() */
+	if (offset >= si->max && !mask) {
+		pr_debug("Ignoring bad slot %u (max: %u)\n", offset, si->max);
+		return 0;
+	}
+	/*
+	 * Account it, skip header slot: si->pages is initiated as
+	 * si->max - 1. Also skip the masking of last cluster,
+	 * si->pages doesn't include that part.
+	 */
+	if (offset && !mask)
+		si->pages -= 1;
+	if (!si->pages) {
+		pr_warn("Empty swap-file\n");
+		return -EINVAL;
+	}
+	/* Check for duplicated bad swap slots. */
+	if (si->swap_map[offset]) {
+		pr_warn("Duplicated bad slot offset %d\n", offset);
+		return -EINVAL;
+	}
+
+	si->swap_map[offset] = SWAP_MAP_BAD;
 	ci = cluster_info + idx;
 	if (!ci->table) {
 		table = swap_table_alloc(GFP_KERNEL);
@@ -3227,30 +3251,12 @@ static int setup_swap_map(struct swap_info_struct *si,
 			  union swap_header *swap_header,
 			  unsigned long maxpages)
 {
-	unsigned long i;
 	unsigned char *swap_map;
 
 	swap_map = vzalloc(maxpages);
 	si->swap_map = swap_map;
 	if (!swap_map)
 		return -ENOMEM;
-
-	swap_map[0] = SWAP_MAP_BAD; /* omit header page */
-	for (i = 0; i < swap_header->info.nr_badpages; i++) {
-		unsigned int page_nr = swap_header->info.badpages[i];
-		if (page_nr == 0 || page_nr > swap_header->info.last_page)
-			return -EINVAL;
-		if (page_nr < maxpages) {
-			swap_map[page_nr] = SWAP_MAP_BAD;
-			si->pages--;
-		}
-	}
-
-	if (!si->pages) {
-		pr_warn("Empty swap-file\n");
-		return -EINVAL;
-	}
-
 	return 0;
 }
 
@@ -3281,26 +3287,28 @@ static int setup_swap_clusters_info(struct swap_info_struct *si,
 	}
 
 	/*
-	 * Mark unusable pages as unavailable. The clusters aren't
-	 * marked free yet, so no list operations are involved yet.
-	 *
-	 * See setup_swap_map(): header page, bad pages,
-	 * and the EOF part of the last cluster.
+	 * Mark unusable pages (header page, bad pages, and the EOF part of
+	 * the last cluster) as unavailable. The clusters aren't marked free
+	 * yet, so no list operations are involved yet.
 	 */
-	err = swap_cluster_setup_bad_slot(cluster_info, 0);
+	err = swap_cluster_setup_bad_slot(si, cluster_info, 0, false);
 	if (err)
 		goto err;
 	for (i = 0; i < swap_header->info.nr_badpages; i++) {
 		unsigned int page_nr = swap_header->info.badpages[i];
 
-		if (page_nr >= maxpages)
-			continue;
-		err = swap_cluster_setup_bad_slot(cluster_info, page_nr);
+		if (!page_nr || page_nr > swap_header->info.last_page) {
+			pr_warn("Bad slot offset is out of border: %d (last_page: %d)\n",
+				page_nr, swap_header->info.last_page);
+			err = -EINVAL;
+			goto err;
+		}
+		err = swap_cluster_setup_bad_slot(si, cluster_info, page_nr, false);
 		if (err)
 			goto err;
 	}
 	for (i = maxpages; i < round_up(maxpages, SWAPFILE_CLUSTER); i++) {
-		err = swap_cluster_setup_bad_slot(cluster_info, i);
+		err = swap_cluster_setup_bad_slot(si, cluster_info, i, true);
 		if (err)
 			goto err;
 	}

-- 
2.52.0




^ permalink raw reply	[flat|nested] 28+ messages in thread

* [PATCH v3 05/12] mm/workingset: leave highest bits empty for anon shadow
  2026-02-17 20:06 [PATCH v3 00/12] mm, swap: swap table phase III: remove swap_map Kairui Song via B4 Relay
                   ` (3 preceding siblings ...)
  2026-02-17 20:06 ` [PATCH v3 04/12] mm, swap: consolidate bad slots setup and make it more robust Kairui Song via B4 Relay
@ 2026-02-17 20:06 ` Kairui Song via B4 Relay
  2026-02-19  6:56   ` Chris Li
  2026-02-17 20:06 ` [PATCH v3 06/12] mm, swap: implement helpers for reserving data in the swap table Kairui Song via B4 Relay
                   ` (7 subsequent siblings)
  12 siblings, 1 reply; 28+ messages in thread
From: Kairui Song via B4 Relay @ 2026-02-17 20:06 UTC (permalink / raw)
  To: linux-mm
  Cc: Andrew Morton, Kemeng Shi, Nhat Pham, Baoquan He, Barry Song,
	Johannes Weiner, David Hildenbrand, Lorenzo Stoakes,
	Youngjun Park, linux-kernel, Chris Li, Kairui Song

From: Kairui Song <kasong@tencent.com>

Swap table entry will need 4 bits reserved for swap count in the shadow,
so the anon shadow should have its leading 4 bits remain 0.

This should be OK for the foreseeable future. Take 52 bits of physical
address space as an example: for 4K pages, there would be at most 40
bits for addressable pages. Currently, we have 36 bits available (64 - 1
- 16 - 10 - 1, where XA_VALUE takes 1 bit for marker,
MEM_CGROUP_ID_SHIFT takes 16 bits, NODES_SHIFT takes <=10 bits,
WORKINGSET flags takes 1 bit).

So in the worst case, we previously need to pack the 40 bits of address
in 36 bits fields using a 64K bucket (bucket_order = 4). After this, the
bucket will be increased to 1M. Which should be fine, as on such large
machines, the working set size will be way larger than the bucket size.

And for MGLRU's gen number tracking, it should be even more than enough,
MGLRU's gen number (max_seq) increment is much slower compared to the
eviction counter (nonresident_age).

And after all, either the refault distance or the gen distance is only a
hint that can tolerate inaccuracy just fine.

And the 4 bits can be shrunk to 3, or extended to a higher value if
needed later.

Signed-off-by: Kairui Song <kasong@tencent.com>
---
 mm/swap_table.h |  4 ++++
 mm/workingset.c | 49 ++++++++++++++++++++++++++++++-------------------
 2 files changed, 34 insertions(+), 19 deletions(-)

diff --git a/mm/swap_table.h b/mm/swap_table.h
index ea244a57a5b7..10e11d1f3b04 100644
--- a/mm/swap_table.h
+++ b/mm/swap_table.h
@@ -12,6 +12,7 @@ struct swap_table {
 };
 
 #define SWP_TABLE_USE_PAGE (sizeof(struct swap_table) == PAGE_SIZE)
+#define SWP_TB_COUNT_BITS		4
 
 /*
  * A swap table entry represents the status of a swap slot on a swap
@@ -22,6 +23,9 @@ struct swap_table {
  * (shadow), or NULL.
  */
 
+/* Macro for shadow offset calculation */
+#define SWAP_COUNT_SHIFT	SWP_TB_COUNT_BITS
+
 /*
  * Helpers for casting one type of info into a swap table entry.
  */
diff --git a/mm/workingset.c b/mm/workingset.c
index 13422d304715..37a94979900f 100644
--- a/mm/workingset.c
+++ b/mm/workingset.c
@@ -16,6 +16,7 @@
 #include <linux/dax.h>
 #include <linux/fs.h>
 #include <linux/mm.h>
+#include "swap_table.h"
 #include "internal.h"
 
 /*
@@ -184,7 +185,9 @@
 #define EVICTION_SHIFT	((BITS_PER_LONG - BITS_PER_XA_VALUE) +	\
 			 WORKINGSET_SHIFT + NODES_SHIFT + \
 			 MEM_CGROUP_ID_SHIFT)
+#define EVICTION_SHIFT_ANON	(EVICTION_SHIFT + SWAP_COUNT_SHIFT)
 #define EVICTION_MASK	(~0UL >> EVICTION_SHIFT)
+#define EVICTION_MASK_ANON	(~0UL >> EVICTION_SHIFT_ANON)
 
 /*
  * Eviction timestamps need to be able to cover the full range of
@@ -194,12 +197,12 @@
  * that case, we have to sacrifice granularity for distance, and group
  * evictions into coarser buckets by shaving off lower timestamp bits.
  */
-static unsigned int bucket_order __read_mostly;
+static unsigned int bucket_order[ANON_AND_FILE] __read_mostly;
 
 static void *pack_shadow(int memcgid, pg_data_t *pgdat, unsigned long eviction,
-			 bool workingset)
+			 bool workingset, bool file)
 {
-	eviction &= EVICTION_MASK;
+	eviction &= file ? EVICTION_MASK : EVICTION_MASK_ANON;
 	eviction = (eviction << MEM_CGROUP_ID_SHIFT) | memcgid;
 	eviction = (eviction << NODES_SHIFT) | pgdat->node_id;
 	eviction = (eviction << WORKINGSET_SHIFT) | workingset;
@@ -244,7 +247,8 @@ static void *lru_gen_eviction(struct folio *folio)
 	struct mem_cgroup *memcg = folio_memcg(folio);
 	struct pglist_data *pgdat = folio_pgdat(folio);
 
-	BUILD_BUG_ON(LRU_GEN_WIDTH + LRU_REFS_WIDTH > BITS_PER_LONG - EVICTION_SHIFT);
+	BUILD_BUG_ON(LRU_GEN_WIDTH + LRU_REFS_WIDTH >
+		     BITS_PER_LONG - max(EVICTION_SHIFT, EVICTION_SHIFT_ANON));
 
 	lruvec = mem_cgroup_lruvec(memcg, pgdat);
 	lrugen = &lruvec->lrugen;
@@ -254,7 +258,7 @@ static void *lru_gen_eviction(struct folio *folio)
 	hist = lru_hist_from_seq(min_seq);
 	atomic_long_add(delta, &lrugen->evicted[hist][type][tier]);
 
-	return pack_shadow(mem_cgroup_private_id(memcg), pgdat, token, workingset);
+	return pack_shadow(mem_cgroup_private_id(memcg), pgdat, token, workingset, type);
 }
 
 /*
@@ -262,7 +266,7 @@ static void *lru_gen_eviction(struct folio *folio)
  * Fills in @lruvec, @token, @workingset with the values unpacked from shadow.
  */
 static bool lru_gen_test_recent(void *shadow, struct lruvec **lruvec,
-				unsigned long *token, bool *workingset)
+				unsigned long *token, bool *workingset, bool file)
 {
 	int memcg_id;
 	unsigned long max_seq;
@@ -275,7 +279,7 @@ static bool lru_gen_test_recent(void *shadow, struct lruvec **lruvec,
 	*lruvec = mem_cgroup_lruvec(memcg, pgdat);
 
 	max_seq = READ_ONCE((*lruvec)->lrugen.max_seq);
-	max_seq &= EVICTION_MASK >> LRU_REFS_WIDTH;
+	max_seq &= (file ? EVICTION_MASK : EVICTION_MASK_ANON) >> LRU_REFS_WIDTH;
 
 	return abs_diff(max_seq, *token >> LRU_REFS_WIDTH) < MAX_NR_GENS;
 }
@@ -293,7 +297,7 @@ static void lru_gen_refault(struct folio *folio, void *shadow)
 
 	rcu_read_lock();
 
-	recent = lru_gen_test_recent(shadow, &lruvec, &token, &workingset);
+	recent = lru_gen_test_recent(shadow, &lruvec, &token, &workingset, type);
 	if (lruvec != folio_lruvec(folio))
 		goto unlock;
 
@@ -331,7 +335,7 @@ static void *lru_gen_eviction(struct folio *folio)
 }
 
 static bool lru_gen_test_recent(void *shadow, struct lruvec **lruvec,
-				unsigned long *token, bool *workingset)
+				unsigned long *token, bool *workingset, bool file)
 {
 	return false;
 }
@@ -381,6 +385,7 @@ void workingset_age_nonresident(struct lruvec *lruvec, unsigned long nr_pages)
 void *workingset_eviction(struct folio *folio, struct mem_cgroup *target_memcg)
 {
 	struct pglist_data *pgdat = folio_pgdat(folio);
+	int file = folio_is_file_lru(folio);
 	unsigned long eviction;
 	struct lruvec *lruvec;
 	int memcgid;
@@ -397,10 +402,10 @@ void *workingset_eviction(struct folio *folio, struct mem_cgroup *target_memcg)
 	/* XXX: target_memcg can be NULL, go through lruvec */
 	memcgid = mem_cgroup_private_id(lruvec_memcg(lruvec));
 	eviction = atomic_long_read(&lruvec->nonresident_age);
-	eviction >>= bucket_order;
+	eviction >>= bucket_order[file];
 	workingset_age_nonresident(lruvec, folio_nr_pages(folio));
 	return pack_shadow(memcgid, pgdat, eviction,
-				folio_test_workingset(folio));
+			   folio_test_workingset(folio), file);
 }
 
 /**
@@ -431,14 +436,15 @@ bool workingset_test_recent(void *shadow, bool file, bool *workingset,
 		bool recent;
 
 		rcu_read_lock();
-		recent = lru_gen_test_recent(shadow, &eviction_lruvec, &eviction, workingset);
+		recent = lru_gen_test_recent(shadow, &eviction_lruvec, &eviction,
+					     workingset, file);
 		rcu_read_unlock();
 		return recent;
 	}
 
 	rcu_read_lock();
 	unpack_shadow(shadow, &memcgid, &pgdat, &eviction, workingset);
-	eviction <<= bucket_order;
+	eviction <<= bucket_order[file];
 
 	/*
 	 * Look up the memcg associated with the stored ID. It might
@@ -495,7 +501,8 @@ bool workingset_test_recent(void *shadow, bool file, bool *workingset,
 	 * longest time, so the occasional inappropriate activation
 	 * leading to pressure on the active list is not a problem.
 	 */
-	refault_distance = (refault - eviction) & EVICTION_MASK;
+	refault_distance = ((refault - eviction) &
+			    (file ? EVICTION_MASK : EVICTION_MASK_ANON));
 
 	/*
 	 * Compare the distance to the existing workingset size. We
@@ -780,8 +787,8 @@ static struct lock_class_key shadow_nodes_key;
 
 static int __init workingset_init(void)
 {
+	unsigned int timestamp_bits, timestamp_bits_anon;
 	struct shrinker *workingset_shadow_shrinker;
-	unsigned int timestamp_bits;
 	unsigned int max_order;
 	int ret = -ENOMEM;
 
@@ -794,11 +801,15 @@ static int __init workingset_init(void)
 	 * double the initial memory by using totalram_pages as-is.
 	 */
 	timestamp_bits = BITS_PER_LONG - EVICTION_SHIFT;
+	timestamp_bits_anon = BITS_PER_LONG - EVICTION_SHIFT_ANON;
 	max_order = fls_long(totalram_pages() - 1);
-	if (max_order > timestamp_bits)
-		bucket_order = max_order - timestamp_bits;
-	pr_info("workingset: timestamp_bits=%d max_order=%d bucket_order=%u\n",
-	       timestamp_bits, max_order, bucket_order);
+	if (max_order > (BITS_PER_LONG - EVICTION_SHIFT))
+		bucket_order[WORKINGSET_FILE] = max_order - timestamp_bits;
+	if (max_order > timestamp_bits_anon)
+		bucket_order[WORKINGSET_ANON] = max_order - timestamp_bits_anon;
+	pr_info("workingset: timestamp_bits=%d (anon: %d) max_order=%d bucket_order=%u (anon: %d)\n",
+		timestamp_bits, timestamp_bits_anon, max_order,
+		bucket_order[WORKINGSET_FILE], bucket_order[WORKINGSET_ANON]);
 
 	workingset_shadow_shrinker = shrinker_alloc(SHRINKER_NUMA_AWARE |
 						    SHRINKER_MEMCG_AWARE,

-- 
2.52.0




^ permalink raw reply	[flat|nested] 28+ messages in thread

* [PATCH v3 06/12] mm, swap: implement helpers for reserving data in the swap table
  2026-02-17 20:06 [PATCH v3 00/12] mm, swap: swap table phase III: remove swap_map Kairui Song via B4 Relay
                   ` (4 preceding siblings ...)
  2026-02-17 20:06 ` [PATCH v3 05/12] mm/workingset: leave highest bits empty for anon shadow Kairui Song via B4 Relay
@ 2026-02-17 20:06 ` Kairui Song via B4 Relay
  2026-02-19  7:00   ` Chris Li
  2026-02-17 20:06 ` [PATCH v3 07/12] mm, swap: mark bad slots in swap table directly Kairui Song via B4 Relay
                   ` (6 subsequent siblings)
  12 siblings, 1 reply; 28+ messages in thread
From: Kairui Song via B4 Relay @ 2026-02-17 20:06 UTC (permalink / raw)
  To: linux-mm
  Cc: Andrew Morton, Kemeng Shi, Nhat Pham, Baoquan He, Barry Song,
	Johannes Weiner, David Hildenbrand, Lorenzo Stoakes,
	Youngjun Park, linux-kernel, Chris Li, Kairui Song

From: Kairui Song <kasong@tencent.com>

To prepare for using the swap table as the unified swap layer, introduce
macros and helpers for storing multiple kinds of data in a swap table
entry.

From now on, we are storing PFN in the swap table to make space for
extra counting bits (SWAP_COUNT). Shadows are still stored as they are,
as the SWAP_COUNT is not used yet.

Also, rename shadow_swp_to_tb to shadow_to_swp_tb. That's a spelling
error, not really worth a separate fix.

No behaviour change yet, just prepare the API.

Signed-off-by: Kairui Song <kasong@tencent.com>
---
 mm/swap_state.c |   6 +--
 mm/swap_table.h | 131 +++++++++++++++++++++++++++++++++++++++++++++++++++-----
 2 files changed, 124 insertions(+), 13 deletions(-)

diff --git a/mm/swap_state.c b/mm/swap_state.c
index 6d0eef7470be..e213ee35c1d2 100644
--- a/mm/swap_state.c
+++ b/mm/swap_state.c
@@ -148,7 +148,7 @@ void __swap_cache_add_folio(struct swap_cluster_info *ci,
 	VM_WARN_ON_ONCE_FOLIO(folio_test_swapcache(folio), folio);
 	VM_WARN_ON_ONCE_FOLIO(!folio_test_swapbacked(folio), folio);
 
-	new_tb = folio_to_swp_tb(folio);
+	new_tb = folio_to_swp_tb(folio, 0);
 	ci_start = swp_cluster_offset(entry);
 	ci_off = ci_start;
 	ci_end = ci_start + nr_pages;
@@ -249,7 +249,7 @@ void __swap_cache_del_folio(struct swap_cluster_info *ci, struct folio *folio,
 	VM_WARN_ON_ONCE_FOLIO(folio_test_writeback(folio), folio);
 
 	si = __swap_entry_to_info(entry);
-	new_tb = shadow_swp_to_tb(shadow);
+	new_tb = shadow_to_swp_tb(shadow, 0);
 	ci_start = swp_cluster_offset(entry);
 	ci_end = ci_start + nr_pages;
 	ci_off = ci_start;
@@ -331,7 +331,7 @@ void __swap_cache_replace_folio(struct swap_cluster_info *ci,
 	VM_WARN_ON_ONCE(!entry.val);
 
 	/* Swap cache still stores N entries instead of a high-order entry */
-	new_tb = folio_to_swp_tb(new);
+	new_tb = folio_to_swp_tb(new, 0);
 	do {
 		old_tb = __swap_table_xchg(ci, ci_off, new_tb);
 		WARN_ON_ONCE(!swp_tb_is_folio(old_tb) || swp_tb_to_folio(old_tb) != old);
diff --git a/mm/swap_table.h b/mm/swap_table.h
index 10e11d1f3b04..10762ac5f4f5 100644
--- a/mm/swap_table.h
+++ b/mm/swap_table.h
@@ -12,17 +12,72 @@ struct swap_table {
 };
 
 #define SWP_TABLE_USE_PAGE (sizeof(struct swap_table) == PAGE_SIZE)
-#define SWP_TB_COUNT_BITS		4
 
 /*
  * A swap table entry represents the status of a swap slot on a swap
  * (physical or virtual) device. The swap table in each cluster is a
  * 1:1 map of the swap slots in this cluster.
  *
- * Each swap table entry could be a pointer (folio), a XA_VALUE
- * (shadow), or NULL.
+ * Swap table entry type and bits layouts:
+ *
+ * NULL:     |---------------- 0 ---------------| - Free slot
+ * Shadow:   | SWAP_COUNT |---- SHADOW_VAL ---|1| - Swapped out slot
+ * PFN:      | SWAP_COUNT |------ PFN -------|10| - Cached slot
+ * Pointer:  |----------- Pointer ----------|100| - (Unused)
+ * Bad:      |------------- 1 -------------|1000| - Bad slot
+ *
+ * SWAP_COUNT is `SWP_TB_COUNT_BITS` long, each entry is an atomic long.
+ *
+ * Usages:
+ *
+ * - NULL: Swap slot is unused, could be allocated.
+ *
+ * - Shadow: Swap slot is used and not cached (usually swapped out). It reuses
+ *   the XA_VALUE format to be compatible with working set shadows. SHADOW_VAL
+ *   part might be all 0 if the working shadow info is absent. In such a case,
+ *   we still want to keep the shadow format as a placeholder.
+ *
+ *   Memcg ID is embedded in SHADOW_VAL.
+ *
+ * - PFN: Swap slot is in use, and cached. Memcg info is recorded on the page
+ *   struct.
+ *
+ * - Pointer: Unused yet. `0b100` is reserved for potential pointer usage
+ *   because only the lower three bits can be used as a marker for 8 bytes
+ *   aligned pointers.
+ *
+ * - Bad: Swap slot is reserved, protects swap header or holes on swap devices.
  */
 
+#if defined(MAX_POSSIBLE_PHYSMEM_BITS)
+#define SWAP_CACHE_PFN_BITS (MAX_POSSIBLE_PHYSMEM_BITS - PAGE_SHIFT)
+#elif defined(MAX_PHYSMEM_BITS)
+#define SWAP_CACHE_PFN_BITS (MAX_PHYSMEM_BITS - PAGE_SHIFT)
+#else
+#define SWAP_CACHE_PFN_BITS (BITS_PER_LONG - PAGE_SHIFT)
+#endif
+
+/* NULL Entry, all 0 */
+#define SWP_TB_NULL		0UL
+
+/* Swapped out: shadow */
+#define SWP_TB_SHADOW_MARK	0b1UL
+
+/* Cached: PFN */
+#define SWP_TB_PFN_BITS		(SWAP_CACHE_PFN_BITS + SWP_TB_PFN_MARK_BITS)
+#define SWP_TB_PFN_MARK		0b10UL
+#define SWP_TB_PFN_MARK_BITS	2
+#define SWP_TB_PFN_MARK_MASK	(BIT(SWP_TB_PFN_MARK_BITS) - 1)
+
+/* SWAP_COUNT part for PFN or shadow, the width can be shrunk or extended */
+#define SWP_TB_COUNT_BITS      min(4, BITS_PER_LONG - SWP_TB_PFN_BITS)
+#define SWP_TB_COUNT_MASK      (~((~0UL) >> SWP_TB_COUNT_BITS))
+#define SWP_TB_COUNT_SHIFT     (BITS_PER_LONG - SWP_TB_COUNT_BITS)
+#define SWP_TB_COUNT_MAX       ((1 << SWP_TB_COUNT_BITS) - 1)
+
+/* Bad slot: ends with 0b1000 and rests of bits are all 1 */
+#define SWP_TB_BAD		((~0UL) << 3)
+
 /* Macro for shadow offset calculation */
 #define SWAP_COUNT_SHIFT	SWP_TB_COUNT_BITS
 
@@ -35,18 +90,47 @@ static inline unsigned long null_to_swp_tb(void)
 	return 0;
 }
 
-static inline unsigned long folio_to_swp_tb(struct folio *folio)
+static inline unsigned long __count_to_swp_tb(unsigned char count)
 {
+	/*
+	 * At least three values are needed to distinguish free (0),
+	 * used (count > 0 && count < SWP_TB_COUNT_MAX), and
+	 * overflow (count == SWP_TB_COUNT_MAX).
+	 */
+	BUILD_BUG_ON(SWP_TB_COUNT_MAX < 2 || SWP_TB_COUNT_BITS < 2);
+	VM_WARN_ON(count > SWP_TB_COUNT_MAX);
+	return ((unsigned long)count) << SWP_TB_COUNT_SHIFT;
+}
+
+static inline unsigned long pfn_to_swp_tb(unsigned long pfn, unsigned int count)
+{
+	unsigned long swp_tb;
+
 	BUILD_BUG_ON(sizeof(unsigned long) != sizeof(void *));
-	return (unsigned long)folio;
+	BUILD_BUG_ON(SWAP_CACHE_PFN_BITS >
+		     (BITS_PER_LONG - SWP_TB_PFN_MARK_BITS - SWP_TB_COUNT_BITS));
+
+	swp_tb = (pfn << SWP_TB_PFN_MARK_BITS) | SWP_TB_PFN_MARK;
+	VM_WARN_ON_ONCE(swp_tb & SWP_TB_COUNT_MASK);
+
+	return swp_tb | __count_to_swp_tb(count);
+}
+
+static inline unsigned long folio_to_swp_tb(struct folio *folio, unsigned int count)
+{
+	return pfn_to_swp_tb(folio_pfn(folio), count);
 }
 
-static inline unsigned long shadow_swp_to_tb(void *shadow)
+static inline unsigned long shadow_to_swp_tb(void *shadow, unsigned int count)
 {
 	BUILD_BUG_ON((BITS_PER_XA_VALUE + 1) !=
 		     BITS_PER_BYTE * sizeof(unsigned long));
+	BUILD_BUG_ON((unsigned long)xa_mk_value(0) != SWP_TB_SHADOW_MARK);
+
 	VM_WARN_ON_ONCE(shadow && !xa_is_value(shadow));
-	return (unsigned long)shadow;
+	VM_WARN_ON_ONCE(shadow && ((unsigned long)shadow & SWP_TB_COUNT_MASK));
+
+	return (unsigned long)shadow | __count_to_swp_tb(count) | SWP_TB_SHADOW_MARK;
 }
 
 /*
@@ -59,7 +143,7 @@ static inline bool swp_tb_is_null(unsigned long swp_tb)
 
 static inline bool swp_tb_is_folio(unsigned long swp_tb)
 {
-	return !xa_is_value((void *)swp_tb) && !swp_tb_is_null(swp_tb);
+	return ((swp_tb & SWP_TB_PFN_MARK_MASK) == SWP_TB_PFN_MARK);
 }
 
 static inline bool swp_tb_is_shadow(unsigned long swp_tb)
@@ -67,19 +151,44 @@ static inline bool swp_tb_is_shadow(unsigned long swp_tb)
 	return xa_is_value((void *)swp_tb);
 }
 
+static inline bool swp_tb_is_bad(unsigned long swp_tb)
+{
+	return swp_tb == SWP_TB_BAD;
+}
+
+static inline bool swp_tb_is_countable(unsigned long swp_tb)
+{
+	return (swp_tb_is_shadow(swp_tb) || swp_tb_is_folio(swp_tb) ||
+		swp_tb_is_null(swp_tb));
+}
+
 /*
  * Helpers for retrieving info from swap table.
  */
 static inline struct folio *swp_tb_to_folio(unsigned long swp_tb)
 {
 	VM_WARN_ON(!swp_tb_is_folio(swp_tb));
-	return (void *)swp_tb;
+	return pfn_folio((swp_tb & ~SWP_TB_COUNT_MASK) >> SWP_TB_PFN_MARK_BITS);
 }
 
 static inline void *swp_tb_to_shadow(unsigned long swp_tb)
 {
 	VM_WARN_ON(!swp_tb_is_shadow(swp_tb));
-	return (void *)swp_tb;
+	/* No shift needed, xa_value is stored as it is in the lower bits. */
+	return (void *)(swp_tb & ~SWP_TB_COUNT_MASK);
+}
+
+static inline unsigned char __swp_tb_get_count(unsigned long swp_tb)
+{
+	VM_WARN_ON(!swp_tb_is_countable(swp_tb));
+	return ((swp_tb & SWP_TB_COUNT_MASK) >> SWP_TB_COUNT_SHIFT);
+}
+
+static inline int swp_tb_get_count(unsigned long swp_tb)
+{
+	if (swp_tb_is_countable(swp_tb))
+		return __swp_tb_get_count(swp_tb);
+	return -EINVAL;
 }
 
 /*
@@ -124,6 +233,8 @@ static inline unsigned long swap_table_get(struct swap_cluster_info *ci,
 	atomic_long_t *table;
 	unsigned long swp_tb;
 
+	VM_WARN_ON_ONCE(off >= SWAPFILE_CLUSTER);
+
 	rcu_read_lock();
 	table = rcu_dereference(ci->table);
 	swp_tb = table ? atomic_long_read(&table[off]) : null_to_swp_tb();

-- 
2.52.0




^ permalink raw reply	[flat|nested] 28+ messages in thread

* [PATCH v3 07/12] mm, swap: mark bad slots in swap table directly
  2026-02-17 20:06 [PATCH v3 00/12] mm, swap: swap table phase III: remove swap_map Kairui Song via B4 Relay
                   ` (5 preceding siblings ...)
  2026-02-17 20:06 ` [PATCH v3 06/12] mm, swap: implement helpers for reserving data in the swap table Kairui Song via B4 Relay
@ 2026-02-17 20:06 ` Kairui Song via B4 Relay
  2026-02-19  7:01   ` Chris Li
  2026-02-17 20:06 ` [PATCH v3 08/12] mm, swap: simplify swap table sanity range check Kairui Song via B4 Relay
                   ` (5 subsequent siblings)
  12 siblings, 1 reply; 28+ messages in thread
From: Kairui Song via B4 Relay @ 2026-02-17 20:06 UTC (permalink / raw)
  To: linux-mm
  Cc: Andrew Morton, Kemeng Shi, Nhat Pham, Baoquan He, Barry Song,
	Johannes Weiner, David Hildenbrand, Lorenzo Stoakes,
	Youngjun Park, linux-kernel, Chris Li, Kairui Song

From: Kairui Song <kasong@tencent.com>

In preparing the deprecating swap_map, mark bad slots in the swap table
too when setting SWAP_MAP_BAD in swap_map. Also, refine the swap table
sanity check on freeing to adapt to the bad slots change. For swapoff,
the bad slots count must match the cluster usage count, as nothing
should touch them, and they contribute to the cluster usage count on
swapon. For ordinary swap table freeing, the swap table of clusters with
bad slots should never be freed since the cluster usage count never
reaches zero.

Signed-off-by: Kairui Song <kasong@tencent.com>
---
 mm/swapfile.c | 56 +++++++++++++++++++++++++++++++++++++++++---------------
 1 file changed, 41 insertions(+), 15 deletions(-)

diff --git a/mm/swapfile.c b/mm/swapfile.c
index 91c1fa804185..18bacf16cd26 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -454,16 +454,37 @@ static void swap_table_free(struct swap_table *table)
 		 swap_table_free_folio_rcu_cb);
 }
 
+/*
+ * Sanity check to ensure nothing leaked, and the specified range is empty.
+ * One special case is that bad slots can't be freed, so check the number of
+ * bad slots for swapoff, and non-swapoff path must never free bad slots.
+ */
+static void swap_cluster_assert_empty(struct swap_cluster_info *ci, bool swapoff)
+{
+	unsigned int ci_off = 0, ci_end = SWAPFILE_CLUSTER;
+	unsigned long swp_tb;
+	int bad_slots = 0;
+
+	if (!IS_ENABLED(CONFIG_DEBUG_VM) && !swapoff)
+		return;
+
+	do {
+		swp_tb = __swap_table_get(ci, ci_off);
+		if (swp_tb_is_bad(swp_tb))
+			bad_slots++;
+		else
+			WARN_ON_ONCE(!swp_tb_is_null(swp_tb));
+	} while (++ci_off < ci_end);
+
+	WARN_ON_ONCE(bad_slots != (swapoff ? ci->count : 0));
+}
+
 static void swap_cluster_free_table(struct swap_cluster_info *ci)
 {
-	unsigned int ci_off;
 	struct swap_table *table;
 
 	/* Only empty cluster's table is allow to be freed  */
 	lockdep_assert_held(&ci->lock);
-	VM_WARN_ON_ONCE(!cluster_is_empty(ci));
-	for (ci_off = 0; ci_off < SWAPFILE_CLUSTER; ci_off++)
-		VM_WARN_ON_ONCE(!swp_tb_is_null(__swap_table_get(ci, ci_off)));
 	table = (void *)rcu_dereference_protected(ci->table, true);
 	rcu_assign_pointer(ci->table, NULL);
 
@@ -567,6 +588,7 @@ static void swap_cluster_schedule_discard(struct swap_info_struct *si,
 
 static void __free_cluster(struct swap_info_struct *si, struct swap_cluster_info *ci)
 {
+	swap_cluster_assert_empty(ci, false);
 	swap_cluster_free_table(ci);
 	move_cluster(si, ci, &si->free_clusters, CLUSTER_FLAG_FREE);
 	ci->order = 0;
@@ -747,9 +769,11 @@ static int swap_cluster_setup_bad_slot(struct swap_info_struct *si,
 				       struct swap_cluster_info *cluster_info,
 				       unsigned int offset, bool mask)
 {
+	unsigned int ci_off = offset % SWAPFILE_CLUSTER;
 	unsigned long idx = offset / SWAPFILE_CLUSTER;
-	struct swap_table *table;
 	struct swap_cluster_info *ci;
+	struct swap_table *table;
+	int ret = 0;
 
 	/* si->max may got shrunk by swap swap_activate() */
 	if (offset >= si->max && !mask) {
@@ -767,13 +791,7 @@ static int swap_cluster_setup_bad_slot(struct swap_info_struct *si,
 		pr_warn("Empty swap-file\n");
 		return -EINVAL;
 	}
-	/* Check for duplicated bad swap slots. */
-	if (si->swap_map[offset]) {
-		pr_warn("Duplicated bad slot offset %d\n", offset);
-		return -EINVAL;
-	}
 
-	si->swap_map[offset] = SWAP_MAP_BAD;
 	ci = cluster_info + idx;
 	if (!ci->table) {
 		table = swap_table_alloc(GFP_KERNEL);
@@ -781,13 +799,21 @@ static int swap_cluster_setup_bad_slot(struct swap_info_struct *si,
 			return -ENOMEM;
 		rcu_assign_pointer(ci->table, table);
 	}
-
-	ci->count++;
+	spin_lock(&ci->lock);
+	/* Check for duplicated bad swap slots. */
+	if (__swap_table_xchg(ci, ci_off, SWP_TB_BAD) != SWP_TB_NULL) {
+		pr_warn("Duplicated bad slot offset %d\n", offset);
+		ret = -EINVAL;
+	} else {
+		si->swap_map[offset] = SWAP_MAP_BAD;
+		ci->count++;
+	}
+	spin_unlock(&ci->lock);
 
 	WARN_ON(ci->count > SWAPFILE_CLUSTER);
 	WARN_ON(ci->flags);
 
-	return 0;
+	return ret;
 }
 
 /*
@@ -2754,7 +2780,7 @@ static void free_swap_cluster_info(struct swap_cluster_info *cluster_info,
 		/* Cluster with bad marks count will have a remaining table */
 		spin_lock(&ci->lock);
 		if (rcu_dereference_protected(ci->table, true)) {
-			ci->count = 0;
+			swap_cluster_assert_empty(ci, true);
 			swap_cluster_free_table(ci);
 		}
 		spin_unlock(&ci->lock);

-- 
2.52.0




^ permalink raw reply	[flat|nested] 28+ messages in thread

* [PATCH v3 08/12] mm, swap: simplify swap table sanity range check
  2026-02-17 20:06 [PATCH v3 00/12] mm, swap: swap table phase III: remove swap_map Kairui Song via B4 Relay
                   ` (6 preceding siblings ...)
  2026-02-17 20:06 ` [PATCH v3 07/12] mm, swap: mark bad slots in swap table directly Kairui Song via B4 Relay
@ 2026-02-17 20:06 ` Kairui Song via B4 Relay
  2026-02-19  7:02   ` Chris Li
  2026-02-17 20:06 ` [PATCH v3 09/12] mm, swap: use the swap table to track the swap count Kairui Song via B4 Relay
                   ` (4 subsequent siblings)
  12 siblings, 1 reply; 28+ messages in thread
From: Kairui Song via B4 Relay @ 2026-02-17 20:06 UTC (permalink / raw)
  To: linux-mm
  Cc: Andrew Morton, Kemeng Shi, Nhat Pham, Baoquan He, Barry Song,
	Johannes Weiner, David Hildenbrand, Lorenzo Stoakes,
	Youngjun Park, linux-kernel, Chris Li, Kairui Song

From: Kairui Song <kasong@tencent.com>

The newly introduced helper, which checks bad slots and emptiness of a
cluster, can cover the older sanity check just fine, with a more
rigorous condition check. So merge them.

Signed-off-by: Kairui Song <kasong@tencent.com>
---
 mm/swapfile.c | 35 +++++++++--------------------------
 1 file changed, 9 insertions(+), 26 deletions(-)

diff --git a/mm/swapfile.c b/mm/swapfile.c
index 18bacf16cd26..9057fb3e4eed 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -459,9 +459,11 @@ static void swap_table_free(struct swap_table *table)
  * One special case is that bad slots can't be freed, so check the number of
  * bad slots for swapoff, and non-swapoff path must never free bad slots.
  */
-static void swap_cluster_assert_empty(struct swap_cluster_info *ci, bool swapoff)
+static void swap_cluster_assert_empty(struct swap_cluster_info *ci,
+				      unsigned int ci_off, unsigned int nr,
+				      bool swapoff)
 {
-	unsigned int ci_off = 0, ci_end = SWAPFILE_CLUSTER;
+	unsigned int ci_end = ci_off + nr;
 	unsigned long swp_tb;
 	int bad_slots = 0;
 
@@ -588,7 +590,7 @@ static void swap_cluster_schedule_discard(struct swap_info_struct *si,
 
 static void __free_cluster(struct swap_info_struct *si, struct swap_cluster_info *ci)
 {
-	swap_cluster_assert_empty(ci, false);
+	swap_cluster_assert_empty(ci, 0, SWAPFILE_CLUSTER, false);
 	swap_cluster_free_table(ci);
 	move_cluster(si, ci, &si->free_clusters, CLUSTER_FLAG_FREE);
 	ci->order = 0;
@@ -898,26 +900,6 @@ static bool cluster_scan_range(struct swap_info_struct *si,
 	return true;
 }
 
-/*
- * Currently, the swap table is not used for count tracking, just
- * do a sanity check here to ensure nothing leaked, so the swap
- * table should be empty upon freeing.
- */
-static void swap_cluster_assert_table_empty(struct swap_cluster_info *ci,
-				unsigned int start, unsigned int nr)
-{
-	unsigned int ci_off = start % SWAPFILE_CLUSTER;
-	unsigned int ci_end = ci_off + nr;
-	unsigned long swp_tb;
-
-	if (IS_ENABLED(CONFIG_DEBUG_VM)) {
-		do {
-			swp_tb = __swap_table_get(ci, ci_off);
-			VM_WARN_ON_ONCE(!swp_tb_is_null(swp_tb));
-		} while (++ci_off < ci_end);
-	}
-}
-
 static bool cluster_alloc_range(struct swap_info_struct *si,
 				struct swap_cluster_info *ci,
 				struct folio *folio,
@@ -943,13 +925,14 @@ static bool cluster_alloc_range(struct swap_info_struct *si,
 	if (likely(folio)) {
 		order = folio_order(folio);
 		nr_pages = 1 << order;
+		swap_cluster_assert_empty(ci, offset % SWAPFILE_CLUSTER, nr_pages, false);
 		__swap_cache_add_folio(ci, folio, swp_entry(si->type, offset));
 	} else if (IS_ENABLED(CONFIG_HIBERNATION)) {
 		order = 0;
 		nr_pages = 1;
 		WARN_ON_ONCE(si->swap_map[offset]);
 		si->swap_map[offset] = 1;
-		swap_cluster_assert_table_empty(ci, offset, 1);
+		swap_cluster_assert_empty(ci, offset % SWAPFILE_CLUSTER, 1, false);
 	} else {
 		/* Allocation without folio is only possible with hibernation */
 		WARN_ON_ONCE(1);
@@ -1768,7 +1751,7 @@ void swap_entries_free(struct swap_info_struct *si,
 
 	mem_cgroup_uncharge_swap(entry, nr_pages);
 	swap_range_free(si, offset, nr_pages);
-	swap_cluster_assert_table_empty(ci, offset, nr_pages);
+	swap_cluster_assert_empty(ci, offset % SWAPFILE_CLUSTER, nr_pages, false);
 
 	if (!ci->count)
 		free_cluster(si, ci);
@@ -2780,7 +2763,7 @@ static void free_swap_cluster_info(struct swap_cluster_info *cluster_info,
 		/* Cluster with bad marks count will have a remaining table */
 		spin_lock(&ci->lock);
 		if (rcu_dereference_protected(ci->table, true)) {
-			swap_cluster_assert_empty(ci, true);
+			swap_cluster_assert_empty(ci, 0, SWAPFILE_CLUSTER, true);
 			swap_cluster_free_table(ci);
 		}
 		spin_unlock(&ci->lock);

-- 
2.52.0




^ permalink raw reply	[flat|nested] 28+ messages in thread

* [PATCH v3 09/12] mm, swap: use the swap table to track the swap count
  2026-02-17 20:06 [PATCH v3 00/12] mm, swap: swap table phase III: remove swap_map Kairui Song via B4 Relay
                   ` (7 preceding siblings ...)
  2026-02-17 20:06 ` [PATCH v3 08/12] mm, swap: simplify swap table sanity range check Kairui Song via B4 Relay
@ 2026-02-17 20:06 ` Kairui Song via B4 Relay
  2026-02-18 10:40   ` kernel test robot
  2026-02-17 20:06 ` [PATCH v3 10/12] mm, swap: no need to truncate the scan border Kairui Song via B4 Relay
                   ` (3 subsequent siblings)
  12 siblings, 1 reply; 28+ messages in thread
From: Kairui Song via B4 Relay @ 2026-02-17 20:06 UTC (permalink / raw)
  To: linux-mm
  Cc: Andrew Morton, Kemeng Shi, Nhat Pham, Baoquan He, Barry Song,
	Johannes Weiner, David Hildenbrand, Lorenzo Stoakes,
	Youngjun Park, linux-kernel, Chris Li, Kairui Song

From: Kairui Song <kasong@tencent.com>

Now all the infrastructures are ready, switch to using the swap table
only. This is unfortunately a large patch because the whole old counting
mechanism, especially SWP_CONTINUED, has to be gone and switch to the
new mechanism together, with no intermediate steps available.

The swap table is capable of holding up to SWP_TB_COUNT_MAX - 1 counts
in the higher bits of each table entry, so using that, the swap_map can
be completely dropped.

swap_map also had a limit of SWAP_CONT_MAX. Any value beyond that limit
will require a COUNT_CONTINUED page. COUNT_CONTINUED is a bit complex to
maintain, so for the swap table, a simpler approach is used: when the
count goes beyond SWP_TB_COUNT_MAX - 1, the cluster will have an
extend_table allocated, which is a swap cluster-sized array of unsigned
int. The counting is basically offloaded there until the count drops
below SWP_TB_COUNT_MAX again.

Both the swap table and the extend table are cluster-based, so they
exhibit good performance and sparsity.

To make the switch from swap_map to swap table clean, this commit cleans
up and introduces a new set of functions based on the swap table design,
for manipulating swap counts:

- __swap_cluster_dup_entry, __swap_cluster_put_entry,
  __swap_cluster_alloc_entry, __swap_cluster_free_entry:

  Increase/decrease the count of a swap slot, or alloc / free a swap
  slot. This is the internal routine that does the counting work based
  on the swap table and handles all the complexities. The caller will
  need to lock the cluster before calling them.

  All swap count-related update operations are wrapped by these four
  helpers.

- swap_dup_entries_cluster, swap_put_entries_cluster:

  Increase/decrease the swap count of one or a set of swap slots in the
  same cluster range. These two helpers serve as the common routines for
  folio_dup_swap & swap_dup_entry_direct, or
  folio_put_swap & swap_put_entries_direct.

And use these helpers to replace all existing callers. This helps to
simplify the count tracking by a lot, and the swap_map is gone.

Suggested-by: Chris Li <chrisl@kernel.org>
Signed-off-by: Kairui Song <kasong@tencent.com>
---
 include/linux/swap.h |  28 +-
 mm/memory.c          |   2 +-
 mm/swap.h            |  14 +-
 mm/swap_state.c      |  53 ++--
 mm/swap_table.h      |   5 +
 mm/swapfile.c        | 790 +++++++++++++++++++--------------------------------
 6 files changed, 334 insertions(+), 558 deletions(-)

diff --git a/include/linux/swap.h b/include/linux/swap.h
index 62fc7499b408..0effe3cc50f5 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -208,7 +208,6 @@ enum {
 	SWP_DISCARDABLE = (1 << 2),	/* blkdev support discard */
 	SWP_DISCARDING	= (1 << 3),	/* now discarding a free cluster */
 	SWP_SOLIDSTATE	= (1 << 4),	/* blkdev seeks are cheap */
-	SWP_CONTINUED	= (1 << 5),	/* swap_map has count continuation */
 	SWP_BLKDEV	= (1 << 6),	/* its a block device */
 	SWP_ACTIVATED	= (1 << 7),	/* set after swap_activate success */
 	SWP_FS_OPS	= (1 << 8),	/* swapfile operations go through fs */
@@ -223,16 +222,6 @@ enum {
 #define SWAP_CLUSTER_MAX_SKIPPED (SWAP_CLUSTER_MAX << 10)
 #define COMPACT_CLUSTER_MAX SWAP_CLUSTER_MAX
 
-/* Bit flag in swap_map */
-#define COUNT_CONTINUED	0x80	/* Flag swap_map continuation for full count */
-
-/* Special value in first swap_map */
-#define SWAP_MAP_MAX	0x3e	/* Max count */
-#define SWAP_MAP_BAD	0x3f	/* Note page is bad */
-
-/* Special value in each swap_map continuation */
-#define SWAP_CONT_MAX	0x7f	/* Max count */
-
 /*
  * The first page in the swap file is the swap header, which is always marked
  * bad to prevent it from being allocated as an entry. This also prevents the
@@ -264,8 +253,7 @@ struct swap_info_struct {
 	signed short	prio;		/* swap priority of this type */
 	struct plist_node list;		/* entry in swap_active_head */
 	signed char	type;		/* strange name for an index */
-	unsigned int	max;		/* extent of the swap_map */
-	unsigned char *swap_map;	/* vmalloc'ed array of usage counts */
+	unsigned int	max;		/* size of this swap device */
 	unsigned long *zeromap;		/* kvmalloc'ed bitmap to track zero pages */
 	struct swap_cluster_info *cluster_info; /* cluster info. Only for SSD */
 	struct list_head free_clusters; /* free clusters list */
@@ -284,18 +272,14 @@ struct swap_info_struct {
 	struct completion comp;		/* seldom referenced */
 	spinlock_t lock;		/*
 					 * protect map scan related fields like
-					 * swap_map, inuse_pages and all cluster
-					 * lists. other fields are only changed
+					 * inuse_pages and all cluster lists.
+					 * Other fields are only changed
 					 * at swapon/swapoff, so are protected
 					 * by swap_lock. changing flags need
 					 * hold this lock and swap_lock. If
 					 * both locks need hold, hold swap_lock
 					 * first.
 					 */
-	spinlock_t cont_lock;		/*
-					 * protect swap count continuation page
-					 * list.
-					 */
 	struct work_struct discard_work; /* discard worker */
 	struct work_struct reclaim_work; /* reclaim worker */
 	struct list_head discard_clusters; /* discard clusters list */
@@ -451,7 +435,6 @@ static inline long get_nr_swap_pages(void)
 }
 
 extern void si_swapinfo(struct sysinfo *);
-extern int add_swap_count_continuation(swp_entry_t, gfp_t);
 int swap_type_of(dev_t device, sector_t offset);
 int find_first_swap(dev_t *device);
 extern unsigned int count_swap_pages(int, int);
@@ -517,11 +500,6 @@ static inline void free_swap_cache(struct folio *folio)
 {
 }
 
-static inline int add_swap_count_continuation(swp_entry_t swp, gfp_t gfp_mask)
-{
-	return 0;
-}
-
 static inline int swap_dup_entry_direct(swp_entry_t ent)
 {
 	return 0;
diff --git a/mm/memory.c b/mm/memory.c
index 9ee60d87969b..81f2a3e1919a 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1346,7 +1346,7 @@ copy_pte_range(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma,
 
 	if (ret == -EIO) {
 		VM_WARN_ON_ONCE(!entry.val);
-		if (add_swap_count_continuation(entry, GFP_KERNEL) < 0) {
+		if (swap_retry_table_alloc(entry, GFP_KERNEL) < 0) {
 			ret = -ENOMEM;
 			goto out;
 		}
diff --git a/mm/swap.h b/mm/swap.h
index bfafa637c458..0a91e21e92b1 100644
--- a/mm/swap.h
+++ b/mm/swap.h
@@ -37,6 +37,7 @@ struct swap_cluster_info {
 	u8 flags;
 	u8 order;
 	atomic_long_t __rcu *table;	/* Swap table entries, see mm/swap_table.h */
+	unsigned int *extend_table;	/* For large swap count, protected by ci->lock */
 	struct list_head list;
 };
 
@@ -183,6 +184,8 @@ static inline void swap_cluster_unlock_irq(struct swap_cluster_info *ci)
 	spin_unlock_irq(&ci->lock);
 }
 
+extern int swap_retry_table_alloc(swp_entry_t entry, gfp_t gfp);
+
 /*
  * Below are the core routines for doing swap for a folio.
  * All helpers requires the folio to be locked, and a locked folio
@@ -206,9 +209,9 @@ int folio_dup_swap(struct folio *folio, struct page *subpage);
 void folio_put_swap(struct folio *folio, struct page *subpage);
 
 /* For internal use */
-extern void swap_entries_free(struct swap_info_struct *si,
-			      struct swap_cluster_info *ci,
-			      unsigned long offset, unsigned int nr_pages);
+extern void __swap_cluster_free_entries(struct swap_info_struct *si,
+					struct swap_cluster_info *ci,
+					unsigned int ci_off, unsigned int nr_pages);
 
 /* linux/mm/page_io.c */
 int sio_pool_init(void);
@@ -446,6 +449,11 @@ static inline int swap_writeout(struct folio *folio,
 	return 0;
 }
 
+static inline int swap_retry_table_alloc(swp_entry_t entry, gfp_t gfp)
+{
+	return -EINVAL;
+}
+
 static inline bool swap_cache_has_folio(swp_entry_t entry)
 {
 	return false;
diff --git a/mm/swap_state.c b/mm/swap_state.c
index e213ee35c1d2..e7618ffe6d70 100644
--- a/mm/swap_state.c
+++ b/mm/swap_state.c
@@ -140,21 +140,20 @@ void *swap_cache_get_shadow(swp_entry_t entry)
 void __swap_cache_add_folio(struct swap_cluster_info *ci,
 			    struct folio *folio, swp_entry_t entry)
 {
-	unsigned long new_tb;
-	unsigned int ci_start, ci_off, ci_end;
+	unsigned int ci_off = swp_cluster_offset(entry), ci_end;
 	unsigned long nr_pages = folio_nr_pages(folio);
+	unsigned long pfn = folio_pfn(folio);
+	unsigned long old_tb;
 
 	VM_WARN_ON_ONCE_FOLIO(!folio_test_locked(folio), folio);
 	VM_WARN_ON_ONCE_FOLIO(folio_test_swapcache(folio), folio);
 	VM_WARN_ON_ONCE_FOLIO(!folio_test_swapbacked(folio), folio);
 
-	new_tb = folio_to_swp_tb(folio, 0);
-	ci_start = swp_cluster_offset(entry);
-	ci_off = ci_start;
-	ci_end = ci_start + nr_pages;
+	ci_end = ci_off + nr_pages;
 	do {
-		VM_WARN_ON_ONCE(swp_tb_is_folio(__swap_table_get(ci, ci_off)));
-		__swap_table_set(ci, ci_off, new_tb);
+		old_tb = __swap_table_get(ci, ci_off);
+		VM_WARN_ON_ONCE(swp_tb_is_folio(old_tb));
+		__swap_table_set(ci, ci_off, pfn_to_swp_tb(pfn, __swp_tb_get_count(old_tb)));
 	} while (++ci_off < ci_end);
 
 	folio_ref_add(folio, nr_pages);
@@ -183,14 +182,13 @@ static int swap_cache_add_folio(struct folio *folio, swp_entry_t entry,
 	unsigned long old_tb;
 	struct swap_info_struct *si;
 	struct swap_cluster_info *ci;
-	unsigned int ci_start, ci_off, ci_end, offset;
+	unsigned int ci_start, ci_off, ci_end;
 	unsigned long nr_pages = folio_nr_pages(folio);
 
 	si = __swap_entry_to_info(entry);
 	ci_start = swp_cluster_offset(entry);
 	ci_end = ci_start + nr_pages;
 	ci_off = ci_start;
-	offset = swp_offset(entry);
 	ci = swap_cluster_lock(si, swp_offset(entry));
 	if (unlikely(!ci->table)) {
 		err = -ENOENT;
@@ -202,13 +200,12 @@ static int swap_cache_add_folio(struct folio *folio, swp_entry_t entry,
 			err = -EEXIST;
 			goto failed;
 		}
-		if (unlikely(!__swap_count(swp_entry(swp_type(entry), offset)))) {
+		if (unlikely(!__swp_tb_get_count(old_tb))) {
 			err = -ENOENT;
 			goto failed;
 		}
 		if (swp_tb_is_shadow(old_tb))
 			shadow = swp_tb_to_shadow(old_tb);
-		offset++;
 	} while (++ci_off < ci_end);
 	__swap_cache_add_folio(ci, folio, entry);
 	swap_cluster_unlock(ci);
@@ -237,8 +234,9 @@ static int swap_cache_add_folio(struct folio *folio, swp_entry_t entry,
 void __swap_cache_del_folio(struct swap_cluster_info *ci, struct folio *folio,
 			    swp_entry_t entry, void *shadow)
 {
+	int count;
+	unsigned long old_tb;
 	struct swap_info_struct *si;
-	unsigned long old_tb, new_tb;
 	unsigned int ci_start, ci_off, ci_end;
 	bool folio_swapped = false, need_free = false;
 	unsigned long nr_pages = folio_nr_pages(folio);
@@ -249,20 +247,20 @@ void __swap_cache_del_folio(struct swap_cluster_info *ci, struct folio *folio,
 	VM_WARN_ON_ONCE_FOLIO(folio_test_writeback(folio), folio);
 
 	si = __swap_entry_to_info(entry);
-	new_tb = shadow_to_swp_tb(shadow, 0);
 	ci_start = swp_cluster_offset(entry);
 	ci_end = ci_start + nr_pages;
 	ci_off = ci_start;
 	do {
-		/* If shadow is NULL, we sets an empty shadow */
-		old_tb = __swap_table_xchg(ci, ci_off, new_tb);
+		old_tb = __swap_table_get(ci, ci_off);
 		WARN_ON_ONCE(!swp_tb_is_folio(old_tb) ||
 			     swp_tb_to_folio(old_tb) != folio);
-		if (__swap_count(swp_entry(si->type,
-				 swp_offset(entry) + ci_off - ci_start)))
+		count = __swp_tb_get_count(old_tb);
+		if (count)
 			folio_swapped = true;
 		else
 			need_free = true;
+		/* If shadow is NULL, we sets an empty shadow. */
+		__swap_table_set(ci, ci_off, shadow_to_swp_tb(shadow, count));
 	} while (++ci_off < ci_end);
 
 	folio->swap.val = 0;
@@ -271,13 +269,13 @@ void __swap_cache_del_folio(struct swap_cluster_info *ci, struct folio *folio,
 	lruvec_stat_mod_folio(folio, NR_SWAPCACHE, -nr_pages);
 
 	if (!folio_swapped) {
-		swap_entries_free(si, ci, swp_offset(entry), nr_pages);
+		__swap_cluster_free_entries(si, ci, ci_start, nr_pages);
 	} else if (need_free) {
+		ci_off = ci_start;
 		do {
-			if (!__swap_count(entry))
-				swap_entries_free(si, ci, swp_offset(entry), 1);
-			entry.val++;
-		} while (--nr_pages);
+			if (!__swp_tb_get_count(__swap_table_get(ci, ci_off)))
+				__swap_cluster_free_entries(si, ci, ci_off, 1);
+		} while (++ci_off < ci_end);
 	}
 }
 
@@ -324,17 +322,18 @@ void __swap_cache_replace_folio(struct swap_cluster_info *ci,
 	unsigned long nr_pages = folio_nr_pages(new);
 	unsigned int ci_off = swp_cluster_offset(entry);
 	unsigned int ci_end = ci_off + nr_pages;
-	unsigned long old_tb, new_tb;
+	unsigned long pfn = folio_pfn(new);
+	unsigned long old_tb;
 
 	VM_WARN_ON_ONCE(!folio_test_swapcache(old) || !folio_test_swapcache(new));
 	VM_WARN_ON_ONCE(!folio_test_locked(old) || !folio_test_locked(new));
 	VM_WARN_ON_ONCE(!entry.val);
 
 	/* Swap cache still stores N entries instead of a high-order entry */
-	new_tb = folio_to_swp_tb(new, 0);
 	do {
-		old_tb = __swap_table_xchg(ci, ci_off, new_tb);
+		old_tb = __swap_table_get(ci, ci_off);
 		WARN_ON_ONCE(!swp_tb_is_folio(old_tb) || swp_tb_to_folio(old_tb) != old);
+		__swap_table_set(ci, ci_off, pfn_to_swp_tb(pfn, __swp_tb_get_count(old_tb)));
 	} while (++ci_off < ci_end);
 
 	/*
@@ -368,7 +367,7 @@ void __swap_cache_clear_shadow(swp_entry_t entry, int nr_ents)
 	ci_end = ci_off + nr_ents;
 	do {
 		old = __swap_table_xchg(ci, ci_off, null_to_swp_tb());
-		WARN_ON_ONCE(swp_tb_is_folio(old));
+		WARN_ON_ONCE(swp_tb_is_folio(old) || swp_tb_get_count(old));
 	} while (++ci_off < ci_end);
 }
 
diff --git a/mm/swap_table.h b/mm/swap_table.h
index 10762ac5f4f5..8415ffbe2b9c 100644
--- a/mm/swap_table.h
+++ b/mm/swap_table.h
@@ -191,6 +191,11 @@ static inline int swp_tb_get_count(unsigned long swp_tb)
 	return -EINVAL;
 }
 
+static inline unsigned long __swp_tb_mk_count(unsigned long swp_tb, int count)
+{
+	return ((swp_tb & ~SWP_TB_COUNT_MASK) | __count_to_swp_tb(count));
+}
+
 /*
  * Helpers for accessing or modifying the swap table of a cluster,
  * the swap cluster must be locked.
diff --git a/mm/swapfile.c b/mm/swapfile.c
index 9057fb3e4eed..801d8092be51 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -51,15 +51,8 @@
 #include "swap_table.h"
 #include "swap.h"
 
-static bool swap_count_continued(struct swap_info_struct *, pgoff_t,
-				 unsigned char);
-static void free_swap_count_continuations(struct swap_info_struct *);
 static void swap_range_alloc(struct swap_info_struct *si,
 			     unsigned int nr_entries);
-static int __swap_duplicate(swp_entry_t entry, unsigned char usage, int nr);
-static void swap_put_entry_locked(struct swap_info_struct *si,
-				  struct swap_cluster_info *ci,
-				  unsigned long offset);
 static bool folio_swapcache_freeable(struct folio *folio);
 static void move_cluster(struct swap_info_struct *si,
 			 struct swap_cluster_info *ci, struct list_head *list,
@@ -182,22 +175,19 @@ static long swap_usage_in_pages(struct swap_info_struct *si)
 /* Reclaim the swap entry if swap is getting full */
 #define TTRS_FULL		0x4
 
-static bool swap_only_has_cache(struct swap_info_struct *si,
-				struct swap_cluster_info *ci,
+static bool swap_only_has_cache(struct swap_cluster_info *ci,
 				unsigned long offset, int nr_pages)
 {
 	unsigned int ci_off = offset % SWAPFILE_CLUSTER;
-	unsigned char *map = si->swap_map + offset;
-	unsigned char *map_end = map + nr_pages;
+	unsigned int ci_end = ci_off + nr_pages;
 	unsigned long swp_tb;
 
 	do {
 		swp_tb = __swap_table_get(ci, ci_off);
 		VM_WARN_ON_ONCE(!swp_tb_is_folio(swp_tb));
-		if (*map)
+		if (swp_tb_get_count(swp_tb))
 			return false;
-		++ci_off;
-	} while (++map < map_end);
+	} while (++ci_off < ci_end);
 
 	return true;
 }
@@ -256,7 +246,7 @@ static int __try_to_reclaim_swap(struct swap_info_struct *si,
 	 * reference or pending writeback, and can't be allocated to others.
 	 */
 	ci = swap_cluster_lock(si, offset);
-	need_reclaim = swap_only_has_cache(si, ci, offset, nr_pages);
+	need_reclaim = swap_only_has_cache(ci, offset, nr_pages);
 	swap_cluster_unlock(ci);
 	if (!need_reclaim)
 		goto out_unlock;
@@ -479,6 +469,7 @@ static void swap_cluster_assert_empty(struct swap_cluster_info *ci,
 	} while (++ci_off < ci_end);
 
 	WARN_ON_ONCE(bad_slots != (swapoff ? ci->count : 0));
+	WARN_ON_ONCE(nr == SWAPFILE_CLUSTER && ci->extend_table);
 }
 
 static void swap_cluster_free_table(struct swap_cluster_info *ci)
@@ -807,7 +798,6 @@ static int swap_cluster_setup_bad_slot(struct swap_info_struct *si,
 		pr_warn("Duplicated bad slot offset %d\n", offset);
 		ret = -EINVAL;
 	} else {
-		si->swap_map[offset] = SWAP_MAP_BAD;
 		ci->count++;
 	}
 	spin_unlock(&ci->lock);
@@ -829,18 +819,16 @@ static bool cluster_reclaim_range(struct swap_info_struct *si,
 {
 	unsigned int nr_pages = 1 << order;
 	unsigned long offset = start, end = start + nr_pages;
-	unsigned char *map = si->swap_map;
 	unsigned long swp_tb;
 
 	spin_unlock(&ci->lock);
 	do {
-		if (READ_ONCE(map[offset]))
-			break;
 		swp_tb = swap_table_get(ci, offset % SWAPFILE_CLUSTER);
-		if (swp_tb_is_folio(swp_tb)) {
+		if (swp_tb_get_count(swp_tb))
+			break;
+		if (swp_tb_is_folio(swp_tb))
 			if (__try_to_reclaim_swap(si, offset, TTRS_ANYWAY) < 0)
 				break;
-		}
 	} while (++offset < end);
 	spin_lock(&ci->lock);
 
@@ -864,7 +852,7 @@ static bool cluster_reclaim_range(struct swap_info_struct *si,
 	 */
 	for (offset = start; offset < end; offset++) {
 		swp_tb = __swap_table_get(ci, offset % SWAPFILE_CLUSTER);
-		if (map[offset] || !swp_tb_is_null(swp_tb))
+		if (!swp_tb_is_null(swp_tb))
 			return false;
 	}
 
@@ -876,37 +864,35 @@ static bool cluster_scan_range(struct swap_info_struct *si,
 			       unsigned long offset, unsigned int nr_pages,
 			       bool *need_reclaim)
 {
-	unsigned long end = offset + nr_pages;
-	unsigned char *map = si->swap_map;
+	unsigned int ci_off = offset % SWAPFILE_CLUSTER;
+	unsigned int ci_end = ci_off + nr_pages;
 	unsigned long swp_tb;
 
-	if (cluster_is_empty(ci))
-		return true;
-
 	do {
-		if (map[offset])
-			return false;
-		swp_tb = __swap_table_get(ci, offset % SWAPFILE_CLUSTER);
-		if (swp_tb_is_folio(swp_tb)) {
+		swp_tb = __swap_table_get(ci, ci_off);
+		if (swp_tb_is_null(swp_tb))
+			continue;
+		if (swp_tb_is_folio(swp_tb) && !__swp_tb_get_count(swp_tb)) {
 			if (!vm_swap_full())
 				return false;
 			*need_reclaim = true;
-		} else {
-			/* A entry with no count and no cache must be null */
-			VM_WARN_ON_ONCE(!swp_tb_is_null(swp_tb));
+			continue;
 		}
-	} while (++offset < end);
+		/* Slot with zero count can only be NULL or folio */
+		VM_WARN_ON(!swp_tb_get_count(swp_tb));
+		return false;
+	} while (++ci_off < ci_end);
 
 	return true;
 }
 
-static bool cluster_alloc_range(struct swap_info_struct *si,
-				struct swap_cluster_info *ci,
-				struct folio *folio,
-				unsigned int offset)
+static bool __swap_cluster_alloc_entries(struct swap_info_struct *si,
+					 struct swap_cluster_info *ci,
+					 struct folio *folio,
+					 unsigned int ci_off)
 {
-	unsigned long nr_pages;
 	unsigned int order;
+	unsigned long nr_pages;
 
 	lockdep_assert_held(&ci->lock);
 
@@ -925,14 +911,15 @@ static bool cluster_alloc_range(struct swap_info_struct *si,
 	if (likely(folio)) {
 		order = folio_order(folio);
 		nr_pages = 1 << order;
-		swap_cluster_assert_empty(ci, offset % SWAPFILE_CLUSTER, nr_pages, false);
-		__swap_cache_add_folio(ci, folio, swp_entry(si->type, offset));
+		swap_cluster_assert_empty(ci, ci_off, nr_pages, false);
+		__swap_cache_add_folio(ci, folio, swp_entry(si->type,
+							    ci_off + cluster_offset(si, ci)));
 	} else if (IS_ENABLED(CONFIG_HIBERNATION)) {
 		order = 0;
 		nr_pages = 1;
-		WARN_ON_ONCE(si->swap_map[offset]);
-		si->swap_map[offset] = 1;
-		swap_cluster_assert_empty(ci, offset % SWAPFILE_CLUSTER, 1, false);
+		swap_cluster_assert_empty(ci, ci_off, 1, false);
+		/* Sets a fake shadow as placeholder */
+		__swap_table_set(ci, ci_off, shadow_to_swp_tb(NULL, 1));
 	} else {
 		/* Allocation without folio is only possible with hibernation */
 		WARN_ON_ONCE(1);
@@ -983,7 +970,7 @@ static unsigned int alloc_swap_scan_cluster(struct swap_info_struct *si,
 			if (!ret)
 				continue;
 		}
-		if (!cluster_alloc_range(si, ci, folio, offset))
+		if (!__swap_cluster_alloc_entries(si, ci, folio, offset % SWAPFILE_CLUSTER))
 			break;
 		found = offset;
 		offset += nr_pages;
@@ -1030,7 +1017,7 @@ static void swap_reclaim_full_clusters(struct swap_info_struct *si, bool force)
 	long to_scan = 1;
 	unsigned long offset, end;
 	struct swap_cluster_info *ci;
-	unsigned char *map = si->swap_map;
+	unsigned long swp_tb;
 	int nr_reclaim;
 
 	if (force)
@@ -1042,8 +1029,8 @@ static void swap_reclaim_full_clusters(struct swap_info_struct *si, bool force)
 		to_scan--;
 
 		while (offset < end) {
-			if (!READ_ONCE(map[offset]) &&
-			    swp_tb_is_folio(swap_table_get(ci, offset % SWAPFILE_CLUSTER))) {
+			swp_tb = swap_table_get(ci, offset % SWAPFILE_CLUSTER);
+			if (swp_tb_is_folio(swp_tb) && !__swp_tb_get_count(swp_tb)) {
 				spin_unlock(&ci->lock);
 				nr_reclaim = __try_to_reclaim_swap(si, offset,
 								   TTRS_ANYWAY);
@@ -1452,40 +1439,127 @@ static bool swap_sync_discard(void)
 	return false;
 }
 
+static int swap_extend_table_alloc(struct swap_info_struct *si,
+				   struct swap_cluster_info *ci, gfp_t gfp)
+{
+	void *table;
+
+	table = kzalloc(sizeof(ci->extend_table[0]) * SWAPFILE_CLUSTER, gfp);
+	if (!table)
+		return -ENOMEM;
+
+	spin_lock(&ci->lock);
+	if (!ci->extend_table)
+		ci->extend_table = table;
+	else
+		kfree(table);
+	spin_unlock(&ci->lock);
+	return 0;
+}
+
+int swap_retry_table_alloc(swp_entry_t entry, gfp_t gfp)
+{
+	int ret;
+	struct swap_info_struct *si;
+	struct swap_cluster_info *ci;
+	unsigned long offset = swp_offset(entry);
+
+	si = get_swap_device(entry);
+	if (!si)
+		return 0;
+
+	ci = __swap_offset_to_cluster(si, offset);
+	ret = swap_extend_table_alloc(si, ci, gfp);
+
+	put_swap_device(si);
+	return ret;
+}
+
+static void swap_extend_table_try_free(struct swap_cluster_info *ci)
+{
+	unsigned long i;
+	bool can_free = true;
+
+	if (!ci->extend_table)
+		return;
+
+	for (i = 0; i < SWAPFILE_CLUSTER; i++) {
+		if (ci->extend_table[i])
+			can_free = false;
+	}
+
+	if (can_free) {
+		kfree(ci->extend_table);
+		ci->extend_table = NULL;
+	}
+}
+
+/* Decrease the swap count of one slot, without freeing it */
+static void __swap_cluster_put_entry(struct swap_cluster_info *ci,
+				    unsigned int ci_off)
+{
+	int count;
+	unsigned long swp_tb;
+
+	lockdep_assert_held(&ci->lock);
+	swp_tb = __swap_table_get(ci, ci_off);
+	count = __swp_tb_get_count(swp_tb);
+
+	VM_WARN_ON_ONCE(count <= 0);
+	VM_WARN_ON_ONCE(count > SWP_TB_COUNT_MAX);
+
+	if (count == SWP_TB_COUNT_MAX) {
+		count = ci->extend_table[ci_off];
+		/* Overflow starts with SWP_TB_COUNT_MAX */
+		VM_WARN_ON_ONCE(count < SWP_TB_COUNT_MAX);
+		count--;
+		if (count == (SWP_TB_COUNT_MAX - 1)) {
+			ci->extend_table[ci_off] = 0;
+			__swap_table_set(ci, ci_off, __swp_tb_mk_count(swp_tb, count));
+			swap_extend_table_try_free(ci);
+		} else {
+			ci->extend_table[ci_off] = count;
+		}
+	} else {
+		__swap_table_set(ci, ci_off, __swp_tb_mk_count(swp_tb, --count));
+	}
+}
+
 /**
- * swap_put_entries_cluster - Decrease the swap count of a set of slots.
+ * swap_put_entries_cluster - Decrease the swap count of slots within one cluster
  * @si: The swap device.
- * @start: start offset of slots.
+ * @offset: start offset of slots.
  * @nr: number of slots.
- * @reclaim_cache: if true, also reclaim the swap cache.
+ * @reclaim_cache: if true, also reclaim the swap cache if slots are freed.
  *
  * This helper decreases the swap count of a set of slots and tries to
  * batch free them. Also reclaims the swap cache if @reclaim_cache is true.
- * Context: The caller must ensure that all slots belong to the same
- * cluster and their swap count doesn't go underflow.
+ *
+ * Context: The specified slots must be pinned by existing swap count or swap
+ * cache reference, so they won't be released until this helper returns.
  */
 static void swap_put_entries_cluster(struct swap_info_struct *si,
-				     unsigned long start, int nr,
+				     pgoff_t offset, int nr,
 				     bool reclaim_cache)
 {
-	unsigned long offset = start, end = start + nr;
-	unsigned long batch_start = SWAP_ENTRY_INVALID;
 	struct swap_cluster_info *ci;
+	unsigned int ci_off, ci_end;
+	pgoff_t end = offset + nr;
 	bool need_reclaim = false;
 	unsigned int nr_reclaimed;
 	unsigned long swp_tb;
-	unsigned int count;
+	int ci_batch = -1;
 
 	ci = swap_cluster_lock(si, offset);
+	ci_off = offset % SWAPFILE_CLUSTER;
+	ci_end = ci_off + nr;
 	do {
-		swp_tb = __swap_table_get(ci, offset % SWAPFILE_CLUSTER);
-		count = si->swap_map[offset];
-		VM_WARN_ON(count < 1 || count == SWAP_MAP_BAD);
-		if (count == 1) {
+		swp_tb = __swap_table_get(ci, ci_off);
+		if (swp_tb_get_count(swp_tb) == 1) {
 			/* count == 1 and non-cached slots will be batch freed. */
 			if (!swp_tb_is_folio(swp_tb)) {
-				if (!batch_start)
-					batch_start = offset;
+				if (ci_batch == -1)
+					ci_batch = ci_off;
 				continue;
 			}
 			/* count will be 0 after put, slot can be reclaimed */
@@ -1497,21 +1571,20 @@ static void swap_put_entries_cluster(struct swap_info_struct *si,
 		 * slots will be freed when folio is removed from swap cache
 		 * (__swap_cache_del_folio).
 		 */
-		swap_put_entry_locked(si, ci, offset);
-		if (batch_start) {
-			swap_entries_free(si, ci, batch_start, offset - batch_start);
-			batch_start = SWAP_ENTRY_INVALID;
+		__swap_cluster_put_entry(ci, ci_off);
+		if (ci_batch != -1) {
+			__swap_cluster_free_entries(si, ci, ci_batch, ci_off - ci_batch);
+			ci_batch = -1;
 		}
-	} while (++offset < end);
+	} while (++ci_off < ci_end);
 
-	if (batch_start)
-		swap_entries_free(si, ci, batch_start, offset - batch_start);
+	if (ci_batch != -1)
+		__swap_cluster_free_entries(si, ci, ci_batch, ci_off - ci_batch);
 	swap_cluster_unlock(ci);
 
 	if (!need_reclaim || !reclaim_cache)
 		return;
 
-	offset = start;
 	do {
 		nr_reclaimed = __try_to_reclaim_swap(si, offset,
 						     TTRS_UNMAPPED | TTRS_FULL);
@@ -1521,6 +1594,92 @@ static void swap_put_entries_cluster(struct swap_info_struct *si,
 	} while (offset < end);
 }
 
+/* Increase the swap count of one slot. */
+static int __swap_cluster_dup_entry(struct swap_cluster_info *ci,
+				    unsigned int ci_off)
+{
+	int count;
+	unsigned long swp_tb;
+
+	lockdep_assert_held(&ci->lock);
+	swp_tb = __swap_table_get(ci, ci_off);
+	/* Bad or special slots can't be handled */
+	if (WARN_ON_ONCE(swp_tb_is_bad(swp_tb)))
+		return -EINVAL;
+	count = __swp_tb_get_count(swp_tb);
+	/* Must be either cached or have a count already */
+	if (WARN_ON_ONCE(!count && !swp_tb_is_folio(swp_tb)))
+		return -ENOENT;
+
+	if (likely(count < (SWP_TB_COUNT_MAX - 1))) {
+		__swap_table_set(ci, ci_off, __swp_tb_mk_count(swp_tb, count + 1));
+		VM_WARN_ON_ONCE(ci->extend_table && ci->extend_table[ci_off]);
+	} else if (count == (SWP_TB_COUNT_MAX - 1)) {
+		if (ci->extend_table) {
+			VM_WARN_ON_ONCE(ci->extend_table[ci_off]);
+			ci->extend_table[ci_off] = SWP_TB_COUNT_MAX;
+			__swap_table_set(ci, ci_off, __swp_tb_mk_count(swp_tb, SWP_TB_COUNT_MAX));
+		} else {
+			return -ENOMEM;
+		}
+	} else if (count == SWP_TB_COUNT_MAX) {
+		VM_WARN_ON_ONCE(ci->extend_table[ci_off] >=
+				(BIT(BITS_PER_TYPE(ci->extend_table[0]))) - 1);
+		++ci->extend_table[ci_off];
+	} else {
+		/* Never happens unless counting went wrong */
+		WARN_ON_ONCE(1);
+	}
+
+	return 0;
+}
+
+/**
+ * swap_dup_entries_cluster: Increase the swap count of slots within one cluster.
+ * @si: The swap device.
+ * @offset: start offset of slots.
+ * @nr: number of slots.
+ *
+ * Context: The specified slots must be pinned by existing swap count or swap
+ * cache reference, so they won't be released until this helper returns.
+ * Return: 0 on success. -ENOMEM if the swap count maxed out (SWP_TB_COUNT_MAX)
+ * and failed to allocate an extended table, -EINVAL if any entry is bad entry.
+ */
+static int swap_dup_entries_cluster(struct swap_info_struct *si,
+				    pgoff_t offset, int nr)
+{
+	int err;
+	struct swap_cluster_info *ci;
+	unsigned int ci_start, ci_off, ci_end;
+
+	ci_start = offset % SWAPFILE_CLUSTER;
+	ci_end = ci_start + nr;
+	ci_off = ci_start;
+	ci = swap_cluster_lock(si, offset);
+restart:
+	do {
+		err = __swap_cluster_dup_entry(ci, ci_off);
+		if (unlikely(err)) {
+			if (err == -ENOMEM) {
+				spin_unlock(&ci->lock);
+				err = swap_extend_table_alloc(si, ci, GFP_ATOMIC);
+				spin_lock(&ci->lock);
+				if (!err)
+					goto restart;
+			}
+			goto failed;
+		}
+	} while (++ci_off < ci_end);
+	swap_cluster_unlock(ci);
+	return 0;
+failed:
+	while (ci_off-- > ci_start)
+		__swap_cluster_put_entry(ci, ci_off);
+	swap_extend_table_try_free(ci);
+	swap_cluster_unlock(ci);
+	return err;
+}
+
 /**
  * folio_alloc_swap - allocate swap space for a folio
  * @folio: folio we want to move to swap
@@ -1589,13 +1748,10 @@ int folio_alloc_swap(struct folio *folio)
  * Context: Caller must ensure the folio is locked and in the swap cache.
  * NOTE: The caller also has to ensure there is no raced call to
  * swap_put_entries_direct on its swap entry before this helper returns, or
- * the swap map may underflow. Currently, we only accept @subpage == NULL
- * for shmem due to the limitation of swap continuation: shmem always
- * duplicates the swap entry only once, so there is no such issue for it.
+ * the swap count may underflow.
  */
 int folio_dup_swap(struct folio *folio, struct page *subpage)
 {
-	int err = 0;
 	swp_entry_t entry = folio->swap;
 	unsigned long nr_pages = folio_nr_pages(folio);
 
@@ -1607,10 +1763,8 @@ int folio_dup_swap(struct folio *folio, struct page *subpage)
 		nr_pages = 1;
 	}
 
-	while (!err && __swap_duplicate(entry, 1, nr_pages) == -ENOMEM)
-		err = add_swap_count_continuation(entry, GFP_ATOMIC);
-
-	return err;
+	return swap_dup_entries_cluster(swap_entry_to_info(entry),
+					swp_offset(entry), nr_pages);
 }
 
 /**
@@ -1639,28 +1793,6 @@ void folio_put_swap(struct folio *folio, struct page *subpage)
 	swap_put_entries_cluster(si, swp_offset(entry), nr_pages, false);
 }
 
-static void swap_put_entry_locked(struct swap_info_struct *si,
-				  struct swap_cluster_info *ci,
-				  unsigned long offset)
-{
-	unsigned char count;
-
-	count = si->swap_map[offset];
-	if ((count & ~COUNT_CONTINUED) <= SWAP_MAP_MAX) {
-		if (count == COUNT_CONTINUED) {
-			if (swap_count_continued(si, offset, count))
-				count = SWAP_MAP_MAX | COUNT_CONTINUED;
-			else
-				count = SWAP_MAP_MAX;
-		} else
-			count--;
-	}
-
-	WRITE_ONCE(si->swap_map[offset], count);
-	if (!count && !swp_tb_is_folio(__swap_table_get(ci, offset % SWAPFILE_CLUSTER)))
-		swap_entries_free(si, ci, offset, 1);
-}
-
 /*
  * When we get a swap entry, if there aren't some other ways to
  * prevent swapoff, such as the folio in swap cache is locked, RCU
@@ -1727,31 +1859,30 @@ struct swap_info_struct *get_swap_device(swp_entry_t entry)
 }
 
 /*
- * Drop the last ref of swap entries, caller have to ensure all entries
- * belong to the same cgroup and cluster.
+ * Free a set of swap slots after their swap count dropped to zero, or will be
+ * zero after putting the last ref (saves one __swap_cluster_put_entry call).
  */
-void swap_entries_free(struct swap_info_struct *si,
-		       struct swap_cluster_info *ci,
-		       unsigned long offset, unsigned int nr_pages)
+void __swap_cluster_free_entries(struct swap_info_struct *si,
+				 struct swap_cluster_info *ci,
+				 unsigned int ci_start, unsigned int nr_pages)
 {
-	swp_entry_t entry = swp_entry(si->type, offset);
-	unsigned char *map = si->swap_map + offset;
-	unsigned char *map_end = map + nr_pages;
+	unsigned long old_tb;
+	unsigned int ci_off = ci_start, ci_end = ci_start + nr_pages;
+	unsigned long offset = cluster_offset(si, ci) + ci_start;
 
-	/* It should never free entries across different clusters */
-	VM_BUG_ON(ci != __swap_offset_to_cluster(si, offset + nr_pages - 1));
-	VM_BUG_ON(cluster_is_empty(ci));
-	VM_BUG_ON(ci->count < nr_pages);
+	VM_WARN_ON(ci->count < nr_pages);
 
 	ci->count -= nr_pages;
 	do {
-		VM_WARN_ON(*map > 1);
-		*map = 0;
-	} while (++map < map_end);
+		old_tb = __swap_table_get(ci, ci_off);
+		/* Release the last ref, or after swap cache is dropped */
+		VM_WARN_ON(!swp_tb_is_shadow(old_tb) || __swp_tb_get_count(old_tb) > 1);
+		__swap_table_set(ci, ci_off, null_to_swp_tb());
+	} while (++ci_off < ci_end);
 
-	mem_cgroup_uncharge_swap(entry, nr_pages);
+	mem_cgroup_uncharge_swap(swp_entry(si->type, offset), nr_pages);
 	swap_range_free(si, offset, nr_pages);
-	swap_cluster_assert_empty(ci, offset % SWAPFILE_CLUSTER, nr_pages, false);
+	swap_cluster_assert_empty(ci, ci_start, nr_pages, false);
 
 	if (!ci->count)
 		free_cluster(si, ci);
@@ -1761,10 +1892,10 @@ void swap_entries_free(struct swap_info_struct *si,
 
 int __swap_count(swp_entry_t entry)
 {
-	struct swap_info_struct *si = __swap_entry_to_info(entry);
-	pgoff_t offset = swp_offset(entry);
+	struct swap_cluster_info *ci = __swap_entry_to_cluster(entry);
+	unsigned int ci_off = swp_cluster_offset(entry);
 
-	return si->swap_map[offset];
+	return swp_tb_get_count(__swap_table_get(ci, ci_off));
 }
 
 /**
@@ -1776,81 +1907,62 @@ bool swap_entry_swapped(struct swap_info_struct *si, swp_entry_t entry)
 {
 	pgoff_t offset = swp_offset(entry);
 	struct swap_cluster_info *ci;
-	int count;
+	unsigned long swp_tb;
 
 	ci = swap_cluster_lock(si, offset);
-	count = si->swap_map[offset];
+	swp_tb = swap_table_get(ci, offset % SWAPFILE_CLUSTER);
 	swap_cluster_unlock(ci);
 
-	return count && count != SWAP_MAP_BAD;
+	return swp_tb_get_count(swp_tb) > 0;
 }
 
 /*
  * How many references to @entry are currently swapped out?
- * This considers COUNT_CONTINUED so it returns exact answer.
+ * This returns exact answer.
  */
 int swp_swapcount(swp_entry_t entry)
 {
-	int count, tmp_count, n;
 	struct swap_info_struct *si;
 	struct swap_cluster_info *ci;
-	struct page *page;
-	pgoff_t offset;
-	unsigned char *map;
+	unsigned long swp_tb;
+	int count;
 
 	si = get_swap_device(entry);
 	if (!si)
 		return 0;
 
-	offset = swp_offset(entry);
-
-	ci = swap_cluster_lock(si, offset);
-
-	count = si->swap_map[offset];
-	if (!(count & COUNT_CONTINUED))
-		goto out;
-
-	count &= ~COUNT_CONTINUED;
-	n = SWAP_MAP_MAX + 1;
-
-	page = vmalloc_to_page(si->swap_map + offset);
-	offset &= ~PAGE_MASK;
-	VM_BUG_ON(page_private(page) != SWP_CONTINUED);
-
-	do {
-		page = list_next_entry(page, lru);
-		map = kmap_local_page(page);
-		tmp_count = map[offset];
-		kunmap_local(map);
-
-		count += (tmp_count & ~COUNT_CONTINUED) * n;
-		n *= (SWAP_CONT_MAX + 1);
-	} while (tmp_count & COUNT_CONTINUED);
-out:
+	ci = swap_cluster_lock(si, swp_offset(entry));
+	swp_tb = __swap_table_get(ci, swp_cluster_offset(entry));
+	count = swp_tb_get_count(swp_tb);
+	if (count == SWP_TB_COUNT_MAX)
+		count = ci->extend_table[swp_cluster_offset(entry)];
 	swap_cluster_unlock(ci);
 	put_swap_device(si);
-	return count;
+
+	return count < 0 ? 0 : count;
 }
 
 static bool swap_page_trans_huge_swapped(struct swap_info_struct *si,
 					 swp_entry_t entry, int order)
 {
 	struct swap_cluster_info *ci;
-	unsigned char *map = si->swap_map;
 	unsigned int nr_pages = 1 << order;
 	unsigned long roffset = swp_offset(entry);
 	unsigned long offset = round_down(roffset, nr_pages);
+	unsigned int ci_off;
 	int i;
 	bool ret = false;
 
 	ci = swap_cluster_lock(si, offset);
 	if (nr_pages == 1) {
-		if (map[roffset])
+		ci_off = roffset % SWAPFILE_CLUSTER;
+		if (swp_tb_get_count(__swap_table_get(ci, ci_off)))
 			ret = true;
 		goto unlock_out;
 	}
 	for (i = 0; i < nr_pages; i++) {
-		if (map[offset + i]) {
+		ci_off = (offset + i) % SWAPFILE_CLUSTER;
+		if (swp_tb_get_count(__swap_table_get(ci, ci_off))) {
 			ret = true;
 			break;
 		}
@@ -2016,7 +2128,8 @@ void swap_free_hibernation_slot(swp_entry_t entry)
 		return;
 
 	ci = swap_cluster_lock(si, offset);
-	swap_put_entry_locked(si, ci, offset);
+	__swap_cluster_put_entry(ci, offset % SWAPFILE_CLUSTER);
+	__swap_cluster_free_entries(si, ci, offset % SWAPFILE_CLUSTER, 1);
 	swap_cluster_unlock(ci);
 
 	/* In theory readahead might add it to the swap cache by accident */
@@ -2242,13 +2355,10 @@ static int unuse_pte_range(struct vm_area_struct *vma, pmd_t *pmd,
 			unsigned int type)
 {
 	pte_t *pte = NULL;
-	struct swap_info_struct *si;
 
-	si = swap_info[type];
 	do {
 		struct folio *folio;
-		unsigned long offset;
-		unsigned char swp_count;
+		unsigned long swp_tb;
 		softleaf_t entry;
 		int ret;
 		pte_t ptent;
@@ -2267,7 +2377,6 @@ static int unuse_pte_range(struct vm_area_struct *vma, pmd_t *pmd,
 		if (swp_type(entry) != type)
 			continue;
 
-		offset = swp_offset(entry);
 		pte_unmap(pte);
 		pte = NULL;
 
@@ -2284,8 +2393,9 @@ static int unuse_pte_range(struct vm_area_struct *vma, pmd_t *pmd,
 						&vmf);
 		}
 		if (!folio) {
-			swp_count = READ_ONCE(si->swap_map[offset]);
-			if (swp_count == 0 || swp_count == SWAP_MAP_BAD)
+			swp_tb = swap_table_get(__swap_entry_to_cluster(entry),
+						swp_cluster_offset(entry));
+			if (swp_tb_get_count(swp_tb) <= 0)
 				continue;
 			return -ENOMEM;
 		}
@@ -2413,7 +2523,7 @@ static int unuse_mm(struct mm_struct *mm, unsigned int type)
 }
 
 /*
- * Scan swap_map from current position to next entry still in use.
+ * Scan swap table from current position to next entry still in use.
  * Return 0 if there are no inuse entries after prev till end of
  * the map.
  */
@@ -2422,7 +2532,6 @@ static unsigned int find_next_to_unuse(struct swap_info_struct *si,
 {
 	unsigned int i;
 	unsigned long swp_tb;
-	unsigned char count;
 
 	/*
 	 * No need for swap_lock here: we're just looking
@@ -2431,12 +2540,9 @@ static unsigned int find_next_to_unuse(struct swap_info_struct *si,
 	 * allocations from this area (while holding swap_lock).
 	 */
 	for (i = prev + 1; i < si->max; i++) {
-		count = READ_ONCE(si->swap_map[i]);
 		swp_tb = swap_table_get(__swap_offset_to_cluster(si, i),
 					i % SWAPFILE_CLUSTER);
-		if (count == SWAP_MAP_BAD)
-			continue;
-		if (count || swp_tb_is_folio(swp_tb))
+		if (!swp_tb_is_null(swp_tb) && !swp_tb_is_bad(swp_tb))
 			break;
 		if ((i % LATENCY_LIMIT) == 0)
 			cond_resched();
@@ -2796,7 +2902,6 @@ static void flush_percpu_swap_cluster(struct swap_info_struct *si)
 SYSCALL_DEFINE1(swapoff, const char __user *, specialfile)
 {
 	struct swap_info_struct *p = NULL;
-	unsigned char *swap_map;
 	unsigned long *zeromap;
 	struct swap_cluster_info *cluster_info;
 	struct file *swap_file, *victim;
@@ -2879,8 +2984,6 @@ SYSCALL_DEFINE1(swapoff, const char __user *, specialfile)
 	flush_percpu_swap_cluster(p);
 
 	destroy_swap_extents(p, p->swap_file);
-	if (p->flags & SWP_CONTINUED)
-		free_swap_count_continuations(p);
 
 	if (!(p->flags & SWP_SOLIDSTATE))
 		atomic_dec(&nr_rotate_swap);
@@ -2892,8 +2995,6 @@ SYSCALL_DEFINE1(swapoff, const char __user *, specialfile)
 
 	swap_file = p->swap_file;
 	p->swap_file = NULL;
-	swap_map = p->swap_map;
-	p->swap_map = NULL;
 	zeromap = p->zeromap;
 	p->zeromap = NULL;
 	maxpages = p->max;
@@ -2907,7 +3008,6 @@ SYSCALL_DEFINE1(swapoff, const char __user *, specialfile)
 	mutex_unlock(&swapon_mutex);
 	kfree(p->global_cluster);
 	p->global_cluster = NULL;
-	vfree(swap_map);
 	kvfree(zeromap);
 	free_swap_cluster_info(cluster_info, maxpages);
 	/* Destroy swap account information */
@@ -3129,7 +3229,6 @@ static struct swap_info_struct *alloc_swap_info(void)
 		kvfree(defer);
 	}
 	spin_lock_init(&p->lock);
-	spin_lock_init(&p->cont_lock);
 	atomic_long_set(&p->inuse_pages, SWAP_USAGE_OFFLIST_BIT);
 	init_completion(&p->comp);
 
@@ -3256,19 +3355,6 @@ static unsigned long read_swap_header(struct swap_info_struct *si,
 	return maxpages;
 }
 
-static int setup_swap_map(struct swap_info_struct *si,
-			  union swap_header *swap_header,
-			  unsigned long maxpages)
-{
-	unsigned char *swap_map;
-
-	swap_map = vzalloc(maxpages);
-	si->swap_map = swap_map;
-	if (!swap_map)
-		return -ENOMEM;
-	return 0;
-}
-
 static int setup_swap_clusters_info(struct swap_info_struct *si,
 				    union swap_header *swap_header,
 				    unsigned long maxpages)
@@ -3460,11 +3546,6 @@ SYSCALL_DEFINE2(swapon, const char __user *, specialfile, int, swap_flags)
 
 	maxpages = si->max;
 
-	/* Setup the swap map and apply bad block */
-	error = setup_swap_map(si, swap_header, maxpages);
-	if (error)
-		goto bad_swap_unlock_inode;
-
 	/* Set up the swap cluster info */
 	error = setup_swap_clusters_info(si, swap_header, maxpages);
 	if (error)
@@ -3585,8 +3666,6 @@ SYSCALL_DEFINE2(swapon, const char __user *, specialfile, int, swap_flags)
 	inode = NULL;
 	destroy_swap_extents(si, swap_file);
 	swap_cgroup_swapoff(si->type);
-	vfree(si->swap_map);
-	si->swap_map = NULL;
 	free_swap_cluster_info(si->cluster_info, si->max);
 	si->cluster_info = NULL;
 	kvfree(si->zeromap);
@@ -3629,322 +3708,29 @@ void si_swapinfo(struct sysinfo *val)
 	spin_unlock(&swap_lock);
 }
 
-/*
- * Verify that nr swap entries are valid and increment their swap map counts.
- *
- * Returns error code in following case.
- * - success -> 0
- * - swp_entry is invalid -> EINVAL
- * - swap-mapped reference is requested but the entry is not used. -> ENOENT
- * - swap-mapped reference requested but needs continued swap count. -> ENOMEM
- */
-static int swap_dup_entries(struct swap_info_struct *si,
-			    struct swap_cluster_info *ci,
-			    unsigned long offset,
-			    unsigned char usage, int nr)
-{
-	int i;
-	unsigned char count;
-
-	for (i = 0; i < nr; i++) {
-		count = si->swap_map[offset + i];
-		/*
-		 * For swapin out, allocator never allocates bad slots. for
-		 * swapin, readahead is guarded by swap_entry_swapped.
-		 */
-		if (WARN_ON(count == SWAP_MAP_BAD))
-			return -ENOENT;
-		/*
-		 * Swap count duplication must be guarded by either swap cache folio (from
-		 * folio_dup_swap) or external lock of existing entry (from swap_dup_entry_direct).
-		 */
-		if (WARN_ON(!count &&
-			    !swp_tb_is_folio(__swap_table_get(ci, offset % SWAPFILE_CLUSTER))))
-			return -ENOENT;
-		if (WARN_ON((count & ~COUNT_CONTINUED) > SWAP_MAP_MAX))
-			return -EINVAL;
-	}
-
-	for (i = 0; i < nr; i++) {
-		count = si->swap_map[offset + i];
-		if ((count & ~COUNT_CONTINUED) < SWAP_MAP_MAX)
-			count += usage;
-		else if (swap_count_continued(si, offset + i, count))
-			count = COUNT_CONTINUED;
-		else {
-			/*
-			 * Don't need to rollback changes, because if
-			 * usage == 1, there must be nr == 1.
-			 */
-			return -ENOMEM;
-		}
-
-		WRITE_ONCE(si->swap_map[offset + i], count);
-	}
-
-	return 0;
-}
-
-static int __swap_duplicate(swp_entry_t entry, unsigned char usage, int nr)
-{
-	int err;
-	struct swap_info_struct *si;
-	struct swap_cluster_info *ci;
-	unsigned long offset = swp_offset(entry);
-
-	si = swap_entry_to_info(entry);
-	if (WARN_ON_ONCE(!si)) {
-		pr_err("%s%08lx\n", Bad_file, entry.val);
-		return -EINVAL;
-	}
-
-	VM_WARN_ON(nr > SWAPFILE_CLUSTER - offset % SWAPFILE_CLUSTER);
-	ci = swap_cluster_lock(si, offset);
-	err = swap_dup_entries(si, ci, offset, usage, nr);
-	swap_cluster_unlock(ci);
-	return err;
-}
-
 /*
  * swap_dup_entry_direct() - Increase reference count of a swap entry by one.
  * @entry: first swap entry from which we want to increase the refcount.
  *
- * Returns 0 for success, or -ENOMEM if a swap_count_continuation is required
- * but could not be atomically allocated.  Returns 0, just as if it succeeded,
- * if __swap_duplicate() fails for another reason (-EINVAL or -ENOENT), which
- * might occur if a page table entry has got corrupted.
+ * Returns 0 for success, or -ENOMEM if the extend table is required
+ * but could not be atomically allocated.  Returns -EINVAL if the swap
+ * entry is invalid, which might occur if a page table entry has got
+ * corrupted.
  *
  * Context: Caller must ensure there is no race condition on the reference
  * owner. e.g., locking the PTL of a PTE containing the entry being increased.
  */
 int swap_dup_entry_direct(swp_entry_t entry)
-{
-	int err = 0;
-	while (!err && __swap_duplicate(entry, 1, 1) == -ENOMEM)
-		err = add_swap_count_continuation(entry, GFP_ATOMIC);
-	return err;
-}
-
-/*
- * add_swap_count_continuation - called when a swap count is duplicated
- * beyond SWAP_MAP_MAX, it allocates a new page and links that to the entry's
- * page of the original vmalloc'ed swap_map, to hold the continuation count
- * (for that entry and for its neighbouring PAGE_SIZE swap entries).  Called
- * again when count is duplicated beyond SWAP_MAP_MAX * SWAP_CONT_MAX, etc.
- *
- * These continuation pages are seldom referenced: the common paths all work
- * on the original swap_map, only referring to a continuation page when the
- * low "digit" of a count is incremented or decremented through SWAP_MAP_MAX.
- *
- * add_swap_count_continuation(, GFP_ATOMIC) can be called while holding
- * page table locks; if it fails, add_swap_count_continuation(, GFP_KERNEL)
- * can be called after dropping locks.
- */
-int add_swap_count_continuation(swp_entry_t entry, gfp_t gfp_mask)
 {
 	struct swap_info_struct *si;
-	struct swap_cluster_info *ci;
-	struct page *head;
-	struct page *page;
-	struct page *list_page;
-	pgoff_t offset;
-	unsigned char count;
-	int ret = 0;
-
-	/*
-	 * When debugging, it's easier to use __GFP_ZERO here; but it's better
-	 * for latency not to zero a page while GFP_ATOMIC and holding locks.
-	 */
-	page = alloc_page(gfp_mask | __GFP_HIGHMEM);
-
-	si = get_swap_device(entry);
-	if (!si) {
-		/*
-		 * An acceptable race has occurred since the failing
-		 * __swap_duplicate(): the swap device may be swapoff
-		 */
-		goto outer;
-	}
-
-	offset = swp_offset(entry);
-
-	ci = swap_cluster_lock(si, offset);
-
-	count = si->swap_map[offset];
-
-	if ((count & ~COUNT_CONTINUED) != SWAP_MAP_MAX) {
-		/*
-		 * The higher the swap count, the more likely it is that tasks
-		 * will race to add swap count continuation: we need to avoid
-		 * over-provisioning.
-		 */
-		goto out;
-	}
-
-	if (!page) {
-		ret = -ENOMEM;
-		goto out;
-	}
-
-	head = vmalloc_to_page(si->swap_map + offset);
-	offset &= ~PAGE_MASK;
-
-	spin_lock(&si->cont_lock);
-	/*
-	 * Page allocation does not initialize the page's lru field,
-	 * but it does always reset its private field.
-	 */
-	if (!page_private(head)) {
-		BUG_ON(count & COUNT_CONTINUED);
-		INIT_LIST_HEAD(&head->lru);
-		set_page_private(head, SWP_CONTINUED);
-		si->flags |= SWP_CONTINUED;
-	}
-
-	list_for_each_entry(list_page, &head->lru, lru) {
-		unsigned char *map;
-
-		/*
-		 * If the previous map said no continuation, but we've found
-		 * a continuation page, free our allocation and use this one.
-		 */
-		if (!(count & COUNT_CONTINUED))
-			goto out_unlock_cont;
-
-		map = kmap_local_page(list_page) + offset;
-		count = *map;
-		kunmap_local(map);
-
-		/*
-		 * If this continuation count now has some space in it,
-		 * free our allocation and use this one.
-		 */
-		if ((count & ~COUNT_CONTINUED) != SWAP_CONT_MAX)
-			goto out_unlock_cont;
-	}
 
-	list_add_tail(&page->lru, &head->lru);
-	page = NULL;			/* now it's attached, don't free it */
-out_unlock_cont:
-	spin_unlock(&si->cont_lock);
-out:
-	swap_cluster_unlock(ci);
-	put_swap_device(si);
-outer:
-	if (page)
-		__free_page(page);
-	return ret;
-}
-
-/*
- * swap_count_continued - when the original swap_map count is incremented
- * from SWAP_MAP_MAX, check if there is already a continuation page to carry
- * into, carry if so, or else fail until a new continuation page is allocated;
- * when the original swap_map count is decremented from 0 with continuation,
- * borrow from the continuation and report whether it still holds more.
- * Called while __swap_duplicate() or caller of swap_put_entry_locked()
- * holds cluster lock.
- */
-static bool swap_count_continued(struct swap_info_struct *si,
-				 pgoff_t offset, unsigned char count)
-{
-	struct page *head;
-	struct page *page;
-	unsigned char *map;
-	bool ret;
-
-	head = vmalloc_to_page(si->swap_map + offset);
-	if (page_private(head) != SWP_CONTINUED) {
-		BUG_ON(count & COUNT_CONTINUED);
-		return false;		/* need to add count continuation */
-	}
-
-	spin_lock(&si->cont_lock);
-	offset &= ~PAGE_MASK;
-	page = list_next_entry(head, lru);
-	map = kmap_local_page(page) + offset;
-
-	if (count == SWAP_MAP_MAX)	/* initial increment from swap_map */
-		goto init_map;		/* jump over SWAP_CONT_MAX checks */
-
-	if (count == (SWAP_MAP_MAX | COUNT_CONTINUED)) { /* incrementing */
-		/*
-		 * Think of how you add 1 to 999
-		 */
-		while (*map == (SWAP_CONT_MAX | COUNT_CONTINUED)) {
-			kunmap_local(map);
-			page = list_next_entry(page, lru);
-			BUG_ON(page == head);
-			map = kmap_local_page(page) + offset;
-		}
-		if (*map == SWAP_CONT_MAX) {
-			kunmap_local(map);
-			page = list_next_entry(page, lru);
-			if (page == head) {
-				ret = false;	/* add count continuation */
-				goto out;
-			}
-			map = kmap_local_page(page) + offset;
-init_map:		*map = 0;		/* we didn't zero the page */
-		}
-		*map += 1;
-		kunmap_local(map);
-		while ((page = list_prev_entry(page, lru)) != head) {
-			map = kmap_local_page(page) + offset;
-			*map = COUNT_CONTINUED;
-			kunmap_local(map);
-		}
-		ret = true;			/* incremented */
-
-	} else {				/* decrementing */
-		/*
-		 * Think of how you subtract 1 from 1000
-		 */
-		BUG_ON(count != COUNT_CONTINUED);
-		while (*map == COUNT_CONTINUED) {
-			kunmap_local(map);
-			page = list_next_entry(page, lru);
-			BUG_ON(page == head);
-			map = kmap_local_page(page) + offset;
-		}
-		BUG_ON(*map == 0);
-		*map -= 1;
-		if (*map == 0)
-			count = 0;
-		kunmap_local(map);
-		while ((page = list_prev_entry(page, lru)) != head) {
-			map = kmap_local_page(page) + offset;
-			*map = SWAP_CONT_MAX | count;
-			count = COUNT_CONTINUED;
-			kunmap_local(map);
-		}
-		ret = count == COUNT_CONTINUED;
+	si = swap_entry_to_info(entry);
+	if (WARN_ON_ONCE(!si)) {
+		pr_err("%s%08lx\n", Bad_file, entry.val);
+		return -EINVAL;
 	}
-out:
-	spin_unlock(&si->cont_lock);
-	return ret;
-}
 
-/*
- * free_swap_count_continuations - swapoff free all the continuation pages
- * appended to the swap_map, after swap_map is quiesced, before vfree'ing it.
- */
-static void free_swap_count_continuations(struct swap_info_struct *si)
-{
-	pgoff_t offset;
-
-	for (offset = 0; offset < si->max; offset += PAGE_SIZE) {
-		struct page *head;
-		head = vmalloc_to_page(si->swap_map + offset);
-		if (page_private(head)) {
-			struct page *page, *next;
-
-			list_for_each_entry_safe(page, next, &head->lru, lru) {
-				list_del(&page->lru);
-				__free_page(page);
-			}
-		}
-	}
+	return swap_dup_entries_cluster(si, swp_offset(entry), 1);
 }
 
 #if defined(CONFIG_MEMCG) && defined(CONFIG_BLK_CGROUP)

-- 
2.52.0




^ permalink raw reply	[flat|nested] 28+ messages in thread

* [PATCH v3 10/12] mm, swap: no need to truncate the scan border
  2026-02-17 20:06 [PATCH v3 00/12] mm, swap: swap table phase III: remove swap_map Kairui Song via B4 Relay
                   ` (8 preceding siblings ...)
  2026-02-17 20:06 ` [PATCH v3 09/12] mm, swap: use the swap table to track the swap count Kairui Song via B4 Relay
@ 2026-02-17 20:06 ` Kairui Song via B4 Relay
  2026-02-19  7:10   ` Chris Li
  2026-02-17 20:06 ` [PATCH v3 11/12] mm, swap: simplify checking if a folio is swapped Kairui Song via B4 Relay
                   ` (2 subsequent siblings)
  12 siblings, 1 reply; 28+ messages in thread
From: Kairui Song via B4 Relay @ 2026-02-17 20:06 UTC (permalink / raw)
  To: linux-mm
  Cc: Andrew Morton, Kemeng Shi, Nhat Pham, Baoquan He, Barry Song,
	Johannes Weiner, David Hildenbrand, Lorenzo Stoakes,
	Youngjun Park, linux-kernel, Chris Li, Kairui Song

From: Kairui Song <kasong@tencent.com>

swap_map had a static flexible size, so the last cluster won't be fully
covered, hence the allocator needs to check the scan border to avoid
OOB. But the swap table has a fixed-sized swap table for each cluster,
and the slots beyond the device size are marked as bad slots. The
allocator can simply scan all slots as usual, and any bad slots will be
skipped.

Signed-off-by: Kairui Song <kasong@tencent.com>
---
 mm/swap.h     | 2 +-
 mm/swapfile.c | 2 +-
 2 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/mm/swap.h b/mm/swap.h
index 0a91e21e92b1..cc410b94e91a 100644
--- a/mm/swap.h
+++ b/mm/swap.h
@@ -85,7 +85,7 @@ static inline struct swap_cluster_info *__swap_offset_to_cluster(
 		struct swap_info_struct *si, pgoff_t offset)
 {
 	VM_WARN_ON_ONCE(percpu_ref_is_zero(&si->users)); /* race with swapoff */
-	VM_WARN_ON_ONCE(offset >= si->max);
+	VM_WARN_ON_ONCE(offset >= roundup(si->max, SWAPFILE_CLUSTER));
 	return &si->cluster_info[offset / SWAPFILE_CLUSTER];
 }
 
diff --git a/mm/swapfile.c b/mm/swapfile.c
index 801d8092be51..df2b88c6c67b 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -945,8 +945,8 @@ static unsigned int alloc_swap_scan_cluster(struct swap_info_struct *si,
 {
 	unsigned int next = SWAP_ENTRY_INVALID, found = SWAP_ENTRY_INVALID;
 	unsigned long start = ALIGN_DOWN(offset, SWAPFILE_CLUSTER);
-	unsigned long end = min(start + SWAPFILE_CLUSTER, si->max);
 	unsigned int order = likely(folio) ? folio_order(folio) : 0;
+	unsigned long end = start + SWAPFILE_CLUSTER;
 	unsigned int nr_pages = 1 << order;
 	bool need_reclaim, ret, usable;
 

-- 
2.52.0




^ permalink raw reply	[flat|nested] 28+ messages in thread

* [PATCH v3 11/12] mm, swap: simplify checking if a folio is swapped
  2026-02-17 20:06 [PATCH v3 00/12] mm, swap: swap table phase III: remove swap_map Kairui Song via B4 Relay
                   ` (9 preceding siblings ...)
  2026-02-17 20:06 ` [PATCH v3 10/12] mm, swap: no need to truncate the scan border Kairui Song via B4 Relay
@ 2026-02-17 20:06 ` Kairui Song via B4 Relay
  2026-02-19  7:18   ` Chris Li
  2026-02-17 20:06 ` [PATCH v3 12/12] mm, swap: no need to clear the shadow explicitly Kairui Song via B4 Relay
  2026-02-17 20:10 ` [PATCH v3 00/12] mm, swap: swap table phase III: remove swap_map Kairui Song
  12 siblings, 1 reply; 28+ messages in thread
From: Kairui Song via B4 Relay @ 2026-02-17 20:06 UTC (permalink / raw)
  To: linux-mm
  Cc: Andrew Morton, Kemeng Shi, Nhat Pham, Baoquan He, Barry Song,
	Johannes Weiner, David Hildenbrand, Lorenzo Stoakes,
	Youngjun Park, linux-kernel, Chris Li, Kairui Song

From: Kairui Song <kasong@tencent.com>

Clean up and simplify how we check if a folio is swapped. The helper
already requires the folio to be in swap cache and locked. That's enough
to pin the swap cluster from being freed, so there is no need to lock
anything else to avoid UAF.

And besides, we have cleaned up and defined the swap operation to be
mostly folio based, and now the only place a folio will have any of its
swap slots' count increased from 0 to 1 is folio_dup_swap, which also
requires the folio lock. So as we are holding the folio lock here, a
folio can't change its swap status from not swapped (all swap slots have
a count of 0) to swapped (any slot has a swap count larger than 0).

So there won't be any false negatives of this helper if we simply depend
on the folio lock to stabilize the cluster.

We are only using this helper to determine if we can and should release
the swap cache. So false positives are completely harmless, and also
already exist before. Depending on the timing, previously, it's also
possible that a racing thread releases the swap count right after
releasing the ci lock and before this helper returns. In any case, the
worst that could happen is we leave a clean swap cache. It will still be
reclaimed when under pressure just fine.

So, in conclusion, we can simplify and make the check much simpler and
lockless. Also, rename it to folio_maybe_swapped to reflect the design.

Signed-off-by: Kairui Song <kasong@tencent.com>
---
 mm/swap.h     |  5 ++--
 mm/swapfile.c | 82 ++++++++++++++++++++++++++++++++---------------------------
 2 files changed, 48 insertions(+), 39 deletions(-)

diff --git a/mm/swap.h b/mm/swap.h
index cc410b94e91a..9728e6a944b2 100644
--- a/mm/swap.h
+++ b/mm/swap.h
@@ -195,12 +195,13 @@ extern int swap_retry_table_alloc(swp_entry_t entry, gfp_t gfp);
  *
  * folio_alloc_swap(): the entry point for a folio to be swapped
  * out. It allocates swap slots and pins the slots with swap cache.
- * The slots start with a swap count of zero.
+ * The slots start with a swap count of zero. The slots are pinned
+ * by swap cache reference which doesn't contribute to swap count.
  *
  * folio_dup_swap(): increases the swap count of a folio, usually
  * during it gets unmapped and a swap entry is installed to replace
  * it (e.g., swap entry in page table). A swap slot with swap
- * count == 0 should only be increasd by this helper.
+ * count == 0 can only be increased by this helper.
  *
  * folio_put_swap(): does the opposite thing of folio_dup_swap().
  */
diff --git a/mm/swapfile.c b/mm/swapfile.c
index df2b88c6c67b..dab5e726855b 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -1743,7 +1743,11 @@ int folio_alloc_swap(struct folio *folio)
  * @subpage: if not NULL, only increase the swap count of this subpage.
  *
  * Typically called when the folio is unmapped and have its swap entry to
- * take its palce.
+ * take its place: Swap entries allocated to a folio has count == 0 and pinned
+ * by swap cache. The swap cache pin doesn't increase the swap count. This
+ * helper sets the initial count == 1 and increases the count as the folio is
+ * unmapped and swap entries referencing the slots are generated to replace
+ * the folio.
  *
  * Context: Caller must ensure the folio is locked and in the swap cache.
  * NOTE: The caller also has to ensure there is no raced call to
@@ -1942,49 +1946,44 @@ int swp_swapcount(swp_entry_t entry)
 	return count < 0 ? 0 : count;
 }
 
-static bool swap_page_trans_huge_swapped(struct swap_info_struct *si,
-					 swp_entry_t entry, int order)
+/*
+ * folio_maybe_swapped - Test if a folio covers any swap slot with count > 0.
+ *
+ * Check if a folio is swapped. Holding the folio lock ensures the folio won't
+ * go from not-swapped to swapped because the initial swap count increment can
+ * only be done by folio_dup_swap, which also locks the folio. But a concurrent
+ * decrease of swap count is possible through swap_put_entries_direct, so this
+ * may return a false positive.
+ *
+ * Context: Caller must ensure the folio is locked and in the swap cache.
+ */
+static bool folio_maybe_swapped(struct folio *folio)
 {
+	swp_entry_t entry = folio->swap;
 	struct swap_cluster_info *ci;
-	unsigned int nr_pages = 1 << order;
-	unsigned long roffset = swp_offset(entry);
-	unsigned long offset = round_down(roffset, nr_pages);
-	unsigned int ci_off;
-	int i;
+	unsigned int ci_off, ci_end;
 	bool ret = false;
 
-	ci = swap_cluster_lock(si, offset);
-	if (nr_pages == 1) {
-		ci_off = roffset % SWAPFILE_CLUSTER;
-		if (swp_tb_get_count(__swap_table_get(ci, ci_off)))
-			ret = true;
-		goto unlock_out;
-	}
-	for (i = 0; i < nr_pages; i++) {
-		ci_off = (offset + i) % SWAPFILE_CLUSTER;
-		if (swp_tb_get_count(__swap_table_get(ci, ci_off))) {
-			ret = true;
-			break;
-		}
-	}
-unlock_out:
-	swap_cluster_unlock(ci);
-	return ret;
-}
-
-static bool folio_swapped(struct folio *folio)
-{
-	swp_entry_t entry = folio->swap;
-	struct swap_info_struct *si;
-
 	VM_WARN_ON_ONCE_FOLIO(!folio_test_locked(folio), folio);
 	VM_WARN_ON_ONCE_FOLIO(!folio_test_swapcache(folio), folio);
 
-	si = __swap_entry_to_info(entry);
-	if (!IS_ENABLED(CONFIG_THP_SWAP) || likely(!folio_test_large(folio)))
-		return swap_entry_swapped(si, entry);
+	ci = __swap_entry_to_cluster(entry);
+	ci_off = swp_cluster_offset(entry);
+	ci_end = ci_off + folio_nr_pages(folio);
+	/*
+	 * Extra locking not needed, folio lock ensures its swap entries
+	 * won't be released, the backing data won't be gone either.
+	 */
+	rcu_read_lock();
+	do {
+		if (__swp_tb_get_count(__swap_table_get(ci, ci_off))) {
+			ret = true;
+			break;
+		}
+	} while (++ci_off < ci_end);
+	rcu_read_unlock();
 
-	return swap_page_trans_huge_swapped(si, entry, folio_order(folio));
+	return ret;
 }
 
 static bool folio_swapcache_freeable(struct folio *folio)
@@ -2030,7 +2029,7 @@ bool folio_free_swap(struct folio *folio)
 {
 	if (!folio_swapcache_freeable(folio))
 		return false;
-	if (folio_swapped(folio))
+	if (folio_maybe_swapped(folio))
 		return false;
 
 	swap_cache_del_folio(folio);
@@ -3719,6 +3718,8 @@ void si_swapinfo(struct sysinfo *val)
  *
  * Context: Caller must ensure there is no race condition on the reference
  * owner. e.g., locking the PTL of a PTE containing the entry being increased.
+ * Also the swap entry must have a count >= 1. Otherwise folio_dup_swap should
+ * be used.
  */
 int swap_dup_entry_direct(swp_entry_t entry)
 {
@@ -3730,6 +3731,13 @@ int swap_dup_entry_direct(swp_entry_t entry)
 		return -EINVAL;
 	}
 
+	/*
+	 * The caller must be increasing the swap count from a direct
+	 * reference of the swap slot (e.g. a swap entry in page table).
+	 * So the swap count must be >= 1.
+	 */
+	VM_WARN_ON_ONCE(!swap_entry_swapped(si, entry));
+
 	return swap_dup_entries_cluster(si, swp_offset(entry), 1);
 }
 

-- 
2.52.0




^ permalink raw reply	[flat|nested] 28+ messages in thread

* [PATCH v3 12/12] mm, swap: no need to clear the shadow explicitly
  2026-02-17 20:06 [PATCH v3 00/12] mm, swap: swap table phase III: remove swap_map Kairui Song via B4 Relay
                   ` (10 preceding siblings ...)
  2026-02-17 20:06 ` [PATCH v3 11/12] mm, swap: simplify checking if a folio is swapped Kairui Song via B4 Relay
@ 2026-02-17 20:06 ` Kairui Song via B4 Relay
  2026-02-19  7:19   ` Chris Li
  2026-02-17 20:10 ` [PATCH v3 00/12] mm, swap: swap table phase III: remove swap_map Kairui Song
  12 siblings, 1 reply; 28+ messages in thread
From: Kairui Song via B4 Relay @ 2026-02-17 20:06 UTC (permalink / raw)
  To: linux-mm
  Cc: Andrew Morton, Kemeng Shi, Nhat Pham, Baoquan He, Barry Song,
	Johannes Weiner, David Hildenbrand, Lorenzo Stoakes,
	Youngjun Park, linux-kernel, Chris Li, Kairui Song

From: Kairui Song <kasong@tencent.com>

Since we no longer bypass the swap cache, every swap-in will clear the
swap shadow by inserting the folio into the swap table. The only place
we may seem to need to free the swap shadow is when the swap slots are
freed directly without a folio (swap_put_entries_direct). But with the
swap table, that is not needed either. Freeing a slot in the swap table
will set the table entry to NULL, which erases the shadow just fine.

So just delete all explicit shadow clearing, it's no longer needed.
Also, rearrange the freeing.

Signed-off-by: Kairui Song <kasong@tencent.com>
---
 mm/swap.h       |  1 -
 mm/swap_state.c | 21 ---------------------
 mm/swapfile.c   |  2 --
 3 files changed, 24 deletions(-)

diff --git a/mm/swap.h b/mm/swap.h
index 9728e6a944b2..a77016f2423b 100644
--- a/mm/swap.h
+++ b/mm/swap.h
@@ -290,7 +290,6 @@ void __swap_cache_del_folio(struct swap_cluster_info *ci,
 			    struct folio *folio, swp_entry_t entry, void *shadow);
 void __swap_cache_replace_folio(struct swap_cluster_info *ci,
 				struct folio *old, struct folio *new);
-void __swap_cache_clear_shadow(swp_entry_t entry, int nr_ents);
 
 void show_swap_cache_info(void);
 void swapcache_clear(struct swap_info_struct *si, swp_entry_t entry, int nr);
diff --git a/mm/swap_state.c b/mm/swap_state.c
index e7618ffe6d70..32d9d877bda8 100644
--- a/mm/swap_state.c
+++ b/mm/swap_state.c
@@ -350,27 +350,6 @@ void __swap_cache_replace_folio(struct swap_cluster_info *ci,
 	}
 }
 
-/**
- * __swap_cache_clear_shadow - Clears a set of shadows in the swap cache.
- * @entry: The starting index entry.
- * @nr_ents: How many slots need to be cleared.
- *
- * Context: Caller must ensure the range is valid, all in one single cluster,
- * not occupied by any folio, and lock the cluster.
- */
-void __swap_cache_clear_shadow(swp_entry_t entry, int nr_ents)
-{
-	struct swap_cluster_info *ci = __swap_entry_to_cluster(entry);
-	unsigned int ci_off = swp_cluster_offset(entry), ci_end;
-	unsigned long old;
-
-	ci_end = ci_off + nr_ents;
-	do {
-		old = __swap_table_xchg(ci, ci_off, null_to_swp_tb());
-		WARN_ON_ONCE(swp_tb_is_folio(old) || swp_tb_get_count(old));
-	} while (++ci_off < ci_end);
-}
-
 /*
  * If we are the only user, then try to free up the swap cache.
  *
diff --git a/mm/swapfile.c b/mm/swapfile.c
index dab5e726855b..802efa37b33f 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -1287,7 +1287,6 @@ static void swap_range_alloc(struct swap_info_struct *si,
 static void swap_range_free(struct swap_info_struct *si, unsigned long offset,
 			    unsigned int nr_entries)
 {
-	unsigned long begin = offset;
 	unsigned long end = offset + nr_entries - 1;
 	void (*swap_slot_free_notify)(struct block_device *, unsigned long);
 	unsigned int i;
@@ -1312,7 +1311,6 @@ static void swap_range_free(struct swap_info_struct *si, unsigned long offset,
 			swap_slot_free_notify(si->bdev, offset);
 		offset++;
 	}
-	__swap_cache_clear_shadow(swp_entry(si->type, begin), nr_entries);
 
 	/*
 	 * Make sure that try_to_unuse() observes si->inuse_pages reaching 0

-- 
2.52.0




^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v3 00/12] mm, swap: swap table phase III: remove swap_map
  2026-02-17 20:06 [PATCH v3 00/12] mm, swap: swap table phase III: remove swap_map Kairui Song via B4 Relay
                   ` (11 preceding siblings ...)
  2026-02-17 20:06 ` [PATCH v3 12/12] mm, swap: no need to clear the shadow explicitly Kairui Song via B4 Relay
@ 2026-02-17 20:10 ` Kairui Song
  12 siblings, 0 replies; 28+ messages in thread
From: Kairui Song @ 2026-02-17 20:10 UTC (permalink / raw)
  To: kasong
  Cc: linux-mm, Andrew Morton, Kemeng Shi, Nhat Pham, Baoquan He,
	Barry Song, Johannes Weiner, David Hildenbrand, Lorenzo Stoakes,
	Youngjun Park, linux-kernel, Chris Li

On Wed, Feb 18, 2026 at 4:07 AM Kairui Song via B4 Relay
<devnull+kasong.tencent.com@kernel.org> wrote:
>
> This series is based on phase II which is still in mm-unstable.

Correction: It's based on mm-unstable, but phase II is already in
mm-stable. Sorry for the confusion.

It's also mergeable for mm-new, tested with no conflict.


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v3 09/12] mm, swap: use the swap table to track the swap count
  2026-02-17 20:06 ` [PATCH v3 09/12] mm, swap: use the swap table to track the swap count Kairui Song via B4 Relay
@ 2026-02-18 10:40   ` kernel test robot
  2026-02-18 12:22     ` Kairui Song
  0 siblings, 1 reply; 28+ messages in thread
From: kernel test robot @ 2026-02-18 10:40 UTC (permalink / raw)
  To: Kairui Song via B4 Relay, linux-mm
  Cc: llvm, oe-kbuild-all, Andrew Morton, Linux Memory Management List,
	Kemeng Shi, Nhat Pham, Baoquan He, Barry Song, Johannes Weiner,
	David Hildenbrand, Lorenzo Stoakes, Youngjun Park, linux-kernel,
	Chris Li, Kairui Song

Hi Kairui,

kernel test robot noticed the following build warnings:

[auto build test WARNING on d9982f38eb6e9a0cb6bdd1116cc87f75a1084aad]

url:    https://github.com/intel-lab-lkp/linux/commits/Kairui-Song-via-B4-Relay/mm-swap-protect-si-swap_file-properly-and-use-as-a-mount-indicator/20260218-040852
base:   d9982f38eb6e9a0cb6bdd1116cc87f75a1084aad
patch link:    https://lore.kernel.org/r/20260218-swap-table-p3-v3-9-f4e34be021a7%40tencent.com
patch subject: [PATCH v3 09/12] mm, swap: use the swap table to track the swap count
config: i386-buildonly-randconfig-001-20260218 (https://download.01.org/0day-ci/archive/20260218/202602181835.58TEynxc-lkp@intel.com/config)
compiler: clang version 20.1.8 (https://github.com/llvm/llvm-project 87f0227cb60147a26a1eeb4fb06e3b505e9c7261)
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20260218/202602181835.58TEynxc-lkp@intel.com/reproduce)

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202602181835.58TEynxc-lkp@intel.com/

All warnings (new ones prefixed by >>):

>> mm/swapfile.c:1627:6: warning: shift count >= width of type [-Wshift-count-overflow]
    1626 |                 VM_WARN_ON_ONCE(ci->extend_table[ci_off] >=
         |                 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
    1627 |                                 (BIT(BITS_PER_TYPE(ci->extend_table[0]))) - 1);
         |                                 ~^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
   include/vdso/bits.h:7:26: note: expanded from macro 'BIT'
       7 | #define BIT(nr)                 (UL(1) << (nr))
         |                                        ^
   include/linux/mmdebug.h:123:50: note: expanded from macro 'VM_WARN_ON_ONCE'
     123 | #define VM_WARN_ON_ONCE(cond) (void)WARN_ON_ONCE(cond)
         |                                     ~~~~~~~~~~~~~^~~~~
   include/asm-generic/bug.h:120:25: note: expanded from macro 'WARN_ON_ONCE'
     120 |         int __ret_warn_on = !!(condition);                              \
         |                                ^~~~~~~~~
   1 warning generated.


vim +1627 mm/swapfile.c

  1596	
  1597	/* Increase the swap count of one slot. */
  1598	static int __swap_cluster_dup_entry(struct swap_cluster_info *ci,
  1599					    unsigned int ci_off)
  1600	{
  1601		int count;
  1602		unsigned long swp_tb;
  1603	
  1604		lockdep_assert_held(&ci->lock);
  1605		swp_tb = __swap_table_get(ci, ci_off);
  1606		/* Bad or special slots can't be handled */
  1607		if (WARN_ON_ONCE(swp_tb_is_bad(swp_tb)))
  1608			return -EINVAL;
  1609		count = __swp_tb_get_count(swp_tb);
  1610		/* Must be either cached or have a count already */
  1611		if (WARN_ON_ONCE(!count && !swp_tb_is_folio(swp_tb)))
  1612			return -ENOENT;
  1613	
  1614		if (likely(count < (SWP_TB_COUNT_MAX - 1))) {
  1615			__swap_table_set(ci, ci_off, __swp_tb_mk_count(swp_tb, count + 1));
  1616			VM_WARN_ON_ONCE(ci->extend_table && ci->extend_table[ci_off]);
  1617		} else if (count == (SWP_TB_COUNT_MAX - 1)) {
  1618			if (ci->extend_table) {
  1619				VM_WARN_ON_ONCE(ci->extend_table[ci_off]);
  1620				ci->extend_table[ci_off] = SWP_TB_COUNT_MAX;
  1621				__swap_table_set(ci, ci_off, __swp_tb_mk_count(swp_tb, SWP_TB_COUNT_MAX));
  1622			} else {
  1623				return -ENOMEM;
  1624			}
  1625		} else if (count == SWP_TB_COUNT_MAX) {
  1626			VM_WARN_ON_ONCE(ci->extend_table[ci_off] >=
> 1627					(BIT(BITS_PER_TYPE(ci->extend_table[0]))) - 1);
  1628			++ci->extend_table[ci_off];
  1629		} else {
  1630			/* Never happens unless counting went wrong */
  1631			WARN_ON_ONCE(1);
  1632		}
  1633	
  1634		return 0;
  1635	}
  1636	

-- 
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v3 09/12] mm, swap: use the swap table to track the swap count
  2026-02-18 10:40   ` kernel test robot
@ 2026-02-18 12:22     ` Kairui Song
  2026-02-19  7:06       ` Chris Li
  0 siblings, 1 reply; 28+ messages in thread
From: Kairui Song @ 2026-02-18 12:22 UTC (permalink / raw)
  To: linux-mm, kernel test robot
  Cc: Kairui Song via B4 Relay, llvm, oe-kbuild-all, Andrew Morton,
	Kemeng Shi, Nhat Pham, Baoquan He, Barry Song, Johannes Weiner,
	David Hildenbrand, Lorenzo Stoakes, Youngjun Park, linux-kernel,
	Chris Li, Kairui Song

On Wed, Feb 18, 2026 at 06:40:16PM +0800, kernel test robot wrote:
> Hi Kairui,
> 
> kernel test robot noticed the following build warnings:
> 
> [auto build test WARNING on d9982f38eb6e9a0cb6bdd1116cc87f75a1084aad]
> 
> url:    https://github.com/intel-lab-lkp/linux/commits/Kairui-Song-via-B4-Relay/mm-swap-protect-si-swap_file-properly-and-use-as-a-mount-indicator/20260218-040852
> base:   d9982f38eb6e9a0cb6bdd1116cc87f75a1084aad
> patch link:    https://lore.kernel.org/r/20260218-swap-table-p3-v3-9-f4e34be021a7%40tencent.com
> patch subject: [PATCH v3 09/12] mm, swap: use the swap table to track the swap count
> config: i386-buildonly-randconfig-001-20260218 (https://download.01.org/0day-ci/archive/20260218/202602181835.58TEynxc-lkp@intel.com/config)
> compiler: clang version 20.1.8 (https://github.com/llvm/llvm-project 87f0227cb60147a26a1eeb4fb06e3b505e9c7261)
> reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20260218/202602181835.58TEynxc-lkp@intel.com/reproduce)
> 
> If you fix the issue in a separate patch/commit (i.e. not just a new version of
> the same patch/commit), kindly add following tags
> | Reported-by: kernel test robot <lkp@intel.com>
> | Closes: https://lore.kernel.org/oe-kbuild-all/202602181835.58TEynxc-lkp@intel.com/
> 
> All warnings (new ones prefixed by >>):
> 
> >> mm/swapfile.c:1627:6: warning: shift count >= width of type [-Wshift-count-overflow]
>     1626 |                 VM_WARN_ON_ONCE(ci->extend_table[ci_off] >=
>          |                 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
>     1627 |                                 (BIT(BITS_PER_TYPE(ci->extend_table[0]))) - 1);
>          |                                 ~^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
>    include/vdso/bits.h:7:26: note: expanded from macro 'BIT'
>        7 | #define BIT(nr)                 (UL(1) << (nr))
>          |                                        ^
>    include/linux/mmdebug.h:123:50: note: expanded from macro 'VM_WARN_ON_ONCE'
>      123 | #define VM_WARN_ON_ONCE(cond) (void)WARN_ON_ONCE(cond)
>          |                                     ~~~~~~~~~~~~~^~~~~
>    include/asm-generic/bug.h:120:25: note: expanded from macro 'WARN_ON_ONCE'
>      120 |         int __ret_warn_on = !!(condition);                              \
>          |                                ^~~~~~~~~
>    1 warning generated.

Nice catch from the bot. It's a new added sanity check in V3 just
in case the swap count maybe grow larger than UINT_MAX, which should
never happen, but just in case.

I really should just use the existing helper macro for that:

diff --git a/mm/swapfile.c b/mm/swapfile.c
index 801d8092be51..34b38255f72a 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -1624,7 +1624,7 @@ static int __swap_cluster_dup_entry(struct swap_cluster_info *ci,
                }
        } else if (count == SWP_TB_COUNT_MAX) {
                VM_WARN_ON_ONCE(ci->extend_table[ci_off] >=
-                               (BIT(BITS_PER_TYPE(ci->extend_table[0]))) - 1);
+                               type_max(typeof(ci->extend_table[0])));
                ++ci->extend_table[ci_off];
        } else {
                /* Never happens unless counting went wrong */


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v3 01/12] mm, swap: protect si->swap_file properly and use as a mount indicator
  2026-02-17 20:06 ` [PATCH v3 01/12] mm, swap: protect si->swap_file properly and use as a mount indicator Kairui Song via B4 Relay
@ 2026-02-19  6:36   ` Chris Li
  0 siblings, 0 replies; 28+ messages in thread
From: Chris Li @ 2026-02-19  6:36 UTC (permalink / raw)
  To: kasong
  Cc: linux-mm, Andrew Morton, Kemeng Shi, Nhat Pham, Baoquan He,
	Barry Song, Johannes Weiner, David Hildenbrand, Lorenzo Stoakes,
	Youngjun Park, linux-kernel

Acked-by: Chris Li <chrisl@kernel.org>

Chris

On Tue, Feb 17, 2026 at 12:06 PM Kairui Song via B4 Relay
<devnull+kasong.tencent.com@kernel.org> wrote:
>
> From: Kairui Song <kasong@tencent.com>
>
> /proc/swaps uses si->swap_map as the indicator to check if the swap
> device is mounted. swap_map will be removed soon, so change it to use
> si->swap_file instead because:
>
> - si->swap_file is exactly the only dynamic content that /proc/swaps is
>   interested in. Previously, it was checking si->swap_map just to ensure
>   si->swap_file is available. si->swap_map is set under mutex
>   protection, and after si->swap_file is set, so having si->swap_map set
>   guarantees si->swap_file is set.
>
> - Checking si->flags doesn't work here. SWP_WRITEOK is cleared during
>   swapoff, but /proc/swaps is supposed to show the device under swapoff
>   too to report the swapoff progress. And SWP_USED is set even if the
>   device hasn't been properly set up.
>
> We can have another flag, but the easier way is to just check
> si->swap_file directly. So protect si->swap_file setting with mutext,
> and set si->swap_file only when the swap device is truly enabled.
>
> /proc/swaps only interested in si->swap_file and a few static data
> reading. Only si->swap_file needs protection. Reading other static
> fields is always fine.
>
> Signed-off-by: Kairui Song <kasong@tencent.com>
> ---
>  mm/swapfile.c | 25 +++++++++++++------------
>  1 file changed, 13 insertions(+), 12 deletions(-)
>
> diff --git a/mm/swapfile.c b/mm/swapfile.c
> index 32e0e7545ab8..25dfe992538d 100644
> --- a/mm/swapfile.c
> +++ b/mm/swapfile.c
> @@ -110,6 +110,7 @@ struct swap_info_struct *swap_info[MAX_SWAPFILES];
>
>  static struct kmem_cache *swap_table_cachep;
>
> +/* Protects si->swap_file for /proc/swaps usage */
>  static DEFINE_MUTEX(swapon_mutex);
>
>  static DECLARE_WAIT_QUEUE_HEAD(proc_poll_wait);
> @@ -2532,7 +2533,8 @@ static void drain_mmlist(void)
>  /*
>   * Free all of a swapdev's extent information
>   */
> -static void destroy_swap_extents(struct swap_info_struct *sis)
> +static void destroy_swap_extents(struct swap_info_struct *sis,
> +                                struct file *swap_file)
>  {
>         while (!RB_EMPTY_ROOT(&sis->swap_extent_root)) {
>                 struct rb_node *rb = sis->swap_extent_root.rb_node;
> @@ -2543,7 +2545,6 @@ static void destroy_swap_extents(struct swap_info_struct *sis)
>         }
>
>         if (sis->flags & SWP_ACTIVATED) {
> -               struct file *swap_file = sis->swap_file;
>                 struct address_space *mapping = swap_file->f_mapping;
>
>                 sis->flags &= ~SWP_ACTIVATED;
> @@ -2626,9 +2627,9 @@ EXPORT_SYMBOL_GPL(add_swap_extent);
>   * Typically it is in the 1-4 megabyte range.  So we can have hundreds of
>   * extents in the rbtree. - akpm.
>   */
> -static int setup_swap_extents(struct swap_info_struct *sis, sector_t *span)
> +static int setup_swap_extents(struct swap_info_struct *sis,
> +                             struct file *swap_file, sector_t *span)
>  {
> -       struct file *swap_file = sis->swap_file;
>         struct address_space *mapping = swap_file->f_mapping;
>         struct inode *inode = mapping->host;
>         int ret;
> @@ -2646,7 +2647,7 @@ static int setup_swap_extents(struct swap_info_struct *sis, sector_t *span)
>                 sis->flags |= SWP_ACTIVATED;
>                 if ((sis->flags & SWP_FS_OPS) &&
>                     sio_pool_init() != 0) {
> -                       destroy_swap_extents(sis);
> +                       destroy_swap_extents(sis, swap_file);
>                         return -ENOMEM;
>                 }
>                 return ret;
> @@ -2862,7 +2863,7 @@ SYSCALL_DEFINE1(swapoff, const char __user *, specialfile)
>         flush_work(&p->reclaim_work);
>         flush_percpu_swap_cluster(p);
>
> -       destroy_swap_extents(p);
> +       destroy_swap_extents(p, p->swap_file);
>         if (p->flags & SWP_CONTINUED)
>                 free_swap_count_continuations(p);
>
> @@ -2952,7 +2953,7 @@ static void *swap_start(struct seq_file *swap, loff_t *pos)
>                 return SEQ_START_TOKEN;
>
>         for (type = 0; (si = swap_type_to_info(type)); type++) {
> -               if (!(si->flags & SWP_USED) || !si->swap_map)
> +               if (!(si->swap_file))
>                         continue;
>                 if (!--l)
>                         return si;
> @@ -2973,7 +2974,7 @@ static void *swap_next(struct seq_file *swap, void *v, loff_t *pos)
>
>         ++(*pos);
>         for (; (si = swap_type_to_info(type)); type++) {
> -               if (!(si->flags & SWP_USED) || !si->swap_map)
> +               if (!(si->swap_file))
>                         continue;
>                 return si;
>         }
> @@ -3390,7 +3391,6 @@ SYSCALL_DEFINE2(swapon, const char __user *, specialfile, int, swap_flags)
>                 goto bad_swap;
>         }
>
> -       si->swap_file = swap_file;
>         mapping = swap_file->f_mapping;
>         dentry = swap_file->f_path.dentry;
>         inode = mapping->host;
> @@ -3440,7 +3440,7 @@ SYSCALL_DEFINE2(swapon, const char __user *, specialfile, int, swap_flags)
>
>         si->max = maxpages;
>         si->pages = maxpages - 1;
> -       nr_extents = setup_swap_extents(si, &span);
> +       nr_extents = setup_swap_extents(si, swap_file, &span);
>         if (nr_extents < 0) {
>                 error = nr_extents;
>                 goto bad_swap_unlock_inode;
> @@ -3549,6 +3549,8 @@ SYSCALL_DEFINE2(swapon, const char __user *, specialfile, int, swap_flags)
>         prio = DEF_SWAP_PRIO;
>         if (swap_flags & SWAP_FLAG_PREFER)
>                 prio = swap_flags & SWAP_FLAG_PRIO_MASK;
> +
> +       si->swap_file = swap_file;
>         enable_swap_info(si, prio, swap_map, cluster_info, zeromap);
>
>         pr_info("Adding %uk swap on %s.  Priority:%d extents:%d across:%lluk %s%s%s%s\n",
> @@ -3573,10 +3575,9 @@ SYSCALL_DEFINE2(swapon, const char __user *, specialfile, int, swap_flags)
>         kfree(si->global_cluster);
>         si->global_cluster = NULL;
>         inode = NULL;
> -       destroy_swap_extents(si);
> +       destroy_swap_extents(si, swap_file);
>         swap_cgroup_swapoff(si->type);
>         spin_lock(&swap_lock);
> -       si->swap_file = NULL;
>         si->flags = 0;
>         spin_unlock(&swap_lock);
>         vfree(swap_map);
>
> --
> 2.52.0
>
>


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v3 02/12] mm, swap: clean up swapon process and locking
  2026-02-17 20:06 ` [PATCH v3 02/12] mm, swap: clean up swapon process and locking Kairui Song via B4 Relay
@ 2026-02-19  6:45   ` Chris Li
  0 siblings, 0 replies; 28+ messages in thread
From: Chris Li @ 2026-02-19  6:45 UTC (permalink / raw)
  To: kasong
  Cc: linux-mm, Andrew Morton, Kemeng Shi, Nhat Pham, Baoquan He,
	Barry Song, Johannes Weiner, David Hildenbrand, Lorenzo Stoakes,
	Youngjun Park, linux-kernel

Acked-by: Chris Li <chrisl@kernel.org>

On Tue, Feb 17, 2026 at 12:06 PM Kairui Song via B4 Relay
<devnull+kasong.tencent.com@kernel.org> wrote:
>
> From: Kairui Song <kasong@tencent.com>
>
> Slightly clean up the swapon process. Add comments about what swap_lock
> protects, introduce and rename helpers that wrap swap_map and
> cluster_info setup, and do it outside of the swap_lock lock.
>
> This lock protection is not needed for swap_map and cluster_info setup
> because all swap users must either hold the percpu ref or hold a stable
> allocated swap entry (e.g., locking a folio in the swap cache) before
> accessing. So before the swap device is exposed by enable_swap_info,
> nothing would use the swap device's map or cluster.
>
> So we are safe to allocate and set up swap data freely first, then
> expose the swap device and set the SWP_WRITEOK flag.
>
> Signed-off-by: Kairui Song <kasong@tencent.com>
> ---
>  mm/swapfile.c | 87 ++++++++++++++++++++++++++++++++---------------------------
>  1 file changed, 48 insertions(+), 39 deletions(-)
>
> diff --git a/mm/swapfile.c b/mm/swapfile.c
> index 25dfe992538d..8fc35b316ade 100644
> --- a/mm/swapfile.c
> +++ b/mm/swapfile.c
> @@ -65,6 +65,13 @@ static void move_cluster(struct swap_info_struct *si,
>                          struct swap_cluster_info *ci, struct list_head *list,
>                          enum swap_cluster_flags new_flags);
>
> +/*
> + * Protects the swap_info array, and the SWP_USED flag. swap_info contains
> + * lazily allocated & freed swap device info struts, and SWP_USED indicates

Is "struts" a typo for "struct"?

Chris

> + * which device is used, ~SWP_USED devices and can be reused.
> + *
> + * Also protects swap_active_head total_swap_pages, and the SWP_WRITEOK flag.
> + */
>  static DEFINE_SPINLOCK(swap_lock);
>  static unsigned int nr_swapfiles;
>  atomic_long_t nr_swap_pages;
> @@ -2657,8 +2664,6 @@ static int setup_swap_extents(struct swap_info_struct *sis,
>  }
>
>  static void setup_swap_info(struct swap_info_struct *si, int prio,
> -                           unsigned char *swap_map,
> -                           struct swap_cluster_info *cluster_info,
>                             unsigned long *zeromap)
>  {
>         si->prio = prio;
> @@ -2668,8 +2673,6 @@ static void setup_swap_info(struct swap_info_struct *si, int prio,
>          */
>         si->list.prio = -si->prio;
>         si->avail_list.prio = -si->prio;
> -       si->swap_map = swap_map;
> -       si->cluster_info = cluster_info;
>         si->zeromap = zeromap;
>  }
>
> @@ -2687,13 +2690,11 @@ static void _enable_swap_info(struct swap_info_struct *si)
>  }
>
>  static void enable_swap_info(struct swap_info_struct *si, int prio,
> -                               unsigned char *swap_map,
> -                               struct swap_cluster_info *cluster_info,
> -                               unsigned long *zeromap)
> +                            unsigned long *zeromap)
>  {
>         spin_lock(&swap_lock);
>         spin_lock(&si->lock);
> -       setup_swap_info(si, prio, swap_map, cluster_info, zeromap);
> +       setup_swap_info(si, prio, zeromap);
>         spin_unlock(&si->lock);
>         spin_unlock(&swap_lock);
>         /*
> @@ -2711,7 +2712,7 @@ static void reinsert_swap_info(struct swap_info_struct *si)
>  {
>         spin_lock(&swap_lock);
>         spin_lock(&si->lock);
> -       setup_swap_info(si, si->prio, si->swap_map, si->cluster_info, si->zeromap);
> +       setup_swap_info(si, si->prio, si->zeromap);
>         _enable_swap_info(si);
>         spin_unlock(&si->lock);
>         spin_unlock(&swap_lock);
> @@ -2735,8 +2736,8 @@ static void wait_for_allocation(struct swap_info_struct *si)
>         }
>  }
>
> -static void free_cluster_info(struct swap_cluster_info *cluster_info,
> -                             unsigned long maxpages)
> +static void free_swap_cluster_info(struct swap_cluster_info *cluster_info,
> +                                  unsigned long maxpages)
>  {
>         struct swap_cluster_info *ci;
>         int i, nr_clusters = DIV_ROUND_UP(maxpages, SWAPFILE_CLUSTER);
> @@ -2894,7 +2895,7 @@ SYSCALL_DEFINE1(swapoff, const char __user *, specialfile)
>         p->global_cluster = NULL;
>         vfree(swap_map);
>         kvfree(zeromap);
> -       free_cluster_info(cluster_info, maxpages);
> +       free_swap_cluster_info(cluster_info, maxpages);
>         /* Destroy swap account information */
>         swap_cgroup_swapoff(p->type);
>
> @@ -3243,10 +3244,15 @@ static unsigned long read_swap_header(struct swap_info_struct *si,
>
>  static int setup_swap_map(struct swap_info_struct *si,
>                           union swap_header *swap_header,
> -                         unsigned char *swap_map,
>                           unsigned long maxpages)
>  {
>         unsigned long i;
> +       unsigned char *swap_map;
> +
> +       swap_map = vzalloc(maxpages);
> +       si->swap_map = swap_map;
> +       if (!swap_map)
> +               return -ENOMEM;
>
>         swap_map[0] = SWAP_MAP_BAD; /* omit header page */
>         for (i = 0; i < swap_header->info.nr_badpages; i++) {
> @@ -3267,9 +3273,9 @@ static int setup_swap_map(struct swap_info_struct *si,
>         return 0;
>  }
>
> -static struct swap_cluster_info *setup_clusters(struct swap_info_struct *si,
> -                                               union swap_header *swap_header,
> -                                               unsigned long maxpages)
> +static int setup_swap_clusters_info(struct swap_info_struct *si,
> +                                   union swap_header *swap_header,
> +                                   unsigned long maxpages)
>  {
>         unsigned long nr_clusters = DIV_ROUND_UP(maxpages, SWAPFILE_CLUSTER);
>         struct swap_cluster_info *cluster_info;
> @@ -3339,10 +3345,11 @@ static struct swap_cluster_info *setup_clusters(struct swap_info_struct *si,
>                 }
>         }
>
> -       return cluster_info;
> +       si->cluster_info = cluster_info;
> +       return 0;
>  err:
> -       free_cluster_info(cluster_info, maxpages);
> -       return ERR_PTR(err);
> +       free_swap_cluster_info(cluster_info, maxpages);
> +       return err;
>  }
>
>  SYSCALL_DEFINE2(swapon, const char __user *, specialfile, int, swap_flags)
> @@ -3358,9 +3365,7 @@ SYSCALL_DEFINE2(swapon, const char __user *, specialfile, int, swap_flags)
>         int nr_extents;
>         sector_t span;
>         unsigned long maxpages;
> -       unsigned char *swap_map = NULL;
>         unsigned long *zeromap = NULL;
> -       struct swap_cluster_info *cluster_info = NULL;
>         struct folio *folio = NULL;
>         struct inode *inode = NULL;
>         bool inced_nr_rotate_swap = false;
> @@ -3371,6 +3376,11 @@ SYSCALL_DEFINE2(swapon, const char __user *, specialfile, int, swap_flags)
>         if (!capable(CAP_SYS_ADMIN))
>                 return -EPERM;
>
> +       /*
> +        * Allocate or reuse existing !SWP_USED swap_info. The returned
> +        * si will stay in a dying status, so nothing will access its content
> +        * until enable_swap_info resurrects its percpu ref and expose it.
> +        */
>         si = alloc_swap_info();
>         if (IS_ERR(si))
>                 return PTR_ERR(si);
> @@ -3453,18 +3463,17 @@ SYSCALL_DEFINE2(swapon, const char __user *, specialfile, int, swap_flags)
>
>         maxpages = si->max;
>
> -       /* OK, set up the swap map and apply the bad block list */
> -       swap_map = vzalloc(maxpages);
> -       if (!swap_map) {
> -               error = -ENOMEM;
> +       /* Setup the swap map and apply bad block */
> +       error = setup_swap_map(si, swap_header, maxpages);
> +       if (error)
>                 goto bad_swap_unlock_inode;
> -       }
>
> -       error = swap_cgroup_swapon(si->type, maxpages);
> +       /* Set up the swap cluster info */
> +       error = setup_swap_clusters_info(si, swap_header, maxpages);
>         if (error)
>                 goto bad_swap_unlock_inode;
>
> -       error = setup_swap_map(si, swap_header, swap_map, maxpages);
> +       error = swap_cgroup_swapon(si->type, maxpages);
>         if (error)
>                 goto bad_swap_unlock_inode;
>
> @@ -3492,13 +3501,6 @@ SYSCALL_DEFINE2(swapon, const char __user *, specialfile, int, swap_flags)
>                 inced_nr_rotate_swap = true;
>         }
>
> -       cluster_info = setup_clusters(si, swap_header, maxpages);
> -       if (IS_ERR(cluster_info)) {
> -               error = PTR_ERR(cluster_info);
> -               cluster_info = NULL;
> -               goto bad_swap_unlock_inode;
> -       }
> -
>         if ((swap_flags & SWAP_FLAG_DISCARD) &&
>             si->bdev && bdev_max_discard_sectors(si->bdev)) {
>                 /*
> @@ -3551,7 +3553,9 @@ SYSCALL_DEFINE2(swapon, const char __user *, specialfile, int, swap_flags)
>                 prio = swap_flags & SWAP_FLAG_PRIO_MASK;
>
>         si->swap_file = swap_file;
> -       enable_swap_info(si, prio, swap_map, cluster_info, zeromap);
> +
> +       /* Sets SWP_WRITEOK, resurrect the percpu ref, expose the swap device */
> +       enable_swap_info(si, prio, zeromap);
>
>         pr_info("Adding %uk swap on %s.  Priority:%d extents:%d across:%lluk %s%s%s%s\n",
>                 K(si->pages), name->name, si->prio, nr_extents,
> @@ -3577,13 +3581,18 @@ SYSCALL_DEFINE2(swapon, const char __user *, specialfile, int, swap_flags)
>         inode = NULL;
>         destroy_swap_extents(si, swap_file);
>         swap_cgroup_swapoff(si->type);
> +       vfree(si->swap_map);
> +       si->swap_map = NULL;
> +       free_swap_cluster_info(si->cluster_info, si->max);
> +       si->cluster_info = NULL;
> +       /*
> +        * Clear the SWP_USED flag after all resources are freed so
> +        * alloc_swap_info can reuse this si safely.
> +        */
>         spin_lock(&swap_lock);
>         si->flags = 0;
>         spin_unlock(&swap_lock);
> -       vfree(swap_map);
>         kvfree(zeromap);
> -       if (cluster_info)
> -               free_cluster_info(cluster_info, maxpages);
>         if (inced_nr_rotate_swap)
>                 atomic_dec(&nr_rotate_swap);
>         if (swap_file)
>
> --
> 2.52.0
>
>


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v3 03/12] mm, swap: remove redundant arguments and locking for enabling a device
  2026-02-17 20:06 ` [PATCH v3 03/12] mm, swap: remove redundant arguments and locking for enabling a device Kairui Song via B4 Relay
@ 2026-02-19  6:48   ` Chris Li
  0 siblings, 0 replies; 28+ messages in thread
From: Chris Li @ 2026-02-19  6:48 UTC (permalink / raw)
  To: kasong
  Cc: linux-mm, Andrew Morton, Kemeng Shi, Nhat Pham, Baoquan He,
	Barry Song, Johannes Weiner, David Hildenbrand, Lorenzo Stoakes,
	Youngjun Park, linux-kernel

On Tue, Feb 17, 2026 at 12:06 PM Kairui Song via B4 Relay
<devnull+kasong.tencent.com@kernel.org> wrote:
>
> From: Kairui Song <kasong@tencent.com>
>
> There is no need to repeatedly pass zero map and priority values.
> zeromap is similar to cluster info and swap_map, which are only used
> once the swap device is exposed. And the prio values are currently
> read only once set, and only used for the list insertion upon expose
> or swap info display.
>
> Signed-off-by: Kairui Song <kasong@tencent.com>


Acked-by: Chris Li <chrisl@kernel.org>

Chris


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v3 04/12] mm, swap: consolidate bad slots setup and make it more robust
  2026-02-17 20:06 ` [PATCH v3 04/12] mm, swap: consolidate bad slots setup and make it more robust Kairui Song via B4 Relay
@ 2026-02-19  6:51   ` Chris Li
  0 siblings, 0 replies; 28+ messages in thread
From: Chris Li @ 2026-02-19  6:51 UTC (permalink / raw)
  To: kasong
  Cc: linux-mm, Andrew Morton, Kemeng Shi, Nhat Pham, Baoquan He,
	Barry Song, Johannes Weiner, David Hildenbrand, Lorenzo Stoakes,
	Youngjun Park, linux-kernel

On Tue, Feb 17, 2026 at 12:06 PM Kairui Song via B4 Relay
<devnull+kasong.tencent.com@kernel.org> wrote:
>
> From: Kairui Song <kasong@tencent.com>
>
> In preparation for using the swap table to track bad slots directly,
> move the bad slot setup to one place, set up the swap_map mark, and
> cluster counter update together.
>
> While at it, provide more informative logs and a more robust fallback if
> any bad slot info looks incorrect.
>
> Fixes a potential issue that a malformed swap file may cause the cluster
> to be unusable upon swapon, and provides a more verbose warning on a
> malformed swap file
>
> Signed-off-by: Kairui Song <kasong@tencent.com>

Acked-by: Chris Li <chrisl@kernel.org>

Chris

> ---
>  mm/swapfile.c | 68 +++++++++++++++++++++++++++++++++--------------------------
>  1 file changed, 38 insertions(+), 30 deletions(-)
>
> diff --git a/mm/swapfile.c b/mm/swapfile.c
> index fb0d48681c48..91c1fa804185 100644
> --- a/mm/swapfile.c
> +++ b/mm/swapfile.c
> @@ -743,13 +743,37 @@ static void relocate_cluster(struct swap_info_struct *si,
>   * slot. The cluster will not be added to the free cluster list, and its
>   * usage counter will be increased by 1. Only used for initialization.
>   */
> -static int swap_cluster_setup_bad_slot(struct swap_cluster_info *cluster_info,
> -                                      unsigned long offset)
> +static int swap_cluster_setup_bad_slot(struct swap_info_struct *si,
> +                                      struct swap_cluster_info *cluster_info,
> +                                      unsigned int offset, bool mask)
>  {
>         unsigned long idx = offset / SWAPFILE_CLUSTER;
>         struct swap_table *table;
>         struct swap_cluster_info *ci;
>
> +       /* si->max may got shrunk by swap swap_activate() */
> +       if (offset >= si->max && !mask) {
> +               pr_debug("Ignoring bad slot %u (max: %u)\n", offset, si->max);
> +               return 0;
> +       }
> +       /*
> +        * Account it, skip header slot: si->pages is initiated as
> +        * si->max - 1. Also skip the masking of last cluster,
> +        * si->pages doesn't include that part.
> +        */
> +       if (offset && !mask)
> +               si->pages -= 1;
> +       if (!si->pages) {
> +               pr_warn("Empty swap-file\n");
> +               return -EINVAL;
> +       }
> +       /* Check for duplicated bad swap slots. */
> +       if (si->swap_map[offset]) {
> +               pr_warn("Duplicated bad slot offset %d\n", offset);
> +               return -EINVAL;
> +       }
> +
> +       si->swap_map[offset] = SWAP_MAP_BAD;
>         ci = cluster_info + idx;
>         if (!ci->table) {
>                 table = swap_table_alloc(GFP_KERNEL);
> @@ -3227,30 +3251,12 @@ static int setup_swap_map(struct swap_info_struct *si,
>                           union swap_header *swap_header,
>                           unsigned long maxpages)
>  {
> -       unsigned long i;
>         unsigned char *swap_map;
>
>         swap_map = vzalloc(maxpages);
>         si->swap_map = swap_map;
>         if (!swap_map)
>                 return -ENOMEM;
> -
> -       swap_map[0] = SWAP_MAP_BAD; /* omit header page */
> -       for (i = 0; i < swap_header->info.nr_badpages; i++) {
> -               unsigned int page_nr = swap_header->info.badpages[i];
> -               if (page_nr == 0 || page_nr > swap_header->info.last_page)
> -                       return -EINVAL;
> -               if (page_nr < maxpages) {
> -                       swap_map[page_nr] = SWAP_MAP_BAD;
> -                       si->pages--;
> -               }
> -       }
> -
> -       if (!si->pages) {
> -               pr_warn("Empty swap-file\n");
> -               return -EINVAL;
> -       }
> -
>         return 0;
>  }
>
> @@ -3281,26 +3287,28 @@ static int setup_swap_clusters_info(struct swap_info_struct *si,
>         }
>
>         /*
> -        * Mark unusable pages as unavailable. The clusters aren't
> -        * marked free yet, so no list operations are involved yet.
> -        *
> -        * See setup_swap_map(): header page, bad pages,
> -        * and the EOF part of the last cluster.
> +        * Mark unusable pages (header page, bad pages, and the EOF part of
> +        * the last cluster) as unavailable. The clusters aren't marked free
> +        * yet, so no list operations are involved yet.
>          */
> -       err = swap_cluster_setup_bad_slot(cluster_info, 0);
> +       err = swap_cluster_setup_bad_slot(si, cluster_info, 0, false);
>         if (err)
>                 goto err;
>         for (i = 0; i < swap_header->info.nr_badpages; i++) {
>                 unsigned int page_nr = swap_header->info.badpages[i];
>
> -               if (page_nr >= maxpages)
> -                       continue;
> -               err = swap_cluster_setup_bad_slot(cluster_info, page_nr);
> +               if (!page_nr || page_nr > swap_header->info.last_page) {
> +                       pr_warn("Bad slot offset is out of border: %d (last_page: %d)\n",
> +                               page_nr, swap_header->info.last_page);
> +                       err = -EINVAL;
> +                       goto err;
> +               }
> +               err = swap_cluster_setup_bad_slot(si, cluster_info, page_nr, false);
>                 if (err)
>                         goto err;
>         }
>         for (i = maxpages; i < round_up(maxpages, SWAPFILE_CLUSTER); i++) {
> -               err = swap_cluster_setup_bad_slot(cluster_info, i);
> +               err = swap_cluster_setup_bad_slot(si, cluster_info, i, true);
>                 if (err)
>                         goto err;
>         }
>
> --
> 2.52.0
>
>


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v3 05/12] mm/workingset: leave highest bits empty for anon shadow
  2026-02-17 20:06 ` [PATCH v3 05/12] mm/workingset: leave highest bits empty for anon shadow Kairui Song via B4 Relay
@ 2026-02-19  6:56   ` Chris Li
  0 siblings, 0 replies; 28+ messages in thread
From: Chris Li @ 2026-02-19  6:56 UTC (permalink / raw)
  To: kasong
  Cc: linux-mm, Andrew Morton, Kemeng Shi, Nhat Pham, Baoquan He,
	Barry Song, Johannes Weiner, David Hildenbrand, Lorenzo Stoakes,
	Youngjun Park, linux-kernel

On Tue, Feb 17, 2026 at 12:06 PM Kairui Song via B4 Relay
<devnull+kasong.tencent.com@kernel.org> wrote:
>
> From: Kairui Song <kasong@tencent.com>
>
> Swap table entry will need 4 bits reserved for swap count in the shadow,
> so the anon shadow should have its leading 4 bits remain 0.
>
> This should be OK for the foreseeable future. Take 52 bits of physical
> address space as an example: for 4K pages, there would be at most 40
> bits for addressable pages. Currently, we have 36 bits available (64 - 1
> - 16 - 10 - 1, where XA_VALUE takes 1 bit for marker,
> MEM_CGROUP_ID_SHIFT takes 16 bits, NODES_SHIFT takes <=10 bits,
> WORKINGSET flags takes 1 bit).
>
> So in the worst case, we previously need to pack the 40 bits of address
> in 36 bits fields using a 64K bucket (bucket_order = 4). After this, the
> bucket will be increased to 1M. Which should be fine, as on such large
> machines, the working set size will be way larger than the bucket size.
>
> And for MGLRU's gen number tracking, it should be even more than enough,
> MGLRU's gen number (max_seq) increment is much slower compared to the
> eviction counter (nonresident_age).
>
> And after all, either the refault distance or the gen distance is only a
> hint that can tolerate inaccuracy just fine.
>
> And the 4 bits can be shrunk to 3, or extended to a higher value if
> needed later.
>
> Signed-off-by: Kairui Song <kasong@tencent.com>

Acked-by: Chris Li <chrisl@kernel.org>

> ---
>  mm/swap_table.h |  4 ++++
>  mm/workingset.c | 49 ++++++++++++++++++++++++++++++-------------------
>  2 files changed, 34 insertions(+), 19 deletions(-)
>
> diff --git a/mm/swap_table.h b/mm/swap_table.h
> index ea244a57a5b7..10e11d1f3b04 100644
> --- a/mm/swap_table.h
> +++ b/mm/swap_table.h
> @@ -12,6 +12,7 @@ struct swap_table {
>  };
>
>  #define SWP_TABLE_USE_PAGE (sizeof(struct swap_table) == PAGE_SIZE)
> +#define SWP_TB_COUNT_BITS              4
>
>  /*
>   * A swap table entry represents the status of a swap slot on a swap
> @@ -22,6 +23,9 @@ struct swap_table {
>   * (shadow), or NULL.
>   */
>
> +/* Macro for shadow offset calculation */
> +#define SWAP_COUNT_SHIFT       SWP_TB_COUNT_BITS
> +
>  /*
>   * Helpers for casting one type of info into a swap table entry.
>   */
> diff --git a/mm/workingset.c b/mm/workingset.c
> index 13422d304715..37a94979900f 100644
> --- a/mm/workingset.c
> +++ b/mm/workingset.c
> @@ -16,6 +16,7 @@
>  #include <linux/dax.h>
>  #include <linux/fs.h>
>  #include <linux/mm.h>
> +#include "swap_table.h"
>  #include "internal.h"
>
>  /*
> @@ -184,7 +185,9 @@
>  #define EVICTION_SHIFT ((BITS_PER_LONG - BITS_PER_XA_VALUE) +  \
>                          WORKINGSET_SHIFT + NODES_SHIFT + \
>                          MEM_CGROUP_ID_SHIFT)
> +#define EVICTION_SHIFT_ANON    (EVICTION_SHIFT + SWAP_COUNT_SHIFT)
>  #define EVICTION_MASK  (~0UL >> EVICTION_SHIFT)
> +#define EVICTION_MASK_ANON     (~0UL >> EVICTION_SHIFT_ANON)
>
>  /*
>   * Eviction timestamps need to be able to cover the full range of
> @@ -194,12 +197,12 @@
>   * that case, we have to sacrifice granularity for distance, and group
>   * evictions into coarser buckets by shaving off lower timestamp bits.
>   */
> -static unsigned int bucket_order __read_mostly;
> +static unsigned int bucket_order[ANON_AND_FILE] __read_mostly;
>
>  static void *pack_shadow(int memcgid, pg_data_t *pgdat, unsigned long eviction,
> -                        bool workingset)
> +                        bool workingset, bool file)
>  {
> -       eviction &= EVICTION_MASK;
> +       eviction &= file ? EVICTION_MASK : EVICTION_MASK_ANON;
>         eviction = (eviction << MEM_CGROUP_ID_SHIFT) | memcgid;
>         eviction = (eviction << NODES_SHIFT) | pgdat->node_id;
>         eviction = (eviction << WORKINGSET_SHIFT) | workingset;
> @@ -244,7 +247,8 @@ static void *lru_gen_eviction(struct folio *folio)
>         struct mem_cgroup *memcg = folio_memcg(folio);
>         struct pglist_data *pgdat = folio_pgdat(folio);
>
> -       BUILD_BUG_ON(LRU_GEN_WIDTH + LRU_REFS_WIDTH > BITS_PER_LONG - EVICTION_SHIFT);
> +       BUILD_BUG_ON(LRU_GEN_WIDTH + LRU_REFS_WIDTH >
> +                    BITS_PER_LONG - max(EVICTION_SHIFT, EVICTION_SHIFT_ANON));
>
>         lruvec = mem_cgroup_lruvec(memcg, pgdat);
>         lrugen = &lruvec->lrugen;
> @@ -254,7 +258,7 @@ static void *lru_gen_eviction(struct folio *folio)
>         hist = lru_hist_from_seq(min_seq);
>         atomic_long_add(delta, &lrugen->evicted[hist][type][tier]);
>
> -       return pack_shadow(mem_cgroup_private_id(memcg), pgdat, token, workingset);
> +       return pack_shadow(mem_cgroup_private_id(memcg), pgdat, token, workingset, type);
>  }
>
>  /*
> @@ -262,7 +266,7 @@ static void *lru_gen_eviction(struct folio *folio)
>   * Fills in @lruvec, @token, @workingset with the values unpacked from shadow.
>   */
>  static bool lru_gen_test_recent(void *shadow, struct lruvec **lruvec,
> -                               unsigned long *token, bool *workingset)
> +                               unsigned long *token, bool *workingset, bool file)
>  {
>         int memcg_id;
>         unsigned long max_seq;
> @@ -275,7 +279,7 @@ static bool lru_gen_test_recent(void *shadow, struct lruvec **lruvec,
>         *lruvec = mem_cgroup_lruvec(memcg, pgdat);
>
>         max_seq = READ_ONCE((*lruvec)->lrugen.max_seq);
> -       max_seq &= EVICTION_MASK >> LRU_REFS_WIDTH;
> +       max_seq &= (file ? EVICTION_MASK : EVICTION_MASK_ANON) >> LRU_REFS_WIDTH;

Nit pick, I saw you use this expression more than once:
"file ? EVICTION_MASK : EVICTION_MASK_ANON"

Maybe make it an inline function or macro?

>
>         return abs_diff(max_seq, *token >> LRU_REFS_WIDTH) < MAX_NR_GENS;
>  }
> @@ -293,7 +297,7 @@ static void lru_gen_refault(struct folio *folio, void *shadow)
>
>         rcu_read_lock();
>
> -       recent = lru_gen_test_recent(shadow, &lruvec, &token, &workingset);
> +       recent = lru_gen_test_recent(shadow, &lruvec, &token, &workingset, type);
>         if (lruvec != folio_lruvec(folio))
>                 goto unlock;
>
> @@ -331,7 +335,7 @@ static void *lru_gen_eviction(struct folio *folio)
>  }
>
>  static bool lru_gen_test_recent(void *shadow, struct lruvec **lruvec,
> -                               unsigned long *token, bool *workingset)
> +                               unsigned long *token, bool *workingset, bool file)
>  {
>         return false;
>  }
> @@ -381,6 +385,7 @@ void workingset_age_nonresident(struct lruvec *lruvec, unsigned long nr_pages)
>  void *workingset_eviction(struct folio *folio, struct mem_cgroup *target_memcg)
>  {
>         struct pglist_data *pgdat = folio_pgdat(folio);
> +       int file = folio_is_file_lru(folio);
>         unsigned long eviction;
>         struct lruvec *lruvec;
>         int memcgid;
> @@ -397,10 +402,10 @@ void *workingset_eviction(struct folio *folio, struct mem_cgroup *target_memcg)
>         /* XXX: target_memcg can be NULL, go through lruvec */
>         memcgid = mem_cgroup_private_id(lruvec_memcg(lruvec));
>         eviction = atomic_long_read(&lruvec->nonresident_age);
> -       eviction >>= bucket_order;
> +       eviction >>= bucket_order[file];
>         workingset_age_nonresident(lruvec, folio_nr_pages(folio));
>         return pack_shadow(memcgid, pgdat, eviction,
> -                               folio_test_workingset(folio));
> +                          folio_test_workingset(folio), file);
>  }
>
>  /**
> @@ -431,14 +436,15 @@ bool workingset_test_recent(void *shadow, bool file, bool *workingset,
>                 bool recent;
>
>                 rcu_read_lock();
> -               recent = lru_gen_test_recent(shadow, &eviction_lruvec, &eviction, workingset);
> +               recent = lru_gen_test_recent(shadow, &eviction_lruvec, &eviction,
> +                                            workingset, file);
>                 rcu_read_unlock();
>                 return recent;
>         }
>
>         rcu_read_lock();
>         unpack_shadow(shadow, &memcgid, &pgdat, &eviction, workingset);
> -       eviction <<= bucket_order;
> +       eviction <<= bucket_order[file];
>
>         /*
>          * Look up the memcg associated with the stored ID. It might
> @@ -495,7 +501,8 @@ bool workingset_test_recent(void *shadow, bool file, bool *workingset,
>          * longest time, so the occasional inappropriate activation
>          * leading to pressure on the active list is not a problem.
>          */
> -       refault_distance = (refault - eviction) & EVICTION_MASK;
> +       refault_distance = ((refault - eviction) &
> +                           (file ? EVICTION_MASK : EVICTION_MASK_ANON));

Here too.

Chris

>
>         /*
>          * Compare the distance to the existing workingset size. We
> @@ -780,8 +787,8 @@ static struct lock_class_key shadow_nodes_key;
>
>  static int __init workingset_init(void)
>  {
> +       unsigned int timestamp_bits, timestamp_bits_anon;
>         struct shrinker *workingset_shadow_shrinker;
> -       unsigned int timestamp_bits;
>         unsigned int max_order;
>         int ret = -ENOMEM;
>
> @@ -794,11 +801,15 @@ static int __init workingset_init(void)
>          * double the initial memory by using totalram_pages as-is.
>          */
>         timestamp_bits = BITS_PER_LONG - EVICTION_SHIFT;
> +       timestamp_bits_anon = BITS_PER_LONG - EVICTION_SHIFT_ANON;
>         max_order = fls_long(totalram_pages() - 1);
> -       if (max_order > timestamp_bits)
> -               bucket_order = max_order - timestamp_bits;
> -       pr_info("workingset: timestamp_bits=%d max_order=%d bucket_order=%u\n",
> -              timestamp_bits, max_order, bucket_order);
> +       if (max_order > (BITS_PER_LONG - EVICTION_SHIFT))
> +               bucket_order[WORKINGSET_FILE] = max_order - timestamp_bits;
> +       if (max_order > timestamp_bits_anon)
> +               bucket_order[WORKINGSET_ANON] = max_order - timestamp_bits_anon;
> +       pr_info("workingset: timestamp_bits=%d (anon: %d) max_order=%d bucket_order=%u (anon: %d)\n",
> +               timestamp_bits, timestamp_bits_anon, max_order,
> +               bucket_order[WORKINGSET_FILE], bucket_order[WORKINGSET_ANON]);
>
>         workingset_shadow_shrinker = shrinker_alloc(SHRINKER_NUMA_AWARE |
>                                                     SHRINKER_MEMCG_AWARE,
>
> --
> 2.52.0
>
>


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v3 06/12] mm, swap: implement helpers for reserving data in the swap table
  2026-02-17 20:06 ` [PATCH v3 06/12] mm, swap: implement helpers for reserving data in the swap table Kairui Song via B4 Relay
@ 2026-02-19  7:00   ` Chris Li
  0 siblings, 0 replies; 28+ messages in thread
From: Chris Li @ 2026-02-19  7:00 UTC (permalink / raw)
  To: kasong
  Cc: linux-mm, Andrew Morton, Kemeng Shi, Nhat Pham, Baoquan He,
	Barry Song, Johannes Weiner, David Hildenbrand, Lorenzo Stoakes,
	Youngjun Park, linux-kernel

On Tue, Feb 17, 2026 at 12:06 PM Kairui Song via B4 Relay
<devnull+kasong.tencent.com@kernel.org> wrote:
>
> From: Kairui Song <kasong@tencent.com>
>
> To prepare for using the swap table as the unified swap layer, introduce
> macros and helpers for storing multiple kinds of data in a swap table
> entry.
>
> From now on, we are storing PFN in the swap table to make space for
> extra counting bits (SWAP_COUNT). Shadows are still stored as they are,
> as the SWAP_COUNT is not used yet.
>
> Also, rename shadow_swp_to_tb to shadow_to_swp_tb. That's a spelling
> error, not really worth a separate fix.
>
> No behaviour change yet, just prepare the API.
>
> Signed-off-by: Kairui Song <kasong@tencent.com>

Acked-by: Chris Li <chrisl@kernel.org>

Chris

> ---
>  mm/swap_state.c |   6 +--
>  mm/swap_table.h | 131 +++++++++++++++++++++++++++++++++++++++++++++++++++-----
>  2 files changed, 124 insertions(+), 13 deletions(-)
>
> diff --git a/mm/swap_state.c b/mm/swap_state.c
> index 6d0eef7470be..e213ee35c1d2 100644
> --- a/mm/swap_state.c
> +++ b/mm/swap_state.c
> @@ -148,7 +148,7 @@ void __swap_cache_add_folio(struct swap_cluster_info *ci,
>         VM_WARN_ON_ONCE_FOLIO(folio_test_swapcache(folio), folio);
>         VM_WARN_ON_ONCE_FOLIO(!folio_test_swapbacked(folio), folio);
>
> -       new_tb = folio_to_swp_tb(folio);
> +       new_tb = folio_to_swp_tb(folio, 0);
>         ci_start = swp_cluster_offset(entry);
>         ci_off = ci_start;
>         ci_end = ci_start + nr_pages;
> @@ -249,7 +249,7 @@ void __swap_cache_del_folio(struct swap_cluster_info *ci, struct folio *folio,
>         VM_WARN_ON_ONCE_FOLIO(folio_test_writeback(folio), folio);
>
>         si = __swap_entry_to_info(entry);
> -       new_tb = shadow_swp_to_tb(shadow);
> +       new_tb = shadow_to_swp_tb(shadow, 0);
>         ci_start = swp_cluster_offset(entry);
>         ci_end = ci_start + nr_pages;
>         ci_off = ci_start;
> @@ -331,7 +331,7 @@ void __swap_cache_replace_folio(struct swap_cluster_info *ci,
>         VM_WARN_ON_ONCE(!entry.val);
>
>         /* Swap cache still stores N entries instead of a high-order entry */
> -       new_tb = folio_to_swp_tb(new);
> +       new_tb = folio_to_swp_tb(new, 0);
>         do {
>                 old_tb = __swap_table_xchg(ci, ci_off, new_tb);
>                 WARN_ON_ONCE(!swp_tb_is_folio(old_tb) || swp_tb_to_folio(old_tb) != old);
> diff --git a/mm/swap_table.h b/mm/swap_table.h
> index 10e11d1f3b04..10762ac5f4f5 100644
> --- a/mm/swap_table.h
> +++ b/mm/swap_table.h
> @@ -12,17 +12,72 @@ struct swap_table {
>  };
>
>  #define SWP_TABLE_USE_PAGE (sizeof(struct swap_table) == PAGE_SIZE)
> -#define SWP_TB_COUNT_BITS              4
>
>  /*
>   * A swap table entry represents the status of a swap slot on a swap
>   * (physical or virtual) device. The swap table in each cluster is a
>   * 1:1 map of the swap slots in this cluster.
>   *
> - * Each swap table entry could be a pointer (folio), a XA_VALUE
> - * (shadow), or NULL.
> + * Swap table entry type and bits layouts:
> + *
> + * NULL:     |---------------- 0 ---------------| - Free slot
> + * Shadow:   | SWAP_COUNT |---- SHADOW_VAL ---|1| - Swapped out slot
> + * PFN:      | SWAP_COUNT |------ PFN -------|10| - Cached slot
> + * Pointer:  |----------- Pointer ----------|100| - (Unused)
> + * Bad:      |------------- 1 -------------|1000| - Bad slot
> + *
> + * SWAP_COUNT is `SWP_TB_COUNT_BITS` long, each entry is an atomic long.
> + *
> + * Usages:
> + *
> + * - NULL: Swap slot is unused, could be allocated.
> + *
> + * - Shadow: Swap slot is used and not cached (usually swapped out). It reuses
> + *   the XA_VALUE format to be compatible with working set shadows. SHADOW_VAL
> + *   part might be all 0 if the working shadow info is absent. In such a case,
> + *   we still want to keep the shadow format as a placeholder.
> + *
> + *   Memcg ID is embedded in SHADOW_VAL.
> + *
> + * - PFN: Swap slot is in use, and cached. Memcg info is recorded on the page
> + *   struct.
> + *
> + * - Pointer: Unused yet. `0b100` is reserved for potential pointer usage
> + *   because only the lower three bits can be used as a marker for 8 bytes
> + *   aligned pointers.
> + *
> + * - Bad: Swap slot is reserved, protects swap header or holes on swap devices.
>   */
>
> +#if defined(MAX_POSSIBLE_PHYSMEM_BITS)
> +#define SWAP_CACHE_PFN_BITS (MAX_POSSIBLE_PHYSMEM_BITS - PAGE_SHIFT)
> +#elif defined(MAX_PHYSMEM_BITS)
> +#define SWAP_CACHE_PFN_BITS (MAX_PHYSMEM_BITS - PAGE_SHIFT)
> +#else
> +#define SWAP_CACHE_PFN_BITS (BITS_PER_LONG - PAGE_SHIFT)
> +#endif
> +
> +/* NULL Entry, all 0 */
> +#define SWP_TB_NULL            0UL
> +
> +/* Swapped out: shadow */
> +#define SWP_TB_SHADOW_MARK     0b1UL
> +
> +/* Cached: PFN */
> +#define SWP_TB_PFN_BITS                (SWAP_CACHE_PFN_BITS + SWP_TB_PFN_MARK_BITS)
> +#define SWP_TB_PFN_MARK                0b10UL
> +#define SWP_TB_PFN_MARK_BITS   2
> +#define SWP_TB_PFN_MARK_MASK   (BIT(SWP_TB_PFN_MARK_BITS) - 1)
> +
> +/* SWAP_COUNT part for PFN or shadow, the width can be shrunk or extended */
> +#define SWP_TB_COUNT_BITS      min(4, BITS_PER_LONG - SWP_TB_PFN_BITS)
> +#define SWP_TB_COUNT_MASK      (~((~0UL) >> SWP_TB_COUNT_BITS))
> +#define SWP_TB_COUNT_SHIFT     (BITS_PER_LONG - SWP_TB_COUNT_BITS)
> +#define SWP_TB_COUNT_MAX       ((1 << SWP_TB_COUNT_BITS) - 1)
> +
> +/* Bad slot: ends with 0b1000 and rests of bits are all 1 */
> +#define SWP_TB_BAD             ((~0UL) << 3)
> +
>  /* Macro for shadow offset calculation */
>  #define SWAP_COUNT_SHIFT       SWP_TB_COUNT_BITS
>
> @@ -35,18 +90,47 @@ static inline unsigned long null_to_swp_tb(void)
>         return 0;
>  }
>
> -static inline unsigned long folio_to_swp_tb(struct folio *folio)
> +static inline unsigned long __count_to_swp_tb(unsigned char count)
>  {
> +       /*
> +        * At least three values are needed to distinguish free (0),
> +        * used (count > 0 && count < SWP_TB_COUNT_MAX), and
> +        * overflow (count == SWP_TB_COUNT_MAX).
> +        */
> +       BUILD_BUG_ON(SWP_TB_COUNT_MAX < 2 || SWP_TB_COUNT_BITS < 2);
> +       VM_WARN_ON(count > SWP_TB_COUNT_MAX);
> +       return ((unsigned long)count) << SWP_TB_COUNT_SHIFT;
> +}
> +
> +static inline unsigned long pfn_to_swp_tb(unsigned long pfn, unsigned int count)
> +{
> +       unsigned long swp_tb;
> +
>         BUILD_BUG_ON(sizeof(unsigned long) != sizeof(void *));
> -       return (unsigned long)folio;
> +       BUILD_BUG_ON(SWAP_CACHE_PFN_BITS >
> +                    (BITS_PER_LONG - SWP_TB_PFN_MARK_BITS - SWP_TB_COUNT_BITS));
> +
> +       swp_tb = (pfn << SWP_TB_PFN_MARK_BITS) | SWP_TB_PFN_MARK;
> +       VM_WARN_ON_ONCE(swp_tb & SWP_TB_COUNT_MASK);
> +
> +       return swp_tb | __count_to_swp_tb(count);
> +}
> +
> +static inline unsigned long folio_to_swp_tb(struct folio *folio, unsigned int count)
> +{
> +       return pfn_to_swp_tb(folio_pfn(folio), count);
>  }
>
> -static inline unsigned long shadow_swp_to_tb(void *shadow)
> +static inline unsigned long shadow_to_swp_tb(void *shadow, unsigned int count)
>  {
>         BUILD_BUG_ON((BITS_PER_XA_VALUE + 1) !=
>                      BITS_PER_BYTE * sizeof(unsigned long));
> +       BUILD_BUG_ON((unsigned long)xa_mk_value(0) != SWP_TB_SHADOW_MARK);
> +
>         VM_WARN_ON_ONCE(shadow && !xa_is_value(shadow));
> -       return (unsigned long)shadow;
> +       VM_WARN_ON_ONCE(shadow && ((unsigned long)shadow & SWP_TB_COUNT_MASK));
> +
> +       return (unsigned long)shadow | __count_to_swp_tb(count) | SWP_TB_SHADOW_MARK;
>  }
>
>  /*
> @@ -59,7 +143,7 @@ static inline bool swp_tb_is_null(unsigned long swp_tb)
>
>  static inline bool swp_tb_is_folio(unsigned long swp_tb)
>  {
> -       return !xa_is_value((void *)swp_tb) && !swp_tb_is_null(swp_tb);
> +       return ((swp_tb & SWP_TB_PFN_MARK_MASK) == SWP_TB_PFN_MARK);
>  }
>
>  static inline bool swp_tb_is_shadow(unsigned long swp_tb)
> @@ -67,19 +151,44 @@ static inline bool swp_tb_is_shadow(unsigned long swp_tb)
>         return xa_is_value((void *)swp_tb);
>  }
>
> +static inline bool swp_tb_is_bad(unsigned long swp_tb)
> +{
> +       return swp_tb == SWP_TB_BAD;
> +}
> +
> +static inline bool swp_tb_is_countable(unsigned long swp_tb)
> +{
> +       return (swp_tb_is_shadow(swp_tb) || swp_tb_is_folio(swp_tb) ||
> +               swp_tb_is_null(swp_tb));
> +}
> +
>  /*
>   * Helpers for retrieving info from swap table.
>   */
>  static inline struct folio *swp_tb_to_folio(unsigned long swp_tb)
>  {
>         VM_WARN_ON(!swp_tb_is_folio(swp_tb));
> -       return (void *)swp_tb;
> +       return pfn_folio((swp_tb & ~SWP_TB_COUNT_MASK) >> SWP_TB_PFN_MARK_BITS);
>  }
>
>  static inline void *swp_tb_to_shadow(unsigned long swp_tb)
>  {
>         VM_WARN_ON(!swp_tb_is_shadow(swp_tb));
> -       return (void *)swp_tb;
> +       /* No shift needed, xa_value is stored as it is in the lower bits. */
> +       return (void *)(swp_tb & ~SWP_TB_COUNT_MASK);
> +}
> +
> +static inline unsigned char __swp_tb_get_count(unsigned long swp_tb)
> +{
> +       VM_WARN_ON(!swp_tb_is_countable(swp_tb));
> +       return ((swp_tb & SWP_TB_COUNT_MASK) >> SWP_TB_COUNT_SHIFT);
> +}
> +
> +static inline int swp_tb_get_count(unsigned long swp_tb)
> +{
> +       if (swp_tb_is_countable(swp_tb))
> +               return __swp_tb_get_count(swp_tb);
> +       return -EINVAL;
>  }
>
>  /*
> @@ -124,6 +233,8 @@ static inline unsigned long swap_table_get(struct swap_cluster_info *ci,
>         atomic_long_t *table;
>         unsigned long swp_tb;
>
> +       VM_WARN_ON_ONCE(off >= SWAPFILE_CLUSTER);
> +
>         rcu_read_lock();
>         table = rcu_dereference(ci->table);
>         swp_tb = table ? atomic_long_read(&table[off]) : null_to_swp_tb();
>
> --
> 2.52.0
>
>


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v3 07/12] mm, swap: mark bad slots in swap table directly
  2026-02-17 20:06 ` [PATCH v3 07/12] mm, swap: mark bad slots in swap table directly Kairui Song via B4 Relay
@ 2026-02-19  7:01   ` Chris Li
  0 siblings, 0 replies; 28+ messages in thread
From: Chris Li @ 2026-02-19  7:01 UTC (permalink / raw)
  To: kasong
  Cc: linux-mm, Andrew Morton, Kemeng Shi, Nhat Pham, Baoquan He,
	Barry Song, Johannes Weiner, David Hildenbrand, Lorenzo Stoakes,
	Youngjun Park, linux-kernel

On Tue, Feb 17, 2026 at 12:06 PM Kairui Song via B4 Relay
<devnull+kasong.tencent.com@kernel.org> wrote:
>
> From: Kairui Song <kasong@tencent.com>
>
> In preparing the deprecating swap_map, mark bad slots in the swap table
> too when setting SWAP_MAP_BAD in swap_map. Also, refine the swap table
> sanity check on freeing to adapt to the bad slots change. For swapoff,
> the bad slots count must match the cluster usage count, as nothing
> should touch them, and they contribute to the cluster usage count on
> swapon. For ordinary swap table freeing, the swap table of clusters with
> bad slots should never be freed since the cluster usage count never
> reaches zero.
>
> Signed-off-by: Kairui Song <kasong@tencent.com>

Acked-by: Chris Li <chrisl@kernel.org>

Chris

> ---
>  mm/swapfile.c | 56 +++++++++++++++++++++++++++++++++++++++++---------------
>  1 file changed, 41 insertions(+), 15 deletions(-)
>
> diff --git a/mm/swapfile.c b/mm/swapfile.c
> index 91c1fa804185..18bacf16cd26 100644
> --- a/mm/swapfile.c
> +++ b/mm/swapfile.c
> @@ -454,16 +454,37 @@ static void swap_table_free(struct swap_table *table)
>                  swap_table_free_folio_rcu_cb);
>  }
>
> +/*
> + * Sanity check to ensure nothing leaked, and the specified range is empty.
> + * One special case is that bad slots can't be freed, so check the number of
> + * bad slots for swapoff, and non-swapoff path must never free bad slots.
> + */
> +static void swap_cluster_assert_empty(struct swap_cluster_info *ci, bool swapoff)
> +{
> +       unsigned int ci_off = 0, ci_end = SWAPFILE_CLUSTER;
> +       unsigned long swp_tb;
> +       int bad_slots = 0;
> +
> +       if (!IS_ENABLED(CONFIG_DEBUG_VM) && !swapoff)
> +               return;
> +
> +       do {
> +               swp_tb = __swap_table_get(ci, ci_off);
> +               if (swp_tb_is_bad(swp_tb))
> +                       bad_slots++;
> +               else
> +                       WARN_ON_ONCE(!swp_tb_is_null(swp_tb));
> +       } while (++ci_off < ci_end);
> +
> +       WARN_ON_ONCE(bad_slots != (swapoff ? ci->count : 0));
> +}
> +
>  static void swap_cluster_free_table(struct swap_cluster_info *ci)
>  {
> -       unsigned int ci_off;
>         struct swap_table *table;
>
>         /* Only empty cluster's table is allow to be freed  */
>         lockdep_assert_held(&ci->lock);
> -       VM_WARN_ON_ONCE(!cluster_is_empty(ci));
> -       for (ci_off = 0; ci_off < SWAPFILE_CLUSTER; ci_off++)
> -               VM_WARN_ON_ONCE(!swp_tb_is_null(__swap_table_get(ci, ci_off)));
>         table = (void *)rcu_dereference_protected(ci->table, true);
>         rcu_assign_pointer(ci->table, NULL);
>
> @@ -567,6 +588,7 @@ static void swap_cluster_schedule_discard(struct swap_info_struct *si,
>
>  static void __free_cluster(struct swap_info_struct *si, struct swap_cluster_info *ci)
>  {
> +       swap_cluster_assert_empty(ci, false);
>         swap_cluster_free_table(ci);
>         move_cluster(si, ci, &si->free_clusters, CLUSTER_FLAG_FREE);
>         ci->order = 0;
> @@ -747,9 +769,11 @@ static int swap_cluster_setup_bad_slot(struct swap_info_struct *si,
>                                        struct swap_cluster_info *cluster_info,
>                                        unsigned int offset, bool mask)
>  {
> +       unsigned int ci_off = offset % SWAPFILE_CLUSTER;
>         unsigned long idx = offset / SWAPFILE_CLUSTER;
> -       struct swap_table *table;
>         struct swap_cluster_info *ci;
> +       struct swap_table *table;
> +       int ret = 0;
>
>         /* si->max may got shrunk by swap swap_activate() */
>         if (offset >= si->max && !mask) {
> @@ -767,13 +791,7 @@ static int swap_cluster_setup_bad_slot(struct swap_info_struct *si,
>                 pr_warn("Empty swap-file\n");
>                 return -EINVAL;
>         }
> -       /* Check for duplicated bad swap slots. */
> -       if (si->swap_map[offset]) {
> -               pr_warn("Duplicated bad slot offset %d\n", offset);
> -               return -EINVAL;
> -       }
>
> -       si->swap_map[offset] = SWAP_MAP_BAD;
>         ci = cluster_info + idx;
>         if (!ci->table) {
>                 table = swap_table_alloc(GFP_KERNEL);
> @@ -781,13 +799,21 @@ static int swap_cluster_setup_bad_slot(struct swap_info_struct *si,
>                         return -ENOMEM;
>                 rcu_assign_pointer(ci->table, table);
>         }
> -
> -       ci->count++;
> +       spin_lock(&ci->lock);
> +       /* Check for duplicated bad swap slots. */
> +       if (__swap_table_xchg(ci, ci_off, SWP_TB_BAD) != SWP_TB_NULL) {
> +               pr_warn("Duplicated bad slot offset %d\n", offset);
> +               ret = -EINVAL;
> +       } else {
> +               si->swap_map[offset] = SWAP_MAP_BAD;
> +               ci->count++;
> +       }
> +       spin_unlock(&ci->lock);
>
>         WARN_ON(ci->count > SWAPFILE_CLUSTER);
>         WARN_ON(ci->flags);
>
> -       return 0;
> +       return ret;
>  }
>
>  /*
> @@ -2754,7 +2780,7 @@ static void free_swap_cluster_info(struct swap_cluster_info *cluster_info,
>                 /* Cluster with bad marks count will have a remaining table */
>                 spin_lock(&ci->lock);
>                 if (rcu_dereference_protected(ci->table, true)) {
> -                       ci->count = 0;
> +                       swap_cluster_assert_empty(ci, true);
>                         swap_cluster_free_table(ci);
>                 }
>                 spin_unlock(&ci->lock);
>
> --
> 2.52.0
>
>


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v3 08/12] mm, swap: simplify swap table sanity range check
  2026-02-17 20:06 ` [PATCH v3 08/12] mm, swap: simplify swap table sanity range check Kairui Song via B4 Relay
@ 2026-02-19  7:02   ` Chris Li
  0 siblings, 0 replies; 28+ messages in thread
From: Chris Li @ 2026-02-19  7:02 UTC (permalink / raw)
  To: kasong
  Cc: linux-mm, Andrew Morton, Kemeng Shi, Nhat Pham, Baoquan He,
	Barry Song, Johannes Weiner, David Hildenbrand, Lorenzo Stoakes,
	Youngjun Park, linux-kernel

On Tue, Feb 17, 2026 at 12:06 PM Kairui Song via B4 Relay
<devnull+kasong.tencent.com@kernel.org> wrote:
>
> From: Kairui Song <kasong@tencent.com>
>
> The newly introduced helper, which checks bad slots and emptiness of a
> cluster, can cover the older sanity check just fine, with a more
> rigorous condition check. So merge them.
>
> Signed-off-by: Kairui Song <kasong@tencent.com>

Acked-by: Chris Li <chrisl@kernel.org>

Chris

> ---
>  mm/swapfile.c | 35 +++++++++--------------------------
>  1 file changed, 9 insertions(+), 26 deletions(-)
>
> diff --git a/mm/swapfile.c b/mm/swapfile.c
> index 18bacf16cd26..9057fb3e4eed 100644
> --- a/mm/swapfile.c
> +++ b/mm/swapfile.c
> @@ -459,9 +459,11 @@ static void swap_table_free(struct swap_table *table)
>   * One special case is that bad slots can't be freed, so check the number of
>   * bad slots for swapoff, and non-swapoff path must never free bad slots.
>   */
> -static void swap_cluster_assert_empty(struct swap_cluster_info *ci, bool swapoff)
> +static void swap_cluster_assert_empty(struct swap_cluster_info *ci,
> +                                     unsigned int ci_off, unsigned int nr,
> +                                     bool swapoff)
>  {
> -       unsigned int ci_off = 0, ci_end = SWAPFILE_CLUSTER;
> +       unsigned int ci_end = ci_off + nr;
>         unsigned long swp_tb;
>         int bad_slots = 0;
>
> @@ -588,7 +590,7 @@ static void swap_cluster_schedule_discard(struct swap_info_struct *si,
>
>  static void __free_cluster(struct swap_info_struct *si, struct swap_cluster_info *ci)
>  {
> -       swap_cluster_assert_empty(ci, false);
> +       swap_cluster_assert_empty(ci, 0, SWAPFILE_CLUSTER, false);
>         swap_cluster_free_table(ci);
>         move_cluster(si, ci, &si->free_clusters, CLUSTER_FLAG_FREE);
>         ci->order = 0;
> @@ -898,26 +900,6 @@ static bool cluster_scan_range(struct swap_info_struct *si,
>         return true;
>  }
>
> -/*
> - * Currently, the swap table is not used for count tracking, just
> - * do a sanity check here to ensure nothing leaked, so the swap
> - * table should be empty upon freeing.
> - */
> -static void swap_cluster_assert_table_empty(struct swap_cluster_info *ci,
> -                               unsigned int start, unsigned int nr)
> -{
> -       unsigned int ci_off = start % SWAPFILE_CLUSTER;
> -       unsigned int ci_end = ci_off + nr;
> -       unsigned long swp_tb;
> -
> -       if (IS_ENABLED(CONFIG_DEBUG_VM)) {
> -               do {
> -                       swp_tb = __swap_table_get(ci, ci_off);
> -                       VM_WARN_ON_ONCE(!swp_tb_is_null(swp_tb));
> -               } while (++ci_off < ci_end);
> -       }
> -}
> -
>  static bool cluster_alloc_range(struct swap_info_struct *si,
>                                 struct swap_cluster_info *ci,
>                                 struct folio *folio,
> @@ -943,13 +925,14 @@ static bool cluster_alloc_range(struct swap_info_struct *si,
>         if (likely(folio)) {
>                 order = folio_order(folio);
>                 nr_pages = 1 << order;
> +               swap_cluster_assert_empty(ci, offset % SWAPFILE_CLUSTER, nr_pages, false);
>                 __swap_cache_add_folio(ci, folio, swp_entry(si->type, offset));
>         } else if (IS_ENABLED(CONFIG_HIBERNATION)) {
>                 order = 0;
>                 nr_pages = 1;
>                 WARN_ON_ONCE(si->swap_map[offset]);
>                 si->swap_map[offset] = 1;
> -               swap_cluster_assert_table_empty(ci, offset, 1);
> +               swap_cluster_assert_empty(ci, offset % SWAPFILE_CLUSTER, 1, false);
>         } else {
>                 /* Allocation without folio is only possible with hibernation */
>                 WARN_ON_ONCE(1);
> @@ -1768,7 +1751,7 @@ void swap_entries_free(struct swap_info_struct *si,
>
>         mem_cgroup_uncharge_swap(entry, nr_pages);
>         swap_range_free(si, offset, nr_pages);
> -       swap_cluster_assert_table_empty(ci, offset, nr_pages);
> +       swap_cluster_assert_empty(ci, offset % SWAPFILE_CLUSTER, nr_pages, false);
>
>         if (!ci->count)
>                 free_cluster(si, ci);
> @@ -2780,7 +2763,7 @@ static void free_swap_cluster_info(struct swap_cluster_info *cluster_info,
>                 /* Cluster with bad marks count will have a remaining table */
>                 spin_lock(&ci->lock);
>                 if (rcu_dereference_protected(ci->table, true)) {
> -                       swap_cluster_assert_empty(ci, true);
> +                       swap_cluster_assert_empty(ci, 0, SWAPFILE_CLUSTER, true);
>                         swap_cluster_free_table(ci);
>                 }
>                 spin_unlock(&ci->lock);
>
> --
> 2.52.0
>
>


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v3 09/12] mm, swap: use the swap table to track the swap count
  2026-02-18 12:22     ` Kairui Song
@ 2026-02-19  7:06       ` Chris Li
  0 siblings, 0 replies; 28+ messages in thread
From: Chris Li @ 2026-02-19  7:06 UTC (permalink / raw)
  To: Kairui Song
  Cc: linux-mm, kernel test robot, Kairui Song via B4 Relay, llvm,
	oe-kbuild-all, Andrew Morton, Kemeng Shi, Nhat Pham, Baoquan He,
	Barry Song, Johannes Weiner, David Hildenbrand, Lorenzo Stoakes,
	Youngjun Park, linux-kernel, Kairui Song

On Wed, Feb 18, 2026 at 4:23 AM Kairui Song <ryncsn@gmail.com> wrote:
>
> On Wed, Feb 18, 2026 at 06:40:16PM +0800, kernel test robot wrote:
> > Hi Kairui,
> >
> > kernel test robot noticed the following build warnings:
> >
> > [auto build test WARNING on d9982f38eb6e9a0cb6bdd1116cc87f75a1084aad]
> >
> > url:    https://github.com/intel-lab-lkp/linux/commits/Kairui-Song-via-B4-Relay/mm-swap-protect-si-swap_file-properly-and-use-as-a-mount-indicator/20260218-040852
> > base:   d9982f38eb6e9a0cb6bdd1116cc87f75a1084aad
> > patch link:    https://lore.kernel.org/r/20260218-swap-table-p3-v3-9-f4e34be021a7%40tencent.com
> > patch subject: [PATCH v3 09/12] mm, swap: use the swap table to track the swap count
> > config: i386-buildonly-randconfig-001-20260218 (https://download.01.org/0day-ci/archive/20260218/202602181835.58TEynxc-lkp@intel.com/config)
> > compiler: clang version 20.1.8 (https://github.com/llvm/llvm-project 87f0227cb60147a26a1eeb4fb06e3b505e9c7261)
> > reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20260218/202602181835.58TEynxc-lkp@intel.com/reproduce)
> >
> > If you fix the issue in a separate patch/commit (i.e. not just a new version of
> > the same patch/commit), kindly add following tags
> > | Reported-by: kernel test robot <lkp@intel.com>
> > | Closes: https://lore.kernel.org/oe-kbuild-all/202602181835.58TEynxc-lkp@intel.com/
> >
> > All warnings (new ones prefixed by >>):
> >
> > >> mm/swapfile.c:1627:6: warning: shift count >= width of type [-Wshift-count-overflow]
> >     1626 |                 VM_WARN_ON_ONCE(ci->extend_table[ci_off] >=
> >          |                 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> >     1627 |                                 (BIT(BITS_PER_TYPE(ci->extend_table[0]))) - 1);
> >          |                                 ~^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> >    include/vdso/bits.h:7:26: note: expanded from macro 'BIT'
> >        7 | #define BIT(nr)                 (UL(1) << (nr))
> >          |                                        ^
> >    include/linux/mmdebug.h:123:50: note: expanded from macro 'VM_WARN_ON_ONCE'
> >      123 | #define VM_WARN_ON_ONCE(cond) (void)WARN_ON_ONCE(cond)
> >          |                                     ~~~~~~~~~~~~~^~~~~
> >    include/asm-generic/bug.h:120:25: note: expanded from macro 'WARN_ON_ONCE'
> >      120 |         int __ret_warn_on = !!(condition);                              \
> >          |                                ^~~~~~~~~
> >    1 warning generated.
>
> Nice catch from the bot. It's a new added sanity check in V3 just
> in case the swap count maybe grow larger than UINT_MAX, which should
> never happen, but just in case.
>
> I really should just use the existing helper macro for that:

With this fix up.

Acked-by: Chris Li <chrisl@kernel.org>

Chris

>
> diff --git a/mm/swapfile.c b/mm/swapfile.c
> index 801d8092be51..34b38255f72a 100644
> --- a/mm/swapfile.c
> +++ b/mm/swapfile.c
> @@ -1624,7 +1624,7 @@ static int __swap_cluster_dup_entry(struct swap_cluster_info *ci,
>                 }
>         } else if (count == SWP_TB_COUNT_MAX) {
>                 VM_WARN_ON_ONCE(ci->extend_table[ci_off] >=
> -                               (BIT(BITS_PER_TYPE(ci->extend_table[0]))) - 1);
> +                               type_max(typeof(ci->extend_table[0])));
>                 ++ci->extend_table[ci_off];
>         } else {
>                 /* Never happens unless counting went wrong */
>


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v3 10/12] mm, swap: no need to truncate the scan border
  2026-02-17 20:06 ` [PATCH v3 10/12] mm, swap: no need to truncate the scan border Kairui Song via B4 Relay
@ 2026-02-19  7:10   ` Chris Li
  0 siblings, 0 replies; 28+ messages in thread
From: Chris Li @ 2026-02-19  7:10 UTC (permalink / raw)
  To: kasong
  Cc: linux-mm, Andrew Morton, Kemeng Shi, Nhat Pham, Baoquan He,
	Barry Song, Johannes Weiner, David Hildenbrand, Lorenzo Stoakes,
	Youngjun Park, linux-kernel

On Tue, Feb 17, 2026 at 12:06 PM Kairui Song via B4 Relay
<devnull+kasong.tencent.com@kernel.org> wrote:
>
> From: Kairui Song <kasong@tencent.com>
>
> swap_map had a static flexible size, so the last cluster won't be fully
> covered, hence the allocator needs to check the scan border to avoid
> OOB. But the swap table has a fixed-sized swap table for each cluster,
> and the slots beyond the device size are marked as bad slots. The
> allocator can simply scan all slots as usual, and any bad slots will be
> skipped.
>
> Signed-off-by: Kairui Song <kasong@tencent.com>

Acked-by: Chris Li <chrisl@kernel.org>

Chris

> ---
>  mm/swap.h     | 2 +-
>  mm/swapfile.c | 2 +-
>  2 files changed, 2 insertions(+), 2 deletions(-)
>
> diff --git a/mm/swap.h b/mm/swap.h
> index 0a91e21e92b1..cc410b94e91a 100644
> --- a/mm/swap.h
> +++ b/mm/swap.h
> @@ -85,7 +85,7 @@ static inline struct swap_cluster_info *__swap_offset_to_cluster(
>                 struct swap_info_struct *si, pgoff_t offset)
>  {
>         VM_WARN_ON_ONCE(percpu_ref_is_zero(&si->users)); /* race with swapoff */
> -       VM_WARN_ON_ONCE(offset >= si->max);
> +       VM_WARN_ON_ONCE(offset >= roundup(si->max, SWAPFILE_CLUSTER));
>         return &si->cluster_info[offset / SWAPFILE_CLUSTER];
>  }
>
> diff --git a/mm/swapfile.c b/mm/swapfile.c
> index 801d8092be51..df2b88c6c67b 100644
> --- a/mm/swapfile.c
> +++ b/mm/swapfile.c
> @@ -945,8 +945,8 @@ static unsigned int alloc_swap_scan_cluster(struct swap_info_struct *si,
>  {
>         unsigned int next = SWAP_ENTRY_INVALID, found = SWAP_ENTRY_INVALID;
>         unsigned long start = ALIGN_DOWN(offset, SWAPFILE_CLUSTER);
> -       unsigned long end = min(start + SWAPFILE_CLUSTER, si->max);
>         unsigned int order = likely(folio) ? folio_order(folio) : 0;
> +       unsigned long end = start + SWAPFILE_CLUSTER;
>         unsigned int nr_pages = 1 << order;
>         bool need_reclaim, ret, usable;
>
>
> --
> 2.52.0
>
>


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v3 11/12] mm, swap: simplify checking if a folio is swapped
  2026-02-17 20:06 ` [PATCH v3 11/12] mm, swap: simplify checking if a folio is swapped Kairui Song via B4 Relay
@ 2026-02-19  7:18   ` Chris Li
  0 siblings, 0 replies; 28+ messages in thread
From: Chris Li @ 2026-02-19  7:18 UTC (permalink / raw)
  To: kasong
  Cc: linux-mm, Andrew Morton, Kemeng Shi, Nhat Pham, Baoquan He,
	Barry Song, Johannes Weiner, David Hildenbrand, Lorenzo Stoakes,
	Youngjun Park, linux-kernel

On Tue, Feb 17, 2026 at 12:06 PM Kairui Song via B4 Relay
<devnull+kasong.tencent.com@kernel.org> wrote:
>
> From: Kairui Song <kasong@tencent.com>
>
> Clean up and simplify how we check if a folio is swapped. The helper
> already requires the folio to be in swap cache and locked. That's enough
> to pin the swap cluster from being freed, so there is no need to lock
> anything else to avoid UAF.
>
> And besides, we have cleaned up and defined the swap operation to be
> mostly folio based, and now the only place a folio will have any of its
> swap slots' count increased from 0 to 1 is folio_dup_swap, which also
> requires the folio lock. So as we are holding the folio lock here, a
> folio can't change its swap status from not swapped (all swap slots have
> a count of 0) to swapped (any slot has a swap count larger than 0).
>
> So there won't be any false negatives of this helper if we simply depend
> on the folio lock to stabilize the cluster.
>
> We are only using this helper to determine if we can and should release
> the swap cache. So false positives are completely harmless, and also
> already exist before. Depending on the timing, previously, it's also
> possible that a racing thread releases the swap count right after
> releasing the ci lock and before this helper returns. In any case, the
> worst that could happen is we leave a clean swap cache. It will still be
> reclaimed when under pressure just fine.
>
> So, in conclusion, we can simplify and make the check much simpler and
> lockless. Also, rename it to folio_maybe_swapped to reflect the design.
>
> Signed-off-by: Kairui Song <kasong@tencent.com>

Acked-by: Chris Li <chrisl@kernel.org>

Chris

> ---
>  mm/swap.h     |  5 ++--
>  mm/swapfile.c | 82 ++++++++++++++++++++++++++++++++---------------------------
>  2 files changed, 48 insertions(+), 39 deletions(-)
>
> diff --git a/mm/swap.h b/mm/swap.h
> index cc410b94e91a..9728e6a944b2 100644
> --- a/mm/swap.h
> +++ b/mm/swap.h
> @@ -195,12 +195,13 @@ extern int swap_retry_table_alloc(swp_entry_t entry, gfp_t gfp);
>   *
>   * folio_alloc_swap(): the entry point for a folio to be swapped
>   * out. It allocates swap slots and pins the slots with swap cache.
> - * The slots start with a swap count of zero.
> + * The slots start with a swap count of zero. The slots are pinned
> + * by swap cache reference which doesn't contribute to swap count.
>   *
>   * folio_dup_swap(): increases the swap count of a folio, usually
>   * during it gets unmapped and a swap entry is installed to replace
>   * it (e.g., swap entry in page table). A swap slot with swap
> - * count == 0 should only be increasd by this helper.
> + * count == 0 can only be increased by this helper.
>   *
>   * folio_put_swap(): does the opposite thing of folio_dup_swap().
>   */
> diff --git a/mm/swapfile.c b/mm/swapfile.c
> index df2b88c6c67b..dab5e726855b 100644
> --- a/mm/swapfile.c
> +++ b/mm/swapfile.c
> @@ -1743,7 +1743,11 @@ int folio_alloc_swap(struct folio *folio)
>   * @subpage: if not NULL, only increase the swap count of this subpage.
>   *
>   * Typically called when the folio is unmapped and have its swap entry to
> - * take its palce.
> + * take its place: Swap entries allocated to a folio has count == 0 and pinned
> + * by swap cache. The swap cache pin doesn't increase the swap count. This
> + * helper sets the initial count == 1 and increases the count as the folio is
> + * unmapped and swap entries referencing the slots are generated to replace
> + * the folio.
>   *
>   * Context: Caller must ensure the folio is locked and in the swap cache.
>   * NOTE: The caller also has to ensure there is no raced call to
> @@ -1942,49 +1946,44 @@ int swp_swapcount(swp_entry_t entry)
>         return count < 0 ? 0 : count;
>  }
>
> -static bool swap_page_trans_huge_swapped(struct swap_info_struct *si,
> -                                        swp_entry_t entry, int order)
> +/*
> + * folio_maybe_swapped - Test if a folio covers any swap slot with count > 0.
> + *
> + * Check if a folio is swapped. Holding the folio lock ensures the folio won't
> + * go from not-swapped to swapped because the initial swap count increment can
> + * only be done by folio_dup_swap, which also locks the folio. But a concurrent
> + * decrease of swap count is possible through swap_put_entries_direct, so this
> + * may return a false positive.
> + *
> + * Context: Caller must ensure the folio is locked and in the swap cache.
> + */
> +static bool folio_maybe_swapped(struct folio *folio)
>  {
> +       swp_entry_t entry = folio->swap;
>         struct swap_cluster_info *ci;
> -       unsigned int nr_pages = 1 << order;
> -       unsigned long roffset = swp_offset(entry);
> -       unsigned long offset = round_down(roffset, nr_pages);
> -       unsigned int ci_off;
> -       int i;
> +       unsigned int ci_off, ci_end;
>         bool ret = false;
>
> -       ci = swap_cluster_lock(si, offset);
> -       if (nr_pages == 1) {
> -               ci_off = roffset % SWAPFILE_CLUSTER;
> -               if (swp_tb_get_count(__swap_table_get(ci, ci_off)))
> -                       ret = true;
> -               goto unlock_out;
> -       }
> -       for (i = 0; i < nr_pages; i++) {
> -               ci_off = (offset + i) % SWAPFILE_CLUSTER;
> -               if (swp_tb_get_count(__swap_table_get(ci, ci_off))) {
> -                       ret = true;
> -                       break;
> -               }
> -       }
> -unlock_out:
> -       swap_cluster_unlock(ci);
> -       return ret;
> -}
> -
> -static bool folio_swapped(struct folio *folio)
> -{
> -       swp_entry_t entry = folio->swap;
> -       struct swap_info_struct *si;
> -
>         VM_WARN_ON_ONCE_FOLIO(!folio_test_locked(folio), folio);
>         VM_WARN_ON_ONCE_FOLIO(!folio_test_swapcache(folio), folio);
>
> -       si = __swap_entry_to_info(entry);
> -       if (!IS_ENABLED(CONFIG_THP_SWAP) || likely(!folio_test_large(folio)))
> -               return swap_entry_swapped(si, entry);
> +       ci = __swap_entry_to_cluster(entry);
> +       ci_off = swp_cluster_offset(entry);
> +       ci_end = ci_off + folio_nr_pages(folio);
> +       /*
> +        * Extra locking not needed, folio lock ensures its swap entries
> +        * won't be released, the backing data won't be gone either.
> +        */
> +       rcu_read_lock();
> +       do {
> +               if (__swp_tb_get_count(__swap_table_get(ci, ci_off))) {
> +                       ret = true;
> +                       break;
> +               }
> +       } while (++ci_off < ci_end);
> +       rcu_read_unlock();
>
> -       return swap_page_trans_huge_swapped(si, entry, folio_order(folio));
> +       return ret;
>  }
>
>  static bool folio_swapcache_freeable(struct folio *folio)
> @@ -2030,7 +2029,7 @@ bool folio_free_swap(struct folio *folio)
>  {
>         if (!folio_swapcache_freeable(folio))
>                 return false;
> -       if (folio_swapped(folio))
> +       if (folio_maybe_swapped(folio))
>                 return false;
>
>         swap_cache_del_folio(folio);
> @@ -3719,6 +3718,8 @@ void si_swapinfo(struct sysinfo *val)
>   *
>   * Context: Caller must ensure there is no race condition on the reference
>   * owner. e.g., locking the PTL of a PTE containing the entry being increased.
> + * Also the swap entry must have a count >= 1. Otherwise folio_dup_swap should
> + * be used.
>   */
>  int swap_dup_entry_direct(swp_entry_t entry)
>  {
> @@ -3730,6 +3731,13 @@ int swap_dup_entry_direct(swp_entry_t entry)
>                 return -EINVAL;
>         }
>
> +       /*
> +        * The caller must be increasing the swap count from a direct
> +        * reference of the swap slot (e.g. a swap entry in page table).
> +        * So the swap count must be >= 1.
> +        */
> +       VM_WARN_ON_ONCE(!swap_entry_swapped(si, entry));
> +
>         return swap_dup_entries_cluster(si, swp_offset(entry), 1);
>  }
>
>
> --
> 2.52.0
>
>


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v3 12/12] mm, swap: no need to clear the shadow explicitly
  2026-02-17 20:06 ` [PATCH v3 12/12] mm, swap: no need to clear the shadow explicitly Kairui Song via B4 Relay
@ 2026-02-19  7:19   ` Chris Li
  0 siblings, 0 replies; 28+ messages in thread
From: Chris Li @ 2026-02-19  7:19 UTC (permalink / raw)
  To: kasong
  Cc: linux-mm, Andrew Morton, Kemeng Shi, Nhat Pham, Baoquan He,
	Barry Song, Johannes Weiner, David Hildenbrand, Lorenzo Stoakes,
	Youngjun Park, linux-kernel

On Tue, Feb 17, 2026 at 12:06 PM Kairui Song via B4 Relay
<devnull+kasong.tencent.com@kernel.org> wrote:
>
> From: Kairui Song <kasong@tencent.com>
>
> Since we no longer bypass the swap cache, every swap-in will clear the
> swap shadow by inserting the folio into the swap table. The only place
> we may seem to need to free the swap shadow is when the swap slots are
> freed directly without a folio (swap_put_entries_direct). But with the
> swap table, that is not needed either. Freeing a slot in the swap table
> will set the table entry to NULL, which erases the shadow just fine.
>
> So just delete all explicit shadow clearing, it's no longer needed.
> Also, rearrange the freeing.
>
> Signed-off-by: Kairui Song <kasong@tencent.com>

Acked-by: Chris Li <chrisl@kernel.org>

Chris

> ---
>  mm/swap.h       |  1 -
>  mm/swap_state.c | 21 ---------------------
>  mm/swapfile.c   |  2 --
>  3 files changed, 24 deletions(-)
>
> diff --git a/mm/swap.h b/mm/swap.h
> index 9728e6a944b2..a77016f2423b 100644
> --- a/mm/swap.h
> +++ b/mm/swap.h
> @@ -290,7 +290,6 @@ void __swap_cache_del_folio(struct swap_cluster_info *ci,
>                             struct folio *folio, swp_entry_t entry, void *shadow);
>  void __swap_cache_replace_folio(struct swap_cluster_info *ci,
>                                 struct folio *old, struct folio *new);
> -void __swap_cache_clear_shadow(swp_entry_t entry, int nr_ents);
>
>  void show_swap_cache_info(void);
>  void swapcache_clear(struct swap_info_struct *si, swp_entry_t entry, int nr);
> diff --git a/mm/swap_state.c b/mm/swap_state.c
> index e7618ffe6d70..32d9d877bda8 100644
> --- a/mm/swap_state.c
> +++ b/mm/swap_state.c
> @@ -350,27 +350,6 @@ void __swap_cache_replace_folio(struct swap_cluster_info *ci,
>         }
>  }
>
> -/**
> - * __swap_cache_clear_shadow - Clears a set of shadows in the swap cache.
> - * @entry: The starting index entry.
> - * @nr_ents: How many slots need to be cleared.
> - *
> - * Context: Caller must ensure the range is valid, all in one single cluster,
> - * not occupied by any folio, and lock the cluster.
> - */
> -void __swap_cache_clear_shadow(swp_entry_t entry, int nr_ents)
> -{
> -       struct swap_cluster_info *ci = __swap_entry_to_cluster(entry);
> -       unsigned int ci_off = swp_cluster_offset(entry), ci_end;
> -       unsigned long old;
> -
> -       ci_end = ci_off + nr_ents;
> -       do {
> -               old = __swap_table_xchg(ci, ci_off, null_to_swp_tb());
> -               WARN_ON_ONCE(swp_tb_is_folio(old) || swp_tb_get_count(old));
> -       } while (++ci_off < ci_end);
> -}
> -
>  /*
>   * If we are the only user, then try to free up the swap cache.
>   *
> diff --git a/mm/swapfile.c b/mm/swapfile.c
> index dab5e726855b..802efa37b33f 100644
> --- a/mm/swapfile.c
> +++ b/mm/swapfile.c
> @@ -1287,7 +1287,6 @@ static void swap_range_alloc(struct swap_info_struct *si,
>  static void swap_range_free(struct swap_info_struct *si, unsigned long offset,
>                             unsigned int nr_entries)
>  {
> -       unsigned long begin = offset;
>         unsigned long end = offset + nr_entries - 1;
>         void (*swap_slot_free_notify)(struct block_device *, unsigned long);
>         unsigned int i;
> @@ -1312,7 +1311,6 @@ static void swap_range_free(struct swap_info_struct *si, unsigned long offset,
>                         swap_slot_free_notify(si->bdev, offset);
>                 offset++;
>         }
> -       __swap_cache_clear_shadow(swp_entry(si->type, begin), nr_entries);
>
>         /*
>          * Make sure that try_to_unuse() observes si->inuse_pages reaching 0
>
> --
> 2.52.0
>
>


^ permalink raw reply	[flat|nested] 28+ messages in thread

end of thread, other threads:[~2026-02-19  7:19 UTC | newest]

Thread overview: 28+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2026-02-17 20:06 [PATCH v3 00/12] mm, swap: swap table phase III: remove swap_map Kairui Song via B4 Relay
2026-02-17 20:06 ` [PATCH v3 01/12] mm, swap: protect si->swap_file properly and use as a mount indicator Kairui Song via B4 Relay
2026-02-19  6:36   ` Chris Li
2026-02-17 20:06 ` [PATCH v3 02/12] mm, swap: clean up swapon process and locking Kairui Song via B4 Relay
2026-02-19  6:45   ` Chris Li
2026-02-17 20:06 ` [PATCH v3 03/12] mm, swap: remove redundant arguments and locking for enabling a device Kairui Song via B4 Relay
2026-02-19  6:48   ` Chris Li
2026-02-17 20:06 ` [PATCH v3 04/12] mm, swap: consolidate bad slots setup and make it more robust Kairui Song via B4 Relay
2026-02-19  6:51   ` Chris Li
2026-02-17 20:06 ` [PATCH v3 05/12] mm/workingset: leave highest bits empty for anon shadow Kairui Song via B4 Relay
2026-02-19  6:56   ` Chris Li
2026-02-17 20:06 ` [PATCH v3 06/12] mm, swap: implement helpers for reserving data in the swap table Kairui Song via B4 Relay
2026-02-19  7:00   ` Chris Li
2026-02-17 20:06 ` [PATCH v3 07/12] mm, swap: mark bad slots in swap table directly Kairui Song via B4 Relay
2026-02-19  7:01   ` Chris Li
2026-02-17 20:06 ` [PATCH v3 08/12] mm, swap: simplify swap table sanity range check Kairui Song via B4 Relay
2026-02-19  7:02   ` Chris Li
2026-02-17 20:06 ` [PATCH v3 09/12] mm, swap: use the swap table to track the swap count Kairui Song via B4 Relay
2026-02-18 10:40   ` kernel test robot
2026-02-18 12:22     ` Kairui Song
2026-02-19  7:06       ` Chris Li
2026-02-17 20:06 ` [PATCH v3 10/12] mm, swap: no need to truncate the scan border Kairui Song via B4 Relay
2026-02-19  7:10   ` Chris Li
2026-02-17 20:06 ` [PATCH v3 11/12] mm, swap: simplify checking if a folio is swapped Kairui Song via B4 Relay
2026-02-19  7:18   ` Chris Li
2026-02-17 20:06 ` [PATCH v3 12/12] mm, swap: no need to clear the shadow explicitly Kairui Song via B4 Relay
2026-02-19  7:19   ` Chris Li
2026-02-17 20:10 ` [PATCH v3 00/12] mm, swap: swap table phase III: remove swap_map Kairui Song

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox