linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v2 00/15] mm, swap: introduce swap table as swap cache (phase I)
@ 2025-09-05 19:13 Kairui Song
  2025-09-05 19:13 ` [PATCH v2 01/15] docs/mm: add document for swap table Kairui Song
                   ` (14 more replies)
  0 siblings, 15 replies; 80+ messages in thread
From: Kairui Song @ 2025-09-05 19:13 UTC (permalink / raw)
  To: linux-mm
  Cc: Andrew Morton, Matthew Wilcox, Hugh Dickins, Chris Li,
	Barry Song, Baoquan He, Nhat Pham, Kemeng Shi, Baolin Wang,
	Ying Huang, Johannes Weiner, David Hildenbrand, Yosry Ahmed,
	Lorenzo Stoakes, Zi Yan, linux-kernel, Kairui Song

From: Kairui Song <kasong@tencent.com>

This is the first phase of the bigger series implementing basic
infrastructures for the Swap Table idea proposed at the LSF/MM/BPF
topic "Integrate swap cache, swap maps with swap allocator" [1].
To give credit where it is due, this is based on Chris Li's
idea and a prototype of using cluster size atomic arrays to
implement swap cache.

This phase I contains 15 patches, introduces the swap table infrastructure
and uses it as the swap cache backend. By doing so, we have up to ~5-20%
performance gain in throughput, RPS or build time for benchmark and
workload tests. The speed up is due to less contention on the swap
cache access and shallower swap cache lookup path. The cluster size
is much finer-grained than the 64M address space split, which is removed
in this phase I. It also unifies and cleans up the swap code base.

Each swap cluster will dynamically allocate the swap table, which is an
atomic array to cover every swap slot in the cluster. It replaces the swap
cache backed by XArray. In phase I, the static allocated swap_map still
co-exists with the swap table. The memory usage is about the same as the
original on average. A few exception test cases show about 1% higher in
memory usage. In the following phases of the series, swap_map will merge
into the swap table without additional memory allocation. It will result
in net memory reduction compared to the original swap cache.

Testing has shown that phase I has a significant performance improvement
from 8c/1G ARM machine to 48c96t/128G x86_64 servers in many practical
workloads.

The full picture with a summary can be found at [2]. An older bigger
series of 28 patches is posted at [3].

vm-scability test:
==================
Test with:
usemem --init-time -O -y -x -n 31 1G (4G memcg, PMEM as swap)
                           Before:         After:
System time:               219.12s         158.16s        (-27.82%)
Sum Throughput:            4767.13 MB/s    6128.59 MB/s   (+28.55%)
Single process Throughput: 150.21 MB/s     196.52 MB/s    (+30.83%)
Free latency:              175047.58 us    131411.87 us   (-24.92%)

usemem --init-time -O -y -x -n 32 1536M (16G memory, global pressure,
PMEM as swap)
                           Before:         After:
System time:               356.16s         284.68s      (-20.06%)
Sum Throughput:            4648.35 MB/s    5453.52 MB/s (+17.32%)
Single process Throughput: 141.63 MB/s     168.35 MB/s  (+18.86%)
Free latency:              499907.71 us    484977.03 us (-2.99%)

This shows an improvement of more than 20% improvement in most readings.

Build kernel test:
==================
The following result matrix is from building kernel with defconfig on
tmpfs with ZSWAP / ZRAM, using different memory pressure and setups.
Measuring sys and real time in seconds, less is better
(user time is almost identical as expected):

 -j<NR> / Mem  | Sys before / after  | Real before / after
Using 16G ZRAM with memcg limit:
     6  / 192M | 9686 / 9472  -2.21% | 2130  / 2096   -1.59%
     12 / 256M | 6610 / 6451  -2.41% |  827  /  812   -1.81%
     24 / 384M | 5938 / 5701  -3.37% |  414  /  405   -2.17%
     48 / 768M | 4696 / 4409  -6.11% |  188  /  182   -3.19%
With 64k folio:
     24 / 512M | 4222 / 4162  -1.42% |  326  /  321   -1.53%
     48 / 1G   | 3688 / 3622  -1.79% |  151  /  149   -1.32%
With ZSWAP with 3G memcg (using higher limit due to kmem account):
     48 / 3G   |  603 /  581  -3.65% |  81   /   80   -1.23%

Testing extremely high global memory and schedule pressure: Using ZSWAP
with 32G NVMEs in a 48c VM that has 4G memory, no memcg limit, system
components take up about 1.5G already, using make -j48 to build
defconfig:

Before:  sys time: 2069.53s            real time: 135.76s
After:   sys time: 2021.13s (-2.34%)   real time: 134.23s (-1.12%)

On another 48c 4G memory VM, using 16G ZRAM as swap, testing make
-j48 with same config:

Before:  sys time: 1756.96s            real time: 111.01s
After:   sys time: 1715.90s (-2.34%)   real time: 109.51s (-1.35%)

All cases are more or less faster, and no regression even under
extremely heavy global memory pressure.

Redis / Valkey bench:
=====================
The test machine is a ARM64 VM with 1536M memory 12 cores, Redis
is set to use 2500M memory, and ZRAM swapfile size is set to 5G:

Testing with:
redis-benchmark -r 2000000 -n 2000000 -d 1024 -c 12 -P 32 -t get

                no BGSAVE                with BGSAVE
Before:         487576.06 RPS            280016.02 RPS
After:          487541.76 RPS (-0.01%)   300155.32 RPS (+7.19%)

Testing with:
redis-benchmark -r 2500000 -n 2500000 -d 1024 -c 12 -P 32 -t get
                no BGSAVE                with BGSAVE
Before:         466789.59 RPS            281213.92 RPS
After:          466402.89 RPS (-0.08%)   298411.84 RPS (+6.12%)

With BGSAVE enabled, most Redis memory will have a swap count > 1 so
swap cache is heavily in use. We can see a about 6% performance gain.
No BGSAVE is very slightly slower (<0.1%) due to the higher memory
pressure of the co-existence of swap_map and swap table. This will be
optimzed into a net gain and up to 20% gain in BGSAVE case in the
following phases.

HDD swap is also about 40% faster with usemem test because we removed
an old contention workaround.

Link: https://lore.kernel.org/CAMgjq7BvQ0ZXvyLGp2YP96+i+6COCBBJCYmjXHGBnfisCAb8VA@mail.gmail.com [1]
Link: https://github.com/ryncsn/linux/tree/kasong/devel/swap-table [2]
Link: https://lore.kernel.org/linux-mm/20250514201729.48420-1-ryncsn@gmail.com/ [3]

Suggested-by: Chris Li <chrisl@kernel.org>

Chris Li (1):
  docs/mm: add document for swap table

Kairui Song (14):
  mm, swap: use unified helper for swap cache look up
  mm, swap: fix swap cahe index error when retrying reclaim
  mm, swap: check page poison flag after locking it
  mm, swap: always lock and check the swap cache folio before use
  mm, swap: rename and move some swap cluster definition and helpers
  mm, swap: tidy up swap device and cluster info helpers
  mm/shmem, swap: remove redundant error handling for replacing folio
  mm, swap: cleanup swap cache API and add kerneldoc
  mm, swap: wrap swap cache replacement with a helper
  mm, swap: use the swap table for the swap cache and switch API
  mm, swap: mark swap address space ro and add context debug check
  mm, swap: remove contention workaround for swap cache
  mm, swap: implement dynamic allocation of swap table
  mm, swap: use a single page for swap table when the size fits

 Documentation/mm/swap-table.rst |  72 +++++
 MAINTAINERS                     |   2 +
 include/linux/swap.h            |  42 ---
 mm/filemap.c                    |   2 +-
 mm/huge_memory.c                |  16 +-
 mm/memory-failure.c             |   2 +-
 mm/memory.c                     |  27 +-
 mm/migrate.c                    |  28 +-
 mm/mincore.c                    |   3 +-
 mm/page_io.c                    |  12 +-
 mm/shmem.c                      |  58 ++--
 mm/swap.h                       | 307 ++++++++++++++++++---
 mm/swap_state.c                 | 447 +++++++++++++++++--------------
 mm/swap_table.h                 | 130 +++++++++
 mm/swapfile.c                   | 455 +++++++++++++++++++++-----------
 mm/userfaultfd.c                |   5 +-
 mm/vmscan.c                     |  20 +-
 mm/zswap.c                      |   9 +-
 18 files changed, 1103 insertions(+), 534 deletions(-)
 create mode 100644 Documentation/mm/swap-table.rst
 create mode 100644 mm/swap_table.h

-- 
2.51.0



^ permalink raw reply	[flat|nested] 80+ messages in thread

* [PATCH v2 01/15] docs/mm: add document for swap table
  2025-09-05 19:13 [PATCH v2 00/15] mm, swap: introduce swap table as swap cache (phase I) Kairui Song
@ 2025-09-05 19:13 ` Kairui Song
  2025-09-05 23:58   ` Chris Li
  2025-09-08 12:35   ` Baoquan He
  2025-09-05 19:13 ` [PATCH v2 02/15] mm, swap: use unified helper for swap cache look up Kairui Song
                   ` (13 subsequent siblings)
  14 siblings, 2 replies; 80+ messages in thread
From: Kairui Song @ 2025-09-05 19:13 UTC (permalink / raw)
  To: linux-mm
  Cc: Andrew Morton, Matthew Wilcox, Hugh Dickins, Chris Li,
	Barry Song, Baoquan He, Nhat Pham, Kemeng Shi, Baolin Wang,
	Ying Huang, Johannes Weiner, David Hildenbrand, Yosry Ahmed,
	Lorenzo Stoakes, Zi Yan, linux-kernel, Kairui Song

From: Kairui Song <kasong@tencent.com>

From: Chris Li <chrisl@kernel.org>

Swap table is the new swap cache.

Signed-off-by: Chris Li <chrisl@kernel.org>
Signed-off-by: Kairui Song <kasong@tencent.com>
---
 Documentation/mm/swap-table.rst | 72 +++++++++++++++++++++++++++++++++
 MAINTAINERS                     |  1 +
 2 files changed, 73 insertions(+)
 create mode 100644 Documentation/mm/swap-table.rst

diff --git a/Documentation/mm/swap-table.rst b/Documentation/mm/swap-table.rst
new file mode 100644
index 000000000000..929cd91aa984
--- /dev/null
+++ b/Documentation/mm/swap-table.rst
@@ -0,0 +1,72 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+:Author: Chris Li <chrisl@kernel.org>, Kairui Song <kasong@tencent.com>
+
+==========
+Swap Table
+==========
+
+Swap table implements swap cache as a per-cluster swap cache value array.
+
+Swap Entry
+----------
+
+A swap entry contains the information required to serve the anonymous page
+fault.
+
+Swap entry is encoded as two parts: swap type and swap offset.
+
+The swap type indicates which swap device to use.
+The swap offset is the offset of the swap file to read the page data from.
+
+Swap Cache
+----------
+
+Swap cache is a map to look up folios using swap entry as the key. The result
+value can have three possible types depending on which stage of this swap entry
+was in.
+
+1. NULL: This swap entry is not used.
+
+2. folio: A folio has been allocated and bound to this swap entry. This is
+   the transient state of swap out or swap in. The folio data can be in
+   the folio or swap file, or both.
+
+3. shadow: The shadow contains the working set information of the swap
+   outed folio. This is the normal state for a swap outed page.
+
+Swap Table
+----------
+
+The previous swap cache is implemented by XAray. The XArray is a tree
+structure. Each lookup will go through multiple nodes. Can we do better?
+
+Notice that most of the time when we look up the swap cache, we are either
+in a swap in or swap out path. We should already have the swap cluster,
+which contains the swap entry.
+
+If we have a per-cluster array to store swap cache value in the cluster.
+Swap cache lookup within the cluster can be a very simple array lookup.
+
+We give such a per-cluster swap cache value array a name: the swap table.
+
+Each swap cluster contains 512 entries, so a swap table stores one cluster
+worth of swap cache values, which is exactly one page. This is not
+coincidental because the cluster size is determined by the huge page size.
+The swap table is holding an array of pointers. The pointer has the same
+size as the PTE. The size of the swap table should match to the second
+last level of the page table page, exactly one page.
+
+With swap table, swap cache lookup can achieve great locality, simpler,
+and faster.
+
+Locking
+-------
+
+Swap table modification requires taking the cluster lock. If a folio
+is being added to or removed from the swap table, the folio must be
+locked prior to the cluster lock. After adding or removing is done, the
+folio shall be unlocked.
+
+Swap table lookup is protected by RCU and atomic read. If the lookup
+returns a folio, the user must lock the folio before use.
diff --git a/MAINTAINERS b/MAINTAINERS
index ec19be6c9917..1c8292c0318d 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -16219,6 +16219,7 @@ R:	Barry Song <baohua@kernel.org>
 R:	Chris Li <chrisl@kernel.org>
 L:	linux-mm@kvack.org
 S:	Maintained
+F:	Documentation/mm/swap-table.rst
 F:	include/linux/swap.h
 F:	include/linux/swapfile.h
 F:	include/linux/swapops.h
-- 
2.51.0



^ permalink raw reply	[flat|nested] 80+ messages in thread

* [PATCH v2 02/15] mm, swap: use unified helper for swap cache look up
  2025-09-05 19:13 [PATCH v2 00/15] mm, swap: introduce swap table as swap cache (phase I) Kairui Song
  2025-09-05 19:13 ` [PATCH v2 01/15] docs/mm: add document for swap table Kairui Song
@ 2025-09-05 19:13 ` Kairui Song
  2025-09-05 23:59   ` Chris Li
  2025-09-08 11:43   ` David Hildenbrand
  2025-09-05 19:13 ` [PATCH v2 03/15] mm, swap: fix swap cahe index error when retrying reclaim Kairui Song
                   ` (12 subsequent siblings)
  14 siblings, 2 replies; 80+ messages in thread
From: Kairui Song @ 2025-09-05 19:13 UTC (permalink / raw)
  To: linux-mm
  Cc: Andrew Morton, Matthew Wilcox, Hugh Dickins, Chris Li,
	Barry Song, Baoquan He, Nhat Pham, Kemeng Shi, Baolin Wang,
	Ying Huang, Johannes Weiner, David Hildenbrand, Yosry Ahmed,
	Lorenzo Stoakes, Zi Yan, linux-kernel, Kairui Song

From: Kairui Song <kasong@tencent.com>

The swap cache lookup helper swap_cache_get_folio currently does
readahead updates as well, so callers that are not doing swapin from any
VMA or mapping are forced to reuse filemap helpers instead, and have to
access the swap cache space directly.

So decouple readahead update with swap cache lookup. Move the readahead
update part into a standalone helper. Let the caller call the readahead
update helper if they do readahead. And convert all swap cache lookups
to use swap_cache_get_folio.

After this commit, there are only three special cases for accessing swap
cache space now: huge memory splitting, migration, and shmem replacing,
because they need to lock the XArray. The following commits will wrap
their accesses to the swap cache too, with special helpers.

And worth noting, currently dropbehind is not supported for anon folio,
and we will never see a dropbehind folio in swap cache. The unified
helper can be updated later to handle that.

While at it, add proper kernedoc for touched helpers.

No functional change.

Signed-off-by: Kairui Song <kasong@tencent.com>
Acked-by: Chris Li <chrisl@kernel.org>
Acked-by: Nhat Pham <nphamcs@gmail.com>
Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com>
Reviewed-by: Barry Song <baohua@kernel.org>
---
 mm/memory.c      |   6 ++-
 mm/mincore.c     |   3 +-
 mm/shmem.c       |   4 +-
 mm/swap.h        |  13 ++++--
 mm/swap_state.c  | 109 +++++++++++++++++++++++++----------------------
 mm/swapfile.c    |  11 +++--
 mm/userfaultfd.c |   5 +--
 7 files changed, 81 insertions(+), 70 deletions(-)

diff --git a/mm/memory.c b/mm/memory.c
index d9de6c056179..10ef528a5f44 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -4660,9 +4660,11 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
 	if (unlikely(!si))
 		goto out;
 
-	folio = swap_cache_get_folio(entry, vma, vmf->address);
-	if (folio)
+	folio = swap_cache_get_folio(entry);
+	if (folio) {
+		swap_update_readahead(folio, vma, vmf->address);
 		page = folio_file_page(folio, swp_offset(entry));
+	}
 	swapcache = folio;
 
 	if (!folio) {
diff --git a/mm/mincore.c b/mm/mincore.c
index 2f3e1816a30d..8ec4719370e1 100644
--- a/mm/mincore.c
+++ b/mm/mincore.c
@@ -76,8 +76,7 @@ static unsigned char mincore_swap(swp_entry_t entry, bool shmem)
 		if (!si)
 			return 0;
 	}
-	folio = filemap_get_entry(swap_address_space(entry),
-				  swap_cache_index(entry));
+	folio = swap_cache_get_folio(entry);
 	if (shmem)
 		put_swap_device(si);
 	/* The swap cache space contains either folio, shadow or NULL */
diff --git a/mm/shmem.c b/mm/shmem.c
index 2df26f4d6e60..4e27e8e5da3b 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -2354,7 +2354,7 @@ static int shmem_swapin_folio(struct inode *inode, pgoff_t index,
 	}
 
 	/* Look it up and read it in.. */
-	folio = swap_cache_get_folio(swap, NULL, 0);
+	folio = swap_cache_get_folio(swap);
 	if (!folio) {
 		if (data_race(si->flags & SWP_SYNCHRONOUS_IO)) {
 			/* Direct swapin skipping swap cache & readahead */
@@ -2379,6 +2379,8 @@ static int shmem_swapin_folio(struct inode *inode, pgoff_t index,
 			count_vm_event(PGMAJFAULT);
 			count_memcg_event_mm(fault_mm, PGMAJFAULT);
 		}
+	} else {
+		swap_update_readahead(folio, NULL, 0);
 	}
 
 	if (order > folio_order(folio)) {
diff --git a/mm/swap.h b/mm/swap.h
index 1ae44d4193b1..efb6d7ff9f30 100644
--- a/mm/swap.h
+++ b/mm/swap.h
@@ -62,8 +62,7 @@ void delete_from_swap_cache(struct folio *folio);
 void clear_shadow_from_swap_cache(int type, unsigned long begin,
 				  unsigned long end);
 void swapcache_clear(struct swap_info_struct *si, swp_entry_t entry, int nr);
-struct folio *swap_cache_get_folio(swp_entry_t entry,
-		struct vm_area_struct *vma, unsigned long addr);
+struct folio *swap_cache_get_folio(swp_entry_t entry);
 struct folio *read_swap_cache_async(swp_entry_t entry, gfp_t gfp_mask,
 		struct vm_area_struct *vma, unsigned long addr,
 		struct swap_iocb **plug);
@@ -74,6 +73,8 @@ struct folio *swap_cluster_readahead(swp_entry_t entry, gfp_t flag,
 		struct mempolicy *mpol, pgoff_t ilx);
 struct folio *swapin_readahead(swp_entry_t entry, gfp_t flag,
 		struct vm_fault *vmf);
+void swap_update_readahead(struct folio *folio, struct vm_area_struct *vma,
+			   unsigned long addr);
 
 static inline unsigned int folio_swap_flags(struct folio *folio)
 {
@@ -159,6 +160,11 @@ static inline struct folio *swapin_readahead(swp_entry_t swp, gfp_t gfp_mask,
 	return NULL;
 }
 
+static inline void swap_update_readahead(struct folio *folio,
+		struct vm_area_struct *vma, unsigned long addr)
+{
+}
+
 static inline int swap_writeout(struct folio *folio,
 		struct swap_iocb **swap_plug)
 {
@@ -169,8 +175,7 @@ static inline void swapcache_clear(struct swap_info_struct *si, swp_entry_t entr
 {
 }
 
-static inline struct folio *swap_cache_get_folio(swp_entry_t entry,
-		struct vm_area_struct *vma, unsigned long addr)
+static inline struct folio *swap_cache_get_folio(swp_entry_t entry)
 {
 	return NULL;
 }
diff --git a/mm/swap_state.c b/mm/swap_state.c
index 99513b74b5d8..68ec531d0f2b 100644
--- a/mm/swap_state.c
+++ b/mm/swap_state.c
@@ -69,6 +69,27 @@ void show_swap_cache_info(void)
 	printk("Total swap = %lukB\n", K(total_swap_pages));
 }
 
+/**
+ * swap_cache_get_folio - Looks up a folio in the swap cache.
+ * @entry: swap entry used for the lookup.
+ *
+ * A found folio will be returned unlocked and with its refcount increased.
+ *
+ * Context: Caller must ensure @entry is valid and protect the swap device
+ * with reference count or locks.
+ * Return: Returns the found folio on success, NULL otherwise. The caller
+ * must lock and check if the folio still matches the swap entry before
+ * use.
+ */
+struct folio *swap_cache_get_folio(swp_entry_t entry)
+{
+	struct folio *folio = filemap_get_folio(swap_address_space(entry),
+						swap_cache_index(entry));
+	if (IS_ERR(folio))
+		return NULL;
+	return folio;
+}
+
 void *get_shadow_from_swap_cache(swp_entry_t entry)
 {
 	struct address_space *address_space = swap_address_space(entry);
@@ -272,55 +293,43 @@ static inline bool swap_use_vma_readahead(void)
 	return READ_ONCE(enable_vma_readahead) && !atomic_read(&nr_rotate_swap);
 }
 
-/*
- * Lookup a swap entry in the swap cache. A found folio will be returned
- * unlocked and with its refcount incremented - we rely on the kernel
- * lock getting page table operations atomic even if we drop the folio
- * lock before returning.
- *
- * Caller must lock the swap device or hold a reference to keep it valid.
+/**
+ * swap_update_readahead - Update the readahead statistics of VMA or globally.
+ * @folio: the swap cache folio that just got hit.
+ * @vma: the VMA that should be updated, could be NULL for global update.
+ * @addr: the addr that triggered the swapin, ignored if @vma is NULL.
  */
-struct folio *swap_cache_get_folio(swp_entry_t entry,
-		struct vm_area_struct *vma, unsigned long addr)
+void swap_update_readahead(struct folio *folio, struct vm_area_struct *vma,
+			   unsigned long addr)
 {
-	struct folio *folio;
-
-	folio = filemap_get_folio(swap_address_space(entry), swap_cache_index(entry));
-	if (!IS_ERR(folio)) {
-		bool vma_ra = swap_use_vma_readahead();
-		bool readahead;
+	bool readahead, vma_ra = swap_use_vma_readahead();
 
-		/*
-		 * At the moment, we don't support PG_readahead for anon THP
-		 * so let's bail out rather than confusing the readahead stat.
-		 */
-		if (unlikely(folio_test_large(folio)))
-			return folio;
-
-		readahead = folio_test_clear_readahead(folio);
-		if (vma && vma_ra) {
-			unsigned long ra_val;
-			int win, hits;
-
-			ra_val = GET_SWAP_RA_VAL(vma);
-			win = SWAP_RA_WIN(ra_val);
-			hits = SWAP_RA_HITS(ra_val);
-			if (readahead)
-				hits = min_t(int, hits + 1, SWAP_RA_HITS_MAX);
-			atomic_long_set(&vma->swap_readahead_info,
-					SWAP_RA_VAL(addr, win, hits));
-		}
-
-		if (readahead) {
-			count_vm_event(SWAP_RA_HIT);
-			if (!vma || !vma_ra)
-				atomic_inc(&swapin_readahead_hits);
-		}
-	} else {
-		folio = NULL;
+	/*
+	 * At the moment, we don't support PG_readahead for anon THP
+	 * so let's bail out rather than confusing the readahead stat.
+	 */
+	if (unlikely(folio_test_large(folio)))
+		return;
+
+	readahead = folio_test_clear_readahead(folio);
+	if (vma && vma_ra) {
+		unsigned long ra_val;
+		int win, hits;
+
+		ra_val = GET_SWAP_RA_VAL(vma);
+		win = SWAP_RA_WIN(ra_val);
+		hits = SWAP_RA_HITS(ra_val);
+		if (readahead)
+			hits = min_t(int, hits + 1, SWAP_RA_HITS_MAX);
+		atomic_long_set(&vma->swap_readahead_info,
+				SWAP_RA_VAL(addr, win, hits));
 	}
 
-	return folio;
+	if (readahead) {
+		count_vm_event(SWAP_RA_HIT);
+		if (!vma || !vma_ra)
+			atomic_inc(&swapin_readahead_hits);
+	}
 }
 
 struct folio *__read_swap_cache_async(swp_entry_t entry, gfp_t gfp_mask,
@@ -336,14 +345,10 @@ struct folio *__read_swap_cache_async(swp_entry_t entry, gfp_t gfp_mask,
 	*new_page_allocated = false;
 	for (;;) {
 		int err;
-		/*
-		 * First check the swap cache.  Since this is normally
-		 * called after swap_cache_get_folio() failed, re-calling
-		 * that would confuse statistics.
-		 */
-		folio = filemap_get_folio(swap_address_space(entry),
-					  swap_cache_index(entry));
-		if (!IS_ERR(folio))
+
+		/* Check the swap cache in case the folio is already there */
+		folio = swap_cache_get_folio(entry);
+		if (folio)
 			goto got_folio;
 
 		/*
diff --git a/mm/swapfile.c b/mm/swapfile.c
index a7ffabbe65ef..4b8ab2cb49ca 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -213,15 +213,14 @@ static int __try_to_reclaim_swap(struct swap_info_struct *si,
 				 unsigned long offset, unsigned long flags)
 {
 	swp_entry_t entry = swp_entry(si->type, offset);
-	struct address_space *address_space = swap_address_space(entry);
 	struct swap_cluster_info *ci;
 	struct folio *folio;
 	int ret, nr_pages;
 	bool need_reclaim;
 
 again:
-	folio = filemap_get_folio(address_space, swap_cache_index(entry));
-	if (IS_ERR(folio))
+	folio = swap_cache_get_folio(entry);
+	if (!folio)
 		return 0;
 
 	nr_pages = folio_nr_pages(folio);
@@ -2131,7 +2130,7 @@ static int unuse_pte_range(struct vm_area_struct *vma, pmd_t *pmd,
 		pte_unmap(pte);
 		pte = NULL;
 
-		folio = swap_cache_get_folio(entry, vma, addr);
+		folio = swap_cache_get_folio(entry);
 		if (!folio) {
 			struct vm_fault vmf = {
 				.vma = vma,
@@ -2357,8 +2356,8 @@ static int try_to_unuse(unsigned int type)
 	       (i = find_next_to_unuse(si, i)) != 0) {
 
 		entry = swp_entry(type, i);
-		folio = filemap_get_folio(swap_address_space(entry), swap_cache_index(entry));
-		if (IS_ERR(folio))
+		folio = swap_cache_get_folio(entry);
+		if (!folio)
 			continue;
 
 		/*
diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c
index 50aaa8dcd24c..af61b95c89e4 100644
--- a/mm/userfaultfd.c
+++ b/mm/userfaultfd.c
@@ -1489,9 +1489,8 @@ static long move_pages_ptes(struct mm_struct *mm, pmd_t *dst_pmd, pmd_t *src_pmd
 		 * separately to allow proper handling.
 		 */
 		if (!src_folio)
-			folio = filemap_get_folio(swap_address_space(entry),
-					swap_cache_index(entry));
-		if (!IS_ERR_OR_NULL(folio)) {
+			folio = swap_cache_get_folio(entry);
+		if (folio) {
 			if (folio_test_large(folio)) {
 				ret = -EBUSY;
 				folio_put(folio);
-- 
2.51.0



^ permalink raw reply	[flat|nested] 80+ messages in thread

* [PATCH v2 03/15] mm, swap: fix swap cahe index error when retrying reclaim
  2025-09-05 19:13 [PATCH v2 00/15] mm, swap: introduce swap table as swap cache (phase I) Kairui Song
  2025-09-05 19:13 ` [PATCH v2 01/15] docs/mm: add document for swap table Kairui Song
  2025-09-05 19:13 ` [PATCH v2 02/15] mm, swap: use unified helper for swap cache look up Kairui Song
@ 2025-09-05 19:13 ` Kairui Song
  2025-09-05 22:40   ` Nhat Pham
                     ` (3 more replies)
  2025-09-05 19:13 ` [PATCH v2 04/15] mm, swap: check page poison flag after locking it Kairui Song
                   ` (11 subsequent siblings)
  14 siblings, 4 replies; 80+ messages in thread
From: Kairui Song @ 2025-09-05 19:13 UTC (permalink / raw)
  To: linux-mm
  Cc: Andrew Morton, Matthew Wilcox, Hugh Dickins, Chris Li,
	Barry Song, Baoquan He, Nhat Pham, Kemeng Shi, Baolin Wang,
	Ying Huang, Johannes Weiner, David Hildenbrand, Yosry Ahmed,
	Lorenzo Stoakes, Zi Yan, linux-kernel, Kairui Song

From: Kairui Song <kasong@tencent.com>

The allocator will reclaim cached slots while scanning. Currently, it
will try again if the reclaim found a folio that is already removed from
the swap cache due to a race. But the following lookup will be using the
wrong index. It won't cause any OOB issue since the swap cache index is
truncated upon lookup, but it may lead to reclaiming of an irrelevant
folio.

This should not cause a measurable issue, but we should fix it.

Fixes: fae8595505313 ("mm, swap: avoid reclaiming irrelevant swap cache")
Signed-off-by: Kairui Song <kasong@tencent.com>
---
 mm/swapfile.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/mm/swapfile.c b/mm/swapfile.c
index 4b8ab2cb49ca..4c63fc62f4cb 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -240,13 +240,13 @@ static int __try_to_reclaim_swap(struct swap_info_struct *si,
 	 * Offset could point to the middle of a large folio, or folio
 	 * may no longer point to the expected offset before it's locked.
 	 */
-	entry = folio->swap;
-	if (offset < swp_offset(entry) || offset >= swp_offset(entry) + nr_pages) {
+	if (offset < swp_offset(folio->swap) ||
+	    offset >= swp_offset(folio->swap) + nr_pages) {
 		folio_unlock(folio);
 		folio_put(folio);
 		goto again;
 	}
-	offset = swp_offset(entry);
+	offset = swp_offset(folio->swap);
 
 	need_reclaim = ((flags & TTRS_ANYWAY) ||
 			((flags & TTRS_UNMAPPED) && !folio_mapped(folio)) ||
-- 
2.51.0



^ permalink raw reply	[flat|nested] 80+ messages in thread

* [PATCH v2 04/15] mm, swap: check page poison flag after locking it
  2025-09-05 19:13 [PATCH v2 00/15] mm, swap: introduce swap table as swap cache (phase I) Kairui Song
                   ` (2 preceding siblings ...)
  2025-09-05 19:13 ` [PATCH v2 03/15] mm, swap: fix swap cahe index error when retrying reclaim Kairui Song
@ 2025-09-05 19:13 ` Kairui Song
  2025-09-06  2:00   ` Chris Li
  2025-09-08 12:11   ` David Hildenbrand
  2025-09-05 19:13 ` [PATCH v2 05/15] mm, swap: always lock and check the swap cache folio before use Kairui Song
                   ` (10 subsequent siblings)
  14 siblings, 2 replies; 80+ messages in thread
From: Kairui Song @ 2025-09-05 19:13 UTC (permalink / raw)
  To: linux-mm
  Cc: Andrew Morton, Matthew Wilcox, Hugh Dickins, Chris Li,
	Barry Song, Baoquan He, Nhat Pham, Kemeng Shi, Baolin Wang,
	Ying Huang, Johannes Weiner, David Hildenbrand, Yosry Ahmed,
	Lorenzo Stoakes, Zi Yan, linux-kernel, Kairui Song

From: Kairui Song <kasong@tencent.com>

Instead of checking the poison flag only in the fast swap cache lookup
path, always check the poison flags after locking a swap cache folio.

There are two reasons to do so.

The folio is unstable and could be removed from the swap cache anytime,
so it's totally possible that the folio is no longer the backing folio
of a swap entry, and could be an irrelevant poisoned folio. We might
mistakenly kill a faulting process.

And it's totally possible or even common for the slow swap in path
(swapin_readahead) to bring in a cached folio. The cache folio could be
poisoned, too. Only checking the poison flag in the fast path will miss
such folios.

The race window is tiny, so it's very unlikely to happen, though.
While at it, also add a unlikely prefix.

Signed-off-by: Kairui Song <kasong@tencent.com>
---
 mm/memory.c | 22 +++++++++++-----------
 1 file changed, 11 insertions(+), 11 deletions(-)

diff --git a/mm/memory.c b/mm/memory.c
index 10ef528a5f44..94a5928e8ace 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -4661,10 +4661,8 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
 		goto out;
 
 	folio = swap_cache_get_folio(entry);
-	if (folio) {
+	if (folio)
 		swap_update_readahead(folio, vma, vmf->address);
-		page = folio_file_page(folio, swp_offset(entry));
-	}
 	swapcache = folio;
 
 	if (!folio) {
@@ -4735,20 +4733,13 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
 		ret = VM_FAULT_MAJOR;
 		count_vm_event(PGMAJFAULT);
 		count_memcg_event_mm(vma->vm_mm, PGMAJFAULT);
-		page = folio_file_page(folio, swp_offset(entry));
-	} else if (PageHWPoison(page)) {
-		/*
-		 * hwpoisoned dirty swapcache pages are kept for killing
-		 * owner processes (which may be unknown at hwpoison time)
-		 */
-		ret = VM_FAULT_HWPOISON;
-		goto out_release;
 	}
 
 	ret |= folio_lock_or_retry(folio, vmf);
 	if (ret & VM_FAULT_RETRY)
 		goto out_release;
 
+	page = folio_file_page(folio, swp_offset(entry));
 	if (swapcache) {
 		/*
 		 * Make sure folio_free_swap() or swapoff did not release the
@@ -4761,6 +4752,15 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
 			     page_swap_entry(page).val != entry.val))
 			goto out_page;
 
+		if (unlikely(PageHWPoison(page))) {
+			/*
+			 * hwpoisoned dirty swapcache pages are kept for killing
+			 * owner processes (which may be unknown at hwpoison time)
+			 */
+			ret = VM_FAULT_HWPOISON;
+			goto out_page;
+		}
+
 		/*
 		 * KSM sometimes has to copy on read faults, for example, if
 		 * folio->index of non-ksm folios would be nonlinear inside the
-- 
2.51.0



^ permalink raw reply	[flat|nested] 80+ messages in thread

* [PATCH v2 05/15] mm, swap: always lock and check the swap cache folio before use
  2025-09-05 19:13 [PATCH v2 00/15] mm, swap: introduce swap table as swap cache (phase I) Kairui Song
                   ` (3 preceding siblings ...)
  2025-09-05 19:13 ` [PATCH v2 04/15] mm, swap: check page poison flag after locking it Kairui Song
@ 2025-09-05 19:13 ` Kairui Song
  2025-09-06  2:12   ` Chris Li
  2025-09-08 12:18   ` David Hildenbrand
  2025-09-05 19:13 ` [PATCH v2 06/15] mm, swap: rename and move some swap cluster definition and helpers Kairui Song
                   ` (9 subsequent siblings)
  14 siblings, 2 replies; 80+ messages in thread
From: Kairui Song @ 2025-09-05 19:13 UTC (permalink / raw)
  To: linux-mm
  Cc: Andrew Morton, Matthew Wilcox, Hugh Dickins, Chris Li,
	Barry Song, Baoquan He, Nhat Pham, Kemeng Shi, Baolin Wang,
	Ying Huang, Johannes Weiner, David Hildenbrand, Yosry Ahmed,
	Lorenzo Stoakes, Zi Yan, linux-kernel, Kairui Song

From: Kairui Song <kasong@tencent.com>

Swap cache lookup only increases the reference count of the returned
folio. That's not enough to ensure a folio is stable in the swap
cache, so the folio could be removed from the swap cache at any
time. The caller should always lock and check the folio before using it.

We have just documented this in kerneldoc, now introduce a helper for swap
cache folio verification with proper sanity checks.

Also, sanitize a few current users to use this convention and the new
helper for easier debugging. They were not having observable problems
yet, only trivial issues like wasted CPU cycles on swapoff or
reclaiming. They would fail in some other way, but it is still better to
always follow this convention to make things robust and make later
commits easier to do.

Signed-off-by: Kairui Song <kasong@tencent.com>
---
 mm/memory.c     |  3 +--
 mm/swap.h       | 24 ++++++++++++++++++++++++
 mm/swap_state.c |  7 +++++--
 mm/swapfile.c   | 10 +++++++---
 4 files changed, 37 insertions(+), 7 deletions(-)

diff --git a/mm/memory.c b/mm/memory.c
index 94a5928e8ace..5808c4ef21b3 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -4748,8 +4748,7 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
 		 * swapcache, we need to check that the page's swap has not
 		 * changed.
 		 */
-		if (unlikely(!folio_test_swapcache(folio) ||
-			     page_swap_entry(page).val != entry.val))
+		if (unlikely(!folio_matches_swap_entry(folio, entry)))
 			goto out_page;
 
 		if (unlikely(PageHWPoison(page))) {
diff --git a/mm/swap.h b/mm/swap.h
index efb6d7ff9f30..a69e18b12b45 100644
--- a/mm/swap.h
+++ b/mm/swap.h
@@ -52,6 +52,25 @@ static inline pgoff_t swap_cache_index(swp_entry_t entry)
 	return swp_offset(entry) & SWAP_ADDRESS_SPACE_MASK;
 }
 
+/**
+ * folio_matches_swap_entry - Check if a folio matches a given swap entry.
+ * @folio: The folio.
+ * @entry: The swap entry to check against.
+ *
+ * Context: The caller should have the folio locked to ensure it's stable
+ * and nothing will move it in or out of the swap cache.
+ * Return: true or false.
+ */
+static inline bool folio_matches_swap_entry(const struct folio *folio,
+					    swp_entry_t entry)
+{
+	VM_WARN_ON_ONCE_FOLIO(!folio_test_locked(folio), folio);
+	if (!folio_test_swapcache(folio))
+		return false;
+	VM_WARN_ON_ONCE_FOLIO(!IS_ALIGNED(folio->swap.val, folio_nr_pages(folio)), folio);
+	return folio->swap.val == round_down(entry.val, folio_nr_pages(folio));
+}
+
 void show_swap_cache_info(void);
 void *get_shadow_from_swap_cache(swp_entry_t entry);
 int add_to_swap_cache(struct folio *folio, swp_entry_t entry,
@@ -144,6 +163,11 @@ static inline pgoff_t swap_cache_index(swp_entry_t entry)
 	return 0;
 }
 
+static inline bool folio_matches_swap_entry(const struct folio *folio, swp_entry_t entry)
+{
+	return false;
+}
+
 static inline void show_swap_cache_info(void)
 {
 }
diff --git a/mm/swap_state.c b/mm/swap_state.c
index 68ec531d0f2b..9225d6b695ad 100644
--- a/mm/swap_state.c
+++ b/mm/swap_state.c
@@ -79,7 +79,7 @@ void show_swap_cache_info(void)
  * with reference count or locks.
  * Return: Returns the found folio on success, NULL otherwise. The caller
  * must lock and check if the folio still matches the swap entry before
- * use.
+ * use (e.g. with folio_matches_swap_entry).
  */
 struct folio *swap_cache_get_folio(swp_entry_t entry)
 {
@@ -346,7 +346,10 @@ struct folio *__read_swap_cache_async(swp_entry_t entry, gfp_t gfp_mask,
 	for (;;) {
 		int err;
 
-		/* Check the swap cache in case the folio is already there */
+		/*
+		 * Check the swap cache first, if a cached folio is found,
+		 * return it unlocked. The caller will lock and check it.
+		 */
 		folio = swap_cache_get_folio(entry);
 		if (folio)
 			goto got_folio;
diff --git a/mm/swapfile.c b/mm/swapfile.c
index 4c63fc62f4cb..1bd90f17440f 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -240,14 +240,12 @@ static int __try_to_reclaim_swap(struct swap_info_struct *si,
 	 * Offset could point to the middle of a large folio, or folio
 	 * may no longer point to the expected offset before it's locked.
 	 */
-	if (offset < swp_offset(folio->swap) ||
-	    offset >= swp_offset(folio->swap) + nr_pages) {
+	if (!folio_matches_swap_entry(folio, entry)) {
 		folio_unlock(folio);
 		folio_put(folio);
 		goto again;
 	}
 	offset = swp_offset(folio->swap);
-
 	need_reclaim = ((flags & TTRS_ANYWAY) ||
 			((flags & TTRS_UNMAPPED) && !folio_mapped(folio)) ||
 			((flags & TTRS_FULL) && mem_cgroup_swap_full(folio)));
@@ -2150,6 +2148,12 @@ static int unuse_pte_range(struct vm_area_struct *vma, pmd_t *pmd,
 		}
 
 		folio_lock(folio);
+		if (!folio_matches_swap_entry(folio, entry)) {
+			folio_unlock(folio);
+			folio_put(folio);
+			continue;
+		}
+
 		folio_wait_writeback(folio);
 		ret = unuse_pte(vma, pmd, addr, entry, folio);
 		if (ret < 0) {
-- 
2.51.0



^ permalink raw reply	[flat|nested] 80+ messages in thread

* [PATCH v2 06/15] mm, swap: rename and move some swap cluster definition and helpers
  2025-09-05 19:13 [PATCH v2 00/15] mm, swap: introduce swap table as swap cache (phase I) Kairui Song
                   ` (4 preceding siblings ...)
  2025-09-05 19:13 ` [PATCH v2 05/15] mm, swap: always lock and check the swap cache folio before use Kairui Song
@ 2025-09-05 19:13 ` Kairui Song
  2025-09-06  2:13   ` Chris Li
  2025-09-08  3:03   ` Baolin Wang
  2025-09-05 19:13 ` [PATCH v2 07/15] mm, swap: tidy up swap device and cluster info helpers Kairui Song
                   ` (8 subsequent siblings)
  14 siblings, 2 replies; 80+ messages in thread
From: Kairui Song @ 2025-09-05 19:13 UTC (permalink / raw)
  To: linux-mm
  Cc: Andrew Morton, Matthew Wilcox, Hugh Dickins, Chris Li,
	Barry Song, Baoquan He, Nhat Pham, Kemeng Shi, Baolin Wang,
	Ying Huang, Johannes Weiner, David Hildenbrand, Yosry Ahmed,
	Lorenzo Stoakes, Zi Yan, linux-kernel, Kairui Song

From: Kairui Song <kasong@tencent.com>

No feature change, move cluster related definitions and helpers to
mm/swap.h, also tidy up and add a "swap_" prefix for cluster lock/unlock
helpers, so they can be used outside of swap files. And while at it, add
kerneldoc.

Signed-off-by: Kairui Song <kasong@tencent.com>
Acked-by: Chris Li <chrisl@kernel.org>
Reviewed-by: Barry Song <baohua@kernel.org>
Acked-by: David Hildenbrand <david@redhat.com>
---
 include/linux/swap.h | 34 ----------------
 mm/swap.h            | 70 ++++++++++++++++++++++++++++++++
 mm/swapfile.c        | 97 +++++++++++++-------------------------------
 3 files changed, 99 insertions(+), 102 deletions(-)

diff --git a/include/linux/swap.h b/include/linux/swap.h
index 23452f014ca1..7e1fe4ff3d30 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -235,40 +235,6 @@ enum {
 /* Special value in each swap_map continuation */
 #define SWAP_CONT_MAX	0x7f	/* Max count */
 
-/*
- * We use this to track usage of a cluster. A cluster is a block of swap disk
- * space with SWAPFILE_CLUSTER pages long and naturally aligns in disk. All
- * free clusters are organized into a list. We fetch an entry from the list to
- * get a free cluster.
- *
- * The flags field determines if a cluster is free. This is
- * protected by cluster lock.
- */
-struct swap_cluster_info {
-	spinlock_t lock;	/*
-				 * Protect swap_cluster_info fields
-				 * other than list, and swap_info_struct->swap_map
-				 * elements corresponding to the swap cluster.
-				 */
-	u16 count;
-	u8 flags;
-	u8 order;
-	struct list_head list;
-};
-
-/* All on-list cluster must have a non-zero flag. */
-enum swap_cluster_flags {
-	CLUSTER_FLAG_NONE = 0, /* For temporary off-list cluster */
-	CLUSTER_FLAG_FREE,
-	CLUSTER_FLAG_NONFULL,
-	CLUSTER_FLAG_FRAG,
-	/* Clusters with flags above are allocatable */
-	CLUSTER_FLAG_USABLE = CLUSTER_FLAG_FRAG,
-	CLUSTER_FLAG_FULL,
-	CLUSTER_FLAG_DISCARD,
-	CLUSTER_FLAG_MAX,
-};
-
 /*
  * The first page in the swap file is the swap header, which is always marked
  * bad to prevent it from being allocated as an entry. This also prevents the
diff --git a/mm/swap.h b/mm/swap.h
index a69e18b12b45..39b27337bc0a 100644
--- a/mm/swap.h
+++ b/mm/swap.h
@@ -7,10 +7,80 @@ struct swap_iocb;
 
 extern int page_cluster;
 
+#ifdef CONFIG_THP_SWAP
+#define SWAPFILE_CLUSTER	HPAGE_PMD_NR
+#define swap_entry_order(order)	(order)
+#else
+#define SWAPFILE_CLUSTER	256
+#define swap_entry_order(order)	0
+#endif
+
+/*
+ * We use this to track usage of a cluster. A cluster is a block of swap disk
+ * space with SWAPFILE_CLUSTER pages long and naturally aligns in disk. All
+ * free clusters are organized into a list. We fetch an entry from the list to
+ * get a free cluster.
+ *
+ * The flags field determines if a cluster is free. This is
+ * protected by cluster lock.
+ */
+struct swap_cluster_info {
+	spinlock_t lock;	/*
+				 * Protect swap_cluster_info fields
+				 * other than list, and swap_info_struct->swap_map
+				 * elements corresponding to the swap cluster.
+				 */
+	u16 count;
+	u8 flags;
+	u8 order;
+	struct list_head list;
+};
+
+/* All on-list cluster must have a non-zero flag. */
+enum swap_cluster_flags {
+	CLUSTER_FLAG_NONE = 0, /* For temporary off-list cluster */
+	CLUSTER_FLAG_FREE,
+	CLUSTER_FLAG_NONFULL,
+	CLUSTER_FLAG_FRAG,
+	/* Clusters with flags above are allocatable */
+	CLUSTER_FLAG_USABLE = CLUSTER_FLAG_FRAG,
+	CLUSTER_FLAG_FULL,
+	CLUSTER_FLAG_DISCARD,
+	CLUSTER_FLAG_MAX,
+};
+
 #ifdef CONFIG_SWAP
 #include <linux/swapops.h> /* for swp_offset */
 #include <linux/blk_types.h> /* for bio_end_io_t */
 
+static inline struct swap_cluster_info *swp_offset_cluster(
+		struct swap_info_struct *si, pgoff_t offset)
+{
+	return &si->cluster_info[offset / SWAPFILE_CLUSTER];
+}
+
+/**
+ * swap_cluster_lock - Lock and return the swap cluster of given offset.
+ * @si: swap device the cluster belongs to.
+ * @offset: the swap entry offset, pointing to a valid slot.
+ *
+ * Context: The caller must ensure the offset is in the valid range and
+ * protect the swap device with reference count or locks.
+ */
+static inline struct swap_cluster_info *swap_cluster_lock(
+		struct swap_info_struct *si, unsigned long offset)
+{
+	struct swap_cluster_info *ci = swp_offset_cluster(si, offset);
+
+	spin_lock(&ci->lock);
+	return ci;
+}
+
+static inline void swap_cluster_unlock(struct swap_cluster_info *ci)
+{
+	spin_unlock(&ci->lock);
+}
+
 /* linux/mm/page_io.c */
 int sio_pool_init(void);
 struct swap_iocb;
diff --git a/mm/swapfile.c b/mm/swapfile.c
index 1bd90f17440f..547ad4bfe1d8 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -58,9 +58,6 @@ static void swap_entries_free(struct swap_info_struct *si,
 static void swap_range_alloc(struct swap_info_struct *si,
 			     unsigned int nr_entries);
 static bool folio_swapcache_freeable(struct folio *folio);
-static struct swap_cluster_info *lock_cluster(struct swap_info_struct *si,
-					      unsigned long offset);
-static inline void unlock_cluster(struct swap_cluster_info *ci);
 
 static DEFINE_SPINLOCK(swap_lock);
 static unsigned int nr_swapfiles;
@@ -257,9 +254,9 @@ static int __try_to_reclaim_swap(struct swap_info_struct *si,
 	 * swap_map is HAS_CACHE only, which means the slots have no page table
 	 * reference or pending writeback, and can't be allocated to others.
 	 */
-	ci = lock_cluster(si, offset);
+	ci = swap_cluster_lock(si, offset);
 	need_reclaim = swap_only_has_cache(si, offset, nr_pages);
-	unlock_cluster(ci);
+	swap_cluster_unlock(ci);
 	if (!need_reclaim)
 		goto out_unlock;
 
@@ -384,19 +381,6 @@ static void discard_swap_cluster(struct swap_info_struct *si,
 	}
 }
 
-#ifdef CONFIG_THP_SWAP
-#define SWAPFILE_CLUSTER	HPAGE_PMD_NR
-
-#define swap_entry_order(order)	(order)
-#else
-#define SWAPFILE_CLUSTER	256
-
-/*
- * Define swap_entry_order() as constant to let compiler to optimize
- * out some code if !CONFIG_THP_SWAP
- */
-#define swap_entry_order(order)	0
-#endif
 #define LATENCY_LIMIT		256
 
 static inline bool cluster_is_empty(struct swap_cluster_info *info)
@@ -424,34 +408,12 @@ static inline unsigned int cluster_index(struct swap_info_struct *si,
 	return ci - si->cluster_info;
 }
 
-static inline struct swap_cluster_info *offset_to_cluster(struct swap_info_struct *si,
-							  unsigned long offset)
-{
-	return &si->cluster_info[offset / SWAPFILE_CLUSTER];
-}
-
 static inline unsigned int cluster_offset(struct swap_info_struct *si,
 					  struct swap_cluster_info *ci)
 {
 	return cluster_index(si, ci) * SWAPFILE_CLUSTER;
 }
 
-static inline struct swap_cluster_info *lock_cluster(struct swap_info_struct *si,
-						     unsigned long offset)
-{
-	struct swap_cluster_info *ci;
-
-	ci = offset_to_cluster(si, offset);
-	spin_lock(&ci->lock);
-
-	return ci;
-}
-
-static inline void unlock_cluster(struct swap_cluster_info *ci)
-{
-	spin_unlock(&ci->lock);
-}
-
 static void move_cluster(struct swap_info_struct *si,
 			 struct swap_cluster_info *ci, struct list_head *list,
 			 enum swap_cluster_flags new_flags)
@@ -807,7 +769,7 @@ static unsigned int alloc_swap_scan_cluster(struct swap_info_struct *si,
 	}
 out:
 	relocate_cluster(si, ci);
-	unlock_cluster(ci);
+	swap_cluster_unlock(ci);
 	if (si->flags & SWP_SOLIDSTATE) {
 		this_cpu_write(percpu_swap_cluster.offset[order], next);
 		this_cpu_write(percpu_swap_cluster.si[order], si);
@@ -874,7 +836,7 @@ static void swap_reclaim_full_clusters(struct swap_info_struct *si, bool force)
 		if (ci->flags == CLUSTER_FLAG_NONE)
 			relocate_cluster(si, ci);
 
-		unlock_cluster(ci);
+		swap_cluster_unlock(ci);
 		if (to_scan <= 0)
 			break;
 	}
@@ -913,7 +875,7 @@ static unsigned long cluster_alloc_swap_entry(struct swap_info_struct *si, int o
 		if (offset == SWAP_ENTRY_INVALID)
 			goto new_cluster;
 
-		ci = lock_cluster(si, offset);
+		ci = swap_cluster_lock(si, offset);
 		/* Cluster could have been used by another order */
 		if (cluster_is_usable(ci, order)) {
 			if (cluster_is_empty(ci))
@@ -921,7 +883,7 @@ static unsigned long cluster_alloc_swap_entry(struct swap_info_struct *si, int o
 			found = alloc_swap_scan_cluster(si, ci, offset,
 							order, usage);
 		} else {
-			unlock_cluster(ci);
+			swap_cluster_unlock(ci);
 		}
 		if (found)
 			goto done;
@@ -1202,7 +1164,7 @@ static bool swap_alloc_fast(swp_entry_t *entry,
 	if (!si || !offset || !get_swap_device_info(si))
 		return false;
 
-	ci = lock_cluster(si, offset);
+	ci = swap_cluster_lock(si, offset);
 	if (cluster_is_usable(ci, order)) {
 		if (cluster_is_empty(ci))
 			offset = cluster_offset(si, ci);
@@ -1210,7 +1172,7 @@ static bool swap_alloc_fast(swp_entry_t *entry,
 		if (found)
 			*entry = swp_entry(si->type, found);
 	} else {
-		unlock_cluster(ci);
+		swap_cluster_unlock(ci);
 	}
 
 	put_swap_device(si);
@@ -1478,14 +1440,14 @@ static void swap_entries_put_cache(struct swap_info_struct *si,
 	unsigned long offset = swp_offset(entry);
 	struct swap_cluster_info *ci;
 
-	ci = lock_cluster(si, offset);
-	if (swap_only_has_cache(si, offset, nr))
+	ci = swap_cluster_lock(si, offset);
+	if (swap_only_has_cache(si, offset, nr)) {
 		swap_entries_free(si, ci, entry, nr);
-	else {
+	} else {
 		for (int i = 0; i < nr; i++, entry.val++)
 			swap_entry_put_locked(si, ci, entry, SWAP_HAS_CACHE);
 	}
-	unlock_cluster(ci);
+	swap_cluster_unlock(ci);
 }
 
 static bool swap_entries_put_map(struct swap_info_struct *si,
@@ -1503,7 +1465,7 @@ static bool swap_entries_put_map(struct swap_info_struct *si,
 	if (count != 1 && count != SWAP_MAP_SHMEM)
 		goto fallback;
 
-	ci = lock_cluster(si, offset);
+	ci = swap_cluster_lock(si, offset);
 	if (!swap_is_last_map(si, offset, nr, &has_cache)) {
 		goto locked_fallback;
 	}
@@ -1512,21 +1474,20 @@ static bool swap_entries_put_map(struct swap_info_struct *si,
 	else
 		for (i = 0; i < nr; i++)
 			WRITE_ONCE(si->swap_map[offset + i], SWAP_HAS_CACHE);
-	unlock_cluster(ci);
+	swap_cluster_unlock(ci);
 
 	return has_cache;
 
 fallback:
-	ci = lock_cluster(si, offset);
+	ci = swap_cluster_lock(si, offset);
 locked_fallback:
 	for (i = 0; i < nr; i++, entry.val++) {
 		count = swap_entry_put_locked(si, ci, entry, 1);
 		if (count == SWAP_HAS_CACHE)
 			has_cache = true;
 	}
-	unlock_cluster(ci);
+	swap_cluster_unlock(ci);
 	return has_cache;
-
 }
 
 /*
@@ -1576,7 +1537,7 @@ static void swap_entries_free(struct swap_info_struct *si,
 	unsigned char *map_end = map + nr_pages;
 
 	/* It should never free entries across different clusters */
-	VM_BUG_ON(ci != offset_to_cluster(si, offset + nr_pages - 1));
+	VM_BUG_ON(ci != swp_offset_cluster(si, offset + nr_pages - 1));
 	VM_BUG_ON(cluster_is_empty(ci));
 	VM_BUG_ON(ci->count < nr_pages);
 
@@ -1651,9 +1612,9 @@ bool swap_entry_swapped(struct swap_info_struct *si, swp_entry_t entry)
 	struct swap_cluster_info *ci;
 	int count;
 
-	ci = lock_cluster(si, offset);
+	ci = swap_cluster_lock(si, offset);
 	count = swap_count(si->swap_map[offset]);
-	unlock_cluster(ci);
+	swap_cluster_unlock(ci);
 	return !!count;
 }
 
@@ -1676,7 +1637,7 @@ int swp_swapcount(swp_entry_t entry)
 
 	offset = swp_offset(entry);
 
-	ci = lock_cluster(si, offset);
+	ci = swap_cluster_lock(si, offset);
 
 	count = swap_count(si->swap_map[offset]);
 	if (!(count & COUNT_CONTINUED))
@@ -1699,7 +1660,7 @@ int swp_swapcount(swp_entry_t entry)
 		n *= (SWAP_CONT_MAX + 1);
 	} while (tmp_count & COUNT_CONTINUED);
 out:
-	unlock_cluster(ci);
+	swap_cluster_unlock(ci);
 	return count;
 }
 
@@ -1714,7 +1675,7 @@ static bool swap_page_trans_huge_swapped(struct swap_info_struct *si,
 	int i;
 	bool ret = false;
 
-	ci = lock_cluster(si, offset);
+	ci = swap_cluster_lock(si, offset);
 	if (nr_pages == 1) {
 		if (swap_count(map[roffset]))
 			ret = true;
@@ -1727,7 +1688,7 @@ static bool swap_page_trans_huge_swapped(struct swap_info_struct *si,
 		}
 	}
 unlock_out:
-	unlock_cluster(ci);
+	swap_cluster_unlock(ci);
 	return ret;
 }
 
@@ -2660,8 +2621,8 @@ static void wait_for_allocation(struct swap_info_struct *si)
 	BUG_ON(si->flags & SWP_WRITEOK);
 
 	for (offset = 0; offset < end; offset += SWAPFILE_CLUSTER) {
-		ci = lock_cluster(si, offset);
-		unlock_cluster(ci);
+		ci = swap_cluster_lock(si, offset);
+		swap_cluster_unlock(ci);
 	}
 }
 
@@ -3577,7 +3538,7 @@ static int __swap_duplicate(swp_entry_t entry, unsigned char usage, int nr)
 	offset = swp_offset(entry);
 	VM_WARN_ON(nr > SWAPFILE_CLUSTER - offset % SWAPFILE_CLUSTER);
 	VM_WARN_ON(usage == 1 && nr > 1);
-	ci = lock_cluster(si, offset);
+	ci = swap_cluster_lock(si, offset);
 
 	err = 0;
 	for (i = 0; i < nr; i++) {
@@ -3632,7 +3593,7 @@ static int __swap_duplicate(swp_entry_t entry, unsigned char usage, int nr)
 	}
 
 unlock_out:
-	unlock_cluster(ci);
+	swap_cluster_unlock(ci);
 	return err;
 }
 
@@ -3731,7 +3692,7 @@ int add_swap_count_continuation(swp_entry_t entry, gfp_t gfp_mask)
 
 	offset = swp_offset(entry);
 
-	ci = lock_cluster(si, offset);
+	ci = swap_cluster_lock(si, offset);
 
 	count = swap_count(si->swap_map[offset]);
 
@@ -3791,7 +3752,7 @@ int add_swap_count_continuation(swp_entry_t entry, gfp_t gfp_mask)
 out_unlock_cont:
 	spin_unlock(&si->cont_lock);
 out:
-	unlock_cluster(ci);
+	swap_cluster_unlock(ci);
 	put_swap_device(si);
 outer:
 	if (page)
-- 
2.51.0



^ permalink raw reply	[flat|nested] 80+ messages in thread

* [PATCH v2 07/15] mm, swap: tidy up swap device and cluster info helpers
  2025-09-05 19:13 [PATCH v2 00/15] mm, swap: introduce swap table as swap cache (phase I) Kairui Song
                   ` (5 preceding siblings ...)
  2025-09-05 19:13 ` [PATCH v2 06/15] mm, swap: rename and move some swap cluster definition and helpers Kairui Song
@ 2025-09-05 19:13 ` Kairui Song
  2025-09-06  2:14   ` Chris Li
  2025-09-08 12:21   ` David Hildenbrand
  2025-09-05 19:13 ` [PATCH v2 08/15] mm/shmem, swap: remove redundant error handling for replacing folio Kairui Song
                   ` (7 subsequent siblings)
  14 siblings, 2 replies; 80+ messages in thread
From: Kairui Song @ 2025-09-05 19:13 UTC (permalink / raw)
  To: linux-mm
  Cc: Andrew Morton, Matthew Wilcox, Hugh Dickins, Chris Li,
	Barry Song, Baoquan He, Nhat Pham, Kemeng Shi, Baolin Wang,
	Ying Huang, Johannes Weiner, David Hildenbrand, Yosry Ahmed,
	Lorenzo Stoakes, Zi Yan, linux-kernel, Kairui Song

From: Kairui Song <kasong@tencent.com>

swp_swap_info is the most commonly used helper for retrieving swap info.
It has an internal check that may lead to a NULL return value, but
almost none of its caller checks the return value, making the internal
check pointless. In fact, most of these callers already ensured the
entry is valid and never expect a NULL value.

Tidy this up and shorten the name. If the caller can make sure the
swap entry/type is valid and the device is pinned, use the new introduced
__swap_entry_to_info/__swap_type_to_info instead. They have more debug
sanity checks and lower overhead as they are inlined.

Callers that may expect a NULL value should use
swap_entry_to_info/swap_type_to_info instead.

No feature change. The rearranged codes should have had no effect, or
they should have been hitting NULL de-ref bugs already. Only some new
sanity checks are added so potential issues may show up in debug build.

The new helpers will be frequently used with swap table later when working
with swap cache folios. A locked swap cache folio ensures the entries are
valid and stable so these helpers are very helpful.

Signed-off-by: Kairui Song <kasong@tencent.com>
Acked-by: Chris Li <chrisl@kernel.org>
Reviewed-by: Barry Song <baohua@kernel.org>
---
 include/linux/swap.h |  6 ------
 mm/page_io.c         | 12 ++++++------
 mm/swap.h            | 38 +++++++++++++++++++++++++++++++++-----
 mm/swap_state.c      |  4 ++--
 mm/swapfile.c        | 37 +++++++++++++++++++------------------
 5 files changed, 60 insertions(+), 37 deletions(-)

diff --git a/include/linux/swap.h b/include/linux/swap.h
index 7e1fe4ff3d30..6db105383782 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -479,7 +479,6 @@ extern sector_t swapdev_block(int, pgoff_t);
 extern int __swap_count(swp_entry_t entry);
 extern bool swap_entry_swapped(struct swap_info_struct *si, swp_entry_t entry);
 extern int swp_swapcount(swp_entry_t entry);
-struct swap_info_struct *swp_swap_info(swp_entry_t entry);
 struct backing_dev_info;
 extern int init_swap_address_space(unsigned int type, unsigned long nr_pages);
 extern void exit_swap_address_space(unsigned int type);
@@ -492,11 +491,6 @@ static inline void put_swap_device(struct swap_info_struct *si)
 }
 
 #else /* CONFIG_SWAP */
-static inline struct swap_info_struct *swp_swap_info(swp_entry_t entry)
-{
-	return NULL;
-}
-
 static inline struct swap_info_struct *get_swap_device(swp_entry_t entry)
 {
 	return NULL;
diff --git a/mm/page_io.c b/mm/page_io.c
index a2056a5ecb13..3c342db77ce3 100644
--- a/mm/page_io.c
+++ b/mm/page_io.c
@@ -204,7 +204,7 @@ static bool is_folio_zero_filled(struct folio *folio)
 static void swap_zeromap_folio_set(struct folio *folio)
 {
 	struct obj_cgroup *objcg = get_obj_cgroup_from_folio(folio);
-	struct swap_info_struct *sis = swp_swap_info(folio->swap);
+	struct swap_info_struct *sis = __swap_entry_to_info(folio->swap);
 	int nr_pages = folio_nr_pages(folio);
 	swp_entry_t entry;
 	unsigned int i;
@@ -223,7 +223,7 @@ static void swap_zeromap_folio_set(struct folio *folio)
 
 static void swap_zeromap_folio_clear(struct folio *folio)
 {
-	struct swap_info_struct *sis = swp_swap_info(folio->swap);
+	struct swap_info_struct *sis = __swap_entry_to_info(folio->swap);
 	swp_entry_t entry;
 	unsigned int i;
 
@@ -374,7 +374,7 @@ static void sio_write_complete(struct kiocb *iocb, long ret)
 static void swap_writepage_fs(struct folio *folio, struct swap_iocb **swap_plug)
 {
 	struct swap_iocb *sio = swap_plug ? *swap_plug : NULL;
-	struct swap_info_struct *sis = swp_swap_info(folio->swap);
+	struct swap_info_struct *sis = __swap_entry_to_info(folio->swap);
 	struct file *swap_file = sis->swap_file;
 	loff_t pos = swap_dev_pos(folio->swap);
 
@@ -446,7 +446,7 @@ static void swap_writepage_bdev_async(struct folio *folio,
 
 void __swap_writepage(struct folio *folio, struct swap_iocb **swap_plug)
 {
-	struct swap_info_struct *sis = swp_swap_info(folio->swap);
+	struct swap_info_struct *sis = __swap_entry_to_info(folio->swap);
 
 	VM_BUG_ON_FOLIO(!folio_test_swapcache(folio), folio);
 	/*
@@ -537,7 +537,7 @@ static bool swap_read_folio_zeromap(struct folio *folio)
 
 static void swap_read_folio_fs(struct folio *folio, struct swap_iocb **plug)
 {
-	struct swap_info_struct *sis = swp_swap_info(folio->swap);
+	struct swap_info_struct *sis = __swap_entry_to_info(folio->swap);
 	struct swap_iocb *sio = NULL;
 	loff_t pos = swap_dev_pos(folio->swap);
 
@@ -608,7 +608,7 @@ static void swap_read_folio_bdev_async(struct folio *folio,
 
 void swap_read_folio(struct folio *folio, struct swap_iocb **plug)
 {
-	struct swap_info_struct *sis = swp_swap_info(folio->swap);
+	struct swap_info_struct *sis = __swap_entry_to_info(folio->swap);
 	bool synchronous = sis->flags & SWP_SYNCHRONOUS_IO;
 	bool workingset = folio_test_workingset(folio);
 	unsigned long pflags;
diff --git a/mm/swap.h b/mm/swap.h
index 39b27337bc0a..a65e72edb087 100644
--- a/mm/swap.h
+++ b/mm/swap.h
@@ -15,6 +15,8 @@ extern int page_cluster;
 #define swap_entry_order(order)	0
 #endif
 
+extern struct swap_info_struct *swap_info[];
+
 /*
  * We use this to track usage of a cluster. A cluster is a block of swap disk
  * space with SWAPFILE_CLUSTER pages long and naturally aligns in disk. All
@@ -53,9 +55,29 @@ enum swap_cluster_flags {
 #include <linux/swapops.h> /* for swp_offset */
 #include <linux/blk_types.h> /* for bio_end_io_t */
 
-static inline struct swap_cluster_info *swp_offset_cluster(
+/*
+ * Callers of all helpers below must ensure the entry, type, or offset is
+ * valid, and protect the swap device with reference count or locks.
+ */
+static inline struct swap_info_struct *__swap_type_to_info(int type)
+{
+	struct swap_info_struct *si;
+
+	si = READ_ONCE(swap_info[type]); /* rcu_dereference() */
+	VM_WARN_ON_ONCE(percpu_ref_is_zero(&si->users)); /* race with swapoff */
+	return si;
+}
+
+static inline struct swap_info_struct *__swap_entry_to_info(swp_entry_t entry)
+{
+	return __swap_type_to_info(swp_type(entry));
+}
+
+static inline struct swap_cluster_info *__swap_offset_to_cluster(
 		struct swap_info_struct *si, pgoff_t offset)
 {
+	VM_WARN_ON_ONCE(percpu_ref_is_zero(&si->users)); /* race with swapoff */
+	VM_WARN_ON_ONCE(offset >= si->max);
 	return &si->cluster_info[offset / SWAPFILE_CLUSTER];
 }
 
@@ -70,8 +92,9 @@ static inline struct swap_cluster_info *swp_offset_cluster(
 static inline struct swap_cluster_info *swap_cluster_lock(
 		struct swap_info_struct *si, unsigned long offset)
 {
-	struct swap_cluster_info *ci = swp_offset_cluster(si, offset);
+	struct swap_cluster_info *ci = __swap_offset_to_cluster(si, offset);
 
+	VM_WARN_ON_ONCE(percpu_ref_is_zero(&si->users)); /* race with swapoff */
 	spin_lock(&ci->lock);
 	return ci;
 }
@@ -167,7 +190,7 @@ void swap_update_readahead(struct folio *folio, struct vm_area_struct *vma,
 
 static inline unsigned int folio_swap_flags(struct folio *folio)
 {
-	return swp_swap_info(folio->swap)->flags;
+	return __swap_entry_to_info(folio->swap)->flags;
 }
 
 /*
@@ -178,7 +201,7 @@ static inline unsigned int folio_swap_flags(struct folio *folio)
 static inline int swap_zeromap_batch(swp_entry_t entry, int max_nr,
 		bool *is_zeromap)
 {
-	struct swap_info_struct *sis = swp_swap_info(entry);
+	struct swap_info_struct *sis = __swap_entry_to_info(entry);
 	unsigned long start = swp_offset(entry);
 	unsigned long end = start + max_nr;
 	bool first_bit;
@@ -197,7 +220,7 @@ static inline int swap_zeromap_batch(swp_entry_t entry, int max_nr,
 
 static inline int non_swapcache_batch(swp_entry_t entry, int max_nr)
 {
-	struct swap_info_struct *si = swp_swap_info(entry);
+	struct swap_info_struct *si = __swap_entry_to_info(entry);
 	pgoff_t offset = swp_offset(entry);
 	int i;
 
@@ -216,6 +239,11 @@ static inline int non_swapcache_batch(swp_entry_t entry, int max_nr)
 
 #else /* CONFIG_SWAP */
 struct swap_iocb;
+static inline struct swap_info_struct *__swap_entry_to_info(swp_entry_t entry)
+{
+	return NULL;
+}
+
 static inline void swap_read_folio(struct folio *folio, struct swap_iocb **plug)
 {
 }
diff --git a/mm/swap_state.c b/mm/swap_state.c
index 9225d6b695ad..0ad4f3b41f1b 100644
--- a/mm/swap_state.c
+++ b/mm/swap_state.c
@@ -336,7 +336,7 @@ struct folio *__read_swap_cache_async(swp_entry_t entry, gfp_t gfp_mask,
 		struct mempolicy *mpol, pgoff_t ilx, bool *new_page_allocated,
 		bool skip_if_exists)
 {
-	struct swap_info_struct *si = swp_swap_info(entry);
+	struct swap_info_struct *si = __swap_entry_to_info(entry);
 	struct folio *folio;
 	struct folio *new_folio = NULL;
 	struct folio *result = NULL;
@@ -560,7 +560,7 @@ struct folio *swap_cluster_readahead(swp_entry_t entry, gfp_t gfp_mask,
 	unsigned long offset = entry_offset;
 	unsigned long start_offset, end_offset;
 	unsigned long mask;
-	struct swap_info_struct *si = swp_swap_info(entry);
+	struct swap_info_struct *si = __swap_entry_to_info(entry);
 	struct blk_plug plug;
 	struct swap_iocb *splug = NULL;
 	bool page_allocated;
diff --git a/mm/swapfile.c b/mm/swapfile.c
index 547ad4bfe1d8..367481d319cd 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -102,7 +102,7 @@ static PLIST_HEAD(swap_active_head);
 static struct plist_head *swap_avail_heads;
 static DEFINE_SPINLOCK(swap_avail_lock);
 
-static struct swap_info_struct *swap_info[MAX_SWAPFILES];
+struct swap_info_struct *swap_info[MAX_SWAPFILES];
 
 static DEFINE_MUTEX(swapon_mutex);
 
@@ -124,14 +124,20 @@ static DEFINE_PER_CPU(struct percpu_swap_cluster, percpu_swap_cluster) = {
 	.lock = INIT_LOCAL_LOCK(),
 };
 
-static struct swap_info_struct *swap_type_to_swap_info(int type)
+/* May return NULL on invalid type, caller must check for NULL return */
+static struct swap_info_struct *swap_type_to_info(int type)
 {
 	if (type >= MAX_SWAPFILES)
 		return NULL;
-
 	return READ_ONCE(swap_info[type]); /* rcu_dereference() */
 }
 
+/* May return NULL on invalid entry, caller must check for NULL return */
+static struct swap_info_struct *swap_entry_to_info(swp_entry_t entry)
+{
+	return swap_type_to_info(swp_type(entry));
+}
+
 static inline unsigned char swap_count(unsigned char ent)
 {
 	return ent & ~SWAP_HAS_CACHE;	/* may include COUNT_CONTINUED flag */
@@ -341,7 +347,7 @@ offset_to_swap_extent(struct swap_info_struct *sis, unsigned long offset)
 
 sector_t swap_folio_sector(struct folio *folio)
 {
-	struct swap_info_struct *sis = swp_swap_info(folio->swap);
+	struct swap_info_struct *sis = __swap_entry_to_info(folio->swap);
 	struct swap_extent *se;
 	sector_t sector;
 	pgoff_t offset;
@@ -1299,7 +1305,7 @@ static struct swap_info_struct *_swap_info_get(swp_entry_t entry)
 
 	if (!entry.val)
 		goto out;
-	si = swp_swap_info(entry);
+	si = swap_entry_to_info(entry);
 	if (!si)
 		goto bad_nofile;
 	if (data_race(!(si->flags & SWP_USED)))
@@ -1414,7 +1420,7 @@ struct swap_info_struct *get_swap_device(swp_entry_t entry)
 
 	if (!entry.val)
 		goto out;
-	si = swp_swap_info(entry);
+	si = swap_entry_to_info(entry);
 	if (!si)
 		goto bad_nofile;
 	if (!get_swap_device_info(si))
@@ -1537,7 +1543,7 @@ static void swap_entries_free(struct swap_info_struct *si,
 	unsigned char *map_end = map + nr_pages;
 
 	/* It should never free entries across different clusters */
-	VM_BUG_ON(ci != swp_offset_cluster(si, offset + nr_pages - 1));
+	VM_BUG_ON(ci != __swap_offset_to_cluster(si, offset + nr_pages - 1));
 	VM_BUG_ON(cluster_is_empty(ci));
 	VM_BUG_ON(ci->count < nr_pages);
 
@@ -1595,7 +1601,7 @@ void put_swap_folio(struct folio *folio, swp_entry_t entry)
 
 int __swap_count(swp_entry_t entry)
 {
-	struct swap_info_struct *si = swp_swap_info(entry);
+	struct swap_info_struct *si = __swap_entry_to_info(entry);
 	pgoff_t offset = swp_offset(entry);
 
 	return swap_count(si->swap_map[offset]);
@@ -1826,7 +1832,7 @@ void free_swap_and_cache_nr(swp_entry_t entry, int nr)
 
 swp_entry_t get_swap_page_of_type(int type)
 {
-	struct swap_info_struct *si = swap_type_to_swap_info(type);
+	struct swap_info_struct *si = swap_type_to_info(type);
 	unsigned long offset;
 	swp_entry_t entry = {0};
 
@@ -1907,7 +1913,7 @@ int find_first_swap(dev_t *device)
  */
 sector_t swapdev_block(int type, pgoff_t offset)
 {
-	struct swap_info_struct *si = swap_type_to_swap_info(type);
+	struct swap_info_struct *si = swap_type_to_info(type);
 	struct swap_extent *se;
 
 	if (!si || !(si->flags & SWP_WRITEOK))
@@ -2835,7 +2841,7 @@ static void *swap_start(struct seq_file *swap, loff_t *pos)
 	if (!l)
 		return SEQ_START_TOKEN;
 
-	for (type = 0; (si = swap_type_to_swap_info(type)); type++) {
+	for (type = 0; (si = swap_type_to_info(type)); type++) {
 		if (!(si->flags & SWP_USED) || !si->swap_map)
 			continue;
 		if (!--l)
@@ -2856,7 +2862,7 @@ static void *swap_next(struct seq_file *swap, void *v, loff_t *pos)
 		type = si->type + 1;
 
 	++(*pos);
-	for (; (si = swap_type_to_swap_info(type)); type++) {
+	for (; (si = swap_type_to_info(type)); type++) {
 		if (!(si->flags & SWP_USED) || !si->swap_map)
 			continue;
 		return si;
@@ -3529,7 +3535,7 @@ static int __swap_duplicate(swp_entry_t entry, unsigned char usage, int nr)
 	unsigned char has_cache;
 	int err, i;
 
-	si = swp_swap_info(entry);
+	si = swap_entry_to_info(entry);
 	if (WARN_ON_ONCE(!si)) {
 		pr_err("%s%08lx\n", Bad_file, entry.val);
 		return -EINVAL;
@@ -3644,11 +3650,6 @@ void swapcache_clear(struct swap_info_struct *si, swp_entry_t entry, int nr)
 	swap_entries_put_cache(si, entry, nr);
 }
 
-struct swap_info_struct *swp_swap_info(swp_entry_t entry)
-{
-	return swap_type_to_swap_info(swp_type(entry));
-}
-
 /*
  * add_swap_count_continuation - called when a swap count is duplicated
  * beyond SWAP_MAP_MAX, it allocates a new page and links that to the entry's
-- 
2.51.0



^ permalink raw reply	[flat|nested] 80+ messages in thread

* [PATCH v2 08/15] mm/shmem, swap: remove redundant error handling for replacing folio
  2025-09-05 19:13 [PATCH v2 00/15] mm, swap: introduce swap table as swap cache (phase I) Kairui Song
                   ` (6 preceding siblings ...)
  2025-09-05 19:13 ` [PATCH v2 07/15] mm, swap: tidy up swap device and cluster info helpers Kairui Song
@ 2025-09-05 19:13 ` Kairui Song
  2025-09-08  3:17   ` Baolin Wang
  2025-09-05 19:13 ` [PATCH v2 09/15] mm, swap: cleanup swap cache API and add kerneldoc Kairui Song
                   ` (6 subsequent siblings)
  14 siblings, 1 reply; 80+ messages in thread
From: Kairui Song @ 2025-09-05 19:13 UTC (permalink / raw)
  To: linux-mm
  Cc: Andrew Morton, Matthew Wilcox, Hugh Dickins, Chris Li,
	Barry Song, Baoquan He, Nhat Pham, Kemeng Shi, Baolin Wang,
	Ying Huang, Johannes Weiner, David Hildenbrand, Yosry Ahmed,
	Lorenzo Stoakes, Zi Yan, linux-kernel, Kairui Song

From: Kairui Song <kasong@tencent.com>

Shmem may replace a folio in the swap cache if the cached one doesn't
fit the swapin's GFP zone. When doing so, shmem has already double
checked that the swap cache folio is locked, still has the swap cache
flag set, and contains the wanted swap entry. So it is impossible to
fail due to an Xarray mismatch. There is even a comment for that.

Delete the defensive error handling path, and add a WARN_ON instead:
if that happened, something has broken the basic principle of how the
swap cache works, we should catch and fix that.

Signed-off-by: Kairui Song <kasong@tencent.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
---
 mm/shmem.c | 42 ++++++++++++------------------------------
 1 file changed, 12 insertions(+), 30 deletions(-)

diff --git a/mm/shmem.c b/mm/shmem.c
index 4e27e8e5da3b..cc6a0007c7a6 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -1698,13 +1698,13 @@ int shmem_writeout(struct folio *folio, struct swap_iocb **plug,
 		}
 
 		/*
-		 * The delete_from_swap_cache() below could be left for
+		 * The swap_cache_del_folio() below could be left for
 		 * shrink_folio_list()'s folio_free_swap() to dispose of;
 		 * but I'm a little nervous about letting this folio out of
 		 * shmem_writeout() in a hybrid half-tmpfs-half-swap state
 		 * e.g. folio_mapping(folio) might give an unexpected answer.
 		 */
-		delete_from_swap_cache(folio);
+		swap_cache_del_folio(folio);
 		goto redirty;
 	}
 	if (nr_pages > 1)
@@ -2082,7 +2082,7 @@ static struct folio *shmem_swap_alloc_folio(struct inode *inode,
 	new->swap = entry;
 
 	memcg1_swapin(entry, nr_pages);
-	shadow = get_shadow_from_swap_cache(entry);
+	shadow = swap_cache_get_shadow(entry);
 	if (shadow)
 		workingset_refault(new, shadow);
 	folio_add_lru(new);
@@ -2158,35 +2158,17 @@ static int shmem_replace_folio(struct folio **foliop, gfp_t gfp,
 	/* Swap cache still stores N entries instead of a high-order entry */
 	xa_lock_irq(&swap_mapping->i_pages);
 	for (i = 0; i < nr_pages; i++) {
-		void *item = xas_load(&xas);
-
-		if (item != old) {
-			error = -ENOENT;
-			break;
-		}
-
-		xas_store(&xas, new);
+		WARN_ON_ONCE(xas_store(&xas, new));
 		xas_next(&xas);
 	}
-	if (!error) {
-		mem_cgroup_replace_folio(old, new);
-		shmem_update_stats(new, nr_pages);
-		shmem_update_stats(old, -nr_pages);
-	}
 	xa_unlock_irq(&swap_mapping->i_pages);
 
-	if (unlikely(error)) {
-		/*
-		 * Is this possible?  I think not, now that our callers
-		 * check both the swapcache flag and folio->private
-		 * after getting the folio lock; but be defensive.
-		 * Reverse old to newpage for clear and free.
-		 */
-		old = new;
-	} else {
-		folio_add_lru(new);
-		*foliop = new;
-	}
+	mem_cgroup_replace_folio(old, new);
+	shmem_update_stats(new, nr_pages);
+	shmem_update_stats(old, -nr_pages);
+
+	folio_add_lru(new);
+	*foliop = new;
 
 	folio_clear_swapcache(old);
 	old->private = NULL;
@@ -2220,7 +2202,7 @@ static void shmem_set_folio_swapin_error(struct inode *inode, pgoff_t index,
 	nr_pages = folio_nr_pages(folio);
 	folio_wait_writeback(folio);
 	if (!skip_swapcache)
-		delete_from_swap_cache(folio);
+		swap_cache_del_folio(folio);
 	/*
 	 * Don't treat swapin error folio as alloced. Otherwise inode->i_blocks
 	 * won't be 0 when inode is released and thus trigger WARN_ON(i_blocks)
@@ -2459,7 +2441,7 @@ static int shmem_swapin_folio(struct inode *inode, pgoff_t index,
 		folio->swap.val = 0;
 		swapcache_clear(si, swap, nr_pages);
 	} else {
-		delete_from_swap_cache(folio);
+		swap_cache_del_folio(folio);
 	}
 	folio_mark_dirty(folio);
 	swap_free_nr(swap, nr_pages);
-- 
2.51.0



^ permalink raw reply	[flat|nested] 80+ messages in thread

* [PATCH v2 09/15] mm, swap: cleanup swap cache API and add kerneldoc
  2025-09-05 19:13 [PATCH v2 00/15] mm, swap: introduce swap table as swap cache (phase I) Kairui Song
                   ` (7 preceding siblings ...)
  2025-09-05 19:13 ` [PATCH v2 08/15] mm/shmem, swap: remove redundant error handling for replacing folio Kairui Song
@ 2025-09-05 19:13 ` Kairui Song
  2025-09-06  5:45   ` Chris Li
                     ` (3 more replies)
  2025-09-05 19:13 ` [PATCH v2 10/15] mm, swap: wrap swap cache replacement with a helper Kairui Song
                   ` (5 subsequent siblings)
  14 siblings, 4 replies; 80+ messages in thread
From: Kairui Song @ 2025-09-05 19:13 UTC (permalink / raw)
  To: linux-mm
  Cc: Andrew Morton, Matthew Wilcox, Hugh Dickins, Chris Li,
	Barry Song, Baoquan He, Nhat Pham, Kemeng Shi, Baolin Wang,
	Ying Huang, Johannes Weiner, David Hildenbrand, Yosry Ahmed,
	Lorenzo Stoakes, Zi Yan, linux-kernel, Kairui Song

From: Kairui Song <kasong@tencent.com>

In preparation for replacing the swap cache backend with the swap table,
clean up and add proper kernel doc for all swap cache APIs. Now all swap
cache APIs are well-defined with consistent names.

No feature change, only renaming and documenting.

Signed-off-by: Kairui Song <kasong@tencent.com>
---
 mm/filemap.c        |  2 +-
 mm/memory-failure.c |  2 +-
 mm/memory.c         |  2 +-
 mm/swap.h           | 48 ++++++++++++++-----------
 mm/swap_state.c     | 86 ++++++++++++++++++++++++++++++++-------------
 mm/swapfile.c       |  8 ++---
 mm/vmscan.c         |  2 +-
 mm/zswap.c          |  2 +-
 8 files changed, 98 insertions(+), 54 deletions(-)

diff --git a/mm/filemap.c b/mm/filemap.c
index 344ab106c21c..29ea56999a16 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -4517,7 +4517,7 @@ static void filemap_cachestat(struct address_space *mapping,
 				 * invalidation, so there might not be
 				 * a shadow in the swapcache (yet).
 				 */
-				shadow = get_shadow_from_swap_cache(swp);
+				shadow = swap_cache_get_shadow(swp);
 				if (!shadow)
 					goto resched;
 			}
diff --git a/mm/memory-failure.c b/mm/memory-failure.c
index b93ab99ad3ef..922526533cd9 100644
--- a/mm/memory-failure.c
+++ b/mm/memory-failure.c
@@ -1199,7 +1199,7 @@ static int me_swapcache_clean(struct page_state *ps, struct page *p)
 	struct folio *folio = page_folio(p);
 	int ret;
 
-	delete_from_swap_cache(folio);
+	swap_cache_del_folio(folio);
 
 	ret = delete_from_lru_cache(folio) ? MF_FAILED : MF_RECOVERED;
 	folio_unlock(folio);
diff --git a/mm/memory.c b/mm/memory.c
index 5808c4ef21b3..41e641823558 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -4699,7 +4699,7 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
 
 				memcg1_swapin(entry, nr_pages);
 
-				shadow = get_shadow_from_swap_cache(entry);
+				shadow = swap_cache_get_shadow(entry);
 				if (shadow)
 					workingset_refault(folio, shadow);
 
diff --git a/mm/swap.h b/mm/swap.h
index a65e72edb087..8b38577a4e04 100644
--- a/mm/swap.h
+++ b/mm/swap.h
@@ -164,17 +164,29 @@ static inline bool folio_matches_swap_entry(const struct folio *folio,
 	return folio->swap.val == round_down(entry.val, folio_nr_pages(folio));
 }
 
+/*
+ * All swap cache helpers below require the caller to ensure the swap entries
+ * used are valid and stablize the device by any of the following ways:
+ * - Hold a reference by get_swap_device(): this ensures a single entry is
+ *   valid and increases the swap device's refcount.
+ * - Locking a folio in the swap cache: this ensures the folio's swap entries
+ *   are valid and pinned, also implies reference to the device.
+ * - Locking anything referencing the swap entry: e.g. PTL that protects
+ *   swap entries in the page table, similar to locking swap cache folio.
+ * - See the comment of get_swap_device() for more complex usage.
+ */
+struct folio *swap_cache_get_folio(swp_entry_t entry);
+void *swap_cache_get_shadow(swp_entry_t entry);
+int swap_cache_add_folio(struct folio *folio, swp_entry_t entry,
+			 gfp_t gfp, void **shadow);
+void swap_cache_del_folio(struct folio *folio);
+void __swap_cache_del_folio(struct folio *folio,
+			    swp_entry_t entry, void *shadow);
+void swap_cache_clear_shadow(int type, unsigned long begin,
+			     unsigned long end);
+
 void show_swap_cache_info(void);
-void *get_shadow_from_swap_cache(swp_entry_t entry);
-int add_to_swap_cache(struct folio *folio, swp_entry_t entry,
-		      gfp_t gfp, void **shadowp);
-void __delete_from_swap_cache(struct folio *folio,
-			      swp_entry_t entry, void *shadow);
-void delete_from_swap_cache(struct folio *folio);
-void clear_shadow_from_swap_cache(int type, unsigned long begin,
-				  unsigned long end);
 void swapcache_clear(struct swap_info_struct *si, swp_entry_t entry, int nr);
-struct folio *swap_cache_get_folio(swp_entry_t entry);
 struct folio *read_swap_cache_async(swp_entry_t entry, gfp_t gfp_mask,
 		struct vm_area_struct *vma, unsigned long addr,
 		struct swap_iocb **plug);
@@ -302,28 +314,22 @@ static inline struct folio *swap_cache_get_folio(swp_entry_t entry)
 	return NULL;
 }
 
-static inline void *get_shadow_from_swap_cache(swp_entry_t entry)
+static inline void *swap_cache_get_shadow(swp_entry_t entry)
 {
 	return NULL;
 }
 
-static inline int add_to_swap_cache(struct folio *folio, swp_entry_t entry,
-					gfp_t gfp_mask, void **shadowp)
-{
-	return -1;
-}
-
-static inline void __delete_from_swap_cache(struct folio *folio,
-					swp_entry_t entry, void *shadow)
+static inline int swap_cache_add_folio(swp_entry_t entry, struct folio *folio,
+				       gfp_t gfp, void **shadow)
 {
+	return -EINVAL;
 }
 
-static inline void delete_from_swap_cache(struct folio *folio)
+static inline void swap_cache_del_folio(struct folio *folio)
 {
 }
 
-static inline void clear_shadow_from_swap_cache(int type, unsigned long begin,
-				unsigned long end)
+static inline void __swap_cache_del_folio(swp_entry_t entry, struct folio *folio, void *shadow)
 {
 }
 
diff --git a/mm/swap_state.c b/mm/swap_state.c
index 0ad4f3b41f1b..f3a32a06a950 100644
--- a/mm/swap_state.c
+++ b/mm/swap_state.c
@@ -78,8 +78,8 @@ void show_swap_cache_info(void)
  * Context: Caller must ensure @entry is valid and protect the swap device
  * with reference count or locks.
  * Return: Returns the found folio on success, NULL otherwise. The caller
- * must lock and check if the folio still matches the swap entry before
- * use (e.g. with folio_matches_swap_entry).
+ * must lock nd check if the folio still matches the swap entry before
+ * use (e.g., folio_matches_swap_entry).
  */
 struct folio *swap_cache_get_folio(swp_entry_t entry)
 {
@@ -90,7 +90,15 @@ struct folio *swap_cache_get_folio(swp_entry_t entry)
 	return folio;
 }
 
-void *get_shadow_from_swap_cache(swp_entry_t entry)
+/**
+ * swap_cache_get_shadow - Looks up a shadow in the swap cache.
+ * @entry: swap entry used for the lookup.
+ *
+ * Context: Caller must ensure @entry is valid and protect the swap device
+ * with reference count or locks.
+ * Return: Returns either NULL or an XA_VALUE (shadow).
+ */
+void *swap_cache_get_shadow(swp_entry_t entry)
 {
 	struct address_space *address_space = swap_address_space(entry);
 	pgoff_t idx = swap_cache_index(entry);
@@ -102,12 +110,21 @@ void *get_shadow_from_swap_cache(swp_entry_t entry)
 	return NULL;
 }
 
-/*
- * add_to_swap_cache resembles filemap_add_folio on swapper_space,
- * but sets SwapCache flag and 'swap' instead of mapping and index.
+/**
+ * swap_cache_add_folio - Add a folio into the swap cache.
+ * @folio: The folio to be added.
+ * @entry: The swap entry corresponding to the folio.
+ * @gfp: gfp_mask for XArray node allocation.
+ * @shadowp: If a shadow is found, return the shadow.
+ *
+ * Context: Caller must ensure @entry is valid and protect the swap device
+ * with reference count or locks.
+ * The caller also needs to mark the corresponding swap_map slots with
+ * SWAP_HAS_CACHE to avoid race or conflict.
+ * Return: Returns 0 on success, error code otherwise.
  */
-int add_to_swap_cache(struct folio *folio, swp_entry_t entry,
-			gfp_t gfp, void **shadowp)
+int swap_cache_add_folio(struct folio *folio, swp_entry_t entry,
+			 gfp_t gfp, void **shadowp)
 {
 	struct address_space *address_space = swap_address_space(entry);
 	pgoff_t idx = swap_cache_index(entry);
@@ -155,12 +172,20 @@ int add_to_swap_cache(struct folio *folio, swp_entry_t entry,
 	return xas_error(&xas);
 }
 
-/*
- * This must be called only on folios that have
- * been verified to be in the swap cache.
+/**
+ * __swap_cache_del_folio - Removes a folio from the swap cache.
+ * @folio: The folio.
+ * @entry: The first swap entry that the folio corresponds to.
+ * @shadow: shadow value to be filled in the swap cache.
+ *
+ * Removes a folio from the swap cache and fills a shadow in place.
+ * This won't put the folio's refcount. The caller has to do that.
+ *
+ * Context: Caller must hold the xa_lock, ensure the folio is
+ * locked and in the swap cache, using the index of @entry.
  */
-void __delete_from_swap_cache(struct folio *folio,
-			swp_entry_t entry, void *shadow)
+void __swap_cache_del_folio(struct folio *folio,
+			    swp_entry_t entry, void *shadow)
 {
 	struct address_space *address_space = swap_address_space(entry);
 	int i;
@@ -186,27 +211,40 @@ void __delete_from_swap_cache(struct folio *folio,
 	__lruvec_stat_mod_folio(folio, NR_SWAPCACHE, -nr);
 }
 
-/*
- * This must be called only on folios that have
- * been verified to be in the swap cache and locked.
- * It will never put the folio into the free list,
- * the caller has a reference on the folio.
+/**
+ * swap_cache_del_folio - Removes a folio from the swap cache.
+ * @folio: The folio.
+ *
+ * Same as __swap_cache_del_folio, but handles lock and refcount. The
+ * caller must ensure the folio is either clean or has a swap count
+ * equal to zero, or it may cause data loss.
+ *
+ * Context: Caller must ensure the folio is locked and in the swap cache.
  */
-void delete_from_swap_cache(struct folio *folio)
+void swap_cache_del_folio(struct folio *folio)
 {
 	swp_entry_t entry = folio->swap;
 	struct address_space *address_space = swap_address_space(entry);
 
 	xa_lock_irq(&address_space->i_pages);
-	__delete_from_swap_cache(folio, entry, NULL);
+	__swap_cache_del_folio(folio, entry, NULL);
 	xa_unlock_irq(&address_space->i_pages);
 
 	put_swap_folio(folio, entry);
 	folio_ref_sub(folio, folio_nr_pages(folio));
 }
 
-void clear_shadow_from_swap_cache(int type, unsigned long begin,
-				unsigned long end)
+/**
+ * swap_cache_clear_shadow - Clears a set of shadows in the swap cache.
+ * @type: Indicates the swap device.
+ * @begin: Beginning offset of the range.
+ * @end: Ending offset of the range.
+ *
+ * Context: Caller must ensure the range is valid and hold a reference to
+ * the swap device.
+ */
+void swap_cache_clear_shadow(int type, unsigned long begin,
+			     unsigned long end)
 {
 	unsigned long curr = begin;
 	void *old;
@@ -393,7 +431,7 @@ struct folio *__read_swap_cache_async(swp_entry_t entry, gfp_t gfp_mask,
 			goto put_and_return;
 
 		/*
-		 * We might race against __delete_from_swap_cache(), and
+		 * We might race against __swap_cache_del_folio(), and
 		 * stumble across a swap_map entry whose SWAP_HAS_CACHE
 		 * has not yet been cleared.  Or race against another
 		 * __read_swap_cache_async(), which has set SWAP_HAS_CACHE
@@ -412,7 +450,7 @@ struct folio *__read_swap_cache_async(swp_entry_t entry, gfp_t gfp_mask,
 		goto fail_unlock;
 
 	/* May fail (-ENOMEM) if XArray node allocation failed. */
-	if (add_to_swap_cache(new_folio, entry, gfp_mask & GFP_RECLAIM_MASK, &shadow))
+	if (swap_cache_add_folio(new_folio, entry, gfp_mask & GFP_RECLAIM_MASK, &shadow))
 		goto fail_unlock;
 
 	memcg1_swapin(entry, 1);
diff --git a/mm/swapfile.c b/mm/swapfile.c
index 367481d319cd..731b541b1d33 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -266,7 +266,7 @@ static int __try_to_reclaim_swap(struct swap_info_struct *si,
 	if (!need_reclaim)
 		goto out_unlock;
 
-	delete_from_swap_cache(folio);
+	swap_cache_del_folio(folio);
 	folio_set_dirty(folio);
 	ret = nr_pages;
 out_unlock:
@@ -1123,7 +1123,7 @@ static void swap_range_free(struct swap_info_struct *si, unsigned long offset,
 			swap_slot_free_notify(si->bdev, offset);
 		offset++;
 	}
-	clear_shadow_from_swap_cache(si->type, begin, end);
+	swap_cache_clear_shadow(si->type, begin, end);
 
 	/*
 	 * Make sure that try_to_unuse() observes si->inuse_pages reaching 0
@@ -1288,7 +1288,7 @@ int folio_alloc_swap(struct folio *folio, gfp_t gfp)
 	 * TODO: this could cause a theoretical memory reclaim
 	 * deadlock in the swap out path.
 	 */
-	if (add_to_swap_cache(folio, entry, gfp | __GFP_NOMEMALLOC, NULL))
+	if (swap_cache_add_folio(folio, entry, gfp | __GFP_NOMEMALLOC, NULL))
 		goto out_free;
 
 	return 0;
@@ -1758,7 +1758,7 @@ bool folio_free_swap(struct folio *folio)
 	if (folio_swapped(folio))
 		return false;
 
-	delete_from_swap_cache(folio);
+	swap_cache_del_folio(folio);
 	folio_set_dirty(folio);
 	return true;
 }
diff --git a/mm/vmscan.c b/mm/vmscan.c
index ca9e1cd3cd68..c79c6806560b 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -776,7 +776,7 @@ static int __remove_mapping(struct address_space *mapping, struct folio *folio,
 
 		if (reclaimed && !mapping_exiting(mapping))
 			shadow = workingset_eviction(folio, target_memcg);
-		__delete_from_swap_cache(folio, swap, shadow);
+		__swap_cache_del_folio(folio, swap, shadow);
 		memcg1_swapout(folio, swap);
 		xa_unlock_irq(&mapping->i_pages);
 		put_swap_folio(folio, swap);
diff --git a/mm/zswap.c b/mm/zswap.c
index c88ad61b232c..3dda4310099e 100644
--- a/mm/zswap.c
+++ b/mm/zswap.c
@@ -1069,7 +1069,7 @@ static int zswap_writeback_entry(struct zswap_entry *entry,
 
 out:
 	if (ret && ret != -EEXIST) {
-		delete_from_swap_cache(folio);
+		swap_cache_del_folio(folio);
 		folio_unlock(folio);
 	}
 	folio_put(folio);
-- 
2.51.0



^ permalink raw reply	[flat|nested] 80+ messages in thread

* [PATCH v2 10/15] mm, swap: wrap swap cache replacement with a helper
  2025-09-05 19:13 [PATCH v2 00/15] mm, swap: introduce swap table as swap cache (phase I) Kairui Song
                   ` (8 preceding siblings ...)
  2025-09-05 19:13 ` [PATCH v2 09/15] mm, swap: cleanup swap cache API and add kerneldoc Kairui Song
@ 2025-09-05 19:13 ` Kairui Song
  2025-09-06  7:09   ` Chris Li
                     ` (2 more replies)
  2025-09-05 19:13 ` [PATCH v2 11/15] mm, swap: use the swap table for the swap cache and switch API Kairui Song
                   ` (4 subsequent siblings)
  14 siblings, 3 replies; 80+ messages in thread
From: Kairui Song @ 2025-09-05 19:13 UTC (permalink / raw)
  To: linux-mm
  Cc: Andrew Morton, Matthew Wilcox, Hugh Dickins, Chris Li,
	Barry Song, Baoquan He, Nhat Pham, Kemeng Shi, Baolin Wang,
	Ying Huang, Johannes Weiner, David Hildenbrand, Yosry Ahmed,
	Lorenzo Stoakes, Zi Yan, linux-kernel, Kairui Song

From: Kairui Song <kasong@tencent.com>

There are currently three swap cache users that are trying to replace an
existing folio with a new one: huge memory splitting, migration, and
shmem replacement. What they are doing is quite similar.

Introduce a common helper for this. In later commits, they can be easily
switched to use the swap table by updating this helper.

The newly added helper also makes the swap cache API better defined, and
debugging is easier.

Signed-off-by: Kairui Song <kasong@tencent.com>
---
 mm/huge_memory.c |  5 ++---
 mm/migrate.c     | 11 +++--------
 mm/shmem.c       | 10 ++--------
 mm/swap.h        |  3 +++
 mm/swap_state.c  | 32 ++++++++++++++++++++++++++++++++
 5 files changed, 42 insertions(+), 19 deletions(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 26cedfcd7418..a4d192c8d794 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -3798,9 +3798,8 @@ static int __folio_split(struct folio *folio, unsigned int new_order,
 			 * NOTE: shmem in swap cache is not supported yet.
 			 */
 			if (swap_cache) {
-				__xa_store(&swap_cache->i_pages,
-					   swap_cache_index(new_folio->swap),
-					   new_folio, 0);
+				__swap_cache_replace_folio(swap_cache, new_folio->swap,
+							   folio, new_folio);
 				continue;
 			}
 
diff --git a/mm/migrate.c b/mm/migrate.c
index 8e435a078fc3..7e1d01aa8c85 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -566,7 +566,6 @@ static int __folio_migrate_mapping(struct address_space *mapping,
 	struct zone *oldzone, *newzone;
 	int dirty;
 	long nr = folio_nr_pages(folio);
-	long entries, i;
 
 	if (!mapping) {
 		/* Take off deferred split queue while frozen and memcg set */
@@ -615,9 +614,6 @@ static int __folio_migrate_mapping(struct address_space *mapping,
 	if (folio_test_swapcache(folio)) {
 		folio_set_swapcache(newfolio);
 		newfolio->private = folio_get_private(folio);
-		entries = nr;
-	} else {
-		entries = 1;
 	}
 
 	/* Move dirty while folio refs frozen and newfolio not yet exposed */
@@ -627,11 +623,10 @@ static int __folio_migrate_mapping(struct address_space *mapping,
 		folio_set_dirty(newfolio);
 	}
 
-	/* Swap cache still stores N entries instead of a high-order entry */
-	for (i = 0; i < entries; i++) {
+	if (folio_test_swapcache(folio))
+		__swap_cache_replace_folio(mapping, folio->swap, folio, newfolio);
+	else
 		xas_store(&xas, newfolio);
-		xas_next(&xas);
-	}
 
 	/*
 	 * Drop cache reference from old folio by unfreezing
diff --git a/mm/shmem.c b/mm/shmem.c
index cc6a0007c7a6..823ceae9dff8 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -2123,10 +2123,8 @@ static int shmem_replace_folio(struct folio **foliop, gfp_t gfp,
 	struct folio *new, *old = *foliop;
 	swp_entry_t entry = old->swap;
 	struct address_space *swap_mapping = swap_address_space(entry);
-	pgoff_t swap_index = swap_cache_index(entry);
-	XA_STATE(xas, &swap_mapping->i_pages, swap_index);
 	int nr_pages = folio_nr_pages(old);
-	int error = 0, i;
+	int error = 0;
 
 	/*
 	 * We have arrived here because our zones are constrained, so don't
@@ -2155,12 +2153,8 @@ static int shmem_replace_folio(struct folio **foliop, gfp_t gfp,
 	new->swap = entry;
 	folio_set_swapcache(new);
 
-	/* Swap cache still stores N entries instead of a high-order entry */
 	xa_lock_irq(&swap_mapping->i_pages);
-	for (i = 0; i < nr_pages; i++) {
-		WARN_ON_ONCE(xas_store(&xas, new));
-		xas_next(&xas);
-	}
+	__swap_cache_replace_folio(swap_mapping, entry, old, new);
 	xa_unlock_irq(&swap_mapping->i_pages);
 
 	mem_cgroup_replace_folio(old, new);
diff --git a/mm/swap.h b/mm/swap.h
index 8b38577a4e04..a139c9131244 100644
--- a/mm/swap.h
+++ b/mm/swap.h
@@ -182,6 +182,9 @@ int swap_cache_add_folio(struct folio *folio, swp_entry_t entry,
 void swap_cache_del_folio(struct folio *folio);
 void __swap_cache_del_folio(struct folio *folio,
 			    swp_entry_t entry, void *shadow);
+void __swap_cache_replace_folio(struct address_space *address_space,
+				swp_entry_t entry,
+				struct folio *old, struct folio *new);
 void swap_cache_clear_shadow(int type, unsigned long begin,
 			     unsigned long end);
 
diff --git a/mm/swap_state.c b/mm/swap_state.c
index f3a32a06a950..38f5f4cf565d 100644
--- a/mm/swap_state.c
+++ b/mm/swap_state.c
@@ -234,6 +234,38 @@ void swap_cache_del_folio(struct folio *folio)
 	folio_ref_sub(folio, folio_nr_pages(folio));
 }
 
+/**
+ * __swap_cache_replace_folio - Replace a folio in the swap cache.
+ * @mapping: Swap mapping address space.
+ * @entry: The first swap entry that the new folio corresponds to.
+ * @old: The old folio to be replaced.
+ * @new: The new folio.
+ *
+ * Replace a existing folio in the swap cache with a new folio.
+ *
+ * Context: Caller must ensure both folios are locked, and lock the
+ * swap address_space that holds the entries to be replaced.
+ */
+void __swap_cache_replace_folio(struct address_space *mapping,
+				swp_entry_t entry,
+				struct folio *old, struct folio *new)
+{
+	unsigned long nr_pages = folio_nr_pages(new);
+	unsigned long offset = swap_cache_index(entry);
+	unsigned long end = offset + nr_pages;
+	XA_STATE(xas, &mapping->i_pages, offset);
+
+	VM_WARN_ON_ONCE(entry.val != new->swap.val);
+	VM_WARN_ON_ONCE(!folio_test_locked(old) || !folio_test_locked(new));
+	VM_WARN_ON_ONCE(!folio_test_swapcache(old) || !folio_test_swapcache(new));
+
+	/* Swap cache still stores N entries instead of a high-order entry */
+	do {
+		WARN_ON_ONCE(xas_store(&xas, new) != old);
+		xas_next(&xas);
+	} while (++offset < end);
+}
+
 /**
  * swap_cache_clear_shadow - Clears a set of shadows in the swap cache.
  * @type: Indicates the swap device.
-- 
2.51.0



^ permalink raw reply	[flat|nested] 80+ messages in thread

* [PATCH v2 11/15] mm, swap: use the swap table for the swap cache and switch API
  2025-09-05 19:13 [PATCH v2 00/15] mm, swap: introduce swap table as swap cache (phase I) Kairui Song
                   ` (9 preceding siblings ...)
  2025-09-05 19:13 ` [PATCH v2 10/15] mm, swap: wrap swap cache replacement with a helper Kairui Song
@ 2025-09-05 19:13 ` Kairui Song
  2025-09-06 15:28   ` Chris Li
                     ` (3 more replies)
  2025-09-05 19:13 ` [PATCH v2 12/15] mm, swap: mark swap address space ro and add context debug check Kairui Song
                   ` (3 subsequent siblings)
  14 siblings, 4 replies; 80+ messages in thread
From: Kairui Song @ 2025-09-05 19:13 UTC (permalink / raw)
  To: linux-mm
  Cc: Andrew Morton, Matthew Wilcox, Hugh Dickins, Chris Li,
	Barry Song, Baoquan He, Nhat Pham, Kemeng Shi, Baolin Wang,
	Ying Huang, Johannes Weiner, David Hildenbrand, Yosry Ahmed,
	Lorenzo Stoakes, Zi Yan, linux-kernel, Kairui Song

From: Kairui Song <kasong@tencent.com>

Introduce basic swap table infrastructures, which are now just a
fixed-sized flat array inside each swap cluster, with access wrappers.

Each cluster contains a swap table of 512 entries. Each table entry is
an opaque atomic long. It could be in 3 types: a shadow type (XA_VALUE),
a folio type (pointer), or NULL.

In this first step, it only supports storing a folio or shadow, and it
is a drop-in replacement for the current swap cache. Convert all swap
cache users to use the new sets of APIs. Chris Li has been suggesting
using a new infrastructure for swap cache for better performance, and
that idea combined well with the swap table as the new backing
structure. Now the lock contention range is reduced to 2M clusters,
which is much smaller than the 64M address_space. And we can also drop
the multiple address_space design.

All the internal works are done with swap_cache_get_* helpers. Swap
cache lookup is still lock-less like before, and the helper's contexts
are same with original swap cache helpers. They still require a pin
on the swap device to prevent the backing data from being freed.

Swap cache updates are now protected by the swap cluster lock
instead of the Xarray lock. This is mostly handled internally, but new
__swap_cache_* helpers require the caller to lock the cluster. So, a
few new cluster access and locking helpers are also introduced.

A fully cluster-based unified swap table can be implemented on top
of this to take care of all count tracking and synchronization work,
with dynamic allocation. It should reduce the memory usage while
making the performance even better.

Co-developed-by: Chris Li <chrisl@kernel.org>
Signed-off-by: Chris Li <chrisl@kernel.org>
Signed-off-by: Kairui Song <kasong@tencent.com>
---
 MAINTAINERS          |   1 +
 include/linux/swap.h |   2 -
 mm/huge_memory.c     |  13 +-
 mm/migrate.c         |  19 ++-
 mm/shmem.c           |   8 +-
 mm/swap.h            | 157 +++++++++++++++++------
 mm/swap_state.c      | 289 +++++++++++++++++++------------------------
 mm/swap_table.h      |  97 +++++++++++++++
 mm/swapfile.c        | 100 +++++++++++----
 mm/vmscan.c          |  20 ++-
 10 files changed, 458 insertions(+), 248 deletions(-)
 create mode 100644 mm/swap_table.h

diff --git a/MAINTAINERS b/MAINTAINERS
index 1c8292c0318d..de402ca91a80 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -16226,6 +16226,7 @@ F:	include/linux/swapops.h
 F:	mm/page_io.c
 F:	mm/swap.c
 F:	mm/swap.h
+F:	mm/swap_table.h
 F:	mm/swap_state.c
 F:	mm/swapfile.c
 
diff --git a/include/linux/swap.h b/include/linux/swap.h
index 6db105383782..2cb0458561ef 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -480,8 +480,6 @@ extern int __swap_count(swp_entry_t entry);
 extern bool swap_entry_swapped(struct swap_info_struct *si, swp_entry_t entry);
 extern int swp_swapcount(swp_entry_t entry);
 struct backing_dev_info;
-extern int init_swap_address_space(unsigned int type, unsigned long nr_pages);
-extern void exit_swap_address_space(unsigned int type);
 extern struct swap_info_struct *get_swap_device(swp_entry_t entry);
 sector_t swap_folio_sector(struct folio *folio);
 
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index a4d192c8d794..052e8fc7ee0c 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -3720,7 +3720,7 @@ static int __folio_split(struct folio *folio, unsigned int new_order,
 	/* Prevent deferred_split_scan() touching ->_refcount */
 	spin_lock(&ds_queue->split_queue_lock);
 	if (folio_ref_freeze(folio, 1 + extra_pins)) {
-		struct address_space *swap_cache = NULL;
+		struct swap_cluster_info *ci = NULL;
 		struct lruvec *lruvec;
 		int expected_refs;
 
@@ -3764,8 +3764,7 @@ static int __folio_split(struct folio *folio, unsigned int new_order,
 				goto fail;
 			}
 
-			swap_cache = swap_address_space(folio->swap);
-			xa_lock(&swap_cache->i_pages);
+			ci = swap_cluster_lock_by_folio(folio);
 		}
 
 		/* lock lru list/PageCompound, ref frozen by page_ref_freeze */
@@ -3797,8 +3796,8 @@ static int __folio_split(struct folio *folio, unsigned int new_order,
 			 * Anonymous folio with swap cache.
 			 * NOTE: shmem in swap cache is not supported yet.
 			 */
-			if (swap_cache) {
-				__swap_cache_replace_folio(swap_cache, new_folio->swap,
+			if (ci) {
+				__swap_cache_replace_folio(ci, new_folio->swap,
 							   folio, new_folio);
 				continue;
 			}
@@ -3834,8 +3833,8 @@ static int __folio_split(struct folio *folio, unsigned int new_order,
 
 		unlock_page_lruvec(lruvec);
 
-		if (swap_cache)
-			xa_unlock(&swap_cache->i_pages);
+		if (ci)
+			swap_cluster_unlock(ci);
 	} else {
 		spin_unlock(&ds_queue->split_queue_lock);
 		ret = -EAGAIN;
diff --git a/mm/migrate.c b/mm/migrate.c
index 7e1d01aa8c85..ea177ef1fea9 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -563,6 +563,7 @@ static int __folio_migrate_mapping(struct address_space *mapping,
 		struct folio *newfolio, struct folio *folio, int expected_count)
 {
 	XA_STATE(xas, &mapping->i_pages, folio_index(folio));
+	struct swap_cluster_info *ci = NULL;
 	struct zone *oldzone, *newzone;
 	int dirty;
 	long nr = folio_nr_pages(folio);
@@ -591,9 +592,16 @@ static int __folio_migrate_mapping(struct address_space *mapping,
 	oldzone = folio_zone(folio);
 	newzone = folio_zone(newfolio);
 
-	xas_lock_irq(&xas);
+	if (folio_test_swapcache(folio))
+		ci = swap_cluster_lock_by_folio_irq(folio);
+	else
+		xas_lock_irq(&xas);
+
 	if (!folio_ref_freeze(folio, expected_count)) {
-		xas_unlock_irq(&xas);
+		if (ci)
+			swap_cluster_unlock(ci);
+		else
+			xas_unlock_irq(&xas);
 		return -EAGAIN;
 	}
 
@@ -624,7 +632,7 @@ static int __folio_migrate_mapping(struct address_space *mapping,
 	}
 
 	if (folio_test_swapcache(folio))
-		__swap_cache_replace_folio(mapping, folio->swap, folio, newfolio);
+		__swap_cache_replace_folio(ci, folio->swap, folio, newfolio);
 	else
 		xas_store(&xas, newfolio);
 
@@ -635,8 +643,11 @@ static int __folio_migrate_mapping(struct address_space *mapping,
 	 */
 	folio_ref_unfreeze(folio, expected_count - nr);
 
-	xas_unlock(&xas);
 	/* Leave irq disabled to prevent preemption while updating stats */
+	if (ci)
+		swap_cluster_unlock(ci);
+	else
+		xas_unlock(&xas);
 
 	/*
 	 * If moved to a different zone then also account
diff --git a/mm/shmem.c b/mm/shmem.c
index 823ceae9dff8..21e795f18e78 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -2120,9 +2120,9 @@ static int shmem_replace_folio(struct folio **foliop, gfp_t gfp,
 				struct shmem_inode_info *info, pgoff_t index,
 				struct vm_area_struct *vma)
 {
+	struct swap_cluster_info *ci;
 	struct folio *new, *old = *foliop;
 	swp_entry_t entry = old->swap;
-	struct address_space *swap_mapping = swap_address_space(entry);
 	int nr_pages = folio_nr_pages(old);
 	int error = 0;
 
@@ -2153,9 +2153,9 @@ static int shmem_replace_folio(struct folio **foliop, gfp_t gfp,
 	new->swap = entry;
 	folio_set_swapcache(new);
 
-	xa_lock_irq(&swap_mapping->i_pages);
-	__swap_cache_replace_folio(swap_mapping, entry, old, new);
-	xa_unlock_irq(&swap_mapping->i_pages);
+	ci = swap_cluster_lock_by_folio_irq(old);
+	__swap_cache_replace_folio(ci, entry, old, new);
+	swap_cluster_unlock(ci);
 
 	mem_cgroup_replace_folio(old, new);
 	shmem_update_stats(new, nr_pages);
diff --git a/mm/swap.h b/mm/swap.h
index a139c9131244..bf4e54f1f6b6 100644
--- a/mm/swap.h
+++ b/mm/swap.h
@@ -2,6 +2,7 @@
 #ifndef _MM_SWAP_H
 #define _MM_SWAP_H
 
+#include <linux/atomic.h> /* for atomic_long_t */
 struct mempolicy;
 struct swap_iocb;
 
@@ -35,6 +36,7 @@ struct swap_cluster_info {
 	u16 count;
 	u8 flags;
 	u8 order;
+	atomic_long_t *table;	/* Swap table entries, see mm/swap_table.h */
 	struct list_head list;
 };
 
@@ -55,6 +57,11 @@ enum swap_cluster_flags {
 #include <linux/swapops.h> /* for swp_offset */
 #include <linux/blk_types.h> /* for bio_end_io_t */
 
+static inline unsigned int swp_cluster_offset(swp_entry_t entry)
+{
+	return swp_offset(entry) % SWAPFILE_CLUSTER;
+}
+
 /*
  * Callers of all helpers below must ensure the entry, type, or offset is
  * valid, and protect the swap device with reference count or locks.
@@ -81,6 +88,25 @@ static inline struct swap_cluster_info *__swap_offset_to_cluster(
 	return &si->cluster_info[offset / SWAPFILE_CLUSTER];
 }
 
+static inline struct swap_cluster_info *__swap_entry_to_cluster(swp_entry_t entry)
+{
+	return __swap_offset_to_cluster(__swap_entry_to_info(entry),
+					swp_offset(entry));
+}
+
+static __always_inline struct swap_cluster_info *__swap_cluster_lock(
+		struct swap_info_struct *si, unsigned long offset, bool irq)
+{
+	struct swap_cluster_info *ci = __swap_offset_to_cluster(si, offset);
+
+	VM_WARN_ON_ONCE(percpu_ref_is_zero(&si->users)); /* race with swapoff */
+	if (irq)
+		spin_lock_irq(&ci->lock);
+	else
+		spin_lock(&ci->lock);
+	return ci;
+}
+
 /**
  * swap_cluster_lock - Lock and return the swap cluster of given offset.
  * @si: swap device the cluster belongs to.
@@ -92,11 +118,48 @@ static inline struct swap_cluster_info *__swap_offset_to_cluster(
 static inline struct swap_cluster_info *swap_cluster_lock(
 		struct swap_info_struct *si, unsigned long offset)
 {
-	struct swap_cluster_info *ci = __swap_offset_to_cluster(si, offset);
+	return __swap_cluster_lock(si, offset, false);
+}
 
-	VM_WARN_ON_ONCE(percpu_ref_is_zero(&si->users)); /* race with swapoff */
-	spin_lock(&ci->lock);
-	return ci;
+static inline struct swap_cluster_info *__swap_cluster_lock_by_folio(
+		const struct folio *folio, bool irq)
+{
+	VM_WARN_ON_ONCE_FOLIO(!folio_test_locked(folio), folio);
+	VM_WARN_ON_ONCE_FOLIO(!folio_test_swapcache(folio), folio);
+	return __swap_cluster_lock(__swap_entry_to_info(folio->swap),
+				   swp_offset(folio->swap), irq);
+}
+
+/*
+ * swap_cluster_lock_by_folio - Locks the cluster that holds a folio's entries.
+ * @folio: The folio.
+ *
+ * This locks the swap cluster that contains a folio's swap entries. The
+ * swap entries of a folio are always in one single cluster, and a locked
+ * swap cache folio is enough to stabilize the entries and the swap device.
+ *
+ * Context: Caller must ensure the folio is locked and in the swap cache.
+ * Return: Pointer to the swap cluster.
+ */
+static inline struct swap_cluster_info *swap_cluster_lock_by_folio(
+		const struct folio *folio)
+{
+	return __swap_cluster_lock_by_folio(folio, false);
+}
+
+/*
+ * swap_cluster_lock_by_folio_irq - Locks the cluster that holds a folio's entries.
+ * @folio: The folio.
+ *
+ * Same as swap_cluster_lock_by_folio but also disable IRQ.
+ *
+ * Context: Caller must ensure the folio is locked and in the swap cache.
+ * Return: Pointer to the swap cluster.
+ */
+static inline struct swap_cluster_info *swap_cluster_lock_by_folio_irq(
+		const struct folio *folio)
+{
+	return __swap_cluster_lock_by_folio(folio, true);
 }
 
 static inline void swap_cluster_unlock(struct swap_cluster_info *ci)
@@ -104,6 +167,11 @@ static inline void swap_cluster_unlock(struct swap_cluster_info *ci)
 	spin_unlock(&ci->lock);
 }
 
+static inline void swap_cluster_unlock_irq(struct swap_cluster_info *ci)
+{
+	spin_unlock_irq(&ci->lock);
+}
+
 /* linux/mm/page_io.c */
 int sio_pool_init(void);
 struct swap_iocb;
@@ -123,10 +191,11 @@ void __swap_writepage(struct folio *folio, struct swap_iocb **swap_plug);
 #define SWAP_ADDRESS_SPACE_SHIFT	14
 #define SWAP_ADDRESS_SPACE_PAGES	(1 << SWAP_ADDRESS_SPACE_SHIFT)
 #define SWAP_ADDRESS_SPACE_MASK		(SWAP_ADDRESS_SPACE_PAGES - 1)
-extern struct address_space *swapper_spaces[];
-#define swap_address_space(entry)			    \
-	(&swapper_spaces[swp_type(entry)][swp_offset(entry) \
-		>> SWAP_ADDRESS_SPACE_SHIFT])
+extern struct address_space swap_space;
+static inline struct address_space *swap_address_space(swp_entry_t entry)
+{
+	return &swap_space;
+}
 
 /*
  * Return the swap device position of the swap entry.
@@ -136,15 +205,6 @@ static inline loff_t swap_dev_pos(swp_entry_t entry)
 	return ((loff_t)swp_offset(entry)) << PAGE_SHIFT;
 }
 
-/*
- * Return the swap cache index of the swap entry.
- */
-static inline pgoff_t swap_cache_index(swp_entry_t entry)
-{
-	BUILD_BUG_ON((SWP_OFFSET_MASK | SWAP_ADDRESS_SPACE_MASK) != SWP_OFFSET_MASK);
-	return swp_offset(entry) & SWAP_ADDRESS_SPACE_MASK;
-}
-
 /**
  * folio_matches_swap_entry - Check if a folio matches a given swap entry.
  * @folio: The folio.
@@ -177,16 +237,15 @@ static inline bool folio_matches_swap_entry(const struct folio *folio,
  */
 struct folio *swap_cache_get_folio(swp_entry_t entry);
 void *swap_cache_get_shadow(swp_entry_t entry);
-int swap_cache_add_folio(struct folio *folio, swp_entry_t entry,
-			 gfp_t gfp, void **shadow);
+void swap_cache_add_folio(struct folio *folio, swp_entry_t entry, void **shadow);
 void swap_cache_del_folio(struct folio *folio);
-void __swap_cache_del_folio(struct folio *folio,
-			    swp_entry_t entry, void *shadow);
-void __swap_cache_replace_folio(struct address_space *address_space,
-				swp_entry_t entry,
-				struct folio *old, struct folio *new);
-void swap_cache_clear_shadow(int type, unsigned long begin,
-			     unsigned long end);
+/* Below helpers require the caller to lock and pass in the swap cluster. */
+void __swap_cache_del_folio(struct swap_cluster_info *ci,
+			    struct folio *folio, swp_entry_t entry, void *shadow);
+void __swap_cache_replace_folio(struct swap_cluster_info *ci,
+				swp_entry_t entry, struct folio *old,
+				struct folio *new);
+void __swap_cache_clear_shadow(swp_entry_t entry, int nr_ents);
 
 void show_swap_cache_info(void);
 void swapcache_clear(struct swap_info_struct *si, swp_entry_t entry, int nr);
@@ -254,6 +313,32 @@ static inline int non_swapcache_batch(swp_entry_t entry, int max_nr)
 
 #else /* CONFIG_SWAP */
 struct swap_iocb;
+static inline struct swap_cluster_info *swap_cluster_lock(
+	struct swap_info_struct *si, pgoff_t offset, bool irq)
+{
+	return NULL;
+}
+
+static inline struct swap_cluster_info *swap_cluster_lock_by_folio(
+		struct folio *folio)
+{
+	return NULL;
+}
+
+static inline struct swap_cluster_info *swap_cluster_lock_by_folio_irq(
+		struct folio *folio)
+{
+	return NULL;
+}
+
+static inline void swap_cluster_unlock(struct swap_cluster_info *ci)
+{
+}
+
+static inline void swap_cluster_unlock_irq(struct swap_cluster_info *ci)
+{
+}
+
 static inline struct swap_info_struct *__swap_entry_to_info(swp_entry_t entry)
 {
 	return NULL;
@@ -271,11 +356,6 @@ static inline struct address_space *swap_address_space(swp_entry_t entry)
 	return NULL;
 }
 
-static inline pgoff_t swap_cache_index(swp_entry_t entry)
-{
-	return 0;
-}
-
 static inline bool folio_matches_swap_entry(const struct folio *folio, swp_entry_t entry)
 {
 	return false;
@@ -322,17 +402,22 @@ static inline void *swap_cache_get_shadow(swp_entry_t entry)
 	return NULL;
 }
 
-static inline int swap_cache_add_folio(swp_entry_t entry, struct folio *folio,
-				       gfp_t gfp, void **shadow)
+static inline void swap_cache_add_folio(struct folio *folio, swp_entry_t entry, void **shadow)
 {
-	return -EINVAL;
 }
 
 static inline void swap_cache_del_folio(struct folio *folio)
 {
 }
 
-static inline void __swap_cache_del_folio(swp_entry_t entry, struct folio *folio, void *shadow)
+static inline void __swap_cache_del_folio(struct swap_cluster_info *ci,
+			    struct folio *folio, swp_entry_t entry, void *shadow)
+{
+}
+
+static inline void __swap_cache_replace_folio(
+		struct swap_cluster_info *ci, swp_entry_t entry,
+		struct folio *old, struct folio *new)
 {
 }
 
@@ -367,7 +452,7 @@ static inline int non_swapcache_batch(swp_entry_t entry, int max_nr)
 static inline pgoff_t folio_index(struct folio *folio)
 {
 	if (unlikely(folio_test_swapcache(folio)))
-		return swap_cache_index(folio->swap);
+		return swp_offset(folio->swap);
 	return folio->index;
 }
 
diff --git a/mm/swap_state.c b/mm/swap_state.c
index 38f5f4cf565d..7147b390745f 100644
--- a/mm/swap_state.c
+++ b/mm/swap_state.c
@@ -23,6 +23,7 @@
 #include <linux/huge_mm.h>
 #include <linux/shmem_fs.h>
 #include "internal.h"
+#include "swap_table.h"
 #include "swap.h"
 
 /*
@@ -36,8 +37,10 @@ static const struct address_space_operations swap_aops = {
 #endif
 };
 
-struct address_space *swapper_spaces[MAX_SWAPFILES] __read_mostly;
-static unsigned int nr_swapper_spaces[MAX_SWAPFILES] __read_mostly;
+struct address_space swap_space __read_mostly = {
+	.a_ops = &swap_aops,
+};
+
 static bool enable_vma_readahead __read_mostly = true;
 
 #define SWAP_RA_ORDER_CEILING	5
@@ -83,11 +86,21 @@ void show_swap_cache_info(void)
  */
 struct folio *swap_cache_get_folio(swp_entry_t entry)
 {
-	struct folio *folio = filemap_get_folio(swap_address_space(entry),
-						swap_cache_index(entry));
-	if (IS_ERR(folio))
-		return NULL;
-	return folio;
+
+	unsigned long swp_tb;
+	struct folio *folio;
+
+	for (;;) {
+		swp_tb = __swap_table_get(__swap_entry_to_cluster(entry),
+					  swp_cluster_offset(entry));
+		if (!swp_tb_is_folio(swp_tb))
+			return NULL;
+		folio = swp_tb_to_folio(swp_tb);
+		if (likely(folio_try_get(folio)))
+			return folio;
+	}
+
+	return NULL;
 }
 
 /**
@@ -100,13 +113,13 @@ struct folio *swap_cache_get_folio(swp_entry_t entry)
  */
 void *swap_cache_get_shadow(swp_entry_t entry)
 {
-	struct address_space *address_space = swap_address_space(entry);
-	pgoff_t idx = swap_cache_index(entry);
-	void *shadow;
+	unsigned long swp_tb;
+
+	swp_tb = __swap_table_get(__swap_entry_to_cluster(entry),
+				  swp_cluster_offset(entry));
+	if (swp_tb_is_shadow(swp_tb))
+		return swp_tb_to_shadow(swp_tb);
 
-	shadow = xa_load(&address_space->i_pages, idx);
-	if (xa_is_value(shadow))
-		return shadow;
 	return NULL;
 }
 
@@ -123,57 +136,45 @@ void *swap_cache_get_shadow(swp_entry_t entry)
  * SWAP_HAS_CACHE to avoid race or conflict.
  * Return: Returns 0 on success, error code otherwise.
  */
-int swap_cache_add_folio(struct folio *folio, swp_entry_t entry,
-			 gfp_t gfp, void **shadowp)
+void swap_cache_add_folio(struct folio *folio, swp_entry_t entry, void **shadowp)
 {
-	struct address_space *address_space = swap_address_space(entry);
-	pgoff_t idx = swap_cache_index(entry);
-	XA_STATE_ORDER(xas, &address_space->i_pages, idx, folio_order(folio));
-	unsigned long i, nr = folio_nr_pages(folio);
-	void *old;
-
-	xas_set_update(&xas, workingset_update_node);
-
-	VM_BUG_ON_FOLIO(!folio_test_locked(folio), folio);
-	VM_BUG_ON_FOLIO(folio_test_swapcache(folio), folio);
-	VM_BUG_ON_FOLIO(!folio_test_swapbacked(folio), folio);
+	void *shadow = NULL;
+	unsigned long swp_tb, exist;
+	struct swap_cluster_info *ci;
+	unsigned int ci_start, ci_off, ci_end;
+	unsigned long nr_pages = folio_nr_pages(folio);
+
+	VM_WARN_ON_ONCE_FOLIO(!folio_test_locked(folio), folio);
+	VM_WARN_ON_ONCE_FOLIO(folio_test_swapcache(folio), folio);
+	VM_WARN_ON_ONCE_FOLIO(!folio_test_swapbacked(folio), folio);
+
+	swp_tb = folio_to_swp_tb(folio);
+	ci_start = swp_cluster_offset(entry);
+	ci_end = ci_start + nr_pages;
+	ci_off = ci_start;
+	ci = swap_cluster_lock(__swap_entry_to_info(entry), swp_offset(entry));
+	do {
+		exist = __swap_table_xchg(ci, ci_off, swp_tb);
+		WARN_ON_ONCE(swp_tb_is_folio(exist));
+		if (swp_tb_is_shadow(exist))
+			shadow = swp_tb_to_shadow(exist);
+	} while (++ci_off < ci_end);
 
-	folio_ref_add(folio, nr);
+	folio_ref_add(folio, nr_pages);
 	folio_set_swapcache(folio);
 	folio->swap = entry;
+	swap_cluster_unlock(ci);
 
-	do {
-		xas_lock_irq(&xas);
-		xas_create_range(&xas);
-		if (xas_error(&xas))
-			goto unlock;
-		for (i = 0; i < nr; i++) {
-			VM_BUG_ON_FOLIO(xas.xa_index != idx + i, folio);
-			if (shadowp) {
-				old = xas_load(&xas);
-				if (xa_is_value(old))
-					*shadowp = old;
-			}
-			xas_store(&xas, folio);
-			xas_next(&xas);
-		}
-		address_space->nrpages += nr;
-		__node_stat_mod_folio(folio, NR_FILE_PAGES, nr);
-		__lruvec_stat_mod_folio(folio, NR_SWAPCACHE, nr);
-unlock:
-		xas_unlock_irq(&xas);
-	} while (xas_nomem(&xas, gfp));
-
-	if (!xas_error(&xas))
-		return 0;
+	node_stat_mod_folio(folio, NR_FILE_PAGES, nr_pages);
+	lruvec_stat_mod_folio(folio, NR_SWAPCACHE, nr_pages);
 
-	folio_clear_swapcache(folio);
-	folio_ref_sub(folio, nr);
-	return xas_error(&xas);
+	if (shadowp)
+		*shadowp = shadow;
 }
 
 /**
  * __swap_cache_del_folio - Removes a folio from the swap cache.
+ * @ci: The locked swap cluster.
  * @folio: The folio.
  * @entry: The first swap entry that the folio corresponds to.
  * @shadow: shadow value to be filled in the swap cache.
@@ -181,34 +182,36 @@ int swap_cache_add_folio(struct folio *folio, swp_entry_t entry,
  * Removes a folio from the swap cache and fills a shadow in place.
  * This won't put the folio's refcount. The caller has to do that.
  *
- * Context: Caller must hold the xa_lock, ensure the folio is
- * locked and in the swap cache, using the index of @entry.
+ * Context: Caller must ensure the folio is locked and in the swap cache
+ * using the index of @entry, and lock the cluster that holds the entries.
  */
-void __swap_cache_del_folio(struct folio *folio,
+void __swap_cache_del_folio(struct swap_cluster_info *ci, struct folio *folio,
 			    swp_entry_t entry, void *shadow)
 {
-	struct address_space *address_space = swap_address_space(entry);
-	int i;
-	long nr = folio_nr_pages(folio);
-	pgoff_t idx = swap_cache_index(entry);
-	XA_STATE(xas, &address_space->i_pages, idx);
-
-	xas_set_update(&xas, workingset_update_node);
-
-	VM_BUG_ON_FOLIO(!folio_test_locked(folio), folio);
-	VM_BUG_ON_FOLIO(!folio_test_swapcache(folio), folio);
-	VM_BUG_ON_FOLIO(folio_test_writeback(folio), folio);
-
-	for (i = 0; i < nr; i++) {
-		void *entry = xas_store(&xas, shadow);
-		VM_BUG_ON_PAGE(entry != folio, entry);
-		xas_next(&xas);
-	}
+	unsigned long exist, swp_tb;
+	unsigned int ci_start, ci_off, ci_end;
+	unsigned long nr_pages = folio_nr_pages(folio);
+
+	VM_WARN_ON_ONCE(__swap_entry_to_cluster(entry) != ci);
+	VM_WARN_ON_ONCE_FOLIO(!folio_test_locked(folio), folio);
+	VM_WARN_ON_ONCE_FOLIO(!folio_test_swapcache(folio), folio);
+	VM_WARN_ON_ONCE_FOLIO(folio_test_writeback(folio), folio);
+
+	swp_tb = shadow_swp_to_tb(shadow);
+	ci_start = swp_cluster_offset(entry);
+	ci_end = ci_start + nr_pages;
+	ci_off = ci_start;
+	do {
+		/* If shadow is NULL, we sets an empty shadow */
+		exist = __swap_table_xchg(ci, ci_off, swp_tb);
+		WARN_ON_ONCE(!swp_tb_is_folio(exist) ||
+			     swp_tb_to_folio(exist) != folio);
+	} while (++ci_off < ci_end);
+
 	folio->swap.val = 0;
 	folio_clear_swapcache(folio);
-	address_space->nrpages -= nr;
-	__node_stat_mod_folio(folio, NR_FILE_PAGES, -nr);
-	__lruvec_stat_mod_folio(folio, NR_SWAPCACHE, -nr);
+	node_stat_mod_folio(folio, NR_FILE_PAGES, -nr_pages);
+	lruvec_stat_mod_folio(folio, NR_SWAPCACHE, -nr_pages);
 }
 
 /**
@@ -223,12 +226,12 @@ void __swap_cache_del_folio(struct folio *folio,
  */
 void swap_cache_del_folio(struct folio *folio)
 {
+	struct swap_cluster_info *ci;
 	swp_entry_t entry = folio->swap;
-	struct address_space *address_space = swap_address_space(entry);
 
-	xa_lock_irq(&address_space->i_pages);
-	__swap_cache_del_folio(folio, entry, NULL);
-	xa_unlock_irq(&address_space->i_pages);
+	ci = swap_cluster_lock(__swap_entry_to_info(entry), swp_offset(entry));
+	__swap_cache_del_folio(ci, folio, entry, NULL);
+	swap_cluster_unlock(ci);
 
 	put_swap_folio(folio, entry);
 	folio_ref_sub(folio, folio_nr_pages(folio));
@@ -236,7 +239,7 @@ void swap_cache_del_folio(struct folio *folio)
 
 /**
  * __swap_cache_replace_folio - Replace a folio in the swap cache.
- * @mapping: Swap mapping address space.
+ * @ci: The locked swap cluster.
  * @entry: The first swap entry that the new folio corresponds to.
  * @old: The old folio to be replaced.
  * @new: The new folio.
@@ -244,64 +247,58 @@ void swap_cache_del_folio(struct folio *folio)
  * Replace a existing folio in the swap cache with a new folio.
  *
  * Context: Caller must ensure both folios are locked, and lock the
- * swap address_space that holds the entries to be replaced.
+ * cluster that holds the entries to be replaced.
  */
-void __swap_cache_replace_folio(struct address_space *mapping,
-				swp_entry_t entry,
+void __swap_cache_replace_folio(struct swap_cluster_info *ci, swp_entry_t entry,
 				struct folio *old, struct folio *new)
 {
+	unsigned int ci_off = swp_cluster_offset(entry);
 	unsigned long nr_pages = folio_nr_pages(new);
-	unsigned long offset = swap_cache_index(entry);
-	unsigned long end = offset + nr_pages;
-	XA_STATE(xas, &mapping->i_pages, offset);
+	unsigned int ci_end = ci_off + nr_pages;
+	unsigned long exist, swp_tb;
 
 	VM_WARN_ON_ONCE(entry.val != new->swap.val);
 	VM_WARN_ON_ONCE(!folio_test_locked(old) || !folio_test_locked(new));
 	VM_WARN_ON_ONCE(!folio_test_swapcache(old) || !folio_test_swapcache(new));
-
-	/* Swap cache still stores N entries instead of a high-order entry */
+	swp_tb = folio_to_swp_tb(new);
 	do {
-		WARN_ON_ONCE(xas_store(&xas, new) != old);
-		xas_next(&xas);
-	} while (++offset < end);
+		exist = __swap_table_xchg(ci, ci_off, swp_tb);
+		WARN_ON_ONCE(!swp_tb_is_folio(exist) || swp_tb_to_folio(exist) != old);
+	} while (++ci_off < ci_end);
+
+	/*
+	 * If the old folio is partially replaced (e.g., splitting a large
+	 * folio, the old folio is shrunk, and new split sub folios replace
+	 * the shrunk part), ensure the new folio doesn't overlap it.
+	 */
+	if (IS_ENABLED(CONFIG_DEBUG_VM) &&
+	    folio_order(old) != folio_order(new)) {
+		ci_off = swp_cluster_offset(old->swap);
+		ci_end = ci_off + folio_nr_pages(old);
+		while (ci_off++ < ci_end)
+			WARN_ON_ONCE(swp_tb_to_folio(__swap_table_get(ci, ci_off)) != old);
+	}
 }
 
 /**
  * swap_cache_clear_shadow - Clears a set of shadows in the swap cache.
- * @type: Indicates the swap device.
- * @begin: Beginning offset of the range.
- * @end: Ending offset of the range.
+ * @entry: The starting index entry.
+ * @nr_ents: How many slots need to be cleared.
  *
- * Context: Caller must ensure the range is valid and hold a reference to
- * the swap device.
+ * Context: Caller must ensure the range is valid, not occupied by,
+ * any folio and protect the swap device with reference count or locks.
  */
-void swap_cache_clear_shadow(int type, unsigned long begin,
-			     unsigned long end)
+void __swap_cache_clear_shadow(swp_entry_t entry, int nr_ents)
 {
-	unsigned long curr = begin;
-	void *old;
-
-	for (;;) {
-		swp_entry_t entry = swp_entry(type, curr);
-		unsigned long index = curr & SWAP_ADDRESS_SPACE_MASK;
-		struct address_space *address_space = swap_address_space(entry);
-		XA_STATE(xas, &address_space->i_pages, index);
-
-		xas_set_update(&xas, workingset_update_node);
-
-		xa_lock_irq(&address_space->i_pages);
-		xas_for_each(&xas, old, min(index + (end - curr), SWAP_ADDRESS_SPACE_PAGES)) {
-			if (!xa_is_value(old))
-				continue;
-			xas_store(&xas, NULL);
-		}
-		xa_unlock_irq(&address_space->i_pages);
+	struct swap_cluster_info *ci = __swap_entry_to_cluster(entry);
+	unsigned int ci_off = swp_cluster_offset(entry), ci_end;
+	unsigned long old;
 
-		/* search the next swapcache until we meet end */
-		curr = ALIGN((curr + 1), SWAP_ADDRESS_SPACE_PAGES);
-		if (curr > end)
-			break;
-	}
+	ci_end = ci_off + nr_ents;
+	do {
+		old = __swap_table_xchg(ci, ci_off, null_to_swp_tb());
+		WARN_ON_ONCE(swp_tb_is_folio(old));
+	} while (++ci_off < ci_end);
 }
 
 /*
@@ -481,10 +478,7 @@ struct folio *__read_swap_cache_async(swp_entry_t entry, gfp_t gfp_mask,
 	if (mem_cgroup_swapin_charge_folio(new_folio, NULL, gfp_mask, entry))
 		goto fail_unlock;
 
-	/* May fail (-ENOMEM) if XArray node allocation failed. */
-	if (swap_cache_add_folio(new_folio, entry, gfp_mask & GFP_RECLAIM_MASK, &shadow))
-		goto fail_unlock;
-
+	swap_cache_add_folio(new_folio, entry, &shadow);
 	memcg1_swapin(entry, 1);
 
 	if (shadow)
@@ -676,41 +670,6 @@ struct folio *swap_cluster_readahead(swp_entry_t entry, gfp_t gfp_mask,
 	return folio;
 }
 
-int init_swap_address_space(unsigned int type, unsigned long nr_pages)
-{
-	struct address_space *spaces, *space;
-	unsigned int i, nr;
-
-	nr = DIV_ROUND_UP(nr_pages, SWAP_ADDRESS_SPACE_PAGES);
-	spaces = kvcalloc(nr, sizeof(struct address_space), GFP_KERNEL);
-	if (!spaces)
-		return -ENOMEM;
-	for (i = 0; i < nr; i++) {
-		space = spaces + i;
-		xa_init_flags(&space->i_pages, XA_FLAGS_LOCK_IRQ);
-		atomic_set(&space->i_mmap_writable, 0);
-		space->a_ops = &swap_aops;
-		/* swap cache doesn't use writeback related tags */
-		mapping_set_no_writeback_tags(space);
-	}
-	nr_swapper_spaces[type] = nr;
-	swapper_spaces[type] = spaces;
-
-	return 0;
-}
-
-void exit_swap_address_space(unsigned int type)
-{
-	int i;
-	struct address_space *spaces = swapper_spaces[type];
-
-	for (i = 0; i < nr_swapper_spaces[type]; i++)
-		VM_WARN_ON_ONCE(!mapping_empty(&spaces[i]));
-	kvfree(spaces);
-	nr_swapper_spaces[type] = 0;
-	swapper_spaces[type] = NULL;
-}
-
 static int swap_vma_ra_win(struct vm_fault *vmf, unsigned long *start,
 			   unsigned long *end)
 {
@@ -883,7 +842,7 @@ static const struct attribute_group swap_attr_group = {
 	.attrs = swap_attrs,
 };
 
-static int __init swap_init_sysfs(void)
+static int __init swap_init(void)
 {
 	int err;
 	struct kobject *swap_kobj;
@@ -898,11 +857,13 @@ static int __init swap_init_sysfs(void)
 		pr_err("failed to register swap group\n");
 		goto delete_obj;
 	}
+	/* Swap cache writeback is LRU based, no tags for it */
+	mapping_set_no_writeback_tags(&swap_space);
 	return 0;
 
 delete_obj:
 	kobject_put(swap_kobj);
 	return err;
 }
-subsys_initcall(swap_init_sysfs);
+subsys_initcall(swap_init);
 #endif
diff --git a/mm/swap_table.h b/mm/swap_table.h
new file mode 100644
index 000000000000..e1f7cc009701
--- /dev/null
+++ b/mm/swap_table.h
@@ -0,0 +1,97 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _MM_SWAP_TABLE_H
+#define _MM_SWAP_TABLE_H
+
+#include "swap.h"
+
+/*
+ * A swap table entry represents the status of a swap slot on a swap
+ * (physical or virtual) device. The swap table in each cluster is a
+ * 1:1 map of the swap slots in this cluster.
+ *
+ * Each swap table entry could be a pointer (folio), a XA_VALUE
+ * (shadow), or NULL.
+ */
+
+/*
+ * Helpers for casting one type of info into a swap table entry.
+ */
+static inline unsigned long null_to_swp_tb(void)
+{
+	BUILD_BUG_ON(sizeof(unsigned long) != sizeof(atomic_long_t));
+	return 0;
+}
+
+static inline unsigned long folio_to_swp_tb(struct folio *folio)
+{
+	BUILD_BUG_ON(sizeof(unsigned long) != sizeof(void *));
+	return (unsigned long)folio;
+}
+
+static inline unsigned long shadow_swp_to_tb(void *shadow)
+{
+	BUILD_BUG_ON((BITS_PER_XA_VALUE + 1) !=
+		     BITS_PER_BYTE * sizeof(unsigned long));
+	VM_WARN_ON_ONCE(shadow && !xa_is_value(shadow));
+	return (unsigned long)shadow;
+}
+
+/*
+ * Helpers for swap table entry type checking.
+ */
+static inline bool swp_tb_is_null(unsigned long swp_tb)
+{
+	return !swp_tb;
+}
+
+static inline bool swp_tb_is_folio(unsigned long swp_tb)
+{
+	return !xa_is_value((void *)swp_tb) && !swp_tb_is_null(swp_tb);
+}
+
+static inline bool swp_tb_is_shadow(unsigned long swp_tb)
+{
+	return xa_is_value((void *)swp_tb);
+}
+
+/*
+ * Helpers for retrieving info from swap table.
+ */
+static inline struct folio *swp_tb_to_folio(unsigned long swp_tb)
+{
+	VM_WARN_ON(!swp_tb_is_folio(swp_tb));
+	return (void *)swp_tb;
+}
+
+static inline void *swp_tb_to_shadow(unsigned long swp_tb)
+{
+	VM_WARN_ON(!swp_tb_is_shadow(swp_tb));
+	return (void *)swp_tb;
+}
+
+/*
+ * Helpers for accessing or modifying the swap table of a cluster,
+ * the swap cluster must be locked.
+ */
+static inline void __swap_table_set(struct swap_cluster_info *ci,
+				    unsigned int off, unsigned long swp_tb)
+{
+	VM_WARN_ON_ONCE(off >= SWAPFILE_CLUSTER);
+	atomic_long_set(&ci->table[off], swp_tb);
+}
+
+static inline unsigned long __swap_table_xchg(struct swap_cluster_info *ci,
+					      unsigned int off, unsigned long swp_tb)
+{
+	VM_WARN_ON_ONCE(off >= SWAPFILE_CLUSTER);
+	/* Ordering is guaranteed by cluster lock, relax */
+	return atomic_long_xchg_relaxed(&ci->table[off], swp_tb);
+}
+
+static inline unsigned long __swap_table_get(struct swap_cluster_info *ci,
+					     unsigned int off)
+{
+	VM_WARN_ON_ONCE(off >= SWAPFILE_CLUSTER);
+	return atomic_long_read(&ci->table[off]);
+}
+#endif
diff --git a/mm/swapfile.c b/mm/swapfile.c
index 731b541b1d33..cbb7d4c0773d 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -46,6 +46,7 @@
 #include <asm/tlbflush.h>
 #include <linux/swapops.h>
 #include <linux/swap_cgroup.h>
+#include "swap_table.h"
 #include "internal.h"
 #include "swap.h"
 
@@ -420,6 +421,34 @@ static inline unsigned int cluster_offset(struct swap_info_struct *si,
 	return cluster_index(si, ci) * SWAPFILE_CLUSTER;
 }
 
+static int swap_table_alloc_table(struct swap_cluster_info *ci)
+{
+	WARN_ON(ci->table);
+	ci->table = kzalloc(sizeof(unsigned long) * SWAPFILE_CLUSTER, GFP_KERNEL);
+	if (!ci->table)
+		return -ENOMEM;
+	return 0;
+}
+
+static void swap_cluster_free_table(struct swap_cluster_info *ci)
+{
+	unsigned int ci_off;
+	unsigned long swp_tb;
+
+	if (!ci->table)
+		return;
+
+	for (ci_off = 0; ci_off < SWAPFILE_CLUSTER; ci_off++) {
+		swp_tb = __swap_table_get(ci, ci_off);
+		if (!swp_tb_is_null(swp_tb))
+			pr_err_once("swap: unclean swap space on swapoff: 0x%lx",
+				    swp_tb);
+	}
+
+	kfree(ci->table);
+	ci->table = NULL;
+}
+
 static void move_cluster(struct swap_info_struct *si,
 			 struct swap_cluster_info *ci, struct list_head *list,
 			 enum swap_cluster_flags new_flags)
@@ -702,6 +731,26 @@ static bool cluster_scan_range(struct swap_info_struct *si,
 	return true;
 }
 
+/*
+ * Currently, the swap table is not used for count tracking, just
+ * do a sanity check here to ensure nothing leaked, so the swap
+ * table should be empty upon freeing.
+ */
+static void cluster_table_check(struct swap_cluster_info *ci,
+				unsigned int start, unsigned int nr)
+{
+	unsigned int ci_off = start % SWAPFILE_CLUSTER;
+	unsigned int ci_end = ci_off + nr;
+	unsigned long swp_tb;
+
+	if (IS_ENABLED(CONFIG_DEBUG_VM)) {
+		do {
+			swp_tb = __swap_table_get(ci, ci_off);
+			VM_WARN_ON_ONCE(!swp_tb_is_null(swp_tb));
+		} while (++ci_off < ci_end);
+	}
+}
+
 static bool cluster_alloc_range(struct swap_info_struct *si, struct swap_cluster_info *ci,
 				unsigned int start, unsigned char usage,
 				unsigned int order)
@@ -721,6 +770,7 @@ static bool cluster_alloc_range(struct swap_info_struct *si, struct swap_cluster
 		ci->order = order;
 
 	memset(si->swap_map + start, usage, nr_pages);
+	cluster_table_check(ci, start, nr_pages);
 	swap_range_alloc(si, nr_pages);
 	ci->count += nr_pages;
 
@@ -1123,7 +1173,7 @@ static void swap_range_free(struct swap_info_struct *si, unsigned long offset,
 			swap_slot_free_notify(si->bdev, offset);
 		offset++;
 	}
-	swap_cache_clear_shadow(si->type, begin, end);
+	__swap_cache_clear_shadow(swp_entry(si->type, begin), nr_entries);
 
 	/*
 	 * Make sure that try_to_unuse() observes si->inuse_pages reaching 0
@@ -1280,16 +1330,7 @@ int folio_alloc_swap(struct folio *folio, gfp_t gfp)
 	if (!entry.val)
 		return -ENOMEM;
 
-	/*
-	 * XArray node allocations from PF_MEMALLOC contexts could
-	 * completely exhaust the page allocator. __GFP_NOMEMALLOC
-	 * stops emergency reserves from being allocated.
-	 *
-	 * TODO: this could cause a theoretical memory reclaim
-	 * deadlock in the swap out path.
-	 */
-	if (swap_cache_add_folio(folio, entry, gfp | __GFP_NOMEMALLOC, NULL))
-		goto out_free;
+	swap_cache_add_folio(folio, entry, NULL);
 
 	return 0;
 
@@ -1555,6 +1596,7 @@ static void swap_entries_free(struct swap_info_struct *si,
 
 	mem_cgroup_uncharge_swap(entry, nr_pages);
 	swap_range_free(si, offset, nr_pages);
+	cluster_table_check(ci, offset, nr_pages);
 
 	if (!ci->count)
 		free_cluster(si, ci);
@@ -2632,6 +2674,18 @@ static void wait_for_allocation(struct swap_info_struct *si)
 	}
 }
 
+static void free_cluster_info(struct swap_cluster_info *cluster_info,
+			      unsigned long maxpages)
+{
+	int i, nr_clusters = DIV_ROUND_UP(maxpages, SWAPFILE_CLUSTER);
+
+	if (!cluster_info)
+		return;
+	for (i = 0; i < nr_clusters; i++)
+		swap_cluster_free_table(&cluster_info[i]);
+	kvfree(cluster_info);
+}
+
 /*
  * Called after swap device's reference count is dead, so
  * neither scan nor allocation will use it.
@@ -2766,12 +2820,13 @@ SYSCALL_DEFINE1(swapoff, const char __user *, specialfile)
 
 	swap_file = p->swap_file;
 	p->swap_file = NULL;
-	p->max = 0;
 	swap_map = p->swap_map;
 	p->swap_map = NULL;
 	zeromap = p->zeromap;
 	p->zeromap = NULL;
 	cluster_info = p->cluster_info;
+	free_cluster_info(cluster_info, p->max);
+	p->max = 0;
 	p->cluster_info = NULL;
 	spin_unlock(&p->lock);
 	spin_unlock(&swap_lock);
@@ -2782,10 +2837,8 @@ SYSCALL_DEFINE1(swapoff, const char __user *, specialfile)
 	p->global_cluster = NULL;
 	vfree(swap_map);
 	kvfree(zeromap);
-	kvfree(cluster_info);
 	/* Destroy swap account information */
 	swap_cgroup_swapoff(p->type);
-	exit_swap_address_space(p->type);
 
 	inode = mapping->host;
 
@@ -3169,8 +3222,11 @@ static struct swap_cluster_info *setup_clusters(struct swap_info_struct *si,
 	if (!cluster_info)
 		goto err;
 
-	for (i = 0; i < nr_clusters; i++)
+	for (i = 0; i < nr_clusters; i++) {
 		spin_lock_init(&cluster_info[i].lock);
+		if (swap_table_alloc_table(&cluster_info[i]))
+			goto err_free;
+	}
 
 	if (!(si->flags & SWP_SOLIDSTATE)) {
 		si->global_cluster = kmalloc(sizeof(*si->global_cluster),
@@ -3231,9 +3287,8 @@ static struct swap_cluster_info *setup_clusters(struct swap_info_struct *si,
 	}
 
 	return cluster_info;
-
 err_free:
-	kvfree(cluster_info);
+	free_cluster_info(cluster_info, maxpages);
 err:
 	return ERR_PTR(err);
 }
@@ -3427,13 +3482,9 @@ SYSCALL_DEFINE2(swapon, const char __user *, specialfile, int, swap_flags)
 		}
 	}
 
-	error = init_swap_address_space(si->type, maxpages);
-	if (error)
-		goto bad_swap_unlock_inode;
-
 	error = zswap_swapon(si->type, maxpages);
 	if (error)
-		goto free_swap_address_space;
+		goto bad_swap_unlock_inode;
 
 	/*
 	 * Flush any pending IO and dirty mappings before we start using this
@@ -3468,8 +3519,6 @@ SYSCALL_DEFINE2(swapon, const char __user *, specialfile, int, swap_flags)
 	goto out;
 free_swap_zswap:
 	zswap_swapoff(si->type);
-free_swap_address_space:
-	exit_swap_address_space(si->type);
 bad_swap_unlock_inode:
 	inode_unlock(inode);
 bad_swap:
@@ -3484,7 +3533,8 @@ SYSCALL_DEFINE2(swapon, const char __user *, specialfile, int, swap_flags)
 	spin_unlock(&swap_lock);
 	vfree(swap_map);
 	kvfree(zeromap);
-	kvfree(cluster_info);
+	if (cluster_info)
+		free_cluster_info(cluster_info, maxpages);
 	if (inced_nr_rotate_swap)
 		atomic_dec(&nr_rotate_swap);
 	if (swap_file)
diff --git a/mm/vmscan.c b/mm/vmscan.c
index c79c6806560b..1d5335993313 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -730,13 +730,18 @@ static int __remove_mapping(struct address_space *mapping, struct folio *folio,
 {
 	int refcount;
 	void *shadow = NULL;
+	struct swap_cluster_info *ci;
 
 	BUG_ON(!folio_test_locked(folio));
 	BUG_ON(mapping != folio_mapping(folio));
 
-	if (!folio_test_swapcache(folio))
+	if (folio_test_swapcache(folio)) {
+		ci = swap_cluster_lock_by_folio_irq(folio);
+	} else {
 		spin_lock(&mapping->host->i_lock);
-	xa_lock_irq(&mapping->i_pages);
+		xa_lock_irq(&mapping->i_pages);
+	}
+
 	/*
 	 * The non racy check for a busy folio.
 	 *
@@ -776,9 +781,9 @@ static int __remove_mapping(struct address_space *mapping, struct folio *folio,
 
 		if (reclaimed && !mapping_exiting(mapping))
 			shadow = workingset_eviction(folio, target_memcg);
-		__swap_cache_del_folio(folio, swap, shadow);
+		__swap_cache_del_folio(ci, folio, swap, shadow);
 		memcg1_swapout(folio, swap);
-		xa_unlock_irq(&mapping->i_pages);
+		swap_cluster_unlock_irq(ci);
 		put_swap_folio(folio, swap);
 	} else {
 		void (*free_folio)(struct folio *);
@@ -816,9 +821,12 @@ static int __remove_mapping(struct address_space *mapping, struct folio *folio,
 	return 1;
 
 cannot_free:
-	xa_unlock_irq(&mapping->i_pages);
-	if (!folio_test_swapcache(folio))
+	if (folio_test_swapcache(folio)) {
+		swap_cluster_unlock_irq(ci);
+	} else {
+		xa_unlock_irq(&mapping->i_pages);
 		spin_unlock(&mapping->host->i_lock);
+	}
 	return 0;
 }
 
-- 
2.51.0



^ permalink raw reply	[flat|nested] 80+ messages in thread

* [PATCH v2 12/15] mm, swap: mark swap address space ro and add context debug check
  2025-09-05 19:13 [PATCH v2 00/15] mm, swap: introduce swap table as swap cache (phase I) Kairui Song
                   ` (10 preceding siblings ...)
  2025-09-05 19:13 ` [PATCH v2 11/15] mm, swap: use the swap table for the swap cache and switch API Kairui Song
@ 2025-09-05 19:13 ` Kairui Song
  2025-09-06 15:35   ` Chris Li
  2025-09-08 13:10   ` David Hildenbrand
  2025-09-05 19:13 ` [PATCH v2 13/15] mm, swap: remove contention workaround for swap cache Kairui Song
                   ` (2 subsequent siblings)
  14 siblings, 2 replies; 80+ messages in thread
From: Kairui Song @ 2025-09-05 19:13 UTC (permalink / raw)
  To: linux-mm
  Cc: Andrew Morton, Matthew Wilcox, Hugh Dickins, Chris Li,
	Barry Song, Baoquan He, Nhat Pham, Kemeng Shi, Baolin Wang,
	Ying Huang, Johannes Weiner, David Hildenbrand, Yosry Ahmed,
	Lorenzo Stoakes, Zi Yan, linux-kernel, Kairui Song

From: Kairui Song <kasong@tencent.com>

Swap cache is now backed by swap table, and the address space is not
holding any mutable data anymore. And swap cache is now protected by
the swap cluster lock, instead of the XArray lock. All access to swap
cache are wrapped by swap cache helpers. Locking is mostly handled
internally by swap cache helpers, only a few __swap_cache_* helpers
require the caller to lock the cluster by themselves.

Worth noting that, unlike XArray, the cluster lock is not IRQ safe.
The swap cache was very different compared to filemap, and now it's
completely separated from filemap. Nothing wants to mark or change
anything or do a writeback callback in IRQ.

So explicitly document this and add a debug check to avoid further
potential misuse. And mark the swap cache space as read-only to avoid
any user wrongly mixing unexpected filemap helpers with swap cache.

Signed-off-by: Kairui Song <kasong@tencent.com>
---
 mm/swap.h       | 12 +++++++++++-
 mm/swap_state.c |  3 ++-
 2 files changed, 13 insertions(+), 2 deletions(-)

diff --git a/mm/swap.h b/mm/swap.h
index bf4e54f1f6b6..e48431a26f89 100644
--- a/mm/swap.h
+++ b/mm/swap.h
@@ -99,6 +99,16 @@ static __always_inline struct swap_cluster_info *__swap_cluster_lock(
 {
 	struct swap_cluster_info *ci = __swap_offset_to_cluster(si, offset);
 
+	/*
+	 * Nothing modifies swap cache in an IRQ context. All access to
+	 * swap cache is wrapped by swap_cache_* helpers, and swap cache
+	 * writeback is handled outside of IRQs. Swapin or swapout never
+	 * occurs in IRQ, and neither does in-place split or replace.
+	 *
+	 * Besides, modifying swap cache requires synchronization with
+	 * swap_map, which was never IRQ safe.
+	 */
+	VM_WARN_ON_ONCE(!in_task());
 	VM_WARN_ON_ONCE(percpu_ref_is_zero(&si->users)); /* race with swapoff */
 	if (irq)
 		spin_lock_irq(&ci->lock);
@@ -191,7 +201,7 @@ void __swap_writepage(struct folio *folio, struct swap_iocb **swap_plug);
 #define SWAP_ADDRESS_SPACE_SHIFT	14
 #define SWAP_ADDRESS_SPACE_PAGES	(1 << SWAP_ADDRESS_SPACE_SHIFT)
 #define SWAP_ADDRESS_SPACE_MASK		(SWAP_ADDRESS_SPACE_PAGES - 1)
-extern struct address_space swap_space;
+extern struct address_space swap_space __ro_after_init;
 static inline struct address_space *swap_address_space(swp_entry_t entry)
 {
 	return &swap_space;
diff --git a/mm/swap_state.c b/mm/swap_state.c
index 7147b390745f..209d5e9e8d90 100644
--- a/mm/swap_state.c
+++ b/mm/swap_state.c
@@ -37,7 +37,8 @@ static const struct address_space_operations swap_aops = {
 #endif
 };
 
-struct address_space swap_space __read_mostly = {
+/* Set swap_space as read only as swap cache is handled by swap table */
+struct address_space swap_space __ro_after_init = {
 	.a_ops = &swap_aops,
 };
 
-- 
2.51.0



^ permalink raw reply	[flat|nested] 80+ messages in thread

* [PATCH v2 13/15] mm, swap: remove contention workaround for swap cache
  2025-09-05 19:13 [PATCH v2 00/15] mm, swap: introduce swap table as swap cache (phase I) Kairui Song
                   ` (11 preceding siblings ...)
  2025-09-05 19:13 ` [PATCH v2 12/15] mm, swap: mark swap address space ro and add context debug check Kairui Song
@ 2025-09-05 19:13 ` Kairui Song
  2025-09-06 15:30   ` Chris Li
  2025-09-08 13:12   ` David Hildenbrand
  2025-09-05 19:13 ` [PATCH v2 14/15] mm, swap: implement dynamic allocation of swap table Kairui Song
  2025-09-05 19:13 ` [PATCH v2 15/15] mm, swap: use a single page for swap table when the size fits Kairui Song
  14 siblings, 2 replies; 80+ messages in thread
From: Kairui Song @ 2025-09-05 19:13 UTC (permalink / raw)
  To: linux-mm
  Cc: Andrew Morton, Matthew Wilcox, Hugh Dickins, Chris Li,
	Barry Song, Baoquan He, Nhat Pham, Kemeng Shi, Baolin Wang,
	Ying Huang, Johannes Weiner, David Hildenbrand, Yosry Ahmed,
	Lorenzo Stoakes, Zi Yan, linux-kernel, Kairui Song,
	kernel test robot

From: Kairui Song <kasong@tencent.com>

Swap cluster setup will try to shuffle the clusters on initialization.
It was helpful to avoid contention for the swap cache space. The cluster
size (2M) was much smaller than each swap cache space (64M), so
shuffling the cluster means the allocator will try to allocate swap
slots that are in different swap cache spaces for each CPU, reducing the
chance of two CPUs using the same swap cache space, and hence reducing
the contention.

Now, swap cache is managed by swap clusters, this shuffle is pointless.
Just remove it, and clean up related macros.

This also improves the HDD swap performance as shuffling IO is a bad
idea for HDD, and now the shuffling is gone. Test have shown a ~40%
performance gain for HDD [1]:

Doing sequential swap in of 8G data using 8 processes with usemem,
average of 3 test runs:

Before: 1270.91 KB/s per process
After:  1849.54 KB/s per process

Link: https://lore.kernel.org/linux-mm/CAMgjq7AdauQ8=X0zeih2r21QoV=-WWj1hyBxLWRzq74n-C=-Ng@mail.gmail.com/ [1]
Reported-by: kernel test robot <oliver.sang@intel.com>
Closes: https://lore.kernel.org/oe-lkp/202504241621.f27743ec-lkp@intel.com
Signed-off-by: Kairui Song <kasong@tencent.com>
Acked-by: Chris Li <chrisl@kernel.org>
Reviewed-by: Barry Song <baohua@kernel.org>
---
 mm/swap.h     |  4 ----
 mm/swapfile.c | 32 ++++++++------------------------
 mm/zswap.c    |  7 +++++--
 3 files changed, 13 insertions(+), 30 deletions(-)

diff --git a/mm/swap.h b/mm/swap.h
index e48431a26f89..c4fb28845d77 100644
--- a/mm/swap.h
+++ b/mm/swap.h
@@ -197,10 +197,6 @@ int swap_writeout(struct folio *folio, struct swap_iocb **swap_plug);
 void __swap_writepage(struct folio *folio, struct swap_iocb **swap_plug);
 
 /* linux/mm/swap_state.c */
-/* One swap address space for each 64M swap space */
-#define SWAP_ADDRESS_SPACE_SHIFT	14
-#define SWAP_ADDRESS_SPACE_PAGES	(1 << SWAP_ADDRESS_SPACE_SHIFT)
-#define SWAP_ADDRESS_SPACE_MASK		(SWAP_ADDRESS_SPACE_PAGES - 1)
 extern struct address_space swap_space __ro_after_init;
 static inline struct address_space *swap_address_space(swp_entry_t entry)
 {
diff --git a/mm/swapfile.c b/mm/swapfile.c
index cbb7d4c0773d..6b3b35a7ddd9 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -3202,21 +3202,14 @@ static int setup_swap_map(struct swap_info_struct *si,
 	return 0;
 }
 
-#define SWAP_CLUSTER_INFO_COLS						\
-	DIV_ROUND_UP(L1_CACHE_BYTES, sizeof(struct swap_cluster_info))
-#define SWAP_CLUSTER_SPACE_COLS						\
-	DIV_ROUND_UP(SWAP_ADDRESS_SPACE_PAGES, SWAPFILE_CLUSTER)
-#define SWAP_CLUSTER_COLS						\
-	max_t(unsigned int, SWAP_CLUSTER_INFO_COLS, SWAP_CLUSTER_SPACE_COLS)
-
 static struct swap_cluster_info *setup_clusters(struct swap_info_struct *si,
 						union swap_header *swap_header,
 						unsigned long maxpages)
 {
 	unsigned long nr_clusters = DIV_ROUND_UP(maxpages, SWAPFILE_CLUSTER);
 	struct swap_cluster_info *cluster_info;
-	unsigned long i, j, idx;
 	int err = -ENOMEM;
+	unsigned long i;
 
 	cluster_info = kvcalloc(nr_clusters, sizeof(*cluster_info), GFP_KERNEL);
 	if (!cluster_info)
@@ -3265,22 +3258,13 @@ static struct swap_cluster_info *setup_clusters(struct swap_info_struct *si,
 		INIT_LIST_HEAD(&si->frag_clusters[i]);
 	}
 
-	/*
-	 * Reduce false cache line sharing between cluster_info and
-	 * sharing same address space.
-	 */
-	for (j = 0; j < SWAP_CLUSTER_COLS; j++) {
-		for (i = 0; i < DIV_ROUND_UP(nr_clusters, SWAP_CLUSTER_COLS); i++) {
-			struct swap_cluster_info *ci;
-			idx = i * SWAP_CLUSTER_COLS + j;
-			ci = cluster_info + idx;
-			if (idx >= nr_clusters)
-				continue;
-			if (ci->count) {
-				ci->flags = CLUSTER_FLAG_NONFULL;
-				list_add_tail(&ci->list, &si->nonfull_clusters[0]);
-				continue;
-			}
+	for (i = 0; i < nr_clusters; i++) {
+		struct swap_cluster_info *ci = &cluster_info[i];
+
+		if (ci->count) {
+			ci->flags = CLUSTER_FLAG_NONFULL;
+			list_add_tail(&ci->list, &si->nonfull_clusters[0]);
+		} else {
 			ci->flags = CLUSTER_FLAG_FREE;
 			list_add_tail(&ci->list, &si->free_clusters);
 		}
diff --git a/mm/zswap.c b/mm/zswap.c
index 3dda4310099e..cba7077fda40 100644
--- a/mm/zswap.c
+++ b/mm/zswap.c
@@ -225,10 +225,13 @@ static bool zswap_has_pool;
 * helpers and fwd declarations
 **********************************/
 
+/* One swap address space for each 64M swap space */
+#define ZSWAP_ADDRESS_SPACE_SHIFT 14
+#define ZSWAP_ADDRESS_SPACE_PAGES (1 << ZSWAP_ADDRESS_SPACE_SHIFT)
 static inline struct xarray *swap_zswap_tree(swp_entry_t swp)
 {
 	return &zswap_trees[swp_type(swp)][swp_offset(swp)
-		>> SWAP_ADDRESS_SPACE_SHIFT];
+		>> ZSWAP_ADDRESS_SPACE_SHIFT];
 }
 
 #define zswap_pool_debug(msg, p)			\
@@ -1674,7 +1677,7 @@ int zswap_swapon(int type, unsigned long nr_pages)
 	struct xarray *trees, *tree;
 	unsigned int nr, i;
 
-	nr = DIV_ROUND_UP(nr_pages, SWAP_ADDRESS_SPACE_PAGES);
+	nr = DIV_ROUND_UP(nr_pages, ZSWAP_ADDRESS_SPACE_PAGES);
 	trees = kvcalloc(nr, sizeof(*tree), GFP_KERNEL);
 	if (!trees) {
 		pr_err("alloc failed, zswap disabled for swap type %d\n", type);
-- 
2.51.0



^ permalink raw reply	[flat|nested] 80+ messages in thread

* [PATCH v2 14/15] mm, swap: implement dynamic allocation of swap table
  2025-09-05 19:13 [PATCH v2 00/15] mm, swap: introduce swap table as swap cache (phase I) Kairui Song
                   ` (12 preceding siblings ...)
  2025-09-05 19:13 ` [PATCH v2 13/15] mm, swap: remove contention workaround for swap cache Kairui Song
@ 2025-09-05 19:13 ` Kairui Song
  2025-09-06 15:45   ` Chris Li
  2025-09-05 19:13 ` [PATCH v2 15/15] mm, swap: use a single page for swap table when the size fits Kairui Song
  14 siblings, 1 reply; 80+ messages in thread
From: Kairui Song @ 2025-09-05 19:13 UTC (permalink / raw)
  To: linux-mm
  Cc: Andrew Morton, Matthew Wilcox, Hugh Dickins, Chris Li,
	Barry Song, Baoquan He, Nhat Pham, Kemeng Shi, Baolin Wang,
	Ying Huang, Johannes Weiner, David Hildenbrand, Yosry Ahmed,
	Lorenzo Stoakes, Zi Yan, linux-kernel, Kairui Song

From: Kairui Song <kasong@tencent.com>

Now swap table is cluster based, which means free clusters can free its
table since no one should modify it.

There could be speculative readers, like swap cache look up, protect
them by making them RCU protected. All swap table should be filled with
null entries before free, so such readers will either see a NULL pointer
or a null filled table being lazy freed.

On allocation, allocate the table when a cluster is used by any order.

This way, we can reduce the memory usage of large swap device
significantly.

This idea to dynamically release unused swap cluster data was initially
suggested by Chris Li while proposing the cluster swap allocator and
it suits the swap table idea very well.

Co-developed-by: Chris Li <chrisl@kernel.org>
Signed-off-by: Chris Li <chrisl@kernel.org>
Signed-off-by: Kairui Song <kasong@tencent.com>
Acked-by: Chris Li <chrisl@kernel.org>
---
 mm/swap.h       |   2 +-
 mm/swap_state.c |   9 +--
 mm/swap_table.h |  37 ++++++++-
 mm/swapfile.c   | 202 ++++++++++++++++++++++++++++++++++++++----------
 4 files changed, 199 insertions(+), 51 deletions(-)

diff --git a/mm/swap.h b/mm/swap.h
index c4fb28845d77..caff4fe30fc5 100644
--- a/mm/swap.h
+++ b/mm/swap.h
@@ -36,7 +36,7 @@ struct swap_cluster_info {
 	u16 count;
 	u8 flags;
 	u8 order;
-	atomic_long_t *table;	/* Swap table entries, see mm/swap_table.h */
+	atomic_long_t __rcu *table;	/* Swap table entries, see mm/swap_table.h */
 	struct list_head list;
 };
 
diff --git a/mm/swap_state.c b/mm/swap_state.c
index 209d5e9e8d90..dfe8f42fc309 100644
--- a/mm/swap_state.c
+++ b/mm/swap_state.c
@@ -92,8 +92,8 @@ struct folio *swap_cache_get_folio(swp_entry_t entry)
 	struct folio *folio;
 
 	for (;;) {
-		swp_tb = __swap_table_get(__swap_entry_to_cluster(entry),
-					  swp_cluster_offset(entry));
+		swp_tb = swap_table_get(__swap_entry_to_cluster(entry),
+					swp_cluster_offset(entry));
 		if (!swp_tb_is_folio(swp_tb))
 			return NULL;
 		folio = swp_tb_to_folio(swp_tb);
@@ -116,11 +116,10 @@ void *swap_cache_get_shadow(swp_entry_t entry)
 {
 	unsigned long swp_tb;
 
-	swp_tb = __swap_table_get(__swap_entry_to_cluster(entry),
-				  swp_cluster_offset(entry));
+	swp_tb = swap_table_get(__swap_entry_to_cluster(entry),
+				swp_cluster_offset(entry));
 	if (swp_tb_is_shadow(swp_tb))
 		return swp_tb_to_shadow(swp_tb);
-
 	return NULL;
 }
 
diff --git a/mm/swap_table.h b/mm/swap_table.h
index e1f7cc009701..52254e455304 100644
--- a/mm/swap_table.h
+++ b/mm/swap_table.h
@@ -2,8 +2,15 @@
 #ifndef _MM_SWAP_TABLE_H
 #define _MM_SWAP_TABLE_H
 
+#include <linux/rcupdate.h>
+#include <linux/atomic.h>
 #include "swap.h"
 
+/* A typical flat array in each cluster as swap table */
+struct swap_table {
+	atomic_long_t entries[SWAPFILE_CLUSTER];
+};
+
 /*
  * A swap table entry represents the status of a swap slot on a swap
  * (physical or virtual) device. The swap table in each cluster is a
@@ -76,22 +83,46 @@ static inline void *swp_tb_to_shadow(unsigned long swp_tb)
 static inline void __swap_table_set(struct swap_cluster_info *ci,
 				    unsigned int off, unsigned long swp_tb)
 {
+	atomic_long_t *table = rcu_dereference_protected(ci->table, true);
+
+	lockdep_assert_held(&ci->lock);
 	VM_WARN_ON_ONCE(off >= SWAPFILE_CLUSTER);
-	atomic_long_set(&ci->table[off], swp_tb);
+	atomic_long_set(&table[off], swp_tb);
 }
 
 static inline unsigned long __swap_table_xchg(struct swap_cluster_info *ci,
 					      unsigned int off, unsigned long swp_tb)
 {
+	atomic_long_t *table = rcu_dereference_protected(ci->table, true);
+
+	lockdep_assert_held(&ci->lock);
 	VM_WARN_ON_ONCE(off >= SWAPFILE_CLUSTER);
 	/* Ordering is guaranteed by cluster lock, relax */
-	return atomic_long_xchg_relaxed(&ci->table[off], swp_tb);
+	return atomic_long_xchg_relaxed(&table[off], swp_tb);
 }
 
 static inline unsigned long __swap_table_get(struct swap_cluster_info *ci,
 					     unsigned int off)
 {
+	atomic_long_t *table;
+
 	VM_WARN_ON_ONCE(off >= SWAPFILE_CLUSTER);
-	return atomic_long_read(&ci->table[off]);
+	table = rcu_dereference_check(ci->table, lockdep_is_held(&ci->lock));
+
+	return atomic_long_read(&table[off]);
+}
+
+static inline unsigned long swap_table_get(struct swap_cluster_info *ci,
+					unsigned int off)
+{
+	atomic_long_t *table;
+	unsigned long swp_tb;
+
+	rcu_read_lock();
+	table = rcu_dereference(ci->table);
+	swp_tb = table ? atomic_long_read(&table[off]) : null_to_swp_tb();
+	rcu_read_unlock();
+
+	return swp_tb;
 }
 #endif
diff --git a/mm/swapfile.c b/mm/swapfile.c
index 6b3b35a7ddd9..49f93069faef 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -105,6 +105,8 @@ static DEFINE_SPINLOCK(swap_avail_lock);
 
 struct swap_info_struct *swap_info[MAX_SWAPFILES];
 
+static struct kmem_cache *swap_table_cachep;
+
 static DEFINE_MUTEX(swapon_mutex);
 
 static DECLARE_WAIT_QUEUE_HEAD(proc_poll_wait);
@@ -400,10 +402,17 @@ static inline bool cluster_is_discard(struct swap_cluster_info *info)
 	return info->flags == CLUSTER_FLAG_DISCARD;
 }
 
+static inline bool cluster_table_is_alloced(struct swap_cluster_info *ci)
+{
+	return rcu_dereference_protected(ci->table, lockdep_is_held(&ci->lock));
+}
+
 static inline bool cluster_is_usable(struct swap_cluster_info *ci, int order)
 {
 	if (unlikely(ci->flags > CLUSTER_FLAG_USABLE))
 		return false;
+	if (!cluster_table_is_alloced(ci))
+		return false;
 	if (!order)
 		return true;
 	return cluster_is_empty(ci) || order == ci->order;
@@ -421,32 +430,98 @@ static inline unsigned int cluster_offset(struct swap_info_struct *si,
 	return cluster_index(si, ci) * SWAPFILE_CLUSTER;
 }
 
-static int swap_table_alloc_table(struct swap_cluster_info *ci)
+static void swap_cluster_free_table(struct swap_cluster_info *ci)
 {
-	WARN_ON(ci->table);
-	ci->table = kzalloc(sizeof(unsigned long) * SWAPFILE_CLUSTER, GFP_KERNEL);
-	if (!ci->table)
-		return -ENOMEM;
-	return 0;
+	unsigned int ci_off;
+	struct swap_table *table;
+
+	/* Only empty cluster's table is allow to be freed  */
+	lockdep_assert_held(&ci->lock);
+	VM_WARN_ON_ONCE(!cluster_is_empty(ci));
+	for (ci_off = 0; ci_off < SWAPFILE_CLUSTER; ci_off++)
+		VM_WARN_ON_ONCE(!swp_tb_is_null(__swap_table_get(ci, ci_off)));
+	table = (void *)rcu_dereference_protected(ci->table, true);
+	rcu_assign_pointer(ci->table, NULL);
+
+	kmem_cache_free(swap_table_cachep, table);
 }
 
-static void swap_cluster_free_table(struct swap_cluster_info *ci)
+/*
+ * Allocate a swap table may need to sleep, which leads to migration,
+ * so attempt an atomic allocation first then fallback and handle
+ * potential race.
+ */
+static struct swap_cluster_info *
+swap_cluster_alloc_table(struct swap_info_struct *si,
+			 struct swap_cluster_info *ci,
+			 int order)
 {
-	unsigned int ci_off;
-	unsigned long swp_tb;
+	struct swap_cluster_info *pcp_ci;
+	struct swap_table *table;
+	unsigned long offset;
 
-	if (!ci->table)
-		return;
+	/*
+	 * Only cluster isolation from the allocator does table allocation.
+	 * Swap allocator uses a percpu cluster and holds the local lock.
+	 */
+	lockdep_assert_held(&ci->lock);
+	lockdep_assert_held(&this_cpu_ptr(&percpu_swap_cluster)->lock);
+
+	table = kmem_cache_zalloc(swap_table_cachep,
+				  __GFP_HIGH | __GFP_NOMEMALLOC | __GFP_NOWARN);
+	if (table) {
+		rcu_assign_pointer(ci->table, table);
+		return ci;
+	}
+
+	/*
+	 * Try a sleep allocation. Each isolated free cluster may cause
+	 * a sleep allocation, but there is a limited number of them, so
+	 * the potential recursive allocation should be limited.
+	 */
+	spin_unlock(&ci->lock);
+	if (!(si->flags & SWP_SOLIDSTATE))
+		spin_unlock(&si->global_cluster_lock);
+	local_unlock(&percpu_swap_cluster.lock);
+	table = kmem_cache_zalloc(swap_table_cachep, __GFP_HIGH | GFP_KERNEL);
 
-	for (ci_off = 0; ci_off < SWAPFILE_CLUSTER; ci_off++) {
-		swp_tb = __swap_table_get(ci, ci_off);
-		if (!swp_tb_is_null(swp_tb))
-			pr_err_once("swap: unclean swap space on swapoff: 0x%lx",
-				    swp_tb);
+	local_lock(&percpu_swap_cluster.lock);
+	if (!(si->flags & SWP_SOLIDSTATE))
+		spin_lock(&si->global_cluster_lock);
+	/*
+	 * Back to atomic context. First, check if we migrated to a new
+	 * CPU with a usable percpu cluster. If so, try using that instead.
+	 * No need to check it for the spinning device, as swap is
+	 * serialized by the global lock on them.
+	 *
+	 * The is_usable check is a bit rough, but ensures order 0 success.
+	 */
+	offset = this_cpu_read(percpu_swap_cluster.offset[order]);
+	if ((si->flags & SWP_SOLIDSTATE) && offset) {
+		pcp_ci = swap_cluster_lock(si, offset);
+		if (cluster_is_usable(pcp_ci, order) &&
+		    pcp_ci->count < SWAPFILE_CLUSTER) {
+			ci = pcp_ci;
+			goto free_table;
+		}
+		swap_cluster_unlock(pcp_ci);
 	}
 
-	kfree(ci->table);
-	ci->table = NULL;
+	if (!table)
+		return NULL;
+
+	spin_lock(&ci->lock);
+	/* Nothing should have touched the dangling empty cluster. */
+	if (WARN_ON_ONCE(cluster_table_is_alloced(ci)))
+		goto free_table;
+
+	rcu_assign_pointer(ci->table, table);
+	return ci;
+
+free_table:
+	if (table)
+		kmem_cache_free(swap_table_cachep, table);
+	return ci;
 }
 
 static void move_cluster(struct swap_info_struct *si,
@@ -478,7 +553,7 @@ static void swap_cluster_schedule_discard(struct swap_info_struct *si,
 
 static void __free_cluster(struct swap_info_struct *si, struct swap_cluster_info *ci)
 {
-	lockdep_assert_held(&ci->lock);
+	swap_cluster_free_table(ci);
 	move_cluster(si, ci, &si->free_clusters, CLUSTER_FLAG_FREE);
 	ci->order = 0;
 }
@@ -493,15 +568,11 @@ static void __free_cluster(struct swap_info_struct *si, struct swap_cluster_info
  * this returns NULL for an non-empty list.
  */
 static struct swap_cluster_info *isolate_lock_cluster(
-		struct swap_info_struct *si, struct list_head *list)
+		struct swap_info_struct *si, struct list_head *list, int order)
 {
-	struct swap_cluster_info *ci, *ret = NULL;
+	struct swap_cluster_info *ci, *found = NULL;
 
 	spin_lock(&si->lock);
-
-	if (unlikely(!(si->flags & SWP_WRITEOK)))
-		goto out;
-
 	list_for_each_entry(ci, list, list) {
 		if (!spin_trylock(&ci->lock))
 			continue;
@@ -513,13 +584,19 @@ static struct swap_cluster_info *isolate_lock_cluster(
 
 		list_del(&ci->list);
 		ci->flags = CLUSTER_FLAG_NONE;
-		ret = ci;
+		found = ci;
 		break;
 	}
-out:
 	spin_unlock(&si->lock);
 
-	return ret;
+	if (found && !cluster_table_is_alloced(found)) {
+		/* Only an empty free cluster's swap table can be freed. */
+		VM_WARN_ON_ONCE(list != &si->free_clusters);
+		VM_WARN_ON_ONCE(!cluster_is_empty(found));
+		return swap_cluster_alloc_table(si, found, order);
+	}
+
+	return found;
 }
 
 /*
@@ -652,17 +729,27 @@ static void relocate_cluster(struct swap_info_struct *si,
  * added to free cluster list and its usage counter will be increased by 1.
  * Only used for initialization.
  */
-static void inc_cluster_info_page(struct swap_info_struct *si,
+static int inc_cluster_info_page(struct swap_info_struct *si,
 	struct swap_cluster_info *cluster_info, unsigned long page_nr)
 {
 	unsigned long idx = page_nr / SWAPFILE_CLUSTER;
+	struct swap_table *table;
 	struct swap_cluster_info *ci;
 
 	ci = cluster_info + idx;
+	if (!ci->table) {
+		table = kmem_cache_zalloc(swap_table_cachep, GFP_KERNEL);
+		if (!table)
+			return -ENOMEM;
+		rcu_assign_pointer(ci->table, table);
+	}
+
 	ci->count++;
 
 	VM_BUG_ON(ci->count > SWAPFILE_CLUSTER);
 	VM_BUG_ON(ci->flags);
+
+	return 0;
 }
 
 static bool cluster_reclaim_range(struct swap_info_struct *si,
@@ -844,7 +931,7 @@ static unsigned int alloc_swap_scan_list(struct swap_info_struct *si,
 	unsigned int found = SWAP_ENTRY_INVALID;
 
 	do {
-		struct swap_cluster_info *ci = isolate_lock_cluster(si, list);
+		struct swap_cluster_info *ci = isolate_lock_cluster(si, list, order);
 		unsigned long offset;
 
 		if (!ci)
@@ -869,7 +956,7 @@ static void swap_reclaim_full_clusters(struct swap_info_struct *si, bool force)
 	if (force)
 		to_scan = swap_usage_in_pages(si) / SWAPFILE_CLUSTER;
 
-	while ((ci = isolate_lock_cluster(si, &si->full_clusters))) {
+	while ((ci = isolate_lock_cluster(si, &si->full_clusters, 0))) {
 		offset = cluster_offset(si, ci);
 		end = min(si->max, offset + SWAPFILE_CLUSTER);
 		to_scan--;
@@ -1017,6 +1104,7 @@ static unsigned long cluster_alloc_swap_entry(struct swap_info_struct *si, int o
 done:
 	if (!(si->flags & SWP_SOLIDSTATE))
 		spin_unlock(&si->global_cluster_lock);
+
 	return found;
 }
 
@@ -1884,7 +1972,13 @@ swp_entry_t get_swap_page_of_type(int type)
 	/* This is called for allocating swap entry, not cache */
 	if (get_swap_device_info(si)) {
 		if (si->flags & SWP_WRITEOK) {
+			/*
+			 * Grab the local lock to be complaint
+			 * with swap table allocation.
+			 */
+			local_lock(&percpu_swap_cluster.lock);
 			offset = cluster_alloc_swap_entry(si, 0, 1);
+			local_unlock(&percpu_swap_cluster.lock);
 			if (offset) {
 				entry = swp_entry(si->type, offset);
 				atomic_long_dec(&nr_swap_pages);
@@ -2677,12 +2771,21 @@ static void wait_for_allocation(struct swap_info_struct *si)
 static void free_cluster_info(struct swap_cluster_info *cluster_info,
 			      unsigned long maxpages)
 {
+	struct swap_cluster_info *ci;
 	int i, nr_clusters = DIV_ROUND_UP(maxpages, SWAPFILE_CLUSTER);
 
 	if (!cluster_info)
 		return;
-	for (i = 0; i < nr_clusters; i++)
-		swap_cluster_free_table(&cluster_info[i]);
+	for (i = 0; i < nr_clusters; i++) {
+		ci = cluster_info + i;
+		/* Cluster with bad marks count will have a remaining table */
+		spin_lock(&ci->lock);
+		if (rcu_dereference_protected(ci->table, true)) {
+			ci->count = 0;
+			swap_cluster_free_table(ci);
+		}
+		spin_unlock(&ci->lock);
+	}
 	kvfree(cluster_info);
 }
 
@@ -2718,6 +2821,7 @@ SYSCALL_DEFINE1(swapoff, const char __user *, specialfile)
 	struct address_space *mapping;
 	struct inode *inode;
 	struct filename *pathname;
+	unsigned int maxpages;
 	int err, found = 0;
 
 	if (!capable(CAP_SYS_ADMIN))
@@ -2824,8 +2928,8 @@ SYSCALL_DEFINE1(swapoff, const char __user *, specialfile)
 	p->swap_map = NULL;
 	zeromap = p->zeromap;
 	p->zeromap = NULL;
+	maxpages = p->max;
 	cluster_info = p->cluster_info;
-	free_cluster_info(cluster_info, p->max);
 	p->max = 0;
 	p->cluster_info = NULL;
 	spin_unlock(&p->lock);
@@ -2837,6 +2941,7 @@ SYSCALL_DEFINE1(swapoff, const char __user *, specialfile)
 	p->global_cluster = NULL;
 	vfree(swap_map);
 	kvfree(zeromap);
+	free_cluster_info(cluster_info, maxpages);
 	/* Destroy swap account information */
 	swap_cgroup_swapoff(p->type);
 
@@ -3215,11 +3320,8 @@ static struct swap_cluster_info *setup_clusters(struct swap_info_struct *si,
 	if (!cluster_info)
 		goto err;
 
-	for (i = 0; i < nr_clusters; i++) {
+	for (i = 0; i < nr_clusters; i++)
 		spin_lock_init(&cluster_info[i].lock);
-		if (swap_table_alloc_table(&cluster_info[i]))
-			goto err_free;
-	}
 
 	if (!(si->flags & SWP_SOLIDSTATE)) {
 		si->global_cluster = kmalloc(sizeof(*si->global_cluster),
@@ -3238,16 +3340,23 @@ static struct swap_cluster_info *setup_clusters(struct swap_info_struct *si,
 	 * See setup_swap_map(): header page, bad pages,
 	 * and the EOF part of the last cluster.
 	 */
-	inc_cluster_info_page(si, cluster_info, 0);
+	err = inc_cluster_info_page(si, cluster_info, 0);
+	if (err)
+		goto err;
 	for (i = 0; i < swap_header->info.nr_badpages; i++) {
 		unsigned int page_nr = swap_header->info.badpages[i];
 
 		if (page_nr >= maxpages)
 			continue;
-		inc_cluster_info_page(si, cluster_info, page_nr);
+		err = inc_cluster_info_page(si, cluster_info, page_nr);
+		if (err)
+			goto err;
+	}
+	for (i = maxpages; i < round_up(maxpages, SWAPFILE_CLUSTER); i++) {
+		err = inc_cluster_info_page(si, cluster_info, i);
+		if (err)
+			goto err;
 	}
-	for (i = maxpages; i < round_up(maxpages, SWAPFILE_CLUSTER); i++)
-		inc_cluster_info_page(si, cluster_info, i);
 
 	INIT_LIST_HEAD(&si->free_clusters);
 	INIT_LIST_HEAD(&si->full_clusters);
@@ -3961,6 +4070,15 @@ static int __init swapfile_init(void)
 
 	swapfile_maximum_size = arch_max_swapfile_size();
 
+	/*
+	 * Once a cluster is freed, it's swap table content is read
+	 * only, and all swap cache readers (swap_cache_*) verifies
+	 * the content before use. So it's safe to use RCU slab here.
+	 */
+	swap_table_cachep = kmem_cache_create("swap_table",
+			    sizeof(struct swap_table),
+			    0, SLAB_PANIC | SLAB_TYPESAFE_BY_RCU, NULL);
+
 #ifdef CONFIG_MIGRATION
 	if (swapfile_maximum_size >= (1UL << SWP_MIG_TOTAL_BITS))
 		swap_migration_ad_supported = true;
-- 
2.51.0



^ permalink raw reply	[flat|nested] 80+ messages in thread

* [PATCH v2 15/15] mm, swap: use a single page for swap table when the size fits
  2025-09-05 19:13 [PATCH v2 00/15] mm, swap: introduce swap table as swap cache (phase I) Kairui Song
                   ` (13 preceding siblings ...)
  2025-09-05 19:13 ` [PATCH v2 14/15] mm, swap: implement dynamic allocation of swap table Kairui Song
@ 2025-09-05 19:13 ` Kairui Song
  2025-09-06 15:48   ` Chris Li
  14 siblings, 1 reply; 80+ messages in thread
From: Kairui Song @ 2025-09-05 19:13 UTC (permalink / raw)
  To: linux-mm
  Cc: Andrew Morton, Matthew Wilcox, Hugh Dickins, Chris Li,
	Barry Song, Baoquan He, Nhat Pham, Kemeng Shi, Baolin Wang,
	Ying Huang, Johannes Weiner, David Hildenbrand, Yosry Ahmed,
	Lorenzo Stoakes, Zi Yan, linux-kernel, Kairui Song

From: Kairui Song <kasong@tencent.com>

We have a cluster size of 512 slots. Each slot consumes 8 bytes in swap
table so the swap table size of each cluster is exactly one page (4K).

If that condition is true, allocate one page direct and disable the slab
cache to reduce the memory usage of swap table and avoid fragmentation.

Co-developed-by: Chris Li <chrisl@kernel.org>
Signed-off-by: Chris Li <chrisl@kernel.org>
Signed-off-by: Kairui Song <kasong@tencent.com>
Acked-by: Chris Li <chrisl@kernel.org>
---
 mm/swap_table.h |  2 ++
 mm/swapfile.c   | 50 ++++++++++++++++++++++++++++++++++++++++---------
 2 files changed, 43 insertions(+), 9 deletions(-)

diff --git a/mm/swap_table.h b/mm/swap_table.h
index 52254e455304..ea244a57a5b7 100644
--- a/mm/swap_table.h
+++ b/mm/swap_table.h
@@ -11,6 +11,8 @@ struct swap_table {
 	atomic_long_t entries[SWAPFILE_CLUSTER];
 };
 
+#define SWP_TABLE_USE_PAGE (sizeof(struct swap_table) == PAGE_SIZE)
+
 /*
  * A swap table entry represents the status of a swap slot on a swap
  * (physical or virtual) device. The swap table in each cluster is a
diff --git a/mm/swapfile.c b/mm/swapfile.c
index 49f93069faef..ab6e877b0644 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -430,6 +430,38 @@ static inline unsigned int cluster_offset(struct swap_info_struct *si,
 	return cluster_index(si, ci) * SWAPFILE_CLUSTER;
 }
 
+static struct swap_table *swap_table_alloc(gfp_t gfp)
+{
+	struct folio *folio;
+
+	if (!SWP_TABLE_USE_PAGE)
+		return kmem_cache_zalloc(swap_table_cachep, gfp);
+
+	folio = folio_alloc(gfp | __GFP_ZERO, 0);
+	if (folio)
+		return folio_address(folio);
+	return NULL;
+}
+
+static void swap_table_free_folio_rcu_cb(struct rcu_head *head)
+{
+	struct folio *folio;
+
+	folio = page_folio(container_of(head, struct page, rcu_head));
+	folio_put(folio);
+}
+
+static void swap_table_free(struct swap_table *table)
+{
+	if (!SWP_TABLE_USE_PAGE) {
+		kmem_cache_free(swap_table_cachep, table);
+		return;
+	}
+
+	call_rcu(&(folio_page(virt_to_folio(table), 0)->rcu_head),
+		 swap_table_free_folio_rcu_cb);
+}
+
 static void swap_cluster_free_table(struct swap_cluster_info *ci)
 {
 	unsigned int ci_off;
@@ -443,7 +475,7 @@ static void swap_cluster_free_table(struct swap_cluster_info *ci)
 	table = (void *)rcu_dereference_protected(ci->table, true);
 	rcu_assign_pointer(ci->table, NULL);
 
-	kmem_cache_free(swap_table_cachep, table);
+	swap_table_free(table);
 }
 
 /*
@@ -467,8 +499,7 @@ swap_cluster_alloc_table(struct swap_info_struct *si,
 	lockdep_assert_held(&ci->lock);
 	lockdep_assert_held(&this_cpu_ptr(&percpu_swap_cluster)->lock);
 
-	table = kmem_cache_zalloc(swap_table_cachep,
-				  __GFP_HIGH | __GFP_NOMEMALLOC | __GFP_NOWARN);
+	table = swap_table_alloc(__GFP_HIGH | __GFP_NOMEMALLOC | __GFP_NOWARN);
 	if (table) {
 		rcu_assign_pointer(ci->table, table);
 		return ci;
@@ -483,7 +514,7 @@ swap_cluster_alloc_table(struct swap_info_struct *si,
 	if (!(si->flags & SWP_SOLIDSTATE))
 		spin_unlock(&si->global_cluster_lock);
 	local_unlock(&percpu_swap_cluster.lock);
-	table = kmem_cache_zalloc(swap_table_cachep, __GFP_HIGH | GFP_KERNEL);
+	table = swap_table_alloc(__GFP_HIGH | GFP_KERNEL);
 
 	local_lock(&percpu_swap_cluster.lock);
 	if (!(si->flags & SWP_SOLIDSTATE))
@@ -520,7 +551,7 @@ swap_cluster_alloc_table(struct swap_info_struct *si,
 
 free_table:
 	if (table)
-		kmem_cache_free(swap_table_cachep, table);
+		swap_table_free(table);
 	return ci;
 }
 
@@ -738,7 +769,7 @@ static int inc_cluster_info_page(struct swap_info_struct *si,
 
 	ci = cluster_info + idx;
 	if (!ci->table) {
-		table = kmem_cache_zalloc(swap_table_cachep, GFP_KERNEL);
+		table = swap_table_alloc(GFP_KERNEL);
 		if (!table)
 			return -ENOMEM;
 		rcu_assign_pointer(ci->table, table);
@@ -4075,9 +4106,10 @@ static int __init swapfile_init(void)
 	 * only, and all swap cache readers (swap_cache_*) verifies
 	 * the content before use. So it's safe to use RCU slab here.
 	 */
-	swap_table_cachep = kmem_cache_create("swap_table",
-			    sizeof(struct swap_table),
-			    0, SLAB_PANIC | SLAB_TYPESAFE_BY_RCU, NULL);
+	if (!SWP_TABLE_USE_PAGE)
+		swap_table_cachep = kmem_cache_create("swap_table",
+				    sizeof(struct swap_table),
+				    0, SLAB_PANIC | SLAB_TYPESAFE_BY_RCU, NULL);
 
 #ifdef CONFIG_MIGRATION
 	if (swapfile_maximum_size >= (1UL << SWP_MIG_TOTAL_BITS))
-- 
2.51.0



^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH v2 03/15] mm, swap: fix swap cahe index error when retrying reclaim
  2025-09-05 19:13 ` [PATCH v2 03/15] mm, swap: fix swap cahe index error when retrying reclaim Kairui Song
@ 2025-09-05 22:40   ` Nhat Pham
  2025-09-06  6:30     ` Kairui Song
  2025-09-06  1:51   ` Chris Li
                     ` (2 subsequent siblings)
  3 siblings, 1 reply; 80+ messages in thread
From: Nhat Pham @ 2025-09-05 22:40 UTC (permalink / raw)
  To: Kairui Song
  Cc: linux-mm, Andrew Morton, Matthew Wilcox, Hugh Dickins, Chris Li,
	Barry Song, Baoquan He, Kemeng Shi, Baolin Wang, Ying Huang,
	Johannes Weiner, David Hildenbrand, Yosry Ahmed, Lorenzo Stoakes,
	Zi Yan, linux-kernel

On Fri, Sep 5, 2025 at 12:14 PM Kairui Song <ryncsn@gmail.com> wrote:
>
> From: Kairui Song <kasong@tencent.com>
>
> The allocator will reclaim cached slots while scanning. Currently, it
> will try again if the reclaim found a folio that is already removed from
> the swap cache due to a race. But the following lookup will be using the
> wrong index. It won't cause any OOB issue since the swap cache index is
> truncated upon lookup, but it may lead to reclaiming of an irrelevant

I mean if there is a race, folio->swap could literally be anything
right? Can the following happen: between the filemap_get_folio()
lookup and the locking, the folio can have its swap slot freed up,
then obtain a new swap slot, potentially from an entirely different
swapfile (i.e different swp_type(folio->swap)).

It is very unlikely, and in many setups you only have one swapfile. Still...

> folio.
>
> This should not cause a measurable issue, but we should fix it.
>
> Fixes: fae8595505313 ("mm, swap: avoid reclaiming irrelevant swap cache")
> Signed-off-by: Kairui Song <kasong@tencent.com>

Yeah that's pretty nuanced lol. It is unlikely to cause any issue
indeed - we're just occasionally swap-cache-reclaim some rando folio
haha.

Anyway, FWIW:

Acked-by: Nhat Pham <nphamcs@gmail.com>


^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH v2 01/15] docs/mm: add document for swap table
  2025-09-05 19:13 ` [PATCH v2 01/15] docs/mm: add document for swap table Kairui Song
@ 2025-09-05 23:58   ` Chris Li
  2025-09-06 13:31     ` Kairui Song
  2025-09-08 12:35   ` Baoquan He
  1 sibling, 1 reply; 80+ messages in thread
From: Chris Li @ 2025-09-05 23:58 UTC (permalink / raw)
  To: Kairui Song
  Cc: linux-mm, Andrew Morton, Matthew Wilcox, Hugh Dickins,
	Barry Song, Baoquan He, Nhat Pham, Kemeng Shi, Baolin Wang,
	Ying Huang, Johannes Weiner, David Hildenbrand, Yosry Ahmed,
	Lorenzo Stoakes, Zi Yan, linux-kernel

On Fri, Sep 5, 2025 at 12:14 PM Kairui Song <ryncsn@gmail.com> wrote:
>
> From: Kairui Song <kasong@tencent.com>
>
> From: Chris Li <chrisl@kernel.org>
>
> Swap table is the new swap cache.
>
> Signed-off-by: Chris Li <chrisl@kernel.org>
> Signed-off-by: Kairui Song <kasong@tencent.com>
> ---
>  Documentation/mm/swap-table.rst | 72 +++++++++++++++++++++++++++++++++
>  MAINTAINERS                     |  1 +
>  2 files changed, 73 insertions(+)
>  create mode 100644 Documentation/mm/swap-table.rst
>
> diff --git a/Documentation/mm/swap-table.rst b/Documentation/mm/swap-table.rst
> new file mode 100644
> index 000000000000..929cd91aa984
> --- /dev/null
> +++ b/Documentation/mm/swap-table.rst
> @@ -0,0 +1,72 @@
> +.. SPDX-License-Identifier: GPL-2.0
> +
> +:Author: Chris Li <chrisl@kernel.org>, Kairui Song <kasong@tencent.com>
> +
> +==========
> +Swap Table
> +==========
> +
> +Swap table implements swap cache as a per-cluster swap cache value array.
> +
> +Swap Entry
> +----------
> +
> +A swap entry contains the information required to serve the anonymous page
> +fault.
> +
> +Swap entry is encoded as two parts: swap type and swap offset.
> +
> +The swap type indicates which swap device to use.
> +The swap offset is the offset of the swap file to read the page data from.
> +
> +Swap Cache
> +----------
> +
> +Swap cache is a map to look up folios using swap entry as the key. The result
> +value can have three possible types depending on which stage of this swap entry
> +was in.
> +
> +1. NULL: This swap entry is not used.
> +
> +2. folio: A folio has been allocated and bound to this swap entry. This is
> +   the transient state of swap out or swap in. The folio data can be in
> +   the folio or swap file, or both.
> +
> +3. shadow: The shadow contains the working set information of the swap

I just noticed a typo here, should be "swapped out page"

> +   outed folio. This is the normal state for a swap outed page.

Same here. "swap outed page" -> "swapped out page"

Chris


^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH v2 02/15] mm, swap: use unified helper for swap cache look up
  2025-09-05 19:13 ` [PATCH v2 02/15] mm, swap: use unified helper for swap cache look up Kairui Song
@ 2025-09-05 23:59   ` Chris Li
  2025-09-08 11:43   ` David Hildenbrand
  1 sibling, 0 replies; 80+ messages in thread
From: Chris Li @ 2025-09-05 23:59 UTC (permalink / raw)
  To: Kairui Song
  Cc: linux-mm, Andrew Morton, Matthew Wilcox, Hugh Dickins,
	Barry Song, Baoquan He, Nhat Pham, Kemeng Shi, Baolin Wang,
	Ying Huang, Johannes Weiner, David Hildenbrand, Yosry Ahmed,
	Lorenzo Stoakes, Zi Yan, linux-kernel

Acked-by: Chris Li <chrisl@kernel.org>

Chris

On Fri, Sep 5, 2025 at 12:14 PM Kairui Song <ryncsn@gmail.com> wrote:
>
> From: Kairui Song <kasong@tencent.com>
>
> The swap cache lookup helper swap_cache_get_folio currently does
> readahead updates as well, so callers that are not doing swapin from any
> VMA or mapping are forced to reuse filemap helpers instead, and have to
> access the swap cache space directly.
>
> So decouple readahead update with swap cache lookup. Move the readahead
> update part into a standalone helper. Let the caller call the readahead
> update helper if they do readahead. And convert all swap cache lookups
> to use swap_cache_get_folio.
>
> After this commit, there are only three special cases for accessing swap
> cache space now: huge memory splitting, migration, and shmem replacing,
> because they need to lock the XArray. The following commits will wrap
> their accesses to the swap cache too, with special helpers.
>
> And worth noting, currently dropbehind is not supported for anon folio,
> and we will never see a dropbehind folio in swap cache. The unified
> helper can be updated later to handle that.
>
> While at it, add proper kernedoc for touched helpers.
>
> No functional change.
>
> Signed-off-by: Kairui Song <kasong@tencent.com>
> Acked-by: Chris Li <chrisl@kernel.org>
> Acked-by: Nhat Pham <nphamcs@gmail.com>
> Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com>
> Reviewed-by: Barry Song <baohua@kernel.org>
> ---
>  mm/memory.c      |   6 ++-
>  mm/mincore.c     |   3 +-
>  mm/shmem.c       |   4 +-
>  mm/swap.h        |  13 ++++--
>  mm/swap_state.c  | 109 +++++++++++++++++++++++++----------------------
>  mm/swapfile.c    |  11 +++--
>  mm/userfaultfd.c |   5 +--
>  7 files changed, 81 insertions(+), 70 deletions(-)
>
> diff --git a/mm/memory.c b/mm/memory.c
> index d9de6c056179..10ef528a5f44 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -4660,9 +4660,11 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
>         if (unlikely(!si))
>                 goto out;
>
> -       folio = swap_cache_get_folio(entry, vma, vmf->address);
> -       if (folio)
> +       folio = swap_cache_get_folio(entry);
> +       if (folio) {
> +               swap_update_readahead(folio, vma, vmf->address);
>                 page = folio_file_page(folio, swp_offset(entry));
> +       }
>         swapcache = folio;
>
>         if (!folio) {
> diff --git a/mm/mincore.c b/mm/mincore.c
> index 2f3e1816a30d..8ec4719370e1 100644
> --- a/mm/mincore.c
> +++ b/mm/mincore.c
> @@ -76,8 +76,7 @@ static unsigned char mincore_swap(swp_entry_t entry, bool shmem)
>                 if (!si)
>                         return 0;
>         }
> -       folio = filemap_get_entry(swap_address_space(entry),
> -                                 swap_cache_index(entry));
> +       folio = swap_cache_get_folio(entry);
>         if (shmem)
>                 put_swap_device(si);
>         /* The swap cache space contains either folio, shadow or NULL */
> diff --git a/mm/shmem.c b/mm/shmem.c
> index 2df26f4d6e60..4e27e8e5da3b 100644
> --- a/mm/shmem.c
> +++ b/mm/shmem.c
> @@ -2354,7 +2354,7 @@ static int shmem_swapin_folio(struct inode *inode, pgoff_t index,
>         }
>
>         /* Look it up and read it in.. */
> -       folio = swap_cache_get_folio(swap, NULL, 0);
> +       folio = swap_cache_get_folio(swap);
>         if (!folio) {
>                 if (data_race(si->flags & SWP_SYNCHRONOUS_IO)) {
>                         /* Direct swapin skipping swap cache & readahead */
> @@ -2379,6 +2379,8 @@ static int shmem_swapin_folio(struct inode *inode, pgoff_t index,
>                         count_vm_event(PGMAJFAULT);
>                         count_memcg_event_mm(fault_mm, PGMAJFAULT);
>                 }
> +       } else {
> +               swap_update_readahead(folio, NULL, 0);
>         }
>
>         if (order > folio_order(folio)) {
> diff --git a/mm/swap.h b/mm/swap.h
> index 1ae44d4193b1..efb6d7ff9f30 100644
> --- a/mm/swap.h
> +++ b/mm/swap.h
> @@ -62,8 +62,7 @@ void delete_from_swap_cache(struct folio *folio);
>  void clear_shadow_from_swap_cache(int type, unsigned long begin,
>                                   unsigned long end);
>  void swapcache_clear(struct swap_info_struct *si, swp_entry_t entry, int nr);
> -struct folio *swap_cache_get_folio(swp_entry_t entry,
> -               struct vm_area_struct *vma, unsigned long addr);
> +struct folio *swap_cache_get_folio(swp_entry_t entry);
>  struct folio *read_swap_cache_async(swp_entry_t entry, gfp_t gfp_mask,
>                 struct vm_area_struct *vma, unsigned long addr,
>                 struct swap_iocb **plug);
> @@ -74,6 +73,8 @@ struct folio *swap_cluster_readahead(swp_entry_t entry, gfp_t flag,
>                 struct mempolicy *mpol, pgoff_t ilx);
>  struct folio *swapin_readahead(swp_entry_t entry, gfp_t flag,
>                 struct vm_fault *vmf);
> +void swap_update_readahead(struct folio *folio, struct vm_area_struct *vma,
> +                          unsigned long addr);
>
>  static inline unsigned int folio_swap_flags(struct folio *folio)
>  {
> @@ -159,6 +160,11 @@ static inline struct folio *swapin_readahead(swp_entry_t swp, gfp_t gfp_mask,
>         return NULL;
>  }
>
> +static inline void swap_update_readahead(struct folio *folio,
> +               struct vm_area_struct *vma, unsigned long addr)
> +{
> +}
> +
>  static inline int swap_writeout(struct folio *folio,
>                 struct swap_iocb **swap_plug)
>  {
> @@ -169,8 +175,7 @@ static inline void swapcache_clear(struct swap_info_struct *si, swp_entry_t entr
>  {
>  }
>
> -static inline struct folio *swap_cache_get_folio(swp_entry_t entry,
> -               struct vm_area_struct *vma, unsigned long addr)
> +static inline struct folio *swap_cache_get_folio(swp_entry_t entry)
>  {
>         return NULL;
>  }
> diff --git a/mm/swap_state.c b/mm/swap_state.c
> index 99513b74b5d8..68ec531d0f2b 100644
> --- a/mm/swap_state.c
> +++ b/mm/swap_state.c
> @@ -69,6 +69,27 @@ void show_swap_cache_info(void)
>         printk("Total swap = %lukB\n", K(total_swap_pages));
>  }
>
> +/**
> + * swap_cache_get_folio - Looks up a folio in the swap cache.
> + * @entry: swap entry used for the lookup.
> + *
> + * A found folio will be returned unlocked and with its refcount increased.
> + *
> + * Context: Caller must ensure @entry is valid and protect the swap device
> + * with reference count or locks.
> + * Return: Returns the found folio on success, NULL otherwise. The caller
> + * must lock and check if the folio still matches the swap entry before
> + * use.
> + */
> +struct folio *swap_cache_get_folio(swp_entry_t entry)
> +{
> +       struct folio *folio = filemap_get_folio(swap_address_space(entry),
> +                                               swap_cache_index(entry));
> +       if (IS_ERR(folio))
> +               return NULL;
> +       return folio;
> +}
> +
>  void *get_shadow_from_swap_cache(swp_entry_t entry)
>  {
>         struct address_space *address_space = swap_address_space(entry);
> @@ -272,55 +293,43 @@ static inline bool swap_use_vma_readahead(void)
>         return READ_ONCE(enable_vma_readahead) && !atomic_read(&nr_rotate_swap);
>  }
>
> -/*
> - * Lookup a swap entry in the swap cache. A found folio will be returned
> - * unlocked and with its refcount incremented - we rely on the kernel
> - * lock getting page table operations atomic even if we drop the folio
> - * lock before returning.
> - *
> - * Caller must lock the swap device or hold a reference to keep it valid.
> +/**
> + * swap_update_readahead - Update the readahead statistics of VMA or globally.
> + * @folio: the swap cache folio that just got hit.
> + * @vma: the VMA that should be updated, could be NULL for global update.
> + * @addr: the addr that triggered the swapin, ignored if @vma is NULL.
>   */
> -struct folio *swap_cache_get_folio(swp_entry_t entry,
> -               struct vm_area_struct *vma, unsigned long addr)
> +void swap_update_readahead(struct folio *folio, struct vm_area_struct *vma,
> +                          unsigned long addr)
>  {
> -       struct folio *folio;
> -
> -       folio = filemap_get_folio(swap_address_space(entry), swap_cache_index(entry));
> -       if (!IS_ERR(folio)) {
> -               bool vma_ra = swap_use_vma_readahead();
> -               bool readahead;
> +       bool readahead, vma_ra = swap_use_vma_readahead();
>
> -               /*
> -                * At the moment, we don't support PG_readahead for anon THP
> -                * so let's bail out rather than confusing the readahead stat.
> -                */
> -               if (unlikely(folio_test_large(folio)))
> -                       return folio;
> -
> -               readahead = folio_test_clear_readahead(folio);
> -               if (vma && vma_ra) {
> -                       unsigned long ra_val;
> -                       int win, hits;
> -
> -                       ra_val = GET_SWAP_RA_VAL(vma);
> -                       win = SWAP_RA_WIN(ra_val);
> -                       hits = SWAP_RA_HITS(ra_val);
> -                       if (readahead)
> -                               hits = min_t(int, hits + 1, SWAP_RA_HITS_MAX);
> -                       atomic_long_set(&vma->swap_readahead_info,
> -                                       SWAP_RA_VAL(addr, win, hits));
> -               }
> -
> -               if (readahead) {
> -                       count_vm_event(SWAP_RA_HIT);
> -                       if (!vma || !vma_ra)
> -                               atomic_inc(&swapin_readahead_hits);
> -               }
> -       } else {
> -               folio = NULL;
> +       /*
> +        * At the moment, we don't support PG_readahead for anon THP
> +        * so let's bail out rather than confusing the readahead stat.
> +        */
> +       if (unlikely(folio_test_large(folio)))
> +               return;
> +
> +       readahead = folio_test_clear_readahead(folio);
> +       if (vma && vma_ra) {
> +               unsigned long ra_val;
> +               int win, hits;
> +
> +               ra_val = GET_SWAP_RA_VAL(vma);
> +               win = SWAP_RA_WIN(ra_val);
> +               hits = SWAP_RA_HITS(ra_val);
> +               if (readahead)
> +                       hits = min_t(int, hits + 1, SWAP_RA_HITS_MAX);
> +               atomic_long_set(&vma->swap_readahead_info,
> +                               SWAP_RA_VAL(addr, win, hits));
>         }
>
> -       return folio;
> +       if (readahead) {
> +               count_vm_event(SWAP_RA_HIT);
> +               if (!vma || !vma_ra)
> +                       atomic_inc(&swapin_readahead_hits);
> +       }
>  }
>
>  struct folio *__read_swap_cache_async(swp_entry_t entry, gfp_t gfp_mask,
> @@ -336,14 +345,10 @@ struct folio *__read_swap_cache_async(swp_entry_t entry, gfp_t gfp_mask,
>         *new_page_allocated = false;
>         for (;;) {
>                 int err;
> -               /*
> -                * First check the swap cache.  Since this is normally
> -                * called after swap_cache_get_folio() failed, re-calling
> -                * that would confuse statistics.
> -                */
> -               folio = filemap_get_folio(swap_address_space(entry),
> -                                         swap_cache_index(entry));
> -               if (!IS_ERR(folio))
> +
> +               /* Check the swap cache in case the folio is already there */
> +               folio = swap_cache_get_folio(entry);
> +               if (folio)
>                         goto got_folio;
>
>                 /*
> diff --git a/mm/swapfile.c b/mm/swapfile.c
> index a7ffabbe65ef..4b8ab2cb49ca 100644
> --- a/mm/swapfile.c
> +++ b/mm/swapfile.c
> @@ -213,15 +213,14 @@ static int __try_to_reclaim_swap(struct swap_info_struct *si,
>                                  unsigned long offset, unsigned long flags)
>  {
>         swp_entry_t entry = swp_entry(si->type, offset);
> -       struct address_space *address_space = swap_address_space(entry);
>         struct swap_cluster_info *ci;
>         struct folio *folio;
>         int ret, nr_pages;
>         bool need_reclaim;
>
>  again:
> -       folio = filemap_get_folio(address_space, swap_cache_index(entry));
> -       if (IS_ERR(folio))
> +       folio = swap_cache_get_folio(entry);
> +       if (!folio)
>                 return 0;
>
>         nr_pages = folio_nr_pages(folio);
> @@ -2131,7 +2130,7 @@ static int unuse_pte_range(struct vm_area_struct *vma, pmd_t *pmd,
>                 pte_unmap(pte);
>                 pte = NULL;
>
> -               folio = swap_cache_get_folio(entry, vma, addr);
> +               folio = swap_cache_get_folio(entry);
>                 if (!folio) {
>                         struct vm_fault vmf = {
>                                 .vma = vma,
> @@ -2357,8 +2356,8 @@ static int try_to_unuse(unsigned int type)
>                (i = find_next_to_unuse(si, i)) != 0) {
>
>                 entry = swp_entry(type, i);
> -               folio = filemap_get_folio(swap_address_space(entry), swap_cache_index(entry));
> -               if (IS_ERR(folio))
> +               folio = swap_cache_get_folio(entry);
> +               if (!folio)
>                         continue;
>
>                 /*
> diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c
> index 50aaa8dcd24c..af61b95c89e4 100644
> --- a/mm/userfaultfd.c
> +++ b/mm/userfaultfd.c
> @@ -1489,9 +1489,8 @@ static long move_pages_ptes(struct mm_struct *mm, pmd_t *dst_pmd, pmd_t *src_pmd
>                  * separately to allow proper handling.
>                  */
>                 if (!src_folio)
> -                       folio = filemap_get_folio(swap_address_space(entry),
> -                                       swap_cache_index(entry));
> -               if (!IS_ERR_OR_NULL(folio)) {
> +                       folio = swap_cache_get_folio(entry);
> +               if (folio) {
>                         if (folio_test_large(folio)) {
>                                 ret = -EBUSY;
>                                 folio_put(folio);
> --
> 2.51.0
>
>


^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH v2 03/15] mm, swap: fix swap cahe index error when retrying reclaim
  2025-09-05 19:13 ` [PATCH v2 03/15] mm, swap: fix swap cahe index error when retrying reclaim Kairui Song
  2025-09-05 22:40   ` Nhat Pham
@ 2025-09-06  1:51   ` Chris Li
  2025-09-06  6:28     ` Kairui Song
  2025-09-08  3:08   ` Baolin Wang
  2025-09-08 11:45   ` David Hildenbrand
  3 siblings, 1 reply; 80+ messages in thread
From: Chris Li @ 2025-09-06  1:51 UTC (permalink / raw)
  To: Kairui Song
  Cc: linux-mm, Andrew Morton, Matthew Wilcox, Hugh Dickins,
	Barry Song, Baoquan He, Nhat Pham, Kemeng Shi, Baolin Wang,
	Ying Huang, Johannes Weiner, David Hildenbrand, Yosry Ahmed,
	Lorenzo Stoakes, Zi Yan, linux-kernel

Hi Kairui,

The patch looks obviously correct to me with some very minor nitpicks following.

Acked-by: Chris Li <chrisl@kernel.org>

On Fri, Sep 5, 2025 at 12:14 PM Kairui Song <ryncsn@gmail.com> wrote:
>
> From: Kairui Song <kasong@tencent.com>
>
> The allocator will reclaim cached slots while scanning. Currently, it
> will try again if the reclaim found a folio that is already removed from
> the swap cache due to a race. But the following lookup will be using the
> wrong index. It won't cause any OOB issue since the swap cache index is
> truncated upon lookup, but it may lead to reclaiming of an irrelevant
> folio.
>
> This should not cause a measurable issue, but we should fix it.
>
> Fixes: fae8595505313 ("mm, swap: avoid reclaiming irrelevant swap cache")
> Signed-off-by: Kairui Song <kasong@tencent.com>
> ---
>  mm/swapfile.c | 6 +++---
>  1 file changed, 3 insertions(+), 3 deletions(-)
>
> diff --git a/mm/swapfile.c b/mm/swapfile.c
> index 4b8ab2cb49ca..4c63fc62f4cb 100644
> --- a/mm/swapfile.c
> +++ b/mm/swapfile.c
> @@ -240,13 +240,13 @@ static int __try_to_reclaim_swap(struct swap_info_struct *si,
>          * Offset could point to the middle of a large folio, or folio
>          * may no longer point to the expected offset before it's locked.
>          */
> -       entry = folio->swap;
Nitpick:
This and the following reuse the folio->swap dereference and
swp_offset() many times.
You can use some local variables to cache the value into a register
and less function calls. I haven't looked into if the compiler will do
the same expression elimination on this, a good compiler should. The
following looks less busy and doesn't need the compiler to optimize it
for you.

           fe = folio->swap;
           eoffset = swp_offset(fe);
           if (offset < eoffset ) || offset >= eoffset + nr_pages) {
...
           }
           offset = eoffset;

This might generate better code due to less function code. If the
compiler does the perfect jobs the original code can generate the same
optimized code as well.

> -       if (offset < swp_offset(entry) || offset >= swp_offset(entry) + nr_pages) {
> +       if (offset < swp_offset(folio->swap) ||
> +           offset >= swp_offset(folio->swap) + nr_pages) {
>                 folio_unlock(folio);
>                 folio_put(folio);
>                 goto again;
>         }
> -       offset = swp_offset(entry);
> +       offset = swp_offset(folio->swap);

So the first entry is only assigned once in the function and never changed?

You can use const to declare it.

Chris

>
>         need_reclaim = ((flags & TTRS_ANYWAY) ||
>                         ((flags & TTRS_UNMAPPED) && !folio_mapped(folio)) ||
> --
> 2.51.0
>
>


^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH v2 04/15] mm, swap: check page poison flag after locking it
  2025-09-05 19:13 ` [PATCH v2 04/15] mm, swap: check page poison flag after locking it Kairui Song
@ 2025-09-06  2:00   ` Chris Li
  2025-09-08 12:11   ` David Hildenbrand
  1 sibling, 0 replies; 80+ messages in thread
From: Chris Li @ 2025-09-06  2:00 UTC (permalink / raw)
  To: Kairui Song
  Cc: linux-mm, Andrew Morton, Matthew Wilcox, Hugh Dickins,
	Barry Song, Baoquan He, Nhat Pham, Kemeng Shi, Baolin Wang,
	Ying Huang, Johannes Weiner, David Hildenbrand, Yosry Ahmed,
	Lorenzo Stoakes, Zi Yan, linux-kernel

Separating this patch out makes it easier to read for me. Thank you.
The last V1 mixed diff is very messy. My last attempt missing the
out_page will unlock the HWPoison page as well. Now it is obviously
correct to me.

Acked-by: Chris Li <chrisl@kernel.org>

Chris

On Fri, Sep 5, 2025 at 12:14 PM Kairui Song <ryncsn@gmail.com> wrote:
>
> From: Kairui Song <kasong@tencent.com>
>
> Instead of checking the poison flag only in the fast swap cache lookup
> path, always check the poison flags after locking a swap cache folio.
>
> There are two reasons to do so.
>
> The folio is unstable and could be removed from the swap cache anytime,
> so it's totally possible that the folio is no longer the backing folio
> of a swap entry, and could be an irrelevant poisoned folio. We might
> mistakenly kill a faulting process.
>
> And it's totally possible or even common for the slow swap in path
> (swapin_readahead) to bring in a cached folio. The cache folio could be
> poisoned, too. Only checking the poison flag in the fast path will miss
> such folios.
>
> The race window is tiny, so it's very unlikely to happen, though.
> While at it, also add a unlikely prefix.
>
> Signed-off-by: Kairui Song <kasong@tencent.com>
> ---
>  mm/memory.c | 22 +++++++++++-----------
>  1 file changed, 11 insertions(+), 11 deletions(-)
>
> diff --git a/mm/memory.c b/mm/memory.c
> index 10ef528a5f44..94a5928e8ace 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -4661,10 +4661,8 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
>                 goto out;
>
>         folio = swap_cache_get_folio(entry);
> -       if (folio) {
> +       if (folio)
>                 swap_update_readahead(folio, vma, vmf->address);
> -               page = folio_file_page(folio, swp_offset(entry));
> -       }
>         swapcache = folio;
>
>         if (!folio) {
> @@ -4735,20 +4733,13 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
>                 ret = VM_FAULT_MAJOR;
>                 count_vm_event(PGMAJFAULT);
>                 count_memcg_event_mm(vma->vm_mm, PGMAJFAULT);
> -               page = folio_file_page(folio, swp_offset(entry));
> -       } else if (PageHWPoison(page)) {
> -               /*
> -                * hwpoisoned dirty swapcache pages are kept for killing
> -                * owner processes (which may be unknown at hwpoison time)
> -                */
> -               ret = VM_FAULT_HWPOISON;
> -               goto out_release;
>         }
>
>         ret |= folio_lock_or_retry(folio, vmf);
>         if (ret & VM_FAULT_RETRY)
>                 goto out_release;
>
> +       page = folio_file_page(folio, swp_offset(entry));
>         if (swapcache) {
>                 /*
>                  * Make sure folio_free_swap() or swapoff did not release the
> @@ -4761,6 +4752,15 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
>                              page_swap_entry(page).val != entry.val))
>                         goto out_page;
>
> +               if (unlikely(PageHWPoison(page))) {
> +                       /*
> +                        * hwpoisoned dirty swapcache pages are kept for killing
> +                        * owner processes (which may be unknown at hwpoison time)
> +                        */
> +                       ret = VM_FAULT_HWPOISON;
> +                       goto out_page;
> +               }
> +
>                 /*
>                  * KSM sometimes has to copy on read faults, for example, if
>                  * folio->index of non-ksm folios would be nonlinear inside the
> --
> 2.51.0
>
>


^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH v2 05/15] mm, swap: always lock and check the swap cache folio before use
  2025-09-05 19:13 ` [PATCH v2 05/15] mm, swap: always lock and check the swap cache folio before use Kairui Song
@ 2025-09-06  2:12   ` Chris Li
  2025-09-06  6:32     ` Kairui Song
  2025-09-08 12:18   ` David Hildenbrand
  1 sibling, 1 reply; 80+ messages in thread
From: Chris Li @ 2025-09-06  2:12 UTC (permalink / raw)
  To: Kairui Song
  Cc: linux-mm, Andrew Morton, Matthew Wilcox, Hugh Dickins,
	Barry Song, Baoquan He, Nhat Pham, Kemeng Shi, Baolin Wang,
	Ying Huang, Johannes Weiner, David Hildenbrand, Yosry Ahmed,
	Lorenzo Stoakes, Zi Yan, linux-kernel

Looks correct to me.

Acked-by: Chris Li <chrisl@kernel.org>

With some nitpick follows,

On Fri, Sep 5, 2025 at 12:14 PM Kairui Song <ryncsn@gmail.com> wrote:
>
> From: Kairui Song <kasong@tencent.com>
>
> Swap cache lookup only increases the reference count of the returned
> folio. That's not enough to ensure a folio is stable in the swap
> cache, so the folio could be removed from the swap cache at any
> time. The caller should always lock and check the folio before using it.
>
> We have just documented this in kerneldoc, now introduce a helper for swap
> cache folio verification with proper sanity checks.
>
> Also, sanitize a few current users to use this convention and the new
> helper for easier debugging. They were not having observable problems
> yet, only trivial issues like wasted CPU cycles on swapoff or
> reclaiming. They would fail in some other way, but it is still better to
> always follow this convention to make things robust and make later
> commits easier to do.
>
> Signed-off-by: Kairui Song <kasong@tencent.com>
> ---
>  mm/memory.c     |  3 +--
>  mm/swap.h       | 24 ++++++++++++++++++++++++
>  mm/swap_state.c |  7 +++++--
>  mm/swapfile.c   | 10 +++++++---
>  4 files changed, 37 insertions(+), 7 deletions(-)
>
> diff --git a/mm/memory.c b/mm/memory.c
> index 94a5928e8ace..5808c4ef21b3 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -4748,8 +4748,7 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
>                  * swapcache, we need to check that the page's swap has not
>                  * changed.
>                  */
> -               if (unlikely(!folio_test_swapcache(folio) ||
> -                            page_swap_entry(page).val != entry.val))
> +               if (unlikely(!folio_matches_swap_entry(folio, entry)))
>                         goto out_page;
>
>                 if (unlikely(PageHWPoison(page))) {
> diff --git a/mm/swap.h b/mm/swap.h
> index efb6d7ff9f30..a69e18b12b45 100644
> --- a/mm/swap.h
> +++ b/mm/swap.h
> @@ -52,6 +52,25 @@ static inline pgoff_t swap_cache_index(swp_entry_t entry)
>         return swp_offset(entry) & SWAP_ADDRESS_SPACE_MASK;
>  }
>
> +/**
> + * folio_matches_swap_entry - Check if a folio matches a given swap entry.
> + * @folio: The folio.
> + * @entry: The swap entry to check against.
> + *
> + * Context: The caller should have the folio locked to ensure it's stable
> + * and nothing will move it in or out of the swap cache.
> + * Return: true or false.
> + */
> +static inline bool folio_matches_swap_entry(const struct folio *folio,
> +                                           swp_entry_t entry)
> +{
> +       VM_WARN_ON_ONCE_FOLIO(!folio_test_locked(folio), folio);
> +       if (!folio_test_swapcache(folio))
> +               return false;
> +       VM_WARN_ON_ONCE_FOLIO(!IS_ALIGNED(folio->swap.val, folio_nr_pages(folio)), folio);

You should cache the folio->swap.val into a local register value.
Because WARN_ON_ONCE I think the compiler has no choice but to load it
twice? Haven't verified it myself.

There is no downside in compiler point of view using more local
variables, the compiler generates an internal version of the local
variable equivalent anyway.

> +       return folio->swap.val == round_down(entry.val, folio_nr_pages(folio));

Same for folio_nr_pages(folio), you should cache it. The function will
look less busy.

Chris

> +}
> +
>  void show_swap_cache_info(void);
>  void *get_shadow_from_swap_cache(swp_entry_t entry);
>  int add_to_swap_cache(struct folio *folio, swp_entry_t entry,
> @@ -144,6 +163,11 @@ static inline pgoff_t swap_cache_index(swp_entry_t entry)
>         return 0;
>  }
>
> +static inline bool folio_matches_swap_entry(const struct folio *folio, swp_entry_t entry)
> +{
> +       return false;
> +}
> +
>  static inline void show_swap_cache_info(void)
>  {
>  }
> diff --git a/mm/swap_state.c b/mm/swap_state.c
> index 68ec531d0f2b..9225d6b695ad 100644
> --- a/mm/swap_state.c
> +++ b/mm/swap_state.c
> @@ -79,7 +79,7 @@ void show_swap_cache_info(void)
>   * with reference count or locks.
>   * Return: Returns the found folio on success, NULL otherwise. The caller
>   * must lock and check if the folio still matches the swap entry before
> - * use.
> + * use (e.g. with folio_matches_swap_entry).
>   */
>  struct folio *swap_cache_get_folio(swp_entry_t entry)
>  {
> @@ -346,7 +346,10 @@ struct folio *__read_swap_cache_async(swp_entry_t entry, gfp_t gfp_mask,
>         for (;;) {
>                 int err;
>
> -               /* Check the swap cache in case the folio is already there */
> +               /*
> +                * Check the swap cache first, if a cached folio is found,
> +                * return it unlocked. The caller will lock and check it.
> +                */
>                 folio = swap_cache_get_folio(entry);
>                 if (folio)
>                         goto got_folio;
> diff --git a/mm/swapfile.c b/mm/swapfile.c
> index 4c63fc62f4cb..1bd90f17440f 100644
> --- a/mm/swapfile.c
> +++ b/mm/swapfile.c
> @@ -240,14 +240,12 @@ static int __try_to_reclaim_swap(struct swap_info_struct *si,
>          * Offset could point to the middle of a large folio, or folio
>          * may no longer point to the expected offset before it's locked.
>          */
> -       if (offset < swp_offset(folio->swap) ||
> -           offset >= swp_offset(folio->swap) + nr_pages) {
> +       if (!folio_matches_swap_entry(folio, entry)) {
>                 folio_unlock(folio);
>                 folio_put(folio);
>                 goto again;
>         }
>         offset = swp_offset(folio->swap);
> -
>         need_reclaim = ((flags & TTRS_ANYWAY) ||
>                         ((flags & TTRS_UNMAPPED) && !folio_mapped(folio)) ||
>                         ((flags & TTRS_FULL) && mem_cgroup_swap_full(folio)));
> @@ -2150,6 +2148,12 @@ static int unuse_pte_range(struct vm_area_struct *vma, pmd_t *pmd,
>                 }
>
>                 folio_lock(folio);
> +               if (!folio_matches_swap_entry(folio, entry)) {
> +                       folio_unlock(folio);
> +                       folio_put(folio);
> +                       continue;
> +               }
> +
>                 folio_wait_writeback(folio);
>                 ret = unuse_pte(vma, pmd, addr, entry, folio);
>                 if (ret < 0) {
> --
> 2.51.0
>
>


^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH v2 06/15] mm, swap: rename and move some swap cluster definition and helpers
  2025-09-05 19:13 ` [PATCH v2 06/15] mm, swap: rename and move some swap cluster definition and helpers Kairui Song
@ 2025-09-06  2:13   ` Chris Li
  2025-09-08  3:03   ` Baolin Wang
  1 sibling, 0 replies; 80+ messages in thread
From: Chris Li @ 2025-09-06  2:13 UTC (permalink / raw)
  To: Kairui Song
  Cc: linux-mm, Andrew Morton, Matthew Wilcox, Hugh Dickins,
	Barry Song, Baoquan He, Nhat Pham, Kemeng Shi, Baolin Wang,
	Ying Huang, Johannes Weiner, David Hildenbrand, Yosry Ahmed,
	Lorenzo Stoakes, Zi Yan, linux-kernel

Acked-by: Chris Li <chrisl@kernel.org>

Chris

On Fri, Sep 5, 2025 at 12:14 PM Kairui Song <ryncsn@gmail.com> wrote:
>
> From: Kairui Song <kasong@tencent.com>
>
> No feature change, move cluster related definitions and helpers to
> mm/swap.h, also tidy up and add a "swap_" prefix for cluster lock/unlock
> helpers, so they can be used outside of swap files. And while at it, add
> kerneldoc.
>
> Signed-off-by: Kairui Song <kasong@tencent.com>
> Acked-by: Chris Li <chrisl@kernel.org>
> Reviewed-by: Barry Song <baohua@kernel.org>
> Acked-by: David Hildenbrand <david@redhat.com>
> ---
>  include/linux/swap.h | 34 ----------------
>  mm/swap.h            | 70 ++++++++++++++++++++++++++++++++
>  mm/swapfile.c        | 97 +++++++++++++-------------------------------
>  3 files changed, 99 insertions(+), 102 deletions(-)
>
> diff --git a/include/linux/swap.h b/include/linux/swap.h
> index 23452f014ca1..7e1fe4ff3d30 100644
> --- a/include/linux/swap.h
> +++ b/include/linux/swap.h
> @@ -235,40 +235,6 @@ enum {
>  /* Special value in each swap_map continuation */
>  #define SWAP_CONT_MAX  0x7f    /* Max count */
>
> -/*
> - * We use this to track usage of a cluster. A cluster is a block of swap disk
> - * space with SWAPFILE_CLUSTER pages long and naturally aligns in disk. All
> - * free clusters are organized into a list. We fetch an entry from the list to
> - * get a free cluster.
> - *
> - * The flags field determines if a cluster is free. This is
> - * protected by cluster lock.
> - */
> -struct swap_cluster_info {
> -       spinlock_t lock;        /*
> -                                * Protect swap_cluster_info fields
> -                                * other than list, and swap_info_struct->swap_map
> -                                * elements corresponding to the swap cluster.
> -                                */
> -       u16 count;
> -       u8 flags;
> -       u8 order;
> -       struct list_head list;
> -};
> -
> -/* All on-list cluster must have a non-zero flag. */
> -enum swap_cluster_flags {
> -       CLUSTER_FLAG_NONE = 0, /* For temporary off-list cluster */
> -       CLUSTER_FLAG_FREE,
> -       CLUSTER_FLAG_NONFULL,
> -       CLUSTER_FLAG_FRAG,
> -       /* Clusters with flags above are allocatable */
> -       CLUSTER_FLAG_USABLE = CLUSTER_FLAG_FRAG,
> -       CLUSTER_FLAG_FULL,
> -       CLUSTER_FLAG_DISCARD,
> -       CLUSTER_FLAG_MAX,
> -};
> -
>  /*
>   * The first page in the swap file is the swap header, which is always marked
>   * bad to prevent it from being allocated as an entry. This also prevents the
> diff --git a/mm/swap.h b/mm/swap.h
> index a69e18b12b45..39b27337bc0a 100644
> --- a/mm/swap.h
> +++ b/mm/swap.h
> @@ -7,10 +7,80 @@ struct swap_iocb;
>
>  extern int page_cluster;
>
> +#ifdef CONFIG_THP_SWAP
> +#define SWAPFILE_CLUSTER       HPAGE_PMD_NR
> +#define swap_entry_order(order)        (order)
> +#else
> +#define SWAPFILE_CLUSTER       256
> +#define swap_entry_order(order)        0
> +#endif
> +
> +/*
> + * We use this to track usage of a cluster. A cluster is a block of swap disk
> + * space with SWAPFILE_CLUSTER pages long and naturally aligns in disk. All
> + * free clusters are organized into a list. We fetch an entry from the list to
> + * get a free cluster.
> + *
> + * The flags field determines if a cluster is free. This is
> + * protected by cluster lock.
> + */
> +struct swap_cluster_info {
> +       spinlock_t lock;        /*
> +                                * Protect swap_cluster_info fields
> +                                * other than list, and swap_info_struct->swap_map
> +                                * elements corresponding to the swap cluster.
> +                                */
> +       u16 count;
> +       u8 flags;
> +       u8 order;
> +       struct list_head list;
> +};
> +
> +/* All on-list cluster must have a non-zero flag. */
> +enum swap_cluster_flags {
> +       CLUSTER_FLAG_NONE = 0, /* For temporary off-list cluster */
> +       CLUSTER_FLAG_FREE,
> +       CLUSTER_FLAG_NONFULL,
> +       CLUSTER_FLAG_FRAG,
> +       /* Clusters with flags above are allocatable */
> +       CLUSTER_FLAG_USABLE = CLUSTER_FLAG_FRAG,
> +       CLUSTER_FLAG_FULL,
> +       CLUSTER_FLAG_DISCARD,
> +       CLUSTER_FLAG_MAX,
> +};
> +
>  #ifdef CONFIG_SWAP
>  #include <linux/swapops.h> /* for swp_offset */
>  #include <linux/blk_types.h> /* for bio_end_io_t */
>
> +static inline struct swap_cluster_info *swp_offset_cluster(
> +               struct swap_info_struct *si, pgoff_t offset)
> +{
> +       return &si->cluster_info[offset / SWAPFILE_CLUSTER];
> +}
> +
> +/**
> + * swap_cluster_lock - Lock and return the swap cluster of given offset.
> + * @si: swap device the cluster belongs to.
> + * @offset: the swap entry offset, pointing to a valid slot.
> + *
> + * Context: The caller must ensure the offset is in the valid range and
> + * protect the swap device with reference count or locks.
> + */
> +static inline struct swap_cluster_info *swap_cluster_lock(
> +               struct swap_info_struct *si, unsigned long offset)
> +{
> +       struct swap_cluster_info *ci = swp_offset_cluster(si, offset);
> +
> +       spin_lock(&ci->lock);
> +       return ci;
> +}
> +
> +static inline void swap_cluster_unlock(struct swap_cluster_info *ci)
> +{
> +       spin_unlock(&ci->lock);
> +}
> +
>  /* linux/mm/page_io.c */
>  int sio_pool_init(void);
>  struct swap_iocb;
> diff --git a/mm/swapfile.c b/mm/swapfile.c
> index 1bd90f17440f..547ad4bfe1d8 100644
> --- a/mm/swapfile.c
> +++ b/mm/swapfile.c
> @@ -58,9 +58,6 @@ static void swap_entries_free(struct swap_info_struct *si,
>  static void swap_range_alloc(struct swap_info_struct *si,
>                              unsigned int nr_entries);
>  static bool folio_swapcache_freeable(struct folio *folio);
> -static struct swap_cluster_info *lock_cluster(struct swap_info_struct *si,
> -                                             unsigned long offset);
> -static inline void unlock_cluster(struct swap_cluster_info *ci);
>
>  static DEFINE_SPINLOCK(swap_lock);
>  static unsigned int nr_swapfiles;
> @@ -257,9 +254,9 @@ static int __try_to_reclaim_swap(struct swap_info_struct *si,
>          * swap_map is HAS_CACHE only, which means the slots have no page table
>          * reference or pending writeback, and can't be allocated to others.
>          */
> -       ci = lock_cluster(si, offset);
> +       ci = swap_cluster_lock(si, offset);
>         need_reclaim = swap_only_has_cache(si, offset, nr_pages);
> -       unlock_cluster(ci);
> +       swap_cluster_unlock(ci);
>         if (!need_reclaim)
>                 goto out_unlock;
>
> @@ -384,19 +381,6 @@ static void discard_swap_cluster(struct swap_info_struct *si,
>         }
>  }
>
> -#ifdef CONFIG_THP_SWAP
> -#define SWAPFILE_CLUSTER       HPAGE_PMD_NR
> -
> -#define swap_entry_order(order)        (order)
> -#else
> -#define SWAPFILE_CLUSTER       256
> -
> -/*
> - * Define swap_entry_order() as constant to let compiler to optimize
> - * out some code if !CONFIG_THP_SWAP
> - */
> -#define swap_entry_order(order)        0
> -#endif
>  #define LATENCY_LIMIT          256
>
>  static inline bool cluster_is_empty(struct swap_cluster_info *info)
> @@ -424,34 +408,12 @@ static inline unsigned int cluster_index(struct swap_info_struct *si,
>         return ci - si->cluster_info;
>  }
>
> -static inline struct swap_cluster_info *offset_to_cluster(struct swap_info_struct *si,
> -                                                         unsigned long offset)
> -{
> -       return &si->cluster_info[offset / SWAPFILE_CLUSTER];
> -}
> -
>  static inline unsigned int cluster_offset(struct swap_info_struct *si,
>                                           struct swap_cluster_info *ci)
>  {
>         return cluster_index(si, ci) * SWAPFILE_CLUSTER;
>  }
>
> -static inline struct swap_cluster_info *lock_cluster(struct swap_info_struct *si,
> -                                                    unsigned long offset)
> -{
> -       struct swap_cluster_info *ci;
> -
> -       ci = offset_to_cluster(si, offset);
> -       spin_lock(&ci->lock);
> -
> -       return ci;
> -}
> -
> -static inline void unlock_cluster(struct swap_cluster_info *ci)
> -{
> -       spin_unlock(&ci->lock);
> -}
> -
>  static void move_cluster(struct swap_info_struct *si,
>                          struct swap_cluster_info *ci, struct list_head *list,
>                          enum swap_cluster_flags new_flags)
> @@ -807,7 +769,7 @@ static unsigned int alloc_swap_scan_cluster(struct swap_info_struct *si,
>         }
>  out:
>         relocate_cluster(si, ci);
> -       unlock_cluster(ci);
> +       swap_cluster_unlock(ci);
>         if (si->flags & SWP_SOLIDSTATE) {
>                 this_cpu_write(percpu_swap_cluster.offset[order], next);
>                 this_cpu_write(percpu_swap_cluster.si[order], si);
> @@ -874,7 +836,7 @@ static void swap_reclaim_full_clusters(struct swap_info_struct *si, bool force)
>                 if (ci->flags == CLUSTER_FLAG_NONE)
>                         relocate_cluster(si, ci);
>
> -               unlock_cluster(ci);
> +               swap_cluster_unlock(ci);
>                 if (to_scan <= 0)
>                         break;
>         }
> @@ -913,7 +875,7 @@ static unsigned long cluster_alloc_swap_entry(struct swap_info_struct *si, int o
>                 if (offset == SWAP_ENTRY_INVALID)
>                         goto new_cluster;
>
> -               ci = lock_cluster(si, offset);
> +               ci = swap_cluster_lock(si, offset);
>                 /* Cluster could have been used by another order */
>                 if (cluster_is_usable(ci, order)) {
>                         if (cluster_is_empty(ci))
> @@ -921,7 +883,7 @@ static unsigned long cluster_alloc_swap_entry(struct swap_info_struct *si, int o
>                         found = alloc_swap_scan_cluster(si, ci, offset,
>                                                         order, usage);
>                 } else {
> -                       unlock_cluster(ci);
> +                       swap_cluster_unlock(ci);
>                 }
>                 if (found)
>                         goto done;
> @@ -1202,7 +1164,7 @@ static bool swap_alloc_fast(swp_entry_t *entry,
>         if (!si || !offset || !get_swap_device_info(si))
>                 return false;
>
> -       ci = lock_cluster(si, offset);
> +       ci = swap_cluster_lock(si, offset);
>         if (cluster_is_usable(ci, order)) {
>                 if (cluster_is_empty(ci))
>                         offset = cluster_offset(si, ci);
> @@ -1210,7 +1172,7 @@ static bool swap_alloc_fast(swp_entry_t *entry,
>                 if (found)
>                         *entry = swp_entry(si->type, found);
>         } else {
> -               unlock_cluster(ci);
> +               swap_cluster_unlock(ci);
>         }
>
>         put_swap_device(si);
> @@ -1478,14 +1440,14 @@ static void swap_entries_put_cache(struct swap_info_struct *si,
>         unsigned long offset = swp_offset(entry);
>         struct swap_cluster_info *ci;
>
> -       ci = lock_cluster(si, offset);
> -       if (swap_only_has_cache(si, offset, nr))
> +       ci = swap_cluster_lock(si, offset);
> +       if (swap_only_has_cache(si, offset, nr)) {
>                 swap_entries_free(si, ci, entry, nr);
> -       else {
> +       } else {
>                 for (int i = 0; i < nr; i++, entry.val++)
>                         swap_entry_put_locked(si, ci, entry, SWAP_HAS_CACHE);
>         }
> -       unlock_cluster(ci);
> +       swap_cluster_unlock(ci);
>  }
>
>  static bool swap_entries_put_map(struct swap_info_struct *si,
> @@ -1503,7 +1465,7 @@ static bool swap_entries_put_map(struct swap_info_struct *si,
>         if (count != 1 && count != SWAP_MAP_SHMEM)
>                 goto fallback;
>
> -       ci = lock_cluster(si, offset);
> +       ci = swap_cluster_lock(si, offset);
>         if (!swap_is_last_map(si, offset, nr, &has_cache)) {
>                 goto locked_fallback;
>         }
> @@ -1512,21 +1474,20 @@ static bool swap_entries_put_map(struct swap_info_struct *si,
>         else
>                 for (i = 0; i < nr; i++)
>                         WRITE_ONCE(si->swap_map[offset + i], SWAP_HAS_CACHE);
> -       unlock_cluster(ci);
> +       swap_cluster_unlock(ci);
>
>         return has_cache;
>
>  fallback:
> -       ci = lock_cluster(si, offset);
> +       ci = swap_cluster_lock(si, offset);
>  locked_fallback:
>         for (i = 0; i < nr; i++, entry.val++) {
>                 count = swap_entry_put_locked(si, ci, entry, 1);
>                 if (count == SWAP_HAS_CACHE)
>                         has_cache = true;
>         }
> -       unlock_cluster(ci);
> +       swap_cluster_unlock(ci);
>         return has_cache;
> -
>  }
>
>  /*
> @@ -1576,7 +1537,7 @@ static void swap_entries_free(struct swap_info_struct *si,
>         unsigned char *map_end = map + nr_pages;
>
>         /* It should never free entries across different clusters */
> -       VM_BUG_ON(ci != offset_to_cluster(si, offset + nr_pages - 1));
> +       VM_BUG_ON(ci != swp_offset_cluster(si, offset + nr_pages - 1));
>         VM_BUG_ON(cluster_is_empty(ci));
>         VM_BUG_ON(ci->count < nr_pages);
>
> @@ -1651,9 +1612,9 @@ bool swap_entry_swapped(struct swap_info_struct *si, swp_entry_t entry)
>         struct swap_cluster_info *ci;
>         int count;
>
> -       ci = lock_cluster(si, offset);
> +       ci = swap_cluster_lock(si, offset);
>         count = swap_count(si->swap_map[offset]);
> -       unlock_cluster(ci);
> +       swap_cluster_unlock(ci);
>         return !!count;
>  }
>
> @@ -1676,7 +1637,7 @@ int swp_swapcount(swp_entry_t entry)
>
>         offset = swp_offset(entry);
>
> -       ci = lock_cluster(si, offset);
> +       ci = swap_cluster_lock(si, offset);
>
>         count = swap_count(si->swap_map[offset]);
>         if (!(count & COUNT_CONTINUED))
> @@ -1699,7 +1660,7 @@ int swp_swapcount(swp_entry_t entry)
>                 n *= (SWAP_CONT_MAX + 1);
>         } while (tmp_count & COUNT_CONTINUED);
>  out:
> -       unlock_cluster(ci);
> +       swap_cluster_unlock(ci);
>         return count;
>  }
>
> @@ -1714,7 +1675,7 @@ static bool swap_page_trans_huge_swapped(struct swap_info_struct *si,
>         int i;
>         bool ret = false;
>
> -       ci = lock_cluster(si, offset);
> +       ci = swap_cluster_lock(si, offset);
>         if (nr_pages == 1) {
>                 if (swap_count(map[roffset]))
>                         ret = true;
> @@ -1727,7 +1688,7 @@ static bool swap_page_trans_huge_swapped(struct swap_info_struct *si,
>                 }
>         }
>  unlock_out:
> -       unlock_cluster(ci);
> +       swap_cluster_unlock(ci);
>         return ret;
>  }
>
> @@ -2660,8 +2621,8 @@ static void wait_for_allocation(struct swap_info_struct *si)
>         BUG_ON(si->flags & SWP_WRITEOK);
>
>         for (offset = 0; offset < end; offset += SWAPFILE_CLUSTER) {
> -               ci = lock_cluster(si, offset);
> -               unlock_cluster(ci);
> +               ci = swap_cluster_lock(si, offset);
> +               swap_cluster_unlock(ci);
>         }
>  }
>
> @@ -3577,7 +3538,7 @@ static int __swap_duplicate(swp_entry_t entry, unsigned char usage, int nr)
>         offset = swp_offset(entry);
>         VM_WARN_ON(nr > SWAPFILE_CLUSTER - offset % SWAPFILE_CLUSTER);
>         VM_WARN_ON(usage == 1 && nr > 1);
> -       ci = lock_cluster(si, offset);
> +       ci = swap_cluster_lock(si, offset);
>
>         err = 0;
>         for (i = 0; i < nr; i++) {
> @@ -3632,7 +3593,7 @@ static int __swap_duplicate(swp_entry_t entry, unsigned char usage, int nr)
>         }
>
>  unlock_out:
> -       unlock_cluster(ci);
> +       swap_cluster_unlock(ci);
>         return err;
>  }
>
> @@ -3731,7 +3692,7 @@ int add_swap_count_continuation(swp_entry_t entry, gfp_t gfp_mask)
>
>         offset = swp_offset(entry);
>
> -       ci = lock_cluster(si, offset);
> +       ci = swap_cluster_lock(si, offset);
>
>         count = swap_count(si->swap_map[offset]);
>
> @@ -3791,7 +3752,7 @@ int add_swap_count_continuation(swp_entry_t entry, gfp_t gfp_mask)
>  out_unlock_cont:
>         spin_unlock(&si->cont_lock);
>  out:
> -       unlock_cluster(ci);
> +       swap_cluster_unlock(ci);
>         put_swap_device(si);
>  outer:
>         if (page)
> --
> 2.51.0
>
>


^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH v2 07/15] mm, swap: tidy up swap device and cluster info helpers
  2025-09-05 19:13 ` [PATCH v2 07/15] mm, swap: tidy up swap device and cluster info helpers Kairui Song
@ 2025-09-06  2:14   ` Chris Li
  2025-09-08 12:21   ` David Hildenbrand
  1 sibling, 0 replies; 80+ messages in thread
From: Chris Li @ 2025-09-06  2:14 UTC (permalink / raw)
  To: Kairui Song
  Cc: linux-mm, Andrew Morton, Matthew Wilcox, Hugh Dickins,
	Barry Song, Baoquan He, Nhat Pham, Kemeng Shi, Baolin Wang,
	Ying Huang, Johannes Weiner, David Hildenbrand, Yosry Ahmed,
	Lorenzo Stoakes, Zi Yan, linux-kernel

Acked-by: Chris Li <chrisl@kernel.org>

Chris

On Fri, Sep 5, 2025 at 12:14 PM Kairui Song <ryncsn@gmail.com> wrote:
>
> From: Kairui Song <kasong@tencent.com>
>
> swp_swap_info is the most commonly used helper for retrieving swap info.
> It has an internal check that may lead to a NULL return value, but
> almost none of its caller checks the return value, making the internal
> check pointless. In fact, most of these callers already ensured the
> entry is valid and never expect a NULL value.
>
> Tidy this up and shorten the name. If the caller can make sure the
> swap entry/type is valid and the device is pinned, use the new introduced
> __swap_entry_to_info/__swap_type_to_info instead. They have more debug
> sanity checks and lower overhead as they are inlined.
>
> Callers that may expect a NULL value should use
> swap_entry_to_info/swap_type_to_info instead.
>
> No feature change. The rearranged codes should have had no effect, or
> they should have been hitting NULL de-ref bugs already. Only some new
> sanity checks are added so potential issues may show up in debug build.
>
> The new helpers will be frequently used with swap table later when working
> with swap cache folios. A locked swap cache folio ensures the entries are
> valid and stable so these helpers are very helpful.
>
> Signed-off-by: Kairui Song <kasong@tencent.com>
> Acked-by: Chris Li <chrisl@kernel.org>
> Reviewed-by: Barry Song <baohua@kernel.org>
> ---
>  include/linux/swap.h |  6 ------
>  mm/page_io.c         | 12 ++++++------
>  mm/swap.h            | 38 +++++++++++++++++++++++++++++++++-----
>  mm/swap_state.c      |  4 ++--
>  mm/swapfile.c        | 37 +++++++++++++++++++------------------
>  5 files changed, 60 insertions(+), 37 deletions(-)
>
> diff --git a/include/linux/swap.h b/include/linux/swap.h
> index 7e1fe4ff3d30..6db105383782 100644
> --- a/include/linux/swap.h
> +++ b/include/linux/swap.h
> @@ -479,7 +479,6 @@ extern sector_t swapdev_block(int, pgoff_t);
>  extern int __swap_count(swp_entry_t entry);
>  extern bool swap_entry_swapped(struct swap_info_struct *si, swp_entry_t entry);
>  extern int swp_swapcount(swp_entry_t entry);
> -struct swap_info_struct *swp_swap_info(swp_entry_t entry);
>  struct backing_dev_info;
>  extern int init_swap_address_space(unsigned int type, unsigned long nr_pages);
>  extern void exit_swap_address_space(unsigned int type);
> @@ -492,11 +491,6 @@ static inline void put_swap_device(struct swap_info_struct *si)
>  }
>
>  #else /* CONFIG_SWAP */
> -static inline struct swap_info_struct *swp_swap_info(swp_entry_t entry)
> -{
> -       return NULL;
> -}
> -
>  static inline struct swap_info_struct *get_swap_device(swp_entry_t entry)
>  {
>         return NULL;
> diff --git a/mm/page_io.c b/mm/page_io.c
> index a2056a5ecb13..3c342db77ce3 100644
> --- a/mm/page_io.c
> +++ b/mm/page_io.c
> @@ -204,7 +204,7 @@ static bool is_folio_zero_filled(struct folio *folio)
>  static void swap_zeromap_folio_set(struct folio *folio)
>  {
>         struct obj_cgroup *objcg = get_obj_cgroup_from_folio(folio);
> -       struct swap_info_struct *sis = swp_swap_info(folio->swap);
> +       struct swap_info_struct *sis = __swap_entry_to_info(folio->swap);
>         int nr_pages = folio_nr_pages(folio);
>         swp_entry_t entry;
>         unsigned int i;
> @@ -223,7 +223,7 @@ static void swap_zeromap_folio_set(struct folio *folio)
>
>  static void swap_zeromap_folio_clear(struct folio *folio)
>  {
> -       struct swap_info_struct *sis = swp_swap_info(folio->swap);
> +       struct swap_info_struct *sis = __swap_entry_to_info(folio->swap);
>         swp_entry_t entry;
>         unsigned int i;
>
> @@ -374,7 +374,7 @@ static void sio_write_complete(struct kiocb *iocb, long ret)
>  static void swap_writepage_fs(struct folio *folio, struct swap_iocb **swap_plug)
>  {
>         struct swap_iocb *sio = swap_plug ? *swap_plug : NULL;
> -       struct swap_info_struct *sis = swp_swap_info(folio->swap);
> +       struct swap_info_struct *sis = __swap_entry_to_info(folio->swap);
>         struct file *swap_file = sis->swap_file;
>         loff_t pos = swap_dev_pos(folio->swap);
>
> @@ -446,7 +446,7 @@ static void swap_writepage_bdev_async(struct folio *folio,
>
>  void __swap_writepage(struct folio *folio, struct swap_iocb **swap_plug)
>  {
> -       struct swap_info_struct *sis = swp_swap_info(folio->swap);
> +       struct swap_info_struct *sis = __swap_entry_to_info(folio->swap);
>
>         VM_BUG_ON_FOLIO(!folio_test_swapcache(folio), folio);
>         /*
> @@ -537,7 +537,7 @@ static bool swap_read_folio_zeromap(struct folio *folio)
>
>  static void swap_read_folio_fs(struct folio *folio, struct swap_iocb **plug)
>  {
> -       struct swap_info_struct *sis = swp_swap_info(folio->swap);
> +       struct swap_info_struct *sis = __swap_entry_to_info(folio->swap);
>         struct swap_iocb *sio = NULL;
>         loff_t pos = swap_dev_pos(folio->swap);
>
> @@ -608,7 +608,7 @@ static void swap_read_folio_bdev_async(struct folio *folio,
>
>  void swap_read_folio(struct folio *folio, struct swap_iocb **plug)
>  {
> -       struct swap_info_struct *sis = swp_swap_info(folio->swap);
> +       struct swap_info_struct *sis = __swap_entry_to_info(folio->swap);
>         bool synchronous = sis->flags & SWP_SYNCHRONOUS_IO;
>         bool workingset = folio_test_workingset(folio);
>         unsigned long pflags;
> diff --git a/mm/swap.h b/mm/swap.h
> index 39b27337bc0a..a65e72edb087 100644
> --- a/mm/swap.h
> +++ b/mm/swap.h
> @@ -15,6 +15,8 @@ extern int page_cluster;
>  #define swap_entry_order(order)        0
>  #endif
>
> +extern struct swap_info_struct *swap_info[];
> +
>  /*
>   * We use this to track usage of a cluster. A cluster is a block of swap disk
>   * space with SWAPFILE_CLUSTER pages long and naturally aligns in disk. All
> @@ -53,9 +55,29 @@ enum swap_cluster_flags {
>  #include <linux/swapops.h> /* for swp_offset */
>  #include <linux/blk_types.h> /* for bio_end_io_t */
>
> -static inline struct swap_cluster_info *swp_offset_cluster(
> +/*
> + * Callers of all helpers below must ensure the entry, type, or offset is
> + * valid, and protect the swap device with reference count or locks.
> + */
> +static inline struct swap_info_struct *__swap_type_to_info(int type)
> +{
> +       struct swap_info_struct *si;
> +
> +       si = READ_ONCE(swap_info[type]); /* rcu_dereference() */
> +       VM_WARN_ON_ONCE(percpu_ref_is_zero(&si->users)); /* race with swapoff */
> +       return si;
> +}
> +
> +static inline struct swap_info_struct *__swap_entry_to_info(swp_entry_t entry)
> +{
> +       return __swap_type_to_info(swp_type(entry));
> +}
> +
> +static inline struct swap_cluster_info *__swap_offset_to_cluster(
>                 struct swap_info_struct *si, pgoff_t offset)
>  {
> +       VM_WARN_ON_ONCE(percpu_ref_is_zero(&si->users)); /* race with swapoff */
> +       VM_WARN_ON_ONCE(offset >= si->max);
>         return &si->cluster_info[offset / SWAPFILE_CLUSTER];
>  }
>
> @@ -70,8 +92,9 @@ static inline struct swap_cluster_info *swp_offset_cluster(
>  static inline struct swap_cluster_info *swap_cluster_lock(
>                 struct swap_info_struct *si, unsigned long offset)
>  {
> -       struct swap_cluster_info *ci = swp_offset_cluster(si, offset);
> +       struct swap_cluster_info *ci = __swap_offset_to_cluster(si, offset);
>
> +       VM_WARN_ON_ONCE(percpu_ref_is_zero(&si->users)); /* race with swapoff */
>         spin_lock(&ci->lock);
>         return ci;
>  }
> @@ -167,7 +190,7 @@ void swap_update_readahead(struct folio *folio, struct vm_area_struct *vma,
>
>  static inline unsigned int folio_swap_flags(struct folio *folio)
>  {
> -       return swp_swap_info(folio->swap)->flags;
> +       return __swap_entry_to_info(folio->swap)->flags;
>  }
>
>  /*
> @@ -178,7 +201,7 @@ static inline unsigned int folio_swap_flags(struct folio *folio)
>  static inline int swap_zeromap_batch(swp_entry_t entry, int max_nr,
>                 bool *is_zeromap)
>  {
> -       struct swap_info_struct *sis = swp_swap_info(entry);
> +       struct swap_info_struct *sis = __swap_entry_to_info(entry);
>         unsigned long start = swp_offset(entry);
>         unsigned long end = start + max_nr;
>         bool first_bit;
> @@ -197,7 +220,7 @@ static inline int swap_zeromap_batch(swp_entry_t entry, int max_nr,
>
>  static inline int non_swapcache_batch(swp_entry_t entry, int max_nr)
>  {
> -       struct swap_info_struct *si = swp_swap_info(entry);
> +       struct swap_info_struct *si = __swap_entry_to_info(entry);
>         pgoff_t offset = swp_offset(entry);
>         int i;
>
> @@ -216,6 +239,11 @@ static inline int non_swapcache_batch(swp_entry_t entry, int max_nr)
>
>  #else /* CONFIG_SWAP */
>  struct swap_iocb;
> +static inline struct swap_info_struct *__swap_entry_to_info(swp_entry_t entry)
> +{
> +       return NULL;
> +}
> +
>  static inline void swap_read_folio(struct folio *folio, struct swap_iocb **plug)
>  {
>  }
> diff --git a/mm/swap_state.c b/mm/swap_state.c
> index 9225d6b695ad..0ad4f3b41f1b 100644
> --- a/mm/swap_state.c
> +++ b/mm/swap_state.c
> @@ -336,7 +336,7 @@ struct folio *__read_swap_cache_async(swp_entry_t entry, gfp_t gfp_mask,
>                 struct mempolicy *mpol, pgoff_t ilx, bool *new_page_allocated,
>                 bool skip_if_exists)
>  {
> -       struct swap_info_struct *si = swp_swap_info(entry);
> +       struct swap_info_struct *si = __swap_entry_to_info(entry);
>         struct folio *folio;
>         struct folio *new_folio = NULL;
>         struct folio *result = NULL;
> @@ -560,7 +560,7 @@ struct folio *swap_cluster_readahead(swp_entry_t entry, gfp_t gfp_mask,
>         unsigned long offset = entry_offset;
>         unsigned long start_offset, end_offset;
>         unsigned long mask;
> -       struct swap_info_struct *si = swp_swap_info(entry);
> +       struct swap_info_struct *si = __swap_entry_to_info(entry);
>         struct blk_plug plug;
>         struct swap_iocb *splug = NULL;
>         bool page_allocated;
> diff --git a/mm/swapfile.c b/mm/swapfile.c
> index 547ad4bfe1d8..367481d319cd 100644
> --- a/mm/swapfile.c
> +++ b/mm/swapfile.c
> @@ -102,7 +102,7 @@ static PLIST_HEAD(swap_active_head);
>  static struct plist_head *swap_avail_heads;
>  static DEFINE_SPINLOCK(swap_avail_lock);
>
> -static struct swap_info_struct *swap_info[MAX_SWAPFILES];
> +struct swap_info_struct *swap_info[MAX_SWAPFILES];
>
>  static DEFINE_MUTEX(swapon_mutex);
>
> @@ -124,14 +124,20 @@ static DEFINE_PER_CPU(struct percpu_swap_cluster, percpu_swap_cluster) = {
>         .lock = INIT_LOCAL_LOCK(),
>  };
>
> -static struct swap_info_struct *swap_type_to_swap_info(int type)
> +/* May return NULL on invalid type, caller must check for NULL return */
> +static struct swap_info_struct *swap_type_to_info(int type)
>  {
>         if (type >= MAX_SWAPFILES)
>                 return NULL;
> -
>         return READ_ONCE(swap_info[type]); /* rcu_dereference() */
>  }
>
> +/* May return NULL on invalid entry, caller must check for NULL return */
> +static struct swap_info_struct *swap_entry_to_info(swp_entry_t entry)
> +{
> +       return swap_type_to_info(swp_type(entry));
> +}
> +
>  static inline unsigned char swap_count(unsigned char ent)
>  {
>         return ent & ~SWAP_HAS_CACHE;   /* may include COUNT_CONTINUED flag */
> @@ -341,7 +347,7 @@ offset_to_swap_extent(struct swap_info_struct *sis, unsigned long offset)
>
>  sector_t swap_folio_sector(struct folio *folio)
>  {
> -       struct swap_info_struct *sis = swp_swap_info(folio->swap);
> +       struct swap_info_struct *sis = __swap_entry_to_info(folio->swap);
>         struct swap_extent *se;
>         sector_t sector;
>         pgoff_t offset;
> @@ -1299,7 +1305,7 @@ static struct swap_info_struct *_swap_info_get(swp_entry_t entry)
>
>         if (!entry.val)
>                 goto out;
> -       si = swp_swap_info(entry);
> +       si = swap_entry_to_info(entry);
>         if (!si)
>                 goto bad_nofile;
>         if (data_race(!(si->flags & SWP_USED)))
> @@ -1414,7 +1420,7 @@ struct swap_info_struct *get_swap_device(swp_entry_t entry)
>
>         if (!entry.val)
>                 goto out;
> -       si = swp_swap_info(entry);
> +       si = swap_entry_to_info(entry);
>         if (!si)
>                 goto bad_nofile;
>         if (!get_swap_device_info(si))
> @@ -1537,7 +1543,7 @@ static void swap_entries_free(struct swap_info_struct *si,
>         unsigned char *map_end = map + nr_pages;
>
>         /* It should never free entries across different clusters */
> -       VM_BUG_ON(ci != swp_offset_cluster(si, offset + nr_pages - 1));
> +       VM_BUG_ON(ci != __swap_offset_to_cluster(si, offset + nr_pages - 1));
>         VM_BUG_ON(cluster_is_empty(ci));
>         VM_BUG_ON(ci->count < nr_pages);
>
> @@ -1595,7 +1601,7 @@ void put_swap_folio(struct folio *folio, swp_entry_t entry)
>
>  int __swap_count(swp_entry_t entry)
>  {
> -       struct swap_info_struct *si = swp_swap_info(entry);
> +       struct swap_info_struct *si = __swap_entry_to_info(entry);
>         pgoff_t offset = swp_offset(entry);
>
>         return swap_count(si->swap_map[offset]);
> @@ -1826,7 +1832,7 @@ void free_swap_and_cache_nr(swp_entry_t entry, int nr)
>
>  swp_entry_t get_swap_page_of_type(int type)
>  {
> -       struct swap_info_struct *si = swap_type_to_swap_info(type);
> +       struct swap_info_struct *si = swap_type_to_info(type);
>         unsigned long offset;
>         swp_entry_t entry = {0};
>
> @@ -1907,7 +1913,7 @@ int find_first_swap(dev_t *device)
>   */
>  sector_t swapdev_block(int type, pgoff_t offset)
>  {
> -       struct swap_info_struct *si = swap_type_to_swap_info(type);
> +       struct swap_info_struct *si = swap_type_to_info(type);
>         struct swap_extent *se;
>
>         if (!si || !(si->flags & SWP_WRITEOK))
> @@ -2835,7 +2841,7 @@ static void *swap_start(struct seq_file *swap, loff_t *pos)
>         if (!l)
>                 return SEQ_START_TOKEN;
>
> -       for (type = 0; (si = swap_type_to_swap_info(type)); type++) {
> +       for (type = 0; (si = swap_type_to_info(type)); type++) {
>                 if (!(si->flags & SWP_USED) || !si->swap_map)
>                         continue;
>                 if (!--l)
> @@ -2856,7 +2862,7 @@ static void *swap_next(struct seq_file *swap, void *v, loff_t *pos)
>                 type = si->type + 1;
>
>         ++(*pos);
> -       for (; (si = swap_type_to_swap_info(type)); type++) {
> +       for (; (si = swap_type_to_info(type)); type++) {
>                 if (!(si->flags & SWP_USED) || !si->swap_map)
>                         continue;
>                 return si;
> @@ -3529,7 +3535,7 @@ static int __swap_duplicate(swp_entry_t entry, unsigned char usage, int nr)
>         unsigned char has_cache;
>         int err, i;
>
> -       si = swp_swap_info(entry);
> +       si = swap_entry_to_info(entry);
>         if (WARN_ON_ONCE(!si)) {
>                 pr_err("%s%08lx\n", Bad_file, entry.val);
>                 return -EINVAL;
> @@ -3644,11 +3650,6 @@ void swapcache_clear(struct swap_info_struct *si, swp_entry_t entry, int nr)
>         swap_entries_put_cache(si, entry, nr);
>  }
>
> -struct swap_info_struct *swp_swap_info(swp_entry_t entry)
> -{
> -       return swap_type_to_swap_info(swp_type(entry));
> -}
> -
>  /*
>   * add_swap_count_continuation - called when a swap count is duplicated
>   * beyond SWAP_MAP_MAX, it allocates a new page and links that to the entry's
> --
> 2.51.0
>
>


^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH v2 09/15] mm, swap: cleanup swap cache API and add kerneldoc
  2025-09-05 19:13 ` [PATCH v2 09/15] mm, swap: cleanup swap cache API and add kerneldoc Kairui Song
@ 2025-09-06  5:45   ` Chris Li
  2025-09-08  0:11   ` Barry Song
                     ` (2 subsequent siblings)
  3 siblings, 0 replies; 80+ messages in thread
From: Chris Li @ 2025-09-06  5:45 UTC (permalink / raw)
  To: Kairui Song
  Cc: linux-mm, Andrew Morton, Matthew Wilcox, Hugh Dickins,
	Barry Song, Baoquan He, Nhat Pham, Kemeng Shi, Baolin Wang,
	Ying Huang, Johannes Weiner, David Hildenbrand, Yosry Ahmed,
	Lorenzo Stoakes, Zi Yan, linux-kernel

Acked-by: Chris Li <chrisl@kernel.org>

Chris

On Fri, Sep 5, 2025 at 12:15 PM Kairui Song <ryncsn@gmail.com> wrote:
>
> From: Kairui Song <kasong@tencent.com>
>
> In preparation for replacing the swap cache backend with the swap table,
> clean up and add proper kernel doc for all swap cache APIs. Now all swap
> cache APIs are well-defined with consistent names.
>
> No feature change, only renaming and documenting.
>
> Signed-off-by: Kairui Song <kasong@tencent.com>
> ---
>  mm/filemap.c        |  2 +-
>  mm/memory-failure.c |  2 +-
>  mm/memory.c         |  2 +-
>  mm/swap.h           | 48 ++++++++++++++-----------
>  mm/swap_state.c     | 86 ++++++++++++++++++++++++++++++++-------------
>  mm/swapfile.c       |  8 ++---
>  mm/vmscan.c         |  2 +-
>  mm/zswap.c          |  2 +-
>  8 files changed, 98 insertions(+), 54 deletions(-)
>
> diff --git a/mm/filemap.c b/mm/filemap.c
> index 344ab106c21c..29ea56999a16 100644
> --- a/mm/filemap.c
> +++ b/mm/filemap.c
> @@ -4517,7 +4517,7 @@ static void filemap_cachestat(struct address_space *mapping,
>                                  * invalidation, so there might not be
>                                  * a shadow in the swapcache (yet).
>                                  */
> -                               shadow = get_shadow_from_swap_cache(swp);
> +                               shadow = swap_cache_get_shadow(swp);
>                                 if (!shadow)
>                                         goto resched;
>                         }
> diff --git a/mm/memory-failure.c b/mm/memory-failure.c
> index b93ab99ad3ef..922526533cd9 100644
> --- a/mm/memory-failure.c
> +++ b/mm/memory-failure.c
> @@ -1199,7 +1199,7 @@ static int me_swapcache_clean(struct page_state *ps, struct page *p)
>         struct folio *folio = page_folio(p);
>         int ret;
>
> -       delete_from_swap_cache(folio);
> +       swap_cache_del_folio(folio);
>
>         ret = delete_from_lru_cache(folio) ? MF_FAILED : MF_RECOVERED;
>         folio_unlock(folio);
> diff --git a/mm/memory.c b/mm/memory.c
> index 5808c4ef21b3..41e641823558 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -4699,7 +4699,7 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
>
>                                 memcg1_swapin(entry, nr_pages);
>
> -                               shadow = get_shadow_from_swap_cache(entry);
> +                               shadow = swap_cache_get_shadow(entry);
>                                 if (shadow)
>                                         workingset_refault(folio, shadow);
>
> diff --git a/mm/swap.h b/mm/swap.h
> index a65e72edb087..8b38577a4e04 100644
> --- a/mm/swap.h
> +++ b/mm/swap.h
> @@ -164,17 +164,29 @@ static inline bool folio_matches_swap_entry(const struct folio *folio,
>         return folio->swap.val == round_down(entry.val, folio_nr_pages(folio));
>  }
>
> +/*
> + * All swap cache helpers below require the caller to ensure the swap entries
> + * used are valid and stablize the device by any of the following ways:
> + * - Hold a reference by get_swap_device(): this ensures a single entry is
> + *   valid and increases the swap device's refcount.
> + * - Locking a folio in the swap cache: this ensures the folio's swap entries
> + *   are valid and pinned, also implies reference to the device.
> + * - Locking anything referencing the swap entry: e.g. PTL that protects
> + *   swap entries in the page table, similar to locking swap cache folio.
> + * - See the comment of get_swap_device() for more complex usage.
> + */
> +struct folio *swap_cache_get_folio(swp_entry_t entry);
> +void *swap_cache_get_shadow(swp_entry_t entry);
> +int swap_cache_add_folio(struct folio *folio, swp_entry_t entry,
> +                        gfp_t gfp, void **shadow);
> +void swap_cache_del_folio(struct folio *folio);
> +void __swap_cache_del_folio(struct folio *folio,
> +                           swp_entry_t entry, void *shadow);
> +void swap_cache_clear_shadow(int type, unsigned long begin,
> +                            unsigned long end);
> +
>  void show_swap_cache_info(void);
> -void *get_shadow_from_swap_cache(swp_entry_t entry);
> -int add_to_swap_cache(struct folio *folio, swp_entry_t entry,
> -                     gfp_t gfp, void **shadowp);
> -void __delete_from_swap_cache(struct folio *folio,
> -                             swp_entry_t entry, void *shadow);
> -void delete_from_swap_cache(struct folio *folio);
> -void clear_shadow_from_swap_cache(int type, unsigned long begin,
> -                                 unsigned long end);
>  void swapcache_clear(struct swap_info_struct *si, swp_entry_t entry, int nr);
> -struct folio *swap_cache_get_folio(swp_entry_t entry);
>  struct folio *read_swap_cache_async(swp_entry_t entry, gfp_t gfp_mask,
>                 struct vm_area_struct *vma, unsigned long addr,
>                 struct swap_iocb **plug);
> @@ -302,28 +314,22 @@ static inline struct folio *swap_cache_get_folio(swp_entry_t entry)
>         return NULL;
>  }
>
> -static inline void *get_shadow_from_swap_cache(swp_entry_t entry)
> +static inline void *swap_cache_get_shadow(swp_entry_t entry)
>  {
>         return NULL;
>  }
>
> -static inline int add_to_swap_cache(struct folio *folio, swp_entry_t entry,
> -                                       gfp_t gfp_mask, void **shadowp)
> -{
> -       return -1;
> -}
> -
> -static inline void __delete_from_swap_cache(struct folio *folio,
> -                                       swp_entry_t entry, void *shadow)
> +static inline int swap_cache_add_folio(swp_entry_t entry, struct folio *folio,
> +                                      gfp_t gfp, void **shadow)
>  {
> +       return -EINVAL;
>  }
>
> -static inline void delete_from_swap_cache(struct folio *folio)
> +static inline void swap_cache_del_folio(struct folio *folio)
>  {
>  }
>
> -static inline void clear_shadow_from_swap_cache(int type, unsigned long begin,
> -                               unsigned long end)
> +static inline void __swap_cache_del_folio(swp_entry_t entry, struct folio *folio, void *shadow)
>  {
>  }
>
> diff --git a/mm/swap_state.c b/mm/swap_state.c
> index 0ad4f3b41f1b..f3a32a06a950 100644
> --- a/mm/swap_state.c
> +++ b/mm/swap_state.c
> @@ -78,8 +78,8 @@ void show_swap_cache_info(void)
>   * Context: Caller must ensure @entry is valid and protect the swap device
>   * with reference count or locks.
>   * Return: Returns the found folio on success, NULL otherwise. The caller
> - * must lock and check if the folio still matches the swap entry before
> - * use (e.g. with folio_matches_swap_entry).
> + * must lock nd check if the folio still matches the swap entry before
> + * use (e.g., folio_matches_swap_entry).
>   */
>  struct folio *swap_cache_get_folio(swp_entry_t entry)
>  {
> @@ -90,7 +90,15 @@ struct folio *swap_cache_get_folio(swp_entry_t entry)
>         return folio;
>  }
>
> -void *get_shadow_from_swap_cache(swp_entry_t entry)
> +/**
> + * swap_cache_get_shadow - Looks up a shadow in the swap cache.
> + * @entry: swap entry used for the lookup.
> + *
> + * Context: Caller must ensure @entry is valid and protect the swap device
> + * with reference count or locks.
> + * Return: Returns either NULL or an XA_VALUE (shadow).
> + */
> +void *swap_cache_get_shadow(swp_entry_t entry)
>  {
>         struct address_space *address_space = swap_address_space(entry);
>         pgoff_t idx = swap_cache_index(entry);
> @@ -102,12 +110,21 @@ void *get_shadow_from_swap_cache(swp_entry_t entry)
>         return NULL;
>  }
>
> -/*
> - * add_to_swap_cache resembles filemap_add_folio on swapper_space,
> - * but sets SwapCache flag and 'swap' instead of mapping and index.
> +/**
> + * swap_cache_add_folio - Add a folio into the swap cache.
> + * @folio: The folio to be added.
> + * @entry: The swap entry corresponding to the folio.
> + * @gfp: gfp_mask for XArray node allocation.
> + * @shadowp: If a shadow is found, return the shadow.
> + *
> + * Context: Caller must ensure @entry is valid and protect the swap device
> + * with reference count or locks.
> + * The caller also needs to mark the corresponding swap_map slots with
> + * SWAP_HAS_CACHE to avoid race or conflict.
> + * Return: Returns 0 on success, error code otherwise.
>   */
> -int add_to_swap_cache(struct folio *folio, swp_entry_t entry,
> -                       gfp_t gfp, void **shadowp)
> +int swap_cache_add_folio(struct folio *folio, swp_entry_t entry,
> +                        gfp_t gfp, void **shadowp)
>  {
>         struct address_space *address_space = swap_address_space(entry);
>         pgoff_t idx = swap_cache_index(entry);
> @@ -155,12 +172,20 @@ int add_to_swap_cache(struct folio *folio, swp_entry_t entry,
>         return xas_error(&xas);
>  }
>
> -/*
> - * This must be called only on folios that have
> - * been verified to be in the swap cache.
> +/**
> + * __swap_cache_del_folio - Removes a folio from the swap cache.
> + * @folio: The folio.
> + * @entry: The first swap entry that the folio corresponds to.
> + * @shadow: shadow value to be filled in the swap cache.
> + *
> + * Removes a folio from the swap cache and fills a shadow in place.
> + * This won't put the folio's refcount. The caller has to do that.
> + *
> + * Context: Caller must hold the xa_lock, ensure the folio is
> + * locked and in the swap cache, using the index of @entry.
>   */
> -void __delete_from_swap_cache(struct folio *folio,
> -                       swp_entry_t entry, void *shadow)
> +void __swap_cache_del_folio(struct folio *folio,
> +                           swp_entry_t entry, void *shadow)
>  {
>         struct address_space *address_space = swap_address_space(entry);
>         int i;
> @@ -186,27 +211,40 @@ void __delete_from_swap_cache(struct folio *folio,
>         __lruvec_stat_mod_folio(folio, NR_SWAPCACHE, -nr);
>  }
>
> -/*
> - * This must be called only on folios that have
> - * been verified to be in the swap cache and locked.
> - * It will never put the folio into the free list,
> - * the caller has a reference on the folio.
> +/**
> + * swap_cache_del_folio - Removes a folio from the swap cache.
> + * @folio: The folio.
> + *
> + * Same as __swap_cache_del_folio, but handles lock and refcount. The
> + * caller must ensure the folio is either clean or has a swap count
> + * equal to zero, or it may cause data loss.
> + *
> + * Context: Caller must ensure the folio is locked and in the swap cache.
>   */
> -void delete_from_swap_cache(struct folio *folio)
> +void swap_cache_del_folio(struct folio *folio)
>  {
>         swp_entry_t entry = folio->swap;
>         struct address_space *address_space = swap_address_space(entry);
>
>         xa_lock_irq(&address_space->i_pages);
> -       __delete_from_swap_cache(folio, entry, NULL);
> +       __swap_cache_del_folio(folio, entry, NULL);
>         xa_unlock_irq(&address_space->i_pages);
>
>         put_swap_folio(folio, entry);
>         folio_ref_sub(folio, folio_nr_pages(folio));
>  }
>
> -void clear_shadow_from_swap_cache(int type, unsigned long begin,
> -                               unsigned long end)
> +/**
> + * swap_cache_clear_shadow - Clears a set of shadows in the swap cache.
> + * @type: Indicates the swap device.
> + * @begin: Beginning offset of the range.
> + * @end: Ending offset of the range.
> + *
> + * Context: Caller must ensure the range is valid and hold a reference to
> + * the swap device.
> + */
> +void swap_cache_clear_shadow(int type, unsigned long begin,
> +                            unsigned long end)
>  {
>         unsigned long curr = begin;
>         void *old;
> @@ -393,7 +431,7 @@ struct folio *__read_swap_cache_async(swp_entry_t entry, gfp_t gfp_mask,
>                         goto put_and_return;
>
>                 /*
> -                * We might race against __delete_from_swap_cache(), and
> +                * We might race against __swap_cache_del_folio(), and
>                  * stumble across a swap_map entry whose SWAP_HAS_CACHE
>                  * has not yet been cleared.  Or race against another
>                  * __read_swap_cache_async(), which has set SWAP_HAS_CACHE
> @@ -412,7 +450,7 @@ struct folio *__read_swap_cache_async(swp_entry_t entry, gfp_t gfp_mask,
>                 goto fail_unlock;
>
>         /* May fail (-ENOMEM) if XArray node allocation failed. */
> -       if (add_to_swap_cache(new_folio, entry, gfp_mask & GFP_RECLAIM_MASK, &shadow))
> +       if (swap_cache_add_folio(new_folio, entry, gfp_mask & GFP_RECLAIM_MASK, &shadow))
>                 goto fail_unlock;
>
>         memcg1_swapin(entry, 1);
> diff --git a/mm/swapfile.c b/mm/swapfile.c
> index 367481d319cd..731b541b1d33 100644
> --- a/mm/swapfile.c
> +++ b/mm/swapfile.c
> @@ -266,7 +266,7 @@ static int __try_to_reclaim_swap(struct swap_info_struct *si,
>         if (!need_reclaim)
>                 goto out_unlock;
>
> -       delete_from_swap_cache(folio);
> +       swap_cache_del_folio(folio);
>         folio_set_dirty(folio);
>         ret = nr_pages;
>  out_unlock:
> @@ -1123,7 +1123,7 @@ static void swap_range_free(struct swap_info_struct *si, unsigned long offset,
>                         swap_slot_free_notify(si->bdev, offset);
>                 offset++;
>         }
> -       clear_shadow_from_swap_cache(si->type, begin, end);
> +       swap_cache_clear_shadow(si->type, begin, end);
>
>         /*
>          * Make sure that try_to_unuse() observes si->inuse_pages reaching 0
> @@ -1288,7 +1288,7 @@ int folio_alloc_swap(struct folio *folio, gfp_t gfp)
>          * TODO: this could cause a theoretical memory reclaim
>          * deadlock in the swap out path.
>          */
> -       if (add_to_swap_cache(folio, entry, gfp | __GFP_NOMEMALLOC, NULL))
> +       if (swap_cache_add_folio(folio, entry, gfp | __GFP_NOMEMALLOC, NULL))
>                 goto out_free;
>
>         return 0;
> @@ -1758,7 +1758,7 @@ bool folio_free_swap(struct folio *folio)
>         if (folio_swapped(folio))
>                 return false;
>
> -       delete_from_swap_cache(folio);
> +       swap_cache_del_folio(folio);
>         folio_set_dirty(folio);
>         return true;
>  }
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index ca9e1cd3cd68..c79c6806560b 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -776,7 +776,7 @@ static int __remove_mapping(struct address_space *mapping, struct folio *folio,
>
>                 if (reclaimed && !mapping_exiting(mapping))
>                         shadow = workingset_eviction(folio, target_memcg);
> -               __delete_from_swap_cache(folio, swap, shadow);
> +               __swap_cache_del_folio(folio, swap, shadow);
>                 memcg1_swapout(folio, swap);
>                 xa_unlock_irq(&mapping->i_pages);
>                 put_swap_folio(folio, swap);
> diff --git a/mm/zswap.c b/mm/zswap.c
> index c88ad61b232c..3dda4310099e 100644
> --- a/mm/zswap.c
> +++ b/mm/zswap.c
> @@ -1069,7 +1069,7 @@ static int zswap_writeback_entry(struct zswap_entry *entry,
>
>  out:
>         if (ret && ret != -EEXIST) {
> -               delete_from_swap_cache(folio);
> +               swap_cache_del_folio(folio);
>                 folio_unlock(folio);
>         }
>         folio_put(folio);
> --
> 2.51.0
>
>


^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH v2 03/15] mm, swap: fix swap cahe index error when retrying reclaim
  2025-09-06  1:51   ` Chris Li
@ 2025-09-06  6:28     ` Kairui Song
  2025-09-06 11:58       ` Chris Li
  0 siblings, 1 reply; 80+ messages in thread
From: Kairui Song @ 2025-09-06  6:28 UTC (permalink / raw)
  To: Chris Li
  Cc: linux-mm, Andrew Morton, Matthew Wilcox, Hugh Dickins,
	Barry Song, Baoquan He, Nhat Pham, Kemeng Shi, Baolin Wang,
	Ying Huang, Johannes Weiner, David Hildenbrand, Yosry Ahmed,
	Lorenzo Stoakes, Zi Yan, linux-kernel

On Sat, Sep 6, 2025 at 11:19 AM Chris Li <chrisl@kernel.org> wrote:
>
> Hi Kairui,
>
> The patch looks obviously correct to me with some very minor nitpicks following.
>
> Acked-by: Chris Li <chrisl@kernel.org>
>
> On Fri, Sep 5, 2025 at 12:14 PM Kairui Song <ryncsn@gmail.com> wrote:
> >
> > From: Kairui Song <kasong@tencent.com>
> >
> > The allocator will reclaim cached slots while scanning. Currently, it
> > will try again if the reclaim found a folio that is already removed from
> > the swap cache due to a race. But the following lookup will be using the
> > wrong index. It won't cause any OOB issue since the swap cache index is
> > truncated upon lookup, but it may lead to reclaiming of an irrelevant
> > folio.
> >
> > This should not cause a measurable issue, but we should fix it.
> >
> > Fixes: fae8595505313 ("mm, swap: avoid reclaiming irrelevant swap cache")
> > Signed-off-by: Kairui Song <kasong@tencent.com>
> > ---
> >  mm/swapfile.c | 6 +++---
> >  1 file changed, 3 insertions(+), 3 deletions(-)
> >
> > diff --git a/mm/swapfile.c b/mm/swapfile.c
> > index 4b8ab2cb49ca..4c63fc62f4cb 100644
> > --- a/mm/swapfile.c
> > +++ b/mm/swapfile.c
> > @@ -240,13 +240,13 @@ static int __try_to_reclaim_swap(struct swap_info_struct *si,
> >          * Offset could point to the middle of a large folio, or folio
> >          * may no longer point to the expected offset before it's locked.
> >          */
> > -       entry = folio->swap;
> Nitpick:
> This and the following reuse the folio->swap dereference and
> swp_offset() many times.
> You can use some local variables to cache the value into a register
> and less function calls. I haven't looked into if the compiler will do
> the same expression elimination on this, a good compiler should. The
> following looks less busy and doesn't need the compiler to optimize it
> for you.
>
>            fe = folio->swap;
>            eoffset = swp_offset(fe);
>            if (offset < eoffset ) || offset >= eoffset + nr_pages) {
> ...
>            }
>            offset = eoffset;
>
> This might generate better code due to less function code. If the
> compiler does the perfect jobs the original code can generate the same
> optimized code as well.

Right, this part of the code will be gone soon so I think maybe better
to keep the change minimal, and it's not a hot path.

>
> > -       if (offset < swp_offset(entry) || offset >= swp_offset(entry) + nr_pages) {
> > +       if (offset < swp_offset(folio->swap) ||
> > +           offset >= swp_offset(folio->swap) + nr_pages) {
> >                 folio_unlock(folio);
> >                 folio_put(folio);
> >                 goto again;
> >         }
> > -       offset = swp_offset(entry);
> > +       offset = swp_offset(folio->swap);
>
> So the first entry is only assigned once in the function and never changed?
>
> You can use const to declare it.

That's a very good point, thanks!

>
> Chris
>
> >
> >         need_reclaim = ((flags & TTRS_ANYWAY) ||
> >                         ((flags & TTRS_UNMAPPED) && !folio_mapped(folio)) ||
> > --
> > 2.51.0
> >
> >
>


^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH v2 03/15] mm, swap: fix swap cahe index error when retrying reclaim
  2025-09-05 22:40   ` Nhat Pham
@ 2025-09-06  6:30     ` Kairui Song
  0 siblings, 0 replies; 80+ messages in thread
From: Kairui Song @ 2025-09-06  6:30 UTC (permalink / raw)
  To: Nhat Pham
  Cc: linux-mm, Andrew Morton, Matthew Wilcox, Hugh Dickins, Chris Li,
	Barry Song, Baoquan He, Kemeng Shi, Baolin Wang, Ying Huang,
	Johannes Weiner, David Hildenbrand, Yosry Ahmed, Lorenzo Stoakes,
	Zi Yan, linux-kernel

On Sat, Sep 6, 2025 at 7:12 AM Nhat Pham <nphamcs@gmail.com> wrote:
>
> On Fri, Sep 5, 2025 at 12:14 PM Kairui Song <ryncsn@gmail.com> wrote:
> >
> > From: Kairui Song <kasong@tencent.com>
> >
> > The allocator will reclaim cached slots while scanning. Currently, it
> > will try again if the reclaim found a folio that is already removed from
> > the swap cache due to a race. But the following lookup will be using the
> > wrong index. It won't cause any OOB issue since the swap cache index is
> > truncated upon lookup, but it may lead to reclaiming of an irrelevant
>
> I mean if there is a race, folio->swap could literally be anything
> right? Can the following happen: between the filemap_get_folio()
> lookup and the locking, the folio can have its swap slot freed up,
> then obtain a new swap slot, potentially from an entirely different
> swapfile (i.e different swp_type(folio->swap)).
>
> It is very unlikely, and in many setups you only have one swapfile. Still...

Yeah, but fortunately nothing under
the `again:` will touch the address_space, here so a random value only
causes a random lookup offset in a valid addresss_space, which is
completely fine.



>
> > folio.
> >
> > This should not cause a measurable issue, but we should fix it.
> >
> > Fixes: fae8595505313 ("mm, swap: avoid reclaiming irrelevant swap cache")
> > Signed-off-by: Kairui Song <kasong@tencent.com>
>
> Yeah that's pretty nuanced lol. It is unlikely to cause any issue
> indeed - we're just occasionally swap-cache-reclaim some rando folio
> haha.
>
> Anyway, FWIW:
>
> Acked-by: Nhat Pham <nphamcs@gmail.com>

Thanks.


^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH v2 05/15] mm, swap: always lock and check the swap cache folio before use
  2025-09-06  2:12   ` Chris Li
@ 2025-09-06  6:32     ` Kairui Song
  0 siblings, 0 replies; 80+ messages in thread
From: Kairui Song @ 2025-09-06  6:32 UTC (permalink / raw)
  To: Chris Li
  Cc: linux-mm, Andrew Morton, Matthew Wilcox, Hugh Dickins,
	Barry Song, Baoquan He, Nhat Pham, Kemeng Shi, Baolin Wang,
	Ying Huang, Johannes Weiner, David Hildenbrand, Yosry Ahmed,
	Lorenzo Stoakes, Zi Yan, linux-kernel

On Sat, Sep 6, 2025 at 11:51 AM Chris Li <chrisl@kernel.org> wrote:
>
> Looks correct to me.
>
> Acked-by: Chris Li <chrisl@kernel.org>

Thanks.

>
> With some nitpick follows,
>
> On Fri, Sep 5, 2025 at 12:14 PM Kairui Song <ryncsn@gmail.com> wrote:
> >
> > From: Kairui Song <kasong@tencent.com>
> >
> > Swap cache lookup only increases the reference count of the returned
> > folio. That's not enough to ensure a folio is stable in the swap
> > cache, so the folio could be removed from the swap cache at any
> > time. The caller should always lock and check the folio before using it.
> >
> > We have just documented this in kerneldoc, now introduce a helper for swap
> > cache folio verification with proper sanity checks.
> >
> > Also, sanitize a few current users to use this convention and the new
> > helper for easier debugging. They were not having observable problems
> > yet, only trivial issues like wasted CPU cycles on swapoff or
> > reclaiming. They would fail in some other way, but it is still better to
> > always follow this convention to make things robust and make later
> > commits easier to do.
> >
> > Signed-off-by: Kairui Song <kasong@tencent.com>
> > ---
> >  mm/memory.c     |  3 +--
> >  mm/swap.h       | 24 ++++++++++++++++++++++++
> >  mm/swap_state.c |  7 +++++--
> >  mm/swapfile.c   | 10 +++++++---
> >  4 files changed, 37 insertions(+), 7 deletions(-)
> >
> > diff --git a/mm/memory.c b/mm/memory.c
> > index 94a5928e8ace..5808c4ef21b3 100644
> > --- a/mm/memory.c
> > +++ b/mm/memory.c
> > @@ -4748,8 +4748,7 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
> >                  * swapcache, we need to check that the page's swap has not
> >                  * changed.
> >                  */
> > -               if (unlikely(!folio_test_swapcache(folio) ||
> > -                            page_swap_entry(page).val != entry.val))
> > +               if (unlikely(!folio_matches_swap_entry(folio, entry)))
> >                         goto out_page;
> >
> >                 if (unlikely(PageHWPoison(page))) {
> > diff --git a/mm/swap.h b/mm/swap.h
> > index efb6d7ff9f30..a69e18b12b45 100644
> > --- a/mm/swap.h
> > +++ b/mm/swap.h
> > @@ -52,6 +52,25 @@ static inline pgoff_t swap_cache_index(swp_entry_t entry)
> >         return swp_offset(entry) & SWAP_ADDRESS_SPACE_MASK;
> >  }
> >
> > +/**
> > + * folio_matches_swap_entry - Check if a folio matches a given swap entry.
> > + * @folio: The folio.
> > + * @entry: The swap entry to check against.
> > + *
> > + * Context: The caller should have the folio locked to ensure it's stable
> > + * and nothing will move it in or out of the swap cache.
> > + * Return: true or false.
> > + */
> > +static inline bool folio_matches_swap_entry(const struct folio *folio,
> > +                                           swp_entry_t entry)
> > +{
> > +       VM_WARN_ON_ONCE_FOLIO(!folio_test_locked(folio), folio);
> > +       if (!folio_test_swapcache(folio))
> > +               return false;
> > +       VM_WARN_ON_ONCE_FOLIO(!IS_ALIGNED(folio->swap.val, folio_nr_pages(folio)), folio);
>
> You should cache the folio->swap.val into a local register value.
> Because WARN_ON_ONCE I think the compiler has no choice but to load it
> twice? Haven't verified it myself.
>
> There is no downside in compiler point of view using more local
> variables, the compiler generates an internal version of the local
> variable equivalent anyway.
>
> > +       return folio->swap.val == round_down(entry.val, folio_nr_pages(folio));
>
> Same for folio_nr_pages(folio), you should cache it. The function will
> look less busy.

That's a very good idea, that should also reduce line length so it is
easier to read.


^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH v2 10/15] mm, swap: wrap swap cache replacement with a helper
  2025-09-05 19:13 ` [PATCH v2 10/15] mm, swap: wrap swap cache replacement with a helper Kairui Song
@ 2025-09-06  7:09   ` Chris Li
  2025-09-08  3:41   ` Baolin Wang
  2025-09-08 12:30   ` David Hildenbrand
  2 siblings, 0 replies; 80+ messages in thread
From: Chris Li @ 2025-09-06  7:09 UTC (permalink / raw)
  To: Kairui Song
  Cc: linux-mm, Andrew Morton, Matthew Wilcox, Hugh Dickins,
	Barry Song, Baoquan He, Nhat Pham, Kemeng Shi, Baolin Wang,
	Ying Huang, Johannes Weiner, David Hildenbrand, Yosry Ahmed,
	Lorenzo Stoakes, Zi Yan, linux-kernel

Acked-by: Chris Li <chrisl@kernel.org>

Chris

On Fri, Sep 5, 2025 at 12:15 PM Kairui Song <ryncsn@gmail.com> wrote:
>
> From: Kairui Song <kasong@tencent.com>
>
> There are currently three swap cache users that are trying to replace an
> existing folio with a new one: huge memory splitting, migration, and
> shmem replacement. What they are doing is quite similar.
>
> Introduce a common helper for this. In later commits, they can be easily
> switched to use the swap table by updating this helper.
>
> The newly added helper also makes the swap cache API better defined, and
> debugging is easier.
>
> Signed-off-by: Kairui Song <kasong@tencent.com>
> ---
>  mm/huge_memory.c |  5 ++---
>  mm/migrate.c     | 11 +++--------
>  mm/shmem.c       | 10 ++--------
>  mm/swap.h        |  3 +++
>  mm/swap_state.c  | 32 ++++++++++++++++++++++++++++++++
>  5 files changed, 42 insertions(+), 19 deletions(-)
>
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index 26cedfcd7418..a4d192c8d794 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -3798,9 +3798,8 @@ static int __folio_split(struct folio *folio, unsigned int new_order,
>                          * NOTE: shmem in swap cache is not supported yet.
>                          */
>                         if (swap_cache) {
> -                               __xa_store(&swap_cache->i_pages,
> -                                          swap_cache_index(new_folio->swap),
> -                                          new_folio, 0);
> +                               __swap_cache_replace_folio(swap_cache, new_folio->swap,
> +                                                          folio, new_folio);
>                                 continue;
>                         }
>
> diff --git a/mm/migrate.c b/mm/migrate.c
> index 8e435a078fc3..7e1d01aa8c85 100644
> --- a/mm/migrate.c
> +++ b/mm/migrate.c
> @@ -566,7 +566,6 @@ static int __folio_migrate_mapping(struct address_space *mapping,
>         struct zone *oldzone, *newzone;
>         int dirty;
>         long nr = folio_nr_pages(folio);
> -       long entries, i;
>
>         if (!mapping) {
>                 /* Take off deferred split queue while frozen and memcg set */
> @@ -615,9 +614,6 @@ static int __folio_migrate_mapping(struct address_space *mapping,
>         if (folio_test_swapcache(folio)) {
>                 folio_set_swapcache(newfolio);
>                 newfolio->private = folio_get_private(folio);
> -               entries = nr;
> -       } else {
> -               entries = 1;
>         }
>
>         /* Move dirty while folio refs frozen and newfolio not yet exposed */
> @@ -627,11 +623,10 @@ static int __folio_migrate_mapping(struct address_space *mapping,
>                 folio_set_dirty(newfolio);
>         }
>
> -       /* Swap cache still stores N entries instead of a high-order entry */
> -       for (i = 0; i < entries; i++) {
> +       if (folio_test_swapcache(folio))
> +               __swap_cache_replace_folio(mapping, folio->swap, folio, newfolio);
> +       else
>                 xas_store(&xas, newfolio);
> -               xas_next(&xas);
> -       }
>
>         /*
>          * Drop cache reference from old folio by unfreezing
> diff --git a/mm/shmem.c b/mm/shmem.c
> index cc6a0007c7a6..823ceae9dff8 100644
> --- a/mm/shmem.c
> +++ b/mm/shmem.c
> @@ -2123,10 +2123,8 @@ static int shmem_replace_folio(struct folio **foliop, gfp_t gfp,
>         struct folio *new, *old = *foliop;
>         swp_entry_t entry = old->swap;
>         struct address_space *swap_mapping = swap_address_space(entry);
> -       pgoff_t swap_index = swap_cache_index(entry);
> -       XA_STATE(xas, &swap_mapping->i_pages, swap_index);
>         int nr_pages = folio_nr_pages(old);
> -       int error = 0, i;
> +       int error = 0;
>
>         /*
>          * We have arrived here because our zones are constrained, so don't
> @@ -2155,12 +2153,8 @@ static int shmem_replace_folio(struct folio **foliop, gfp_t gfp,
>         new->swap = entry;
>         folio_set_swapcache(new);
>
> -       /* Swap cache still stores N entries instead of a high-order entry */
>         xa_lock_irq(&swap_mapping->i_pages);
> -       for (i = 0; i < nr_pages; i++) {
> -               WARN_ON_ONCE(xas_store(&xas, new));
> -               xas_next(&xas);
> -       }
> +       __swap_cache_replace_folio(swap_mapping, entry, old, new);
>         xa_unlock_irq(&swap_mapping->i_pages);
>
>         mem_cgroup_replace_folio(old, new);
> diff --git a/mm/swap.h b/mm/swap.h
> index 8b38577a4e04..a139c9131244 100644
> --- a/mm/swap.h
> +++ b/mm/swap.h
> @@ -182,6 +182,9 @@ int swap_cache_add_folio(struct folio *folio, swp_entry_t entry,
>  void swap_cache_del_folio(struct folio *folio);
>  void __swap_cache_del_folio(struct folio *folio,
>                             swp_entry_t entry, void *shadow);
> +void __swap_cache_replace_folio(struct address_space *address_space,
> +                               swp_entry_t entry,
> +                               struct folio *old, struct folio *new);
>  void swap_cache_clear_shadow(int type, unsigned long begin,
>                              unsigned long end);
>
> diff --git a/mm/swap_state.c b/mm/swap_state.c
> index f3a32a06a950..38f5f4cf565d 100644
> --- a/mm/swap_state.c
> +++ b/mm/swap_state.c
> @@ -234,6 +234,38 @@ void swap_cache_del_folio(struct folio *folio)
>         folio_ref_sub(folio, folio_nr_pages(folio));
>  }
>
> +/**
> + * __swap_cache_replace_folio - Replace a folio in the swap cache.
> + * @mapping: Swap mapping address space.
> + * @entry: The first swap entry that the new folio corresponds to.
> + * @old: The old folio to be replaced.
> + * @new: The new folio.
> + *
> + * Replace a existing folio in the swap cache with a new folio.
> + *
> + * Context: Caller must ensure both folios are locked, and lock the
> + * swap address_space that holds the entries to be replaced.
> + */
> +void __swap_cache_replace_folio(struct address_space *mapping,
> +                               swp_entry_t entry,
> +                               struct folio *old, struct folio *new)
> +{
> +       unsigned long nr_pages = folio_nr_pages(new);
> +       unsigned long offset = swap_cache_index(entry);
> +       unsigned long end = offset + nr_pages;
> +       XA_STATE(xas, &mapping->i_pages, offset);
> +
> +       VM_WARN_ON_ONCE(entry.val != new->swap.val);
> +       VM_WARN_ON_ONCE(!folio_test_locked(old) || !folio_test_locked(new));
> +       VM_WARN_ON_ONCE(!folio_test_swapcache(old) || !folio_test_swapcache(new));
> +
> +       /* Swap cache still stores N entries instead of a high-order entry */
> +       do {
> +               WARN_ON_ONCE(xas_store(&xas, new) != old);
> +               xas_next(&xas);
> +       } while (++offset < end);
> +}
> +
>  /**
>   * swap_cache_clear_shadow - Clears a set of shadows in the swap cache.
>   * @type: Indicates the swap device.
> --
> 2.51.0
>
>


^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH v2 03/15] mm, swap: fix swap cahe index error when retrying reclaim
  2025-09-06  6:28     ` Kairui Song
@ 2025-09-06 11:58       ` Chris Li
  0 siblings, 0 replies; 80+ messages in thread
From: Chris Li @ 2025-09-06 11:58 UTC (permalink / raw)
  To: Kairui Song
  Cc: linux-mm, Andrew Morton, Matthew Wilcox, Hugh Dickins,
	Barry Song, Baoquan He, Nhat Pham, Kemeng Shi, Baolin Wang,
	Ying Huang, Johannes Weiner, David Hildenbrand, Yosry Ahmed,
	Lorenzo Stoakes, Zi Yan, linux-kernel

On Fri, Sep 5, 2025 at 11:29 PM Kairui Song <ryncsn@gmail.com> wrote:
>
> On Sat, Sep 6, 2025 at 11:19 AM Chris Li <chrisl@kernel.org> wrote:
> >
> > Hi Kairui,
> >
> > The patch looks obviously correct to me with some very minor nitpicks following.
> >
> > Acked-by: Chris Li <chrisl@kernel.org>
> >
> > On Fri, Sep 5, 2025 at 12:14 PM Kairui Song <ryncsn@gmail.com> wrote:
> > >
> > > From: Kairui Song <kasong@tencent.com>
> > >
> > > The allocator will reclaim cached slots while scanning. Currently, it
> > > will try again if the reclaim found a folio that is already removed from
> > > the swap cache due to a race. But the following lookup will be using the
> > > wrong index. It won't cause any OOB issue since the swap cache index is
> > > truncated upon lookup, but it may lead to reclaiming of an irrelevant
> > > folio.
> > >
> > > This should not cause a measurable issue, but we should fix it.
> > >
> > > Fixes: fae8595505313 ("mm, swap: avoid reclaiming irrelevant swap cache")
> > > Signed-off-by: Kairui Song <kasong@tencent.com>
> > > ---
> > >  mm/swapfile.c | 6 +++---
> > >  1 file changed, 3 insertions(+), 3 deletions(-)
> > >
> > > diff --git a/mm/swapfile.c b/mm/swapfile.c
> > > index 4b8ab2cb49ca..4c63fc62f4cb 100644
> > > --- a/mm/swapfile.c
> > > +++ b/mm/swapfile.c
> > > @@ -240,13 +240,13 @@ static int __try_to_reclaim_swap(struct swap_info_struct *si,
> > >          * Offset could point to the middle of a large folio, or folio
> > >          * may no longer point to the expected offset before it's locked.
> > >          */
> > > -       entry = folio->swap;
> > Nitpick:
> > This and the following reuse the folio->swap dereference and
> > swp_offset() many times.
> > You can use some local variables to cache the value into a register
> > and less function calls. I haven't looked into if the compiler will do
> > the same expression elimination on this, a good compiler should. The
> > following looks less busy and doesn't need the compiler to optimize it
> > for you.
> >
> >            fe = folio->swap;
> >            eoffset = swp_offset(fe);
> >            if (offset < eoffset ) || offset >= eoffset + nr_pages) {
> > ...
> >            }
> >            offset = eoffset;
> >
> > This might generate better code due to less function code. If the
> > compiler does the perfect jobs the original code can generate the same
> > optimized code as well.
>
> Right, this part of the code will be gone soon so I think maybe better
> to keep the change minimal, and it's not a hot path.

Ack. It is nitpick anyway. Most likely doesn't make a difference to
modern compilers anyway.

Chris


^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH v2 01/15] docs/mm: add document for swap table
  2025-09-05 23:58   ` Chris Li
@ 2025-09-06 13:31     ` Kairui Song
  0 siblings, 0 replies; 80+ messages in thread
From: Kairui Song @ 2025-09-06 13:31 UTC (permalink / raw)
  To: Chris Li
  Cc: linux-mm, Andrew Morton, Matthew Wilcox, Hugh Dickins,
	Barry Song, Baoquan He, Nhat Pham, Kemeng Shi, Baolin Wang,
	Ying Huang, Johannes Weiner, David Hildenbrand, Yosry Ahmed,
	Lorenzo Stoakes, Zi Yan, linux-kernel

On Sat, Sep 6, 2025 at 8:05 AM Chris Li <chrisl@kernel.org> wrote:
>
> On Fri, Sep 5, 2025 at 12:14 PM Kairui Song <ryncsn@gmail.com> wrote:
> >
> > From: Kairui Song <kasong@tencent.com>
> >
> > From: Chris Li <chrisl@kernel.org>
> >
> > Swap table is the new swap cache.
> >
> > Signed-off-by: Chris Li <chrisl@kernel.org>
> > Signed-off-by: Kairui Song <kasong@tencent.com>
> > ---
> >  Documentation/mm/swap-table.rst | 72 +++++++++++++++++++++++++++++++++
> >  MAINTAINERS                     |  1 +
> >  2 files changed, 73 insertions(+)
> >  create mode 100644 Documentation/mm/swap-table.rst
> >
> > diff --git a/Documentation/mm/swap-table.rst b/Documentation/mm/swap-table.rst
> > new file mode 100644
> > index 000000000000..929cd91aa984
> > --- /dev/null
> > +++ b/Documentation/mm/swap-table.rst
> > @@ -0,0 +1,72 @@
> > +.. SPDX-License-Identifier: GPL-2.0
> > +
> > +:Author: Chris Li <chrisl@kernel.org>, Kairui Song <kasong@tencent.com>
> > +
> > +==========
> > +Swap Table
> > +==========
> > +
> > +Swap table implements swap cache as a per-cluster swap cache value array.
> > +
> > +Swap Entry
> > +----------
> > +
> > +A swap entry contains the information required to serve the anonymous page
> > +fault.
> > +
> > +Swap entry is encoded as two parts: swap type and swap offset.
> > +
> > +The swap type indicates which swap device to use.
> > +The swap offset is the offset of the swap file to read the page data from.
> > +
> > +Swap Cache
> > +----------
> > +
> > +Swap cache is a map to look up folios using swap entry as the key. The result
> > +value can have three possible types depending on which stage of this swap entry
> > +was in.
> > +
> > +1. NULL: This swap entry is not used.
> > +
> > +2. folio: A folio has been allocated and bound to this swap entry. This is
> > +   the transient state of swap out or swap in. The folio data can be in
> > +   the folio or swap file, or both.
> > +
> > +3. shadow: The shadow contains the working set information of the swap
>
> I just noticed a typo here, should be "swapped out page"
>
> > +   outed folio. This is the normal state for a swap outed page.
>
> Same here. "swap outed page" -> "swapped out page"

Thanks, I used some grammar check tools and it seems they are not
perfect with kernel terminologies.

>
> Chris
>


^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH v2 11/15] mm, swap: use the swap table for the swap cache and switch API
  2025-09-05 19:13 ` [PATCH v2 11/15] mm, swap: use the swap table for the swap cache and switch API Kairui Song
@ 2025-09-06 15:28   ` Chris Li
  2025-09-08 15:38     ` Kairui Song
  2025-09-07 12:55   ` Klara Modin
                     ` (2 subsequent siblings)
  3 siblings, 1 reply; 80+ messages in thread
From: Chris Li @ 2025-09-06 15:28 UTC (permalink / raw)
  To: Kairui Song
  Cc: linux-mm, Andrew Morton, Matthew Wilcox, Hugh Dickins,
	Barry Song, Baoquan He, Nhat Pham, Kemeng Shi, Baolin Wang,
	Ying Huang, Johannes Weiner, David Hildenbrand, Yosry Ahmed,
	Lorenzo Stoakes, Zi Yan, linux-kernel

Acked-by: Chris Li <chrisl@kernel.org>

Some nitpick follows.

Chris

On Fri, Sep 5, 2025 at 12:15 PM Kairui Song <ryncsn@gmail.com> wrote:
>
> From: Kairui Song <kasong@tencent.com>
>
> Introduce basic swap table infrastructures, which are now just a
> fixed-sized flat array inside each swap cluster, with access wrappers.
>
> Each cluster contains a swap table of 512 entries. Each table entry is
> an opaque atomic long. It could be in 3 types: a shadow type (XA_VALUE),
> a folio type (pointer), or NULL.
>
> In this first step, it only supports storing a folio or shadow, and it
> is a drop-in replacement for the current swap cache. Convert all swap
> cache users to use the new sets of APIs. Chris Li has been suggesting
> using a new infrastructure for swap cache for better performance, and
> that idea combined well with the swap table as the new backing
> structure. Now the lock contention range is reduced to 2M clusters,
> which is much smaller than the 64M address_space. And we can also drop
> the multiple address_space design.
>
> All the internal works are done with swap_cache_get_* helpers. Swap
> cache lookup is still lock-less like before, and the helper's contexts
> are same with original swap cache helpers. They still require a pin
> on the swap device to prevent the backing data from being freed.
>
> Swap cache updates are now protected by the swap cluster lock
> instead of the Xarray lock. This is mostly handled internally, but new
> __swap_cache_* helpers require the caller to lock the cluster. So, a
> few new cluster access and locking helpers are also introduced.
>
> A fully cluster-based unified swap table can be implemented on top
> of this to take care of all count tracking and synchronization work,
> with dynamic allocation. It should reduce the memory usage while
> making the performance even better.
>
> Co-developed-by: Chris Li <chrisl@kernel.org>
> Signed-off-by: Chris Li <chrisl@kernel.org>
> Signed-off-by: Kairui Song <kasong@tencent.com>
> ---
>  MAINTAINERS          |   1 +
>  include/linux/swap.h |   2 -
>  mm/huge_memory.c     |  13 +-
>  mm/migrate.c         |  19 ++-
>  mm/shmem.c           |   8 +-
>  mm/swap.h            | 157 +++++++++++++++++------
>  mm/swap_state.c      | 289 +++++++++++++++++++------------------------
>  mm/swap_table.h      |  97 +++++++++++++++
>  mm/swapfile.c        | 100 +++++++++++----
>  mm/vmscan.c          |  20 ++-
>  10 files changed, 458 insertions(+), 248 deletions(-)
>  create mode 100644 mm/swap_table.h
>
> diff --git a/MAINTAINERS b/MAINTAINERS
> index 1c8292c0318d..de402ca91a80 100644
> --- a/MAINTAINERS
> +++ b/MAINTAINERS
> @@ -16226,6 +16226,7 @@ F:      include/linux/swapops.h
>  F:     mm/page_io.c
>  F:     mm/swap.c
>  F:     mm/swap.h
> +F:     mm/swap_table.h
>  F:     mm/swap_state.c
>  F:     mm/swapfile.c
>
> diff --git a/include/linux/swap.h b/include/linux/swap.h
> index 6db105383782..2cb0458561ef 100644
> --- a/include/linux/swap.h
> +++ b/include/linux/swap.h
> @@ -480,8 +480,6 @@ extern int __swap_count(swp_entry_t entry);
>  extern bool swap_entry_swapped(struct swap_info_struct *si, swp_entry_t entry);
>  extern int swp_swapcount(swp_entry_t entry);
>  struct backing_dev_info;
> -extern int init_swap_address_space(unsigned int type, unsigned long nr_pages);
> -extern void exit_swap_address_space(unsigned int type);
>  extern struct swap_info_struct *get_swap_device(swp_entry_t entry);
>  sector_t swap_folio_sector(struct folio *folio);
>
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index a4d192c8d794..052e8fc7ee0c 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -3720,7 +3720,7 @@ static int __folio_split(struct folio *folio, unsigned int new_order,
>         /* Prevent deferred_split_scan() touching ->_refcount */
>         spin_lock(&ds_queue->split_queue_lock);
>         if (folio_ref_freeze(folio, 1 + extra_pins)) {
> -               struct address_space *swap_cache = NULL;
> +               struct swap_cluster_info *ci = NULL;
>                 struct lruvec *lruvec;
>                 int expected_refs;
>
> @@ -3764,8 +3764,7 @@ static int __folio_split(struct folio *folio, unsigned int new_order,
>                                 goto fail;
>                         }
>
> -                       swap_cache = swap_address_space(folio->swap);
> -                       xa_lock(&swap_cache->i_pages);
> +                       ci = swap_cluster_lock_by_folio(folio);
>                 }
>
>                 /* lock lru list/PageCompound, ref frozen by page_ref_freeze */
> @@ -3797,8 +3796,8 @@ static int __folio_split(struct folio *folio, unsigned int new_order,
>                          * Anonymous folio with swap cache.
>                          * NOTE: shmem in swap cache is not supported yet.
>                          */
> -                       if (swap_cache) {
> -                               __swap_cache_replace_folio(swap_cache, new_folio->swap,
> +                       if (ci) {
> +                               __swap_cache_replace_folio(ci, new_folio->swap,
>                                                            folio, new_folio);
>                                 continue;
>                         }
> @@ -3834,8 +3833,8 @@ static int __folio_split(struct folio *folio, unsigned int new_order,
>
>                 unlock_page_lruvec(lruvec);
>
> -               if (swap_cache)
> -                       xa_unlock(&swap_cache->i_pages);
> +               if (ci)
> +                       swap_cluster_unlock(ci);
>         } else {
>                 spin_unlock(&ds_queue->split_queue_lock);
>                 ret = -EAGAIN;
> diff --git a/mm/migrate.c b/mm/migrate.c
> index 7e1d01aa8c85..ea177ef1fea9 100644
> --- a/mm/migrate.c
> +++ b/mm/migrate.c
> @@ -563,6 +563,7 @@ static int __folio_migrate_mapping(struct address_space *mapping,
>                 struct folio *newfolio, struct folio *folio, int expected_count)
>  {
>         XA_STATE(xas, &mapping->i_pages, folio_index(folio));
> +       struct swap_cluster_info *ci = NULL;
>         struct zone *oldzone, *newzone;
>         int dirty;
>         long nr = folio_nr_pages(folio);
> @@ -591,9 +592,16 @@ static int __folio_migrate_mapping(struct address_space *mapping,
>         oldzone = folio_zone(folio);
>         newzone = folio_zone(newfolio);
>
> -       xas_lock_irq(&xas);
> +       if (folio_test_swapcache(folio))
> +               ci = swap_cluster_lock_by_folio_irq(folio);
> +       else
> +               xas_lock_irq(&xas);
> +
>         if (!folio_ref_freeze(folio, expected_count)) {
> -               xas_unlock_irq(&xas);
> +               if (ci)
> +                       swap_cluster_unlock(ci);
> +               else
> +                       xas_unlock_irq(&xas);
>                 return -EAGAIN;
>         }
>
> @@ -624,7 +632,7 @@ static int __folio_migrate_mapping(struct address_space *mapping,
>         }
>
>         if (folio_test_swapcache(folio))
> -               __swap_cache_replace_folio(mapping, folio->swap, folio, newfolio);
> +               __swap_cache_replace_folio(ci, folio->swap, folio, newfolio);
>         else
>                 xas_store(&xas, newfolio);
>
> @@ -635,8 +643,11 @@ static int __folio_migrate_mapping(struct address_space *mapping,
>          */
>         folio_ref_unfreeze(folio, expected_count - nr);
>
> -       xas_unlock(&xas);
>         /* Leave irq disabled to prevent preemption while updating stats */
> +       if (ci)
> +               swap_cluster_unlock(ci);
> +       else
> +               xas_unlock(&xas);
>
>         /*
>          * If moved to a different zone then also account
> diff --git a/mm/shmem.c b/mm/shmem.c
> index 823ceae9dff8..21e795f18e78 100644
> --- a/mm/shmem.c
> +++ b/mm/shmem.c
> @@ -2120,9 +2120,9 @@ static int shmem_replace_folio(struct folio **foliop, gfp_t gfp,
>                                 struct shmem_inode_info *info, pgoff_t index,
>                                 struct vm_area_struct *vma)
>  {
> +       struct swap_cluster_info *ci;
>         struct folio *new, *old = *foliop;
>         swp_entry_t entry = old->swap;
> -       struct address_space *swap_mapping = swap_address_space(entry);
>         int nr_pages = folio_nr_pages(old);
>         int error = 0;
>
> @@ -2153,9 +2153,9 @@ static int shmem_replace_folio(struct folio **foliop, gfp_t gfp,
>         new->swap = entry;
>         folio_set_swapcache(new);
>
> -       xa_lock_irq(&swap_mapping->i_pages);
> -       __swap_cache_replace_folio(swap_mapping, entry, old, new);
> -       xa_unlock_irq(&swap_mapping->i_pages);
> +       ci = swap_cluster_lock_by_folio_irq(old);
> +       __swap_cache_replace_folio(ci, entry, old, new);
> +       swap_cluster_unlock(ci);
>
>         mem_cgroup_replace_folio(old, new);
>         shmem_update_stats(new, nr_pages);
> diff --git a/mm/swap.h b/mm/swap.h
> index a139c9131244..bf4e54f1f6b6 100644
> --- a/mm/swap.h
> +++ b/mm/swap.h
> @@ -2,6 +2,7 @@
>  #ifndef _MM_SWAP_H
>  #define _MM_SWAP_H
>
> +#include <linux/atomic.h> /* for atomic_long_t */
>  struct mempolicy;
>  struct swap_iocb;
>
> @@ -35,6 +36,7 @@ struct swap_cluster_info {
>         u16 count;
>         u8 flags;
>         u8 order;
> +       atomic_long_t *table;   /* Swap table entries, see mm/swap_table.h */
>         struct list_head list;
>  };
>
> @@ -55,6 +57,11 @@ enum swap_cluster_flags {
>  #include <linux/swapops.h> /* for swp_offset */
>  #include <linux/blk_types.h> /* for bio_end_io_t */
>
> +static inline unsigned int swp_cluster_offset(swp_entry_t entry)
> +{
> +       return swp_offset(entry) % SWAPFILE_CLUSTER;
> +}
> +
>  /*
>   * Callers of all helpers below must ensure the entry, type, or offset is
>   * valid, and protect the swap device with reference count or locks.
> @@ -81,6 +88,25 @@ static inline struct swap_cluster_info *__swap_offset_to_cluster(
>         return &si->cluster_info[offset / SWAPFILE_CLUSTER];
>  }
>
> +static inline struct swap_cluster_info *__swap_entry_to_cluster(swp_entry_t entry)
> +{
> +       return __swap_offset_to_cluster(__swap_entry_to_info(entry),
> +                                       swp_offset(entry));
> +}
> +
> +static __always_inline struct swap_cluster_info *__swap_cluster_lock(
> +               struct swap_info_struct *si, unsigned long offset, bool irq)
> +{
> +       struct swap_cluster_info *ci = __swap_offset_to_cluster(si, offset);
> +
> +       VM_WARN_ON_ONCE(percpu_ref_is_zero(&si->users)); /* race with swapoff */
> +       if (irq)
> +               spin_lock_irq(&ci->lock);
> +       else
> +               spin_lock(&ci->lock);
> +       return ci;
> +}
> +
>  /**
>   * swap_cluster_lock - Lock and return the swap cluster of given offset.
>   * @si: swap device the cluster belongs to.
> @@ -92,11 +118,48 @@ static inline struct swap_cluster_info *__swap_offset_to_cluster(
>  static inline struct swap_cluster_info *swap_cluster_lock(
>                 struct swap_info_struct *si, unsigned long offset)
>  {
> -       struct swap_cluster_info *ci = __swap_offset_to_cluster(si, offset);
> +       return __swap_cluster_lock(si, offset, false);
> +}
>
> -       VM_WARN_ON_ONCE(percpu_ref_is_zero(&si->users)); /* race with swapoff */
> -       spin_lock(&ci->lock);
> -       return ci;
> +static inline struct swap_cluster_info *__swap_cluster_lock_by_folio(
> +               const struct folio *folio, bool irq)
> +{
> +       VM_WARN_ON_ONCE_FOLIO(!folio_test_locked(folio), folio);
> +       VM_WARN_ON_ONCE_FOLIO(!folio_test_swapcache(folio), folio);
> +       return __swap_cluster_lock(__swap_entry_to_info(folio->swap),
> +                                  swp_offset(folio->swap), irq);
> +}
> +
> +/*
> + * swap_cluster_lock_by_folio - Locks the cluster that holds a folio's entries.
> + * @folio: The folio.
> + *
> + * This locks the swap cluster that contains a folio's swap entries. The
> + * swap entries of a folio are always in one single cluster, and a locked
> + * swap cache folio is enough to stabilize the entries and the swap device.

I was wondering if we have a better word than stabilize, we haven't
defined what does stabilize mean. I assume it means protecting from
racing access to the swap cache entry. If we describe what it protects
or what it prevents, that would give more detailed meaning than
stabilize.

> + *
> + * Context: Caller must ensure the folio is locked and in the swap cache.
> + * Return: Pointer to the swap cluster.
> + */
> +static inline struct swap_cluster_info *swap_cluster_lock_by_folio(
> +               const struct folio *folio)
> +{
> +       return __swap_cluster_lock_by_folio(folio, false);
> +}
> +
> +/*
> + * swap_cluster_lock_by_folio_irq - Locks the cluster that holds a folio's entries.
> + * @folio: The folio.
> + *
> + * Same as swap_cluster_lock_by_folio but also disable IRQ.
> + *
> + * Context: Caller must ensure the folio is locked and in the swap cache.
> + * Return: Pointer to the swap cluster.
> + */
> +static inline struct swap_cluster_info *swap_cluster_lock_by_folio_irq(
> +               const struct folio *folio)
> +{
> +       return __swap_cluster_lock_by_folio(folio, true);
>  }
>
>  static inline void swap_cluster_unlock(struct swap_cluster_info *ci)
> @@ -104,6 +167,11 @@ static inline void swap_cluster_unlock(struct swap_cluster_info *ci)
>         spin_unlock(&ci->lock);
>  }
>
> +static inline void swap_cluster_unlock_irq(struct swap_cluster_info *ci)
> +{
> +       spin_unlock_irq(&ci->lock);
> +}
> +
>  /* linux/mm/page_io.c */
>  int sio_pool_init(void);
>  struct swap_iocb;
> @@ -123,10 +191,11 @@ void __swap_writepage(struct folio *folio, struct swap_iocb **swap_plug);
>  #define SWAP_ADDRESS_SPACE_SHIFT       14
>  #define SWAP_ADDRESS_SPACE_PAGES       (1 << SWAP_ADDRESS_SPACE_SHIFT)
>  #define SWAP_ADDRESS_SPACE_MASK                (SWAP_ADDRESS_SPACE_PAGES - 1)
> -extern struct address_space *swapper_spaces[];
> -#define swap_address_space(entry)                          \
> -       (&swapper_spaces[swp_type(entry)][swp_offset(entry) \
> -               >> SWAP_ADDRESS_SPACE_SHIFT])
> +extern struct address_space swap_space;
> +static inline struct address_space *swap_address_space(swp_entry_t entry)
> +{
> +       return &swap_space;
> +}
>
>  /*
>   * Return the swap device position of the swap entry.
> @@ -136,15 +205,6 @@ static inline loff_t swap_dev_pos(swp_entry_t entry)
>         return ((loff_t)swp_offset(entry)) << PAGE_SHIFT;
>  }
>
> -/*
> - * Return the swap cache index of the swap entry.
> - */
> -static inline pgoff_t swap_cache_index(swp_entry_t entry)
> -{
> -       BUILD_BUG_ON((SWP_OFFSET_MASK | SWAP_ADDRESS_SPACE_MASK) != SWP_OFFSET_MASK);
> -       return swp_offset(entry) & SWAP_ADDRESS_SPACE_MASK;
> -}
> -
>  /**
>   * folio_matches_swap_entry - Check if a folio matches a given swap entry.
>   * @folio: The folio.
> @@ -177,16 +237,15 @@ static inline bool folio_matches_swap_entry(const struct folio *folio,
>   */
>  struct folio *swap_cache_get_folio(swp_entry_t entry);
>  void *swap_cache_get_shadow(swp_entry_t entry);
> -int swap_cache_add_folio(struct folio *folio, swp_entry_t entry,
> -                        gfp_t gfp, void **shadow);
> +void swap_cache_add_folio(struct folio *folio, swp_entry_t entry, void **shadow);
>  void swap_cache_del_folio(struct folio *folio);
> -void __swap_cache_del_folio(struct folio *folio,
> -                           swp_entry_t entry, void *shadow);
> -void __swap_cache_replace_folio(struct address_space *address_space,
> -                               swp_entry_t entry,
> -                               struct folio *old, struct folio *new);
> -void swap_cache_clear_shadow(int type, unsigned long begin,
> -                            unsigned long end);
> +/* Below helpers require the caller to lock and pass in the swap cluster. */
> +void __swap_cache_del_folio(struct swap_cluster_info *ci,
> +                           struct folio *folio, swp_entry_t entry, void *shadow);
> +void __swap_cache_replace_folio(struct swap_cluster_info *ci,
> +                               swp_entry_t entry, struct folio *old,
> +                               struct folio *new);
> +void __swap_cache_clear_shadow(swp_entry_t entry, int nr_ents);
>
>  void show_swap_cache_info(void);
>  void swapcache_clear(struct swap_info_struct *si, swp_entry_t entry, int nr);
> @@ -254,6 +313,32 @@ static inline int non_swapcache_batch(swp_entry_t entry, int max_nr)
>
>  #else /* CONFIG_SWAP */
>  struct swap_iocb;
> +static inline struct swap_cluster_info *swap_cluster_lock(
> +       struct swap_info_struct *si, pgoff_t offset, bool irq)
> +{
> +       return NULL;
> +}
> +
> +static inline struct swap_cluster_info *swap_cluster_lock_by_folio(
> +               struct folio *folio)
> +{
> +       return NULL;
> +}
> +
> +static inline struct swap_cluster_info *swap_cluster_lock_by_folio_irq(
> +               struct folio *folio)
> +{
> +       return NULL;
> +}
> +
> +static inline void swap_cluster_unlock(struct swap_cluster_info *ci)
> +{
> +}
> +
> +static inline void swap_cluster_unlock_irq(struct swap_cluster_info *ci)
> +{
> +}
> +
>  static inline struct swap_info_struct *__swap_entry_to_info(swp_entry_t entry)
>  {
>         return NULL;
> @@ -271,11 +356,6 @@ static inline struct address_space *swap_address_space(swp_entry_t entry)
>         return NULL;
>  }
>
> -static inline pgoff_t swap_cache_index(swp_entry_t entry)
> -{
> -       return 0;
> -}
> -
>  static inline bool folio_matches_swap_entry(const struct folio *folio, swp_entry_t entry)
>  {
>         return false;
> @@ -322,17 +402,22 @@ static inline void *swap_cache_get_shadow(swp_entry_t entry)
>         return NULL;
>  }
>
> -static inline int swap_cache_add_folio(swp_entry_t entry, struct folio *folio,
> -                                      gfp_t gfp, void **shadow)
> +static inline void swap_cache_add_folio(struct folio *folio, swp_entry_t entry, void **shadow)
>  {
> -       return -EINVAL;
>  }
>
>  static inline void swap_cache_del_folio(struct folio *folio)
>  {
>  }
>
> -static inline void __swap_cache_del_folio(swp_entry_t entry, struct folio *folio, void *shadow)
> +static inline void __swap_cache_del_folio(struct swap_cluster_info *ci,
> +                           struct folio *folio, swp_entry_t entry, void *shadow)
> +{
> +}
> +
> +static inline void __swap_cache_replace_folio(
> +               struct swap_cluster_info *ci, swp_entry_t entry,
> +               struct folio *old, struct folio *new)
>  {
>  }
>
> @@ -367,7 +452,7 @@ static inline int non_swapcache_batch(swp_entry_t entry, int max_nr)
>  static inline pgoff_t folio_index(struct folio *folio)
>  {
>         if (unlikely(folio_test_swapcache(folio)))
> -               return swap_cache_index(folio->swap);
> +               return swp_offset(folio->swap);
>         return folio->index;
>  }
>
> diff --git a/mm/swap_state.c b/mm/swap_state.c
> index 38f5f4cf565d..7147b390745f 100644
> --- a/mm/swap_state.c
> +++ b/mm/swap_state.c
> @@ -23,6 +23,7 @@
>  #include <linux/huge_mm.h>
>  #include <linux/shmem_fs.h>
>  #include "internal.h"
> +#include "swap_table.h"
>  #include "swap.h"
>
>  /*
> @@ -36,8 +37,10 @@ static const struct address_space_operations swap_aops = {
>  #endif
>  };
>
> -struct address_space *swapper_spaces[MAX_SWAPFILES] __read_mostly;
> -static unsigned int nr_swapper_spaces[MAX_SWAPFILES] __read_mostly;
> +struct address_space swap_space __read_mostly = {
> +       .a_ops = &swap_aops,
> +};
> +
>  static bool enable_vma_readahead __read_mostly = true;
>
>  #define SWAP_RA_ORDER_CEILING  5
> @@ -83,11 +86,21 @@ void show_swap_cache_info(void)
>   */
>  struct folio *swap_cache_get_folio(swp_entry_t entry)
>  {
> -       struct folio *folio = filemap_get_folio(swap_address_space(entry),
> -                                               swap_cache_index(entry));
> -       if (IS_ERR(folio))
> -               return NULL;
> -       return folio;
> +
> +       unsigned long swp_tb;
> +       struct folio *folio;
> +
> +       for (;;) {
> +               swp_tb = __swap_table_get(__swap_entry_to_cluster(entry),
> +                                         swp_cluster_offset(entry));
> +               if (!swp_tb_is_folio(swp_tb))
> +                       return NULL;
> +               folio = swp_tb_to_folio(swp_tb);
> +               if (likely(folio_try_get(folio)))
> +                       return folio;
> +       }
> +
> +       return NULL;
>  }
>
>  /**
> @@ -100,13 +113,13 @@ struct folio *swap_cache_get_folio(swp_entry_t entry)
>   */
>  void *swap_cache_get_shadow(swp_entry_t entry)
>  {
> -       struct address_space *address_space = swap_address_space(entry);
> -       pgoff_t idx = swap_cache_index(entry);
> -       void *shadow;
> +       unsigned long swp_tb;
> +
> +       swp_tb = __swap_table_get(__swap_entry_to_cluster(entry),
> +                                 swp_cluster_offset(entry));
> +       if (swp_tb_is_shadow(swp_tb))
> +               return swp_tb_to_shadow(swp_tb);
>
> -       shadow = xa_load(&address_space->i_pages, idx);
> -       if (xa_is_value(shadow))
> -               return shadow;
>         return NULL;
>  }
>
> @@ -123,57 +136,45 @@ void *swap_cache_get_shadow(swp_entry_t entry)
>   * SWAP_HAS_CACHE to avoid race or conflict.
>   * Return: Returns 0 on success, error code otherwise.
>   */
> -int swap_cache_add_folio(struct folio *folio, swp_entry_t entry,
> -                        gfp_t gfp, void **shadowp)
> +void swap_cache_add_folio(struct folio *folio, swp_entry_t entry, void **shadowp)
>  {
> -       struct address_space *address_space = swap_address_space(entry);
> -       pgoff_t idx = swap_cache_index(entry);
> -       XA_STATE_ORDER(xas, &address_space->i_pages, idx, folio_order(folio));
> -       unsigned long i, nr = folio_nr_pages(folio);
> -       void *old;
> -
> -       xas_set_update(&xas, workingset_update_node);
> -
> -       VM_BUG_ON_FOLIO(!folio_test_locked(folio), folio);
> -       VM_BUG_ON_FOLIO(folio_test_swapcache(folio), folio);
> -       VM_BUG_ON_FOLIO(!folio_test_swapbacked(folio), folio);
> +       void *shadow = NULL;
> +       unsigned long swp_tb, exist;
> +       struct swap_cluster_info *ci;
> +       unsigned int ci_start, ci_off, ci_end;
> +       unsigned long nr_pages = folio_nr_pages(folio);
> +
> +       VM_WARN_ON_ONCE_FOLIO(!folio_test_locked(folio), folio);
> +       VM_WARN_ON_ONCE_FOLIO(folio_test_swapcache(folio), folio);
> +       VM_WARN_ON_ONCE_FOLIO(!folio_test_swapbacked(folio), folio);
> +
> +       swp_tb = folio_to_swp_tb(folio);
> +       ci_start = swp_cluster_offset(entry);
> +       ci_end = ci_start + nr_pages;
> +       ci_off = ci_start;
> +       ci = swap_cluster_lock(__swap_entry_to_info(entry), swp_offset(entry));
> +       do {
> +               exist = __swap_table_xchg(ci, ci_off, swp_tb);

Thanks for changing it to xchg. I understand that by "exist" you mean
the previous existing swap table entry. However after it was taken out
from the swap table, is it still considered an "existing entry"? I am
considering "old" or "prior" might be a better name. Just nitpicks
anyway. If we use "old", we can rename "swp_tb" to "new_tb" to make it
obvious what we are replacing with.

Also I saw this kind of for loop repeat a few places.
Maybe considering some for loop macro to do:

for_each_folio_offset(folio, ci, ci_off) {
      exist = __swap_table_xchg(ci, ci_off, swp_tb);
      ...
} end_for_each_folio_offset();

The kernel has a lot of similar for loop macros.

> +               WARN_ON_ONCE(swp_tb_is_folio(exist));
> +               if (swp_tb_is_shadow(exist))
> +                       shadow = swp_tb_to_shadow(exist);
> +       } while (++ci_off < ci_end);
>
> -       folio_ref_add(folio, nr);
> +       folio_ref_add(folio, nr_pages);
>         folio_set_swapcache(folio);
>         folio->swap = entry;
> +       swap_cluster_unlock(ci);
>
> -       do {
> -               xas_lock_irq(&xas);
> -               xas_create_range(&xas);
> -               if (xas_error(&xas))
> -                       goto unlock;
> -               for (i = 0; i < nr; i++) {
> -                       VM_BUG_ON_FOLIO(xas.xa_index != idx + i, folio);
> -                       if (shadowp) {
> -                               old = xas_load(&xas);
> -                               if (xa_is_value(old))
> -                                       *shadowp = old;
> -                       }
> -                       xas_store(&xas, folio);
> -                       xas_next(&xas);
> -               }
> -               address_space->nrpages += nr;
> -               __node_stat_mod_folio(folio, NR_FILE_PAGES, nr);
> -               __lruvec_stat_mod_folio(folio, NR_SWAPCACHE, nr);
> -unlock:
> -               xas_unlock_irq(&xas);
> -       } while (xas_nomem(&xas, gfp));
> -
> -       if (!xas_error(&xas))
> -               return 0;
> +       node_stat_mod_folio(folio, NR_FILE_PAGES, nr_pages);
> +       lruvec_stat_mod_folio(folio, NR_SWAPCACHE, nr_pages);
>
> -       folio_clear_swapcache(folio);
> -       folio_ref_sub(folio, nr);
> -       return xas_error(&xas);
> +       if (shadowp)
> +               *shadowp = shadow;
>  }
>
>  /**
>   * __swap_cache_del_folio - Removes a folio from the swap cache.
> + * @ci: The locked swap cluster.
>   * @folio: The folio.
>   * @entry: The first swap entry that the folio corresponds to.
>   * @shadow: shadow value to be filled in the swap cache.
> @@ -181,34 +182,36 @@ int swap_cache_add_folio(struct folio *folio, swp_entry_t entry,
>   * Removes a folio from the swap cache and fills a shadow in place.
>   * This won't put the folio's refcount. The caller has to do that.
>   *
> - * Context: Caller must hold the xa_lock, ensure the folio is
> - * locked and in the swap cache, using the index of @entry.
> + * Context: Caller must ensure the folio is locked and in the swap cache
> + * using the index of @entry, and lock the cluster that holds the entries.
>   */
> -void __swap_cache_del_folio(struct folio *folio,
> +void __swap_cache_del_folio(struct swap_cluster_info *ci, struct folio *folio,
>                             swp_entry_t entry, void *shadow)
>  {
> -       struct address_space *address_space = swap_address_space(entry);
> -       int i;
> -       long nr = folio_nr_pages(folio);
> -       pgoff_t idx = swap_cache_index(entry);
> -       XA_STATE(xas, &address_space->i_pages, idx);
> -
> -       xas_set_update(&xas, workingset_update_node);
> -
> -       VM_BUG_ON_FOLIO(!folio_test_locked(folio), folio);
> -       VM_BUG_ON_FOLIO(!folio_test_swapcache(folio), folio);
> -       VM_BUG_ON_FOLIO(folio_test_writeback(folio), folio);
> -
> -       for (i = 0; i < nr; i++) {
> -               void *entry = xas_store(&xas, shadow);
> -               VM_BUG_ON_PAGE(entry != folio, entry);
> -               xas_next(&xas);
> -       }
> +       unsigned long exist, swp_tb;
> +       unsigned int ci_start, ci_off, ci_end;
> +       unsigned long nr_pages = folio_nr_pages(folio);
> +
> +       VM_WARN_ON_ONCE(__swap_entry_to_cluster(entry) != ci);
> +       VM_WARN_ON_ONCE_FOLIO(!folio_test_locked(folio), folio);
> +       VM_WARN_ON_ONCE_FOLIO(!folio_test_swapcache(folio), folio);
> +       VM_WARN_ON_ONCE_FOLIO(folio_test_writeback(folio), folio);
> +
> +       swp_tb = shadow_swp_to_tb(shadow);
> +       ci_start = swp_cluster_offset(entry);
> +       ci_end = ci_start + nr_pages;
> +       ci_off = ci_start;
> +       do {
> +               /* If shadow is NULL, we sets an empty shadow */
> +               exist = __swap_table_xchg(ci, ci_off, swp_tb);
> +               WARN_ON_ONCE(!swp_tb_is_folio(exist) ||
> +                            swp_tb_to_folio(exist) != folio);
> +       } while (++ci_off < ci_end);
> +
>         folio->swap.val = 0;
>         folio_clear_swapcache(folio);
> -       address_space->nrpages -= nr;
> -       __node_stat_mod_folio(folio, NR_FILE_PAGES, -nr);
> -       __lruvec_stat_mod_folio(folio, NR_SWAPCACHE, -nr);
> +       node_stat_mod_folio(folio, NR_FILE_PAGES, -nr_pages);
> +       lruvec_stat_mod_folio(folio, NR_SWAPCACHE, -nr_pages);
>  }
>
>  /**
> @@ -223,12 +226,12 @@ void __swap_cache_del_folio(struct folio *folio,
>   */
>  void swap_cache_del_folio(struct folio *folio)
>  {
> +       struct swap_cluster_info *ci;
>         swp_entry_t entry = folio->swap;
> -       struct address_space *address_space = swap_address_space(entry);
>
> -       xa_lock_irq(&address_space->i_pages);
> -       __swap_cache_del_folio(folio, entry, NULL);
> -       xa_unlock_irq(&address_space->i_pages);
> +       ci = swap_cluster_lock(__swap_entry_to_info(entry), swp_offset(entry));
> +       __swap_cache_del_folio(ci, folio, entry, NULL);
> +       swap_cluster_unlock(ci);
>
>         put_swap_folio(folio, entry);
>         folio_ref_sub(folio, folio_nr_pages(folio));
> @@ -236,7 +239,7 @@ void swap_cache_del_folio(struct folio *folio)
>
>  /**
>   * __swap_cache_replace_folio - Replace a folio in the swap cache.
> - * @mapping: Swap mapping address space.
> + * @ci: The locked swap cluster.
>   * @entry: The first swap entry that the new folio corresponds to.
>   * @old: The old folio to be replaced.
>   * @new: The new folio.
> @@ -244,64 +247,58 @@ void swap_cache_del_folio(struct folio *folio)
>   * Replace a existing folio in the swap cache with a new folio.
>   *
>   * Context: Caller must ensure both folios are locked, and lock the
> - * swap address_space that holds the entries to be replaced.
> + * cluster that holds the entries to be replaced.
>   */
> -void __swap_cache_replace_folio(struct address_space *mapping,
> -                               swp_entry_t entry,
> +void __swap_cache_replace_folio(struct swap_cluster_info *ci, swp_entry_t entry,
>                                 struct folio *old, struct folio *new)
>  {
> +       unsigned int ci_off = swp_cluster_offset(entry);
>         unsigned long nr_pages = folio_nr_pages(new);
> -       unsigned long offset = swap_cache_index(entry);
> -       unsigned long end = offset + nr_pages;
> -       XA_STATE(xas, &mapping->i_pages, offset);
> +       unsigned int ci_end = ci_off + nr_pages;
> +       unsigned long exist, swp_tb;
>
>         VM_WARN_ON_ONCE(entry.val != new->swap.val);
>         VM_WARN_ON_ONCE(!folio_test_locked(old) || !folio_test_locked(new));
>         VM_WARN_ON_ONCE(!folio_test_swapcache(old) || !folio_test_swapcache(new));
> -
> -       /* Swap cache still stores N entries instead of a high-order entry */
> +       swp_tb = folio_to_swp_tb(new);
>         do {
> -               WARN_ON_ONCE(xas_store(&xas, new) != old);
> -               xas_next(&xas);
> -       } while (++offset < end);
> +               exist = __swap_table_xchg(ci, ci_off, swp_tb);
> +               WARN_ON_ONCE(!swp_tb_is_folio(exist) || swp_tb_to_folio(exist) != old);
> +       } while (++ci_off < ci_end);
> +
> +       /*
> +        * If the old folio is partially replaced (e.g., splitting a large
> +        * folio, the old folio is shrunk, and new split sub folios replace
> +        * the shrunk part), ensure the new folio doesn't overlap it.
> +        */
> +       if (IS_ENABLED(CONFIG_DEBUG_VM) &&
> +           folio_order(old) != folio_order(new)) {
> +               ci_off = swp_cluster_offset(old->swap);
> +               ci_end = ci_off + folio_nr_pages(old);
> +               while (ci_off++ < ci_end)
> +                       WARN_ON_ONCE(swp_tb_to_folio(__swap_table_get(ci, ci_off)) != old);
> +       }
>  }
>
>  /**
>   * swap_cache_clear_shadow - Clears a set of shadows in the swap cache.
> - * @type: Indicates the swap device.
> - * @begin: Beginning offset of the range.
> - * @end: Ending offset of the range.
> + * @entry: The starting index entry.
> + * @nr_ents: How many slots need to be cleared.
>   *
> - * Context: Caller must ensure the range is valid and hold a reference to
> - * the swap device.
> + * Context: Caller must ensure the range is valid, not occupied by,
> + * any folio and protect the swap device with reference count or locks.
>   */
> -void swap_cache_clear_shadow(int type, unsigned long begin,
> -                            unsigned long end)
> +void __swap_cache_clear_shadow(swp_entry_t entry, int nr_ents)
>  {
> -       unsigned long curr = begin;
> -       void *old;
> -
> -       for (;;) {
> -               swp_entry_t entry = swp_entry(type, curr);
> -               unsigned long index = curr & SWAP_ADDRESS_SPACE_MASK;
> -               struct address_space *address_space = swap_address_space(entry);
> -               XA_STATE(xas, &address_space->i_pages, index);
> -
> -               xas_set_update(&xas, workingset_update_node);
> -
> -               xa_lock_irq(&address_space->i_pages);
> -               xas_for_each(&xas, old, min(index + (end - curr), SWAP_ADDRESS_SPACE_PAGES)) {
> -                       if (!xa_is_value(old))
> -                               continue;
> -                       xas_store(&xas, NULL);
> -               }
> -               xa_unlock_irq(&address_space->i_pages);
> +       struct swap_cluster_info *ci = __swap_entry_to_cluster(entry);
> +       unsigned int ci_off = swp_cluster_offset(entry), ci_end;
> +       unsigned long old;
>
> -               /* search the next swapcache until we meet end */
> -               curr = ALIGN((curr + 1), SWAP_ADDRESS_SPACE_PAGES);
> -               if (curr > end)
> -                       break;
> -       }
> +       ci_end = ci_off + nr_ents;
> +       do {
> +               old = __swap_table_xchg(ci, ci_off, null_to_swp_tb());
> +               WARN_ON_ONCE(swp_tb_is_folio(old));
> +       } while (++ci_off < ci_end);
>  }
>
>  /*
> @@ -481,10 +478,7 @@ struct folio *__read_swap_cache_async(swp_entry_t entry, gfp_t gfp_mask,
>         if (mem_cgroup_swapin_charge_folio(new_folio, NULL, gfp_mask, entry))
>                 goto fail_unlock;
>
> -       /* May fail (-ENOMEM) if XArray node allocation failed. */
> -       if (swap_cache_add_folio(new_folio, entry, gfp_mask & GFP_RECLAIM_MASK, &shadow))
> -               goto fail_unlock;
> -
> +       swap_cache_add_folio(new_folio, entry, &shadow);
>         memcg1_swapin(entry, 1);
>
>         if (shadow)
> @@ -676,41 +670,6 @@ struct folio *swap_cluster_readahead(swp_entry_t entry, gfp_t gfp_mask,
>         return folio;
>  }
>
> -int init_swap_address_space(unsigned int type, unsigned long nr_pages)
> -{
> -       struct address_space *spaces, *space;
> -       unsigned int i, nr;
> -
> -       nr = DIV_ROUND_UP(nr_pages, SWAP_ADDRESS_SPACE_PAGES);
> -       spaces = kvcalloc(nr, sizeof(struct address_space), GFP_KERNEL);
> -       if (!spaces)
> -               return -ENOMEM;
> -       for (i = 0; i < nr; i++) {
> -               space = spaces + i;
> -               xa_init_flags(&space->i_pages, XA_FLAGS_LOCK_IRQ);
> -               atomic_set(&space->i_mmap_writable, 0);
> -               space->a_ops = &swap_aops;
> -               /* swap cache doesn't use writeback related tags */
> -               mapping_set_no_writeback_tags(space);
> -       }
> -       nr_swapper_spaces[type] = nr;
> -       swapper_spaces[type] = spaces;
> -
> -       return 0;
> -}
> -
> -void exit_swap_address_space(unsigned int type)
> -{
> -       int i;
> -       struct address_space *spaces = swapper_spaces[type];
> -
> -       for (i = 0; i < nr_swapper_spaces[type]; i++)
> -               VM_WARN_ON_ONCE(!mapping_empty(&spaces[i]));
> -       kvfree(spaces);
> -       nr_swapper_spaces[type] = 0;
> -       swapper_spaces[type] = NULL;
> -}
> -
>  static int swap_vma_ra_win(struct vm_fault *vmf, unsigned long *start,
>                            unsigned long *end)
>  {
> @@ -883,7 +842,7 @@ static const struct attribute_group swap_attr_group = {
>         .attrs = swap_attrs,
>  };
>
> -static int __init swap_init_sysfs(void)
> +static int __init swap_init(void)
>  {
>         int err;
>         struct kobject *swap_kobj;
> @@ -898,11 +857,13 @@ static int __init swap_init_sysfs(void)
>                 pr_err("failed to register swap group\n");
>                 goto delete_obj;
>         }
> +       /* Swap cache writeback is LRU based, no tags for it */
> +       mapping_set_no_writeback_tags(&swap_space);
>         return 0;
>
>  delete_obj:
>         kobject_put(swap_kobj);
>         return err;
>  }
> -subsys_initcall(swap_init_sysfs);
> +subsys_initcall(swap_init);
>  #endif
> diff --git a/mm/swap_table.h b/mm/swap_table.h
> new file mode 100644
> index 000000000000..e1f7cc009701
> --- /dev/null
> +++ b/mm/swap_table.h
> @@ -0,0 +1,97 @@
> +/* SPDX-License-Identifier: GPL-2.0 */
> +#ifndef _MM_SWAP_TABLE_H
> +#define _MM_SWAP_TABLE_H
> +
> +#include "swap.h"
> +
> +/*
> + * A swap table entry represents the status of a swap slot on a swap
> + * (physical or virtual) device. The swap table in each cluster is a
> + * 1:1 map of the swap slots in this cluster.
> + *
> + * Each swap table entry could be a pointer (folio), a XA_VALUE
> + * (shadow), or NULL.
> + */
> +
> +/*
> + * Helpers for casting one type of info into a swap table entry.
> + */
> +static inline unsigned long null_to_swp_tb(void)
> +{
> +       BUILD_BUG_ON(sizeof(unsigned long) != sizeof(atomic_long_t));
> +       return 0;
> +}
> +
> +static inline unsigned long folio_to_swp_tb(struct folio *folio)
> +{
> +       BUILD_BUG_ON(sizeof(unsigned long) != sizeof(void *));
> +       return (unsigned long)folio;
> +}
> +
> +static inline unsigned long shadow_swp_to_tb(void *shadow)
> +{
> +       BUILD_BUG_ON((BITS_PER_XA_VALUE + 1) !=
> +                    BITS_PER_BYTE * sizeof(unsigned long));
> +       VM_WARN_ON_ONCE(shadow && !xa_is_value(shadow));
> +       return (unsigned long)shadow;
> +}
> +
> +/*
> + * Helpers for swap table entry type checking.
> + */
> +static inline bool swp_tb_is_null(unsigned long swp_tb)
> +{
> +       return !swp_tb;
> +}
> +
> +static inline bool swp_tb_is_folio(unsigned long swp_tb)
> +{
> +       return !xa_is_value((void *)swp_tb) && !swp_tb_is_null(swp_tb);
> +}
> +
> +static inline bool swp_tb_is_shadow(unsigned long swp_tb)
> +{
> +       return xa_is_value((void *)swp_tb);
> +}
> +
> +/*
> + * Helpers for retrieving info from swap table.
> + */
> +static inline struct folio *swp_tb_to_folio(unsigned long swp_tb)
> +{
> +       VM_WARN_ON(!swp_tb_is_folio(swp_tb));
> +       return (void *)swp_tb;
> +}
> +
> +static inline void *swp_tb_to_shadow(unsigned long swp_tb)
> +{
> +       VM_WARN_ON(!swp_tb_is_shadow(swp_tb));
> +       return (void *)swp_tb;
> +}
> +
> +/*
> + * Helpers for accessing or modifying the swap table of a cluster,
> + * the swap cluster must be locked.
> + */
> +static inline void __swap_table_set(struct swap_cluster_info *ci,
> +                                   unsigned int off, unsigned long swp_tb)
> +{
> +       VM_WARN_ON_ONCE(off >= SWAPFILE_CLUSTER);
> +       atomic_long_set(&ci->table[off], swp_tb);
> +}
> +
> +static inline unsigned long __swap_table_xchg(struct swap_cluster_info *ci,
> +                                             unsigned int off, unsigned long swp_tb)
> +{
> +       VM_WARN_ON_ONCE(off >= SWAPFILE_CLUSTER);
> +       /* Ordering is guaranteed by cluster lock, relax */
> +       return atomic_long_xchg_relaxed(&ci->table[off], swp_tb);
> +}
> +
> +static inline unsigned long __swap_table_get(struct swap_cluster_info *ci,
> +                                            unsigned int off)
> +{
> +       VM_WARN_ON_ONCE(off >= SWAPFILE_CLUSTER);
> +       return atomic_long_read(&ci->table[off]);
> +}
> +#endif
> diff --git a/mm/swapfile.c b/mm/swapfile.c
> index 731b541b1d33..cbb7d4c0773d 100644
> --- a/mm/swapfile.c
> +++ b/mm/swapfile.c
> @@ -46,6 +46,7 @@
>  #include <asm/tlbflush.h>
>  #include <linux/swapops.h>
>  #include <linux/swap_cgroup.h>
> +#include "swap_table.h"
>  #include "internal.h"
>  #include "swap.h"
>
> @@ -420,6 +421,34 @@ static inline unsigned int cluster_offset(struct swap_info_struct *si,
>         return cluster_index(si, ci) * SWAPFILE_CLUSTER;
>  }
>
> +static int swap_table_alloc_table(struct swap_cluster_info *ci)
> +{
> +       WARN_ON(ci->table);
> +       ci->table = kzalloc(sizeof(unsigned long) * SWAPFILE_CLUSTER, GFP_KERNEL);
> +       if (!ci->table)
> +               return -ENOMEM;
> +       return 0;
> +}
> +
> +static void swap_cluster_free_table(struct swap_cluster_info *ci)
> +{
> +       unsigned int ci_off;
> +       unsigned long swp_tb;
> +
> +       if (!ci->table)
> +               return;
> +
> +       for (ci_off = 0; ci_off < SWAPFILE_CLUSTER; ci_off++) {
> +               swp_tb = __swap_table_get(ci, ci_off);
> +               if (!swp_tb_is_null(swp_tb))
> +                       pr_err_once("swap: unclean swap space on swapoff: 0x%lx",
> +                                   swp_tb);
> +       }
> +
> +       kfree(ci->table);
> +       ci->table = NULL;
> +}
> +
>  static void move_cluster(struct swap_info_struct *si,
>                          struct swap_cluster_info *ci, struct list_head *list,
>                          enum swap_cluster_flags new_flags)
> @@ -702,6 +731,26 @@ static bool cluster_scan_range(struct swap_info_struct *si,
>         return true;
>  }
>
> +/*
> + * Currently, the swap table is not used for count tracking, just
> + * do a sanity check here to ensure nothing leaked, so the swap
> + * table should be empty upon freeing.
> + */
> +static void cluster_table_check(struct swap_cluster_info *ci,
> +                               unsigned int start, unsigned int nr)
> +{
> +       unsigned int ci_off = start % SWAPFILE_CLUSTER;
> +       unsigned int ci_end = ci_off + nr;
> +       unsigned long swp_tb;
> +
> +       if (IS_ENABLED(CONFIG_DEBUG_VM)) {
> +               do {
> +                       swp_tb = __swap_table_get(ci, ci_off);
> +                       VM_WARN_ON_ONCE(!swp_tb_is_null(swp_tb));
> +               } while (++ci_off < ci_end);
> +       }
> +}
> +
>  static bool cluster_alloc_range(struct swap_info_struct *si, struct swap_cluster_info *ci,
>                                 unsigned int start, unsigned char usage,
>                                 unsigned int order)
> @@ -721,6 +770,7 @@ static bool cluster_alloc_range(struct swap_info_struct *si, struct swap_cluster
>                 ci->order = order;
>
>         memset(si->swap_map + start, usage, nr_pages);
> +       cluster_table_check(ci, start, nr_pages);
>         swap_range_alloc(si, nr_pages);
>         ci->count += nr_pages;
>
> @@ -1123,7 +1173,7 @@ static void swap_range_free(struct swap_info_struct *si, unsigned long offset,
>                         swap_slot_free_notify(si->bdev, offset);
>                 offset++;
>         }
> -       swap_cache_clear_shadow(si->type, begin, end);
> +       __swap_cache_clear_shadow(swp_entry(si->type, begin), nr_entries);
>
>         /*
>          * Make sure that try_to_unuse() observes si->inuse_pages reaching 0
> @@ -1280,16 +1330,7 @@ int folio_alloc_swap(struct folio *folio, gfp_t gfp)
>         if (!entry.val)
>                 return -ENOMEM;
>
> -       /*
> -        * XArray node allocations from PF_MEMALLOC contexts could
> -        * completely exhaust the page allocator. __GFP_NOMEMALLOC
> -        * stops emergency reserves from being allocated.
> -        *
> -        * TODO: this could cause a theoretical memory reclaim
> -        * deadlock in the swap out path.
> -        */
> -       if (swap_cache_add_folio(folio, entry, gfp | __GFP_NOMEMALLOC, NULL))
> -               goto out_free;
> +       swap_cache_add_folio(folio, entry, NULL);
>
>         return 0;
>
> @@ -1555,6 +1596,7 @@ static void swap_entries_free(struct swap_info_struct *si,
>
>         mem_cgroup_uncharge_swap(entry, nr_pages);
>         swap_range_free(si, offset, nr_pages);
> +       cluster_table_check(ci, offset, nr_pages);
>
>         if (!ci->count)
>                 free_cluster(si, ci);
> @@ -2632,6 +2674,18 @@ static void wait_for_allocation(struct swap_info_struct *si)
>         }
>  }
>
> +static void free_cluster_info(struct swap_cluster_info *cluster_info,
> +                             unsigned long maxpages)
> +{
> +       int i, nr_clusters = DIV_ROUND_UP(maxpages, SWAPFILE_CLUSTER);
> +
> +       if (!cluster_info)
> +               return;
> +       for (i = 0; i < nr_clusters; i++)
> +               swap_cluster_free_table(&cluster_info[i]);
> +       kvfree(cluster_info);
> +}
> +
>  /*
>   * Called after swap device's reference count is dead, so
>   * neither scan nor allocation will use it.
> @@ -2766,12 +2820,13 @@ SYSCALL_DEFINE1(swapoff, const char __user *, specialfile)
>
>         swap_file = p->swap_file;
>         p->swap_file = NULL;
> -       p->max = 0;
>         swap_map = p->swap_map;
>         p->swap_map = NULL;
>         zeromap = p->zeromap;
>         p->zeromap = NULL;
>         cluster_info = p->cluster_info;
> +       free_cluster_info(cluster_info, p->max);
> +       p->max = 0;
>         p->cluster_info = NULL;
>         spin_unlock(&p->lock);
>         spin_unlock(&swap_lock);
> @@ -2782,10 +2837,8 @@ SYSCALL_DEFINE1(swapoff, const char __user *, specialfile)
>         p->global_cluster = NULL;
>         vfree(swap_map);
>         kvfree(zeromap);
> -       kvfree(cluster_info);
>         /* Destroy swap account information */
>         swap_cgroup_swapoff(p->type);
> -       exit_swap_address_space(p->type);
>
>         inode = mapping->host;
>
> @@ -3169,8 +3222,11 @@ static struct swap_cluster_info *setup_clusters(struct swap_info_struct *si,
>         if (!cluster_info)
>                 goto err;
>
> -       for (i = 0; i < nr_clusters; i++)
> +       for (i = 0; i < nr_clusters; i++) {
>                 spin_lock_init(&cluster_info[i].lock);
> +               if (swap_table_alloc_table(&cluster_info[i]))
> +                       goto err_free;
> +       }
>
>         if (!(si->flags & SWP_SOLIDSTATE)) {
>                 si->global_cluster = kmalloc(sizeof(*si->global_cluster),
> @@ -3231,9 +3287,8 @@ static struct swap_cluster_info *setup_clusters(struct swap_info_struct *si,
>         }
>
>         return cluster_info;
> -
>  err_free:
> -       kvfree(cluster_info);
> +       free_cluster_info(cluster_info, maxpages);
>  err:
>         return ERR_PTR(err);
>  }
> @@ -3427,13 +3482,9 @@ SYSCALL_DEFINE2(swapon, const char __user *, specialfile, int, swap_flags)
>                 }
>         }
>
> -       error = init_swap_address_space(si->type, maxpages);
> -       if (error)
> -               goto bad_swap_unlock_inode;
> -
>         error = zswap_swapon(si->type, maxpages);
>         if (error)
> -               goto free_swap_address_space;
> +               goto bad_swap_unlock_inode;
>
>         /*
>          * Flush any pending IO and dirty mappings before we start using this
> @@ -3468,8 +3519,6 @@ SYSCALL_DEFINE2(swapon, const char __user *, specialfile, int, swap_flags)
>         goto out;
>  free_swap_zswap:
>         zswap_swapoff(si->type);
> -free_swap_address_space:
> -       exit_swap_address_space(si->type);
>  bad_swap_unlock_inode:
>         inode_unlock(inode);
>  bad_swap:
> @@ -3484,7 +3533,8 @@ SYSCALL_DEFINE2(swapon, const char __user *, specialfile, int, swap_flags)
>         spin_unlock(&swap_lock);
>         vfree(swap_map);
>         kvfree(zeromap);
> -       kvfree(cluster_info);
> +       if (cluster_info)
> +               free_cluster_info(cluster_info, maxpages);
>         if (inced_nr_rotate_swap)
>                 atomic_dec(&nr_rotate_swap);
>         if (swap_file)
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index c79c6806560b..1d5335993313 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -730,13 +730,18 @@ static int __remove_mapping(struct address_space *mapping, struct folio *folio,
>  {
>         int refcount;
>         void *shadow = NULL;
> +       struct swap_cluster_info *ci;
>
>         BUG_ON(!folio_test_locked(folio));
>         BUG_ON(mapping != folio_mapping(folio));
>
> -       if (!folio_test_swapcache(folio))
> +       if (folio_test_swapcache(folio)) {
> +               ci = swap_cluster_lock_by_folio_irq(folio);
> +       } else {
>                 spin_lock(&mapping->host->i_lock);
> -       xa_lock_irq(&mapping->i_pages);
> +               xa_lock_irq(&mapping->i_pages);
> +       }
> +
>         /*
>          * The non racy check for a busy folio.
>          *
> @@ -776,9 +781,9 @@ static int __remove_mapping(struct address_space *mapping, struct folio *folio,
>
>                 if (reclaimed && !mapping_exiting(mapping))
>                         shadow = workingset_eviction(folio, target_memcg);
> -               __swap_cache_del_folio(folio, swap, shadow);
> +               __swap_cache_del_folio(ci, folio, swap, shadow);
>                 memcg1_swapout(folio, swap);
> -               xa_unlock_irq(&mapping->i_pages);
> +               swap_cluster_unlock_irq(ci);
>                 put_swap_folio(folio, swap);
>         } else {
>                 void (*free_folio)(struct folio *);
> @@ -816,9 +821,12 @@ static int __remove_mapping(struct address_space *mapping, struct folio *folio,
>         return 1;
>
>  cannot_free:
> -       xa_unlock_irq(&mapping->i_pages);
> -       if (!folio_test_swapcache(folio))
> +       if (folio_test_swapcache(folio)) {
> +               swap_cluster_unlock_irq(ci);
> +       } else {
> +               xa_unlock_irq(&mapping->i_pages);
>                 spin_unlock(&mapping->host->i_lock);
> +       }
>         return 0;
>  }
>
> --
> 2.51.0
>
>


^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH v2 13/15] mm, swap: remove contention workaround for swap cache
  2025-09-05 19:13 ` [PATCH v2 13/15] mm, swap: remove contention workaround for swap cache Kairui Song
@ 2025-09-06 15:30   ` Chris Li
  2025-09-08 13:12   ` David Hildenbrand
  1 sibling, 0 replies; 80+ messages in thread
From: Chris Li @ 2025-09-06 15:30 UTC (permalink / raw)
  To: Kairui Song
  Cc: linux-mm, Andrew Morton, Matthew Wilcox, Hugh Dickins,
	Barry Song, Baoquan He, Nhat Pham, Kemeng Shi, Baolin Wang,
	Ying Huang, Johannes Weiner, David Hildenbrand, Yosry Ahmed,
	Lorenzo Stoakes, Zi Yan, linux-kernel, kernel test robot

Acked-by: Chris Li <chrisl@kernel.org>

Chris

On Fri, Sep 5, 2025 at 12:15 PM Kairui Song <ryncsn@gmail.com> wrote:
>
> From: Kairui Song <kasong@tencent.com>
>
> Swap cluster setup will try to shuffle the clusters on initialization.
> It was helpful to avoid contention for the swap cache space. The cluster
> size (2M) was much smaller than each swap cache space (64M), so
> shuffling the cluster means the allocator will try to allocate swap
> slots that are in different swap cache spaces for each CPU, reducing the
> chance of two CPUs using the same swap cache space, and hence reducing
> the contention.
>
> Now, swap cache is managed by swap clusters, this shuffle is pointless.
> Just remove it, and clean up related macros.
>
> This also improves the HDD swap performance as shuffling IO is a bad
> idea for HDD, and now the shuffling is gone. Test have shown a ~40%
> performance gain for HDD [1]:
>
> Doing sequential swap in of 8G data using 8 processes with usemem,
> average of 3 test runs:
>
> Before: 1270.91 KB/s per process
> After:  1849.54 KB/s per process
>
> Link: https://lore.kernel.org/linux-mm/CAMgjq7AdauQ8=X0zeih2r21QoV=-WWj1hyBxLWRzq74n-C=-Ng@mail.gmail.com/ [1]
> Reported-by: kernel test robot <oliver.sang@intel.com>
> Closes: https://lore.kernel.org/oe-lkp/202504241621.f27743ec-lkp@intel.com
> Signed-off-by: Kairui Song <kasong@tencent.com>
> Acked-by: Chris Li <chrisl@kernel.org>
> Reviewed-by: Barry Song <baohua@kernel.org>
> ---
>  mm/swap.h     |  4 ----
>  mm/swapfile.c | 32 ++++++++------------------------
>  mm/zswap.c    |  7 +++++--
>  3 files changed, 13 insertions(+), 30 deletions(-)
>
> diff --git a/mm/swap.h b/mm/swap.h
> index e48431a26f89..c4fb28845d77 100644
> --- a/mm/swap.h
> +++ b/mm/swap.h
> @@ -197,10 +197,6 @@ int swap_writeout(struct folio *folio, struct swap_iocb **swap_plug);
>  void __swap_writepage(struct folio *folio, struct swap_iocb **swap_plug);
>
>  /* linux/mm/swap_state.c */
> -/* One swap address space for each 64M swap space */
> -#define SWAP_ADDRESS_SPACE_SHIFT       14
> -#define SWAP_ADDRESS_SPACE_PAGES       (1 << SWAP_ADDRESS_SPACE_SHIFT)
> -#define SWAP_ADDRESS_SPACE_MASK                (SWAP_ADDRESS_SPACE_PAGES - 1)
>  extern struct address_space swap_space __ro_after_init;
>  static inline struct address_space *swap_address_space(swp_entry_t entry)
>  {
> diff --git a/mm/swapfile.c b/mm/swapfile.c
> index cbb7d4c0773d..6b3b35a7ddd9 100644
> --- a/mm/swapfile.c
> +++ b/mm/swapfile.c
> @@ -3202,21 +3202,14 @@ static int setup_swap_map(struct swap_info_struct *si,
>         return 0;
>  }
>
> -#define SWAP_CLUSTER_INFO_COLS                                         \
> -       DIV_ROUND_UP(L1_CACHE_BYTES, sizeof(struct swap_cluster_info))
> -#define SWAP_CLUSTER_SPACE_COLS                                                \
> -       DIV_ROUND_UP(SWAP_ADDRESS_SPACE_PAGES, SWAPFILE_CLUSTER)
> -#define SWAP_CLUSTER_COLS                                              \
> -       max_t(unsigned int, SWAP_CLUSTER_INFO_COLS, SWAP_CLUSTER_SPACE_COLS)
> -
>  static struct swap_cluster_info *setup_clusters(struct swap_info_struct *si,
>                                                 union swap_header *swap_header,
>                                                 unsigned long maxpages)
>  {
>         unsigned long nr_clusters = DIV_ROUND_UP(maxpages, SWAPFILE_CLUSTER);
>         struct swap_cluster_info *cluster_info;
> -       unsigned long i, j, idx;
>         int err = -ENOMEM;
> +       unsigned long i;
>
>         cluster_info = kvcalloc(nr_clusters, sizeof(*cluster_info), GFP_KERNEL);
>         if (!cluster_info)
> @@ -3265,22 +3258,13 @@ static struct swap_cluster_info *setup_clusters(struct swap_info_struct *si,
>                 INIT_LIST_HEAD(&si->frag_clusters[i]);
>         }
>
> -       /*
> -        * Reduce false cache line sharing between cluster_info and
> -        * sharing same address space.
> -        */
> -       for (j = 0; j < SWAP_CLUSTER_COLS; j++) {
> -               for (i = 0; i < DIV_ROUND_UP(nr_clusters, SWAP_CLUSTER_COLS); i++) {
> -                       struct swap_cluster_info *ci;
> -                       idx = i * SWAP_CLUSTER_COLS + j;
> -                       ci = cluster_info + idx;
> -                       if (idx >= nr_clusters)
> -                               continue;
> -                       if (ci->count) {
> -                               ci->flags = CLUSTER_FLAG_NONFULL;
> -                               list_add_tail(&ci->list, &si->nonfull_clusters[0]);
> -                               continue;
> -                       }
> +       for (i = 0; i < nr_clusters; i++) {
> +               struct swap_cluster_info *ci = &cluster_info[i];
> +
> +               if (ci->count) {
> +                       ci->flags = CLUSTER_FLAG_NONFULL;
> +                       list_add_tail(&ci->list, &si->nonfull_clusters[0]);
> +               } else {
>                         ci->flags = CLUSTER_FLAG_FREE;
>                         list_add_tail(&ci->list, &si->free_clusters);
>                 }
> diff --git a/mm/zswap.c b/mm/zswap.c
> index 3dda4310099e..cba7077fda40 100644
> --- a/mm/zswap.c
> +++ b/mm/zswap.c
> @@ -225,10 +225,13 @@ static bool zswap_has_pool;
>  * helpers and fwd declarations
>  **********************************/
>
> +/* One swap address space for each 64M swap space */
> +#define ZSWAP_ADDRESS_SPACE_SHIFT 14
> +#define ZSWAP_ADDRESS_SPACE_PAGES (1 << ZSWAP_ADDRESS_SPACE_SHIFT)
>  static inline struct xarray *swap_zswap_tree(swp_entry_t swp)
>  {
>         return &zswap_trees[swp_type(swp)][swp_offset(swp)
> -               >> SWAP_ADDRESS_SPACE_SHIFT];
> +               >> ZSWAP_ADDRESS_SPACE_SHIFT];
>  }
>
>  #define zswap_pool_debug(msg, p)                       \
> @@ -1674,7 +1677,7 @@ int zswap_swapon(int type, unsigned long nr_pages)
>         struct xarray *trees, *tree;
>         unsigned int nr, i;
>
> -       nr = DIV_ROUND_UP(nr_pages, SWAP_ADDRESS_SPACE_PAGES);
> +       nr = DIV_ROUND_UP(nr_pages, ZSWAP_ADDRESS_SPACE_PAGES);
>         trees = kvcalloc(nr, sizeof(*tree), GFP_KERNEL);
>         if (!trees) {
>                 pr_err("alloc failed, zswap disabled for swap type %d\n", type);
> --
> 2.51.0
>
>


^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH v2 12/15] mm, swap: mark swap address space ro and add context debug check
  2025-09-05 19:13 ` [PATCH v2 12/15] mm, swap: mark swap address space ro and add context debug check Kairui Song
@ 2025-09-06 15:35   ` Chris Li
  2025-09-08 13:10   ` David Hildenbrand
  1 sibling, 0 replies; 80+ messages in thread
From: Chris Li @ 2025-09-06 15:35 UTC (permalink / raw)
  To: Kairui Song
  Cc: linux-mm, Andrew Morton, Matthew Wilcox, Hugh Dickins,
	Barry Song, Baoquan He, Nhat Pham, Kemeng Shi, Baolin Wang,
	Ying Huang, Johannes Weiner, David Hildenbrand, Yosry Ahmed,
	Lorenzo Stoakes, Zi Yan, linux-kernel

I think adding one more check is fine.

I don't think there are exceptions but I am not 100% sure either. If
there are any violations we can catch it now that is a good thing.

Acked-by: Chris Li <chrisl@kernel.org>

Chris

On Fri, Sep 5, 2025 at 12:15 PM Kairui Song <ryncsn@gmail.com> wrote:
>
> From: Kairui Song <kasong@tencent.com>
>
> Swap cache is now backed by swap table, and the address space is not
> holding any mutable data anymore. And swap cache is now protected by
> the swap cluster lock, instead of the XArray lock. All access to swap
> cache are wrapped by swap cache helpers. Locking is mostly handled
> internally by swap cache helpers, only a few __swap_cache_* helpers
> require the caller to lock the cluster by themselves.
>
> Worth noting that, unlike XArray, the cluster lock is not IRQ safe.
> The swap cache was very different compared to filemap, and now it's
> completely separated from filemap. Nothing wants to mark or change
> anything or do a writeback callback in IRQ.
>
> So explicitly document this and add a debug check to avoid further
> potential misuse. And mark the swap cache space as read-only to avoid
> any user wrongly mixing unexpected filemap helpers with swap cache.
>
> Signed-off-by: Kairui Song <kasong@tencent.com>
> ---
>  mm/swap.h       | 12 +++++++++++-
>  mm/swap_state.c |  3 ++-
>  2 files changed, 13 insertions(+), 2 deletions(-)
>
> diff --git a/mm/swap.h b/mm/swap.h
> index bf4e54f1f6b6..e48431a26f89 100644
> --- a/mm/swap.h
> +++ b/mm/swap.h
> @@ -99,6 +99,16 @@ static __always_inline struct swap_cluster_info *__swap_cluster_lock(
>  {
>         struct swap_cluster_info *ci = __swap_offset_to_cluster(si, offset);
>
> +       /*
> +        * Nothing modifies swap cache in an IRQ context. All access to
> +        * swap cache is wrapped by swap_cache_* helpers, and swap cache
> +        * writeback is handled outside of IRQs. Swapin or swapout never
> +        * occurs in IRQ, and neither does in-place split or replace.
> +        *
> +        * Besides, modifying swap cache requires synchronization with
> +        * swap_map, which was never IRQ safe.
> +        */
> +       VM_WARN_ON_ONCE(!in_task());
>         VM_WARN_ON_ONCE(percpu_ref_is_zero(&si->users)); /* race with swapoff */
>         if (irq)
>                 spin_lock_irq(&ci->lock);
> @@ -191,7 +201,7 @@ void __swap_writepage(struct folio *folio, struct swap_iocb **swap_plug);
>  #define SWAP_ADDRESS_SPACE_SHIFT       14
>  #define SWAP_ADDRESS_SPACE_PAGES       (1 << SWAP_ADDRESS_SPACE_SHIFT)
>  #define SWAP_ADDRESS_SPACE_MASK                (SWAP_ADDRESS_SPACE_PAGES - 1)
> -extern struct address_space swap_space;
> +extern struct address_space swap_space __ro_after_init;
>  static inline struct address_space *swap_address_space(swp_entry_t entry)
>  {
>         return &swap_space;
> diff --git a/mm/swap_state.c b/mm/swap_state.c
> index 7147b390745f..209d5e9e8d90 100644
> --- a/mm/swap_state.c
> +++ b/mm/swap_state.c
> @@ -37,7 +37,8 @@ static const struct address_space_operations swap_aops = {
>  #endif
>  };
>
> -struct address_space swap_space __read_mostly = {
> +/* Set swap_space as read only as swap cache is handled by swap table */
> +struct address_space swap_space __ro_after_init = {
>         .a_ops = &swap_aops,
>  };
>
> --
> 2.51.0
>
>


^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH v2 14/15] mm, swap: implement dynamic allocation of swap table
  2025-09-05 19:13 ` [PATCH v2 14/15] mm, swap: implement dynamic allocation of swap table Kairui Song
@ 2025-09-06 15:45   ` Chris Li
  2025-09-08 14:58     ` Kairui Song
  0 siblings, 1 reply; 80+ messages in thread
From: Chris Li @ 2025-09-06 15:45 UTC (permalink / raw)
  To: Kairui Song
  Cc: linux-mm, Andrew Morton, Matthew Wilcox, Hugh Dickins,
	Barry Song, Baoquan He, Nhat Pham, Kemeng Shi, Baolin Wang,
	Ying Huang, Johannes Weiner, David Hildenbrand, Yosry Ahmed,
	Lorenzo Stoakes, Zi Yan, linux-kernel

Hi Kairui,

Acked-by: Chris Li <chrisl@kernel.org>

BTW, if you made some changes after my last ack, please drop my ack
tag on the new version or clarify ack was on the older version so I
know this version has new changes.

Chris

On Fri, Sep 5, 2025 at 12:15 PM Kairui Song <ryncsn@gmail.com> wrote:
>
> From: Kairui Song <kasong@tencent.com>
>
> Now swap table is cluster based, which means free clusters can free its
> table since no one should modify it.
>
> There could be speculative readers, like swap cache look up, protect
> them by making them RCU protected. All swap table should be filled with
> null entries before free, so such readers will either see a NULL pointer
> or a null filled table being lazy freed.
>
> On allocation, allocate the table when a cluster is used by any order.
>
> This way, we can reduce the memory usage of large swap device
> significantly.
>
> This idea to dynamically release unused swap cluster data was initially
> suggested by Chris Li while proposing the cluster swap allocator and
> it suits the swap table idea very well.
>
> Co-developed-by: Chris Li <chrisl@kernel.org>
> Signed-off-by: Chris Li <chrisl@kernel.org>
> Signed-off-by: Kairui Song <kasong@tencent.com>
> Acked-by: Chris Li <chrisl@kernel.org>
> ---
>  mm/swap.h       |   2 +-
>  mm/swap_state.c |   9 +--
>  mm/swap_table.h |  37 ++++++++-
>  mm/swapfile.c   | 202 ++++++++++++++++++++++++++++++++++++++----------
>  4 files changed, 199 insertions(+), 51 deletions(-)
>
> diff --git a/mm/swap.h b/mm/swap.h
> index c4fb28845d77..caff4fe30fc5 100644
> --- a/mm/swap.h
> +++ b/mm/swap.h
> @@ -36,7 +36,7 @@ struct swap_cluster_info {
>         u16 count;
>         u8 flags;
>         u8 order;
> -       atomic_long_t *table;   /* Swap table entries, see mm/swap_table.h */
> +       atomic_long_t __rcu *table;     /* Swap table entries, see mm/swap_table.h */
>         struct list_head list;
>  };
>
> diff --git a/mm/swap_state.c b/mm/swap_state.c
> index 209d5e9e8d90..dfe8f42fc309 100644
> --- a/mm/swap_state.c
> +++ b/mm/swap_state.c
> @@ -92,8 +92,8 @@ struct folio *swap_cache_get_folio(swp_entry_t entry)
>         struct folio *folio;
>
>         for (;;) {
> -               swp_tb = __swap_table_get(__swap_entry_to_cluster(entry),
> -                                         swp_cluster_offset(entry));
> +               swp_tb = swap_table_get(__swap_entry_to_cluster(entry),
> +                                       swp_cluster_offset(entry));
>                 if (!swp_tb_is_folio(swp_tb))
>                         return NULL;
>                 folio = swp_tb_to_folio(swp_tb);
> @@ -116,11 +116,10 @@ void *swap_cache_get_shadow(swp_entry_t entry)
>  {
>         unsigned long swp_tb;
>
> -       swp_tb = __swap_table_get(__swap_entry_to_cluster(entry),
> -                                 swp_cluster_offset(entry));
> +       swp_tb = swap_table_get(__swap_entry_to_cluster(entry),
> +                               swp_cluster_offset(entry));
>         if (swp_tb_is_shadow(swp_tb))
>                 return swp_tb_to_shadow(swp_tb);
> -
>         return NULL;
>  }
>
> diff --git a/mm/swap_table.h b/mm/swap_table.h
> index e1f7cc009701..52254e455304 100644
> --- a/mm/swap_table.h
> +++ b/mm/swap_table.h
> @@ -2,8 +2,15 @@
>  #ifndef _MM_SWAP_TABLE_H
>  #define _MM_SWAP_TABLE_H
>
> +#include <linux/rcupdate.h>
> +#include <linux/atomic.h>
>  #include "swap.h"
>
> +/* A typical flat array in each cluster as swap table */
> +struct swap_table {
> +       atomic_long_t entries[SWAPFILE_CLUSTER];
> +};
> +
>  /*
>   * A swap table entry represents the status of a swap slot on a swap
>   * (physical or virtual) device. The swap table in each cluster is a
> @@ -76,22 +83,46 @@ static inline void *swp_tb_to_shadow(unsigned long swp_tb)
>  static inline void __swap_table_set(struct swap_cluster_info *ci,
>                                     unsigned int off, unsigned long swp_tb)
>  {
> +       atomic_long_t *table = rcu_dereference_protected(ci->table, true);
> +
> +       lockdep_assert_held(&ci->lock);
>         VM_WARN_ON_ONCE(off >= SWAPFILE_CLUSTER);
> -       atomic_long_set(&ci->table[off], swp_tb);
> +       atomic_long_set(&table[off], swp_tb);
>  }
>
>  static inline unsigned long __swap_table_xchg(struct swap_cluster_info *ci,
>                                               unsigned int off, unsigned long swp_tb)
>  {
> +       atomic_long_t *table = rcu_dereference_protected(ci->table, true);
> +
> +       lockdep_assert_held(&ci->lock);
>         VM_WARN_ON_ONCE(off >= SWAPFILE_CLUSTER);
>         /* Ordering is guaranteed by cluster lock, relax */
> -       return atomic_long_xchg_relaxed(&ci->table[off], swp_tb);
> +       return atomic_long_xchg_relaxed(&table[off], swp_tb);
>  }
>
>  static inline unsigned long __swap_table_get(struct swap_cluster_info *ci,
>                                              unsigned int off)
>  {
> +       atomic_long_t *table;
> +
>         VM_WARN_ON_ONCE(off >= SWAPFILE_CLUSTER);
> -       return atomic_long_read(&ci->table[off]);
> +       table = rcu_dereference_check(ci->table, lockdep_is_held(&ci->lock));
> +
> +       return atomic_long_read(&table[off]);
> +}
> +
> +static inline unsigned long swap_table_get(struct swap_cluster_info *ci,
> +                                       unsigned int off)
> +{
> +       atomic_long_t *table;
> +       unsigned long swp_tb;
> +
> +       rcu_read_lock();
> +       table = rcu_dereference(ci->table);
> +       swp_tb = table ? atomic_long_read(&table[off]) : null_to_swp_tb();
> +       rcu_read_unlock();
> +
> +       return swp_tb;
>  }
>  #endif
> diff --git a/mm/swapfile.c b/mm/swapfile.c
> index 6b3b35a7ddd9..49f93069faef 100644
> --- a/mm/swapfile.c
> +++ b/mm/swapfile.c
> @@ -105,6 +105,8 @@ static DEFINE_SPINLOCK(swap_avail_lock);
>
>  struct swap_info_struct *swap_info[MAX_SWAPFILES];
>
> +static struct kmem_cache *swap_table_cachep;
> +
>  static DEFINE_MUTEX(swapon_mutex);
>
>  static DECLARE_WAIT_QUEUE_HEAD(proc_poll_wait);
> @@ -400,10 +402,17 @@ static inline bool cluster_is_discard(struct swap_cluster_info *info)
>         return info->flags == CLUSTER_FLAG_DISCARD;
>  }
>
> +static inline bool cluster_table_is_alloced(struct swap_cluster_info *ci)
> +{
> +       return rcu_dereference_protected(ci->table, lockdep_is_held(&ci->lock));
> +}
> +
>  static inline bool cluster_is_usable(struct swap_cluster_info *ci, int order)
>  {
>         if (unlikely(ci->flags > CLUSTER_FLAG_USABLE))
>                 return false;
> +       if (!cluster_table_is_alloced(ci))
> +               return false;
>         if (!order)
>                 return true;
>         return cluster_is_empty(ci) || order == ci->order;
> @@ -421,32 +430,98 @@ static inline unsigned int cluster_offset(struct swap_info_struct *si,
>         return cluster_index(si, ci) * SWAPFILE_CLUSTER;
>  }
>
> -static int swap_table_alloc_table(struct swap_cluster_info *ci)
> +static void swap_cluster_free_table(struct swap_cluster_info *ci)
>  {
> -       WARN_ON(ci->table);
> -       ci->table = kzalloc(sizeof(unsigned long) * SWAPFILE_CLUSTER, GFP_KERNEL);
> -       if (!ci->table)
> -               return -ENOMEM;
> -       return 0;
> +       unsigned int ci_off;
> +       struct swap_table *table;
> +
> +       /* Only empty cluster's table is allow to be freed  */
> +       lockdep_assert_held(&ci->lock);
> +       VM_WARN_ON_ONCE(!cluster_is_empty(ci));
> +       for (ci_off = 0; ci_off < SWAPFILE_CLUSTER; ci_off++)
> +               VM_WARN_ON_ONCE(!swp_tb_is_null(__swap_table_get(ci, ci_off)));
> +       table = (void *)rcu_dereference_protected(ci->table, true);
> +       rcu_assign_pointer(ci->table, NULL);
> +
> +       kmem_cache_free(swap_table_cachep, table);
>  }
>
> -static void swap_cluster_free_table(struct swap_cluster_info *ci)
> +/*
> + * Allocate a swap table may need to sleep, which leads to migration,
> + * so attempt an atomic allocation first then fallback and handle
> + * potential race.
> + */
> +static struct swap_cluster_info *
> +swap_cluster_alloc_table(struct swap_info_struct *si,
> +                        struct swap_cluster_info *ci,
> +                        int order)
>  {
> -       unsigned int ci_off;
> -       unsigned long swp_tb;
> +       struct swap_cluster_info *pcp_ci;
> +       struct swap_table *table;
> +       unsigned long offset;
>
> -       if (!ci->table)
> -               return;
> +       /*
> +        * Only cluster isolation from the allocator does table allocation.
> +        * Swap allocator uses a percpu cluster and holds the local lock.
> +        */
> +       lockdep_assert_held(&ci->lock);
> +       lockdep_assert_held(&this_cpu_ptr(&percpu_swap_cluster)->lock);
> +
> +       table = kmem_cache_zalloc(swap_table_cachep,
> +                                 __GFP_HIGH | __GFP_NOMEMALLOC | __GFP_NOWARN);
> +       if (table) {
> +               rcu_assign_pointer(ci->table, table);
> +               return ci;
> +       }
> +
> +       /*
> +        * Try a sleep allocation. Each isolated free cluster may cause
> +        * a sleep allocation, but there is a limited number of them, so
> +        * the potential recursive allocation should be limited.
> +        */
> +       spin_unlock(&ci->lock);
> +       if (!(si->flags & SWP_SOLIDSTATE))
> +               spin_unlock(&si->global_cluster_lock);
> +       local_unlock(&percpu_swap_cluster.lock);
> +       table = kmem_cache_zalloc(swap_table_cachep, __GFP_HIGH | GFP_KERNEL);
>
> -       for (ci_off = 0; ci_off < SWAPFILE_CLUSTER; ci_off++) {
> -               swp_tb = __swap_table_get(ci, ci_off);
> -               if (!swp_tb_is_null(swp_tb))
> -                       pr_err_once("swap: unclean swap space on swapoff: 0x%lx",
> -                                   swp_tb);
> +       local_lock(&percpu_swap_cluster.lock);
> +       if (!(si->flags & SWP_SOLIDSTATE))
> +               spin_lock(&si->global_cluster_lock);
> +       /*
> +        * Back to atomic context. First, check if we migrated to a new
> +        * CPU with a usable percpu cluster. If so, try using that instead.
> +        * No need to check it for the spinning device, as swap is
> +        * serialized by the global lock on them.
> +        *
> +        * The is_usable check is a bit rough, but ensures order 0 success.
> +        */
> +       offset = this_cpu_read(percpu_swap_cluster.offset[order]);
> +       if ((si->flags & SWP_SOLIDSTATE) && offset) {
> +               pcp_ci = swap_cluster_lock(si, offset);
> +               if (cluster_is_usable(pcp_ci, order) &&
> +                   pcp_ci->count < SWAPFILE_CLUSTER) {
> +                       ci = pcp_ci;
> +                       goto free_table;
> +               }
> +               swap_cluster_unlock(pcp_ci);
>         }
>
> -       kfree(ci->table);
> -       ci->table = NULL;
> +       if (!table)
> +               return NULL;
> +
> +       spin_lock(&ci->lock);
> +       /* Nothing should have touched the dangling empty cluster. */
> +       if (WARN_ON_ONCE(cluster_table_is_alloced(ci)))
> +               goto free_table;
> +
> +       rcu_assign_pointer(ci->table, table);
> +       return ci;
> +
> +free_table:
> +       if (table)
> +               kmem_cache_free(swap_table_cachep, table);
> +       return ci;
>  }
>
>  static void move_cluster(struct swap_info_struct *si,
> @@ -478,7 +553,7 @@ static void swap_cluster_schedule_discard(struct swap_info_struct *si,
>
>  static void __free_cluster(struct swap_info_struct *si, struct swap_cluster_info *ci)
>  {
> -       lockdep_assert_held(&ci->lock);
> +       swap_cluster_free_table(ci);
>         move_cluster(si, ci, &si->free_clusters, CLUSTER_FLAG_FREE);
>         ci->order = 0;
>  }
> @@ -493,15 +568,11 @@ static void __free_cluster(struct swap_info_struct *si, struct swap_cluster_info
>   * this returns NULL for an non-empty list.
>   */
>  static struct swap_cluster_info *isolate_lock_cluster(
> -               struct swap_info_struct *si, struct list_head *list)
> +               struct swap_info_struct *si, struct list_head *list, int order)
>  {
> -       struct swap_cluster_info *ci, *ret = NULL;
> +       struct swap_cluster_info *ci, *found = NULL;
>
>         spin_lock(&si->lock);
> -
> -       if (unlikely(!(si->flags & SWP_WRITEOK)))
> -               goto out;
> -
>         list_for_each_entry(ci, list, list) {
>                 if (!spin_trylock(&ci->lock))
>                         continue;
> @@ -513,13 +584,19 @@ static struct swap_cluster_info *isolate_lock_cluster(
>
>                 list_del(&ci->list);
>                 ci->flags = CLUSTER_FLAG_NONE;
> -               ret = ci;
> +               found = ci;
>                 break;
>         }
> -out:
>         spin_unlock(&si->lock);
>
> -       return ret;
> +       if (found && !cluster_table_is_alloced(found)) {
> +               /* Only an empty free cluster's swap table can be freed. */
> +               VM_WARN_ON_ONCE(list != &si->free_clusters);
> +               VM_WARN_ON_ONCE(!cluster_is_empty(found));
> +               return swap_cluster_alloc_table(si, found, order);
> +       }
> +
> +       return found;
>  }
>
>  /*
> @@ -652,17 +729,27 @@ static void relocate_cluster(struct swap_info_struct *si,
>   * added to free cluster list and its usage counter will be increased by 1.
>   * Only used for initialization.
>   */
> -static void inc_cluster_info_page(struct swap_info_struct *si,
> +static int inc_cluster_info_page(struct swap_info_struct *si,
>         struct swap_cluster_info *cluster_info, unsigned long page_nr)
>  {
>         unsigned long idx = page_nr / SWAPFILE_CLUSTER;
> +       struct swap_table *table;
>         struct swap_cluster_info *ci;
>
>         ci = cluster_info + idx;
> +       if (!ci->table) {
> +               table = kmem_cache_zalloc(swap_table_cachep, GFP_KERNEL);
> +               if (!table)
> +                       return -ENOMEM;
> +               rcu_assign_pointer(ci->table, table);
> +       }
> +
>         ci->count++;
>
>         VM_BUG_ON(ci->count > SWAPFILE_CLUSTER);
>         VM_BUG_ON(ci->flags);
> +
> +       return 0;
>  }
>
>  static bool cluster_reclaim_range(struct swap_info_struct *si,
> @@ -844,7 +931,7 @@ static unsigned int alloc_swap_scan_list(struct swap_info_struct *si,
>         unsigned int found = SWAP_ENTRY_INVALID;
>
>         do {
> -               struct swap_cluster_info *ci = isolate_lock_cluster(si, list);
> +               struct swap_cluster_info *ci = isolate_lock_cluster(si, list, order);
>                 unsigned long offset;
>
>                 if (!ci)
> @@ -869,7 +956,7 @@ static void swap_reclaim_full_clusters(struct swap_info_struct *si, bool force)
>         if (force)
>                 to_scan = swap_usage_in_pages(si) / SWAPFILE_CLUSTER;
>
> -       while ((ci = isolate_lock_cluster(si, &si->full_clusters))) {
> +       while ((ci = isolate_lock_cluster(si, &si->full_clusters, 0))) {
>                 offset = cluster_offset(si, ci);
>                 end = min(si->max, offset + SWAPFILE_CLUSTER);
>                 to_scan--;
> @@ -1017,6 +1104,7 @@ static unsigned long cluster_alloc_swap_entry(struct swap_info_struct *si, int o
>  done:
>         if (!(si->flags & SWP_SOLIDSTATE))
>                 spin_unlock(&si->global_cluster_lock);
> +
>         return found;
>  }
>
> @@ -1884,7 +1972,13 @@ swp_entry_t get_swap_page_of_type(int type)
>         /* This is called for allocating swap entry, not cache */
>         if (get_swap_device_info(si)) {
>                 if (si->flags & SWP_WRITEOK) {
> +                       /*
> +                        * Grab the local lock to be complaint
> +                        * with swap table allocation.
> +                        */
> +                       local_lock(&percpu_swap_cluster.lock);
>                         offset = cluster_alloc_swap_entry(si, 0, 1);
> +                       local_unlock(&percpu_swap_cluster.lock);
>                         if (offset) {
>                                 entry = swp_entry(si->type, offset);
>                                 atomic_long_dec(&nr_swap_pages);
> @@ -2677,12 +2771,21 @@ static void wait_for_allocation(struct swap_info_struct *si)
>  static void free_cluster_info(struct swap_cluster_info *cluster_info,
>                               unsigned long maxpages)
>  {
> +       struct swap_cluster_info *ci;
>         int i, nr_clusters = DIV_ROUND_UP(maxpages, SWAPFILE_CLUSTER);
>
>         if (!cluster_info)
>                 return;
> -       for (i = 0; i < nr_clusters; i++)
> -               swap_cluster_free_table(&cluster_info[i]);
> +       for (i = 0; i < nr_clusters; i++) {
> +               ci = cluster_info + i;
> +               /* Cluster with bad marks count will have a remaining table */
> +               spin_lock(&ci->lock);
> +               if (rcu_dereference_protected(ci->table, true)) {
> +                       ci->count = 0;
> +                       swap_cluster_free_table(ci);
> +               }
> +               spin_unlock(&ci->lock);
> +       }
>         kvfree(cluster_info);
>  }
>
> @@ -2718,6 +2821,7 @@ SYSCALL_DEFINE1(swapoff, const char __user *, specialfile)
>         struct address_space *mapping;
>         struct inode *inode;
>         struct filename *pathname;
> +       unsigned int maxpages;
>         int err, found = 0;
>
>         if (!capable(CAP_SYS_ADMIN))
> @@ -2824,8 +2928,8 @@ SYSCALL_DEFINE1(swapoff, const char __user *, specialfile)
>         p->swap_map = NULL;
>         zeromap = p->zeromap;
>         p->zeromap = NULL;
> +       maxpages = p->max;
>         cluster_info = p->cluster_info;
> -       free_cluster_info(cluster_info, p->max);
>         p->max = 0;
>         p->cluster_info = NULL;
>         spin_unlock(&p->lock);
> @@ -2837,6 +2941,7 @@ SYSCALL_DEFINE1(swapoff, const char __user *, specialfile)
>         p->global_cluster = NULL;
>         vfree(swap_map);
>         kvfree(zeromap);
> +       free_cluster_info(cluster_info, maxpages);
>         /* Destroy swap account information */
>         swap_cgroup_swapoff(p->type);
>
> @@ -3215,11 +3320,8 @@ static struct swap_cluster_info *setup_clusters(struct swap_info_struct *si,
>         if (!cluster_info)
>                 goto err;
>
> -       for (i = 0; i < nr_clusters; i++) {
> +       for (i = 0; i < nr_clusters; i++)
>                 spin_lock_init(&cluster_info[i].lock);
> -               if (swap_table_alloc_table(&cluster_info[i]))
> -                       goto err_free;
> -       }
>
>         if (!(si->flags & SWP_SOLIDSTATE)) {
>                 si->global_cluster = kmalloc(sizeof(*si->global_cluster),
> @@ -3238,16 +3340,23 @@ static struct swap_cluster_info *setup_clusters(struct swap_info_struct *si,
>          * See setup_swap_map(): header page, bad pages,
>          * and the EOF part of the last cluster.
>          */
> -       inc_cluster_info_page(si, cluster_info, 0);
> +       err = inc_cluster_info_page(si, cluster_info, 0);
> +       if (err)
> +               goto err;
>         for (i = 0; i < swap_header->info.nr_badpages; i++) {
>                 unsigned int page_nr = swap_header->info.badpages[i];
>
>                 if (page_nr >= maxpages)
>                         continue;
> -               inc_cluster_info_page(si, cluster_info, page_nr);
> +               err = inc_cluster_info_page(si, cluster_info, page_nr);
> +               if (err)
> +                       goto err;
> +       }
> +       for (i = maxpages; i < round_up(maxpages, SWAPFILE_CLUSTER); i++) {
> +               err = inc_cluster_info_page(si, cluster_info, i);
> +               if (err)
> +                       goto err;
>         }
> -       for (i = maxpages; i < round_up(maxpages, SWAPFILE_CLUSTER); i++)
> -               inc_cluster_info_page(si, cluster_info, i);
>
>         INIT_LIST_HEAD(&si->free_clusters);
>         INIT_LIST_HEAD(&si->full_clusters);
> @@ -3961,6 +4070,15 @@ static int __init swapfile_init(void)
>
>         swapfile_maximum_size = arch_max_swapfile_size();
>
> +       /*
> +        * Once a cluster is freed, it's swap table content is read
> +        * only, and all swap cache readers (swap_cache_*) verifies
> +        * the content before use. So it's safe to use RCU slab here.
> +        */
> +       swap_table_cachep = kmem_cache_create("swap_table",
> +                           sizeof(struct swap_table),
> +                           0, SLAB_PANIC | SLAB_TYPESAFE_BY_RCU, NULL);
> +
>  #ifdef CONFIG_MIGRATION
>         if (swapfile_maximum_size >= (1UL << SWP_MIG_TOTAL_BITS))
>                 swap_migration_ad_supported = true;
> --
> 2.51.0
>
>


^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH v2 15/15] mm, swap: use a single page for swap table when the size fits
  2025-09-05 19:13 ` [PATCH v2 15/15] mm, swap: use a single page for swap table when the size fits Kairui Song
@ 2025-09-06 15:48   ` Chris Li
  0 siblings, 0 replies; 80+ messages in thread
From: Chris Li @ 2025-09-06 15:48 UTC (permalink / raw)
  To: Kairui Song
  Cc: linux-mm, Andrew Morton, Matthew Wilcox, Hugh Dickins,
	Barry Song, Baoquan He, Nhat Pham, Kemeng Shi, Baolin Wang,
	Ying Huang, Johannes Weiner, David Hildenbrand, Yosry Ahmed,
	Lorenzo Stoakes, Zi Yan, linux-kernel

I did not notice new changes, anyway.

Acked-by: Chris Li <chrisl@kernel.org>

Chris

On Fri, Sep 5, 2025 at 12:15 PM Kairui Song <ryncsn@gmail.com> wrote:
>
> From: Kairui Song <kasong@tencent.com>
>
> We have a cluster size of 512 slots. Each slot consumes 8 bytes in swap
> table so the swap table size of each cluster is exactly one page (4K).
>
> If that condition is true, allocate one page direct and disable the slab
> cache to reduce the memory usage of swap table and avoid fragmentation.
>
> Co-developed-by: Chris Li <chrisl@kernel.org>
> Signed-off-by: Chris Li <chrisl@kernel.org>
> Signed-off-by: Kairui Song <kasong@tencent.com>
> Acked-by: Chris Li <chrisl@kernel.org>
> ---
>  mm/swap_table.h |  2 ++
>  mm/swapfile.c   | 50 ++++++++++++++++++++++++++++++++++++++++---------
>  2 files changed, 43 insertions(+), 9 deletions(-)
>
> diff --git a/mm/swap_table.h b/mm/swap_table.h
> index 52254e455304..ea244a57a5b7 100644
> --- a/mm/swap_table.h
> +++ b/mm/swap_table.h
> @@ -11,6 +11,8 @@ struct swap_table {
>         atomic_long_t entries[SWAPFILE_CLUSTER];
>  };
>
> +#define SWP_TABLE_USE_PAGE (sizeof(struct swap_table) == PAGE_SIZE)
> +
>  /*
>   * A swap table entry represents the status of a swap slot on a swap
>   * (physical or virtual) device. The swap table in each cluster is a
> diff --git a/mm/swapfile.c b/mm/swapfile.c
> index 49f93069faef..ab6e877b0644 100644
> --- a/mm/swapfile.c
> +++ b/mm/swapfile.c
> @@ -430,6 +430,38 @@ static inline unsigned int cluster_offset(struct swap_info_struct *si,
>         return cluster_index(si, ci) * SWAPFILE_CLUSTER;
>  }
>
> +static struct swap_table *swap_table_alloc(gfp_t gfp)
> +{
> +       struct folio *folio;
> +
> +       if (!SWP_TABLE_USE_PAGE)
> +               return kmem_cache_zalloc(swap_table_cachep, gfp);
> +
> +       folio = folio_alloc(gfp | __GFP_ZERO, 0);
> +       if (folio)
> +               return folio_address(folio);
> +       return NULL;
> +}
> +
> +static void swap_table_free_folio_rcu_cb(struct rcu_head *head)
> +{
> +       struct folio *folio;
> +
> +       folio = page_folio(container_of(head, struct page, rcu_head));
> +       folio_put(folio);
> +}
> +
> +static void swap_table_free(struct swap_table *table)
> +{
> +       if (!SWP_TABLE_USE_PAGE) {
> +               kmem_cache_free(swap_table_cachep, table);
> +               return;
> +       }
> +
> +       call_rcu(&(folio_page(virt_to_folio(table), 0)->rcu_head),
> +                swap_table_free_folio_rcu_cb);
> +}
> +
>  static void swap_cluster_free_table(struct swap_cluster_info *ci)
>  {
>         unsigned int ci_off;
> @@ -443,7 +475,7 @@ static void swap_cluster_free_table(struct swap_cluster_info *ci)
>         table = (void *)rcu_dereference_protected(ci->table, true);
>         rcu_assign_pointer(ci->table, NULL);
>
> -       kmem_cache_free(swap_table_cachep, table);
> +       swap_table_free(table);
>  }
>
>  /*
> @@ -467,8 +499,7 @@ swap_cluster_alloc_table(struct swap_info_struct *si,
>         lockdep_assert_held(&ci->lock);
>         lockdep_assert_held(&this_cpu_ptr(&percpu_swap_cluster)->lock);
>
> -       table = kmem_cache_zalloc(swap_table_cachep,
> -                                 __GFP_HIGH | __GFP_NOMEMALLOC | __GFP_NOWARN);
> +       table = swap_table_alloc(__GFP_HIGH | __GFP_NOMEMALLOC | __GFP_NOWARN);
>         if (table) {
>                 rcu_assign_pointer(ci->table, table);
>                 return ci;
> @@ -483,7 +514,7 @@ swap_cluster_alloc_table(struct swap_info_struct *si,
>         if (!(si->flags & SWP_SOLIDSTATE))
>                 spin_unlock(&si->global_cluster_lock);
>         local_unlock(&percpu_swap_cluster.lock);
> -       table = kmem_cache_zalloc(swap_table_cachep, __GFP_HIGH | GFP_KERNEL);
> +       table = swap_table_alloc(__GFP_HIGH | GFP_KERNEL);
>
>         local_lock(&percpu_swap_cluster.lock);
>         if (!(si->flags & SWP_SOLIDSTATE))
> @@ -520,7 +551,7 @@ swap_cluster_alloc_table(struct swap_info_struct *si,
>
>  free_table:
>         if (table)
> -               kmem_cache_free(swap_table_cachep, table);
> +               swap_table_free(table);
>         return ci;
>  }
>
> @@ -738,7 +769,7 @@ static int inc_cluster_info_page(struct swap_info_struct *si,
>
>         ci = cluster_info + idx;
>         if (!ci->table) {
> -               table = kmem_cache_zalloc(swap_table_cachep, GFP_KERNEL);
> +               table = swap_table_alloc(GFP_KERNEL);
>                 if (!table)
>                         return -ENOMEM;
>                 rcu_assign_pointer(ci->table, table);
> @@ -4075,9 +4106,10 @@ static int __init swapfile_init(void)
>          * only, and all swap cache readers (swap_cache_*) verifies
>          * the content before use. So it's safe to use RCU slab here.
>          */
> -       swap_table_cachep = kmem_cache_create("swap_table",
> -                           sizeof(struct swap_table),
> -                           0, SLAB_PANIC | SLAB_TYPESAFE_BY_RCU, NULL);
> +       if (!SWP_TABLE_USE_PAGE)
> +               swap_table_cachep = kmem_cache_create("swap_table",
> +                                   sizeof(struct swap_table),
> +                                   0, SLAB_PANIC | SLAB_TYPESAFE_BY_RCU, NULL);
>
>  #ifdef CONFIG_MIGRATION
>         if (swapfile_maximum_size >= (1UL << SWP_MIG_TOTAL_BITS))
> --
> 2.51.0
>
>


^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH v2 11/15] mm, swap: use the swap table for the swap cache and switch API
  2025-09-05 19:13 ` [PATCH v2 11/15] mm, swap: use the swap table for the swap cache and switch API Kairui Song
  2025-09-06 15:28   ` Chris Li
@ 2025-09-07 12:55   ` Klara Modin
  2025-09-08 14:34     ` Kairui Song
  2025-09-08 13:45   ` David Hildenbrand
  2025-09-10  2:53   ` SeongJae Park
  3 siblings, 1 reply; 80+ messages in thread
From: Klara Modin @ 2025-09-07 12:55 UTC (permalink / raw)
  To: Kairui Song
  Cc: linux-mm, Andrew Morton, Matthew Wilcox, Hugh Dickins, Chris Li,
	Barry Song, Baoquan He, Nhat Pham, Kemeng Shi, Baolin Wang,
	Ying Huang, Johannes Weiner, David Hildenbrand, Yosry Ahmed,
	Lorenzo Stoakes, Zi Yan, linux-kernel

On 2025-09-06 03:13:53 +0800, Kairui Song wrote:
> From: Kairui Song <kasong@tencent.com>
> 
> Introduce basic swap table infrastructures, which are now just a
> fixed-sized flat array inside each swap cluster, with access wrappers.
> 
> Each cluster contains a swap table of 512 entries. Each table entry is
> an opaque atomic long. It could be in 3 types: a shadow type (XA_VALUE),
> a folio type (pointer), or NULL.
> 
> In this first step, it only supports storing a folio or shadow, and it
> is a drop-in replacement for the current swap cache. Convert all swap
> cache users to use the new sets of APIs. Chris Li has been suggesting
> using a new infrastructure for swap cache for better performance, and
> that idea combined well with the swap table as the new backing
> structure. Now the lock contention range is reduced to 2M clusters,
> which is much smaller than the 64M address_space. And we can also drop
> the multiple address_space design.
> 
> All the internal works are done with swap_cache_get_* helpers. Swap
> cache lookup is still lock-less like before, and the helper's contexts
> are same with original swap cache helpers. They still require a pin
> on the swap device to prevent the backing data from being freed.
> 
> Swap cache updates are now protected by the swap cluster lock
> instead of the Xarray lock. This is mostly handled internally, but new
> __swap_cache_* helpers require the caller to lock the cluster. So, a
> few new cluster access and locking helpers are also introduced.
> 
> A fully cluster-based unified swap table can be implemented on top
> of this to take care of all count tracking and synchronization work,
> with dynamic allocation. It should reduce the memory usage while
> making the performance even better.
> 
> Co-developed-by: Chris Li <chrisl@kernel.org>
> Signed-off-by: Chris Li <chrisl@kernel.org>
> Signed-off-by: Kairui Song <kasong@tencent.com>
> ---
>  MAINTAINERS          |   1 +
>  include/linux/swap.h |   2 -
>  mm/huge_memory.c     |  13 +-
>  mm/migrate.c         |  19 ++-
>  mm/shmem.c           |   8 +-
>  mm/swap.h            | 157 +++++++++++++++++------
>  mm/swap_state.c      | 289 +++++++++++++++++++------------------------
>  mm/swap_table.h      |  97 +++++++++++++++
>  mm/swapfile.c        | 100 +++++++++++----
>  mm/vmscan.c          |  20 ++-
>  10 files changed, 458 insertions(+), 248 deletions(-)
>  create mode 100644 mm/swap_table.h
> 
> diff --git a/MAINTAINERS b/MAINTAINERS
> index 1c8292c0318d..de402ca91a80 100644
> --- a/MAINTAINERS
> +++ b/MAINTAINERS
> @@ -16226,6 +16226,7 @@ F:	include/linux/swapops.h
>  F:	mm/page_io.c
>  F:	mm/swap.c
>  F:	mm/swap.h
> +F:	mm/swap_table.h
>  F:	mm/swap_state.c
>  F:	mm/swapfile.c
>  

...

> diff --git a/mm/swap.h b/mm/swap.h
> index a139c9131244..bf4e54f1f6b6 100644
> --- a/mm/swap.h
> +++ b/mm/swap.h
> @@ -2,6 +2,7 @@
>  #ifndef _MM_SWAP_H
>  #define _MM_SWAP_H
>  
> +#include <linux/atomic.h> /* for atomic_long_t */
>  struct mempolicy;
>  struct swap_iocb;
>  
> @@ -35,6 +36,7 @@ struct swap_cluster_info {
>  	u16 count;
>  	u8 flags;
>  	u8 order;
> +	atomic_long_t *table;	/* Swap table entries, see mm/swap_table.h */
>  	struct list_head list;
>  };
>  
> @@ -55,6 +57,11 @@ enum swap_cluster_flags {

>  #include <linux/swapops.h> /* for swp_offset */

Now that swp_offset() is used in folio_index(), should this perhaps also be
included for !CONFIG_SWAP?

>  #include <linux/blk_types.h> /* for bio_end_io_t */
>  
> +static inline unsigned int swp_cluster_offset(swp_entry_t entry)
> +{
> +	return swp_offset(entry) % SWAPFILE_CLUSTER;
> +}
> +
>  /*
>   * Callers of all helpers below must ensure the entry, type, or offset is
>   * valid, and protect the swap device with reference count or locks.
> @@ -81,6 +88,25 @@ static inline struct swap_cluster_info *__swap_offset_to_cluster(
>  	return &si->cluster_info[offset / SWAPFILE_CLUSTER];
>  }
>  
> +static inline struct swap_cluster_info *__swap_entry_to_cluster(swp_entry_t entry)
> +{
> +	return __swap_offset_to_cluster(__swap_entry_to_info(entry),
> +					swp_offset(entry));
> +}
> +
> +static __always_inline struct swap_cluster_info *__swap_cluster_lock(
> +		struct swap_info_struct *si, unsigned long offset, bool irq)
> +{
> +	struct swap_cluster_info *ci = __swap_offset_to_cluster(si, offset);
> +
> +	VM_WARN_ON_ONCE(percpu_ref_is_zero(&si->users)); /* race with swapoff */
> +	if (irq)
> +		spin_lock_irq(&ci->lock);
> +	else
> +		spin_lock(&ci->lock);
> +	return ci;
> +}
> +
>  /**
>   * swap_cluster_lock - Lock and return the swap cluster of given offset.
>   * @si: swap device the cluster belongs to.
> @@ -92,11 +118,48 @@ static inline struct swap_cluster_info *__swap_offset_to_cluster(
>  static inline struct swap_cluster_info *swap_cluster_lock(
>  		struct swap_info_struct *si, unsigned long offset)
>  {
> -	struct swap_cluster_info *ci = __swap_offset_to_cluster(si, offset);
> +	return __swap_cluster_lock(si, offset, false);
> +}
>  
> -	VM_WARN_ON_ONCE(percpu_ref_is_zero(&si->users)); /* race with swapoff */
> -	spin_lock(&ci->lock);
> -	return ci;
> +static inline struct swap_cluster_info *__swap_cluster_lock_by_folio(
> +		const struct folio *folio, bool irq)
> +{
> +	VM_WARN_ON_ONCE_FOLIO(!folio_test_locked(folio), folio);
> +	VM_WARN_ON_ONCE_FOLIO(!folio_test_swapcache(folio), folio);
> +	return __swap_cluster_lock(__swap_entry_to_info(folio->swap),
> +				   swp_offset(folio->swap), irq);
> +}
> +
> +/*
> + * swap_cluster_lock_by_folio - Locks the cluster that holds a folio's entries.
> + * @folio: The folio.
> + *
> + * This locks the swap cluster that contains a folio's swap entries. The
> + * swap entries of a folio are always in one single cluster, and a locked
> + * swap cache folio is enough to stabilize the entries and the swap device.
> + *
> + * Context: Caller must ensure the folio is locked and in the swap cache.
> + * Return: Pointer to the swap cluster.
> + */
> +static inline struct swap_cluster_info *swap_cluster_lock_by_folio(
> +		const struct folio *folio)
> +{
> +	return __swap_cluster_lock_by_folio(folio, false);
> +}
> +
> +/*
> + * swap_cluster_lock_by_folio_irq - Locks the cluster that holds a folio's entries.
> + * @folio: The folio.
> + *
> + * Same as swap_cluster_lock_by_folio but also disable IRQ.
> + *
> + * Context: Caller must ensure the folio is locked and in the swap cache.
> + * Return: Pointer to the swap cluster.
> + */
> +static inline struct swap_cluster_info *swap_cluster_lock_by_folio_irq(
> +		const struct folio *folio)
> +{
> +	return __swap_cluster_lock_by_folio(folio, true);
>  }
>  
>  static inline void swap_cluster_unlock(struct swap_cluster_info *ci)
> @@ -104,6 +167,11 @@ static inline void swap_cluster_unlock(struct swap_cluster_info *ci)
>  	spin_unlock(&ci->lock);
>  }
>  
> +static inline void swap_cluster_unlock_irq(struct swap_cluster_info *ci)
> +{
> +	spin_unlock_irq(&ci->lock);
> +}
> +
>  /* linux/mm/page_io.c */
>  int sio_pool_init(void);
>  struct swap_iocb;
> @@ -123,10 +191,11 @@ void __swap_writepage(struct folio *folio, struct swap_iocb **swap_plug);
>  #define SWAP_ADDRESS_SPACE_SHIFT	14
>  #define SWAP_ADDRESS_SPACE_PAGES	(1 << SWAP_ADDRESS_SPACE_SHIFT)
>  #define SWAP_ADDRESS_SPACE_MASK		(SWAP_ADDRESS_SPACE_PAGES - 1)
> -extern struct address_space *swapper_spaces[];
> -#define swap_address_space(entry)			    \
> -	(&swapper_spaces[swp_type(entry)][swp_offset(entry) \
> -		>> SWAP_ADDRESS_SPACE_SHIFT])
> +extern struct address_space swap_space;
> +static inline struct address_space *swap_address_space(swp_entry_t entry)
> +{
> +	return &swap_space;
> +}
>  
>  /*
>   * Return the swap device position of the swap entry.
> @@ -136,15 +205,6 @@ static inline loff_t swap_dev_pos(swp_entry_t entry)
>  	return ((loff_t)swp_offset(entry)) << PAGE_SHIFT;
>  }
>  
> -/*
> - * Return the swap cache index of the swap entry.
> - */
> -static inline pgoff_t swap_cache_index(swp_entry_t entry)
> -{
> -	BUILD_BUG_ON((SWP_OFFSET_MASK | SWAP_ADDRESS_SPACE_MASK) != SWP_OFFSET_MASK);
> -	return swp_offset(entry) & SWAP_ADDRESS_SPACE_MASK;
> -}
> -
>  /**
>   * folio_matches_swap_entry - Check if a folio matches a given swap entry.
>   * @folio: The folio.
> @@ -177,16 +237,15 @@ static inline bool folio_matches_swap_entry(const struct folio *folio,
>   */
>  struct folio *swap_cache_get_folio(swp_entry_t entry);
>  void *swap_cache_get_shadow(swp_entry_t entry);
> -int swap_cache_add_folio(struct folio *folio, swp_entry_t entry,
> -			 gfp_t gfp, void **shadow);
> +void swap_cache_add_folio(struct folio *folio, swp_entry_t entry, void **shadow);
>  void swap_cache_del_folio(struct folio *folio);
> -void __swap_cache_del_folio(struct folio *folio,
> -			    swp_entry_t entry, void *shadow);
> -void __swap_cache_replace_folio(struct address_space *address_space,
> -				swp_entry_t entry,
> -				struct folio *old, struct folio *new);
> -void swap_cache_clear_shadow(int type, unsigned long begin,
> -			     unsigned long end);
> +/* Below helpers require the caller to lock and pass in the swap cluster. */
> +void __swap_cache_del_folio(struct swap_cluster_info *ci,
> +			    struct folio *folio, swp_entry_t entry, void *shadow);
> +void __swap_cache_replace_folio(struct swap_cluster_info *ci,
> +				swp_entry_t entry, struct folio *old,
> +				struct folio *new);
> +void __swap_cache_clear_shadow(swp_entry_t entry, int nr_ents);
>  
>  void show_swap_cache_info(void);
>  void swapcache_clear(struct swap_info_struct *si, swp_entry_t entry, int nr);
> @@ -254,6 +313,32 @@ static inline int non_swapcache_batch(swp_entry_t entry, int max_nr)
>  
>  #else /* CONFIG_SWAP */
>  struct swap_iocb;
> +static inline struct swap_cluster_info *swap_cluster_lock(
> +	struct swap_info_struct *si, pgoff_t offset, bool irq)
> +{
> +	return NULL;
> +}
> +
> +static inline struct swap_cluster_info *swap_cluster_lock_by_folio(
> +		struct folio *folio)
> +{
> +	return NULL;
> +}
> +
> +static inline struct swap_cluster_info *swap_cluster_lock_by_folio_irq(
> +		struct folio *folio)
> +{
> +	return NULL;
> +}
> +
> +static inline void swap_cluster_unlock(struct swap_cluster_info *ci)
> +{
> +}
> +
> +static inline void swap_cluster_unlock_irq(struct swap_cluster_info *ci)
> +{
> +}
> +
>  static inline struct swap_info_struct *__swap_entry_to_info(swp_entry_t entry)
>  {
>  	return NULL;
> @@ -271,11 +356,6 @@ static inline struct address_space *swap_address_space(swp_entry_t entry)
>  	return NULL;
>  }
>  
> -static inline pgoff_t swap_cache_index(swp_entry_t entry)
> -{
> -	return 0;
> -}
> -
>  static inline bool folio_matches_swap_entry(const struct folio *folio, swp_entry_t entry)
>  {
>  	return false;
> @@ -322,17 +402,22 @@ static inline void *swap_cache_get_shadow(swp_entry_t entry)
>  	return NULL;
>  }
>  
> -static inline int swap_cache_add_folio(swp_entry_t entry, struct folio *folio,
> -				       gfp_t gfp, void **shadow)
> +static inline void swap_cache_add_folio(struct folio *folio, swp_entry_t entry, void **shadow)
>  {
> -	return -EINVAL;
>  }
>  
>  static inline void swap_cache_del_folio(struct folio *folio)
>  {
>  }
>  
> -static inline void __swap_cache_del_folio(swp_entry_t entry, struct folio *folio, void *shadow)
> +static inline void __swap_cache_del_folio(struct swap_cluster_info *ci,
> +			    struct folio *folio, swp_entry_t entry, void *shadow)
> +{
> +}
> +
> +static inline void __swap_cache_replace_folio(
> +		struct swap_cluster_info *ci, swp_entry_t entry,
> +		struct folio *old, struct folio *new)
>  {
>  }
>  
> @@ -367,7 +452,7 @@ static inline int non_swapcache_batch(swp_entry_t entry, int max_nr)
>  static inline pgoff_t folio_index(struct folio *folio)
>  {
>  	if (unlikely(folio_test_swapcache(folio)))

> -		return swap_cache_index(folio->swap);
> +		return swp_offset(folio->swap);

This is outside CONFIG_SWAP.

>  	return folio->index;
>  }

...

Regards,
Klara Modin


^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH v2 09/15] mm, swap: cleanup swap cache API and add kerneldoc
  2025-09-05 19:13 ` [PATCH v2 09/15] mm, swap: cleanup swap cache API and add kerneldoc Kairui Song
  2025-09-06  5:45   ` Chris Li
@ 2025-09-08  0:11   ` Barry Song
  2025-09-08  3:23   ` Baolin Wang
  2025-09-08 12:23   ` David Hildenbrand
  3 siblings, 0 replies; 80+ messages in thread
From: Barry Song @ 2025-09-08  0:11 UTC (permalink / raw)
  To: Kairui Song
  Cc: linux-mm, Andrew Morton, Matthew Wilcox, Hugh Dickins, Chris Li,
	Baoquan He, Nhat Pham, Kemeng Shi, Baolin Wang, Ying Huang,
	Johannes Weiner, David Hildenbrand, Yosry Ahmed, Lorenzo Stoakes,
	Zi Yan, linux-kernel

On Sat, Sep 6, 2025 at 3:15 AM Kairui Song <ryncsn@gmail.com> wrote:
>
> From: Kairui Song <kasong@tencent.com>
>
> In preparation for replacing the swap cache backend with the swap table,
> clean up and add proper kernel doc for all swap cache APIs. Now all swap
> cache APIs are well-defined with consistent names.
>
> No feature change, only renaming and documenting.
>
> Signed-off-by: Kairui Song <kasong@tencent.com>

Reviewed-by: Barry Song <baohua@kernel.org>


^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH v2 06/15] mm, swap: rename and move some swap cluster definition and helpers
  2025-09-05 19:13 ` [PATCH v2 06/15] mm, swap: rename and move some swap cluster definition and helpers Kairui Song
  2025-09-06  2:13   ` Chris Li
@ 2025-09-08  3:03   ` Baolin Wang
  1 sibling, 0 replies; 80+ messages in thread
From: Baolin Wang @ 2025-09-08  3:03 UTC (permalink / raw)
  To: Kairui Song, linux-mm
  Cc: Andrew Morton, Matthew Wilcox, Hugh Dickins, Chris Li,
	Barry Song, Baoquan He, Nhat Pham, Kemeng Shi, Ying Huang,
	Johannes Weiner, David Hildenbrand, Yosry Ahmed, Lorenzo Stoakes,
	Zi Yan, linux-kernel



On 2025/9/6 03:13, Kairui Song wrote:
> From: Kairui Song <kasong@tencent.com>
> 
> No feature change, move cluster related definitions and helpers to
> mm/swap.h, also tidy up and add a "swap_" prefix for cluster lock/unlock
> helpers, so they can be used outside of swap files. And while at it, add
> kerneldoc.
> 
> Signed-off-by: Kairui Song <kasong@tencent.com>
> Acked-by: Chris Li <chrisl@kernel.org>
> Reviewed-by: Barry Song <baohua@kernel.org>
> Acked-by: David Hildenbrand <david@redhat.com>
> ---

LGTM.
Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com>



^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH v2 03/15] mm, swap: fix swap cahe index error when retrying reclaim
  2025-09-05 19:13 ` [PATCH v2 03/15] mm, swap: fix swap cahe index error when retrying reclaim Kairui Song
  2025-09-05 22:40   ` Nhat Pham
  2025-09-06  1:51   ` Chris Li
@ 2025-09-08  3:08   ` Baolin Wang
  2025-09-08 11:45   ` David Hildenbrand
  3 siblings, 0 replies; 80+ messages in thread
From: Baolin Wang @ 2025-09-08  3:08 UTC (permalink / raw)
  To: Kairui Song, linux-mm
  Cc: Andrew Morton, Matthew Wilcox, Hugh Dickins, Chris Li,
	Barry Song, Baoquan He, Nhat Pham, Kemeng Shi, Ying Huang,
	Johannes Weiner, David Hildenbrand, Yosry Ahmed, Lorenzo Stoakes,
	Zi Yan, linux-kernel



On 2025/9/6 03:13, Kairui Song wrote:
> From: Kairui Song <kasong@tencent.com>
> 
> The allocator will reclaim cached slots while scanning. Currently, it
> will try again if the reclaim found a folio that is already removed from
> the swap cache due to a race. But the following lookup will be using the
> wrong index. It won't cause any OOB issue since the swap cache index is
> truncated upon lookup, but it may lead to reclaiming of an irrelevant
> folio.
> 
> This should not cause a measurable issue, but we should fix it.
> 
> Fixes: fae8595505313 ("mm, swap: avoid reclaiming irrelevant swap cache")
> Signed-off-by: Kairui Song <kasong@tencent.com>
> ---

Good catch. LGTM.
Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com>

>   mm/swapfile.c | 6 +++---
>   1 file changed, 3 insertions(+), 3 deletions(-)
> 
> diff --git a/mm/swapfile.c b/mm/swapfile.c
> index 4b8ab2cb49ca..4c63fc62f4cb 100644
> --- a/mm/swapfile.c
> +++ b/mm/swapfile.c
> @@ -240,13 +240,13 @@ static int __try_to_reclaim_swap(struct swap_info_struct *si,
>   	 * Offset could point to the middle of a large folio, or folio
>   	 * may no longer point to the expected offset before it's locked.
>   	 */
> -	entry = folio->swap;
> -	if (offset < swp_offset(entry) || offset >= swp_offset(entry) + nr_pages) {
> +	if (offset < swp_offset(folio->swap) ||
> +	    offset >= swp_offset(folio->swap) + nr_pages) {
>   		folio_unlock(folio);
>   		folio_put(folio);
>   		goto again;
>   	}
> -	offset = swp_offset(entry);
> +	offset = swp_offset(folio->swap);
>   
>   	need_reclaim = ((flags & TTRS_ANYWAY) ||
>   			((flags & TTRS_UNMAPPED) && !folio_mapped(folio)) ||



^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH v2 08/15] mm/shmem, swap: remove redundant error handling for replacing folio
  2025-09-05 19:13 ` [PATCH v2 08/15] mm/shmem, swap: remove redundant error handling for replacing folio Kairui Song
@ 2025-09-08  3:17   ` Baolin Wang
  2025-09-08  9:28     ` Kairui Song
  0 siblings, 1 reply; 80+ messages in thread
From: Baolin Wang @ 2025-09-08  3:17 UTC (permalink / raw)
  To: Kairui Song, linux-mm
  Cc: Andrew Morton, Matthew Wilcox, Hugh Dickins, Chris Li,
	Barry Song, Baoquan He, Nhat Pham, Kemeng Shi, Ying Huang,
	Johannes Weiner, David Hildenbrand, Yosry Ahmed, Lorenzo Stoakes,
	Zi Yan, linux-kernel



On 2025/9/6 03:13, Kairui Song wrote:
> From: Kairui Song <kasong@tencent.com>
> 
> Shmem may replace a folio in the swap cache if the cached one doesn't
> fit the swapin's GFP zone. When doing so, shmem has already double
> checked that the swap cache folio is locked, still has the swap cache
> flag set, and contains the wanted swap entry. So it is impossible to
> fail due to an Xarray mismatch. There is even a comment for that.
> 
> Delete the defensive error handling path, and add a WARN_ON instead:
> if that happened, something has broken the basic principle of how the
> swap cache works, we should catch and fix that.
> 
> Signed-off-by: Kairui Song <kasong@tencent.com>
> Reviewed-by: David Hildenbrand <david@redhat.com>
> ---
>   mm/shmem.c | 42 ++++++++++++------------------------------
>   1 file changed, 12 insertions(+), 30 deletions(-)
> 
> diff --git a/mm/shmem.c b/mm/shmem.c
> index 4e27e8e5da3b..cc6a0007c7a6 100644
> --- a/mm/shmem.c
> +++ b/mm/shmem.c
> @@ -1698,13 +1698,13 @@ int shmem_writeout(struct folio *folio, struct swap_iocb **plug,
>   		}
>   
>   		/*
> -		 * The delete_from_swap_cache() below could be left for
> +		 * The swap_cache_del_folio() below could be left for
>   		 * shrink_folio_list()'s folio_free_swap() to dispose of;
>   		 * but I'm a little nervous about letting this folio out of
>   		 * shmem_writeout() in a hybrid half-tmpfs-half-swap state
>   		 * e.g. folio_mapping(folio) might give an unexpected answer.
>   		 */
> -		delete_from_swap_cache(folio);
> +		swap_cache_del_folio(folio);
>   		goto redirty;
>   	}

You should reorganize your patch set, as the swap_cache_del_folio() 
function is introduced in patch 9.

>   	if (nr_pages > 1)
> @@ -2082,7 +2082,7 @@ static struct folio *shmem_swap_alloc_folio(struct inode *inode,
>   	new->swap = entry;
>   
>   	memcg1_swapin(entry, nr_pages);
> -	shadow = get_shadow_from_swap_cache(entry);
> +	shadow = swap_cache_get_shadow(entry);

Ditto.

>   	if (shadow)
>   		workingset_refault(new, shadow);
>   	folio_add_lru(new);
> @@ -2158,35 +2158,17 @@ static int shmem_replace_folio(struct folio **foliop, gfp_t gfp,
>   	/* Swap cache still stores N entries instead of a high-order entry */
>   	xa_lock_irq(&swap_mapping->i_pages);
>   	for (i = 0; i < nr_pages; i++) {
> -		void *item = xas_load(&xas);
> -
> -		if (item != old) {
> -			error = -ENOENT;
> -			break;
> -		}
> -
> -		xas_store(&xas, new);
> +		WARN_ON_ONCE(xas_store(&xas, new));
>   		xas_next(&xas);
>   	}
> -	if (!error) {
> -		mem_cgroup_replace_folio(old, new);
> -		shmem_update_stats(new, nr_pages);
> -		shmem_update_stats(old, -nr_pages);
> -	}
>   	xa_unlock_irq(&swap_mapping->i_pages);
>   
> -	if (unlikely(error)) {
> -		/*
> -		 * Is this possible?  I think not, now that our callers
> -		 * check both the swapcache flag and folio->private
> -		 * after getting the folio lock; but be defensive.
> -		 * Reverse old to newpage for clear and free.
> -		 */
> -		old = new;
> -	} else {
> -		folio_add_lru(new);
> -		*foliop = new;
> -	}
> +	mem_cgroup_replace_folio(old, new);
> +	shmem_update_stats(new, nr_pages);
> +	shmem_update_stats(old, -nr_pages);
> +
> +	folio_add_lru(new);
> +	*foliop = new;
>   
>   	folio_clear_swapcache(old);
>   	old->private = NULL;
> @@ -2220,7 +2202,7 @@ static void shmem_set_folio_swapin_error(struct inode *inode, pgoff_t index,
>   	nr_pages = folio_nr_pages(folio);
>   	folio_wait_writeback(folio);
>   	if (!skip_swapcache)
> -		delete_from_swap_cache(folio);
> +		swap_cache_del_folio(folio);
>   	/*
>   	 * Don't treat swapin error folio as alloced. Otherwise inode->i_blocks
>   	 * won't be 0 when inode is released and thus trigger WARN_ON(i_blocks)
> @@ -2459,7 +2441,7 @@ static int shmem_swapin_folio(struct inode *inode, pgoff_t index,
>   		folio->swap.val = 0;
>   		swapcache_clear(si, swap, nr_pages);
>   	} else {
> -		delete_from_swap_cache(folio);
> +		swap_cache_del_folio(folio);
>   	}
>   	folio_mark_dirty(folio);
>   	swap_free_nr(swap, nr_pages);



^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH v2 09/15] mm, swap: cleanup swap cache API and add kerneldoc
  2025-09-05 19:13 ` [PATCH v2 09/15] mm, swap: cleanup swap cache API and add kerneldoc Kairui Song
  2025-09-06  5:45   ` Chris Li
  2025-09-08  0:11   ` Barry Song
@ 2025-09-08  3:23   ` Baolin Wang
  2025-09-08 12:23   ` David Hildenbrand
  3 siblings, 0 replies; 80+ messages in thread
From: Baolin Wang @ 2025-09-08  3:23 UTC (permalink / raw)
  To: Kairui Song, linux-mm
  Cc: Andrew Morton, Matthew Wilcox, Hugh Dickins, Chris Li,
	Barry Song, Baoquan He, Nhat Pham, Kemeng Shi, Ying Huang,
	Johannes Weiner, David Hildenbrand, Yosry Ahmed, Lorenzo Stoakes,
	Zi Yan, linux-kernel



On 2025/9/6 03:13, Kairui Song wrote:
> From: Kairui Song <kasong@tencent.com>
> 
> In preparation for replacing the swap cache backend with the swap table,
> clean up and add proper kernel doc for all swap cache APIs. Now all swap
> cache APIs are well-defined with consistent names.
> 
> No feature change, only renaming and documenting.
> 
> Signed-off-by: Kairui Song <kasong@tencent.com>
> ---

With the shmem parts cleanup moved into this patch:
Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com>


^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH v2 10/15] mm, swap: wrap swap cache replacement with a helper
  2025-09-05 19:13 ` [PATCH v2 10/15] mm, swap: wrap swap cache replacement with a helper Kairui Song
  2025-09-06  7:09   ` Chris Li
@ 2025-09-08  3:41   ` Baolin Wang
  2025-09-08 10:44     ` Kairui Song
  2025-09-08 12:30   ` David Hildenbrand
  2 siblings, 1 reply; 80+ messages in thread
From: Baolin Wang @ 2025-09-08  3:41 UTC (permalink / raw)
  To: Kairui Song, linux-mm
  Cc: Andrew Morton, Matthew Wilcox, Hugh Dickins, Chris Li,
	Barry Song, Baoquan He, Nhat Pham, Kemeng Shi, Ying Huang,
	Johannes Weiner, David Hildenbrand, Yosry Ahmed, Lorenzo Stoakes,
	Zi Yan, linux-kernel



On 2025/9/6 03:13, Kairui Song wrote:
> From: Kairui Song <kasong@tencent.com>
> 
> There are currently three swap cache users that are trying to replace an
> existing folio with a new one: huge memory splitting, migration, and
> shmem replacement. What they are doing is quite similar.
> 
> Introduce a common helper for this. In later commits, they can be easily
> switched to use the swap table by updating this helper.
> 
> The newly added helper also makes the swap cache API better defined, and
> debugging is easier.
> 
> Signed-off-by: Kairui Song <kasong@tencent.com>
> ---
>   mm/huge_memory.c |  5 ++---
>   mm/migrate.c     | 11 +++--------
>   mm/shmem.c       | 10 ++--------
>   mm/swap.h        |  3 +++
>   mm/swap_state.c  | 32 ++++++++++++++++++++++++++++++++
>   5 files changed, 42 insertions(+), 19 deletions(-)
> 
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index 26cedfcd7418..a4d192c8d794 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -3798,9 +3798,8 @@ static int __folio_split(struct folio *folio, unsigned int new_order,
>   			 * NOTE: shmem in swap cache is not supported yet.
>   			 */
>   			if (swap_cache) {
> -				__xa_store(&swap_cache->i_pages,
> -					   swap_cache_index(new_folio->swap),
> -					   new_folio, 0);
> +				__swap_cache_replace_folio(swap_cache, new_folio->swap,
> +							   folio, new_folio);
>   				continue;
>   			}

IIUC, it doesn't seem like a simple function replacement here. It 
appears that the original code has a bug: if the 'new_folio' is a large 
folio after split, we need to iterate over each swap entry of the large 
swapcache folio and then restore the new 'new_folio'.


^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH v2 08/15] mm/shmem, swap: remove redundant error handling for replacing folio
  2025-09-08  3:17   ` Baolin Wang
@ 2025-09-08  9:28     ` Kairui Song
  0 siblings, 0 replies; 80+ messages in thread
From: Kairui Song @ 2025-09-08  9:28 UTC (permalink / raw)
  To: Baolin Wang
  Cc: linux-mm, Andrew Morton, Matthew Wilcox, Hugh Dickins, Chris Li,
	Barry Song, Baoquan He, Nhat Pham, Kemeng Shi, Ying Huang,
	Johannes Weiner, David Hildenbrand, Yosry Ahmed, Lorenzo Stoakes,
	Zi Yan, linux-kernel

On Mon, Sep 8, 2025 at 2:04 PM Baolin Wang
<baolin.wang@linux.alibaba.com> wrote:
>
>
>
> On 2025/9/6 03:13, Kairui Song wrote:
> > From: Kairui Song <kasong@tencent.com>
> >
> > Shmem may replace a folio in the swap cache if the cached one doesn't
> > fit the swapin's GFP zone. When doing so, shmem has already double
> > checked that the swap cache folio is locked, still has the swap cache
> > flag set, and contains the wanted swap entry. So it is impossible to
> > fail due to an Xarray mismatch. There is even a comment for that.
> >
> > Delete the defensive error handling path, and add a WARN_ON instead:
> > if that happened, something has broken the basic principle of how the
> > swap cache works, we should catch and fix that.
> >
> > Signed-off-by: Kairui Song <kasong@tencent.com>
> > Reviewed-by: David Hildenbrand <david@redhat.com>
> > ---
> >   mm/shmem.c | 42 ++++++++++++------------------------------
> >   1 file changed, 12 insertions(+), 30 deletions(-)
> >
> > diff --git a/mm/shmem.c b/mm/shmem.c
> > index 4e27e8e5da3b..cc6a0007c7a6 100644
> > --- a/mm/shmem.c
> > +++ b/mm/shmem.c
> > @@ -1698,13 +1698,13 @@ int shmem_writeout(struct folio *folio, struct swap_iocb **plug,
> >               }
> >
> >               /*
> > -              * The delete_from_swap_cache() below could be left for
> > +              * The swap_cache_del_folio() below could be left for
> >                * shrink_folio_list()'s folio_free_swap() to dispose of;
> >                * but I'm a little nervous about letting this folio out of
> >                * shmem_writeout() in a hybrid half-tmpfs-half-swap state
> >                * e.g. folio_mapping(folio) might give an unexpected answer.
> >                */
> > -             delete_from_swap_cache(folio);
> > +             swap_cache_del_folio(folio);
> >               goto redirty;
> >       }
>
> You should reorganize your patch set, as the swap_cache_del_folio()
> function is introduced in patch 9.
>
> >       if (nr_pages > 1)
> > @@ -2082,7 +2082,7 @@ static struct folio *shmem_swap_alloc_folio(struct inode *inode,
> >       new->swap = entry;
> >
> >       memcg1_swapin(entry, nr_pages);
> > -     shadow = get_shadow_from_swap_cache(entry);
> > +     shadow = swap_cache_get_shadow(entry);
>
> Ditto.
>
> >       if (shadow)
> >               workingset_refault(new, shadow);
> >       folio_add_lru(new);
> > @@ -2158,35 +2158,17 @@ static int shmem_replace_folio(struct folio **foliop, gfp_t gfp,
> >       /* Swap cache still stores N entries instead of a high-order entry */
> >       xa_lock_irq(&swap_mapping->i_pages);
> >       for (i = 0; i < nr_pages; i++) {
> > -             void *item = xas_load(&xas);
> > -
> > -             if (item != old) {
> > -                     error = -ENOENT;
> > -                     break;
> > -             }
> > -
> > -             xas_store(&xas, new);
> > +             WARN_ON_ONCE(xas_store(&xas, new));
> >               xas_next(&xas);
> >       }
> > -     if (!error) {
> > -             mem_cgroup_replace_folio(old, new);
> > -             shmem_update_stats(new, nr_pages);
> > -             shmem_update_stats(old, -nr_pages);
> > -     }
> >       xa_unlock_irq(&swap_mapping->i_pages);
> >
> > -     if (unlikely(error)) {
> > -             /*
> > -              * Is this possible?  I think not, now that our callers
> > -              * check both the swapcache flag and folio->private
> > -              * after getting the folio lock; but be defensive.
> > -              * Reverse old to newpage for clear and free.
> > -              */
> > -             old = new;
> > -     } else {
> > -             folio_add_lru(new);
> > -             *foliop = new;
> > -     }
> > +     mem_cgroup_replace_folio(old, new);
> > +     shmem_update_stats(new, nr_pages);
> > +     shmem_update_stats(old, -nr_pages);
> > +
> > +     folio_add_lru(new);
> > +     *foliop = new;
> >
> >       folio_clear_swapcache(old);
> >       old->private = NULL;
> > @@ -2220,7 +2202,7 @@ static void shmem_set_folio_swapin_error(struct inode *inode, pgoff_t index,
> >       nr_pages = folio_nr_pages(folio);
> >       folio_wait_writeback(folio);
> >       if (!skip_swapcache)
> > -             delete_from_swap_cache(folio);
> > +             swap_cache_del_folio(folio);
> >       /*
> >        * Don't treat swapin error folio as alloced. Otherwise inode->i_blocks
> >        * won't be 0 when inode is released and thus trigger WARN_ON(i_blocks)
> > @@ -2459,7 +2441,7 @@ static int shmem_swapin_folio(struct inode *inode, pgoff_t index,
> >               folio->swap.val = 0;
> >               swapcache_clear(si, swap, nr_pages);
> >       } else {
> > -             delete_from_swap_cache(folio);
> > +             swap_cache_del_folio(folio);

Oh you are right, or I should keep the delete_from_swap_cache here.

Let me just rebase and move this patch later then. Thanks!

> >       }
> >       folio_mark_dirty(folio);
> >       swap_free_nr(swap, nr_pages);
>
>


^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH v2 10/15] mm, swap: wrap swap cache replacement with a helper
  2025-09-08  3:41   ` Baolin Wang
@ 2025-09-08 10:44     ` Kairui Song
  2025-09-09  1:18       ` Baolin Wang
  0 siblings, 1 reply; 80+ messages in thread
From: Kairui Song @ 2025-09-08 10:44 UTC (permalink / raw)
  To: Baolin Wang
  Cc: linux-mm, Andrew Morton, Matthew Wilcox, Hugh Dickins, Chris Li,
	Barry Song, Baoquan He, Nhat Pham, Kemeng Shi, Ying Huang,
	Johannes Weiner, David Hildenbrand, Yosry Ahmed, Lorenzo Stoakes,
	Zi Yan, linux-kernel

On Mon, Sep 8, 2025 at 11:52 AM Baolin Wang
<baolin.wang@linux.alibaba.com> wrote:
>
>
>
> On 2025/9/6 03:13, Kairui Song wrote:
> > From: Kairui Song <kasong@tencent.com>
> >
> > There are currently three swap cache users that are trying to replace an
> > existing folio with a new one: huge memory splitting, migration, and
> > shmem replacement. What they are doing is quite similar.
> >
> > Introduce a common helper for this. In later commits, they can be easily
> > switched to use the swap table by updating this helper.
> >
> > The newly added helper also makes the swap cache API better defined, and
> > debugging is easier.
> >
> > Signed-off-by: Kairui Song <kasong@tencent.com>
> > ---
> >   mm/huge_memory.c |  5 ++---
> >   mm/migrate.c     | 11 +++--------
> >   mm/shmem.c       | 10 ++--------
> >   mm/swap.h        |  3 +++
> >   mm/swap_state.c  | 32 ++++++++++++++++++++++++++++++++
> >   5 files changed, 42 insertions(+), 19 deletions(-)
> >
> > diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> > index 26cedfcd7418..a4d192c8d794 100644
> > --- a/mm/huge_memory.c
> > +++ b/mm/huge_memory.c
> > @@ -3798,9 +3798,8 @@ static int __folio_split(struct folio *folio, unsigned int new_order,
> >                        * NOTE: shmem in swap cache is not supported yet.
> >                        */
> >                       if (swap_cache) {
> > -                             __xa_store(&swap_cache->i_pages,
> > -                                        swap_cache_index(new_folio->swap),
> > -                                        new_folio, 0);
> > +                             __swap_cache_replace_folio(swap_cache, new_folio->swap,
> > +                                                        folio, new_folio);
> >                               continue;
> >                       }
>
> IIUC, it doesn't seem like a simple function replacement here. It
> appears that the original code has a bug: if the 'new_folio' is a large
> folio after split, we need to iterate over each swap entry of the large
> swapcache folio and then restore the new 'new_folio'.
>

That should be OK. We have a check in uniform_split_supported and
non_uniform_split_supported that swapcache folio can only be splitted
into order0. And it seems there is no support for splitting pure
swapcache folio now.

Maybe we can try to enable and make use of higher order split
after this series for swapcache. I just had a try to use some hackish
code to split random folios in the swap cache to larger order, it seems
fine after this series.


^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH v2 02/15] mm, swap: use unified helper for swap cache look up
  2025-09-05 19:13 ` [PATCH v2 02/15] mm, swap: use unified helper for swap cache look up Kairui Song
  2025-09-05 23:59   ` Chris Li
@ 2025-09-08 11:43   ` David Hildenbrand
  1 sibling, 0 replies; 80+ messages in thread
From: David Hildenbrand @ 2025-09-08 11:43 UTC (permalink / raw)
  To: Kairui Song, linux-mm
  Cc: Andrew Morton, Matthew Wilcox, Hugh Dickins, Chris Li,
	Barry Song, Baoquan He, Nhat Pham, Kemeng Shi, Baolin Wang,
	Ying Huang, Johannes Weiner, Yosry Ahmed, Lorenzo Stoakes,
	Zi Yan, linux-kernel

On 05.09.25 21:13, Kairui Song wrote:
> From: Kairui Song <kasong@tencent.com>
> 
> The swap cache lookup helper swap_cache_get_folio currently does
> readahead updates as well, so callers that are not doing swapin from any
> VMA or mapping are forced to reuse filemap helpers instead, and have to
> access the swap cache space directly.
> 
> So decouple readahead update with swap cache lookup. Move the readahead
> update part into a standalone helper. Let the caller call the readahead
> update helper if they do readahead. And convert all swap cache lookups
> to use swap_cache_get_folio.
> 
> After this commit, there are only three special cases for accessing swap
> cache space now: huge memory splitting, migration, and shmem replacing,
> because they need to lock the XArray. The following commits will wrap
> their accesses to the swap cache too, with special helpers.
> 
> And worth noting, currently dropbehind is not supported for anon folio,
> and we will never see a dropbehind folio in swap cache. The unified
> helper can be updated later to handle that.
> 
> While at it, add proper kernedoc for touched helpers.
> 
> No functional change.
> 
> Signed-off-by: Kairui Song <kasong@tencent.com>
> Acked-by: Chris Li <chrisl@kernel.org>
> Acked-by: Nhat Pham <nphamcs@gmail.com>
> Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com>
> Reviewed-by: Barry Song <baohua@kernel.org>
> ---

Acked-by: David Hildenbrand <david@redhat.com>

-- 
Cheers

David / dhildenb



^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH v2 03/15] mm, swap: fix swap cahe index error when retrying reclaim
  2025-09-05 19:13 ` [PATCH v2 03/15] mm, swap: fix swap cahe index error when retrying reclaim Kairui Song
                     ` (2 preceding siblings ...)
  2025-09-08  3:08   ` Baolin Wang
@ 2025-09-08 11:45   ` David Hildenbrand
  3 siblings, 0 replies; 80+ messages in thread
From: David Hildenbrand @ 2025-09-08 11:45 UTC (permalink / raw)
  To: Kairui Song, linux-mm
  Cc: Andrew Morton, Matthew Wilcox, Hugh Dickins, Chris Li,
	Barry Song, Baoquan He, Nhat Pham, Kemeng Shi, Baolin Wang,
	Ying Huang, Johannes Weiner, Yosry Ahmed, Lorenzo Stoakes,
	Zi Yan, linux-kernel

On 05.09.25 21:13, Kairui Song wrote:
> From: Kairui Song <kasong@tencent.com>

Subject: s/cahe/cache/

> 
> The allocator will reclaim cached slots while scanning. Currently, it
> will try again if the reclaim found a folio that is already removed from

s/the reclaim/reclaim/

> the swap cache due to a race. But the following lookup will be using the
> wrong index. It won't cause any OOB issue since the swap cache index is
> truncated upon lookup, but it may lead to reclaiming of an irrelevant
> folio.
> 
> This should not cause a measurable issue, but we should fix it.
> 
> Fixes: fae8595505313 ("mm, swap: avoid reclaiming irrelevant swap cache")
> Signed-off-by: Kairui Song <kasong@tencent.com>
> ---
>   mm/swapfile.c | 6 +++---
>   1 file changed, 3 insertions(+), 3 deletions(-)
> 
> diff --git a/mm/swapfile.c b/mm/swapfile.c
> index 4b8ab2cb49ca..4c63fc62f4cb 100644
> --- a/mm/swapfile.c
> +++ b/mm/swapfile.c
> @@ -240,13 +240,13 @@ static int __try_to_reclaim_swap(struct swap_info_struct *si,
>   	 * Offset could point to the middle of a large folio, or folio
>   	 * may no longer point to the expected offset before it's locked.
>   	 */
> -	entry = folio->swap;
> -	if (offset < swp_offset(entry) || offset >= swp_offset(entry) + nr_pages) {
> +	if (offset < swp_offset(folio->swap) ||
> +	    offset >= swp_offset(folio->swap) + nr_pages) {
>   		folio_unlock(folio);
>   		folio_put(folio);
>   		goto again;
>   	}
> -	offset = swp_offset(entry);
> +	offset = swp_offset(folio->swap);
>   
>   	need_reclaim = ((flags & TTRS_ANYWAY) ||
>   			((flags & TTRS_UNMAPPED) && !folio_mapped(folio)) ||


Acked-by: David Hildenbrand <david@redhat.com>

-- 
Cheers

David / dhildenb



^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH v2 04/15] mm, swap: check page poison flag after locking it
  2025-09-05 19:13 ` [PATCH v2 04/15] mm, swap: check page poison flag after locking it Kairui Song
  2025-09-06  2:00   ` Chris Li
@ 2025-09-08 12:11   ` David Hildenbrand
  2025-09-09 14:54     ` Kairui Song
  1 sibling, 1 reply; 80+ messages in thread
From: David Hildenbrand @ 2025-09-08 12:11 UTC (permalink / raw)
  To: Kairui Song, linux-mm
  Cc: Andrew Morton, Matthew Wilcox, Hugh Dickins, Chris Li,
	Barry Song, Baoquan He, Nhat Pham, Kemeng Shi, Baolin Wang,
	Ying Huang, Johannes Weiner, Yosry Ahmed, Lorenzo Stoakes,
	Zi Yan, linux-kernel

On 05.09.25 21:13, Kairui Song wrote:
> From: Kairui Song <kasong@tencent.com>
> 
> Instead of checking the poison flag only in the fast swap cache lookup
> path, always check the poison flags after locking a swap cache folio.
> 
> There are two reasons to do so.
> 
> The folio is unstable and could be removed from the swap cache anytime,
> so it's totally possible that the folio is no longer the backing folio
> of a swap entry, and could be an irrelevant poisoned folio. We might
> mistakenly kill a faulting process.
> 
> And it's totally possible or even common for the slow swap in path
> (swapin_readahead) to bring in a cached folio. The cache folio could be
> poisoned, too. Only checking the poison flag in the fast path will miss
> such folios.
> 
> The race window is tiny, so it's very unlikely to happen, though.
> While at it, also add a unlikely prefix.
> 
> Signed-off-by: Kairui Song <kasong@tencent.com>
> ---
>   mm/memory.c | 22 +++++++++++-----------
>   1 file changed, 11 insertions(+), 11 deletions(-)
> 
> diff --git a/mm/memory.c b/mm/memory.c
> index 10ef528a5f44..94a5928e8ace 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -4661,10 +4661,8 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
>   		goto out;
>   
>   	folio = swap_cache_get_folio(entry);
> -	if (folio) {
> +	if (folio)
>   		swap_update_readahead(folio, vma, vmf->address);
> -		page = folio_file_page(folio, swp_offset(entry));
> -	}
>   	swapcache = folio;
>   
>   	if (!folio) {
> @@ -4735,20 +4733,13 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
>   		ret = VM_FAULT_MAJOR;
>   		count_vm_event(PGMAJFAULT);
>   		count_memcg_event_mm(vma->vm_mm, PGMAJFAULT);
> -		page = folio_file_page(folio, swp_offset(entry));
> -	} else if (PageHWPoison(page)) {
> -		/*
> -		 * hwpoisoned dirty swapcache pages are kept for killing
> -		 * owner processes (which may be unknown at hwpoison time)
> -		 */
> -		ret = VM_FAULT_HWPOISON;
> -		goto out_release;
>   	}
>   
>   	ret |= folio_lock_or_retry(folio, vmf);
>   	if (ret & VM_FAULT_RETRY)
>   		goto out_release;
>   
> +	page = folio_file_page(folio, swp_offset(entry));
>   	if (swapcache) {
>   		/*
>   		 * Make sure folio_free_swap() or swapoff did not release the
> @@ -4761,6 +4752,15 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
>   			     page_swap_entry(page).val != entry.val))
>   			goto out_page;
>   
> +		if (unlikely(PageHWPoison(page))) {
> +			/*
> +			 * hwpoisoned dirty swapcache pages are kept for killing
> +			 * owner processes (which may be unknown at hwpoison time)
> +			 */
> +			ret = VM_FAULT_HWPOISON;
> +			goto out_page;
> +		}
> +
>   		/*
>   		 * KSM sometimes has to copy on read faults, for example, if
>   		 * folio->index of non-ksm folios would be nonlinear inside the

LGTM, but I was wondering whether we just want to check that even when 
we just allocated a fresh folio for simplicity. The check is cheap ...


-- 
Cheers

David / dhildenb



^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH v2 05/15] mm, swap: always lock and check the swap cache folio before use
  2025-09-05 19:13 ` [PATCH v2 05/15] mm, swap: always lock and check the swap cache folio before use Kairui Song
  2025-09-06  2:12   ` Chris Li
@ 2025-09-08 12:18   ` David Hildenbrand
  2025-09-09 14:58     ` Kairui Song
  1 sibling, 1 reply; 80+ messages in thread
From: David Hildenbrand @ 2025-09-08 12:18 UTC (permalink / raw)
  To: Kairui Song, linux-mm
  Cc: Andrew Morton, Matthew Wilcox, Hugh Dickins, Chris Li,
	Barry Song, Baoquan He, Nhat Pham, Kemeng Shi, Baolin Wang,
	Ying Huang, Johannes Weiner, Yosry Ahmed, Lorenzo Stoakes,
	Zi Yan, linux-kernel


>   
>   		folio_lock(folio);
> +		if (!folio_matches_swap_entry(folio, entry)) {
> +			folio_unlock(folio);
> +			folio_put(folio);
> +			continue;
> +		}
> +

I wonder if we should put that into unuse_pte() instead. It checks for 
other types of races (like the page table entry getting modified) already.

-- 
Cheers

David / dhildenb



^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH v2 07/15] mm, swap: tidy up swap device and cluster info helpers
  2025-09-05 19:13 ` [PATCH v2 07/15] mm, swap: tidy up swap device and cluster info helpers Kairui Song
  2025-09-06  2:14   ` Chris Li
@ 2025-09-08 12:21   ` David Hildenbrand
  2025-09-08 15:01     ` Kairui Song
  1 sibling, 1 reply; 80+ messages in thread
From: David Hildenbrand @ 2025-09-08 12:21 UTC (permalink / raw)
  To: Kairui Song, linux-mm
  Cc: Andrew Morton, Matthew Wilcox, Hugh Dickins, Chris Li,
	Barry Song, Baoquan He, Nhat Pham, Kemeng Shi, Baolin Wang,
	Ying Huang, Johannes Weiner, Yosry Ahmed, Lorenzo Stoakes,
	Zi Yan, linux-kernel

On 05.09.25 21:13, Kairui Song wrote:
> From: Kairui Song <kasong@tencent.com>
> 
> swp_swap_info is the most commonly used helper for retrieving swap info.
> It has an internal check that may lead to a NULL return value, but
> almost none of its caller checks the return value, making the internal
> check pointless. In fact, most of these callers already ensured the
> entry is valid and never expect a NULL value.
> 
> Tidy this up and shorten the name. If the caller can make sure the

"Tidy this up and improve the function names." ?

> swap entry/type is valid and the device is pinned, use the new introduced
> __swap_entry_to_info/__swap_type_to_info instead. They have more debug
> sanity checks and lower overhead as they are inlined.
> 
> Callers that may expect a NULL value should use
> swap_entry_to_info/swap_type_to_info instead.
> 
> No feature change. The rearranged codes should have had no effect, or
> they should have been hitting NULL de-ref bugs already. Only some new
> sanity checks are added so potential issues may show up in debug build.
> 
> The new helpers will be frequently used with swap table later when working
> with swap cache folios. A locked swap cache folio ensures the entries are
> valid and stable so these helpers are very helpful.
> 
> Signed-off-by: Kairui Song <kasong@tencent.com>
> Acked-by: Chris Li <chrisl@kernel.org>
> Reviewed-by: Barry Song <baohua@kernel.org>
> ---

Acked-by: David Hildenbrand <david@redhat.com>

-- 
Cheers

David / dhildenb



^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH v2 09/15] mm, swap: cleanup swap cache API and add kerneldoc
  2025-09-05 19:13 ` [PATCH v2 09/15] mm, swap: cleanup swap cache API and add kerneldoc Kairui Song
                     ` (2 preceding siblings ...)
  2025-09-08  3:23   ` Baolin Wang
@ 2025-09-08 12:23   ` David Hildenbrand
  3 siblings, 0 replies; 80+ messages in thread
From: David Hildenbrand @ 2025-09-08 12:23 UTC (permalink / raw)
  To: Kairui Song, linux-mm
  Cc: Andrew Morton, Matthew Wilcox, Hugh Dickins, Chris Li,
	Barry Song, Baoquan He, Nhat Pham, Kemeng Shi, Baolin Wang,
	Ying Huang, Johannes Weiner, Yosry Ahmed, Lorenzo Stoakes,
	Zi Yan, linux-kernel

On 05.09.25 21:13, Kairui Song wrote:
> From: Kairui Song <kasong@tencent.com>
> 
> In preparation for replacing the swap cache backend with the swap table,
> clean up and add proper kernel doc for all swap cache APIs. Now all swap
> cache APIs are well-defined with consistent names.
> 
> No feature change, only renaming and documenting.
> 
> Signed-off-by: Kairui Song <kasong@tencent.com>
> ---

Acked-by: David Hildenbrand <david@redhat.com>

-- 
Cheers

David / dhildenb



^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH v2 10/15] mm, swap: wrap swap cache replacement with a helper
  2025-09-05 19:13 ` [PATCH v2 10/15] mm, swap: wrap swap cache replacement with a helper Kairui Song
  2025-09-06  7:09   ` Chris Li
  2025-09-08  3:41   ` Baolin Wang
@ 2025-09-08 12:30   ` David Hildenbrand
  2025-09-08 14:20     ` Kairui Song
  2 siblings, 1 reply; 80+ messages in thread
From: David Hildenbrand @ 2025-09-08 12:30 UTC (permalink / raw)
  To: Kairui Song, linux-mm
  Cc: Andrew Morton, Matthew Wilcox, Hugh Dickins, Chris Li,
	Barry Song, Baoquan He, Nhat Pham, Kemeng Shi, Baolin Wang,
	Ying Huang, Johannes Weiner, Yosry Ahmed, Lorenzo Stoakes,
	Zi Yan, linux-kernel


>   
> +/**
> + * __swap_cache_replace_folio - Replace a folio in the swap cache.
> + * @mapping: Swap mapping address space.
> + * @entry: The first swap entry that the new folio corresponds to.
> + * @old: The old folio to be replaced.
> + * @new: The new folio.
> + *
> + * Replace a existing folio in the swap cache with a new folio.
> + *
> + * Context: Caller must ensure both folios are locked, and lock the
> + * swap address_space that holds the entries to be replaced.
> + */
> +void __swap_cache_replace_folio(struct address_space *mapping,
> +				swp_entry_t entry,
> +				struct folio *old, struct folio *new)

Can't we just use "new->swap.val" directly and avoid passing in the 
entry, documenting that new->swap.val must be setup properly in advance?

Similarly, can't we obtain "mapping" from new?

-- 
Cheers

David / dhildenb



^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH v2 01/15] docs/mm: add document for swap table
  2025-09-05 19:13 ` [PATCH v2 01/15] docs/mm: add document for swap table Kairui Song
  2025-09-05 23:58   ` Chris Li
@ 2025-09-08 12:35   ` Baoquan He
  2025-09-08 14:27     ` Kairui Song
  2025-09-08 15:01     ` Chris Li
  1 sibling, 2 replies; 80+ messages in thread
From: Baoquan He @ 2025-09-08 12:35 UTC (permalink / raw)
  To: Kairui Song
  Cc: linux-mm, Andrew Morton, Matthew Wilcox, Hugh Dickins, Chris Li,
	Barry Song, Nhat Pham, Kemeng Shi, Baolin Wang, Ying Huang,
	Johannes Weiner, David Hildenbrand, Yosry Ahmed, Lorenzo Stoakes,
	Zi Yan, linux-kernel

On 09/06/25 at 03:13am, Kairui Song wrote:
> From: Kairui Song <kasong@tencent.com>
> 
> From: Chris Li <chrisl@kernel.org>

'From author <authorkernel.org>' can only be one person, and the co-author
should be specified by "Co-developed-by:" and "Signed-off-by:"?

> 
> Swap table is the new swap cache.
> 
> Signed-off-by: Chris Li <chrisl@kernel.org>
> Signed-off-by: Kairui Song <kasong@tencent.com>
> ---
>  Documentation/mm/swap-table.rst | 72 +++++++++++++++++++++++++++++++++
>  MAINTAINERS                     |  1 +
>  2 files changed, 73 insertions(+)
>  create mode 100644 Documentation/mm/swap-table.rst
> 
> diff --git a/Documentation/mm/swap-table.rst b/Documentation/mm/swap-table.rst
> new file mode 100644
> index 000000000000..929cd91aa984
> --- /dev/null
> +++ b/Documentation/mm/swap-table.rst
> @@ -0,0 +1,72 @@
> +.. SPDX-License-Identifier: GPL-2.0
> +
> +:Author: Chris Li <chrisl@kernel.org>, Kairui Song <kasong@tencent.com>
> +
> +==========
> +Swap Table
> +==========
> +
> +Swap table implements swap cache as a per-cluster swap cache value array.
> +
> +Swap Entry
> +----------
> +
> +A swap entry contains the information required to serve the anonymous page
> +fault.
> +
> +Swap entry is encoded as two parts: swap type and swap offset.
> +
> +The swap type indicates which swap device to use.
> +The swap offset is the offset of the swap file to read the page data from.
> +
> +Swap Cache
> +----------
> +
> +Swap cache is a map to look up folios using swap entry as the key. The result
> +value can have three possible types depending on which stage of this swap entry
> +was in.
> +
> +1. NULL: This swap entry is not used.
> +
> +2. folio: A folio has been allocated and bound to this swap entry. This is
> +   the transient state of swap out or swap in. The folio data can be in
> +   the folio or swap file, or both.
> +
> +3. shadow: The shadow contains the working set information of the swap
> +   outed folio. This is the normal state for a swap outed page.
> +
> +Swap Table
> +----------
> +
> +The previous swap cache is implemented by XAray. The XArray is a tree
> +structure. Each lookup will go through multiple nodes. Can we do better?
> +
> +Notice that most of the time when we look up the swap cache, we are either
> +in a swap in or swap out path. We should already have the swap cluster,
> +which contains the swap entry.
> +
> +If we have a per-cluster array to store swap cache value in the cluster.
> +Swap cache lookup within the cluster can be a very simple array lookup.
> +
> +We give such a per-cluster swap cache value array a name: the swap table.
> +
> +Each swap cluster contains 512 entries, so a swap table stores one cluster
> +worth of swap cache values, which is exactly one page. This is not
> +coincidental because the cluster size is determined by the huge page size.
> +The swap table is holding an array of pointers. The pointer has the same
> +size as the PTE. The size of the swap table should match to the second
> +last level of the page table page, exactly one page.
> +
> +With swap table, swap cache lookup can achieve great locality, simpler,
> +and faster.
> +
> +Locking
> +-------
> +
> +Swap table modification requires taking the cluster lock. If a folio
> +is being added to or removed from the swap table, the folio must be
> +locked prior to the cluster lock. After adding or removing is done, the
> +folio shall be unlocked.
> +
> +Swap table lookup is protected by RCU and atomic read. If the lookup
> +returns a folio, the user must lock the folio before use.
> diff --git a/MAINTAINERS b/MAINTAINERS
> index ec19be6c9917..1c8292c0318d 100644
> --- a/MAINTAINERS
> +++ b/MAINTAINERS
> @@ -16219,6 +16219,7 @@ R:	Barry Song <baohua@kernel.org>
>  R:	Chris Li <chrisl@kernel.org>
>  L:	linux-mm@kvack.org
>  S:	Maintained
> +F:	Documentation/mm/swap-table.rst
>  F:	include/linux/swap.h
>  F:	include/linux/swapfile.h
>  F:	include/linux/swapops.h
> -- 
> 2.51.0
> 



^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH v2 12/15] mm, swap: mark swap address space ro and add context debug check
  2025-09-05 19:13 ` [PATCH v2 12/15] mm, swap: mark swap address space ro and add context debug check Kairui Song
  2025-09-06 15:35   ` Chris Li
@ 2025-09-08 13:10   ` David Hildenbrand
  1 sibling, 0 replies; 80+ messages in thread
From: David Hildenbrand @ 2025-09-08 13:10 UTC (permalink / raw)
  To: Kairui Song, linux-mm
  Cc: Andrew Morton, Matthew Wilcox, Hugh Dickins, Chris Li,
	Barry Song, Baoquan He, Nhat Pham, Kemeng Shi, Baolin Wang,
	Ying Huang, Johannes Weiner, Yosry Ahmed, Lorenzo Stoakes,
	Zi Yan, linux-kernel

On 05.09.25 21:13, Kairui Song wrote:
> From: Kairui Song <kasong@tencent.com>
> 
> Swap cache is now backed by swap table, and the address space is not
> holding any mutable data anymore. And swap cache is now protected by
> the swap cluster lock, instead of the XArray lock. All access to swap
> cache are wrapped by swap cache helpers. Locking is mostly handled
> internally by swap cache helpers, only a few __swap_cache_* helpers
> require the caller to lock the cluster by themselves.
> 
> Worth noting that, unlike XArray, the cluster lock is not IRQ safe.
> The swap cache was very different compared to filemap, and now it's
> completely separated from filemap. Nothing wants to mark or change
> anything or do a writeback callback in IRQ.
> 
> So explicitly document this and add a debug check to avoid further
> potential misuse. And mark the swap cache space as read-only to avoid
> any user wrongly mixing unexpected filemap helpers with swap cache.
> 
> Signed-off-by: Kairui Song <kasong@tencent.com>
> ---

Acked-by: David Hildenbrand <david@redhat.com>

-- 
Cheers

David / dhildenb



^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH v2 13/15] mm, swap: remove contention workaround for swap cache
  2025-09-05 19:13 ` [PATCH v2 13/15] mm, swap: remove contention workaround for swap cache Kairui Song
  2025-09-06 15:30   ` Chris Li
@ 2025-09-08 13:12   ` David Hildenbrand
  1 sibling, 0 replies; 80+ messages in thread
From: David Hildenbrand @ 2025-09-08 13:12 UTC (permalink / raw)
  To: Kairui Song, linux-mm
  Cc: Andrew Morton, Matthew Wilcox, Hugh Dickins, Chris Li,
	Barry Song, Baoquan He, Nhat Pham, Kemeng Shi, Baolin Wang,
	Ying Huang, Johannes Weiner, Yosry Ahmed, Lorenzo Stoakes,
	Zi Yan, linux-kernel, kernel test robot

On 05.09.25 21:13, Kairui Song wrote:
> From: Kairui Song <kasong@tencent.com>
> 
> Swap cluster setup will try to shuffle the clusters on initialization.
> It was helpful to avoid contention for the swap cache space. The cluster
> size (2M) was much smaller than each swap cache space (64M), so
> shuffling the cluster means the allocator will try to allocate swap
> slots that are in different swap cache spaces for each CPU, reducing the
> chance of two CPUs using the same swap cache space, and hence reducing
> the contention.
> 
> Now, swap cache is managed by swap clusters, this shuffle is pointless.
> Just remove it, and clean up related macros.
> 
> This also improves the HDD swap performance as shuffling IO is a bad
> idea for HDD, and now the shuffling is gone. Test have shown a ~40%
> performance gain for HDD [1]:
> 
> Doing sequential swap in of 8G data using 8 processes with usemem,
> average of 3 test runs:
> 
> Before: 1270.91 KB/s per process
> After:  1849.54 KB/s per process
> 
> Link: https://lore.kernel.org/linux-mm/CAMgjq7AdauQ8=X0zeih2r21QoV=-WWj1hyBxLWRzq74n-C=-Ng@mail.gmail.com/ [1]
> Reported-by: kernel test robot <oliver.sang@intel.com>
> Closes: https://lore.kernel.org/oe-lkp/202504241621.f27743ec-lkp@intel.com
> Signed-off-by: Kairui Song <kasong@tencent.com>
> Acked-by: Chris Li <chrisl@kernel.org>
> Reviewed-by: Barry Song <baohua@kernel.org>

Acked-by: David Hildenbrand <david@redhat.com>

-- 
Cheers

David / dhildenb



^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH v2 11/15] mm, swap: use the swap table for the swap cache and switch API
  2025-09-05 19:13 ` [PATCH v2 11/15] mm, swap: use the swap table for the swap cache and switch API Kairui Song
  2025-09-06 15:28   ` Chris Li
  2025-09-07 12:55   ` Klara Modin
@ 2025-09-08 13:45   ` David Hildenbrand
  2025-09-08 15:14     ` Kairui Song
  2025-09-10  2:53   ` SeongJae Park
  3 siblings, 1 reply; 80+ messages in thread
From: David Hildenbrand @ 2025-09-08 13:45 UTC (permalink / raw)
  To: Kairui Song, linux-mm
  Cc: Andrew Morton, Matthew Wilcox, Hugh Dickins, Chris Li,
	Barry Song, Baoquan He, Nhat Pham, Kemeng Shi, Baolin Wang,
	Ying Huang, Johannes Weiner, Yosry Ahmed, Lorenzo Stoakes,
	Zi Yan, linux-kernel


> +static inline struct swap_cluster_info *swap_cluster_lock(
> +	struct swap_info_struct *si, pgoff_t offset, bool irq)
> +{
> +	return NULL;
> +}
> +
> +static inline struct swap_cluster_info *swap_cluster_lock_by_folio(
> +		struct folio *folio)

I would probably call that "swap_cluster_get_and_lock" or sth like that ...

> +{
> +	return NULL;
> +}
> +
> +static inline struct swap_cluster_info *swap_cluster_lock_by_folio_irq(
> +		struct folio *folio)
> +{

... then this would become "swap_cluster_get_and_lock_irq"


Alterantively, separate the lookup from the locking of the cluster.

swap_cluster_from_folio() / folio_get_swap_cluster()
swap_cluster_lock()
swap_cluster_lock_irq()

Which might look cleaner in the end.

[...]

> -struct address_space *swapper_spaces[MAX_SWAPFILES] __read_mostly;
> -static unsigned int nr_swapper_spaces[MAX_SWAPFILES] __read_mostly;
> +struct address_space swap_space __read_mostly = {
> +	.a_ops = &swap_aops,
> +};
> +
>   static bool enable_vma_readahead __read_mostly = true;
>   
>   #define SWAP_RA_ORDER_CEILING	5
> @@ -83,11 +86,21 @@ void show_swap_cache_info(void)
>    */
>   struct folio *swap_cache_get_folio(swp_entry_t entry)
>   {
> -	struct folio *folio = filemap_get_folio(swap_address_space(entry),
> -						swap_cache_index(entry));
> -	if (IS_ERR(folio))
> -		return NULL;
> -	return folio;
> +

^ superfluous empty line.

[...]

>   
> @@ -420,6 +421,34 @@ static inline unsigned int cluster_offset(struct swap_info_struct *si,
>   	return cluster_index(si, ci) * SWAPFILE_CLUSTER;
>   }
>   
> +static int swap_table_alloc_table(struct swap_cluster_info *ci)

swap_cluster_alloc_table ?

> +{
> +	WARN_ON(ci->table);
> +	ci->table = kzalloc(sizeof(unsigned long) * SWAPFILE_CLUSTER, GFP_KERNEL);
> +	if (!ci->table)
> +		return -ENOMEM;
> +	return 0;
> +}
> +
> +static void swap_cluster_free_table(struct swap_cluster_info *ci)
> +{
> +	unsigned int ci_off;
> +	unsigned long swp_tb;
> +
> +	if (!ci->table)
> +		return;
> +
> +	for (ci_off = 0; ci_off < SWAPFILE_CLUSTER; ci_off++) {
> +		swp_tb = __swap_table_get(ci, ci_off);
> +		if (!swp_tb_is_null(swp_tb))
> +			pr_err_once("swap: unclean swap space on swapoff: 0x%lx",
> +				    swp_tb);
> +	}
> +
> +	kfree(ci->table);
> +	ci->table = NULL;
> +}
> +
>   static void move_cluster(struct swap_info_struct *si,
>   			 struct swap_cluster_info *ci, struct list_head *list,
>   			 enum swap_cluster_flags new_flags)
> @@ -702,6 +731,26 @@ static bool cluster_scan_range(struct swap_info_struct *si,
>   	return true;
>   }
>   
> +/*
> + * Currently, the swap table is not used for count tracking, just
> + * do a sanity check here to ensure nothing leaked, so the swap
> + * table should be empty upon freeing.
> + */
> +static void cluster_table_check(struct swap_cluster_info *ci,
> +				unsigned int start, unsigned int nr)

"swap_cluster_assert_table_empty()"

or sth like that that makes it clearer what you are checking for.

-- 
Cheers

David / dhildenb



^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH v2 10/15] mm, swap: wrap swap cache replacement with a helper
  2025-09-08 12:30   ` David Hildenbrand
@ 2025-09-08 14:20     ` Kairui Song
  2025-09-08 14:39       ` David Hildenbrand
  0 siblings, 1 reply; 80+ messages in thread
From: Kairui Song @ 2025-09-08 14:20 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: linux-mm, Andrew Morton, Matthew Wilcox, Hugh Dickins, Chris Li,
	Barry Song, Baoquan He, Nhat Pham, Kemeng Shi, Baolin Wang,
	Ying Huang, Johannes Weiner, Yosry Ahmed, Lorenzo Stoakes,
	Zi Yan, linux-kernel

On Mon, Sep 8, 2025 at 8:35 PM David Hildenbrand <david@redhat.com> wrote:
>
>
> >
> > +/**
> > + * __swap_cache_replace_folio - Replace a folio in the swap cache.
> > + * @mapping: Swap mapping address space.
> > + * @entry: The first swap entry that the new folio corresponds to.
> > + * @old: The old folio to be replaced.
> > + * @new: The new folio.
> > + *
> > + * Replace a existing folio in the swap cache with a new folio.
> > + *
> > + * Context: Caller must ensure both folios are locked, and lock the
> > + * swap address_space that holds the entries to be replaced.
> > + */
> > +void __swap_cache_replace_folio(struct address_space *mapping,
> > +                             swp_entry_t entry,
> > +                             struct folio *old, struct folio *new)
>
> Can't we just use "new->swap.val" directly and avoid passing in the
> entry, documenting that new->swap.val must be setup properly in advance?

Thanks for the suggestion.

I was thinking about the opposite. I think maybe it's better that the
caller never sets the new folio's entry value, so folio->swap is always
modified in mm/swap_state.c, and let __swap_cache_replace_folio set
new->swap, to make it easier to track the folio->swap
usage.

This can be done easily for migration and shmem parts, the huge split
code will need a bit more cleanup.

It's a trivial change I think. But letting __swap_cache_replace_folio
setup new's swap and flags may deduplicate some code. So I thought
maybe this can be better cleaned up later. So for now I just add a
debug check here that `entry == new->swap`.

And the debug check does imply that we can just drop the entry params
in this patch, there will be no feature change.

> Similarly, can't we obtain "mapping" from new?

This is doable. But this patch is only an intermediate patch, next
commit will let the pass in ci instead. Of course the `ci` can be
retrieved from `entry` directly too, but it's the caller's
responsibility to lock the `ci`, so passing in a locked ci explicitly
might be more intuitive? Also might save a tiny bit of CPU time from
recalculating and load the `ci`.


>
> --
> Cheers
>
> David / dhildenb
>
>


^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH v2 01/15] docs/mm: add document for swap table
  2025-09-08 12:35   ` Baoquan He
@ 2025-09-08 14:27     ` Kairui Song
  2025-09-08 15:06       ` Baoquan He
  2025-09-08 15:01     ` Chris Li
  1 sibling, 1 reply; 80+ messages in thread
From: Kairui Song @ 2025-09-08 14:27 UTC (permalink / raw)
  To: Baoquan He
  Cc: linux-mm, Andrew Morton, Matthew Wilcox, Hugh Dickins, Chris Li,
	Barry Song, Nhat Pham, Kemeng Shi, Baolin Wang, Ying Huang,
	Johannes Weiner, David Hildenbrand, Yosry Ahmed, Lorenzo Stoakes,
	Zi Yan, linux-kernel

On Mon, Sep 8, 2025 at 8:54 PM Baoquan He <bhe@redhat.com> wrote:
>
> On 09/06/25 at 03:13am, Kairui Song wrote:
> > From: Kairui Song <kasong@tencent.com>
> >
> > From: Chris Li <chrisl@kernel.org>
>
> 'From author <authorkernel.org>' can only be one person, and the co-author
> should be specified by "Co-developed-by:" and "Signed-off-by:"?
>

Hmm, that's interesting, I'm using git send mail with below setup:

[sendemail]
from = Kairui Song <ryncsn@gmail.com>
confirm = auto
smtpServer = smtp.gmail.com
smtpServerPort = 587
smtpEncryption = tls
smtpUser = ryncsn@gmail.com

So it will add a "From:" automatically when I'm using gmail's SMTP but
the patch author doesn't match the sender. It seems git somehow got
confused by this commit, maybe I used some sending parameters wrongly.

The author of the doc really should be Chris.


^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH v2 11/15] mm, swap: use the swap table for the swap cache and switch API
  2025-09-07 12:55   ` Klara Modin
@ 2025-09-08 14:34     ` Kairui Song
  2025-09-08 15:00       ` Klara Modin
  0 siblings, 1 reply; 80+ messages in thread
From: Kairui Song @ 2025-09-08 14:34 UTC (permalink / raw)
  To: Klara Modin
  Cc: linux-mm, Andrew Morton, Matthew Wilcox, Hugh Dickins, Chris Li,
	Barry Song, Baoquan He, Nhat Pham, Kemeng Shi, Baolin Wang,
	Ying Huang, Johannes Weiner, David Hildenbrand, Yosry Ahmed,
	Lorenzo Stoakes, Zi Yan, linux-kernel

On Sun, Sep 7, 2025 at 8:59 PM Klara Modin <klarasmodin@gmail.com> wrote:
>
> On 2025-09-06 03:13:53 +0800, Kairui Song wrote:
> > From: Kairui Song <kasong@tencent.com>
> >
> > Introduce basic swap table infrastructures, which are now just a
> > fixed-sized flat array inside each swap cluster, with access wrappers.
> >
> > Each cluster contains a swap table of 512 entries. Each table entry is
> > an opaque atomic long. It could be in 3 types: a shadow type (XA_VALUE),
> > a folio type (pointer), or NULL.
> >
> > In this first step, it only supports storing a folio or shadow, and it
> > is a drop-in replacement for the current swap cache. Convert all swap
> > cache users to use the new sets of APIs. Chris Li has been suggesting
> > using a new infrastructure for swap cache for better performance, and
> > that idea combined well with the swap table as the new backing
> > structure. Now the lock contention range is reduced to 2M clusters,
> > which is much smaller than the 64M address_space. And we can also drop
> > the multiple address_space design.
> >
> > All the internal works are done with swap_cache_get_* helpers. Swap
> > cache lookup is still lock-less like before, and the helper's contexts
> > are same with original swap cache helpers. They still require a pin
> > on the swap device to prevent the backing data from being freed.
> >
> > Swap cache updates are now protected by the swap cluster lock
> > instead of the Xarray lock. This is mostly handled internally, but new
> > __swap_cache_* helpers require the caller to lock the cluster. So, a
> > few new cluster access and locking helpers are also introduced.
> >
> > A fully cluster-based unified swap table can be implemented on top
> > of this to take care of all count tracking and synchronization work,
> > with dynamic allocation. It should reduce the memory usage while
> > making the performance even better.
> >
> > Co-developed-by: Chris Li <chrisl@kernel.org>
> > Signed-off-by: Chris Li <chrisl@kernel.org>
> > Signed-off-by: Kairui Song <kasong@tencent.com>
> > ---
> >  MAINTAINERS          |   1 +
> >  include/linux/swap.h |   2 -
> >  mm/huge_memory.c     |  13 +-
> >  mm/migrate.c         |  19 ++-
> >  mm/shmem.c           |   8 +-
> >  mm/swap.h            | 157 +++++++++++++++++------
> >  mm/swap_state.c      | 289 +++++++++++++++++++------------------------
> >  mm/swap_table.h      |  97 +++++++++++++++
> >  mm/swapfile.c        | 100 +++++++++++----
> >  mm/vmscan.c          |  20 ++-
> >  10 files changed, 458 insertions(+), 248 deletions(-)
> >  create mode 100644 mm/swap_table.h
> >
> > diff --git a/MAINTAINERS b/MAINTAINERS
> > index 1c8292c0318d..de402ca91a80 100644
> > --- a/MAINTAINERS
> > +++ b/MAINTAINERS
> > @@ -16226,6 +16226,7 @@ F:    include/linux/swapops.h
> >  F:   mm/page_io.c
> >  F:   mm/swap.c
> >  F:   mm/swap.h
> > +F:   mm/swap_table.h
> >  F:   mm/swap_state.c
> >  F:   mm/swapfile.c
> >
>
> ...
>
> >  #include <linux/swapops.h> /* for swp_offset */
>
> Now that swp_offset() is used in folio_index(), should this perhaps also be
> included for !CONFIG_SWAP?

Hi, Thanks for looking at this series.

>
> >  #include <linux/blk_types.h> /* for bio_end_io_t */
> >
...

> >       if (unlikely(folio_test_swapcache(folio)))
>
> > -             return swap_cache_index(folio->swap);
> > +             return swp_offset(folio->swap);
>
> This is outside CONFIG_SWAP.

Right, but there are users of folio_index that are outside of
CONFIG_SWAP (mm/migrate.c), and swp_offset is also outside of SWAP so
that's OK.

If we wrap it, the CONFIG_SWAP build will fail. I've test !CONFIG_SWAP
build on this patch and after the whole series, it works fine.

We should drop the usage of folio_index in migrate.c, that's not
really related to this series though.

>
> >       return folio->index;
> >  }
>
> ...
>
> Regards,
> Klara Modin
>


^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH v2 10/15] mm, swap: wrap swap cache replacement with a helper
  2025-09-08 14:20     ` Kairui Song
@ 2025-09-08 14:39       ` David Hildenbrand
  2025-09-08 14:49         ` Kairui Song
  0 siblings, 1 reply; 80+ messages in thread
From: David Hildenbrand @ 2025-09-08 14:39 UTC (permalink / raw)
  To: Kairui Song
  Cc: linux-mm, Andrew Morton, Matthew Wilcox, Hugh Dickins, Chris Li,
	Barry Song, Baoquan He, Nhat Pham, Kemeng Shi, Baolin Wang,
	Ying Huang, Johannes Weiner, Yosry Ahmed, Lorenzo Stoakes,
	Zi Yan, linux-kernel

On 08.09.25 16:20, Kairui Song wrote:
> On Mon, Sep 8, 2025 at 8:35 PM David Hildenbrand <david@redhat.com> wrote:
>>
>>
>>>
>>> +/**
>>> + * __swap_cache_replace_folio - Replace a folio in the swap cache.
>>> + * @mapping: Swap mapping address space.
>>> + * @entry: The first swap entry that the new folio corresponds to.
>>> + * @old: The old folio to be replaced.
>>> + * @new: The new folio.
>>> + *
>>> + * Replace a existing folio in the swap cache with a new folio.
>>> + *
>>> + * Context: Caller must ensure both folios are locked, and lock the
>>> + * swap address_space that holds the entries to be replaced.
>>> + */
>>> +void __swap_cache_replace_folio(struct address_space *mapping,
>>> +                             swp_entry_t entry,
>>> +                             struct folio *old, struct folio *new)
>>
>> Can't we just use "new->swap.val" directly and avoid passing in the
>> entry, documenting that new->swap.val must be setup properly in advance?
> 
> Thanks for the suggestion.
> 
> I was thinking about the opposite. I think maybe it's better that the
> caller never sets the new folio's entry value, so folio->swap is always
> modified in mm/swap_state.c, and let __swap_cache_replace_folio set
> new->swap, to make it easier to track the folio->swap
> usage.
> 
> This can be done easily for migration and shmem parts, the huge split
> code will need a bit more cleanup.

Right, but it's probably worth it.

> 
> It's a trivial change I think. But letting __swap_cache_replace_folio
> setup new's swap and flags may deduplicate some code. So I thought
> maybe this can be better cleaned up later. So for now I just add a
> debug check here that `entry == new->swap`.
> 
> And the debug check does imply that we can just drop the entry params
> in this patch, there will be no feature change.

Well, the current API as you introduce it here is confusing, as it's not 
clear who is supposed to initialize what.

So better to it cleanly right from the start.

> 
>> Similarly, can't we obtain "mapping" from new?
> 
> This is doable. But this patch is only an intermediate patch, next
> commit will let the pass in ci instead. Of course the `ci` can be
> retrieved from `entry` directly too, but it's the caller's
> responsibility to lock the `ci`, so passing in a locked ci explicitly
> might be more intuitive? Also might save a tiny bit of CPU time from
> recalculating and load the `ci`.

Well, no other swap_cache_* functions consumes an address space, right?

-- 
Cheers

David / dhildenb



^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH v2 10/15] mm, swap: wrap swap cache replacement with a helper
  2025-09-08 14:39       ` David Hildenbrand
@ 2025-09-08 14:49         ` Kairui Song
  0 siblings, 0 replies; 80+ messages in thread
From: Kairui Song @ 2025-09-08 14:49 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: linux-mm, Andrew Morton, Matthew Wilcox, Hugh Dickins, Chris Li,
	Barry Song, Baoquan He, Nhat Pham, Kemeng Shi, Baolin Wang,
	Ying Huang, Johannes Weiner, Yosry Ahmed, Lorenzo Stoakes,
	Zi Yan, linux-kernel

On Mon, Sep 8, 2025 at 10:39 PM David Hildenbrand <david@redhat.com> wrote:
>
> On 08.09.25 16:20, Kairui Song wrote:
> > On Mon, Sep 8, 2025 at 8:35 PM David Hildenbrand <david@redhat.com> wrote:
> >>
> >>
> >>>
> >>> +/**
> >>> + * __swap_cache_replace_folio - Replace a folio in the swap cache.
> >>> + * @mapping: Swap mapping address space.
> >>> + * @entry: The first swap entry that the new folio corresponds to.
> >>> + * @old: The old folio to be replaced.
> >>> + * @new: The new folio.
> >>> + *
> >>> + * Replace a existing folio in the swap cache with a new folio.
> >>> + *
> >>> + * Context: Caller must ensure both folios are locked, and lock the
> >>> + * swap address_space that holds the entries to be replaced.
> >>> + */
> >>> +void __swap_cache_replace_folio(struct address_space *mapping,
> >>> +                             swp_entry_t entry,
> >>> +                             struct folio *old, struct folio *new)
> >>
> >> Can't we just use "new->swap.val" directly and avoid passing in the
> >> entry, documenting that new->swap.val must be setup properly in advance?
> >
> > Thanks for the suggestion.
> >
> > I was thinking about the opposite. I think maybe it's better that the
> > caller never sets the new folio's entry value, so folio->swap is always
> > modified in mm/swap_state.c, and let __swap_cache_replace_folio set
> > new->swap, to make it easier to track the folio->swap
> > usage.
> >
> > This can be done easily for migration and shmem parts, the huge split
> > code will need a bit more cleanup.
>
> Right, but it's probably worth it.
>
> >
> > It's a trivial change I think. But letting __swap_cache_replace_folio
> > setup new's swap and flags may deduplicate some code. So I thought
> > maybe this can be better cleaned up later. So for now I just add a
> > debug check here that `entry == new->swap`.
> >
> > And the debug check does imply that we can just drop the entry params
> > in this patch, there will be no feature change.
>
> Well, the current API as you introduce it here is confusing, as it's not
> clear who is supposed to initialize what.
>
> So better to it cleanly right from the start.
>
> >
> >> Similarly, can't we obtain "mapping" from new?
> >
> > This is doable. But this patch is only an intermediate patch, next
> > commit will let the pass in ci instead. Of course the `ci` can be
> > retrieved from `entry` directly too, but it's the caller's
> > responsibility to lock the `ci`, so passing in a locked ci explicitly
> > might be more intuitive? Also might save a tiny bit of CPU time from
> > recalculating and load the `ci`.
>
> Well, no other swap_cache_* functions consumes an address space, right?

Right. I can drop it in this patch.

>
> --
> Cheers
>
> David / dhildenb
>


^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH v2 14/15] mm, swap: implement dynamic allocation of swap table
  2025-09-06 15:45   ` Chris Li
@ 2025-09-08 14:58     ` Kairui Song
  0 siblings, 0 replies; 80+ messages in thread
From: Kairui Song @ 2025-09-08 14:58 UTC (permalink / raw)
  To: Chris Li
  Cc: linux-mm, Andrew Morton, Matthew Wilcox, Hugh Dickins,
	Barry Song, Baoquan He, Nhat Pham, Kemeng Shi, Baolin Wang,
	Ying Huang, Johannes Weiner, David Hildenbrand, Yosry Ahmed,
	Lorenzo Stoakes, Zi Yan, linux-kernel

On Sat, Sep 6, 2025 at 11:59 PM Chris Li <chrisl@kernel.org> wrote:
>
> Hi Kairui,
>
> Acked-by: Chris Li <chrisl@kernel.org>

Hi Chris, thanks for the review.

>
> BTW, if you made some changes after my last ack, please drop my ack
> tag on the new version or clarify ack was on the older version so I
> know this version has new changes.

Most of the patches are basically the same as before except the xchg
change and naming change. I'll mention and drop Ack's if any patch is
updated non-trivially. Thanks for the info.


^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH v2 11/15] mm, swap: use the swap table for the swap cache and switch API
  2025-09-08 14:34     ` Kairui Song
@ 2025-09-08 15:00       ` Klara Modin
  2025-09-08 15:10         ` Kairui Song
  0 siblings, 1 reply; 80+ messages in thread
From: Klara Modin @ 2025-09-08 15:00 UTC (permalink / raw)
  To: Kairui Song
  Cc: linux-mm, Andrew Morton, Matthew Wilcox, Hugh Dickins, Chris Li,
	Barry Song, Baoquan He, Nhat Pham, Kemeng Shi, Baolin Wang,
	Ying Huang, Johannes Weiner, David Hildenbrand, Yosry Ahmed,
	Lorenzo Stoakes, Zi Yan, linux-kernel

[-- Attachment #1: Type: text/plain, Size: 5385 bytes --]

On 2025-09-08 22:34:04 +0800, Kairui Song wrote:
> On Sun, Sep 7, 2025 at 8:59 PM Klara Modin <klarasmodin@gmail.com> wrote:
> >
> > On 2025-09-06 03:13:53 +0800, Kairui Song wrote:
> > > From: Kairui Song <kasong@tencent.com>
> > >
> > > Introduce basic swap table infrastructures, which are now just a
> > > fixed-sized flat array inside each swap cluster, with access wrappers.
> > >
> > > Each cluster contains a swap table of 512 entries. Each table entry is
> > > an opaque atomic long. It could be in 3 types: a shadow type (XA_VALUE),
> > > a folio type (pointer), or NULL.
> > >
> > > In this first step, it only supports storing a folio or shadow, and it
> > > is a drop-in replacement for the current swap cache. Convert all swap
> > > cache users to use the new sets of APIs. Chris Li has been suggesting
> > > using a new infrastructure for swap cache for better performance, and
> > > that idea combined well with the swap table as the new backing
> > > structure. Now the lock contention range is reduced to 2M clusters,
> > > which is much smaller than the 64M address_space. And we can also drop
> > > the multiple address_space design.
> > >
> > > All the internal works are done with swap_cache_get_* helpers. Swap
> > > cache lookup is still lock-less like before, and the helper's contexts
> > > are same with original swap cache helpers. They still require a pin
> > > on the swap device to prevent the backing data from being freed.
> > >
> > > Swap cache updates are now protected by the swap cluster lock
> > > instead of the Xarray lock. This is mostly handled internally, but new
> > > __swap_cache_* helpers require the caller to lock the cluster. So, a
> > > few new cluster access and locking helpers are also introduced.
> > >
> > > A fully cluster-based unified swap table can be implemented on top
> > > of this to take care of all count tracking and synchronization work,
> > > with dynamic allocation. It should reduce the memory usage while
> > > making the performance even better.
> > >
> > > Co-developed-by: Chris Li <chrisl@kernel.org>
> > > Signed-off-by: Chris Li <chrisl@kernel.org>
> > > Signed-off-by: Kairui Song <kasong@tencent.com>
> > > ---
> > >  MAINTAINERS          |   1 +
> > >  include/linux/swap.h |   2 -
> > >  mm/huge_memory.c     |  13 +-
> > >  mm/migrate.c         |  19 ++-
> > >  mm/shmem.c           |   8 +-
> > >  mm/swap.h            | 157 +++++++++++++++++------
> > >  mm/swap_state.c      | 289 +++++++++++++++++++------------------------
> > >  mm/swap_table.h      |  97 +++++++++++++++
> > >  mm/swapfile.c        | 100 +++++++++++----
> > >  mm/vmscan.c          |  20 ++-
> > >  10 files changed, 458 insertions(+), 248 deletions(-)
> > >  create mode 100644 mm/swap_table.h
> > >
> > > diff --git a/MAINTAINERS b/MAINTAINERS
> > > index 1c8292c0318d..de402ca91a80 100644
> > > --- a/MAINTAINERS
> > > +++ b/MAINTAINERS
> > > @@ -16226,6 +16226,7 @@ F:    include/linux/swapops.h
> > >  F:   mm/page_io.c
> > >  F:   mm/swap.c
> > >  F:   mm/swap.h
> > > +F:   mm/swap_table.h
> > >  F:   mm/swap_state.c
> > >  F:   mm/swapfile.c
> > >
> >
> > ...
> >
> > >  #include <linux/swapops.h> /* for swp_offset */
> >
> > Now that swp_offset() is used in folio_index(), should this perhaps also be
> > included for !CONFIG_SWAP?
> 
> Hi, Thanks for looking at this series.
> 
> >
> > >  #include <linux/blk_types.h> /* for bio_end_io_t */
> > >
> ...
> 
> > >       if (unlikely(folio_test_swapcache(folio)))
> >
> > > -             return swap_cache_index(folio->swap);
> > > +             return swp_offset(folio->swap);
> >
> > This is outside CONFIG_SWAP.
> 
> Right, but there are users of folio_index that are outside of
> CONFIG_SWAP (mm/migrate.c), and swp_offset is also outside of SWAP so
> that's OK.
> 
> If we wrap it, the CONFIG_SWAP build will fail. I've test !CONFIG_SWAP
> build on this patch and after the whole series, it works fine.
> 
> We should drop the usage of folio_index in migrate.c, that's not
> really related to this series though.

Interesting that it works for you. I have a config with !CONFIG_SWAP which
fails with:

 In file included from mm/shmem.c:44:
 mm/swap.h: In function ‘folio_index’:
 mm/swap.h:461:24: error: implicit declaration of function ‘swp_offset’; did you mean ‘pmd_offset’? [-Wimplicit-function-declaration]
   461 |                 return swp_offset(folio->swap);
       |                        ^~~~~~~~~~
       |                        pmd_offset
 
(though it's possible I have misapplied the series somehow).
If I just move the linux/swapops.h include outside the CONFIG_SWAP ifdef:

diff --git a/mm/swap.h b/mm/swap.h
index caff4fe30fc5..12dd7d6478ff 100644
--- a/mm/swap.h
+++ b/mm/swap.h
@@ -3,6 +3,7 @@
 #define _MM_SWAP_H
 
 #include <linux/atomic.h> /* for atomic_long_t */
+#include <linux/swapops.h> /* for swp_offset */
 struct mempolicy;
 struct swap_iocb;
 
@@ -54,7 +55,6 @@ enum swap_cluster_flags {
 };
 
 #ifdef CONFIG_SWAP
-#include <linux/swapops.h> /* for swp_offset */
 #include <linux/blk_types.h> /* for bio_end_io_t */
 
 static inline unsigned int swp_cluster_offset(swp_entry_t entry)

it fixes that issue for me, and my other CONFIG_SWAP builds do not seem
to be impacted. I attached the config in case it's useful.

> 
> >
> > >       return folio->index;
> > >  }
> >
> > ...
> >
> > Regards,
> > Klara Modin
> >

[-- Attachment #2: config.gz --]
[-- Type: application/gzip, Size: 10332 bytes --]

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH v2 01/15] docs/mm: add document for swap table
  2025-09-08 12:35   ` Baoquan He
  2025-09-08 14:27     ` Kairui Song
@ 2025-09-08 15:01     ` Chris Li
  2025-09-08 15:09       ` Baoquan He
  1 sibling, 1 reply; 80+ messages in thread
From: Chris Li @ 2025-09-08 15:01 UTC (permalink / raw)
  To: Baoquan He
  Cc: Kairui Song, linux-mm, Andrew Morton, Matthew Wilcox,
	Hugh Dickins, Barry Song, Nhat Pham, Kemeng Shi, Baolin Wang,
	Ying Huang, Johannes Weiner, David Hildenbrand, Yosry Ahmed,
	Lorenzo Stoakes, Zi Yan, linux-kernel

On Mon, Sep 8, 2025 at 5:36 AM Baoquan He <bhe@redhat.com> wrote:
>
> On 09/06/25 at 03:13am, Kairui Song wrote:
> > From: Kairui Song <kasong@tencent.com>
> >
> > From: Chris Li <chrisl@kernel.org>
>
> 'From author <authorkernel.org>' can only be one person, and the co-author
> should be specified by "Co-developed-by:" and "Signed-off-by:"?

That is the artifact of sending another person's patch in a series.
The first "From" is from the email header sender. The second "from" is
the real author of the patch. Just like an IP tunnel packet there is
another inner IP packet wrapped in the outer IP packet.

I think that is all normal and did not violate the kernel rules. When
I include Kairui's patch in my swap allocator series. The same thing
happened there on Kairui's patch. In the end the git will know enough
who is the real author, because those patches are  outputted by git
anyway.

Chris


^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH v2 07/15] mm, swap: tidy up swap device and cluster info helpers
  2025-09-08 12:21   ` David Hildenbrand
@ 2025-09-08 15:01     ` Kairui Song
  0 siblings, 0 replies; 80+ messages in thread
From: Kairui Song @ 2025-09-08 15:01 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: linux-mm, Andrew Morton, Matthew Wilcox, Hugh Dickins, Chris Li,
	Barry Song, Baoquan He, Nhat Pham, Kemeng Shi, Baolin Wang,
	Ying Huang, Johannes Weiner, Yosry Ahmed, Lorenzo Stoakes,
	Zi Yan, linux-kernel

On Mon, Sep 8, 2025 at 8:52 PM David Hildenbrand <david@redhat.com> wrote:
>
> On 05.09.25 21:13, Kairui Song wrote:
> > From: Kairui Song <kasong@tencent.com>
> >
> > swp_swap_info is the most commonly used helper for retrieving swap info.
> > It has an internal check that may lead to a NULL return value, but
> > almost none of its caller checks the return value, making the internal
> > check pointless. In fact, most of these callers already ensured the
> > entry is valid and never expect a NULL value.
> >
> > Tidy this up and shorten the name. If the caller can make sure the
>
> "Tidy this up and improve the function names." ?

Yeah you are right. Most names actually got longer :)

>
> > swap entry/type is valid and the device is pinned, use the new introduced
> > __swap_entry_to_info/__swap_type_to_info instead. They have more debug
> > sanity checks and lower overhead as they are inlined.
> >
> > Callers that may expect a NULL value should use
> > swap_entry_to_info/swap_type_to_info instead.
> >
> > No feature change. The rearranged codes should have had no effect, or
> > they should have been hitting NULL de-ref bugs already. Only some new
> > sanity checks are added so potential issues may show up in debug build.
> >
> > The new helpers will be frequently used with swap table later when working
> > with swap cache folios. A locked swap cache folio ensures the entries are
> > valid and stable so these helpers are very helpful.
> >
> > Signed-off-by: Kairui Song <kasong@tencent.com>
> > Acked-by: Chris Li <chrisl@kernel.org>
> > Reviewed-by: Barry Song <baohua@kernel.org>
> > ---
>
> Acked-by: David Hildenbrand <david@redhat.com>

Thanks!

>
> --
> Cheers
>
> David / dhildenb
>
>


^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH v2 01/15] docs/mm: add document for swap table
  2025-09-08 14:27     ` Kairui Song
@ 2025-09-08 15:06       ` Baoquan He
  0 siblings, 0 replies; 80+ messages in thread
From: Baoquan He @ 2025-09-08 15:06 UTC (permalink / raw)
  To: Kairui Song
  Cc: linux-mm, Andrew Morton, Matthew Wilcox, Hugh Dickins, Chris Li,
	Barry Song, Nhat Pham, Kemeng Shi, Baolin Wang, Ying Huang,
	Johannes Weiner, David Hildenbrand, Yosry Ahmed, Lorenzo Stoakes,
	Zi Yan, linux-kernel

On 09/08/25 at 10:27pm, Kairui Song wrote:
> On Mon, Sep 8, 2025 at 8:54 PM Baoquan He <bhe@redhat.com> wrote:
> >
> > On 09/06/25 at 03:13am, Kairui Song wrote:
> > > From: Kairui Song <kasong@tencent.com>
> > >
> > > From: Chris Li <chrisl@kernel.org>
> >
> > 'From author <authorkernel.org>' can only be one person, and the co-author
> > should be specified by "Co-developed-by:" and "Signed-off-by:"?
> >
> 
> Hmm, that's interesting, I'm using git send mail with below setup:
> 
> [sendemail]
> from = Kairui Song <ryncsn@gmail.com>
> confirm = auto
> smtpServer = smtp.gmail.com
> smtpServerPort = 587
> smtpEncryption = tls
> smtpUser = ryncsn@gmail.com
> 
> So it will add a "From:" automatically when I'm using gmail's SMTP but
> the patch author doesn't match the sender. It seems git somehow got
> confused by this commit, maybe I used some sending parameters wrongly.

Then you may need to remove the 'from' field of your git [sendemail]
section. If I git am your patch, then the your first 'from' will be the
patch author.

> 
> The author of the doc really should be Chris.
> 



^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH v2 01/15] docs/mm: add document for swap table
  2025-09-08 15:01     ` Chris Li
@ 2025-09-08 15:09       ` Baoquan He
  2025-09-08 15:52         ` Chris Li
  0 siblings, 1 reply; 80+ messages in thread
From: Baoquan He @ 2025-09-08 15:09 UTC (permalink / raw)
  To: Chris Li
  Cc: Kairui Song, linux-mm, Andrew Morton, Matthew Wilcox,
	Hugh Dickins, Barry Song, Nhat Pham, Kemeng Shi, Baolin Wang,
	Ying Huang, Johannes Weiner, David Hildenbrand, Yosry Ahmed,
	Lorenzo Stoakes, Zi Yan, linux-kernel

On 09/08/25 at 08:01am, Chris Li wrote:
> On Mon, Sep 8, 2025 at 5:36 AM Baoquan He <bhe@redhat.com> wrote:
> >
> > On 09/06/25 at 03:13am, Kairui Song wrote:
> > > From: Kairui Song <kasong@tencent.com>
> > >
> > > From: Chris Li <chrisl@kernel.org>
> >
> > 'From author <authorkernel.org>' can only be one person, and the co-author
> > should be specified by "Co-developed-by:" and "Signed-off-by:"?
> 
> That is the artifact of sending another person's patch in a series.
> The first "From" is from the email header sender. The second "from" is
> the real author of the patch. Just like an IP tunnel packet there is
> another inner IP packet wrapped in the outer IP packet.
> 
> I think that is all normal and did not violate the kernel rules. When
> I include Kairui's patch in my swap allocator series. The same thing
> happened there on Kairui's patch. In the end the git will know enough
> who is the real author, because those patches are  outputted by git
> anyway.

Hmm, maybe git doesn't work like that. I applied this patch via git am,
I got this on my local branch. The 2nd 'From' become part of commit log.

commit 337b3cd6c0ffad355df8851414e8aa5be052f4cb (HEAD -> kasan-v3)
Author: Kairui Song <kasong@tencent.com>
Date:   Sat Sep 6 03:13:43 2025 +0800

    docs/mm: add document for swap table
    
    From: Chris Li <chrisl@kernel.org>
    
    Swap table is the new swap cache.
    
    Signed-off-by: Chris Li <chrisl@kernel.org>
    Signed-off-by: Kairui Song <kasong@tencent.com>



^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH v2 11/15] mm, swap: use the swap table for the swap cache and switch API
  2025-09-08 15:00       ` Klara Modin
@ 2025-09-08 15:10         ` Kairui Song
  0 siblings, 0 replies; 80+ messages in thread
From: Kairui Song @ 2025-09-08 15:10 UTC (permalink / raw)
  To: Klara Modin
  Cc: linux-mm, Andrew Morton, Matthew Wilcox, Hugh Dickins, Chris Li,
	Barry Song, Baoquan He, Nhat Pham, Kemeng Shi, Baolin Wang,
	Ying Huang, Johannes Weiner, David Hildenbrand, Yosry Ahmed,
	Lorenzo Stoakes, Zi Yan, linux-kernel

On Mon, Sep 8, 2025 at 11:01 PM Klara Modin <klarasmodin@gmail.com> wrote:
>
> On 2025-09-08 22:34:04 +0800, Kairui Song wrote:
> > On Sun, Sep 7, 2025 at 8:59 PM Klara Modin <klarasmodin@gmail.com> wrote:
> > >
> > > On 2025-09-06 03:13:53 +0800, Kairui Song wrote:
> > > > From: Kairui Song <kasong@tencent.com>
> > > >
> > > > Introduce basic swap table infrastructures, which are now just a
> > > > fixed-sized flat array inside each swap cluster, with access wrappers.
> > > >
> > > > Each cluster contains a swap table of 512 entries. Each table entry is
> > > > an opaque atomic long. It could be in 3 types: a shadow type (XA_VALUE),
> > > > a folio type (pointer), or NULL.
> > > >
> > > > In this first step, it only supports storing a folio or shadow, and it
> > > > is a drop-in replacement for the current swap cache. Convert all swap
> > > > cache users to use the new sets of APIs. Chris Li has been suggesting
> > > > using a new infrastructure for swap cache for better performance, and
> > > > that idea combined well with the swap table as the new backing
> > > > structure. Now the lock contention range is reduced to 2M clusters,
> > > > which is much smaller than the 64M address_space. And we can also drop
> > > > the multiple address_space design.
> > > >
> > > > All the internal works are done with swap_cache_get_* helpers. Swap
> > > > cache lookup is still lock-less like before, and the helper's contexts
> > > > are same with original swap cache helpers. They still require a pin
> > > > on the swap device to prevent the backing data from being freed.
> > > >
> > > > Swap cache updates are now protected by the swap cluster lock
> > > > instead of the Xarray lock. This is mostly handled internally, but new
> > > > __swap_cache_* helpers require the caller to lock the cluster. So, a
> > > > few new cluster access and locking helpers are also introduced.
> > > >
> > > > A fully cluster-based unified swap table can be implemented on top
> > > > of this to take care of all count tracking and synchronization work,
> > > > with dynamic allocation. It should reduce the memory usage while
> > > > making the performance even better.
> > > >
> > > > Co-developed-by: Chris Li <chrisl@kernel.org>
> > > > Signed-off-by: Chris Li <chrisl@kernel.org>
> > > > Signed-off-by: Kairui Song <kasong@tencent.com>
> > > > ---
> > > >  MAINTAINERS          |   1 +
> > > >  include/linux/swap.h |   2 -
> > > >  mm/huge_memory.c     |  13 +-
> > > >  mm/migrate.c         |  19 ++-
> > > >  mm/shmem.c           |   8 +-
> > > >  mm/swap.h            | 157 +++++++++++++++++------
> > > >  mm/swap_state.c      | 289 +++++++++++++++++++------------------------
> > > >  mm/swap_table.h      |  97 +++++++++++++++
> > > >  mm/swapfile.c        | 100 +++++++++++----
> > > >  mm/vmscan.c          |  20 ++-
> > > >  10 files changed, 458 insertions(+), 248 deletions(-)
> > > >  create mode 100644 mm/swap_table.h
> > > >
> > > > diff --git a/MAINTAINERS b/MAINTAINERS
> > > > index 1c8292c0318d..de402ca91a80 100644
> > > > --- a/MAINTAINERS
> > > > +++ b/MAINTAINERS
> > > > @@ -16226,6 +16226,7 @@ F:    include/linux/swapops.h
> > > >  F:   mm/page_io.c
> > > >  F:   mm/swap.c
> > > >  F:   mm/swap.h
> > > > +F:   mm/swap_table.h
> > > >  F:   mm/swap_state.c
> > > >  F:   mm/swapfile.c
> > > >
> > >
> > > ...
> > >
> > > >  #include <linux/swapops.h> /* for swp_offset */
> > >
> > > Now that swp_offset() is used in folio_index(), should this perhaps also be
> > > included for !CONFIG_SWAP?
> >
> > Hi, Thanks for looking at this series.
> >
> > >
> > > >  #include <linux/blk_types.h> /* for bio_end_io_t */
> > > >
> > ...
> >
> > > >       if (unlikely(folio_test_swapcache(folio)))
> > >
> > > > -             return swap_cache_index(folio->swap);
> > > > +             return swp_offset(folio->swap);
> > >
> > > This is outside CONFIG_SWAP.
> >
> > Right, but there are users of folio_index that are outside of
> > CONFIG_SWAP (mm/migrate.c), and swp_offset is also outside of SWAP so
> > that's OK.
> >
> > If we wrap it, the CONFIG_SWAP build will fail. I've test !CONFIG_SWAP
> > build on this patch and after the whole series, it works fine.
> >
> > We should drop the usage of folio_index in migrate.c, that's not
> > really related to this series though.
>
> Interesting that it works for you. I have a config with !CONFIG_SWAP which
> fails with:
>
>  In file included from mm/shmem.c:44:
>  mm/swap.h: In function ‘folio_index’:
>  mm/swap.h:461:24: error: implicit declaration of function ‘swp_offset’; did you mean ‘pmd_offset’? [-Wimplicit-function-declaration]
>    461 |                 return swp_offset(folio->swap);
>        |                        ^~~~~~~~~~
>        |                        pmd_offset
>
> (though it's possible I have misapplied the series somehow).
> If I just move the linux/swapops.h include outside the CONFIG_SWAP ifdef:
>
> diff --git a/mm/swap.h b/mm/swap.h
> index caff4fe30fc5..12dd7d6478ff 100644
> --- a/mm/swap.h
> +++ b/mm/swap.h
> @@ -3,6 +3,7 @@
>  #define _MM_SWAP_H
>
>  #include <linux/atomic.h> /* for atomic_long_t */
> +#include <linux/swapops.h> /* for swp_offset */
>  struct mempolicy;
>  struct swap_iocb;
>
> @@ -54,7 +55,6 @@ enum swap_cluster_flags {
>  };
>
>  #ifdef CONFIG_SWAP
> -#include <linux/swapops.h> /* for swp_offset */

Oh, I think I know what the problem is here. You disabled SHMEM too.
Most users of swap.h includes linux/swapops.h already. But for
shmem.c, it doesn't include linux/swapops.h when !CONFIG_SHMEM
 so swp_offset is undefined.

It's true that the problem is in swap.h, it should include swapops.h
for !SWAP too to avoid build error like this. Thanks for the report!


>  #include <linux/blk_types.h> /* for bio_end_io_t */
>
>  static inline unsigned int swp_cluster_offset(swp_entry_t entry)
>
> it fixes that issue for me, and my other CONFIG_SWAP builds do not seem
> to be impacted. I attached the config in case it's useful.
>
> >
> > >
> > > >       return folio->index;
> > > >  }
> > >
> > > ...
> > >
> > > Regards,
> > > Klara Modin
> > >


^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH v2 11/15] mm, swap: use the swap table for the swap cache and switch API
  2025-09-08 13:45   ` David Hildenbrand
@ 2025-09-08 15:14     ` Kairui Song
  2025-09-08 15:32       ` Kairui Song
  0 siblings, 1 reply; 80+ messages in thread
From: Kairui Song @ 2025-09-08 15:14 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: linux-mm, Andrew Morton, Matthew Wilcox, Hugh Dickins, Chris Li,
	Barry Song, Baoquan He, Nhat Pham, Kemeng Shi, Baolin Wang,
	Ying Huang, Johannes Weiner, Yosry Ahmed, Lorenzo Stoakes,
	Zi Yan, linux-kernel

On Mon, Sep 8, 2025 at 11:09 PM David Hildenbrand <david@redhat.com> wrote:
> > +static inline struct swap_cluster_info *swap_cluster_lock(
> > +     struct swap_info_struct *si, pgoff_t offset, bool irq)
> > +{
> > +     return NULL;
> > +}
> > +
> > +static inline struct swap_cluster_info *swap_cluster_lock_by_folio(
> > +             struct folio *folio)
>
> I would probably call that "swap_cluster_get_and_lock" or sth like that ...
>
> > +{
> > +     return NULL;
> > +}
> > +
> > +static inline struct swap_cluster_info *swap_cluster_lock_by_folio_irq(
> > +             struct folio *folio)
> > +{
>
> ... then this would become "swap_cluster_get_and_lock_irq"
>
>
> Alterantively, separate the lookup from the locking of the cluster.
>
> swap_cluster_from_folio() / folio_get_swap_cluster()
> swap_cluster_lock()
> swap_cluster_lock_irq()
>
> Which might look cleaner in the end.

That's a very good suggestion.

>
> [...]
>
> > -struct address_space *swapper_spaces[MAX_SWAPFILES] __read_mostly;
> > -static unsigned int nr_swapper_spaces[MAX_SWAPFILES] __read_mostly;
> > +struct address_space swap_space __read_mostly = {
> > +     .a_ops = &swap_aops,
> > +};
> > +
> >   static bool enable_vma_readahead __read_mostly = true;
> >
> >   #define SWAP_RA_ORDER_CEILING       5
> > @@ -83,11 +86,21 @@ void show_swap_cache_info(void)
> >    */
> >   struct folio *swap_cache_get_folio(swp_entry_t entry)
> >   {
> > -     struct folio *folio = filemap_get_folio(swap_address_space(entry),
> > -                                             swap_cache_index(entry));
> > -     if (IS_ERR(folio))
> > -             return NULL;
> > -     return folio;
> > +
>
> ^ superfluous empty line.
>
> [...]
>
> >
> > @@ -420,6 +421,34 @@ static inline unsigned int cluster_offset(struct swap_info_struct *si,
> >       return cluster_index(si, ci) * SWAPFILE_CLUSTER;
> >   }
> >
> > +static int swap_table_alloc_table(struct swap_cluster_info *ci)
>
> swap_cluster_alloc_table ?

Good idea.

>
> > +{
> > +     WARN_ON(ci->table);
> > +     ci->table = kzalloc(sizeof(unsigned long) * SWAPFILE_CLUSTER, GFP_KERNEL);
> > +     if (!ci->table)
> > +             return -ENOMEM;
> > +     return 0;
> > +}
> > +
> > +static void swap_cluster_free_table(struct swap_cluster_info *ci)
> > +{
> > +     unsigned int ci_off;
> > +     unsigned long swp_tb;
> > +
> > +     if (!ci->table)
> > +             return;
> > +
> > +     for (ci_off = 0; ci_off < SWAPFILE_CLUSTER; ci_off++) {
> > +             swp_tb = __swap_table_get(ci, ci_off);
> > +             if (!swp_tb_is_null(swp_tb))
> > +                     pr_err_once("swap: unclean swap space on swapoff: 0x%lx",
> > +                                 swp_tb);
> > +     }
> > +
> > +     kfree(ci->table);
> > +     ci->table = NULL;
> > +}
> > +
> >   static void move_cluster(struct swap_info_struct *si,
> >                        struct swap_cluster_info *ci, struct list_head *list,
> >                        enum swap_cluster_flags new_flags)
> > @@ -702,6 +731,26 @@ static bool cluster_scan_range(struct swap_info_struct *si,
> >       return true;
> >   }
> >
> > +/*
> > + * Currently, the swap table is not used for count tracking, just
> > + * do a sanity check here to ensure nothing leaked, so the swap
> > + * table should be empty upon freeing.
> > + */
> > +static void cluster_table_check(struct swap_cluster_info *ci,
> > +                             unsigned int start, unsigned int nr)
>
> "swap_cluster_assert_table_empty()"
>
> or sth like that that makes it clearer what you are checking for.

Agree.

>
> --
> Cheers
>

Thanks!


^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH v2 11/15] mm, swap: use the swap table for the swap cache and switch API
  2025-09-08 15:14     ` Kairui Song
@ 2025-09-08 15:32       ` Kairui Song
  0 siblings, 0 replies; 80+ messages in thread
From: Kairui Song @ 2025-09-08 15:32 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: linux-mm, Andrew Morton, Matthew Wilcox, Hugh Dickins, Chris Li,
	Barry Song, Baoquan He, Nhat Pham, Kemeng Shi, Baolin Wang,
	Ying Huang, Johannes Weiner, Yosry Ahmed, Lorenzo Stoakes,
	Zi Yan, linux-kernel

On Mon, Sep 8, 2025 at 11:14 PM Kairui Song <ryncsn@gmail.com> wrote:
>
> On Mon, Sep 8, 2025 at 11:09 PM David Hildenbrand <david@redhat.com> wrote:
> > > +static inline struct swap_cluster_info *swap_cluster_lock(
> > > +     struct swap_info_struct *si, pgoff_t offset, bool irq)
> > > +{
> > > +     return NULL;
> > > +}
> > > +
> > > +static inline struct swap_cluster_info *swap_cluster_lock_by_folio(
> > > +             struct folio *folio)
> >
> > I would probably call that "swap_cluster_get_and_lock" or sth like that ...
> >
> > > +{
> > > +     return NULL;
> > > +}
> > > +
> > > +static inline struct swap_cluster_info *swap_cluster_lock_by_folio_irq(
> > > +             struct folio *folio)
> > > +{
> >
> > ... then this would become "swap_cluster_get_and_lock_irq"
> >
> >
> > Alterantively, separate the lookup from the locking of the cluster.
> >
> > swap_cluster_from_folio() / folio_get_swap_cluster()
> > swap_cluster_lock()
> > swap_cluster_lock_irq()
> >
> > Which might look cleaner in the end.
>
> That's a very good suggestion.

Hmm, one problem here is swap_cluster_lock's args are the `si` and
`offset`. But swap_cluster_from_folio returns a pointer to the
cluster. Retrieving the offset / si pointer after having the pointer
to the cluster seems redundant and troublesome.

I think maybe swap_cluster_get_and_lock is better. (Not sure if it
looks strange to take a folio as argument).

>
> >
> > [...]
> >
> > > -struct address_space *swapper_spaces[MAX_SWAPFILES] __read_mostly;
> > > -static unsigned int nr_swapper_spaces[MAX_SWAPFILES] __read_mostly;
> > > +struct address_space swap_space __read_mostly = {
> > > +     .a_ops = &swap_aops,
> > > +};
> > > +
> > >   static bool enable_vma_readahead __read_mostly = true;
> > >
> > >   #define SWAP_RA_ORDER_CEILING       5
> > > @@ -83,11 +86,21 @@ void show_swap_cache_info(void)
> > >    */
> > >   struct folio *swap_cache_get_folio(swp_entry_t entry)
> > >   {
> > > -     struct folio *folio = filemap_get_folio(swap_address_space(entry),
> > > -                                             swap_cache_index(entry));
> > > -     if (IS_ERR(folio))
> > > -             return NULL;
> > > -     return folio;
> > > +
> >
> > ^ superfluous empty line.
> >
> > [...]
> >
> > >
> > > @@ -420,6 +421,34 @@ static inline unsigned int cluster_offset(struct swap_info_struct *si,
> > >       return cluster_index(si, ci) * SWAPFILE_CLUSTER;
> > >   }
> > >
> > > +static int swap_table_alloc_table(struct swap_cluster_info *ci)
> >
> > swap_cluster_alloc_table ?
>
> Good idea.
>
> >
> > > +{
> > > +     WARN_ON(ci->table);
> > > +     ci->table = kzalloc(sizeof(unsigned long) * SWAPFILE_CLUSTER, GFP_KERNEL);
> > > +     if (!ci->table)
> > > +             return -ENOMEM;
> > > +     return 0;
> > > +}
> > > +
> > > +static void swap_cluster_free_table(struct swap_cluster_info *ci)
> > > +{
> > > +     unsigned int ci_off;
> > > +     unsigned long swp_tb;
> > > +
> > > +     if (!ci->table)
> > > +             return;
> > > +
> > > +     for (ci_off = 0; ci_off < SWAPFILE_CLUSTER; ci_off++) {
> > > +             swp_tb = __swap_table_get(ci, ci_off);
> > > +             if (!swp_tb_is_null(swp_tb))
> > > +                     pr_err_once("swap: unclean swap space on swapoff: 0x%lx",
> > > +                                 swp_tb);
> > > +     }
> > > +
> > > +     kfree(ci->table);
> > > +     ci->table = NULL;
> > > +}
> > > +
> > >   static void move_cluster(struct swap_info_struct *si,
> > >                        struct swap_cluster_info *ci, struct list_head *list,
> > >                        enum swap_cluster_flags new_flags)
> > > @@ -702,6 +731,26 @@ static bool cluster_scan_range(struct swap_info_struct *si,
> > >       return true;
> > >   }
> > >
> > > +/*
> > > + * Currently, the swap table is not used for count tracking, just
> > > + * do a sanity check here to ensure nothing leaked, so the swap
> > > + * table should be empty upon freeing.
> > > + */
> > > +static void cluster_table_check(struct swap_cluster_info *ci,
> > > +                             unsigned int start, unsigned int nr)
> >
> > "swap_cluster_assert_table_empty()"
> >
> > or sth like that that makes it clearer what you are checking for.
>
> Agree.
>
> >
> > --
> > Cheers
> >
>
> Thanks!


^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH v2 11/15] mm, swap: use the swap table for the swap cache and switch API
  2025-09-06 15:28   ` Chris Li
@ 2025-09-08 15:38     ` Kairui Song
  0 siblings, 0 replies; 80+ messages in thread
From: Kairui Song @ 2025-09-08 15:38 UTC (permalink / raw)
  To: Chris Li
  Cc: linux-mm, Andrew Morton, Matthew Wilcox, Hugh Dickins,
	Barry Song, Baoquan He, Nhat Pham, Kemeng Shi, Baolin Wang,
	Ying Huang, Johannes Weiner, David Hildenbrand, Yosry Ahmed,
	Lorenzo Stoakes, Zi Yan, linux-kernel

On Sat, Sep 6, 2025 at 11:35 PM Chris Li <chrisl@kernel.org> wrote:
>
> Acked-by: Chris Li <chrisl@kernel.org>
>
> Some nitpick follows.
>
> Chris

Thanks!

> > +
> > +/*
> > + * swap_cluster_lock_by_folio - Locks the cluster that holds a folio's entries.
> > + * @folio: The folio.
> > + *
> > + * This locks the swap cluster that contains a folio's swap entries. The
> > + * swap entries of a folio are always in one single cluster, and a locked
> > + * swap cache folio is enough to stabilize the entries and the swap device.
>
> I was wondering if we have a better word than stabilize, we haven't
> defined what does stabilize mean. I assume it means protecting from
> racing access to the swap cache entry. If we describe what it protects
> or what it prevents, that would give more detailed meaning than
> stabilize.

Right, I used to use the word "pin". What it means here is: locking
the folio will ensure folio->swap won't change so the folio will have
a stable bind with the swap cluster its folio->swap points to. Also
the swap device can't be swapped off so there is no risk of UAF.

How about:

 * This locks the swap cluster that contains a folio's swap entries. The
 * swap entries of a folio are always in one single cluster. The folio has
 * to be locked so its swap entries won't change and the cluster is binded
 * to the folio.

...

> > @@ -123,57 +136,45 @@ void *swap_cache_get_shadow(swp_entry_t entry)
> >   * SWAP_HAS_CACHE to avoid race or conflict.
> >   * Return: Returns 0 on success, error code otherwise.
> >   */
> > -int swap_cache_add_folio(struct folio *folio, swp_entry_t entry,
> > -                        gfp_t gfp, void **shadowp)
> > +void swap_cache_add_folio(struct folio *folio, swp_entry_t entry, void **shadowp)
> >  {
> > -       struct address_space *address_space = swap_address_space(entry);
> > -       pgoff_t idx = swap_cache_index(entry);
> > -       XA_STATE_ORDER(xas, &address_space->i_pages, idx, folio_order(folio));
> > -       unsigned long i, nr = folio_nr_pages(folio);
> > -       void *old;
> > -
> > -       xas_set_update(&xas, workingset_update_node);
> > -
> > -       VM_BUG_ON_FOLIO(!folio_test_locked(folio), folio);
> > -       VM_BUG_ON_FOLIO(folio_test_swapcache(folio), folio);
> > -       VM_BUG_ON_FOLIO(!folio_test_swapbacked(folio), folio);
> > +       void *shadow = NULL;
> > +       unsigned long swp_tb, exist;
> > +       struct swap_cluster_info *ci;
> > +       unsigned int ci_start, ci_off, ci_end;
> > +       unsigned long nr_pages = folio_nr_pages(folio);
> > +
> > +       VM_WARN_ON_ONCE_FOLIO(!folio_test_locked(folio), folio);
> > +       VM_WARN_ON_ONCE_FOLIO(folio_test_swapcache(folio), folio);
> > +       VM_WARN_ON_ONCE_FOLIO(!folio_test_swapbacked(folio), folio);
> > +
> > +       swp_tb = folio_to_swp_tb(folio);
> > +       ci_start = swp_cluster_offset(entry);
> > +       ci_end = ci_start + nr_pages;
> > +       ci_off = ci_start;
> > +       ci = swap_cluster_lock(__swap_entry_to_info(entry), swp_offset(entry));
> > +       do {
> > +               exist = __swap_table_xchg(ci, ci_off, swp_tb);
>
> Thanks for changing it to xchg. I understand that by "exist" you mean
> the previous existing swap table entry. However after it was taken out
> from the swap table, is it still considered an "existing entry"? I am
> considering "old" or "prior" might be a better name. Just nitpicks
> anyway. If we use "old", we can rename "swp_tb" to "new_tb" to make it
> obvious what we are replacing with.

Good suggestion.

>
> Also I saw this kind of for loop repeat a few places.
> Maybe considering some for loop macro to do:
>
> for_each_folio_offset(folio, ci, ci_off) {
>       exist = __swap_table_xchg(ci, ci_off, swp_tb);
>       ...
> } end_for_each_folio_offset();
>
> The kernel has a lot of similar for loop macros.
>

There seem to be only a few users like this, but I can have a try.
Will use this style if it helps to reduce LOC or make it easier.
Thanks!


^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH v2 01/15] docs/mm: add document for swap table
  2025-09-08 15:09       ` Baoquan He
@ 2025-09-08 15:52         ` Chris Li
  0 siblings, 0 replies; 80+ messages in thread
From: Chris Li @ 2025-09-08 15:52 UTC (permalink / raw)
  To: Baoquan He
  Cc: Kairui Song, linux-mm, Andrew Morton, Matthew Wilcox,
	Hugh Dickins, Barry Song, Nhat Pham, Kemeng Shi, Baolin Wang,
	Ying Huang, Johannes Weiner, David Hildenbrand, Yosry Ahmed,
	Lorenzo Stoakes, Zi Yan, linux-kernel

On Mon, Sep 8, 2025 at 8:09 AM Baoquan He <bhe@redhat.com> wrote:
> > I think that is all normal and did not violate the kernel rules. When
> > I include Kairui's patch in my swap allocator series. The same thing
> > happened there on Kairui's patch. In the end the git will know enough
> > who is the real author, because those patches are  outputted by git
> > anyway.
>
> Hmm, maybe git doesn't work like that. I applied this patch via git am,
> I got this on my local branch. The 2nd 'From' become part of commit log.
>

In that case, Kairui needs to fix his sendmail config.

Maybe as you suggested, remove his own From: line  in this case. I
don't recall needing such a special git-send-mail config. Maybe
Kairui's smtp server is different.

BTW, I definitely know that Google's smtp server does not work well
with "b4 send --reflect".  Google SMTP is like: "Oh, I see you might
want to CC these people, because you include them in your inner
envelope CC list. let me do a CC for you on the outer envelope as
well". It defeats the purpose of "b4 send --reflect", which is a dry
run. I still recall the horror on my poor colleague's face when I
convinced him to try out "b4 send --reflect", which should be safe,
but the email actually sent out to the full list. I should file a bug
for that.

Chris

> commit 337b3cd6c0ffad355df8851414e8aa5be052f4cb (HEAD -> kasan-v3)
> Author: Kairui Song <kasong@tencent.com>
> Date:   Sat Sep 6 03:13:43 2025 +0800
>
>     docs/mm: add document for swap table
>
>     From: Chris Li <chrisl@kernel.org>
>
>     Swap table is the new swap cache.
>
>     Signed-off-by: Chris Li <chrisl@kernel.org>
>     Signed-off-by: Kairui Song <kasong@tencent.com>
>


^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH v2 10/15] mm, swap: wrap swap cache replacement with a helper
  2025-09-08 10:44     ` Kairui Song
@ 2025-09-09  1:18       ` Baolin Wang
  0 siblings, 0 replies; 80+ messages in thread
From: Baolin Wang @ 2025-09-09  1:18 UTC (permalink / raw)
  To: Kairui Song
  Cc: linux-mm, Andrew Morton, Matthew Wilcox, Hugh Dickins, Chris Li,
	Barry Song, Baoquan He, Nhat Pham, Kemeng Shi, Ying Huang,
	Johannes Weiner, David Hildenbrand, Yosry Ahmed, Lorenzo Stoakes,
	Zi Yan, linux-kernel



On 2025/9/8 18:44, Kairui Song wrote:
> On Mon, Sep 8, 2025 at 11:52 AM Baolin Wang
> <baolin.wang@linux.alibaba.com> wrote:
>>
>>
>>
>> On 2025/9/6 03:13, Kairui Song wrote:
>>> From: Kairui Song <kasong@tencent.com>
>>>
>>> There are currently three swap cache users that are trying to replace an
>>> existing folio with a new one: huge memory splitting, migration, and
>>> shmem replacement. What they are doing is quite similar.
>>>
>>> Introduce a common helper for this. In later commits, they can be easily
>>> switched to use the swap table by updating this helper.
>>>
>>> The newly added helper also makes the swap cache API better defined, and
>>> debugging is easier.
>>>
>>> Signed-off-by: Kairui Song <kasong@tencent.com>
>>> ---
>>>    mm/huge_memory.c |  5 ++---
>>>    mm/migrate.c     | 11 +++--------
>>>    mm/shmem.c       | 10 ++--------
>>>    mm/swap.h        |  3 +++
>>>    mm/swap_state.c  | 32 ++++++++++++++++++++++++++++++++
>>>    5 files changed, 42 insertions(+), 19 deletions(-)
>>>
>>> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
>>> index 26cedfcd7418..a4d192c8d794 100644
>>> --- a/mm/huge_memory.c
>>> +++ b/mm/huge_memory.c
>>> @@ -3798,9 +3798,8 @@ static int __folio_split(struct folio *folio, unsigned int new_order,
>>>                         * NOTE: shmem in swap cache is not supported yet.
>>>                         */
>>>                        if (swap_cache) {
>>> -                             __xa_store(&swap_cache->i_pages,
>>> -                                        swap_cache_index(new_folio->swap),
>>> -                                        new_folio, 0);
>>> +                             __swap_cache_replace_folio(swap_cache, new_folio->swap,
>>> +                                                        folio, new_folio);
>>>                                continue;
>>>                        }
>>
>> IIUC, it doesn't seem like a simple function replacement here. It
>> appears that the original code has a bug: if the 'new_folio' is a large
>> folio after split, we need to iterate over each swap entry of the large
>> swapcache folio and then restore the new 'new_folio'.
>>
> 
> That should be OK. We have a check in uniform_split_supported and
> non_uniform_split_supported that swapcache folio can only be splitted
> into order0. And it seems there is no support for splitting pure
> swapcache folio now.

Ah, yes. Better to mention that in the commit message, otherwise, it 
will make people (at least for me) doubt whether this is a 
non-functional change.

With David's comments addressed, feel free to add:
Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com>

> Maybe we can try to enable and make use of higher order split
> after this series for swapcache. I just had a try to use some hackish
> code to split random folios in the swap cache to larger order, it seems
> fine after this series.



^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH v2 04/15] mm, swap: check page poison flag after locking it
  2025-09-08 12:11   ` David Hildenbrand
@ 2025-09-09 14:54     ` Kairui Song
  2025-09-09 15:18       ` David Hildenbrand
  0 siblings, 1 reply; 80+ messages in thread
From: Kairui Song @ 2025-09-09 14:54 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: linux-mm, Andrew Morton, Matthew Wilcox, Hugh Dickins, Chris Li,
	Barry Song, Baoquan He, Nhat Pham, Kemeng Shi, Baolin Wang,
	Ying Huang, Johannes Weiner, Yosry Ahmed, Lorenzo Stoakes,
	Zi Yan, linux-kernel

On Mon, Sep 8, 2025 at 8:40 PM David Hildenbrand <david@redhat.com> wrote:
>
> On 05.09.25 21:13, Kairui Song wrote:
> > From: Kairui Song <kasong@tencent.com>
> >
> > Instead of checking the poison flag only in the fast swap cache lookup
> > path, always check the poison flags after locking a swap cache folio.
> >
> > There are two reasons to do so.
> >
> > The folio is unstable and could be removed from the swap cache anytime,
> > so it's totally possible that the folio is no longer the backing folio
> > of a swap entry, and could be an irrelevant poisoned folio. We might
> > mistakenly kill a faulting process.
> >
> > And it's totally possible or even common for the slow swap in path
> > (swapin_readahead) to bring in a cached folio. The cache folio could be
> > poisoned, too. Only checking the poison flag in the fast path will miss
> > such folios.
> >
> > The race window is tiny, so it's very unlikely to happen, though.
> > While at it, also add a unlikely prefix.
> >
> > Signed-off-by: Kairui Song <kasong@tencent.com>
> > ---
> >   mm/memory.c | 22 +++++++++++-----------
> >   1 file changed, 11 insertions(+), 11 deletions(-)
> >
> > diff --git a/mm/memory.c b/mm/memory.c
> > index 10ef528a5f44..94a5928e8ace 100644
> > --- a/mm/memory.c
> > +++ b/mm/memory.c
> > @@ -4661,10 +4661,8 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
> >               goto out;
> >
> >       folio = swap_cache_get_folio(entry);
> > -     if (folio) {
> > +     if (folio)
> >               swap_update_readahead(folio, vma, vmf->address);
> > -             page = folio_file_page(folio, swp_offset(entry));
> > -     }
> >       swapcache = folio;
> >
> >       if (!folio) {
> > @@ -4735,20 +4733,13 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
> >               ret = VM_FAULT_MAJOR;
> >               count_vm_event(PGMAJFAULT);
> >               count_memcg_event_mm(vma->vm_mm, PGMAJFAULT);
> > -             page = folio_file_page(folio, swp_offset(entry));
> > -     } else if (PageHWPoison(page)) {
> > -             /*
> > -              * hwpoisoned dirty swapcache pages are kept for killing
> > -              * owner processes (which may be unknown at hwpoison time)
> > -              */
> > -             ret = VM_FAULT_HWPOISON;
> > -             goto out_release;
> >       }
> >
> >       ret |= folio_lock_or_retry(folio, vmf);
> >       if (ret & VM_FAULT_RETRY)
> >               goto out_release;
> >
> > +     page = folio_file_page(folio, swp_offset(entry));
> >       if (swapcache) {
> >               /*
> >                * Make sure folio_free_swap() or swapoff did not release the
> > @@ -4761,6 +4752,15 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
> >                            page_swap_entry(page).val != entry.val))
> >                       goto out_page;
> >
> > +             if (unlikely(PageHWPoison(page))) {
> > +                     /*
> > +                      * hwpoisoned dirty swapcache pages are kept for killing
> > +                      * owner processes (which may be unknown at hwpoison time)
> > +                      */
> > +                     ret = VM_FAULT_HWPOISON;
> > +                     goto out_page;
> > +             }
> > +
> >               /*
> >                * KSM sometimes has to copy on read faults, for example, if
> >                * folio->index of non-ksm folios would be nonlinear inside the
>
> LGTM, but I was wondering whether we just want to check that even when

Thanks for checking the patch.

> we just allocated a fresh folio for simplicity. The check is cheap ...
>

Maybe not for now? This patch expects folio_test_swapcache to filter
out potentially irrelevant folios, so moving the check before that is
in theory not correct.. And folio_test_swapcache check won't work for
the fresh allocated folio here...

I'm planning to remove the whole `if (swapcache)` check in phase 2, as
all swapin will go through swap cache. By that time all checks will
always be applied. The simplification will be done in a cleaner way.

> --
> Cheers
>
> David / dhildenb
>
>


^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH v2 05/15] mm, swap: always lock and check the swap cache folio before use
  2025-09-08 12:18   ` David Hildenbrand
@ 2025-09-09 14:58     ` Kairui Song
  2025-09-09 15:19       ` David Hildenbrand
  0 siblings, 1 reply; 80+ messages in thread
From: Kairui Song @ 2025-09-09 14:58 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: linux-mm, Andrew Morton, Matthew Wilcox, Hugh Dickins, Chris Li,
	Barry Song, Baoquan He, Nhat Pham, Kemeng Shi, Baolin Wang,
	Ying Huang, Johannes Weiner, Yosry Ahmed, Lorenzo Stoakes,
	Zi Yan, linux-kernel

On Mon, Sep 8, 2025 at 10:08 PM David Hildenbrand <david@redhat.com> wrote:
>
>
> >
> >               folio_lock(folio);
> > +             if (!folio_matches_swap_entry(folio, entry)) {
> > +                     folio_unlock(folio);
> > +                     folio_put(folio);
> > +                     continue;
> > +             }
> > +
>
> I wonder if we should put that into unuse_pte() instead. It checks for
> other types of races (like the page table entry getting modified) already.

Doing this earlier here might help to avoid the folio_wait_writeback
below? And checking the folio right after locking seems to follow the
convention more strictly.

I'm fine either way though as there should be almost no difference.

> --
> Cheers
>
> David / dhildenb
>
>


^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH v2 04/15] mm, swap: check page poison flag after locking it
  2025-09-09 14:54     ` Kairui Song
@ 2025-09-09 15:18       ` David Hildenbrand
  0 siblings, 0 replies; 80+ messages in thread
From: David Hildenbrand @ 2025-09-09 15:18 UTC (permalink / raw)
  To: Kairui Song
  Cc: linux-mm, Andrew Morton, Matthew Wilcox, Hugh Dickins, Chris Li,
	Barry Song, Baoquan He, Nhat Pham, Kemeng Shi, Baolin Wang,
	Ying Huang, Johannes Weiner, Yosry Ahmed, Lorenzo Stoakes,
	Zi Yan, linux-kernel

On 09.09.25 16:54, Kairui Song wrote:
> On Mon, Sep 8, 2025 at 8:40 PM David Hildenbrand <david@redhat.com> wrote:
>>
>> On 05.09.25 21:13, Kairui Song wrote:
>>> From: Kairui Song <kasong@tencent.com>
>>>
>>> Instead of checking the poison flag only in the fast swap cache lookup
>>> path, always check the poison flags after locking a swap cache folio.
>>>
>>> There are two reasons to do so.
>>>
>>> The folio is unstable and could be removed from the swap cache anytime,
>>> so it's totally possible that the folio is no longer the backing folio
>>> of a swap entry, and could be an irrelevant poisoned folio. We might
>>> mistakenly kill a faulting process.
>>>
>>> And it's totally possible or even common for the slow swap in path
>>> (swapin_readahead) to bring in a cached folio. The cache folio could be
>>> poisoned, too. Only checking the poison flag in the fast path will miss
>>> such folios.
>>>
>>> The race window is tiny, so it's very unlikely to happen, though.
>>> While at it, also add a unlikely prefix.
>>>
>>> Signed-off-by: Kairui Song <kasong@tencent.com>
>>> ---
>>>    mm/memory.c | 22 +++++++++++-----------
>>>    1 file changed, 11 insertions(+), 11 deletions(-)
>>>
>>> diff --git a/mm/memory.c b/mm/memory.c
>>> index 10ef528a5f44..94a5928e8ace 100644
>>> --- a/mm/memory.c
>>> +++ b/mm/memory.c
>>> @@ -4661,10 +4661,8 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
>>>                goto out;
>>>
>>>        folio = swap_cache_get_folio(entry);
>>> -     if (folio) {
>>> +     if (folio)
>>>                swap_update_readahead(folio, vma, vmf->address);
>>> -             page = folio_file_page(folio, swp_offset(entry));
>>> -     }
>>>        swapcache = folio;
>>>
>>>        if (!folio) {
>>> @@ -4735,20 +4733,13 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
>>>                ret = VM_FAULT_MAJOR;
>>>                count_vm_event(PGMAJFAULT);
>>>                count_memcg_event_mm(vma->vm_mm, PGMAJFAULT);
>>> -             page = folio_file_page(folio, swp_offset(entry));
>>> -     } else if (PageHWPoison(page)) {
>>> -             /*
>>> -              * hwpoisoned dirty swapcache pages are kept for killing
>>> -              * owner processes (which may be unknown at hwpoison time)
>>> -              */
>>> -             ret = VM_FAULT_HWPOISON;
>>> -             goto out_release;
>>>        }
>>>
>>>        ret |= folio_lock_or_retry(folio, vmf);
>>>        if (ret & VM_FAULT_RETRY)
>>>                goto out_release;
>>>
>>> +     page = folio_file_page(folio, swp_offset(entry));
>>>        if (swapcache) {
>>>                /*
>>>                 * Make sure folio_free_swap() or swapoff did not release the
>>> @@ -4761,6 +4752,15 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
>>>                             page_swap_entry(page).val != entry.val))
>>>                        goto out_page;
>>>
>>> +             if (unlikely(PageHWPoison(page))) {
>>> +                     /*
>>> +                      * hwpoisoned dirty swapcache pages are kept for killing
>>> +                      * owner processes (which may be unknown at hwpoison time)
>>> +                      */
>>> +                     ret = VM_FAULT_HWPOISON;
>>> +                     goto out_page;
>>> +             }
>>> +
>>>                /*
>>>                 * KSM sometimes has to copy on read faults, for example, if
>>>                 * folio->index of non-ksm folios would be nonlinear inside the
>>
>> LGTM, but I was wondering whether we just want to check that even when
> 
> Thanks for checking the patch.
> 
>> we just allocated a fresh folio for simplicity. The check is cheap ...
>>
> 
> Maybe not for now? This patch expects folio_test_swapcache to filter
> out potentially irrelevant folios, so moving the check before that is
> in theory not correct.. And folio_test_swapcache check won't work for
> the fresh allocated folio here...
> 
> I'm planning to remove the whole `if (swapcache)` check in phase 2, as
> all swapin will go through swap cache. By that time all checks will
> always be applied. The simplification will be done in a cleaner way.
> 

Fair enough :)

-- 
Cheers

David / dhildenb



^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH v2 05/15] mm, swap: always lock and check the swap cache folio before use
  2025-09-09 14:58     ` Kairui Song
@ 2025-09-09 15:19       ` David Hildenbrand
  2025-09-10 12:56         ` Kairui Song
  0 siblings, 1 reply; 80+ messages in thread
From: David Hildenbrand @ 2025-09-09 15:19 UTC (permalink / raw)
  To: Kairui Song
  Cc: linux-mm, Andrew Morton, Matthew Wilcox, Hugh Dickins, Chris Li,
	Barry Song, Baoquan He, Nhat Pham, Kemeng Shi, Baolin Wang,
	Ying Huang, Johannes Weiner, Yosry Ahmed, Lorenzo Stoakes,
	Zi Yan, linux-kernel

On 09.09.25 16:58, Kairui Song wrote:
> On Mon, Sep 8, 2025 at 10:08 PM David Hildenbrand <david@redhat.com> wrote:
>>
>>
>>>
>>>                folio_lock(folio);
>>> +             if (!folio_matches_swap_entry(folio, entry)) {
>>> +                     folio_unlock(folio);
>>> +                     folio_put(folio);
>>> +                     continue;
>>> +             }
>>> +
>>
>> I wonder if we should put that into unuse_pte() instead. It checks for
>> other types of races (like the page table entry getting modified) already.
> 
> Doing this earlier here might help to avoid the folio_wait_writeback
> below? 

Why would we care about optimizing that out in that corner case?

And checking the folio right after locking seems to follow the
> convention more strictly.

I'd just slap it into unuse_pte() where you can return immediately and 
we don't need another duplicated

	folio_unlock(folio);
	folio_put(folio);
	continue;

-- 
Cheers

David / dhildenb



^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH v2 11/15] mm, swap: use the swap table for the swap cache and switch API
  2025-09-05 19:13 ` [PATCH v2 11/15] mm, swap: use the swap table for the swap cache and switch API Kairui Song
                     ` (2 preceding siblings ...)
  2025-09-08 13:45   ` David Hildenbrand
@ 2025-09-10  2:53   ` SeongJae Park
  2025-09-10  2:56     ` Kairui Song
  3 siblings, 1 reply; 80+ messages in thread
From: SeongJae Park @ 2025-09-10  2:53 UTC (permalink / raw)
  To: Kairui Song
  Cc: SeongJae Park, linux-mm, Andrew Morton, Matthew Wilcox,
	Hugh Dickins, Chris Li, Barry Song, Baoquan He, Nhat Pham,
	Kemeng Shi, Baolin Wang, Ying Huang, Johannes Weiner,
	David Hildenbrand, Yosry Ahmed, Lorenzo Stoakes, Zi Yan,
	linux-kernel, Kairui Song

Hi Kairui,

On Sat,  6 Sep 2025 03:13:53 +0800 Kairui Song <ryncsn@gmail.com> wrote:

> From: Kairui Song <kasong@tencent.com>
> 
> Introduce basic swap table infrastructures, which are now just a
> fixed-sized flat array inside each swap cluster, with access wrappers.
> 
> Each cluster contains a swap table of 512 entries. Each table entry is
> an opaque atomic long. It could be in 3 types: a shadow type (XA_VALUE),
> a folio type (pointer), or NULL.
> 
> In this first step, it only supports storing a folio or shadow, and it
> is a drop-in replacement for the current swap cache. Convert all swap
> cache users to use the new sets of APIs. Chris Li has been suggesting
> using a new infrastructure for swap cache for better performance, and
> that idea combined well with the swap table as the new backing
> structure. Now the lock contention range is reduced to 2M clusters,
> which is much smaller than the 64M address_space. And we can also drop
> the multiple address_space design.
> 
> All the internal works are done with swap_cache_get_* helpers. Swap
> cache lookup is still lock-less like before, and the helper's contexts
> are same with original swap cache helpers. They still require a pin
> on the swap device to prevent the backing data from being freed.
> 
> Swap cache updates are now protected by the swap cluster lock
> instead of the Xarray lock. This is mostly handled internally, but new
> __swap_cache_* helpers require the caller to lock the cluster. So, a
> few new cluster access and locking helpers are also introduced.
> 
> A fully cluster-based unified swap table can be implemented on top
> of this to take care of all count tracking and synchronization work,
> with dynamic allocation. It should reduce the memory usage while
> making the performance even better.

Thank you for continuing this nice work.  I was unfortunately unable to get
time to review this thoroughly, but found below.

> 
> Co-developed-by: Chris Li <chrisl@kernel.org>
> Signed-off-by: Chris Li <chrisl@kernel.org>
> Signed-off-by: Kairui Song <kasong@tencent.com>
> ---
[...]
> --- a/mm/swap.h
> +++ b/mm/swap.h
[...]
> @@ -367,7 +452,7 @@ static inline int non_swapcache_batch(swp_entry_t entry, int max_nr)
>  static inline pgoff_t folio_index(struct folio *folio)
>  {
>  	if (unlikely(folio_test_swapcache(folio)))
> -		return swap_cache_index(folio->swap);
> +		return swp_offset(folio->swap);
>  	return folio->index;
>  }

This makes i386 build on my setup fails, like below:

    In file included from /mm/shmem.c:44:
    /mm/swap.h: In function ‘folio_index’:
    /mm/swap.h:462:24: error: implicit declaration of function ‘swp_offset’; did you mean ‘pmd_offset’? [-Werror=implicit-function-declaration]
      462 |                 return swp_offset(folio->swap);
          |                        ^~~~~~~~~~
          |                        pmd_offset
    In file included from /mm/shmem.c:69:
    /include/linux/swapops.h: At top level:
    /include/linux/swapops.h:107:23: error: conflicting types for ‘swp_offset’; have ‘long unsigned int(swp_entry_t)’
      107 | static inline pgoff_t swp_offset(swp_entry_t entry)
          |                       ^~~~~~~~~~
    /mm/swap.h:462:24: note: previous implicit declaration of ‘swp_offset’ with type ‘int()’
      462 |                 return swp_offset(folio->swap);
          |                        ^~~~~~~~~~
    cc1: some warnings being treated as errors

You may be able to reproduce this using my script [1].

I also found including swapops.h as below fix this on my setup.  I didn't read
this code thoroughly, so not really sure if it is the right approach, though.

    --- a/mm/swap.h
    +++ b/mm/swap.h
    @@ -3,6 +3,7 @@
     #define _MM_SWAP_H
    
     #include <linux/atomic.h> /* for atomic_long_t */
    +#include <linux/swapops.h>
     struct mempolicy;
     struct swap_iocb;

[1] https://github.com/damonitor/damon-tests/blob/master/corr/tests/build_i386.sh


Thanks,
SJ

[...]


^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH v2 11/15] mm, swap: use the swap table for the swap cache and switch API
  2025-09-10  2:53   ` SeongJae Park
@ 2025-09-10  2:56     ` Kairui Song
  0 siblings, 0 replies; 80+ messages in thread
From: Kairui Song @ 2025-09-10  2:56 UTC (permalink / raw)
  To: SeongJae Park
  Cc: linux-mm, Andrew Morton, Matthew Wilcox, Hugh Dickins, Chris Li,
	Barry Song, Baoquan He, Nhat Pham, Kemeng Shi, Baolin Wang,
	Ying Huang, Johannes Weiner, David Hildenbrand, Yosry Ahmed,
	Lorenzo Stoakes, Zi Yan, linux-kernel

On Wed, Sep 10, 2025 at 10:53 AM SeongJae Park <sj@kernel.org> wrote:
>
> Hi Kairui,

Hi SeongJea,

>
> On Sat,  6 Sep 2025 03:13:53 +0800 Kairui Song <ryncsn@gmail.com> wrote:
>
> > From: Kairui Song <kasong@tencent.com>
> >
> > Introduce basic swap table infrastructures, which are now just a
> > fixed-sized flat array inside each swap cluster, with access wrappers.
> >
> > Each cluster contains a swap table of 512 entries. Each table entry is
> > an opaque atomic long. It could be in 3 types: a shadow type (XA_VALUE),
> > a folio type (pointer), or NULL.
> >
> > In this first step, it only supports storing a folio or shadow, and it
> > is a drop-in replacement for the current swap cache. Convert all swap
> > cache users to use the new sets of APIs. Chris Li has been suggesting
> > using a new infrastructure for swap cache for better performance, and
> > that idea combined well with the swap table as the new backing
> > structure. Now the lock contention range is reduced to 2M clusters,
> > which is much smaller than the 64M address_space. And we can also drop
> > the multiple address_space design.
> >
> > All the internal works are done with swap_cache_get_* helpers. Swap
> > cache lookup is still lock-less like before, and the helper's contexts
> > are same with original swap cache helpers. They still require a pin
> > on the swap device to prevent the backing data from being freed.
> >
> > Swap cache updates are now protected by the swap cluster lock
> > instead of the Xarray lock. This is mostly handled internally, but new
> > __swap_cache_* helpers require the caller to lock the cluster. So, a
> > few new cluster access and locking helpers are also introduced.
> >
> > A fully cluster-based unified swap table can be implemented on top
> > of this to take care of all count tracking and synchronization work,
> > with dynamic allocation. It should reduce the memory usage while
> > making the performance even better.
>
> Thank you for continuing this nice work.  I was unfortunately unable to get
> time to review this thoroughly, but found below.
>
> >
> > Co-developed-by: Chris Li <chrisl@kernel.org>
> > Signed-off-by: Chris Li <chrisl@kernel.org>
> > Signed-off-by: Kairui Song <kasong@tencent.com>
> > ---
> [...]
> > --- a/mm/swap.h
> > +++ b/mm/swap.h
> [...]
> > @@ -367,7 +452,7 @@ static inline int non_swapcache_batch(swp_entry_t entry, int max_nr)
> >  static inline pgoff_t folio_index(struct folio *folio)
> >  {
> >       if (unlikely(folio_test_swapcache(folio)))
> > -             return swap_cache_index(folio->swap);
> > +             return swp_offset(folio->swap);
> >       return folio->index;
> >  }
>
> This makes i386 build on my setup fails, like below:
>
>     In file included from /mm/shmem.c:44:
>     /mm/swap.h: In function ‘folio_index’:
>     /mm/swap.h:462:24: error: implicit declaration of function ‘swp_offset’; did you mean ‘pmd_offset’? [-Werror=implicit-function-declaration]
>       462 |                 return swp_offset(folio->swap);
>           |                        ^~~~~~~~~~
>           |                        pmd_offset
>     In file included from /mm/shmem.c:69:
>     /include/linux/swapops.h: At top level:
>     /include/linux/swapops.h:107:23: error: conflicting types for ‘swp_offset’; have ‘long unsigned int(swp_entry_t)’
>       107 | static inline pgoff_t swp_offset(swp_entry_t entry)
>           |                       ^~~~~~~~~~
>     /mm/swap.h:462:24: note: previous implicit declaration of ‘swp_offset’ with type ‘int()’
>       462 |                 return swp_offset(folio->swap);
>           |                        ^~~~~~~~~~
>     cc1: some warnings being treated as errors
>
> You may be able to reproduce this using my script [1].
>
> I also found including swapops.h as below fix this on my setup.  I didn't read
> this code thoroughly, so not really sure if it is the right approach, though.
>
>     --- a/mm/swap.h
>     +++ b/mm/swap.h
>     @@ -3,6 +3,7 @@
>      #define _MM_SWAP_H
>
>      #include <linux/atomic.h> /* for atomic_long_t */
>     +#include <linux/swapops.h>
>      struct mempolicy;
>      struct swap_iocb;
>
> [1] https://github.com/damonitor/damon-tests/blob/master/corr/tests/build_i386.sh

Yes, I also saw a report from the build bot. I tested !SWAP build but
didn't test !SWAP !SHMEM build.

Adjust the include header then the problem is gone, I'll send a V3 soon to
include this fix.

Thanks!


^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH v2 05/15] mm, swap: always lock and check the swap cache folio before use
  2025-09-09 15:19       ` David Hildenbrand
@ 2025-09-10 12:56         ` Kairui Song
  0 siblings, 0 replies; 80+ messages in thread
From: Kairui Song @ 2025-09-10 12:56 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: linux-mm, Andrew Morton, Matthew Wilcox, Hugh Dickins, Chris Li,
	Barry Song, Baoquan He, Nhat Pham, Kemeng Shi, Baolin Wang,
	Ying Huang, Johannes Weiner, Yosry Ahmed, Lorenzo Stoakes,
	Zi Yan, linux-kernel

On Tue, Sep 9, 2025 at 11:19 PM David Hildenbrand <david@redhat.com> wrote:
>
> On 09.09.25 16:58, Kairui Song wrote:
> > On Mon, Sep 8, 2025 at 10:08 PM David Hildenbrand <david@redhat.com> wrote:
> >>
> >>
> >>>
> >>>                folio_lock(folio);
> >>> +             if (!folio_matches_swap_entry(folio, entry)) {
> >>> +                     folio_unlock(folio);
> >>> +                     folio_put(folio);
> >>> +                     continue;
> >>> +             }
> >>> +
> >>
> >> I wonder if we should put that into unuse_pte() instead. It checks for
> >> other types of races (like the page table entry getting modified) already.
> >
> > Doing this earlier here might help to avoid the folio_wait_writeback
> > below?
>
> Why would we care about optimizing that out in that corner case?
>
> And checking the folio right after locking seems to follow the
> > convention more strictly.
>
> I'd just slap it into unuse_pte() where you can return immediately and
> we don't need another duplicated
>
>         folio_unlock(folio);
>         folio_put(folio);
>         continue;

Yeah, removing the duplication is a very good point.

>
> --
> Cheers
>
> David / dhildenb
>


^ permalink raw reply	[flat|nested] 80+ messages in thread

end of thread, other threads:[~2025-09-10 12:57 UTC | newest]

Thread overview: 80+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2025-09-05 19:13 [PATCH v2 00/15] mm, swap: introduce swap table as swap cache (phase I) Kairui Song
2025-09-05 19:13 ` [PATCH v2 01/15] docs/mm: add document for swap table Kairui Song
2025-09-05 23:58   ` Chris Li
2025-09-06 13:31     ` Kairui Song
2025-09-08 12:35   ` Baoquan He
2025-09-08 14:27     ` Kairui Song
2025-09-08 15:06       ` Baoquan He
2025-09-08 15:01     ` Chris Li
2025-09-08 15:09       ` Baoquan He
2025-09-08 15:52         ` Chris Li
2025-09-05 19:13 ` [PATCH v2 02/15] mm, swap: use unified helper for swap cache look up Kairui Song
2025-09-05 23:59   ` Chris Li
2025-09-08 11:43   ` David Hildenbrand
2025-09-05 19:13 ` [PATCH v2 03/15] mm, swap: fix swap cahe index error when retrying reclaim Kairui Song
2025-09-05 22:40   ` Nhat Pham
2025-09-06  6:30     ` Kairui Song
2025-09-06  1:51   ` Chris Li
2025-09-06  6:28     ` Kairui Song
2025-09-06 11:58       ` Chris Li
2025-09-08  3:08   ` Baolin Wang
2025-09-08 11:45   ` David Hildenbrand
2025-09-05 19:13 ` [PATCH v2 04/15] mm, swap: check page poison flag after locking it Kairui Song
2025-09-06  2:00   ` Chris Li
2025-09-08 12:11   ` David Hildenbrand
2025-09-09 14:54     ` Kairui Song
2025-09-09 15:18       ` David Hildenbrand
2025-09-05 19:13 ` [PATCH v2 05/15] mm, swap: always lock and check the swap cache folio before use Kairui Song
2025-09-06  2:12   ` Chris Li
2025-09-06  6:32     ` Kairui Song
2025-09-08 12:18   ` David Hildenbrand
2025-09-09 14:58     ` Kairui Song
2025-09-09 15:19       ` David Hildenbrand
2025-09-10 12:56         ` Kairui Song
2025-09-05 19:13 ` [PATCH v2 06/15] mm, swap: rename and move some swap cluster definition and helpers Kairui Song
2025-09-06  2:13   ` Chris Li
2025-09-08  3:03   ` Baolin Wang
2025-09-05 19:13 ` [PATCH v2 07/15] mm, swap: tidy up swap device and cluster info helpers Kairui Song
2025-09-06  2:14   ` Chris Li
2025-09-08 12:21   ` David Hildenbrand
2025-09-08 15:01     ` Kairui Song
2025-09-05 19:13 ` [PATCH v2 08/15] mm/shmem, swap: remove redundant error handling for replacing folio Kairui Song
2025-09-08  3:17   ` Baolin Wang
2025-09-08  9:28     ` Kairui Song
2025-09-05 19:13 ` [PATCH v2 09/15] mm, swap: cleanup swap cache API and add kerneldoc Kairui Song
2025-09-06  5:45   ` Chris Li
2025-09-08  0:11   ` Barry Song
2025-09-08  3:23   ` Baolin Wang
2025-09-08 12:23   ` David Hildenbrand
2025-09-05 19:13 ` [PATCH v2 10/15] mm, swap: wrap swap cache replacement with a helper Kairui Song
2025-09-06  7:09   ` Chris Li
2025-09-08  3:41   ` Baolin Wang
2025-09-08 10:44     ` Kairui Song
2025-09-09  1:18       ` Baolin Wang
2025-09-08 12:30   ` David Hildenbrand
2025-09-08 14:20     ` Kairui Song
2025-09-08 14:39       ` David Hildenbrand
2025-09-08 14:49         ` Kairui Song
2025-09-05 19:13 ` [PATCH v2 11/15] mm, swap: use the swap table for the swap cache and switch API Kairui Song
2025-09-06 15:28   ` Chris Li
2025-09-08 15:38     ` Kairui Song
2025-09-07 12:55   ` Klara Modin
2025-09-08 14:34     ` Kairui Song
2025-09-08 15:00       ` Klara Modin
2025-09-08 15:10         ` Kairui Song
2025-09-08 13:45   ` David Hildenbrand
2025-09-08 15:14     ` Kairui Song
2025-09-08 15:32       ` Kairui Song
2025-09-10  2:53   ` SeongJae Park
2025-09-10  2:56     ` Kairui Song
2025-09-05 19:13 ` [PATCH v2 12/15] mm, swap: mark swap address space ro and add context debug check Kairui Song
2025-09-06 15:35   ` Chris Li
2025-09-08 13:10   ` David Hildenbrand
2025-09-05 19:13 ` [PATCH v2 13/15] mm, swap: remove contention workaround for swap cache Kairui Song
2025-09-06 15:30   ` Chris Li
2025-09-08 13:12   ` David Hildenbrand
2025-09-05 19:13 ` [PATCH v2 14/15] mm, swap: implement dynamic allocation of swap table Kairui Song
2025-09-06 15:45   ` Chris Li
2025-09-08 14:58     ` Kairui Song
2025-09-05 19:13 ` [PATCH v2 15/15] mm, swap: use a single page for swap table when the size fits Kairui Song
2025-09-06 15:48   ` Chris Li

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox