[PATCH RFC 00/15] mm, swap: swap table phase IV with dynamic ghost swapfile

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [PATCH RFC 00/15] mm, swap: swap table phase IV with dynamic ghost swapfile
@ 2026-02-19 23:42 Kairui Song via B4 Relay
  2026-02-19 23:42 ` [PATCH RFC 01/15] mm: move thp_limit_gfp_mask to header Kairui Song via B4 Relay
                   ` (15 more replies)
  0 siblings, 16 replies; 19+ messages in thread
From: Kairui Song via B4 Relay @ 2026-02-19 23:42 UTC (permalink / raw)
  To: linux-mm
  Cc: Andrew Morton, David Hildenbrand, Lorenzo Stoakes, Zi Yan,
	Baolin Wang, Barry Song, Hugh Dickins, Chris Li, Kemeng Shi,
	Nhat Pham, Baoquan He, Johannes Weiner, Yosry Ahmed,
	Youngjun Park, Chengming Zhou, Roman Gushchin, Shakeel Butt,
	Muchun Song, Qi Zheng, linux-kernel, cgroups, Kairui Song

NOTE for an RFC quality series: Swap table P4 is patch 1 - 12, and the
dynamic ghost file is patch 13 - 15. Putting them together as RFC for
easier review and discussions. Swap table P4 is stable and good to merge
if we are OK with a few memcg reparent behavior (there is also a
solution if we don't), dynamic ghost swap is yet a minimal proof of
concept. See patch 15 for more details. And see below for Swap table 4
cover letter (nice performance gain and memory save).

This is based on the latest mm-unstable, swap table P3 [1] and patches
[2] and [3], [4]. Sending this out early, as it might be helpful for us
to get a cleaner picture of the ongoing efforts, make the discussions easier.

Summary: With this approach, we can have an infinitely or dynamically
large ghost which could be identical to "virtual swap", and support
every feature we need while being *runtime configurable* with *zero
overhead* for plain swap and keep the infrastructure unified. Also
highly compatible with YoungJun's swap tiering [5], and other ideas like
swap table compaction, swapops, as it aligns with a few proposals [6]
[7] [8] [9] [10].

In the past two years, most efforts have focused on the swap
infrastructure, and we have made tremendous gains in performance,
keeping the memory usage reasonable or lower, and also greatly cleaned
up and simplified the API and conventions.

Now the infrastructures are almost ready, after P4, implementing an
infinitely or dynamically large swapfile can be done in a very easy to
maintain and flexible way, code change is minimal and progressive
for review, and makes future optimization like swap table compaction
doable too, since the infrastructure is all the same for all swaps.

The dynamic swap file is now using Xarray for the cluster info, and
inside the cluster, it's all the same swap allocator, swap table, and
existing infrastructures. A virtual table is available for any extra
data or usage. See below for the benefits and what we can achieve.

Huge thanks to Chris Li for the layered swap table and ghost swapfile
idea, without whom the work here can't be archived. Also, thanks to Nhat
for pushing and suggesting using an Xarray for the swapfile [11] for
dynamic size. I was originally planning to use a dynamic cluster
array, which requires a bit more adaptation, cleanup, and convention
changes. But during the discussion there, I got the inspiration that
Xarray can be used as the intermediate step, making this approach
doable with minimal changes. Just keep using it in the future, it
might not hurt too, as Xarray is only limited to ghost / virtual
files, so plain swaps won't have any extra overhead for lookup or high
risk of swapout allocation failure.

I'm fully open and totally fine for suggestions on naming or API
strategy, and others are highly welcome to keep the work going using
this flexible approach. Following this approach, we will have all the
following things progressively (some are already or almost there):

- 8 bytes per slot memory usage, when using only plain swap.
  - And the memory usage can be reduced to 3 or only 1 byte.
- 16 bytes per slot memory usage, when using ghost / virtual zswap.
  - Zswap can just use ci_dyn->virtual_table to free up it's content
    completely.
  - And the memory usage can be reduced to 11 or 8 bytes using the same
    code above.
  - 24 bytes only if including reverse mapping is in use.
- Minimal code review or maintenance burden. All layers are using the exact
  same infrastructure for metadata / allocation / synchronization, making
  all API and conventions consistent and easy to maintain.
- Writeback, migration and compaction are easily supportable since both
  reverse mapping and reallocation are prepared. We just need a
  folio_realloc_swap to allocate new entries for the existing entry, and
  fill the swap table with a reserve map entry.
- Fast swapoff: Just read into ghost / virtual swap cache.
- Zero static data (mostly due to swap table P4), even the clusters are
  dynamic (If using Xarray, only for ghost / virtual swap file).
- So we can have an infinitely sized swap space with no static data
  overhead.
- Everything is runtime configurable, and high-performance. An
  uncompressible workload or an offline batch workload can directly use a
  plain or remote swap for the lowest interference, memory usage, or for
  best performance.
- Highly compatible with YoungJun's swap tiering, even the ghost / virtual
  file can be just a tier. For example, if you have a huge NBD that doesn't
  care about fragmentation and compression, or the workload is
  uncompressible, setting the workload to use NBD's tier will give you only
  8 bytes of overhead per slot and peak performance, bypassing everything.
  Meanwhile, other workloads or cgroups can still use the ghost layer with
  compression or defragmentation using 16 bytes (zswap only) or 24 bytes
  (ghost swap with physical writeback) overhead.
- No force or breaking change to any existing allocation, priority, swap
  setup, or reclaim strategy. Ghost / virtual swap can be enabled or
  disabled using swapon / swapoff.

And if you consider these ops are too complex to set up and maintain, we
can then only allow one ghost / virtual file, make it infinitely large,
and be the default one and top tier, then it achieves the identical thing
to virtual swap space, but with much fewer LOC changed and being runtime
optional.

Currently, the dynamic ghost files are just reported as ordinary swap files
in /proc/swaps and we can have multiple ones, so users will have a full
view of what's going on. This is a very easy-to-change design decision.
I'm open to ideas about how we should present this to users. e.g., Hiding
it will make it more "virtual", but I don't think that's a good idea.

The size of the swapfile (si->max) is now just a number, which could be
changeable at runtime if we have a proper idea how to expose that and
might need some audit of a few remaining users. But right now, we can
already easily have a huge swap device with no overhead, for example:

free -m
               total        used        free      shared  buff/cache   available
Mem:            1465         250         927           1         356        1215
Swap:       15269887           0    15269887

And for easier testing, I added a /dev/ghostswap in this RFC. `swapon
/dev/ghostswap` enables that. Without swapon /dev/ghostswap, any existing
users, including ZRAM, won't observe any change.

===

Original cover letter for swap table phase IV:

This series unifies the allocation and charging process of anon and shmem,
provides better synchronization, and consolidates cgroup tracking, hence
dropping the cgroup array and improving the performance of mTHP by about
~15%.

Still testing with build kernel under great pressure, enabling mTHP 256kB,
on an EPYC 7K62 using 16G ZRAM, make -j48 with 1G memory limit, 12 test
runs:

Before: 2215.55s system, 2:53.03 elapsed
After:  1852.14s system, 2:41.44 elapsed (16.4% faster system time)

In some workloads, the speed gain is more than that since this reduces
memory thrashing, so even IO-bound work could benefit a lot, and I no
longer see any: "Huh VM_FAULT_OOM leaked out to the #PF handler. Retrying
PF", it was shown from time to time before this series.

Now, the swap cache layer ensures a folio will be the exclusive owner of
the swap slot, then charge it, which leads to much smaller thrashing when
under pressure.

And besides, the swap cgroup static array is gone, so for example, mounting
a 1TB swap device saves about 512MB of memory:

Before:
        total     used     free     shared  buff/cache available
Mem:    1465      854      331      1       347        610
Swap:   1048575   0        1048575

After:
        total     used     free     shared  buff/cache available
Mem:    1465      332      838      1       363        1133
Swap:   1048575   0        1048575

It saves us ~512M of memory, we now have close to 0 static overhead.

Link: https://lore.kernel.org/linux-mm/20260218-swap-table-p3-v3-0-f4e34be021a7@tencent.com/ [1]
Link: https://lore.kernel.org/linux-mm/20260213-memcg-privid-v1-1-d8cb7afcf831@tencent.com/ [2]
Link: https://lore.kernel.org/linux-mm/20260211-shmem-swap-gfp-v1-1-e9781099a861@tencent.com/ [3]
Link: https://lore.kernel.org/linux-mm/20260216-hibernate-perf-v4-0-1ba9f0bf1ec9@tencent.com/ [4]
Link: https://lore.kernel.org/linux-mm/20260217000950.4015880-1-youngjun.park@lge.com/ [5]
Link: https://lore.kernel.org/all/CAMgjq7BvQ0ZXvyLGp2YP96+i+6COCBBJCYmjXHGBnfisCAb8VA@mail.gmail.com/ [6]
Link: https://lwn.net/Articles/974587/ [7]
Link: https://lwn.net/Articles/932077/ [8]
Link: https://lwn.net/Articles/1016136/ [9]
Link: https://lore.kernel.org/linux-mm/20260208215839.87595-1-nphamcs@gmail.com/ [10]
Link: https://lore.kernel.org/linux-mm/CAKEwX=OUni7PuUqGQUhbMDtErurFN_i=1RgzyQsNXy4LABhXoA@mail.gmail.com/ [11]

Signed-off-by: Kairui Song <kasong@tencent.com>
---
Chris Li (1):
      mm: ghost swapfile support for zswap

Kairui Song (14):
      mm: move thp_limit_gfp_mask to header
      mm, swap: simplify swap_cache_alloc_folio
      mm, swap: move conflict checking logic of out swap cache adding
      mm, swap: add support for large order folios in swap cache directly
      mm, swap: unify large folio allocation
      memcg, swap: reparent the swap entry on swapin if swapout cgroup is dead
      memcg, swap: defer the recording of memcg info and reparent flexibly
      mm, swap: store and check memcg info in the swap table
      mm, swap: support flexible batch freeing of slots in different memcg
      mm, swap: always retrieve memcg id from swap table
      mm/swap, memcg: remove swap cgroup array
      mm, swap: merge zeromap into swap table
      mm, swap: add a special device for ghost swap setup
      mm, swap: allocate cluster dynamically for ghost swapfile

 MAINTAINERS                 |   1 -
 drivers/char/mem.c          |  39 ++++
 include/linux/huge_mm.h     |  24 +++
 include/linux/memcontrol.h  |  12 +-
 include/linux/swap.h        |  30 ++-
 include/linux/swap_cgroup.h |  47 -----
 mm/Makefile                 |   3 -
 mm/internal.h               |  25 ++-
 mm/memcontrol-v1.c          |  78 ++++----
 mm/memcontrol.c             | 119 ++++++++++--
 mm/memory.c                 |  89 ++-------
 mm/page_io.c                |  46 +++--
 mm/shmem.c                  | 122 +++---------
 mm/swap.h                   | 122 +++++-------
 mm/swap_cgroup.c            | 172 ----------------
 mm/swap_state.c             | 464 ++++++++++++++++++++++++--------------------
 mm/swap_table.h             | 105 ++++++++--
 mm/swapfile.c               | 278 ++++++++++++++++++++------
 mm/vmscan.c                 |   7 +-
 mm/workingset.c             |  16 +-
 mm/zswap.c                  |  29 +--
 21 files changed, 977 insertions(+), 851 deletions(-)
---
base-commit: 4750368e2cd365ac1e02c6919013c8871f35d8f9
change-id: 20260111-swap-table-p4-98ee92baa7c4

Best regards,
-- 
Kairui Song <kasong@tencent.com>

^ permalink raw reply	[flat|nested] 19+ messages in thread

* [PATCH RFC 01/15] mm: move thp_limit_gfp_mask to header
  2026-02-19 23:42 [PATCH RFC 00/15] mm, swap: swap table phase IV with dynamic ghost swapfile Kairui Song via B4 Relay
@ 2026-02-19 23:42 ` Kairui Song via B4 Relay
  2026-02-19 23:42 ` [PATCH RFC 02/15] mm, swap: simplify swap_cache_alloc_folio Kairui Song via B4 Relay
                   ` (14 subsequent siblings)
  15 siblings, 0 replies; 19+ messages in thread
From: Kairui Song via B4 Relay @ 2026-02-19 23:42 UTC (permalink / raw)
  To: linux-mm
  Cc: Andrew Morton, David Hildenbrand, Lorenzo Stoakes, Zi Yan,
	Baolin Wang, Barry Song, Hugh Dickins, Chris Li, Kemeng Shi,
	Nhat Pham, Baoquan He, Johannes Weiner, Yosry Ahmed,
	Youngjun Park, Chengming Zhou, Roman Gushchin, Shakeel Butt,
	Muchun Song, Qi Zheng, linux-kernel, cgroups, Kairui Song

From: Kairui Song <kasong@tencent.com>

No feature change, to be used later.

Signed-off-by: Kairui Song <kasong@tencent.com>
---
 include/linux/huge_mm.h | 24 ++++++++++++++++++++++++
 mm/shmem.c              | 30 +++---------------------------
 2 files changed, 27 insertions(+), 27 deletions(-)

diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index a4d9f964dfde..d522e798822d 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -237,6 +237,30 @@ static inline bool thp_vma_suitable_order(struct vm_area_struct *vma,
 	return true;
 }
 
+/*
+ * Make sure huge_gfp is always more limited than limit_gfp.
+ * Some of the flags set permissions, while others set limitations.
+ */
+static inline gfp_t thp_limit_gfp_mask(gfp_t huge_gfp, gfp_t limit_gfp)
+{
+	gfp_t allowflags = __GFP_IO | __GFP_FS | __GFP_RECLAIM;
+	gfp_t denyflags = __GFP_NOWARN | __GFP_NORETRY;
+	gfp_t zoneflags = limit_gfp & GFP_ZONEMASK;
+	gfp_t result = huge_gfp & ~(allowflags | GFP_ZONEMASK);
+
+	/* Allow allocations only from the originally specified zones. */
+	result |= zoneflags;
+
+	/*
+	 * Minimize the result gfp by taking the union with the deny flags,
+	 * and the intersection of the allow flags.
+	 */
+	result |= (limit_gfp & denyflags);
+	result |= (huge_gfp & limit_gfp) & allowflags;
+
+	return result;
+}
+
 /*
  * Filter the bitfield of input orders to the ones suitable for use in the vma.
  * See thp_vma_suitable_order().
diff --git a/mm/shmem.c b/mm/shmem.c
index b976b40fd442..9f054b5aae8e 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -1788,30 +1788,6 @@ static struct folio *shmem_swapin_cluster(swp_entry_t swap, gfp_t gfp,
 	return folio;
 }
 
-/*
- * Make sure huge_gfp is always more limited than limit_gfp.
- * Some of the flags set permissions, while others set limitations.
- */
-static gfp_t limit_gfp_mask(gfp_t huge_gfp, gfp_t limit_gfp)
-{
-	gfp_t allowflags = __GFP_IO | __GFP_FS | __GFP_RECLAIM;
-	gfp_t denyflags = __GFP_NOWARN | __GFP_NORETRY;
-	gfp_t zoneflags = limit_gfp & GFP_ZONEMASK;
-	gfp_t result = huge_gfp & ~(allowflags | GFP_ZONEMASK);
-
-	/* Allow allocations only from the originally specified zones. */
-	result |= zoneflags;
-
-	/*
-	 * Minimize the result gfp by taking the union with the deny flags,
-	 * and the intersection of the allow flags.
-	 */
-	result |= (limit_gfp & denyflags);
-	result |= (huge_gfp & limit_gfp) & allowflags;
-
-	return result;
-}
-
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
 bool shmem_hpage_pmd_enabled(void)
 {
@@ -2062,7 +2038,7 @@ static struct folio *shmem_swap_alloc_folio(struct inode *inode,
 		     non_swapcache_batch(entry, nr_pages) != nr_pages)
 			goto fallback;
 
-		alloc_gfp = limit_gfp_mask(vma_thp_gfp_mask(vma), gfp);
+		alloc_gfp = thp_limit_gfp_mask(vma_thp_gfp_mask(vma), gfp);
 	}
 retry:
 	new = shmem_alloc_folio(alloc_gfp, order, info, index);
@@ -2138,7 +2114,7 @@ static int shmem_replace_folio(struct folio **foliop, gfp_t gfp,
 	if (nr_pages > 1) {
 		gfp_t huge_gfp = vma_thp_gfp_mask(vma);
 
-		gfp = limit_gfp_mask(huge_gfp, gfp);
+		gfp = thp_limit_gfp_mask(huge_gfp, gfp);
 	}
 #endif
 
@@ -2545,7 +2521,7 @@ static int shmem_get_folio_gfp(struct inode *inode, pgoff_t index,
 		gfp_t huge_gfp;
 
 		huge_gfp = vma_thp_gfp_mask(vma);
-		huge_gfp = limit_gfp_mask(huge_gfp, gfp);
+		huge_gfp = thp_limit_gfp_mask(huge_gfp, gfp);
 		folio = shmem_alloc_and_add_folio(vmf, huge_gfp,
 				inode, index, fault_mm, orders);
 		if (!IS_ERR(folio)) {

-- 
2.53.0




^ permalink raw reply	[flat|nested] 19+ messages in thread

* [PATCH RFC 02/15] mm, swap: simplify swap_cache_alloc_folio
  2026-02-19 23:42 [PATCH RFC 00/15] mm, swap: swap table phase IV with dynamic ghost swapfile Kairui Song via B4 Relay
  2026-02-19 23:42 ` [PATCH RFC 01/15] mm: move thp_limit_gfp_mask to header Kairui Song via B4 Relay
@ 2026-02-19 23:42 ` Kairui Song via B4 Relay
  2026-02-19 23:42 ` [PATCH RFC 03/15] mm, swap: move conflict checking logic of out swap cache adding Kairui Song via B4 Relay
                   ` (13 subsequent siblings)
  15 siblings, 0 replies; 19+ messages in thread
From: Kairui Song via B4 Relay @ 2026-02-19 23:42 UTC (permalink / raw)
  To: linux-mm
  Cc: Andrew Morton, David Hildenbrand, Lorenzo Stoakes, Zi Yan,
	Baolin Wang, Barry Song, Hugh Dickins, Chris Li, Kemeng Shi,
	Nhat Pham, Baoquan He, Johannes Weiner, Yosry Ahmed,
	Youngjun Park, Chengming Zhou, Roman Gushchin, Shakeel Butt,
	Muchun Song, Qi Zheng, linux-kernel, cgroups, Kairui Song

From: Kairui Song <kasong@tencent.com>

Instead of trying to return the existing folio if the entry is already
cached, just return an error code if the allocation failed. And
introduce proper wrappers that handle the allocation failure in
different ways.

For async swapin and readahead, the caller only wants to ensure that a
swap in read if the allocation succeeded, and for zswap swap out, the
caller will just abort if the allocation failed because the entry is
gone or cached already.

Signed-off-by: Kairui Song <kasong@tencent.com>
---
 mm/swap.h       |   3 +-
 mm/swap_state.c | 177 +++++++++++++++++++++++++++++---------------------------
 mm/zswap.c      |  15 ++---
 3 files changed, 98 insertions(+), 97 deletions(-)

diff --git a/mm/swap.h b/mm/swap.h
index a77016f2423b..ad8b17a93758 100644
--- a/mm/swap.h
+++ b/mm/swap.h
@@ -281,8 +281,7 @@ struct folio *swap_cache_get_folio(swp_entry_t entry);
 void *swap_cache_get_shadow(swp_entry_t entry);
 void swap_cache_del_folio(struct folio *folio);
 struct folio *swap_cache_alloc_folio(swp_entry_t entry, gfp_t gfp_flags,
-				     struct mempolicy *mpol, pgoff_t ilx,
-				     bool *alloced);
+				     struct mempolicy *mpol, pgoff_t ilx);
 /* Below helpers require the caller to lock and pass in the swap cluster. */
 void __swap_cache_add_folio(struct swap_cluster_info *ci,
 			    struct folio *folio, swp_entry_t entry);
diff --git a/mm/swap_state.c b/mm/swap_state.c
index 32d9d877bda8..53fa95059012 100644
--- a/mm/swap_state.c
+++ b/mm/swap_state.c
@@ -459,41 +459,24 @@ void swap_update_readahead(struct folio *folio, struct vm_area_struct *vma,
  * All swap slots covered by the folio must have a non-zero swap count.
  *
  * Context: Caller must protect the swap device with reference count or locks.
- * Return: Returns the folio being added on success. Returns the existing folio
- * if @entry is already cached. Returns NULL if raced with swapin or swapoff.
+ * Return: 0 if success, error code if failed.
  */
-static struct folio *__swap_cache_prepare_and_add(swp_entry_t entry,
-						  struct folio *folio,
-						  gfp_t gfp, bool charged)
+static int __swap_cache_prepare_and_add(swp_entry_t entry,
+					struct folio *folio,
+					gfp_t gfp, bool charged)
 {
-	struct folio *swapcache = NULL;
 	void *shadow;
 	int ret;
 
 	__folio_set_locked(folio);
 	__folio_set_swapbacked(folio);
-	for (;;) {
-		ret = swap_cache_add_folio(folio, entry, &shadow);
-		if (!ret)
-			break;
-
-		/*
-		 * Large order allocation needs special handling on
-		 * race: if a smaller folio exists in cache, swapin needs
-		 * to fallback to order 0, and doing a swap cache lookup
-		 * might return a folio that is irrelevant to the faulting
-		 * entry because @entry is aligned down. Just return NULL.
-		 */
-		if (ret != -EEXIST || folio_test_large(folio))
-			goto failed;
-
-		swapcache = swap_cache_get_folio(entry);
-		if (swapcache)
-			goto failed;
-	}
+	ret = swap_cache_add_folio(folio, entry, &shadow);
+	if (ret)
+		goto failed;
 
 	if (!charged && mem_cgroup_swapin_charge_folio(folio, NULL, gfp, entry)) {
 		swap_cache_del_folio(folio);
+		ret = -ENOMEM;
 		goto failed;
 	}
 
@@ -503,11 +486,11 @@ static struct folio *__swap_cache_prepare_and_add(swp_entry_t entry,
 
 	/* Caller will initiate read into locked folio */
 	folio_add_lru(folio);
-	return folio;
+	return 0;
 
 failed:
 	folio_unlock(folio);
-	return swapcache;
+	return ret;
 }
 
 /**
@@ -516,7 +499,6 @@ static struct folio *__swap_cache_prepare_and_add(swp_entry_t entry,
  * @gfp_mask: memory allocation flags
  * @mpol: NUMA memory allocation policy to be applied
  * @ilx: NUMA interleave index, for use only when MPOL_INTERLEAVE
- * @new_page_allocated: sets true if allocation happened, false otherwise
  *
  * Allocate a folio in the swap cache for one swap slot, typically before
  * doing IO (e.g. swap in or zswap writeback). The swap slot indicated by
@@ -524,18 +506,40 @@ static struct folio *__swap_cache_prepare_and_add(swp_entry_t entry,
  * Currently only supports order 0.
  *
  * Context: Caller must protect the swap device with reference count or locks.
- * Return: Returns the existing folio if @entry is cached already. Returns
- * NULL if failed due to -ENOMEM or @entry have a swap count < 1.
+ * Return: Returns the folio if allocation succeeded and folio is added to
+ * swap cache. Returns error code if allocation failed due to race.
  */
 struct folio *swap_cache_alloc_folio(swp_entry_t entry, gfp_t gfp_mask,
-				     struct mempolicy *mpol, pgoff_t ilx,
-				     bool *new_page_allocated)
+				     struct mempolicy *mpol, pgoff_t ilx)
+{
+	int ret;
+	struct folio *folio;
+
+	/* Allocate a new folio to be added into the swap cache. */
+	folio = folio_alloc_mpol(gfp_mask, 0, mpol, ilx, numa_node_id());
+	if (!folio)
+		return ERR_PTR(-ENOMEM);
+
+	/*
+	 * Try add the new folio, it returns NULL if already exist,
+	 * since folio is order 0.
+	 */
+	ret = __swap_cache_prepare_and_add(entry, folio, gfp_mask, false);
+	if (ret) {
+		folio_put(folio);
+		return ERR_PTR(ret);
+	}
+
+	return folio;
+}
+
+static struct folio *swap_cache_read_folio(swp_entry_t entry, gfp_t gfp,
+					   struct mempolicy *mpol, pgoff_t ilx,
+					   struct swap_iocb **plug, bool readahead)
 {
 	struct swap_info_struct *si = __swap_entry_to_info(entry);
 	struct folio *folio;
-	struct folio *result = NULL;
 
-	*new_page_allocated = false;
 	/* Check the swap cache again for readahead path. */
 	folio = swap_cache_get_folio(entry);
 	if (folio)
@@ -545,17 +549,24 @@ struct folio *swap_cache_alloc_folio(swp_entry_t entry, gfp_t gfp_mask,
 	if (!swap_entry_swapped(si, entry))
 		return NULL;
 
-	/* Allocate a new folio to be added into the swap cache. */
-	folio = folio_alloc_mpol(gfp_mask, 0, mpol, ilx, numa_node_id());
-	if (!folio)
+	do {
+		folio = swap_cache_get_folio(entry);
+		if (folio)
+			return folio;
+
+		folio = swap_cache_alloc_folio(entry, gfp, mpol, ilx);
+	} while (PTR_ERR(folio) == -EEXIST);
+
+	if (IS_ERR_OR_NULL(folio))
 		return NULL;
-	/* Try add the new folio, returns existing folio or NULL on failure. */
-	result = __swap_cache_prepare_and_add(entry, folio, gfp_mask, false);
-	if (result == folio)
-		*new_page_allocated = true;
-	else
-		folio_put(folio);
-	return result;
+
+	swap_read_folio(folio, plug);
+	if (readahead) {
+		folio_set_readahead(folio);
+		count_vm_event(SWAP_RA);
+	}
+
+	return folio;
 }
 
 /**
@@ -574,15 +585,35 @@ struct folio *swap_cache_alloc_folio(swp_entry_t entry, gfp_t gfp_mask,
  */
 struct folio *swapin_folio(swp_entry_t entry, struct folio *folio)
 {
+	int ret;
 	struct folio *swapcache;
 	pgoff_t offset = swp_offset(entry);
 	unsigned long nr_pages = folio_nr_pages(folio);
 
 	entry = swp_entry(swp_type(entry), round_down(offset, nr_pages));
-	swapcache = __swap_cache_prepare_and_add(entry, folio, 0, true);
-	if (swapcache == folio)
-		swap_read_folio(folio, NULL);
-	return swapcache;
+	for (;;) {
+		ret = __swap_cache_prepare_and_add(entry, folio, 0, true);
+		if (!ret) {
+			swap_read_folio(folio, NULL);
+			break;
+		}
+
+		/*
+		 * Large order allocation needs special handling on
+		 * race: if a smaller folio exists in cache, swapin needs
+		 * to fallback to order 0, and doing a swap cache lookup
+		 * might return a folio that is irrelevant to the faulting
+		 * entry because @entry is aligned down. Just return NULL.
+		 */
+		if (ret != -EEXIST || nr_pages > 1)
+			return NULL;
+
+		swapcache = swap_cache_get_folio(entry);
+		if (swapcache)
+			return swapcache;
+	}
+
+	return folio;
 }
 
 /*
@@ -596,7 +627,6 @@ struct folio *read_swap_cache_async(swp_entry_t entry, gfp_t gfp_mask,
 		struct swap_iocb **plug)
 {
 	struct swap_info_struct *si;
-	bool page_allocated;
 	struct mempolicy *mpol;
 	pgoff_t ilx;
 	struct folio *folio;
@@ -606,13 +636,9 @@ struct folio *read_swap_cache_async(swp_entry_t entry, gfp_t gfp_mask,
 		return NULL;
 
 	mpol = get_vma_policy(vma, addr, 0, &ilx);
-	folio = swap_cache_alloc_folio(entry, gfp_mask, mpol, ilx,
-				       &page_allocated);
+	folio = swap_cache_read_folio(entry, gfp_mask, mpol, ilx, plug, false);
 	mpol_cond_put(mpol);
 
-	if (page_allocated)
-		swap_read_folio(folio, plug);
-
 	put_swap_device(si);
 	return folio;
 }
@@ -697,7 +723,7 @@ static unsigned long swapin_nr_pages(unsigned long offset)
  * are fairly likely to have been swapped out from the same node.
  */
 struct folio *swap_cluster_readahead(swp_entry_t entry, gfp_t gfp_mask,
-				    struct mempolicy *mpol, pgoff_t ilx)
+				     struct mempolicy *mpol, pgoff_t ilx)
 {
 	struct folio *folio;
 	unsigned long entry_offset = swp_offset(entry);
@@ -707,7 +733,7 @@ struct folio *swap_cluster_readahead(swp_entry_t entry, gfp_t gfp_mask,
 	struct swap_info_struct *si = __swap_entry_to_info(entry);
 	struct blk_plug plug;
 	struct swap_iocb *splug = NULL;
-	bool page_allocated;
+	swp_entry_t ra_entry;
 
 	mask = swapin_nr_pages(offset) - 1;
 	if (!mask)
@@ -724,18 +750,11 @@ struct folio *swap_cluster_readahead(swp_entry_t entry, gfp_t gfp_mask,
 	blk_start_plug(&plug);
 	for (offset = start_offset; offset <= end_offset ; offset++) {
 		/* Ok, do the async read-ahead now */
-		folio = swap_cache_alloc_folio(
-			swp_entry(swp_type(entry), offset), gfp_mask, mpol, ilx,
-			&page_allocated);
+		ra_entry = swp_entry(swp_type(entry), offset);
+		folio = swap_cache_read_folio(ra_entry, gfp_mask, mpol, ilx,
+					      &splug, offset != entry_offset);
 		if (!folio)
 			continue;
-		if (page_allocated) {
-			swap_read_folio(folio, &splug);
-			if (offset != entry_offset) {
-				folio_set_readahead(folio);
-				count_vm_event(SWAP_RA);
-			}
-		}
 		folio_put(folio);
 	}
 	blk_finish_plug(&plug);
@@ -743,11 +762,7 @@ struct folio *swap_cluster_readahead(swp_entry_t entry, gfp_t gfp_mask,
 	lru_add_drain();	/* Push any new pages onto the LRU now */
 skip:
 	/* The page was likely read above, so no need for plugging here */
-	folio = swap_cache_alloc_folio(entry, gfp_mask, mpol, ilx,
-				       &page_allocated);
-	if (unlikely(page_allocated))
-		swap_read_folio(folio, NULL);
-	return folio;
+	return swap_cache_read_folio(entry, gfp_mask, mpol, ilx, NULL, false);
 }
 
 static int swap_vma_ra_win(struct vm_fault *vmf, unsigned long *start,
@@ -813,8 +828,7 @@ static struct folio *swap_vma_readahead(swp_entry_t targ_entry, gfp_t gfp_mask,
 	pte_t *pte = NULL, pentry;
 	int win;
 	unsigned long start, end, addr;
-	pgoff_t ilx;
-	bool page_allocated;
+	pgoff_t ilx = targ_ilx;
 
 	win = swap_vma_ra_win(vmf, &start, &end);
 	if (win == 1)
@@ -848,19 +862,12 @@ static struct folio *swap_vma_readahead(swp_entry_t targ_entry, gfp_t gfp_mask,
 			if (!si)
 				continue;
 		}
-		folio = swap_cache_alloc_folio(entry, gfp_mask, mpol, ilx,
-					       &page_allocated);
+		folio = swap_cache_read_folio(entry, gfp_mask, mpol, ilx,
+					      &splug, addr != vmf->address);
 		if (si)
 			put_swap_device(si);
 		if (!folio)
 			continue;
-		if (page_allocated) {
-			swap_read_folio(folio, &splug);
-			if (addr != vmf->address) {
-				folio_set_readahead(folio);
-				count_vm_event(SWAP_RA);
-			}
-		}
 		folio_put(folio);
 	}
 	if (pte)
@@ -870,10 +877,8 @@ static struct folio *swap_vma_readahead(swp_entry_t targ_entry, gfp_t gfp_mask,
 	lru_add_drain();
 skip:
 	/* The folio was likely read above, so no need for plugging here */
-	folio = swap_cache_alloc_folio(targ_entry, gfp_mask, mpol, targ_ilx,
-				       &page_allocated);
-	if (unlikely(page_allocated))
-		swap_read_folio(folio, NULL);
+	folio = swap_cache_read_folio(targ_entry, gfp_mask, mpol, targ_ilx,
+				      NULL, false);
 	return folio;
 }
 
diff --git a/mm/zswap.c b/mm/zswap.c
index af3f0fbb0558..f3aa83a99636 100644
--- a/mm/zswap.c
+++ b/mm/zswap.c
@@ -992,7 +992,6 @@ static int zswap_writeback_entry(struct zswap_entry *entry,
 	pgoff_t offset = swp_offset(swpentry);
 	struct folio *folio;
 	struct mempolicy *mpol;
-	bool folio_was_allocated;
 	struct swap_info_struct *si;
 	int ret = 0;
 
@@ -1003,22 +1002,20 @@ static int zswap_writeback_entry(struct zswap_entry *entry,
 
 	mpol = get_task_policy(current);
 	folio = swap_cache_alloc_folio(swpentry, GFP_KERNEL, mpol,
-				       NO_INTERLEAVE_INDEX, &folio_was_allocated);
+				       NO_INTERLEAVE_INDEX);
 	put_swap_device(si);
-	if (!folio)
-		return -ENOMEM;
 
 	/*
+	 * Swap cache allocaiton might fail due to OOM, raced with free
+	 * or existing folio when we due to concurrent swapin or free.
 	 * Found an existing folio, we raced with swapin or concurrent
 	 * shrinker. We generally writeback cold folios from zswap, and
 	 * swapin means the folio just became hot, so skip this folio.
 	 * For unlikely concurrent shrinker case, it will be unlinked
 	 * and freed when invalidated by the concurrent shrinker anyway.
 	 */
-	if (!folio_was_allocated) {
-		ret = -EEXIST;
-		goto out;
-	}
+	if (IS_ERR(folio))
+		return PTR_ERR(folio);
 
 	/*
 	 * folio is locked, and the swapcache is now secured against
@@ -1058,7 +1055,7 @@ static int zswap_writeback_entry(struct zswap_entry *entry,
 	__swap_writepage(folio, NULL);
 
 out:
-	if (ret && ret != -EEXIST) {
+	if (ret) {
 		swap_cache_del_folio(folio);
 		folio_unlock(folio);
 	}

-- 
2.53.0




^ permalink raw reply	[flat|nested] 19+ messages in thread

* [PATCH RFC 03/15] mm, swap: move conflict checking logic of out swap cache adding
  2026-02-19 23:42 [PATCH RFC 00/15] mm, swap: swap table phase IV with dynamic ghost swapfile Kairui Song via B4 Relay
  2026-02-19 23:42 ` [PATCH RFC 01/15] mm: move thp_limit_gfp_mask to header Kairui Song via B4 Relay
  2026-02-19 23:42 ` [PATCH RFC 02/15] mm, swap: simplify swap_cache_alloc_folio Kairui Song via B4 Relay
@ 2026-02-19 23:42 ` Kairui Song via B4 Relay
  2026-02-19 23:42 ` [PATCH RFC 04/15] mm, swap: add support for large order folios in swap cache directly Kairui Song via B4 Relay
                   ` (12 subsequent siblings)
  15 siblings, 0 replies; 19+ messages in thread
From: Kairui Song via B4 Relay @ 2026-02-19 23:42 UTC (permalink / raw)
  To: linux-mm
  Cc: Andrew Morton, David Hildenbrand, Lorenzo Stoakes, Zi Yan,
	Baolin Wang, Barry Song, Hugh Dickins, Chris Li, Kemeng Shi,
	Nhat Pham, Baoquan He, Johannes Weiner, Yosry Ahmed,
	Youngjun Park, Chengming Zhou, Roman Gushchin, Shakeel Butt,
	Muchun Song, Qi Zheng, linux-kernel, cgroups, Kairui Song

From: Kairui Song <kasong@tencent.com>

No feature change, make later commits easier to review.

Signed-off-by: Kairui Song <kasong@tencent.com>
---
 mm/swap_state.c | 55 ++++++++++++++++++++++++++++++-------------------------
 1 file changed, 30 insertions(+), 25 deletions(-)

diff --git a/mm/swap_state.c b/mm/swap_state.c
index 53fa95059012..1e340faea9ac 100644
--- a/mm/swap_state.c
+++ b/mm/swap_state.c
@@ -137,6 +137,28 @@ void *swap_cache_get_shadow(swp_entry_t entry)
 	return NULL;
 }
 
+static int __swap_cache_add_check(struct swap_cluster_info *ci,
+				  unsigned int ci_off, unsigned int nr,
+				  void **shadow)
+{
+	unsigned int ci_end = ci_off + nr;
+	unsigned long old_tb;
+
+	if (unlikely(!ci->table))
+		return -ENOENT;
+	do {
+		old_tb = __swap_table_get(ci, ci_off);
+		if (unlikely(swp_tb_is_folio(old_tb)))
+			return -EEXIST;
+		if (unlikely(!__swp_tb_get_count(old_tb)))
+			return -ENOENT;
+		if (swp_tb_is_shadow(old_tb))
+			*shadow = swp_tb_to_shadow(old_tb);
+	} while (++ci_off < ci_end);
+
+	return 0;
+}
+
 void __swap_cache_add_folio(struct swap_cluster_info *ci,
 			    struct folio *folio, swp_entry_t entry)
 {
@@ -179,43 +201,26 @@ static int swap_cache_add_folio(struct folio *folio, swp_entry_t entry,
 {
 	int err;
 	void *shadow = NULL;
-	unsigned long old_tb;
+	unsigned int ci_off;
 	struct swap_info_struct *si;
 	struct swap_cluster_info *ci;
-	unsigned int ci_start, ci_off, ci_end;
 	unsigned long nr_pages = folio_nr_pages(folio);
 
 	si = __swap_entry_to_info(entry);
-	ci_start = swp_cluster_offset(entry);
-	ci_end = ci_start + nr_pages;
-	ci_off = ci_start;
 	ci = swap_cluster_lock(si, swp_offset(entry));
-	if (unlikely(!ci->table)) {
-		err = -ENOENT;
-		goto failed;
+	ci_off = swp_cluster_offset(entry);
+	err = __swap_cache_add_check(ci, ci_off, nr_pages, &shadow);
+	if (err) {
+		swap_cluster_unlock(ci);
+		return err;
 	}
-	do {
-		old_tb = __swap_table_get(ci, ci_off);
-		if (unlikely(swp_tb_is_folio(old_tb))) {
-			err = -EEXIST;
-			goto failed;
-		}
-		if (unlikely(!__swp_tb_get_count(old_tb))) {
-			err = -ENOENT;
-			goto failed;
-		}
-		if (swp_tb_is_shadow(old_tb))
-			shadow = swp_tb_to_shadow(old_tb);
-	} while (++ci_off < ci_end);
+
 	__swap_cache_add_folio(ci, folio, entry);
 	swap_cluster_unlock(ci);
 	if (shadowp)
 		*shadowp = shadow;
-	return 0;
 
-failed:
-	swap_cluster_unlock(ci);
-	return err;
+	return 0;
 }
 
 /**

-- 
2.53.0




^ permalink raw reply	[flat|nested] 19+ messages in thread

* [PATCH RFC 04/15] mm, swap: add support for large order folios in swap cache directly
  2026-02-19 23:42 [PATCH RFC 00/15] mm, swap: swap table phase IV with dynamic ghost swapfile Kairui Song via B4 Relay
                   ` (2 preceding siblings ...)
  2026-02-19 23:42 ` [PATCH RFC 03/15] mm, swap: move conflict checking logic of out swap cache adding Kairui Song via B4 Relay
@ 2026-02-19 23:42 ` Kairui Song via B4 Relay
  2026-02-19 23:42 ` [PATCH RFC 05/15] mm, swap: unify large folio allocation Kairui Song via B4 Relay
                   ` (11 subsequent siblings)
  15 siblings, 0 replies; 19+ messages in thread
From: Kairui Song via B4 Relay @ 2026-02-19 23:42 UTC (permalink / raw)
  To: linux-mm
  Cc: Andrew Morton, David Hildenbrand, Lorenzo Stoakes, Zi Yan,
	Baolin Wang, Barry Song, Hugh Dickins, Chris Li, Kemeng Shi,
	Nhat Pham, Baoquan He, Johannes Weiner, Yosry Ahmed,
	Youngjun Park, Chengming Zhou, Roman Gushchin, Shakeel Butt,
	Muchun Song, Qi Zheng, linux-kernel, cgroups, Kairui Song

From: Kairui Song <kasong@tencent.com>

To make it possible to allocate large folios directly in swap cache, let
swap_cache_alloc_folio handle larger orders too.

This slightly changes how allocation is synchronized. Now, whoever first
successfully allocates a folio in the swap cache will be the one who
charges it and performs the swap-in. Raced swapin now should avoid a
redundant charge and just wait for the swapin to finish.

Large order fallback is also moved to the swap cache layer. This should
make the fallback process less racy, too.

Signed-off-by: Kairui Song <kasong@tencent.com>
---
 mm/swap.h       |   3 +-
 mm/swap_state.c | 193 +++++++++++++++++++++++++++++++++++++++++---------------
 mm/zswap.c      |   2 +-
 3 files changed, 145 insertions(+), 53 deletions(-)

diff --git a/mm/swap.h b/mm/swap.h
index ad8b17a93758..6774af10a943 100644
--- a/mm/swap.h
+++ b/mm/swap.h
@@ -280,7 +280,8 @@ bool swap_cache_has_folio(swp_entry_t entry);
 struct folio *swap_cache_get_folio(swp_entry_t entry);
 void *swap_cache_get_shadow(swp_entry_t entry);
 void swap_cache_del_folio(struct folio *folio);
-struct folio *swap_cache_alloc_folio(swp_entry_t entry, gfp_t gfp_flags,
+struct folio *swap_cache_alloc_folio(swp_entry_t target_entry, gfp_t gfp_mask,
+				     unsigned long orders, struct vm_fault *vmf,
 				     struct mempolicy *mpol, pgoff_t ilx);
 /* Below helpers require the caller to lock and pass in the swap cluster. */
 void __swap_cache_add_folio(struct swap_cluster_info *ci,
diff --git a/mm/swap_state.c b/mm/swap_state.c
index 1e340faea9ac..e32b06a1f229 100644
--- a/mm/swap_state.c
+++ b/mm/swap_state.c
@@ -137,26 +137,39 @@ void *swap_cache_get_shadow(swp_entry_t entry)
 	return NULL;
 }
 
-static int __swap_cache_add_check(struct swap_cluster_info *ci,
-				  unsigned int ci_off, unsigned int nr,
-				  void **shadow)
+static int __swap_cache_check_batch(struct swap_cluster_info *ci,
+				    unsigned int ci_off, unsigned int ci_targ,
+				    unsigned int nr, void **shadowp)
 {
 	unsigned int ci_end = ci_off + nr;
 	unsigned long old_tb;
 
 	if (unlikely(!ci->table))
 		return -ENOENT;
+
 	do {
 		old_tb = __swap_table_get(ci, ci_off);
-		if (unlikely(swp_tb_is_folio(old_tb)))
-			return -EEXIST;
-		if (unlikely(!__swp_tb_get_count(old_tb)))
-			return -ENOENT;
+		if (unlikely(swp_tb_is_folio(old_tb)) ||
+		    unlikely(!__swp_tb_get_count(old_tb)))
+			break;
 		if (swp_tb_is_shadow(old_tb))
-			*shadow = swp_tb_to_shadow(old_tb);
+			*shadowp = swp_tb_to_shadow(old_tb);
 	} while (++ci_off < ci_end);
 
-	return 0;
+	if (likely(ci_off == ci_end))
+		return 0;
+
+	/*
+	 * If the target slot is not suitable for adding swap cache, return
+	 * -EEXIST or -ENOENT. If the batch is not suitable, could be a
+	 * race with concurrent free or cache add, return -EBUSY.
+	 */
+	old_tb = __swap_table_get(ci, ci_targ);
+	if (swp_tb_is_folio(old_tb))
+		return -EEXIST;
+	if (!__swp_tb_get_count(old_tb))
+		return -ENOENT;
+	return -EBUSY;
 }
 
 void __swap_cache_add_folio(struct swap_cluster_info *ci,
@@ -209,7 +222,7 @@ static int swap_cache_add_folio(struct folio *folio, swp_entry_t entry,
 	si = __swap_entry_to_info(entry);
 	ci = swap_cluster_lock(si, swp_offset(entry));
 	ci_off = swp_cluster_offset(entry);
-	err = __swap_cache_add_check(ci, ci_off, nr_pages, &shadow);
+	err = __swap_cache_check_batch(ci, ci_off, ci_off, nr_pages, &shadow);
 	if (err) {
 		swap_cluster_unlock(ci);
 		return err;
@@ -223,6 +236,124 @@ static int swap_cache_add_folio(struct folio *folio, swp_entry_t entry,
 	return 0;
 }
 
+static struct folio *__swap_cache_alloc(struct swap_cluster_info *ci,
+					swp_entry_t targ_entry, gfp_t gfp,
+					unsigned int order, struct vm_fault *vmf,
+					struct mempolicy *mpol, pgoff_t ilx)
+{
+	int err;
+	swp_entry_t entry;
+	struct folio *folio;
+	void *shadow = NULL, *shadow_check = NULL;
+	unsigned long address, nr_pages = 1 << order;
+	unsigned int ci_off, ci_targ = swp_cluster_offset(targ_entry);
+
+	entry.val = round_down(targ_entry.val, nr_pages);
+	ci_off = round_down(ci_targ, nr_pages);
+
+	/* First check if the range is available */
+	spin_lock(&ci->lock);
+	err = __swap_cache_check_batch(ci, ci_off, ci_targ, nr_pages, &shadow);
+	spin_unlock(&ci->lock);
+	if (unlikely(err))
+		return ERR_PTR(err);
+
+	if (vmf) {
+		if (order)
+			gfp = thp_limit_gfp_mask(vma_thp_gfp_mask(vmf->vma), gfp);
+		address = round_down(vmf->address, PAGE_SIZE << order);
+		folio = vma_alloc_folio(gfp, order, vmf->vma, address);
+	} else {
+		folio = folio_alloc_mpol(gfp, order, mpol, ilx, numa_node_id());
+	}
+	if (unlikely(!folio))
+		return ERR_PTR(-ENOMEM);
+
+	/* Double check the range is still not in conflict */
+	spin_lock(&ci->lock);
+	err = __swap_cache_check_batch(ci, ci_off, ci_targ, nr_pages, &shadow_check);
+	if (unlikely(err) || shadow_check != shadow) {
+		spin_unlock(&ci->lock);
+		folio_put(folio);
+
+		/* If shadow changed, just try again */
+		return ERR_PTR(err ? err : -EAGAIN);
+	}
+
+	__folio_set_locked(folio);
+	__folio_set_swapbacked(folio);
+	__swap_cache_add_folio(ci, folio, entry);
+	spin_unlock(&ci->lock);
+
+	if (mem_cgroup_swapin_charge_folio(folio, vmf ? vmf->vma->vm_mm : NULL,
+					   gfp, entry)) {
+		spin_lock(&ci->lock);
+		__swap_cache_del_folio(ci, folio, shadow);
+		spin_unlock(&ci->lock);
+		folio_unlock(folio);
+		folio_put(folio);
+		count_mthp_stat(order, MTHP_STAT_SWPIN_FALLBACK_CHARGE);
+		return ERR_PTR(-ENOMEM);
+	}
+
+	/* For memsw accouting, swap is uncharged when folio is added to swap cache */
+	memcg1_swapin(entry, 1 << order);
+	if (shadow)
+		workingset_refault(folio, shadow);
+
+	/* Caller will initiate read into locked new_folio */
+	folio_add_lru(folio);
+
+	return folio;
+}
+
+/**
+ * swap_cache_alloc_folio - Allocate folio for swapped out slot in swap cache.
+ * @targ_entry: swap entry indicating the target slot
+ * @orders: allocation orders
+ * @vmf: fault information
+ * @gfp_mask: memory allocation flags
+ * @mpol: NUMA memory allocation policy to be applied
+ * @ilx: NUMA interleave index, for use only when MPOL_INTERLEAVE
+ *
+ * Allocate a folio in the swap cache for one swap slot, typically before
+ * doing IO (e.g. swap in or zswap writeback). The swap slot indicated by
+ * @targ_entry must have a non-zero swap count (swapped out).
+ *
+ * Context: Caller must protect the swap device with reference count or locks.
+ * Return: Returns the folio if allocation successed and folio is added to
+ * swap cache. Returns error code if allocation failed due to race.
+ */
+struct folio *swap_cache_alloc_folio(swp_entry_t targ_entry, gfp_t gfp_mask,
+				     unsigned long orders, struct vm_fault *vmf,
+				     struct mempolicy *mpol, pgoff_t ilx)
+{
+	int order;
+	struct folio *folio;
+	struct swap_cluster_info *ci;
+
+	ci = __swap_entry_to_cluster(targ_entry);
+	order = orders ? highest_order(orders) : 0;
+	for (;;) {
+		folio = __swap_cache_alloc(ci, targ_entry, gfp_mask, order,
+					   vmf, mpol, ilx);
+		if (!IS_ERR(folio))
+			return folio;
+		if (PTR_ERR(folio) == -EAGAIN)
+			continue;
+		/* Only -EBUSY means we should fallback and retry. */
+		if (PTR_ERR(folio) != -EBUSY)
+			return folio;
+		count_mthp_stat(order, MTHP_STAT_SWPIN_FALLBACK);
+		order = next_order(&orders, order);
+		if (!orders)
+			break;
+	}
+	/* Should never reach here, order 0 should not fail with -EBUSY. */
+	WARN_ON_ONCE(1);
+	return ERR_PTR(-EINVAL);
+}
+
 /**
  * __swap_cache_del_folio - Removes a folio from the swap cache.
  * @ci: The locked swap cluster.
@@ -498,46 +629,6 @@ static int __swap_cache_prepare_and_add(swp_entry_t entry,
 	return ret;
 }
 
-/**
- * swap_cache_alloc_folio - Allocate folio for swapped out slot in swap cache.
- * @entry: the swapped out swap entry to be binded to the folio.
- * @gfp_mask: memory allocation flags
- * @mpol: NUMA memory allocation policy to be applied
- * @ilx: NUMA interleave index, for use only when MPOL_INTERLEAVE
- *
- * Allocate a folio in the swap cache for one swap slot, typically before
- * doing IO (e.g. swap in or zswap writeback). The swap slot indicated by
- * @entry must have a non-zero swap count (swapped out).
- * Currently only supports order 0.
- *
- * Context: Caller must protect the swap device with reference count or locks.
- * Return: Returns the folio if allocation succeeded and folio is added to
- * swap cache. Returns error code if allocation failed due to race.
- */
-struct folio *swap_cache_alloc_folio(swp_entry_t entry, gfp_t gfp_mask,
-				     struct mempolicy *mpol, pgoff_t ilx)
-{
-	int ret;
-	struct folio *folio;
-
-	/* Allocate a new folio to be added into the swap cache. */
-	folio = folio_alloc_mpol(gfp_mask, 0, mpol, ilx, numa_node_id());
-	if (!folio)
-		return ERR_PTR(-ENOMEM);
-
-	/*
-	 * Try add the new folio, it returns NULL if already exist,
-	 * since folio is order 0.
-	 */
-	ret = __swap_cache_prepare_and_add(entry, folio, gfp_mask, false);
-	if (ret) {
-		folio_put(folio);
-		return ERR_PTR(ret);
-	}
-
-	return folio;
-}
-
 static struct folio *swap_cache_read_folio(swp_entry_t entry, gfp_t gfp,
 					   struct mempolicy *mpol, pgoff_t ilx,
 					   struct swap_iocb **plug, bool readahead)
@@ -559,7 +650,7 @@ static struct folio *swap_cache_read_folio(swp_entry_t entry, gfp_t gfp,
 		if (folio)
 			return folio;
 
-		folio = swap_cache_alloc_folio(entry, gfp, mpol, ilx);
+		folio = swap_cache_alloc_folio(entry, gfp, 0, NULL, mpol, ilx);
 	} while (PTR_ERR(folio) == -EEXIST);
 
 	if (IS_ERR_OR_NULL(folio))
diff --git a/mm/zswap.c b/mm/zswap.c
index f3aa83a99636..5d83539a8bba 100644
--- a/mm/zswap.c
+++ b/mm/zswap.c
@@ -1001,7 +1001,7 @@ static int zswap_writeback_entry(struct zswap_entry *entry,
 		return -EEXIST;
 
 	mpol = get_task_policy(current);
-	folio = swap_cache_alloc_folio(swpentry, GFP_KERNEL, mpol,
+	folio = swap_cache_alloc_folio(swpentry, GFP_KERNEL, 0, NULL, mpol,
 				       NO_INTERLEAVE_INDEX);
 	put_swap_device(si);
 

-- 
2.53.0




^ permalink raw reply	[flat|nested] 19+ messages in thread

* [PATCH RFC 05/15] mm, swap: unify large folio allocation
  2026-02-19 23:42 [PATCH RFC 00/15] mm, swap: swap table phase IV with dynamic ghost swapfile Kairui Song via B4 Relay
                   ` (3 preceding siblings ...)
  2026-02-19 23:42 ` [PATCH RFC 04/15] mm, swap: add support for large order folios in swap cache directly Kairui Song via B4 Relay
@ 2026-02-19 23:42 ` Kairui Song via B4 Relay
  2026-02-19 23:42 ` [PATCH RFC 06/15] memcg, swap: reparent the swap entry on swapin if swapout cgroup is dead Kairui Song via B4 Relay
                   ` (10 subsequent siblings)
  15 siblings, 0 replies; 19+ messages in thread
From: Kairui Song via B4 Relay @ 2026-02-19 23:42 UTC (permalink / raw)
  To: linux-mm
  Cc: Andrew Morton, David Hildenbrand, Lorenzo Stoakes, Zi Yan,
	Baolin Wang, Barry Song, Hugh Dickins, Chris Li, Kemeng Shi,
	Nhat Pham, Baoquan He, Johannes Weiner, Yosry Ahmed,
	Youngjun Park, Chengming Zhou, Roman Gushchin, Shakeel Butt,
	Muchun Song, Qi Zheng, linux-kernel, cgroups, Kairui Song

From: Kairui Song <kasong@tencent.com>

Now the large order allocation is supported in swap cache, making both
anon and shmem use this instead of implementing their own different
method for doing so.

Signed-off-by: Kairui Song <kasong@tencent.com>
---
 mm/memory.c     |  77 +++++---------------------
 mm/shmem.c      |  94 ++++++++------------------------
 mm/swap.h       |  30 ++---------
 mm/swap_state.c | 163 ++++++++++++--------------------------------------------
 mm/swapfile.c   |   3 +-
 5 files changed, 76 insertions(+), 291 deletions(-)

diff --git a/mm/memory.c b/mm/memory.c
index 21bf2517fbce..e58f976508b3 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -4520,26 +4520,6 @@ static vm_fault_t handle_pte_marker(struct vm_fault *vmf)
 	return VM_FAULT_SIGBUS;
 }
 
-static struct folio *__alloc_swap_folio(struct vm_fault *vmf)
-{
-	struct vm_area_struct *vma = vmf->vma;
-	struct folio *folio;
-	softleaf_t entry;
-
-	folio = vma_alloc_folio(GFP_HIGHUSER_MOVABLE, 0, vma, vmf->address);
-	if (!folio)
-		return NULL;
-
-	entry = softleaf_from_pte(vmf->orig_pte);
-	if (mem_cgroup_swapin_charge_folio(folio, vma->vm_mm,
-					   GFP_KERNEL, entry)) {
-		folio_put(folio);
-		return NULL;
-	}
-
-	return folio;
-}
-
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
 /*
  * Check if the PTEs within a range are contiguous swap entries
@@ -4569,8 +4549,6 @@ static bool can_swapin_thp(struct vm_fault *vmf, pte_t *ptep, int nr_pages)
 	 */
 	if (unlikely(swap_zeromap_batch(entry, nr_pages, NULL) != nr_pages))
 		return false;
-	if (unlikely(non_swapcache_batch(entry, nr_pages) != nr_pages))
-		return false;
 
 	return true;
 }
@@ -4598,16 +4576,14 @@ static inline unsigned long thp_swap_suitable_orders(pgoff_t swp_offset,
 	return orders;
 }
 
-static struct folio *alloc_swap_folio(struct vm_fault *vmf)
+static unsigned long thp_swapin_suiltable_orders(struct vm_fault *vmf)
 {
 	struct vm_area_struct *vma = vmf->vma;
 	unsigned long orders;
-	struct folio *folio;
 	unsigned long addr;
 	softleaf_t entry;
 	spinlock_t *ptl;
 	pte_t *pte;
-	gfp_t gfp;
 	int order;
 
 	/*
@@ -4615,7 +4591,7 @@ static struct folio *alloc_swap_folio(struct vm_fault *vmf)
 	 * maintain the uffd semantics.
 	 */
 	if (unlikely(userfaultfd_armed(vma)))
-		goto fallback;
+		return 0;
 
 	/*
 	 * A large swapped out folio could be partially or fully in zswap. We
@@ -4623,7 +4599,7 @@ static struct folio *alloc_swap_folio(struct vm_fault *vmf)
 	 * folio.
 	 */
 	if (!zswap_never_enabled())
-		goto fallback;
+		return 0;
 
 	entry = softleaf_from_pte(vmf->orig_pte);
 	/*
@@ -4637,12 +4613,12 @@ static struct folio *alloc_swap_folio(struct vm_fault *vmf)
 					  vmf->address, orders);
 
 	if (!orders)
-		goto fallback;
+		return 0;
 
 	pte = pte_offset_map_lock(vmf->vma->vm_mm, vmf->pmd,
 				  vmf->address & PMD_MASK, &ptl);
 	if (unlikely(!pte))
-		goto fallback;
+		return 0;
 
 	/*
 	 * For do_swap_page, find the highest order where the aligned range is
@@ -4658,29 +4634,12 @@ static struct folio *alloc_swap_folio(struct vm_fault *vmf)
 
 	pte_unmap_unlock(pte, ptl);
 
-	/* Try allocating the highest of the remaining orders. */
-	gfp = vma_thp_gfp_mask(vma);
-	while (orders) {
-		addr = ALIGN_DOWN(vmf->address, PAGE_SIZE << order);
-		folio = vma_alloc_folio(gfp, order, vma, addr);
-		if (folio) {
-			if (!mem_cgroup_swapin_charge_folio(folio, vma->vm_mm,
-							    gfp, entry))
-				return folio;
-			count_mthp_stat(order, MTHP_STAT_SWPIN_FALLBACK_CHARGE);
-			folio_put(folio);
-		}
-		count_mthp_stat(order, MTHP_STAT_SWPIN_FALLBACK);
-		order = next_order(&orders, order);
-	}
-
-fallback:
-	return __alloc_swap_folio(vmf);
+	return orders;
 }
 #else /* !CONFIG_TRANSPARENT_HUGEPAGE */
-static struct folio *alloc_swap_folio(struct vm_fault *vmf)
+static unsigned long thp_swapin_suiltable_orders(struct vm_fault *vmf)
 {
-	return __alloc_swap_folio(vmf);
+	return 0;
 }
 #endif /* CONFIG_TRANSPARENT_HUGEPAGE */
 
@@ -4785,21 +4744,13 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
 	if (folio)
 		swap_update_readahead(folio, vma, vmf->address);
 	if (!folio) {
-		if (data_race(si->flags & SWP_SYNCHRONOUS_IO)) {
-			folio = alloc_swap_folio(vmf);
-			if (folio) {
-				/*
-				 * folio is charged, so swapin can only fail due
-				 * to raced swapin and return NULL.
-				 */
-				swapcache = swapin_folio(entry, folio);
-				if (swapcache != folio)
-					folio_put(folio);
-				folio = swapcache;
-			}
-		} else {
+		/* Swapin bypass readahead for SWP_SYNCHRONOUS_IO devices */
+		if (data_race(si->flags & SWP_SYNCHRONOUS_IO))
+			folio = swapin_entry(entry, GFP_HIGHUSER_MOVABLE,
+					     thp_swapin_suiltable_orders(vmf),
+					     vmf, NULL, 0);
+		else
 			folio = swapin_readahead(entry, GFP_HIGHUSER_MOVABLE, vmf);
-		}
 
 		if (!folio) {
 			/*
diff --git a/mm/shmem.c b/mm/shmem.c
index 9f054b5aae8e..0a19ac82ec77 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -159,7 +159,7 @@ static unsigned long shmem_default_max_inodes(void)
 
 static int shmem_swapin_folio(struct inode *inode, pgoff_t index,
 			struct folio **foliop, enum sgp_type sgp, gfp_t gfp,
-			struct vm_area_struct *vma, vm_fault_t *fault_type);
+			struct vm_fault *vmf, vm_fault_t *fault_type);
 
 static inline struct shmem_sb_info *SHMEM_SB(struct super_block *sb)
 {
@@ -2014,68 +2014,24 @@ static struct folio *shmem_alloc_and_add_folio(struct vm_fault *vmf,
 }
 
 static struct folio *shmem_swap_alloc_folio(struct inode *inode,
-		struct vm_area_struct *vma, pgoff_t index,
+		struct vm_fault *vmf, pgoff_t index,
 		swp_entry_t entry, int order, gfp_t gfp)
 {
+	pgoff_t ilx;
+	struct folio *folio;
+	struct mempolicy *mpol;
+	unsigned long orders = BIT(order);
 	struct shmem_inode_info *info = SHMEM_I(inode);
-	struct folio *new, *swapcache;
-	int nr_pages = 1 << order;
-	gfp_t alloc_gfp = gfp;
-
-	if (!IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE)) {
-		if (WARN_ON_ONCE(order))
-			return ERR_PTR(-EINVAL);
-	} else if (order) {
-		/*
-		 * If uffd is active for the vma, we need per-page fault
-		 * fidelity to maintain the uffd semantics, then fallback
-		 * to swapin order-0 folio, as well as for zswap case.
-		 * Any existing sub folio in the swap cache also blocks
-		 * mTHP swapin.
-		 */
-		if ((vma && unlikely(userfaultfd_armed(vma))) ||
-		     !zswap_never_enabled() ||
-		     non_swapcache_batch(entry, nr_pages) != nr_pages)
-			goto fallback;
 
-		alloc_gfp = thp_limit_gfp_mask(vma_thp_gfp_mask(vma), gfp);
-	}
-retry:
-	new = shmem_alloc_folio(alloc_gfp, order, info, index);
-	if (!new) {
-		new = ERR_PTR(-ENOMEM);
-		goto fallback;
-	}
+	if ((vmf && unlikely(userfaultfd_armed(vmf->vma))) ||
+	     !zswap_never_enabled())
+		orders = 0;
 
-	if (mem_cgroup_swapin_charge_folio(new, vma ? vma->vm_mm : NULL,
-					   alloc_gfp, entry)) {
-		folio_put(new);
-		new = ERR_PTR(-ENOMEM);
-		goto fallback;
-	}
+	mpol = shmem_get_pgoff_policy(info, index, order, &ilx);
+	folio = swapin_entry(entry, gfp, orders, vmf, mpol, ilx);
+	mpol_cond_put(mpol);
 
-	swapcache = swapin_folio(entry, new);
-	if (swapcache != new) {
-		folio_put(new);
-		if (!swapcache) {
-			/*
-			 * The new folio is charged already, swapin can
-			 * only fail due to another raced swapin.
-			 */
-			new = ERR_PTR(-EEXIST);
-			goto fallback;
-		}
-	}
-	return swapcache;
-fallback:
-	/* Order 0 swapin failed, nothing to fallback to, abort */
-	if (!order)
-		return new;
-	entry.val += index - round_down(index, nr_pages);
-	alloc_gfp = gfp;
-	nr_pages = 1;
-	order = 0;
-	goto retry;
+	return folio;
 }
 
 /*
@@ -2262,11 +2218,12 @@ static int shmem_split_large_entry(struct inode *inode, pgoff_t index,
  */
 static int shmem_swapin_folio(struct inode *inode, pgoff_t index,
 			     struct folio **foliop, enum sgp_type sgp,
-			     gfp_t gfp, struct vm_area_struct *vma,
+			     gfp_t gfp, struct vm_fault *vmf,
 			     vm_fault_t *fault_type)
 {
 	struct address_space *mapping = inode->i_mapping;
-	struct mm_struct *fault_mm = vma ? vma->vm_mm : NULL;
+	struct vm_area_struct *vma = vmf ? vmf->vma : NULL;
+	struct mm_struct *fault_mm = vmf ? vmf->vma->vm_mm : NULL;
 	struct shmem_inode_info *info = SHMEM_I(inode);
 	swp_entry_t swap;
 	softleaf_t index_entry;
@@ -2307,20 +2264,15 @@ static int shmem_swapin_folio(struct inode *inode, pgoff_t index,
 	if (!folio) {
 		if (data_race(si->flags & SWP_SYNCHRONOUS_IO)) {
 			/* Direct swapin skipping swap cache & readahead */
-			folio = shmem_swap_alloc_folio(inode, vma, index,
-						       index_entry, order, gfp);
-			if (IS_ERR(folio)) {
-				error = PTR_ERR(folio);
-				folio = NULL;
-				goto failed;
-			}
+			folio = shmem_swap_alloc_folio(inode, vmf, index,
+						       swap, order, gfp);
 		} else {
 			/* Cached swapin only supports order 0 folio */
 			folio = shmem_swapin_cluster(swap, gfp, info, index);
-			if (!folio) {
-				error = -ENOMEM;
-				goto failed;
-			}
+		}
+		if (!folio) {
+			error = -ENOMEM;
+			goto failed;
 		}
 		if (fault_type) {
 			*fault_type |= VM_FAULT_MAJOR;
@@ -2468,7 +2420,7 @@ static int shmem_get_folio_gfp(struct inode *inode, pgoff_t index,
 
 	if (xa_is_value(folio)) {
 		error = shmem_swapin_folio(inode, index, &folio,
-					   sgp, gfp, vma, fault_type);
+					   sgp, gfp, vmf, fault_type);
 		if (error == -EEXIST)
 			goto repeat;
 
diff --git a/mm/swap.h b/mm/swap.h
index 6774af10a943..80c2f1bf7a57 100644
--- a/mm/swap.h
+++ b/mm/swap.h
@@ -300,7 +300,8 @@ struct folio *swap_cluster_readahead(swp_entry_t entry, gfp_t flag,
 		struct mempolicy *mpol, pgoff_t ilx);
 struct folio *swapin_readahead(swp_entry_t entry, gfp_t flag,
 		struct vm_fault *vmf);
-struct folio *swapin_folio(swp_entry_t entry, struct folio *folio);
+struct folio *swapin_entry(swp_entry_t entry, gfp_t flag, unsigned long orders,
+			   struct vm_fault *vmf, struct mempolicy *mpol, pgoff_t ilx);
 void swap_update_readahead(struct folio *folio, struct vm_area_struct *vma,
 			   unsigned long addr);
 
@@ -334,24 +335,6 @@ static inline int swap_zeromap_batch(swp_entry_t entry, int max_nr,
 		return find_next_bit(sis->zeromap, end, start) - start;
 }
 
-static inline int non_swapcache_batch(swp_entry_t entry, int max_nr)
-{
-	int i;
-
-	/*
-	 * While allocating a large folio and doing mTHP swapin, we need to
-	 * ensure all entries are not cached, otherwise, the mTHP folio will
-	 * be in conflict with the folio in swap cache.
-	 */
-	for (i = 0; i < max_nr; i++) {
-		if (swap_cache_has_folio(entry))
-			return i;
-		entry.val++;
-	}
-
-	return i;
-}
-
 #else /* CONFIG_SWAP */
 struct swap_iocb;
 static inline struct swap_cluster_info *swap_cluster_lock(
@@ -433,7 +416,9 @@ static inline struct folio *swapin_readahead(swp_entry_t swp, gfp_t gfp_mask,
 	return NULL;
 }
 
-static inline struct folio *swapin_folio(swp_entry_t entry, struct folio *folio)
+static inline struct folio *swapin_entry(
+	swp_entry_t entry, gfp_t flag, unsigned long orders,
+	struct vm_fault *vmf, struct mempolicy *mpol, pgoff_t ilx)
 {
 	return NULL;
 }
@@ -493,10 +478,5 @@ static inline int swap_zeromap_batch(swp_entry_t entry, int max_nr,
 {
 	return 0;
 }
-
-static inline int non_swapcache_batch(swp_entry_t entry, int max_nr)
-{
-	return 0;
-}
 #endif /* CONFIG_SWAP */
 #endif /* _MM_SWAP_H */
diff --git a/mm/swap_state.c b/mm/swap_state.c
index e32b06a1f229..0a2a4e084cf2 100644
--- a/mm/swap_state.c
+++ b/mm/swap_state.c
@@ -199,43 +199,6 @@ void __swap_cache_add_folio(struct swap_cluster_info *ci,
 	lruvec_stat_mod_folio(folio, NR_SWAPCACHE, nr_pages);
 }
 
-/**
- * swap_cache_add_folio - Add a folio into the swap cache.
- * @folio: The folio to be added.
- * @entry: The swap entry corresponding to the folio.
- * @gfp: gfp_mask for XArray node allocation.
- * @shadowp: If a shadow is found, return the shadow.
- *
- * Context: Caller must ensure @entry is valid and protect the swap device
- * with reference count or locks.
- */
-static int swap_cache_add_folio(struct folio *folio, swp_entry_t entry,
-				void **shadowp)
-{
-	int err;
-	void *shadow = NULL;
-	unsigned int ci_off;
-	struct swap_info_struct *si;
-	struct swap_cluster_info *ci;
-	unsigned long nr_pages = folio_nr_pages(folio);
-
-	si = __swap_entry_to_info(entry);
-	ci = swap_cluster_lock(si, swp_offset(entry));
-	ci_off = swp_cluster_offset(entry);
-	err = __swap_cache_check_batch(ci, ci_off, ci_off, nr_pages, &shadow);
-	if (err) {
-		swap_cluster_unlock(ci);
-		return err;
-	}
-
-	__swap_cache_add_folio(ci, folio, entry);
-	swap_cluster_unlock(ci);
-	if (shadowp)
-		*shadowp = shadow;
-
-	return 0;
-}
-
 static struct folio *__swap_cache_alloc(struct swap_cluster_info *ci,
 					swp_entry_t targ_entry, gfp_t gfp,
 					unsigned int order, struct vm_fault *vmf,
@@ -328,30 +291,28 @@ struct folio *swap_cache_alloc_folio(swp_entry_t targ_entry, gfp_t gfp_mask,
 				     unsigned long orders, struct vm_fault *vmf,
 				     struct mempolicy *mpol, pgoff_t ilx)
 {
-	int order;
+	int order, err;
 	struct folio *folio;
 	struct swap_cluster_info *ci;
 
+	/* Always allow order 0 so swap won't fail under pressure. */
+	order = orders ? highest_order(orders |= BIT(0)) : 0;
 	ci = __swap_entry_to_cluster(targ_entry);
-	order = orders ? highest_order(orders) : 0;
 	for (;;) {
 		folio = __swap_cache_alloc(ci, targ_entry, gfp_mask, order,
 					   vmf, mpol, ilx);
 		if (!IS_ERR(folio))
 			return folio;
-		if (PTR_ERR(folio) == -EAGAIN)
+		err = PTR_ERR(folio);
+		if (err == -EAGAIN)
 			continue;
-		/* Only -EBUSY means we should fallback and retry. */
-		if (PTR_ERR(folio) != -EBUSY)
-			return folio;
+		if (!order || (err != -EBUSY && err != -ENOMEM))
+			break;
 		count_mthp_stat(order, MTHP_STAT_SWPIN_FALLBACK);
 		order = next_order(&orders, order);
-		if (!orders)
-			break;
 	}
-	/* Should never reach here, order 0 should not fail with -EBUSY. */
-	WARN_ON_ONCE(1);
-	return ERR_PTR(-EINVAL);
+
+	return ERR_PTR(err);
 }
 
 /**
@@ -584,51 +545,6 @@ void swap_update_readahead(struct folio *folio, struct vm_area_struct *vma,
 	}
 }
 
-/**
- * __swap_cache_prepare_and_add - Prepare the folio and add it to swap cache.
- * @entry: swap entry to be bound to the folio.
- * @folio: folio to be added.
- * @gfp: memory allocation flags for charge, can be 0 if @charged if true.
- * @charged: if the folio is already charged.
- *
- * Update the swap_map and add folio as swap cache, typically before swapin.
- * All swap slots covered by the folio must have a non-zero swap count.
- *
- * Context: Caller must protect the swap device with reference count or locks.
- * Return: 0 if success, error code if failed.
- */
-static int __swap_cache_prepare_and_add(swp_entry_t entry,
-					struct folio *folio,
-					gfp_t gfp, bool charged)
-{
-	void *shadow;
-	int ret;
-
-	__folio_set_locked(folio);
-	__folio_set_swapbacked(folio);
-	ret = swap_cache_add_folio(folio, entry, &shadow);
-	if (ret)
-		goto failed;
-
-	if (!charged && mem_cgroup_swapin_charge_folio(folio, NULL, gfp, entry)) {
-		swap_cache_del_folio(folio);
-		ret = -ENOMEM;
-		goto failed;
-	}
-
-	memcg1_swapin(entry, folio_nr_pages(folio));
-	if (shadow)
-		workingset_refault(folio, shadow);
-
-	/* Caller will initiate read into locked folio */
-	folio_add_lru(folio);
-	return 0;
-
-failed:
-	folio_unlock(folio);
-	return ret;
-}
-
 static struct folio *swap_cache_read_folio(swp_entry_t entry, gfp_t gfp,
 					   struct mempolicy *mpol, pgoff_t ilx,
 					   struct swap_iocb **plug, bool readahead)
@@ -649,7 +565,6 @@ static struct folio *swap_cache_read_folio(swp_entry_t entry, gfp_t gfp,
 		folio = swap_cache_get_folio(entry);
 		if (folio)
 			return folio;
-
 		folio = swap_cache_alloc_folio(entry, gfp, 0, NULL, mpol, ilx);
 	} while (PTR_ERR(folio) == -EEXIST);
 
@@ -666,49 +581,37 @@ static struct folio *swap_cache_read_folio(swp_entry_t entry, gfp_t gfp,
 }
 
 /**
- * swapin_folio - swap-in one or multiple entries skipping readahead.
- * @entry: starting swap entry to swap in
- * @folio: a new allocated and charged folio
+ * swapin_entry - swap-in one or multiple entries skipping readahead.
+ * @entry: swap entry indicating the target slot
+ * @gfp_mask: memory allocation flags
+ * @orders: allocation orders
+ * @vmf: fault information
+ * @mpol: NUMA memory allocation policy to be applied
+ * @ilx: NUMA interleave index, for use only when MPOL_INTERLEAVE
  *
- * Reads @entry into @folio, @folio will be added to the swap cache.
- * If @folio is a large folio, the @entry will be rounded down to align
- * with the folio size.
+ * This would allocate a folio suit given @orders, or return the existing
+ * folio in the swap cache for @entry. This initiates the IO, too, if needed.
+ * @entry could be rounded down if @orders allows large allocation.
  *
- * Return: returns pointer to @folio on success. If folio is a large folio
- * and this raced with another swapin, NULL will be returned to allow fallback
- * to order 0. Else, if another folio was already added to the swap cache,
- * return that swap cache folio instead.
+ * Context: Caller must ensure @entry is valid and pin the swap device with refcount.
+ * Return: Returns the folio on success, returns error code if failed.
  */
-struct folio *swapin_folio(swp_entry_t entry, struct folio *folio)
+struct folio *swapin_entry(swp_entry_t entry, gfp_t gfp, unsigned long orders,
+			   struct vm_fault *vmf, struct mempolicy *mpol, pgoff_t ilx)
 {
-	int ret;
-	struct folio *swapcache;
-	pgoff_t offset = swp_offset(entry);
-	unsigned long nr_pages = folio_nr_pages(folio);
-
-	entry = swp_entry(swp_type(entry), round_down(offset, nr_pages));
-	for (;;) {
-		ret = __swap_cache_prepare_and_add(entry, folio, 0, true);
-		if (!ret) {
-			swap_read_folio(folio, NULL);
-			break;
-		}
+	struct folio *folio;
 
-		/*
-		 * Large order allocation needs special handling on
-		 * race: if a smaller folio exists in cache, swapin needs
-		 * to fallback to order 0, and doing a swap cache lookup
-		 * might return a folio that is irrelevant to the faulting
-		 * entry because @entry is aligned down. Just return NULL.
-		 */
-		if (ret != -EEXIST || nr_pages > 1)
-			return NULL;
+	do {
+		folio = swap_cache_get_folio(entry);
+		if (folio)
+			return folio;
+		folio = swap_cache_alloc_folio(entry, gfp, orders, vmf, mpol, ilx);
+	} while (PTR_ERR(folio) == -EEXIST);
 
-		swapcache = swap_cache_get_folio(entry);
-		if (swapcache)
-			return swapcache;
-	}
+	if (IS_ERR(folio))
+		return NULL;
 
+	swap_read_folio(folio, NULL);
 	return folio;
 }
 
diff --git a/mm/swapfile.c b/mm/swapfile.c
index 06b37efad2bd..7e7614a5181a 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -1833,8 +1833,7 @@ void folio_put_swap(struct folio *folio, struct page *subpage)
  *   do_swap_page()
  *     ...				swapoff+swapon
  *     swap_cache_alloc_folio()
- *       swap_cache_add_folio()
- *         // check swap_map
+ *       // check swap_map
  *     // verify PTE not changed
  *
  * In __swap_duplicate(), the swap_map need to be checked before

-- 
2.53.0




^ permalink raw reply	[flat|nested] 19+ messages in thread

* [PATCH RFC 06/15] memcg, swap: reparent the swap entry on swapin if swapout cgroup is dead
  2026-02-19 23:42 [PATCH RFC 00/15] mm, swap: swap table phase IV with dynamic ghost swapfile Kairui Song via B4 Relay
                   ` (4 preceding siblings ...)
  2026-02-19 23:42 ` [PATCH RFC 05/15] mm, swap: unify large folio allocation Kairui Song via B4 Relay
@ 2026-02-19 23:42 ` Kairui Song via B4 Relay
  2026-02-19 23:42 ` [PATCH RFC 07/15] memcg, swap: defer the recording of memcg info and reparent flexibly Kairui Song via B4 Relay
                   ` (9 subsequent siblings)
  15 siblings, 0 replies; 19+ messages in thread
From: Kairui Song via B4 Relay @ 2026-02-19 23:42 UTC (permalink / raw)
  To: linux-mm
  Cc: Andrew Morton, David Hildenbrand, Lorenzo Stoakes, Zi Yan,
	Baolin Wang, Barry Song, Hugh Dickins, Chris Li, Kemeng Shi,
	Nhat Pham, Baoquan He, Johannes Weiner, Yosry Ahmed,
	Youngjun Park, Chengming Zhou, Roman Gushchin, Shakeel Butt,
	Muchun Song, Qi Zheng, linux-kernel, cgroups, Kairui Song

From: Kairui Song <kasong@tencent.com>

As a result this will always charge the swapin folio into the dead
cgroup's parent cgroup, and ensure folio->swap belongs to folio_memcg.
This only affects some uncommon behavior if we move the process between
memcg.

When a process that previously swapped some memory is moved to another
cgroup, and the cgroup where the swap occurred is dead, folios for
swap in of old swap entries will be charged into the new cgroup.
Combined with the lazy freeing of swap cache, this leads to a strange
situation where the folio->swap entry belongs to a cgroup that is not
folio->memcg.

Swapin from dead zombie memcg might be rare in practise, cgroups are
offlined only after the workload in it is gone, which requires zapping
the page table first, and releases all swap entries. Shmem is
a bit different, but shmem always has swap count == 1, and force
releases the swap cache. So, for shmem charging into the new memcg and
release entry does look more sensible.

However, to make things easier to understand for an RFC, let's just
always charge to the parent cgroup if the leaf cgroup is dead. This may
not be the best design, but it makes the following work much easier to
demonstrate.

For a better solution, we can later:

- Dynamically allocate a swap cluster trampoline cgroup table
  (ci->memcg_table) and use that for zombie swapin only. Which is
  actually OK and may not cause a mess in the code level, since the
  incoming swap table compaction will require table expansion on swap-in
  as well.

- Just tolerate a 2-byte per slot overhead all the time, which is also
  acceptable.

- Limit the charge to parent behavior to only one situation: when the
  swap count > 2 and the process is migrated to another cgroup after
  swapout, these entries. This is even more rare to see in practice, I
  think.

For reference, the memory ownership model of cgroup v2:

"""
A memory area is charged to the cgroup which instantiated it and stays
charged to the cgroup until the area is released.  Migrating a process
to a different cgroup doesn't move the memory usages that it
instantiated while in the previous cgroup to the new cgroup.

A memory area may be used by processes belonging to different cgroups.
To which cgroup the area will be charged is in-deterministic; however,
over time, the memory area is likely to end up in a cgroup which has
enough memory allowance to avoid high reclaim pressure.

If a cgroup sweeps a considerable amount of memory which is expected
to be accessed repeatedly by other cgroups, it may make sense to use
POSIX_FADV_DONTNEED to relinquish the ownership of memory areas
belonging to the affected files to ensure correct memory ownership.
"""

So I think all of the solutions mentioned above, including this commit,
are not wrong.

Signed-off-by: Kairui Song <kasong@tencent.com>
---
 mm/memcontrol.c | 53 +++++++++++++++++++++++++++++++++++++++++++++++++----
 1 file changed, 49 insertions(+), 4 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 73f622f7a72b..b2898719e935 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -4803,22 +4803,67 @@ int mem_cgroup_charge_hugetlb(struct folio *folio, gfp_t gfp)
 int mem_cgroup_swapin_charge_folio(struct folio *folio, struct mm_struct *mm,
 				  gfp_t gfp, swp_entry_t entry)
 {
-	struct mem_cgroup *memcg;
-	unsigned short id;
+	struct mem_cgroup *memcg, *swap_memcg;
+	unsigned short id, parent_id;
+	unsigned int nr_pages;
 	int ret;

 	if (mem_cgroup_disabled())
 		return 0;

 	id = lookup_swap_cgroup_id(entry);
+	nr_pages = folio_nr_pages(folio);
+
 	rcu_read_lock();
-	memcg = mem_cgroup_from_private_id(id);
-	if (!memcg || !css_tryget_online(&memcg->css))
+	swap_memcg = mem_cgroup_from_private_id(id);
+	if (!swap_memcg) {
+		WARN_ON_ONCE(id);
 		memcg = get_mem_cgroup_from_mm(mm);
+	} else {
+		memcg = swap_memcg;
+		/* Find the nearest online ancestor if dead, for reparent */
+		while (!css_tryget_online(&memcg->css))
+			memcg = parent_mem_cgroup(memcg);
+	}
 	rcu_read_unlock();

 	ret = charge_memcg(folio, memcg, gfp);
+	if (ret)
+		goto out;
+
+	/*
+	 * If the swap entry's memcg is dead, reparent the swap charge
+	 * from swap_memcg to memcg.
+	 *
+	 * If memcg is also being offlined, the charge will be moved to
+	 * its parent again.
+	 */
+	if (swap_memcg && memcg != swap_memcg) {
+		struct mem_cgroup *parent_memcg;

+		parent_memcg = mem_cgroup_private_id_get_online(memcg, nr_pages);
+		parent_id = mem_cgroup_private_id(parent_memcg);
+
+		WARN_ON(id != swap_cgroup_clear(entry, nr_pages));
+		swap_cgroup_record(folio, parent_id, entry);
+
+		if (do_memsw_account()) {
+			if (!mem_cgroup_is_root(parent_memcg))
+				page_counter_charge(&parent_memcg->memsw, nr_pages);
+			page_counter_uncharge(&swap_memcg->memsw, nr_pages);
+		} else {
+			if (!mem_cgroup_is_root(parent_memcg))
+				page_counter_charge(&parent_memcg->swap, nr_pages);
+			page_counter_uncharge(&swap_memcg->swap, nr_pages);
+		}
+
+		mod_memcg_state(parent_memcg, MEMCG_SWAP, nr_pages);
+		mod_memcg_state(swap_memcg, MEMCG_SWAP, -nr_pages);
+
+		/* Release the dead cgroup after reparent */
+		mem_cgroup_private_id_put(swap_memcg, nr_pages);
+	}
+out:
 	css_put(&memcg->css);
 	return ret;
 }

-- 
2.53.0

^ permalink raw reply	[flat|nested] 19+ messages in thread

* [PATCH RFC 07/15] memcg, swap: defer the recording of memcg info and reparent flexibly
  2026-02-19 23:42 [PATCH RFC 00/15] mm, swap: swap table phase IV with dynamic ghost swapfile Kairui Song via B4 Relay
                   ` (5 preceding siblings ...)
  2026-02-19 23:42 ` [PATCH RFC 06/15] memcg, swap: reparent the swap entry on swapin if swapout cgroup is dead Kairui Song via B4 Relay
@ 2026-02-19 23:42 ` Kairui Song via B4 Relay
  2026-02-19 23:42 ` [PATCH RFC 08/15] mm, swap: store and check memcg info in the swap table Kairui Song via B4 Relay
                   ` (8 subsequent siblings)
  15 siblings, 0 replies; 19+ messages in thread
From: Kairui Song via B4 Relay @ 2026-02-19 23:42 UTC (permalink / raw)
  To: linux-mm
  Cc: Andrew Morton, David Hildenbrand, Lorenzo Stoakes, Zi Yan,
	Baolin Wang, Barry Song, Hugh Dickins, Chris Li, Kemeng Shi,
	Nhat Pham, Baoquan He, Johannes Weiner, Yosry Ahmed,
	Youngjun Park, Chengming Zhou, Roman Gushchin, Shakeel Butt,
	Muchun Song, Qi Zheng, linux-kernel, cgroups, Kairui Song

From: Kairui Song <kasong@tencent.com>

To make sure folio->swap always belongs to folio->memcg, when doing the
charge, charge against folio->memcg. Defer the recording of swap cgroup
info, do a reparent, and record the nearest online ancestor on swap cache
removal only.

Then, a folio is in the swap cache, and the folio itself is owned by the
memcg. Hence, through the folio, the memcg also owns folio->swap. The
extra pinning of the swap cgroup info record is not needed and can be
released.

This should be fine for both cgroup v2 and v1. There should be no
userspace observable behavior.

Signed-off-by: Kairui Song <kasong@tencent.com>
---
 include/linux/memcontrol.h |   8 ++--
 include/linux/swap.h       |  24 +++++++++--
 mm/memcontrol-v1.c         |  77 ++++++++++++++++-----------------
 mm/memcontrol.c            | 104 ++++++++++++++++++++++++++++++++-------------
 mm/swap.h                  |   6 ++-
 mm/swap_cgroup.c           |   5 +--
 mm/swap_state.c            |  26 +++++++++---
 mm/swapfile.c              |  15 +++++--
 mm/vmscan.c                |   3 +-
 9 files changed, 173 insertions(+), 95 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 70b685a85bf4..0b37d4faf785 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -1896,8 +1896,8 @@ static inline void mem_cgroup_exit_user_fault(void)
 	current->in_user_fault = 0;
 }
 
-void memcg1_swapout(struct folio *folio, swp_entry_t entry);
-void memcg1_swapin(swp_entry_t entry, unsigned int nr_pages);
+void memcg1_swapout(struct folio *folio, struct mem_cgroup *swap_memcg);
+void memcg1_swapin(struct folio *folio);
 
 #else /* CONFIG_MEMCG_V1 */
 static inline
@@ -1926,11 +1926,11 @@ static inline void mem_cgroup_exit_user_fault(void)
 {
 }
 
-static inline void memcg1_swapout(struct folio *folio, swp_entry_t entry)
+static inline void memcg1_swapout(struct folio *folio, struct mem_cgroup *_memcg)
 {
 }
 
-static inline void memcg1_swapin(swp_entry_t entry, unsigned int nr_pages)
+static inline void memcg1_swapin(struct folio *folio)
 {
 }
 
diff --git a/include/linux/swap.h b/include/linux/swap.h
index 0effe3cc50f5..66cf657a1f35 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -580,12 +580,22 @@ static inline int mem_cgroup_try_charge_swap(struct folio *folio,
 	return __mem_cgroup_try_charge_swap(folio, entry);
 }
 
-extern void __mem_cgroup_uncharge_swap(swp_entry_t entry, unsigned int nr_pages);
-static inline void mem_cgroup_uncharge_swap(swp_entry_t entry, unsigned int nr_pages)
+extern void __mem_cgroup_uncharge_swap(unsigned short id, unsigned int nr_pages);
+static inline void mem_cgroup_uncharge_swap(unsigned short id, unsigned int nr_pages)
 {
 	if (mem_cgroup_disabled())
 		return;
-	__mem_cgroup_uncharge_swap(entry, nr_pages);
+	__mem_cgroup_uncharge_swap(id, nr_pages);
+}
+
+struct mem_cgroup *__mem_cgroup_swap_free_folio(struct folio *folio,
+					       bool reclaim);
+static inline struct mem_cgroup *mem_cgroup_swap_free_folio(struct folio *folio,
+							    bool reclaim)
+{
+	if (mem_cgroup_disabled())
+		return NULL;
+	return __mem_cgroup_swap_free_folio(folio, reclaim);
 }
 
 extern long mem_cgroup_get_nr_swap_pages(struct mem_cgroup *memcg);
@@ -597,11 +607,17 @@ static inline int mem_cgroup_try_charge_swap(struct folio *folio,
 	return 0;
 }
 
-static inline void mem_cgroup_uncharge_swap(swp_entry_t entry,
+static inline void mem_cgroup_uncharge_swap(unsigned short id,
 					    unsigned int nr_pages)
 {
 }
 
+static inline struct mem_cgroup *mem_cgroup_swap_free_folio(struct folio *folio,
+							    bool reclaim)
+{
+	return NULL;
+}
+
 static inline long mem_cgroup_get_nr_swap_pages(struct mem_cgroup *memcg)
 {
 	return get_nr_swap_pages();
diff --git a/mm/memcontrol-v1.c b/mm/memcontrol-v1.c
index a7c78b0987df..038e630dc7e1 100644
--- a/mm/memcontrol-v1.c
+++ b/mm/memcontrol-v1.c
@@ -606,29 +606,21 @@ void memcg1_commit_charge(struct folio *folio, struct mem_cgroup *memcg)
 /**
  * memcg1_swapout - transfer a memsw charge to swap
  * @folio: folio whose memsw charge to transfer
- * @entry: swap entry to move the charge to
- *
- * Transfer the memsw charge of @folio to @entry.
+ * @swap_memcg: cgroup that will be charged, must be online ancestor
+ *              of folio's memcg.
  */
-void memcg1_swapout(struct folio *folio, swp_entry_t entry)
+void memcg1_swapout(struct folio *folio, struct mem_cgroup *swap_memcg)
 {
-	struct mem_cgroup *memcg, *swap_memcg;
+	struct mem_cgroup *memcg;
 	unsigned int nr_entries;
+	unsigned long flags;
 
-	VM_BUG_ON_FOLIO(folio_test_lru(folio), folio);
-	VM_BUG_ON_FOLIO(folio_ref_count(folio), folio);
-
-	if (mem_cgroup_disabled())
-		return;
-
-	if (!do_memsw_account())
-		return;
+	/* The folio must be getting reclaimed. */
+	VM_WARN_ON_ONCE_FOLIO(folio_mapped(folio), folio);
 
 	memcg = folio_memcg(folio);
 
 	VM_WARN_ON_ONCE_FOLIO(!memcg, folio);
-	if (!memcg)
-		return;
 
 	/*
 	 * In case the memcg owning these pages has been offlined and doesn't
@@ -636,14 +628,15 @@ void memcg1_swapout(struct folio *folio, swp_entry_t entry)
 	 * ancestor for the swap instead and transfer the memory+swap charge.
 	 */
 	nr_entries = folio_nr_pages(folio);
-	swap_memcg = mem_cgroup_private_id_get_online(memcg, nr_entries);
 	mod_memcg_state(swap_memcg, MEMCG_SWAP, nr_entries);
 
-	swap_cgroup_record(folio, mem_cgroup_private_id(swap_memcg), entry);
-
 	folio_unqueue_deferred_split(folio);
-	folio->memcg_data = 0;
 
+	/*
+	 * Free the folio charge now so memsw won't be double uncharged:
+	 * memsw is now charged by the swap record.
+	 */
+	folio->memcg_data = 0;
 	if (!mem_cgroup_is_root(memcg))
 		page_counter_uncharge(&memcg->memory, nr_entries);
 
@@ -653,33 +646,34 @@ void memcg1_swapout(struct folio *folio, swp_entry_t entry)
 		page_counter_uncharge(&memcg->memsw, nr_entries);
 	}
 
-	/*
-	 * Interrupts should be disabled here because the caller holds the
-	 * i_pages lock which is taken with interrupts-off. It is
-	 * important here to have the interrupts disabled because it is the
-	 * only synchronisation we have for updating the per-CPU variables.
-	 */
+	local_irq_save(flags);
 	preempt_disable_nested();
-	VM_WARN_ON_IRQS_ENABLED();
 	memcg1_charge_statistics(memcg, -folio_nr_pages(folio));
 	preempt_enable_nested();
+	local_irq_restore(flags);
 	memcg1_check_events(memcg, folio_nid(folio));
 
 	css_put(&memcg->css);
 }
 
 /*
- * memcg1_swapin - uncharge swap slot
- * @entry: the first swap entry for which the pages are charged
- * @nr_pages: number of pages which will be uncharged
+ * memcg1_swapin - uncharge memsw for the swap slot on swapin
+ * @folio: the folio being swapped in, already charged to memory
  *
  * Call this function after successfully adding the charged page to swapcache.
- *
- * Note: This function assumes the page for which swap slot is being uncharged
- * is order 0 page.
+ * The swap cgroup tracking has already been released by
+ * mem_cgroup_swapin_charge_folio(), so we only need to drop the duplicate
+ * memsw charge that was placed on the swap entry during swapout.
  */
-void memcg1_swapin(swp_entry_t entry, unsigned int nr_pages)
+void memcg1_swapin(struct folio *folio)
 {
+	struct mem_cgroup *memcg;
+	unsigned int nr_pages;
+
+	VM_WARN_ON_ONCE_FOLIO(!folio_test_locked(folio), folio);
+	VM_WARN_ON_ONCE_FOLIO(!folio_test_swapcache(folio), folio);
+	VM_WARN_ON_ONCE_FOLIO(!folio_memcg_charged(folio), folio);
+
 	/*
 	 * Cgroup1's unified memory+swap counter has been charged with the
 	 * new swapcache page, finish the transfer by uncharging the swap
@@ -692,14 +686,15 @@ void memcg1_swapin(swp_entry_t entry, unsigned int nr_pages)
 	 * correspond 1:1 to page and swap slot lifetimes: we charge the
 	 * page to memory here, and uncharge swap when the slot is freed.
 	 */
-	if (do_memsw_account()) {
-		/*
-		 * The swap entry might not get freed for a long time,
-		 * let's not wait for it.  The page already received a
-		 * memory+swap charge, drop the swap entry duplicate.
-		 */
-		mem_cgroup_uncharge_swap(entry, nr_pages);
-	}
+	if (!do_memsw_account())
+		return;
+
+	memcg = folio_memcg(folio);
+	nr_pages = folio_nr_pages(folio);
+
+	if (!mem_cgroup_is_root(memcg))
+		page_counter_uncharge(&memcg->memsw, nr_pages);
+	mod_memcg_state(memcg, MEMCG_SWAP, -nr_pages);
 }
 
 void memcg1_uncharge_batch(struct mem_cgroup *memcg, unsigned long pgpgout,
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index b2898719e935..d9ff44b77409 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -4804,8 +4804,8 @@ int mem_cgroup_swapin_charge_folio(struct folio *folio, struct mm_struct *mm,
 				  gfp_t gfp, swp_entry_t entry)
 {
 	struct mem_cgroup *memcg, *swap_memcg;
-	unsigned short id, parent_id;
 	unsigned int nr_pages;
+	unsigned short id;
 	int ret;
 
 	if (mem_cgroup_disabled())
@@ -4831,37 +4831,31 @@ int mem_cgroup_swapin_charge_folio(struct folio *folio, struct mm_struct *mm,
 	if (ret)
 		goto out;
 
+	/*
+	 * On successful charge, the folio itself now belongs to the memcg,
+	 * so is folio->swap. So we can release the swap cgroup table's
+	 * pinning of the private id.
+	 */
+	swap_cgroup_clear(folio->swap, nr_pages);
+	mem_cgroup_private_id_put(swap_memcg, nr_pages);
+
 	/*
 	 * If the swap entry's memcg is dead, reparent the swap charge
 	 * from swap_memcg to memcg.
-	 *
-	 * If memcg is also being offlined, the charge will be moved to
-	 * its parent again.
 	 */
 	if (swap_memcg && memcg != swap_memcg) {
-		struct mem_cgroup *parent_memcg;
-
-		parent_memcg = mem_cgroup_private_id_get_online(memcg, nr_pages);
-		parent_id = mem_cgroup_private_id(parent_memcg);
-
-		WARN_ON(id != swap_cgroup_clear(entry, nr_pages));
-		swap_cgroup_record(folio, parent_id, entry);
-
 		if (do_memsw_account()) {
-			if (!mem_cgroup_is_root(parent_memcg))
-				page_counter_charge(&parent_memcg->memsw, nr_pages);
+			if (!mem_cgroup_is_root(memcg))
+				page_counter_charge(&memcg->memsw, nr_pages);
 			page_counter_uncharge(&swap_memcg->memsw, nr_pages);
 		} else {
-			if (!mem_cgroup_is_root(parent_memcg))
-				page_counter_charge(&parent_memcg->swap, nr_pages);
+			if (!mem_cgroup_is_root(memcg))
+				page_counter_charge(&memcg->swap, nr_pages);
 			page_counter_uncharge(&swap_memcg->swap, nr_pages);
 		}
 
-		mod_memcg_state(parent_memcg, MEMCG_SWAP, nr_pages);
+		mod_memcg_state(memcg, MEMCG_SWAP, nr_pages);
 		mod_memcg_state(swap_memcg, MEMCG_SWAP, -nr_pages);
-
-		/* Release the dead cgroup after reparent */
-		mem_cgroup_private_id_put(swap_memcg, nr_pages);
 	}
 out:
 	css_put(&memcg->css);
@@ -5260,33 +5254,32 @@ int __mem_cgroup_try_charge_swap(struct folio *folio, swp_entry_t entry)
 		return 0;
 	}
 
-	memcg = mem_cgroup_private_id_get_online(memcg, nr_pages);
-
+	/*
+	 * Charge the swap counter against the folio's memcg directly.
+	 * The private id pinning and swap cgroup recording are deferred
+	 * to __mem_cgroup_swap_free_folio() when the folio leaves the
+	 * swap cache.  No _id_get_online here means no _id_put on error.
+	 */
 	if (!mem_cgroup_is_root(memcg) &&
 	    !page_counter_try_charge(&memcg->swap, nr_pages, &counter)) {
 		memcg_memory_event(memcg, MEMCG_SWAP_MAX);
 		memcg_memory_event(memcg, MEMCG_SWAP_FAIL);
-		mem_cgroup_private_id_put(memcg, nr_pages);
 		return -ENOMEM;
 	}
 	mod_memcg_state(memcg, MEMCG_SWAP, nr_pages);
 
-	swap_cgroup_record(folio, mem_cgroup_private_id(memcg), entry);
-
 	return 0;
 }
 
 /**
  * __mem_cgroup_uncharge_swap - uncharge swap space
- * @entry: swap entry to uncharge
+ * @id: private id of the mem_cgroup to uncharge
  * @nr_pages: the amount of swap space to uncharge
  */
-void __mem_cgroup_uncharge_swap(swp_entry_t entry, unsigned int nr_pages)
+void __mem_cgroup_uncharge_swap(unsigned short id, unsigned int nr_pages)
 {
 	struct mem_cgroup *memcg;
-	unsigned short id;
 
-	id = swap_cgroup_clear(entry, nr_pages);
 	rcu_read_lock();
 	memcg = mem_cgroup_from_private_id(id);
 	if (memcg) {
@@ -5302,6 +5295,59 @@ void __mem_cgroup_uncharge_swap(swp_entry_t entry, unsigned int nr_pages)
 	rcu_read_unlock();
 }
 
+/**
+ * __mem_cgroup_swap_free_folio - Folio is being freed from swap cache.
+ * @folio: folio being freed.
+ * @reclaim: true if the folio is being reclaimed.
+ *
+ * For cgroup V2, swap entries are charged to folio's memcg by the time
+ * swap allocator adds it into the swap cache by mem_cgroup_try_charge_swap.
+ * The ownership of folio->swap to folio->memcg is constrained by the folio
+ * in swap cache. If the folio is being removed from swap cache, the
+ * constraint will be gone so need to grab the memcg's private id for long
+ * term tracking.
+ *
+ * For cgroup V1, the memory-to-swap charge transfer is also performed on
+ * the folio reclaim path.
+ *
+ * It's unlikely but possible that the folio's memcg is dead, in that case
+ * we reparent and recharge the parent. Recorded cgroup is changed to
+ * parent too.
+ *
+ * Return: Pointer to the mem cgroup being pinned by the charge.
+ */
+struct mem_cgroup *__mem_cgroup_swap_free_folio(struct folio *folio,
+					       bool reclaim)
+{
+	unsigned int nr_pages = folio_nr_pages(folio);
+	struct mem_cgroup *memcg, *swap_memcg;
+	swp_entry_t entry = folio->swap;
+	unsigned short id;
+
+	VM_WARN_ON_ONCE_FOLIO(!folio_memcg_charged(folio), folio);
+	VM_WARN_ON_ONCE_FOLIO(!folio_test_swapcache(folio), folio);
+
+	/*
+	 * Pin the nearest online ancestor's private id for long term
+	 * swap cgroup tracking.  If memcg is still alive, swap_memcg
+	 * will be the same as memcg. Else, it's reparented.
+	 */
+	memcg = folio_memcg(folio);
+	swap_memcg = mem_cgroup_private_id_get_online(memcg, nr_pages);
+	id = mem_cgroup_private_id(swap_memcg);
+	swap_cgroup_record(folio, id, entry);
+
+	if (reclaim && do_memsw_account()) {
+		memcg1_swapout(folio, swap_memcg);
+	} else if (memcg != swap_memcg) {
+		if (!mem_cgroup_is_root(swap_memcg))
+			page_counter_charge(&swap_memcg->swap, nr_pages);
+		page_counter_uncharge(&memcg->swap, nr_pages);
+	}
+
+	return swap_memcg;
+}
+
 long mem_cgroup_get_nr_swap_pages(struct mem_cgroup *memcg)
 {
 	long nr_swap_pages = get_nr_swap_pages();
diff --git a/mm/swap.h b/mm/swap.h
index 80c2f1bf7a57..da41e9cea46d 100644
--- a/mm/swap.h
+++ b/mm/swap.h
@@ -287,7 +287,8 @@ struct folio *swap_cache_alloc_folio(swp_entry_t target_entry, gfp_t gfp_mask,
 void __swap_cache_add_folio(struct swap_cluster_info *ci,
 			    struct folio *folio, swp_entry_t entry);
 void __swap_cache_del_folio(struct swap_cluster_info *ci,
-			    struct folio *folio, swp_entry_t entry, void *shadow);
+			    struct folio *folio, void *shadow,
+			    bool charged, bool reclaim);
 void __swap_cache_replace_folio(struct swap_cluster_info *ci,
 				struct folio *old, struct folio *new);
 
@@ -459,7 +460,8 @@ static inline void swap_cache_del_folio(struct folio *folio)
 }
 
 static inline void __swap_cache_del_folio(struct swap_cluster_info *ci,
-		struct folio *folio, swp_entry_t entry, void *shadow)
+		struct folio *folio, void *shadow,
+		bool charged, bool reclaim)
 {
 }
 
diff --git a/mm/swap_cgroup.c b/mm/swap_cgroup.c
index de779fed8c21..b5a7f21c3afe 100644
--- a/mm/swap_cgroup.c
+++ b/mm/swap_cgroup.c
@@ -54,8 +54,7 @@ static unsigned short __swap_cgroup_id_xchg(struct swap_cgroup *map,
 /**
  * swap_cgroup_record - record mem_cgroup for a set of swap entries.
  * These entries must belong to one single folio, and that folio
- * must be being charged for swap space (swap out), and these
- * entries must not have been charged
+ * must be being charged for swap space (swap out).
  *
  * @folio: the folio that the swap entry belongs to
  * @id: mem_cgroup ID to be recorded
@@ -75,7 +74,7 @@ void swap_cgroup_record(struct folio *folio, unsigned short id,
 
 	do {
 		old = __swap_cgroup_id_xchg(map, offset, id);
-		VM_BUG_ON(old);
+		VM_WARN_ON_ONCE(old);
 	} while (++offset != end);
 }
 
diff --git a/mm/swap_state.c b/mm/swap_state.c
index 0a2a4e084cf2..40f037576c5f 100644
--- a/mm/swap_state.c
+++ b/mm/swap_state.c
@@ -251,7 +251,7 @@ static struct folio *__swap_cache_alloc(struct swap_cluster_info *ci,
 	if (mem_cgroup_swapin_charge_folio(folio, vmf ? vmf->vma->vm_mm : NULL,
 					   gfp, entry)) {
 		spin_lock(&ci->lock);
-		__swap_cache_del_folio(ci, folio, shadow);
+		__swap_cache_del_folio(ci, folio, shadow, false, false);
 		spin_unlock(&ci->lock);
 		folio_unlock(folio);
 		folio_put(folio);
@@ -260,7 +260,7 @@ static struct folio *__swap_cache_alloc(struct swap_cluster_info *ci,
 	}
 
 	/* For memsw accouting, swap is uncharged when folio is added to swap cache */
-	memcg1_swapin(entry, 1 << order);
+	memcg1_swapin(folio);
 	if (shadow)
 		workingset_refault(folio, shadow);
 
@@ -319,21 +319,24 @@ struct folio *swap_cache_alloc_folio(swp_entry_t targ_entry, gfp_t gfp_mask,
  * __swap_cache_del_folio - Removes a folio from the swap cache.
  * @ci: The locked swap cluster.
  * @folio: The folio.
- * @entry: The first swap entry that the folio corresponds to.
  * @shadow: shadow value to be filled in the swap cache.
+ * @charged: If folio->swap is charged to folio->memcg.
+ * @reclaim: If the folio is being reclaimed. When true on cgroup v1,
+ *           the memory charge is transferred from memory to swap.
  *
  * Removes a folio from the swap cache and fills a shadow in place.
  * This won't put the folio's refcount. The caller has to do that.
  *
- * Context: Caller must ensure the folio is locked and in the swap cache
- * using the index of @entry, and lock the cluster that holds the entries.
+ * Context: Caller must ensure the folio is locked and in the swap cache,
+ * and lock the cluster that holds the entries.
  */
 void __swap_cache_del_folio(struct swap_cluster_info *ci, struct folio *folio,
-			    swp_entry_t entry, void *shadow)
+			    void *shadow, bool charged, bool reclaim)
 {
 	int count;
 	unsigned long old_tb;
 	struct swap_info_struct *si;
+	swp_entry_t entry = folio->swap;
 	unsigned int ci_start, ci_off, ci_end;
 	bool folio_swapped = false, need_free = false;
 	unsigned long nr_pages = folio_nr_pages(folio);
@@ -343,6 +346,15 @@ void __swap_cache_del_folio(struct swap_cluster_info *ci, struct folio *folio,
 	VM_WARN_ON_ONCE_FOLIO(!folio_test_swapcache(folio), folio);
 	VM_WARN_ON_ONCE_FOLIO(folio_test_writeback(folio), folio);
 
+	/*
+	 * If the folio's swap entry is charged to its memcg, record the
+	 * swap cgroup for long-term tracking before the folio leaves the
+	 * swap cache.  Not charged when the folio never completed memcg
+	 * charging (e.g. swapin charge failure, or swap alloc charge failure).
+	 */
+	if (charged)
+		mem_cgroup_swap_free_folio(folio, reclaim);
+
 	si = __swap_entry_to_info(entry);
 	ci_start = swp_cluster_offset(entry);
 	ci_end = ci_start + nr_pages;
@@ -392,7 +404,7 @@ void swap_cache_del_folio(struct folio *folio)
 	swp_entry_t entry = folio->swap;
 
 	ci = swap_cluster_lock(__swap_entry_to_info(entry), swp_offset(entry));
-	__swap_cache_del_folio(ci, folio, entry, NULL);
+	__swap_cache_del_folio(ci, folio, NULL, true, false);
 	swap_cluster_unlock(ci);
 
 	folio_ref_sub(folio, folio_nr_pages(folio));
diff --git a/mm/swapfile.c b/mm/swapfile.c
index 7e7614a5181a..c0169bce46c9 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -1703,6 +1703,7 @@ int folio_alloc_swap(struct folio *folio)
 {
 	unsigned int order = folio_order(folio);
 	unsigned int size = 1 << order;
+	struct swap_cluster_info *ci;
 
 	VM_BUG_ON_FOLIO(!folio_test_locked(folio), folio);
 	VM_BUG_ON_FOLIO(!folio_test_uptodate(folio), folio);
@@ -1737,8 +1738,12 @@ int folio_alloc_swap(struct folio *folio)
 	}
 
 	/* Need to call this even if allocation failed, for MEMCG_SWAP_FAIL. */
-	if (unlikely(mem_cgroup_try_charge_swap(folio, folio->swap)))
-		swap_cache_del_folio(folio);
+	if (unlikely(mem_cgroup_try_charge_swap(folio, folio->swap))) {
+		ci = swap_cluster_lock(__swap_entry_to_info(folio->swap),
+				       swp_offset(folio->swap));
+		__swap_cache_del_folio(ci, folio, NULL, false, false);
+		swap_cluster_unlock(ci);
+	}
 
 	if (unlikely(!folio_test_swapcache(folio)))
 		return -ENOMEM;
@@ -1879,6 +1884,7 @@ void __swap_cluster_free_entries(struct swap_info_struct *si,
 				 unsigned int ci_start, unsigned int nr_pages)
 {
 	unsigned long old_tb;
+	unsigned short id;
 	unsigned int ci_off = ci_start, ci_end = ci_start + nr_pages;
 	unsigned long offset = cluster_offset(si, ci) + ci_start;
 
@@ -1892,7 +1898,10 @@ void __swap_cluster_free_entries(struct swap_info_struct *si,
 		__swap_table_set(ci, ci_off, null_to_swp_tb());
 	} while (++ci_off < ci_end);
 
-	mem_cgroup_uncharge_swap(swp_entry(si->type, offset), nr_pages);
+	id = swap_cgroup_clear(swp_entry(si->type, offset), nr_pages);
+	if (id)
+		mem_cgroup_uncharge_swap(id, nr_pages);
+
 	swap_range_free(si, offset, nr_pages);
 	swap_cluster_assert_empty(ci, ci_start, nr_pages, false);
 
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 44e4fcd6463c..5112f81cf875 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -759,8 +759,7 @@ static int __remove_mapping(struct address_space *mapping, struct folio *folio,
 
 		if (reclaimed && !mapping_exiting(mapping))
 			shadow = workingset_eviction(folio, target_memcg);
-		memcg1_swapout(folio, swap);
-		__swap_cache_del_folio(ci, folio, swap, shadow);
+		__swap_cache_del_folio(ci, folio, shadow, true, true);
 		swap_cluster_unlock_irq(ci);
 	} else {
 		void (*free_folio)(struct folio *);

-- 
2.53.0




^ permalink raw reply	[flat|nested] 19+ messages in thread

* [PATCH RFC 08/15] mm, swap: store and check memcg info in the swap table
  2026-02-19 23:42 [PATCH RFC 00/15] mm, swap: swap table phase IV with dynamic ghost swapfile Kairui Song via B4 Relay
                   ` (6 preceding siblings ...)
  2026-02-19 23:42 ` [PATCH RFC 07/15] memcg, swap: defer the recording of memcg info and reparent flexibly Kairui Song via B4 Relay
@ 2026-02-19 23:42 ` Kairui Song via B4 Relay
  2026-02-19 23:42 ` [PATCH RFC 09/15] mm, swap: support flexible batch freeing of slots in different memcg Kairui Song via B4 Relay
                   ` (7 subsequent siblings)
  15 siblings, 0 replies; 19+ messages in thread
From: Kairui Song via B4 Relay @ 2026-02-19 23:42 UTC (permalink / raw)
  To: linux-mm
  Cc: Andrew Morton, David Hildenbrand, Lorenzo Stoakes, Zi Yan,
	Baolin Wang, Barry Song, Hugh Dickins, Chris Li, Kemeng Shi,
	Nhat Pham, Baoquan He, Johannes Weiner, Yosry Ahmed,
	Youngjun Park, Chengming Zhou, Roman Gushchin, Shakeel Butt,
	Muchun Song, Qi Zheng, linux-kernel, cgroups, Kairui Song

From: Kairui Song <kasong@tencent.com>

To prepare for merging the swap_cgroup_ctrl into the swap table, store
the memcg info in the swap table on swapout.

This is done by using the existing shadow format.

Note this also changes the refault counting at the nearest online memcg
level:

Unlike file folios, anon folios are mostly exclusive to one mem cgroup,
and each cgroup is likely to have different characteristics.

When commit b910718a948a ("mm: vmscan: detect file thrashing at the
reclaim root") moved the refault accounting to the reclaim root level,
anon shadows don't even exist, and it's explicitly for file pages. Later
commit aae466b0052e ("mm/swap: implement workingset detection for
anonymous LRU") added anon shadows following a similar design. And in
shrink_lruvec, an active LRU's shrinking is done regardlessly when it's
low.

For MGLRU, it's a bit different, but with the PID refault control, it's
more accurate to let the nearest online memcg take the refault feedback
too.

Signed-off-by: Kairui Song <kasong@tencent.com>
---
 mm/internal.h   | 20 ++++++++++++++++++++
 mm/swap.h       |  7 ++++---
 mm/swap_state.c | 50 +++++++++++++++++++++++++++++++++-----------------
 mm/swapfile.c   |  4 +++-
 mm/vmscan.c     |  6 +-----
 mm/workingset.c | 16 +++++++++++-----
 6 files changed, 72 insertions(+), 31 deletions(-)

diff --git a/mm/internal.h b/mm/internal.h
index cb0af847d7d9..5bbe081c9048 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -1714,6 +1714,7 @@ static inline void shrinker_debugfs_remove(struct dentry *debugfs_entry,
 #endif /* CONFIG_SHRINKER_DEBUG */
 
 /* Only track the nodes of mappings with shadow entries */
+#define WORKINGSET_SHIFT 1
 void workingset_update_node(struct xa_node *node);
 extern struct list_lru shadow_nodes;
 #define mapping_set_update(xas, mapping) do {			\
@@ -1722,6 +1723,25 @@ extern struct list_lru shadow_nodes;
 		xas_set_lru(xas, &shadow_nodes);		\
 	}							\
 } while (0)
+static inline unsigned short shadow_to_memcgid(void *shadow)
+{
+	unsigned long entry = xa_to_value(shadow);
+	unsigned short memcgid;
+
+	entry >>= (WORKINGSET_SHIFT + NODES_SHIFT);
+	memcgid = entry & ((1UL << MEM_CGROUP_ID_SHIFT) - 1);
+
+	return memcgid;
+}
+static inline void *memcgid_to_shadow(unsigned short memcgid)
+{
+	unsigned long val;
+
+	val = memcgid;
+	val <<= (NODES_SHIFT + WORKINGSET_SHIFT);
+
+	return xa_mk_value(val);
+}
 
 /* mremap.c */
 unsigned long move_page_tables(struct pagetable_move_control *pmc);
diff --git a/mm/swap.h b/mm/swap.h
index da41e9cea46d..c95f5fafea42 100644
--- a/mm/swap.h
+++ b/mm/swap.h
@@ -265,6 +265,8 @@ static inline bool folio_matches_swap_entry(const struct folio *folio,
 	return folio_entry.val == round_down(entry.val, nr_pages);
 }
 
+bool folio_maybe_swapped(struct folio *folio);
+
 /*
  * All swap cache helpers below require the caller to ensure the swap entries
  * used are valid and stabilize the device by any of the following ways:
@@ -286,9 +288,8 @@ struct folio *swap_cache_alloc_folio(swp_entry_t target_entry, gfp_t gfp_mask,
 /* Below helpers require the caller to lock and pass in the swap cluster. */
 void __swap_cache_add_folio(struct swap_cluster_info *ci,
 			    struct folio *folio, swp_entry_t entry);
-void __swap_cache_del_folio(struct swap_cluster_info *ci,
-			    struct folio *folio, void *shadow,
-			    bool charged, bool reclaim);
+void __swap_cache_del_folio(struct swap_cluster_info *ci, struct folio *folio,
+			    void *shadow, bool charged, bool reclaim);
 void __swap_cache_replace_folio(struct swap_cluster_info *ci,
 				struct folio *old, struct folio *new);
 
diff --git a/mm/swap_state.c b/mm/swap_state.c
index 40f037576c5f..cc4bf40320ef 100644
--- a/mm/swap_state.c
+++ b/mm/swap_state.c
@@ -143,22 +143,11 @@ static int __swap_cache_check_batch(struct swap_cluster_info *ci,
 {
 	unsigned int ci_end = ci_off + nr;
 	unsigned long old_tb;
+	unsigned int memcgid;
 
 	if (unlikely(!ci->table))
 		return -ENOENT;
 
-	do {
-		old_tb = __swap_table_get(ci, ci_off);
-		if (unlikely(swp_tb_is_folio(old_tb)) ||
-		    unlikely(!__swp_tb_get_count(old_tb)))
-			break;
-		if (swp_tb_is_shadow(old_tb))
-			*shadowp = swp_tb_to_shadow(old_tb);
-	} while (++ci_off < ci_end);
-
-	if (likely(ci_off == ci_end))
-		return 0;
-
 	/*
 	 * If the target slot is not suitable for adding swap cache, return
 	 * -EEXIST or -ENOENT. If the batch is not suitable, could be a
@@ -169,7 +158,21 @@ static int __swap_cache_check_batch(struct swap_cluster_info *ci,
 		return -EEXIST;
 	if (!__swp_tb_get_count(old_tb))
 		return -ENOENT;
-	return -EBUSY;
+	if (WARN_ON_ONCE(!swp_tb_is_shadow(old_tb)))
+		return -ENOENT;
+	*shadowp = swp_tb_to_shadow(old_tb);
+	memcgid = shadow_to_memcgid(*shadowp);
+
+	WARN_ON_ONCE(!mem_cgroup_disabled() && !memcgid);
+	do {
+		old_tb = __swap_table_get(ci, ci_off);
+		if (unlikely(swp_tb_is_folio(old_tb)) ||
+		    unlikely(!__swp_tb_get_count(old_tb)) ||
+		    memcgid != shadow_to_memcgid(swp_tb_to_shadow(old_tb)))
+			return -EBUSY;
+	} while (++ci_off < ci_end);
+
+	return 0;
 }
 
 void __swap_cache_add_folio(struct swap_cluster_info *ci,
@@ -261,8 +264,7 @@ static struct folio *__swap_cache_alloc(struct swap_cluster_info *ci,
 
 	/* For memsw accouting, swap is uncharged when folio is added to swap cache */
 	memcg1_swapin(folio);
-	if (shadow)
-		workingset_refault(folio, shadow);
+	workingset_refault(folio, shadow);
 
 	/* Caller will initiate read into locked new_folio */
 	folio_add_lru(folio);
@@ -319,7 +321,8 @@ struct folio *swap_cache_alloc_folio(swp_entry_t targ_entry, gfp_t gfp_mask,
  * __swap_cache_del_folio - Removes a folio from the swap cache.
  * @ci: The locked swap cluster.
  * @folio: The folio.
- * @shadow: shadow value to be filled in the swap cache.
+ * @shadow: Shadow to restore when the folio is not charged. Ignored when
+ *          @charged is true, as the shadow is computed internally.
  * @charged: If folio->swap is charged to folio->memcg.
  * @reclaim: If the folio is being reclaimed. When true on cgroup v1,
  *           the memory charge is transferred from memory to swap.
@@ -336,6 +339,7 @@ void __swap_cache_del_folio(struct swap_cluster_info *ci, struct folio *folio,
 	int count;
 	unsigned long old_tb;
 	struct swap_info_struct *si;
+	struct mem_cgroup *memcg = NULL;
 	swp_entry_t entry = folio->swap;
 	unsigned int ci_start, ci_off, ci_end;
 	bool folio_swapped = false, need_free = false;
@@ -353,7 +357,13 @@ void __swap_cache_del_folio(struct swap_cluster_info *ci, struct folio *folio,
 	 * charging (e.g. swapin charge failure, or swap alloc charge failure).
 	 */
 	if (charged)
-		mem_cgroup_swap_free_folio(folio, reclaim);
+		memcg = mem_cgroup_swap_free_folio(folio, reclaim);
+	if (reclaim) {
+		WARN_ON(!charged);
+		shadow = workingset_eviction(folio, memcg);
+	} else if (memcg) {
+		shadow = memcgid_to_shadow(mem_cgroup_private_id(memcg));
+	}
 
 	si = __swap_entry_to_info(entry);
 	ci_start = swp_cluster_offset(entry);
@@ -392,6 +402,11 @@ void __swap_cache_del_folio(struct swap_cluster_info *ci, struct folio *folio,
  * swap_cache_del_folio - Removes a folio from the swap cache.
  * @folio: The folio.
  *
+ * Force delete a folio from the swap cache. This is only safe to use for
+ * folios that are not swapped out (swap count == 0) to release the swap
+ * space from being pinned by swap cache, or remove a clean and charged
+ * folio that no one modified or is still using.
+ *
  * Same as __swap_cache_del_folio, but handles lock and refcount. The
  * caller must ensure the folio is either clean or has a swap count
  * equal to zero, or it may cause data loss.
@@ -404,6 +419,7 @@ void swap_cache_del_folio(struct folio *folio)
 	swp_entry_t entry = folio->swap;
 
 	ci = swap_cluster_lock(__swap_entry_to_info(entry), swp_offset(entry));
+	VM_WARN_ON_ONCE(folio_test_dirty(folio) && folio_maybe_swapped(folio));
 	__swap_cache_del_folio(ci, folio, NULL, true, false);
 	swap_cluster_unlock(ci);
 
diff --git a/mm/swapfile.c b/mm/swapfile.c
index c0169bce46c9..2cd3e260f1bf 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -1972,9 +1972,11 @@ int swp_swapcount(swp_entry_t entry)
  * decrease of swap count is possible through swap_put_entries_direct, so this
  * may return a false positive.
  *
+ * Caller can hold the ci lock to get a stable result.
+ *
  * Context: Caller must ensure the folio is locked and in the swap cache.
  */
-static bool folio_maybe_swapped(struct folio *folio)
+bool folio_maybe_swapped(struct folio *folio)
 {
 	swp_entry_t entry = folio->swap;
 	struct swap_cluster_info *ci;
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 5112f81cf875..4565c9c3ac60 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -755,11 +755,7 @@ static int __remove_mapping(struct address_space *mapping, struct folio *folio,
 	}
 
 	if (folio_test_swapcache(folio)) {
-		swp_entry_t swap = folio->swap;
-
-		if (reclaimed && !mapping_exiting(mapping))
-			shadow = workingset_eviction(folio, target_memcg);
-		__swap_cache_del_folio(ci, folio, shadow, true, true);
+		__swap_cache_del_folio(ci, folio, NULL, true, true);
 		swap_cluster_unlock_irq(ci);
 	} else {
 		void (*free_folio)(struct folio *);
diff --git a/mm/workingset.c b/mm/workingset.c
index 37a94979900f..765a954baefa 100644
--- a/mm/workingset.c
+++ b/mm/workingset.c
@@ -202,12 +202,18 @@ static unsigned int bucket_order[ANON_AND_FILE] __read_mostly;
 static void *pack_shadow(int memcgid, pg_data_t *pgdat, unsigned long eviction,
 			 bool workingset, bool file)
 {
+	void *shadow;
+
 	eviction &= file ? EVICTION_MASK : EVICTION_MASK_ANON;
 	eviction = (eviction << MEM_CGROUP_ID_SHIFT) | memcgid;
 	eviction = (eviction << NODES_SHIFT) | pgdat->node_id;
 	eviction = (eviction << WORKINGSET_SHIFT) | workingset;
 
-	return xa_mk_value(eviction);
+	shadow = xa_mk_value(eviction);
+	/* Sanity check for retrieving memcgid from anon shadow. */
+	VM_WARN_ON_ONCE(shadow_to_memcgid(shadow) != memcgid);
+
+	return shadow;
 }
 
 static void unpack_shadow(void *shadow, int *memcgidp, pg_data_t **pgdat,
@@ -232,7 +238,7 @@ static void unpack_shadow(void *shadow, int *memcgidp, pg_data_t **pgdat,
 
 #ifdef CONFIG_LRU_GEN
 
-static void *lru_gen_eviction(struct folio *folio)
+static void *lru_gen_eviction(struct folio *folio, struct mem_cgroup *memcg)
 {
 	int hist;
 	unsigned long token;
@@ -244,7 +250,6 @@ static void *lru_gen_eviction(struct folio *folio)
 	int refs = folio_lru_refs(folio);
 	bool workingset = folio_test_workingset(folio);
 	int tier = lru_tier_from_refs(refs, workingset);
-	struct mem_cgroup *memcg = folio_memcg(folio);
 	struct pglist_data *pgdat = folio_pgdat(folio);
 
 	BUILD_BUG_ON(LRU_GEN_WIDTH + LRU_REFS_WIDTH >
@@ -252,6 +257,7 @@ static void *lru_gen_eviction(struct folio *folio)
 
 	lruvec = mem_cgroup_lruvec(memcg, pgdat);
 	lrugen = &lruvec->lrugen;
+	memcg = lruvec_memcg(lruvec);
 	min_seq = READ_ONCE(lrugen->min_seq[type]);
 	token = (min_seq << LRU_REFS_WIDTH) | max(refs - 1, 0);
 
@@ -329,7 +335,7 @@ static void lru_gen_refault(struct folio *folio, void *shadow)
 
 #else /* !CONFIG_LRU_GEN */
 
-static void *lru_gen_eviction(struct folio *folio)
+static void *lru_gen_eviction(struct folio *folio, struct mem_cgroup *target_memcg)
 {
 	return NULL;
 }
@@ -396,7 +402,7 @@ void *workingset_eviction(struct folio *folio, struct mem_cgroup *target_memcg)
 	VM_BUG_ON_FOLIO(!folio_test_locked(folio), folio);
 
 	if (lru_gen_enabled())
-		return lru_gen_eviction(folio);
+		return lru_gen_eviction(folio, target_memcg);
 
 	lruvec = mem_cgroup_lruvec(target_memcg, pgdat);
 	/* XXX: target_memcg can be NULL, go through lruvec */

-- 
2.53.0




^ permalink raw reply	[flat|nested] 19+ messages in thread

* [PATCH RFC 09/15] mm, swap: support flexible batch freeing of slots in different memcg
  2026-02-19 23:42 [PATCH RFC 00/15] mm, swap: swap table phase IV with dynamic ghost swapfile Kairui Song via B4 Relay
                   ` (7 preceding siblings ...)
  2026-02-19 23:42 ` [PATCH RFC 08/15] mm, swap: store and check memcg info in the swap table Kairui Song via B4 Relay
@ 2026-02-19 23:42 ` Kairui Song via B4 Relay
  2026-02-19 23:42 ` [PATCH RFC 10/15] mm, swap: always retrieve memcg id from swap table Kairui Song via B4 Relay
                   ` (6 subsequent siblings)
  15 siblings, 0 replies; 19+ messages in thread
From: Kairui Song via B4 Relay @ 2026-02-19 23:42 UTC (permalink / raw)
  To: linux-mm
  Cc: Andrew Morton, David Hildenbrand, Lorenzo Stoakes, Zi Yan,
	Baolin Wang, Barry Song, Hugh Dickins, Chris Li, Kemeng Shi,
	Nhat Pham, Baoquan He, Johannes Weiner, Yosry Ahmed,
	Youngjun Park, Chengming Zhou, Roman Gushchin, Shakeel Butt,
	Muchun Song, Qi Zheng, linux-kernel, cgroups, Kairui Song

From: Kairui Song <kasong@tencent.com>

Instead of let the caller ensures all slots are in the same memcg, the
make it be able to handle different memcg at once.

Signed-off-by: Kairui Song <kasong@tencent.com>
---
 mm/swapfile.c | 31 +++++++++++++++++++++++++------
 1 file changed, 25 insertions(+), 6 deletions(-)

diff --git a/mm/swapfile.c b/mm/swapfile.c
index 2cd3e260f1bf..cd2d3b2ca6f0 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -1883,10 +1883,13 @@ void __swap_cluster_free_entries(struct swap_info_struct *si,
 				 struct swap_cluster_info *ci,
 				 unsigned int ci_start, unsigned int nr_pages)
 {
+	void *shadow;
 	unsigned long old_tb;
-	unsigned short id;
+	unsigned int type = si->type;
+	unsigned int id = 0, id_iter, id_check;
 	unsigned int ci_off = ci_start, ci_end = ci_start + nr_pages;
-	unsigned long offset = cluster_offset(si, ci) + ci_start;
+	unsigned long offset = cluster_offset(si, ci);
+	unsigned int ci_batch = ci_off;
 
 	VM_WARN_ON(ci->count < nr_pages);
 
@@ -1896,13 +1899,29 @@ void __swap_cluster_free_entries(struct swap_info_struct *si,
 		/* Release the last ref, or after swap cache is dropped */
 		VM_WARN_ON(!swp_tb_is_shadow(old_tb) || __swp_tb_get_count(old_tb) > 1);
 		__swap_table_set(ci, ci_off, null_to_swp_tb());
+
+		shadow = swp_tb_to_shadow(old_tb);
+		id_iter = shadow_to_memcgid(shadow);
+		if (id != id_iter) {
+			if (id) {
+				id_check = swap_cgroup_clear(swp_entry(type, offset + ci_batch),
+							     ci_off - ci_batch);
+				WARN_ON(id != id_check);
+				mem_cgroup_uncharge_swap(id, ci_off - ci_batch);
+			}
+			id = id_iter;
+			ci_batch = ci_off;
+		}
 	} while (++ci_off < ci_end);
 
-	id = swap_cgroup_clear(swp_entry(si->type, offset), nr_pages);
-	if (id)
-		mem_cgroup_uncharge_swap(id, nr_pages);
+	if (id) {
+		id_check = swap_cgroup_clear(swp_entry(type, offset + ci_batch),
+					     ci_off - ci_batch);
+		WARN_ON(id != id_check);
+		mem_cgroup_uncharge_swap(id, ci_off - ci_batch);
+	}
 
-	swap_range_free(si, offset, nr_pages);
+	swap_range_free(si, offset + ci_start, nr_pages);
 	swap_cluster_assert_empty(ci, ci_start, nr_pages, false);
 
 	if (!ci->count)

-- 
2.53.0




^ permalink raw reply	[flat|nested] 19+ messages in thread

* [PATCH RFC 10/15] mm, swap: always retrieve memcg id from swap table
  2026-02-19 23:42 [PATCH RFC 00/15] mm, swap: swap table phase IV with dynamic ghost swapfile Kairui Song via B4 Relay
                   ` (8 preceding siblings ...)
  2026-02-19 23:42 ` [PATCH RFC 09/15] mm, swap: support flexible batch freeing of slots in different memcg Kairui Song via B4 Relay
@ 2026-02-19 23:42 ` Kairui Song via B4 Relay
  2026-02-19 23:42 ` [PATCH RFC 11/15] mm/swap, memcg: remove swap cgroup array Kairui Song via B4 Relay
                   ` (5 subsequent siblings)
  15 siblings, 0 replies; 19+ messages in thread
From: Kairui Song via B4 Relay @ 2026-02-19 23:42 UTC (permalink / raw)
  To: linux-mm
  Cc: Andrew Morton, David Hildenbrand, Lorenzo Stoakes, Zi Yan,
	Baolin Wang, Barry Song, Hugh Dickins, Chris Li, Kemeng Shi,
	Nhat Pham, Baoquan He, Johannes Weiner, Yosry Ahmed,
	Youngjun Park, Chengming Zhou, Roman Gushchin, Shakeel Butt,
	Muchun Song, Qi Zheng, linux-kernel, cgroups, Kairui Song

From: Kairui Song <kasong@tencent.com>

Transition mem_cgroup_swapin_charge_folio() to receive the memcg id
from the caller via the swap table shadow entry, demoting the old
swap cgroup array lookup to a sanity check. Also removes the per-PTE
cgroup id batching break from swap_pte_batch() since now swap is able to
free slots across mem cgroups.

Signed-off-by: Kairui Song <kasong@tencent.com>
---
 include/linux/memcontrol.h | 6 ++++--
 mm/internal.h              | 4 ----
 mm/memcontrol.c            | 9 ++++++---
 mm/swap_state.c            | 5 ++++-
 4 files changed, 14 insertions(+), 10 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 0b37d4faf785..8fc794baf736 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -667,7 +667,8 @@ static inline int mem_cgroup_charge(struct folio *folio, struct mm_struct *mm,
 int mem_cgroup_charge_hugetlb(struct folio* folio, gfp_t gfp);
 
 int mem_cgroup_swapin_charge_folio(struct folio *folio, struct mm_struct *mm,
-				  gfp_t gfp, swp_entry_t entry);
+				   gfp_t gfp, swp_entry_t entry,
+				   unsigned short id);
 
 void __mem_cgroup_uncharge(struct folio *folio);
 
@@ -1145,7 +1146,8 @@ static inline int mem_cgroup_charge_hugetlb(struct folio* folio, gfp_t gfp)
 }
 
 static inline int mem_cgroup_swapin_charge_folio(struct folio *folio,
-			struct mm_struct *mm, gfp_t gfp, swp_entry_t entry)
+			struct mm_struct *mm, gfp_t gfp, swp_entry_t entry,
+			unsigned short id)
 {
 	return 0;
 }
diff --git a/mm/internal.h b/mm/internal.h
index 5bbe081c9048..416d3401aa17 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -452,12 +452,10 @@ static inline int swap_pte_batch(pte_t *start_ptep, int max_nr, pte_t pte)
 	const pte_t *end_ptep = start_ptep + max_nr;
 	const softleaf_t entry = softleaf_from_pte(pte);
 	pte_t *ptep = start_ptep + 1;
-	unsigned short cgroup_id;
 
 	VM_WARN_ON(max_nr < 1);
 	VM_WARN_ON(!softleaf_is_swap(entry));
 
-	cgroup_id = lookup_swap_cgroup_id(entry);
 	while (ptep < end_ptep) {
 		softleaf_t entry;
 
@@ -466,8 +464,6 @@ static inline int swap_pte_batch(pte_t *start_ptep, int max_nr, pte_t pte)
 		if (!pte_same(pte, expected_pte))
 			break;
 		entry = softleaf_from_pte(pte);
-		if (lookup_swap_cgroup_id(entry) != cgroup_id)
-			break;
 		expected_pte = pte_next_swp_offset(expected_pte);
 		ptep++;
 	}
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index d9ff44b77409..d0f50019d733 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -4794,6 +4794,7 @@ int mem_cgroup_charge_hugetlb(struct folio *folio, gfp_t gfp)
  * @mm: mm context of the victim
  * @gfp: reclaim mode
  * @entry: swap entry for which the folio is allocated
+ * @id: the mem cgroup id
  *
  * This function charges a folio allocated for swapin. Please call this before
  * adding the folio to the swapcache.
@@ -4801,19 +4802,21 @@ int mem_cgroup_charge_hugetlb(struct folio *folio, gfp_t gfp)
  * Returns 0 on success. Otherwise, an error code is returned.
  */
 int mem_cgroup_swapin_charge_folio(struct folio *folio, struct mm_struct *mm,
-				  gfp_t gfp, swp_entry_t entry)
+				   gfp_t gfp, swp_entry_t entry, unsigned short id)
 {
 	struct mem_cgroup *memcg, *swap_memcg;
+	unsigned short memcg_id;
 	unsigned int nr_pages;
-	unsigned short id;
 	int ret;
 
 	if (mem_cgroup_disabled())
 		return 0;
 
-	id = lookup_swap_cgroup_id(entry);
+	memcg_id = lookup_swap_cgroup_id(entry);
 	nr_pages = folio_nr_pages(folio);
 
+	WARN_ON_ONCE(id != memcg_id);
+
 	rcu_read_lock();
 	swap_memcg = mem_cgroup_from_private_id(id);
 	if (!swap_memcg) {
diff --git a/mm/swap_state.c b/mm/swap_state.c
index cc4bf40320ef..5ab3a41fe42c 100644
--- a/mm/swap_state.c
+++ b/mm/swap_state.c
@@ -251,8 +251,11 @@ static struct folio *__swap_cache_alloc(struct swap_cluster_info *ci,
 	__swap_cache_add_folio(ci, folio, entry);
 	spin_unlock(&ci->lock);
 
+	/* With swap table, we must have a shadow, for memcg tracking */
+	WARN_ON(!shadow);
+
 	if (mem_cgroup_swapin_charge_folio(folio, vmf ? vmf->vma->vm_mm : NULL,
-					   gfp, entry)) {
+					   gfp, entry, shadow_to_memcgid(shadow))) {
 		spin_lock(&ci->lock);
 		__swap_cache_del_folio(ci, folio, shadow, false, false);
 		spin_unlock(&ci->lock);

-- 
2.53.0




^ permalink raw reply	[flat|nested] 19+ messages in thread

* [PATCH RFC 11/15] mm/swap, memcg: remove swap cgroup array
  2026-02-19 23:42 [PATCH RFC 00/15] mm, swap: swap table phase IV with dynamic ghost swapfile Kairui Song via B4 Relay
                   ` (9 preceding siblings ...)
  2026-02-19 23:42 ` [PATCH RFC 10/15] mm, swap: always retrieve memcg id from swap table Kairui Song via B4 Relay
@ 2026-02-19 23:42 ` Kairui Song via B4 Relay
  2026-02-19 23:42 ` [PATCH RFC 12/15] mm, swap: merge zeromap into swap table Kairui Song via B4 Relay
                   ` (4 subsequent siblings)
  15 siblings, 0 replies; 19+ messages in thread
From: Kairui Song via B4 Relay @ 2026-02-19 23:42 UTC (permalink / raw)
  To: linux-mm
  Cc: Andrew Morton, David Hildenbrand, Lorenzo Stoakes, Zi Yan,
	Baolin Wang, Barry Song, Hugh Dickins, Chris Li, Kemeng Shi,
	Nhat Pham, Baoquan He, Johannes Weiner, Yosry Ahmed,
	Youngjun Park, Chengming Zhou, Roman Gushchin, Shakeel Butt,
	Muchun Song, Qi Zheng, linux-kernel, cgroups, Kairui Song

From: Kairui Song <kasong@tencent.com>

Now swap table contains the swap cgropu info all the time, the swap
cgroup array can be dropped.

Signed-off-by: Kairui Song <kasong@tencent.com>
---
 MAINTAINERS                 |   1 -
 include/linux/memcontrol.h  |   6 +-
 include/linux/swap_cgroup.h |  47 ------------
 mm/Makefile                 |   3 -
 mm/internal.h               |   1 -
 mm/memcontrol-v1.c          |   1 -
 mm/memcontrol.c             |  19 ++---
 mm/swap_cgroup.c            | 171 --------------------------------------------
 mm/swap_state.c             |   3 +-
 mm/swapfile.c               |  23 +-----
 10 files changed, 11 insertions(+), 264 deletions(-)

diff --git a/MAINTAINERS b/MAINTAINERS
index aa1734a12887..05e633611e0b 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -6571,7 +6571,6 @@ F:	mm/memcontrol.c
 F:	mm/memcontrol-v1.c
 F:	mm/memcontrol-v1.h
 F:	mm/page_counter.c
-F:	mm/swap_cgroup.c
 F:	samples/cgroup/*
 F:	tools/testing/selftests/cgroup/memcg_protection.m
 F:	tools/testing/selftests/cgroup/test_hugetlb_memcg.c
diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 8fc794baf736..4bfe905bffb0 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -667,8 +667,7 @@ static inline int mem_cgroup_charge(struct folio *folio, struct mm_struct *mm,
 int mem_cgroup_charge_hugetlb(struct folio* folio, gfp_t gfp);
 
 int mem_cgroup_swapin_charge_folio(struct folio *folio, struct mm_struct *mm,
-				   gfp_t gfp, swp_entry_t entry,
-				   unsigned short id);
+				   gfp_t gfp, unsigned short id);
 
 void __mem_cgroup_uncharge(struct folio *folio);
 
@@ -1146,8 +1145,7 @@ static inline int mem_cgroup_charge_hugetlb(struct folio* folio, gfp_t gfp)
 }
 
 static inline int mem_cgroup_swapin_charge_folio(struct folio *folio,
-			struct mm_struct *mm, gfp_t gfp, swp_entry_t entry,
-			unsigned short id)
+			struct mm_struct *mm, gfp_t gfp, unsigned short id)
 {
 	return 0;
 }
diff --git a/include/linux/swap_cgroup.h b/include/linux/swap_cgroup.h
deleted file mode 100644
index 91cdf12190a0..000000000000
--- a/include/linux/swap_cgroup.h
+++ /dev/null
@@ -1,47 +0,0 @@
-/* SPDX-License-Identifier: GPL-2.0 */
-#ifndef __LINUX_SWAP_CGROUP_H
-#define __LINUX_SWAP_CGROUP_H
-
-#include <linux/swap.h>
-
-#if defined(CONFIG_MEMCG) && defined(CONFIG_SWAP)
-
-extern void swap_cgroup_record(struct folio *folio, unsigned short id, swp_entry_t ent);
-extern unsigned short swap_cgroup_clear(swp_entry_t ent, unsigned int nr_ents);
-extern unsigned short lookup_swap_cgroup_id(swp_entry_t ent);
-extern int swap_cgroup_swapon(int type, unsigned long max_pages);
-extern void swap_cgroup_swapoff(int type);
-
-#else
-
-static inline
-void swap_cgroup_record(struct folio *folio, unsigned short id, swp_entry_t ent)
-{
-}
-
-static inline
-unsigned short swap_cgroup_clear(swp_entry_t ent, unsigned int nr_ents)
-{
-	return 0;
-}
-
-static inline
-unsigned short lookup_swap_cgroup_id(swp_entry_t ent)
-{
-	return 0;
-}
-
-static inline int
-swap_cgroup_swapon(int type, unsigned long max_pages)
-{
-	return 0;
-}
-
-static inline void swap_cgroup_swapoff(int type)
-{
-	return;
-}
-
-#endif
-
-#endif /* __LINUX_SWAP_CGROUP_H */
diff --git a/mm/Makefile b/mm/Makefile
index 8ad2ab08244e..eff9f9e7e061 100644
--- a/mm/Makefile
+++ b/mm/Makefile
@@ -103,9 +103,6 @@ obj-$(CONFIG_PAGE_COUNTER) += page_counter.o
 obj-$(CONFIG_LIVEUPDATE_MEMFD) += memfd_luo.o
 obj-$(CONFIG_MEMCG_V1) += memcontrol-v1.o
 obj-$(CONFIG_MEMCG) += memcontrol.o vmpressure.o
-ifdef CONFIG_SWAP
-obj-$(CONFIG_MEMCG) += swap_cgroup.o
-endif
 ifdef CONFIG_BPF_SYSCALL
 obj-$(CONFIG_MEMCG) += bpf_memcontrol.o
 endif
diff --git a/mm/internal.h b/mm/internal.h
index 416d3401aa17..26691885d75f 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -16,7 +16,6 @@
 #include <linux/rmap.h>
 #include <linux/swap.h>
 #include <linux/leafops.h>
-#include <linux/swap_cgroup.h>
 #include <linux/tracepoint-defs.h>
 
 /* Internal core VMA manipulation functions. */
diff --git a/mm/memcontrol-v1.c b/mm/memcontrol-v1.c
index 038e630dc7e1..eff18eda0707 100644
--- a/mm/memcontrol-v1.c
+++ b/mm/memcontrol-v1.c
@@ -5,7 +5,6 @@
 #include <linux/mm_inline.h>
 #include <linux/pagewalk.h>
 #include <linux/backing-dev.h>
-#include <linux/swap_cgroup.h>
 #include <linux/eventfd.h>
 #include <linux/poll.h>
 #include <linux/sort.h>
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index d0f50019d733..8d0c9f3a011e 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -54,7 +54,6 @@
 #include <linux/vmpressure.h>
 #include <linux/memremap.h>
 #include <linux/mm_inline.h>
-#include <linux/swap_cgroup.h>
 #include <linux/cpu.h>
 #include <linux/oom.h>
 #include <linux/lockdep.h>
@@ -4793,7 +4792,6 @@ int mem_cgroup_charge_hugetlb(struct folio *folio, gfp_t gfp)
  * @folio: folio to charge.
  * @mm: mm context of the victim
  * @gfp: reclaim mode
- * @entry: swap entry for which the folio is allocated
  * @id: the mem cgroup id
  *
  * This function charges a folio allocated for swapin. Please call this before
@@ -4802,21 +4800,17 @@ int mem_cgroup_charge_hugetlb(struct folio *folio, gfp_t gfp)
  * Returns 0 on success. Otherwise, an error code is returned.
  */
 int mem_cgroup_swapin_charge_folio(struct folio *folio, struct mm_struct *mm,
-				   gfp_t gfp, swp_entry_t entry, unsigned short id)
+				   gfp_t gfp, unsigned short id)
 {
 	struct mem_cgroup *memcg, *swap_memcg;
-	unsigned short memcg_id;
 	unsigned int nr_pages;
 	int ret;
 
 	if (mem_cgroup_disabled())
 		return 0;
 
-	memcg_id = lookup_swap_cgroup_id(entry);
 	nr_pages = folio_nr_pages(folio);
 
-	WARN_ON_ONCE(id != memcg_id);
-
 	rcu_read_lock();
 	swap_memcg = mem_cgroup_from_private_id(id);
 	if (!swap_memcg) {
@@ -4836,10 +4830,11 @@ int mem_cgroup_swapin_charge_folio(struct folio *folio, struct mm_struct *mm,
 
 	/*
 	 * On successful charge, the folio itself now belongs to the memcg,
-	 * so is folio->swap. So we can release the swap cgroup table's
-	 * pinning of the private id.
+	 * so is folio->swap. And the folio takes place of the shadow in
+	 * the swap table so we can release the shadow's pinning of the
+	 * private id.
 	 */
-	swap_cgroup_clear(folio->swap, nr_pages);
+	VM_WARN_ON_ONCE_FOLIO(!folio_test_swapcache(folio), folio);
 	mem_cgroup_private_id_put(swap_memcg, nr_pages);
 
 	/*
@@ -5324,8 +5319,6 @@ struct mem_cgroup *__mem_cgroup_swap_free_folio(struct folio *folio,
 {
 	unsigned int nr_pages = folio_nr_pages(folio);
 	struct mem_cgroup *memcg, *swap_memcg;
-	swp_entry_t entry = folio->swap;
-	unsigned short id;
 
 	VM_WARN_ON_ONCE_FOLIO(!folio_memcg_charged(folio), folio);
 	VM_WARN_ON_ONCE_FOLIO(!folio_test_swapcache(folio), folio);
@@ -5337,8 +5330,6 @@ struct mem_cgroup *__mem_cgroup_swap_free_folio(struct folio *folio,
 	 */
 	memcg = folio_memcg(folio);
 	swap_memcg = mem_cgroup_private_id_get_online(memcg, nr_pages);
-	id = mem_cgroup_private_id(swap_memcg);
-	swap_cgroup_record(folio, id, entry);
 
 	if (reclaim && do_memsw_account()) {
 		memcg1_swapout(folio, swap_memcg);
diff --git a/mm/swap_cgroup.c b/mm/swap_cgroup.c
deleted file mode 100644
index b5a7f21c3afe..000000000000
--- a/mm/swap_cgroup.c
+++ /dev/null
@@ -1,171 +0,0 @@
-// SPDX-License-Identifier: GPL-2.0
-#include <linux/swap_cgroup.h>
-#include <linux/vmalloc.h>
-#include <linux/mm.h>
-
-#include <linux/swapops.h> /* depends on mm.h include */
-
-static DEFINE_MUTEX(swap_cgroup_mutex);
-
-/* Pack two cgroup id (short) of two entries in one swap_cgroup (atomic_t) */
-#define ID_PER_SC (sizeof(struct swap_cgroup) / sizeof(unsigned short))
-#define ID_SHIFT (BITS_PER_TYPE(unsigned short))
-#define ID_MASK (BIT(ID_SHIFT) - 1)
-struct swap_cgroup {
-	atomic_t ids;
-};
-
-struct swap_cgroup_ctrl {
-	struct swap_cgroup *map;
-};
-
-static struct swap_cgroup_ctrl swap_cgroup_ctrl[MAX_SWAPFILES];
-
-static unsigned short __swap_cgroup_id_lookup(struct swap_cgroup *map,
-					      pgoff_t offset)
-{
-	unsigned int shift = (offset % ID_PER_SC) * ID_SHIFT;
-	unsigned int old_ids = atomic_read(&map[offset / ID_PER_SC].ids);
-
-	BUILD_BUG_ON(!is_power_of_2(ID_PER_SC));
-	BUILD_BUG_ON(sizeof(struct swap_cgroup) != sizeof(atomic_t));
-
-	return (old_ids >> shift) & ID_MASK;
-}
-
-static unsigned short __swap_cgroup_id_xchg(struct swap_cgroup *map,
-					    pgoff_t offset,
-					    unsigned short new_id)
-{
-	unsigned short old_id;
-	struct swap_cgroup *sc = &map[offset / ID_PER_SC];
-	unsigned int shift = (offset % ID_PER_SC) * ID_SHIFT;
-	unsigned int new_ids, old_ids = atomic_read(&sc->ids);
-
-	do {
-		old_id = (old_ids >> shift) & ID_MASK;
-		new_ids = (old_ids & ~(ID_MASK << shift));
-		new_ids |= ((unsigned int)new_id) << shift;
-	} while (!atomic_try_cmpxchg(&sc->ids, &old_ids, new_ids));
-
-	return old_id;
-}
-
-/**
- * swap_cgroup_record - record mem_cgroup for a set of swap entries.
- * These entries must belong to one single folio, and that folio
- * must be being charged for swap space (swap out).
- *
- * @folio: the folio that the swap entry belongs to
- * @id: mem_cgroup ID to be recorded
- * @ent: the first swap entry to be recorded
- */
-void swap_cgroup_record(struct folio *folio, unsigned short id,
-			swp_entry_t ent)
-{
-	unsigned int nr_ents = folio_nr_pages(folio);
-	struct swap_cgroup *map;
-	pgoff_t offset, end;
-	unsigned short old;
-
-	offset = swp_offset(ent);
-	end = offset + nr_ents;
-	map = swap_cgroup_ctrl[swp_type(ent)].map;
-
-	do {
-		old = __swap_cgroup_id_xchg(map, offset, id);
-		VM_WARN_ON_ONCE(old);
-	} while (++offset != end);
-}
-
-/**
- * swap_cgroup_clear - clear mem_cgroup for a set of swap entries.
- * These entries must be being uncharged from swap. They either
- * belongs to one single folio in the swap cache (swap in for
- * cgroup v1), or no longer have any users (slot freeing).
- *
- * @ent: the first swap entry to be recorded into
- * @nr_ents: number of swap entries to be recorded
- *
- * Returns the existing old value.
- */
-unsigned short swap_cgroup_clear(swp_entry_t ent, unsigned int nr_ents)
-{
-	pgoff_t offset, end;
-	struct swap_cgroup *map;
-	unsigned short old, iter = 0;
-
-	offset = swp_offset(ent);
-	end = offset + nr_ents;
-	map = swap_cgroup_ctrl[swp_type(ent)].map;
-
-	do {
-		old = __swap_cgroup_id_xchg(map, offset, 0);
-		if (!iter)
-			iter = old;
-		VM_BUG_ON(iter != old);
-	} while (++offset != end);
-
-	return old;
-}
-
-/**
- * lookup_swap_cgroup_id - lookup mem_cgroup id tied to swap entry
- * @ent: swap entry to be looked up.
- *
- * Returns ID of mem_cgroup at success. 0 at failure. (0 is invalid ID)
- */
-unsigned short lookup_swap_cgroup_id(swp_entry_t ent)
-{
-	struct swap_cgroup_ctrl *ctrl;
-
-	if (mem_cgroup_disabled())
-		return 0;
-
-	ctrl = &swap_cgroup_ctrl[swp_type(ent)];
-	return __swap_cgroup_id_lookup(ctrl->map, swp_offset(ent));
-}
-
-int swap_cgroup_swapon(int type, unsigned long max_pages)
-{
-	struct swap_cgroup *map;
-	struct swap_cgroup_ctrl *ctrl;
-
-	if (mem_cgroup_disabled())
-		return 0;
-
-	BUILD_BUG_ON(sizeof(unsigned short) * ID_PER_SC !=
-		     sizeof(struct swap_cgroup));
-	map = vzalloc(DIV_ROUND_UP(max_pages, ID_PER_SC) *
-		      sizeof(struct swap_cgroup));
-	if (!map)
-		goto nomem;
-
-	ctrl = &swap_cgroup_ctrl[type];
-	mutex_lock(&swap_cgroup_mutex);
-	ctrl->map = map;
-	mutex_unlock(&swap_cgroup_mutex);
-
-	return 0;
-nomem:
-	pr_info("couldn't allocate enough memory for swap_cgroup\n");
-	pr_info("swap_cgroup can be disabled by swapaccount=0 boot option\n");
-	return -ENOMEM;
-}
-
-void swap_cgroup_swapoff(int type)
-{
-	struct swap_cgroup *map;
-	struct swap_cgroup_ctrl *ctrl;
-
-	if (mem_cgroup_disabled())
-		return;
-
-	mutex_lock(&swap_cgroup_mutex);
-	ctrl = &swap_cgroup_ctrl[type];
-	map = ctrl->map;
-	ctrl->map = NULL;
-	mutex_unlock(&swap_cgroup_mutex);
-
-	vfree(map);
-}
diff --git a/mm/swap_state.c b/mm/swap_state.c
index 5ab3a41fe42c..c6ba15de4094 100644
--- a/mm/swap_state.c
+++ b/mm/swap_state.c
@@ -163,7 +163,6 @@ static int __swap_cache_check_batch(struct swap_cluster_info *ci,
 	*shadowp = swp_tb_to_shadow(old_tb);
 	memcgid = shadow_to_memcgid(*shadowp);
 
-	WARN_ON_ONCE(!mem_cgroup_disabled() && !memcgid);
 	do {
 		old_tb = __swap_table_get(ci, ci_off);
 		if (unlikely(swp_tb_is_folio(old_tb)) ||
@@ -255,7 +254,7 @@ static struct folio *__swap_cache_alloc(struct swap_cluster_info *ci,
 	WARN_ON(!shadow);
 
 	if (mem_cgroup_swapin_charge_folio(folio, vmf ? vmf->vma->vm_mm : NULL,
-					   gfp, entry, shadow_to_memcgid(shadow))) {
+					   gfp, shadow_to_memcgid(shadow))) {
 		spin_lock(&ci->lock);
 		__swap_cache_del_folio(ci, folio, shadow, false, false);
 		spin_unlock(&ci->lock);
diff --git a/mm/swapfile.c b/mm/swapfile.c
index cd2d3b2ca6f0..de34f1990209 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -45,7 +45,6 @@
 
 #include <asm/tlbflush.h>
 #include <linux/leafops.h>
-#include <linux/swap_cgroup.h>
 #include "swap_table.h"
 #include "internal.h"
 #include "swap_table.h"
@@ -1885,8 +1884,7 @@ void __swap_cluster_free_entries(struct swap_info_struct *si,
 {
 	void *shadow;
 	unsigned long old_tb;
-	unsigned int type = si->type;
-	unsigned int id = 0, id_iter, id_check;
+	unsigned int id = 0, id_iter;
 	unsigned int ci_off = ci_start, ci_end = ci_start + nr_pages;
 	unsigned long offset = cluster_offset(si, ci);
 	unsigned int ci_batch = ci_off;
@@ -1903,23 +1901,15 @@ void __swap_cluster_free_entries(struct swap_info_struct *si,
 		shadow = swp_tb_to_shadow(old_tb);
 		id_iter = shadow_to_memcgid(shadow);
 		if (id != id_iter) {
-			if (id) {
-				id_check = swap_cgroup_clear(swp_entry(type, offset + ci_batch),
-							     ci_off - ci_batch);
-				WARN_ON(id != id_check);
+			if (id)
 				mem_cgroup_uncharge_swap(id, ci_off - ci_batch);
-			}
 			id = id_iter;
 			ci_batch = ci_off;
 		}
 	} while (++ci_off < ci_end);
 
-	if (id) {
-		id_check = swap_cgroup_clear(swp_entry(type, offset + ci_batch),
-					     ci_off - ci_batch);
-		WARN_ON(id != id_check);
+	if (id)
 		mem_cgroup_uncharge_swap(id, ci_off - ci_batch);
-	}
 
 	swap_range_free(si, offset + ci_start, nr_pages);
 	swap_cluster_assert_empty(ci, ci_start, nr_pages, false);
@@ -3034,8 +3024,6 @@ SYSCALL_DEFINE1(swapoff, const char __user *, specialfile)
 	p->global_cluster = NULL;
 	kvfree(zeromap);
 	free_swap_cluster_info(cluster_info, maxpages);
-	/* Destroy swap account information */
-	swap_cgroup_swapoff(p->type);
 
 	inode = mapping->host;
 
@@ -3567,10 +3555,6 @@ SYSCALL_DEFINE2(swapon, const char __user *, specialfile, int, swap_flags)
 	if (error)
 		goto bad_swap_unlock_inode;
 
-	error = swap_cgroup_swapon(si->type, maxpages);
-	if (error)
-		goto bad_swap_unlock_inode;
-
 	/*
 	 * Use kvmalloc_array instead of bitmap_zalloc as the allocation order might
 	 * be above MAX_PAGE_ORDER incase of a large swap file.
@@ -3681,7 +3665,6 @@ SYSCALL_DEFINE2(swapon, const char __user *, specialfile, int, swap_flags)
 	si->global_cluster = NULL;
 	inode = NULL;
 	destroy_swap_extents(si, swap_file);
-	swap_cgroup_swapoff(si->type);
 	free_swap_cluster_info(si->cluster_info, si->max);
 	si->cluster_info = NULL;
 	kvfree(si->zeromap);

-- 
2.53.0




^ permalink raw reply	[flat|nested] 19+ messages in thread

* [PATCH RFC 12/15] mm, swap: merge zeromap into swap table
  2026-02-19 23:42 [PATCH RFC 00/15] mm, swap: swap table phase IV with dynamic ghost swapfile Kairui Song via B4 Relay
                   ` (10 preceding siblings ...)
  2026-02-19 23:42 ` [PATCH RFC 11/15] mm/swap, memcg: remove swap cgroup array Kairui Song via B4 Relay
@ 2026-02-19 23:42 ` Kairui Song via B4 Relay
  2026-02-19 23:42 ` [PATCH RFC 13/15] mm: ghost swapfile support for zswap Kairui Song via B4 Relay
                   ` (3 subsequent siblings)
  15 siblings, 0 replies; 19+ messages in thread
From: Kairui Song via B4 Relay @ 2026-02-19 23:42 UTC (permalink / raw)
  To: linux-mm
  Cc: Andrew Morton, David Hildenbrand, Lorenzo Stoakes, Zi Yan,
	Baolin Wang, Barry Song, Hugh Dickins, Chris Li, Kemeng Shi,
	Nhat Pham, Baoquan He, Johannes Weiner, Yosry Ahmed,
	Youngjun Park, Chengming Zhou, Roman Gushchin, Shakeel Butt,
	Muchun Song, Qi Zheng, linux-kernel, cgroups, Kairui Song

From: Kairui Song <kasong@tencent.com>

By reserving one bit for the counting part, we can easily merge the
zeromap into the swap table.

Signed-off-by: Kairui Song <kasong@tencent.com>
---
 include/linux/swap.h |   1 -
 mm/memory.c          |  12 ++----
 mm/page_io.c         |  28 ++++++++++----
 mm/swap.h            |  31 ----------------
 mm/swap_state.c      |  23 ++++++++----
 mm/swap_table.h      | 103 +++++++++++++++++++++++++++++++++++++++++++--------
 mm/swapfile.c        |  27 +-------------
 7 files changed, 127 insertions(+), 98 deletions(-)

diff --git a/include/linux/swap.h b/include/linux/swap.h
index 66cf657a1f35..bc871d8a1e99 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -254,7 +254,6 @@ struct swap_info_struct {
 	struct plist_node list;		/* entry in swap_active_head */
 	signed char	type;		/* strange name for an index */
 	unsigned int	max;		/* size of this swap device */
-	unsigned long *zeromap;		/* kvmalloc'ed bitmap to track zero pages */
 	struct swap_cluster_info *cluster_info; /* cluster info. Only for SSD */
 	struct list_head free_clusters; /* free clusters list */
 	struct list_head full_clusters; /* full clusters list */
diff --git a/mm/memory.c b/mm/memory.c
index e58f976508b3..8df169fced0d 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -88,6 +88,7 @@
 
 #include "pgalloc-track.h"
 #include "internal.h"
+#include "swap_table.h"
 #include "swap.h"
 
 #if defined(LAST_CPUPID_NOT_IN_PAGE_FLAGS) && !defined(CONFIG_COMPILE_TEST)
@@ -4522,13 +4523,11 @@ static vm_fault_t handle_pte_marker(struct vm_fault *vmf)
 
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
 /*
- * Check if the PTEs within a range are contiguous swap entries
- * and have consistent swapcache, zeromap.
+ * Check if the PTEs within a range are contiguous swap entries.
  */
 static bool can_swapin_thp(struct vm_fault *vmf, pte_t *ptep, int nr_pages)
 {
 	unsigned long addr;
-	softleaf_t entry;
 	int idx;
 	pte_t pte;
 
@@ -4538,18 +4537,13 @@ static bool can_swapin_thp(struct vm_fault *vmf, pte_t *ptep, int nr_pages)
 
 	if (!pte_same(pte, pte_move_swp_offset(vmf->orig_pte, -idx)))
 		return false;
-	entry = softleaf_from_pte(pte);
-	if (swap_pte_batch(ptep, nr_pages, pte) != nr_pages)
-		return false;
-
 	/*
 	 * swap_read_folio() can't handle the case a large folio is hybridly
 	 * from different backends. And they are likely corner cases. Similar
 	 * things might be added once zswap support large folios.
 	 */
-	if (unlikely(swap_zeromap_batch(entry, nr_pages, NULL) != nr_pages))
+	if (swap_pte_batch(ptep, nr_pages, pte) != nr_pages)
 		return false;
-
 	return true;
 }
 
diff --git a/mm/page_io.c b/mm/page_io.c
index a2c034660c80..5a0b5034489b 100644
--- a/mm/page_io.c
+++ b/mm/page_io.c
@@ -26,6 +26,7 @@
 #include <linux/delayacct.h>
 #include <linux/zswap.h>
 #include "swap.h"
+#include "swap_table.h"
 
 static void __end_swap_bio_write(struct bio *bio)
 {
@@ -204,15 +205,20 @@ static bool is_folio_zero_filled(struct folio *folio)
 static void swap_zeromap_folio_set(struct folio *folio)
 {
 	struct obj_cgroup *objcg = get_obj_cgroup_from_folio(folio);
-	struct swap_info_struct *sis = __swap_entry_to_info(folio->swap);
 	int nr_pages = folio_nr_pages(folio);
+	struct swap_cluster_info *ci;
 	swp_entry_t entry;
 	unsigned int i;
 
+	VM_WARN_ON_ONCE_FOLIO(!folio_test_swapcache(folio), folio);
+	VM_WARN_ON_ONCE_FOLIO(!folio_test_locked(folio), folio);
+
+	ci = swap_cluster_get_and_lock(folio);
 	for (i = 0; i < folio_nr_pages(folio); i++) {
 		entry = page_swap_entry(folio_page(folio, i));
-		set_bit(swp_offset(entry), sis->zeromap);
+		__swap_table_set_zero(ci, swp_cluster_offset(entry));
 	}
+	swap_cluster_unlock(ci);
 
 	count_vm_events(SWPOUT_ZERO, nr_pages);
 	if (objcg) {
@@ -223,14 +229,19 @@ static void swap_zeromap_folio_set(struct folio *folio)
 
 static void swap_zeromap_folio_clear(struct folio *folio)
 {
-	struct swap_info_struct *sis = __swap_entry_to_info(folio->swap);
+	struct swap_cluster_info *ci;
 	swp_entry_t entry;
 	unsigned int i;
 
+	VM_WARN_ON_ONCE_FOLIO(!folio_test_swapcache(folio), folio);
+	VM_WARN_ON_ONCE_FOLIO(!folio_test_locked(folio), folio);
+
+	ci = swap_cluster_get_and_lock(folio);
 	for (i = 0; i < folio_nr_pages(folio); i++) {
 		entry = page_swap_entry(folio_page(folio, i));
-		clear_bit(swp_offset(entry), sis->zeromap);
+		__swap_table_clear_zero(ci, swp_cluster_offset(entry));
 	}
+	swap_cluster_unlock(ci);
 }
 
 /*
@@ -255,10 +266,9 @@ int swap_writeout(struct folio *folio, struct swap_iocb **swap_plug)
 	}
 
 	/*
-	 * Use a bitmap (zeromap) to avoid doing IO for zero-filled pages.
-	 * The bits in zeromap are protected by the locked swapcache folio
-	 * and atomic updates are used to protect against read-modify-write
-	 * corruption due to other zero swap entries seeing concurrent updates.
+	 * Use the swap table zero mark to avoid doing IO for zero-filled
+	 * pages. The zero mark is protected by the cluster lock, which is
+	 * acquired internally by swap_zeromap_folio_set/clear.
 	 */
 	if (is_folio_zero_filled(folio)) {
 		swap_zeromap_folio_set(folio);
@@ -511,6 +521,8 @@ static bool swap_read_folio_zeromap(struct folio *folio)
 	struct obj_cgroup *objcg;
 	bool is_zeromap;
 
+	VM_WARN_ON_ONCE_FOLIO(!folio_test_locked(folio), folio);
+
 	/*
 	 * Swapping in a large folio that is partially in the zeromap is not
 	 * currently handled. Return true without marking the folio uptodate so
diff --git a/mm/swap.h b/mm/swap.h
index c95f5fafea42..cb1ab20d83d5 100644
--- a/mm/swap.h
+++ b/mm/swap.h
@@ -312,31 +312,6 @@ static inline unsigned int folio_swap_flags(struct folio *folio)
 	return __swap_entry_to_info(folio->swap)->flags;
 }
 
-/*
- * Return the count of contiguous swap entries that share the same
- * zeromap status as the starting entry. If is_zeromap is not NULL,
- * it will return the zeromap status of the starting entry.
- */
-static inline int swap_zeromap_batch(swp_entry_t entry, int max_nr,
-		bool *is_zeromap)
-{
-	struct swap_info_struct *sis = __swap_entry_to_info(entry);
-	unsigned long start = swp_offset(entry);
-	unsigned long end = start + max_nr;
-	bool first_bit;
-
-	first_bit = test_bit(start, sis->zeromap);
-	if (is_zeromap)
-		*is_zeromap = first_bit;
-
-	if (max_nr <= 1)
-		return max_nr;
-	if (first_bit)
-		return find_next_zero_bit(sis->zeromap, end, start) - start;
-	else
-		return find_next_bit(sis->zeromap, end, start) - start;
-}
-
 #else /* CONFIG_SWAP */
 struct swap_iocb;
 static inline struct swap_cluster_info *swap_cluster_lock(
@@ -475,11 +450,5 @@ static inline unsigned int folio_swap_flags(struct folio *folio)
 {
 	return 0;
 }
-
-static inline int swap_zeromap_batch(swp_entry_t entry, int max_nr,
-		bool *has_zeromap)
-{
-	return 0;
-}
 #endif /* CONFIG_SWAP */
 #endif /* _MM_SWAP_H */
diff --git a/mm/swap_state.c b/mm/swap_state.c
index c6ba15de4094..419419e18a47 100644
--- a/mm/swap_state.c
+++ b/mm/swap_state.c
@@ -138,6 +138,7 @@ void *swap_cache_get_shadow(swp_entry_t entry)
 }
 
 static int __swap_cache_check_batch(struct swap_cluster_info *ci,
+				    swp_entry_t entry,
 				    unsigned int ci_off, unsigned int ci_targ,
 				    unsigned int nr, void **shadowp)
 {
@@ -148,6 +149,13 @@ static int __swap_cache_check_batch(struct swap_cluster_info *ci,
 	if (unlikely(!ci->table))
 		return -ENOENT;
 
+	/*
+	 * TODO: Swap of large folio that is partially in the zeromap
+	 * is not supported.
+	 */
+	if (nr > 1 && swap_zeromap_batch(entry, nr, NULL) != nr)
+		return -EBUSY;
+
 	/*
 	 * If the target slot is not suitable for adding swap cache, return
 	 * -EEXIST or -ENOENT. If the batch is not suitable, could be a
@@ -190,7 +198,7 @@ void __swap_cache_add_folio(struct swap_cluster_info *ci,
 	do {
 		old_tb = __swap_table_get(ci, ci_off);
 		VM_WARN_ON_ONCE(swp_tb_is_folio(old_tb));
-		__swap_table_set(ci, ci_off, pfn_to_swp_tb(pfn, __swp_tb_get_count(old_tb)));
+		__swap_table_set(ci, ci_off, pfn_to_swp_tb(pfn, __swp_tb_get_flags(old_tb)));
 	} while (++ci_off < ci_end);
 
 	folio_ref_add(folio, nr_pages);
@@ -218,7 +226,7 @@ static struct folio *__swap_cache_alloc(struct swap_cluster_info *ci,
 
 	/* First check if the range is available */
 	spin_lock(&ci->lock);
-	err = __swap_cache_check_batch(ci, ci_off, ci_targ, nr_pages, &shadow);
+	err = __swap_cache_check_batch(ci, entry, ci_off, ci_targ, nr_pages, &shadow);
 	spin_unlock(&ci->lock);
 	if (unlikely(err))
 		return ERR_PTR(err);
@@ -236,7 +244,7 @@ static struct folio *__swap_cache_alloc(struct swap_cluster_info *ci,
 
 	/* Double check the range is still not in conflict */
 	spin_lock(&ci->lock);
-	err = __swap_cache_check_batch(ci, ci_off, ci_targ, nr_pages, &shadow_check);
+	err = __swap_cache_check_batch(ci, entry, ci_off, ci_targ, nr_pages, &shadow_check);
 	if (unlikely(err) || shadow_check != shadow) {
 		spin_unlock(&ci->lock);
 		folio_put(folio);
@@ -338,7 +346,6 @@ struct folio *swap_cache_alloc_folio(swp_entry_t targ_entry, gfp_t gfp_mask,
 void __swap_cache_del_folio(struct swap_cluster_info *ci, struct folio *folio,
 			    void *shadow, bool charged, bool reclaim)
 {
-	int count;
 	unsigned long old_tb;
 	struct swap_info_struct *si;
 	struct mem_cgroup *memcg = NULL;
@@ -375,13 +382,13 @@ void __swap_cache_del_folio(struct swap_cluster_info *ci, struct folio *folio,
 		old_tb = __swap_table_get(ci, ci_off);
 		WARN_ON_ONCE(!swp_tb_is_folio(old_tb) ||
 			     swp_tb_to_folio(old_tb) != folio);
-		count = __swp_tb_get_count(old_tb);
-		if (count)
+		if (__swp_tb_get_count(old_tb))
 			folio_swapped = true;
 		else
 			need_free = true;
 		/* If shadow is NULL, we sets an empty shadow. */
-		__swap_table_set(ci, ci_off, shadow_to_swp_tb(shadow, count));
+		__swap_table_set(ci, ci_off, shadow_to_swp_tb(shadow,
+				 __swp_tb_get_flags(old_tb)));
 	} while (++ci_off < ci_end);
 
 	folio->swap.val = 0;
@@ -460,7 +467,7 @@ void __swap_cache_replace_folio(struct swap_cluster_info *ci,
 	do {
 		old_tb = __swap_table_get(ci, ci_off);
 		WARN_ON_ONCE(!swp_tb_is_folio(old_tb) || swp_tb_to_folio(old_tb) != old);
-		__swap_table_set(ci, ci_off, pfn_to_swp_tb(pfn, __swp_tb_get_count(old_tb)));
+		__swap_table_set(ci, ci_off, pfn_to_swp_tb(pfn, __swp_tb_get_flags(old_tb)));
 	} while (++ci_off < ci_end);
 
 	/*
diff --git a/mm/swap_table.h b/mm/swap_table.h
index 8415ffbe2b9c..6d3d773e1908 100644
--- a/mm/swap_table.h
+++ b/mm/swap_table.h
@@ -21,12 +21,14 @@ struct swap_table {
  * Swap table entry type and bits layouts:
  *
  * NULL:     |---------------- 0 ---------------| - Free slot
- * Shadow:   | SWAP_COUNT |---- SHADOW_VAL ---|1| - Swapped out slot
- * PFN:      | SWAP_COUNT |------ PFN -------|10| - Cached slot
+ * Shadow:   |SWAP_COUNT|Z|---- SHADOW_VAL ---|1| - Swapped out slot
+ * PFN:      |SWAP_COUNT|Z|------ PFN -------|10| - Cached slot
  * Pointer:  |----------- Pointer ----------|100| - (Unused)
  * Bad:      |------------- 1 -------------|1000| - Bad slot
  *
- * SWAP_COUNT is `SWP_TB_COUNT_BITS` long, each entry is an atomic long.
+ * COUNT is `SWP_TB_COUNT_BITS` long, Z is the `SWP_TB_ZERO_MARK` bit,
+ * and together they form the `SWP_TB_FLAGS_BITS` wide flags field.
+ * Each entry is an atomic long.
  *
  * Usages:
  *
@@ -70,16 +72,21 @@ struct swap_table {
 #define SWP_TB_PFN_MARK_MASK	(BIT(SWP_TB_PFN_MARK_BITS) - 1)
 
 /* SWAP_COUNT part for PFN or shadow, the width can be shrunk or extended */
-#define SWP_TB_COUNT_BITS      min(4, BITS_PER_LONG - SWP_TB_PFN_BITS)
+#define SWP_TB_FLAGS_BITS	min(5, BITS_PER_LONG - SWP_TB_PFN_BITS)
+#define SWP_TB_COUNT_BITS	(SWP_TB_FLAGS_BITS - 1)
+#define SWP_TB_FLAGS_MASK	(~((~0UL) >> SWP_TB_FLAGS_BITS))
 #define SWP_TB_COUNT_MASK      (~((~0UL) >> SWP_TB_COUNT_BITS))
+#define SWP_TB_FLAGS_SHIFT     (BITS_PER_LONG - SWP_TB_FLAGS_BITS)
 #define SWP_TB_COUNT_SHIFT     (BITS_PER_LONG - SWP_TB_COUNT_BITS)
 #define SWP_TB_COUNT_MAX       ((1 << SWP_TB_COUNT_BITS) - 1)
 
+#define SWP_TB_ZERO_MARK	BIT(BITS_PER_LONG - SWP_TB_COUNT_BITS - 1)
+
 /* Bad slot: ends with 0b1000 and rests of bits are all 1 */
 #define SWP_TB_BAD		((~0UL) << 3)
 
 /* Macro for shadow offset calculation */
-#define SWAP_COUNT_SHIFT	SWP_TB_COUNT_BITS
+#define SWAP_COUNT_SHIFT	SWP_TB_FLAGS_BITS
 
 /*
  * Helpers for casting one type of info into a swap table entry.
@@ -102,35 +109,43 @@ static inline unsigned long __count_to_swp_tb(unsigned char count)
 	return ((unsigned long)count) << SWP_TB_COUNT_SHIFT;
 }
 
-static inline unsigned long pfn_to_swp_tb(unsigned long pfn, unsigned int count)
+static inline unsigned long __flags_to_swp_tb(unsigned char flags)
+{
+	BUILD_BUG_ON(SWP_TB_FLAGS_BITS > BITS_PER_BYTE);
+	VM_WARN_ON((flags >> 1) > SWP_TB_COUNT_MAX);
+	return ((unsigned long)flags) << SWP_TB_FLAGS_SHIFT;
+}
+
+
+static inline unsigned long pfn_to_swp_tb(unsigned long pfn, unsigned char flags)
 {
 	unsigned long swp_tb;
 
 	BUILD_BUG_ON(sizeof(unsigned long) != sizeof(void *));
 	BUILD_BUG_ON(SWAP_CACHE_PFN_BITS >
-		     (BITS_PER_LONG - SWP_TB_PFN_MARK_BITS - SWP_TB_COUNT_BITS));
+		     (BITS_PER_LONG - SWP_TB_PFN_MARK_BITS - SWP_TB_FLAGS_BITS));
 
 	swp_tb = (pfn << SWP_TB_PFN_MARK_BITS) | SWP_TB_PFN_MARK;
-	VM_WARN_ON_ONCE(swp_tb & SWP_TB_COUNT_MASK);
+	VM_WARN_ON_ONCE(swp_tb & SWP_TB_FLAGS_MASK);
 
-	return swp_tb | __count_to_swp_tb(count);
+	return swp_tb | __flags_to_swp_tb(flags);
 }
 
-static inline unsigned long folio_to_swp_tb(struct folio *folio, unsigned int count)
+static inline unsigned long folio_to_swp_tb(struct folio *folio, unsigned char flags)
 {
-	return pfn_to_swp_tb(folio_pfn(folio), count);
+	return pfn_to_swp_tb(folio_pfn(folio), flags);
 }
 
-static inline unsigned long shadow_to_swp_tb(void *shadow, unsigned int count)
+static inline unsigned long shadow_to_swp_tb(void *shadow, unsigned char flags)
 {
 	BUILD_BUG_ON((BITS_PER_XA_VALUE + 1) !=
 		     BITS_PER_BYTE * sizeof(unsigned long));
 	BUILD_BUG_ON((unsigned long)xa_mk_value(0) != SWP_TB_SHADOW_MARK);
 
 	VM_WARN_ON_ONCE(shadow && !xa_is_value(shadow));
-	VM_WARN_ON_ONCE(shadow && ((unsigned long)shadow & SWP_TB_COUNT_MASK));
+	VM_WARN_ON_ONCE(shadow && ((unsigned long)shadow & SWP_TB_FLAGS_MASK));
 
-	return (unsigned long)shadow | __count_to_swp_tb(count) | SWP_TB_SHADOW_MARK;
+	return (unsigned long)shadow | SWP_TB_SHADOW_MARK | __flags_to_swp_tb(flags);
 }
 
 /*
@@ -168,14 +183,14 @@ static inline bool swp_tb_is_countable(unsigned long swp_tb)
 static inline struct folio *swp_tb_to_folio(unsigned long swp_tb)
 {
 	VM_WARN_ON(!swp_tb_is_folio(swp_tb));
-	return pfn_folio((swp_tb & ~SWP_TB_COUNT_MASK) >> SWP_TB_PFN_MARK_BITS);
+	return pfn_folio((swp_tb & ~SWP_TB_FLAGS_MASK) >> SWP_TB_PFN_MARK_BITS);
 }
 
 static inline void *swp_tb_to_shadow(unsigned long swp_tb)
 {
 	VM_WARN_ON(!swp_tb_is_shadow(swp_tb));
 	/* No shift needed, xa_value is stored as it is in the lower bits. */
-	return (void *)(swp_tb & ~SWP_TB_COUNT_MASK);
+	return (void *)(swp_tb & ~SWP_TB_FLAGS_MASK);
 }
 
 static inline unsigned char __swp_tb_get_count(unsigned long swp_tb)
@@ -184,6 +199,12 @@ static inline unsigned char __swp_tb_get_count(unsigned long swp_tb)
 	return ((swp_tb & SWP_TB_COUNT_MASK) >> SWP_TB_COUNT_SHIFT);
 }
 
+static inline unsigned char __swp_tb_get_flags(unsigned long swp_tb)
+{
+	VM_WARN_ON(!swp_tb_is_countable(swp_tb));
+	return ((swp_tb & SWP_TB_FLAGS_MASK) >> SWP_TB_FLAGS_SHIFT);
+}
+
 static inline int swp_tb_get_count(unsigned long swp_tb)
 {
 	if (swp_tb_is_countable(swp_tb))
@@ -247,4 +268,54 @@ static inline unsigned long swap_table_get(struct swap_cluster_info *ci,
 
 	return swp_tb;
 }
+
+static inline void __swap_table_set_zero(struct swap_cluster_info *ci,
+					 unsigned int ci_off)
+{
+	unsigned long swp_tb = __swap_table_get(ci, ci_off);
+
+	VM_WARN_ON(!swp_tb_is_countable(swp_tb));
+	swp_tb |= SWP_TB_ZERO_MARK;
+	__swap_table_set(ci, ci_off, swp_tb);
+}
+
+static inline void __swap_table_clear_zero(struct swap_cluster_info *ci,
+					   unsigned int ci_off)
+{
+	unsigned long swp_tb = __swap_table_get(ci, ci_off);
+
+	VM_WARN_ON(!swp_tb_is_countable(swp_tb));
+	swp_tb &= ~SWP_TB_ZERO_MARK;
+	__swap_table_set(ci, ci_off, swp_tb);
+}
+
+/**
+ * Return the count of contiguous swap entries that share the same
+ * zeromap status as the starting entry. If is_zerop is not NULL,
+ * it will return the zeromap status of the starting entry.
+ *
+ * Context: Caller must ensure the cluster containing the entries
+ * that will be checked won't be freed.
+ */
+static inline int swap_zeromap_batch(swp_entry_t entry, int max_nr,
+				     bool *is_zerop)
+{
+	bool is_zero;
+	unsigned long swp_tb;
+	struct swap_cluster_info *ci = __swap_entry_to_cluster(entry);
+	unsigned int ci_start = swp_cluster_offset(entry), ci_off, ci_end;
+
+	ci_off = ci_start;
+	ci_end = ci_off + max_nr;
+	swp_tb = swap_table_get(ci, ci_off);
+	is_zero = !!(swp_tb & SWP_TB_ZERO_MARK);
+	if (is_zerop)
+		*is_zerop = is_zero;
+	while (++ci_off < ci_end) {
+		swp_tb = swap_table_get(ci, ci_off);
+		if (is_zero != !!(swp_tb & SWP_TB_ZERO_MARK))
+			break;
+	}
+	return ci_off - ci_start;
+}
 #endif
diff --git a/mm/swapfile.c b/mm/swapfile.c
index de34f1990209..4018e8694b72 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -918,7 +918,7 @@ static bool __swap_cluster_alloc_entries(struct swap_info_struct *si,
 		nr_pages = 1;
 		swap_cluster_assert_empty(ci, ci_off, 1, false);
 		/* Sets a fake shadow as placeholder */
-		__swap_table_set(ci, ci_off, shadow_to_swp_tb(NULL, 1));
+		__swap_table_set(ci, ci_off, __swp_tb_mk_count(shadow_to_swp_tb(NULL, 0), 1));
 	} else {
 		/* Allocation without folio is only possible with hibernation */
 		WARN_ON_ONCE(1);
@@ -1308,14 +1308,8 @@ static void swap_range_free(struct swap_info_struct *si, unsigned long offset,
 	void (*swap_slot_free_notify)(struct block_device *, unsigned long);
 	unsigned int i;
 
-	/*
-	 * Use atomic clear_bit operations only on zeromap instead of non-atomic
-	 * bitmap_clear to prevent adjacent bits corruption due to simultaneous writes.
-	 */
-	for (i = 0; i < nr_entries; i++) {
-		clear_bit(offset + i, si->zeromap);
+	for (i = 0; i < nr_entries; i++)
 		zswap_invalidate(swp_entry(si->type, offset + i));
-	}
 
 	if (si->flags & SWP_BLKDEV)
 		swap_slot_free_notify =
@@ -2921,7 +2915,6 @@ static void flush_percpu_swap_cluster(struct swap_info_struct *si)
 SYSCALL_DEFINE1(swapoff, const char __user *, specialfile)
 {
 	struct swap_info_struct *p = NULL;
-	unsigned long *zeromap;
 	struct swap_cluster_info *cluster_info;
 	struct file *swap_file, *victim;
 	struct address_space *mapping;
@@ -3009,8 +3002,6 @@ SYSCALL_DEFINE1(swapoff, const char __user *, specialfile)
 
 	swap_file = p->swap_file;
 	p->swap_file = NULL;
-	zeromap = p->zeromap;
-	p->zeromap = NULL;
 	maxpages = p->max;
 	cluster_info = p->cluster_info;
 	p->max = 0;
@@ -3022,7 +3013,6 @@ SYSCALL_DEFINE1(swapoff, const char __user *, specialfile)
 	mutex_unlock(&swapon_mutex);
 	kfree(p->global_cluster);
 	p->global_cluster = NULL;
-	kvfree(zeromap);
 	free_swap_cluster_info(cluster_info, maxpages);
 
 	inode = mapping->host;
@@ -3555,17 +3545,6 @@ SYSCALL_DEFINE2(swapon, const char __user *, specialfile, int, swap_flags)
 	if (error)
 		goto bad_swap_unlock_inode;
 
-	/*
-	 * Use kvmalloc_array instead of bitmap_zalloc as the allocation order might
-	 * be above MAX_PAGE_ORDER incase of a large swap file.
-	 */
-	si->zeromap = kvmalloc_array(BITS_TO_LONGS(maxpages), sizeof(long),
-				     GFP_KERNEL | __GFP_ZERO);
-	if (!si->zeromap) {
-		error = -ENOMEM;
-		goto bad_swap_unlock_inode;
-	}
-
 	if (si->bdev && bdev_stable_writes(si->bdev))
 		si->flags |= SWP_STABLE_WRITES;
 
@@ -3667,8 +3646,6 @@ SYSCALL_DEFINE2(swapon, const char __user *, specialfile, int, swap_flags)
 	destroy_swap_extents(si, swap_file);
 	free_swap_cluster_info(si->cluster_info, si->max);
 	si->cluster_info = NULL;
-	kvfree(si->zeromap);
-	si->zeromap = NULL;
 	/*
 	 * Clear the SWP_USED flag after all resources are freed so
 	 * alloc_swap_info can reuse this si safely.

-- 
2.53.0




^ permalink raw reply	[flat|nested] 19+ messages in thread

* [PATCH RFC 13/15] mm: ghost swapfile support for zswap
  2026-02-19 23:42 [PATCH RFC 00/15] mm, swap: swap table phase IV with dynamic ghost swapfile Kairui Song via B4 Relay
                   ` (11 preceding siblings ...)
  2026-02-19 23:42 ` [PATCH RFC 12/15] mm, swap: merge zeromap into swap table Kairui Song via B4 Relay
@ 2026-02-19 23:42 ` Kairui Song via B4 Relay
  2026-02-19 23:42 ` [PATCH RFC 14/15] mm, swap: add a special device for ghost swap setup Kairui Song via B4 Relay
                   ` (2 subsequent siblings)
  15 siblings, 0 replies; 19+ messages in thread
From: Kairui Song via B4 Relay @ 2026-02-19 23:42 UTC (permalink / raw)
  To: linux-mm
  Cc: Andrew Morton, David Hildenbrand, Lorenzo Stoakes, Zi Yan,
	Baolin Wang, Barry Song, Hugh Dickins, Chris Li, Kemeng Shi,
	Nhat Pham, Baoquan He, Johannes Weiner, Yosry Ahmed,
	Youngjun Park, Chengming Zhou, Roman Gushchin, Shakeel Butt,
	Muchun Song, Qi Zheng, linux-kernel, cgroups, Kairui Song

From: Chris Li <chrisl@kernel.org>

The current zswap requires a backing swapfile. The swap slot used
by zswap is not able to be used by the swapfile. That waste swapfile
space.

The ghost swapfile is a swapfile that only contains the swapfile header
for zswap. The swapfile header indicate the size of the swapfile. There
is no swap data section in the ghost swapfile, therefore, no waste of
swapfile space.  As such, any write to a ghost swapfile will fail. To
prevents accidental read or write of ghost swapfile, bdev of
swap_info_struct is set to NULL. Ghost swapfile will also set the SSD
flag because there is no rotation disk access when using zswap.

The zswap write back has been disabled if all swapfiles in the system
are ghost swap files.

Signed-off-by: Chris Li <chrisl@kernel.org>
Signed-off-by: Kairui Song <kasong@tencent.com>
---
 include/linux/swap.h |  2 ++
 mm/page_io.c         | 18 +++++++++++++++---
 mm/swap.h            |  2 +-
 mm/swapfile.c        | 42 +++++++++++++++++++++++++++++++++++++-----
 mm/zswap.c           | 12 +++++++++---
 5 files changed, 64 insertions(+), 12 deletions(-)

diff --git a/include/linux/swap.h b/include/linux/swap.h
index bc871d8a1e99..3b2efd319f44 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -215,6 +215,7 @@ enum {
 	SWP_PAGE_DISCARD = (1 << 10),	/* freed swap page-cluster discards */
 	SWP_STABLE_WRITES = (1 << 11),	/* no overwrite PG_writeback pages */
 	SWP_SYNCHRONOUS_IO = (1 << 12),	/* synchronous IO is efficient */
+	SWP_GHOST	= (1 << 13),	/* not backed by anything */
 					/* add others here before... */
 };
 
@@ -419,6 +420,7 @@ void free_folio_and_swap_cache(struct folio *folio);
 void free_pages_and_swap_cache(struct encoded_page **, int);
 /* linux/mm/swapfile.c */
 extern atomic_long_t nr_swap_pages;
+extern atomic_t nr_real_swapfiles;
 extern long total_swap_pages;
 extern atomic_t nr_rotate_swap;
 
diff --git a/mm/page_io.c b/mm/page_io.c
index 5a0b5034489b..f4a5fc0863f5 100644
--- a/mm/page_io.c
+++ b/mm/page_io.c
@@ -291,8 +291,7 @@ int swap_writeout(struct folio *folio, struct swap_iocb **swap_plug)
 		return AOP_WRITEPAGE_ACTIVATE;
 	}
 
-	__swap_writepage(folio, swap_plug);
-	return 0;
+	return __swap_writepage(folio, swap_plug);
 out_unlock:
 	folio_unlock(folio);
 	return ret;
@@ -454,11 +453,18 @@ static void swap_writepage_bdev_async(struct folio *folio,
 	submit_bio(bio);
 }
 
-void __swap_writepage(struct folio *folio, struct swap_iocb **swap_plug)
+int __swap_writepage(struct folio *folio, struct swap_iocb **swap_plug)
 {
 	struct swap_info_struct *sis = __swap_entry_to_info(folio->swap);
 
 	VM_BUG_ON_FOLIO(!folio_test_swapcache(folio), folio);
+
+	if (sis->flags & SWP_GHOST) {
+		/* Prevent the page from getting reclaimed. */
+		folio_set_dirty(folio);
+		return AOP_WRITEPAGE_ACTIVATE;
+	}
+
 	/*
 	 * ->flags can be updated non-atomically (scan_swap_map_slots),
 	 * but that will never affect SWP_FS_OPS, so the data_race
@@ -475,6 +481,7 @@ void __swap_writepage(struct folio *folio, struct swap_iocb **swap_plug)
 		swap_writepage_bdev_sync(folio, sis);
 	else
 		swap_writepage_bdev_async(folio, sis);
+	return 0;
 }
 
 void swap_write_unplug(struct swap_iocb *sio)
@@ -649,6 +656,11 @@ void swap_read_folio(struct folio *folio, struct swap_iocb **plug)
 	if (zswap_load(folio) != -ENOENT)
 		goto finish;
 
+	if (unlikely(sis->flags & SWP_GHOST)) {
+		folio_unlock(folio);
+		goto finish;
+	}
+
 	/* We have to read from slower devices. Increase zswap protection. */
 	zswap_folio_swapin(folio);
 
diff --git a/mm/swap.h b/mm/swap.h
index cb1ab20d83d5..55aa6d904afd 100644
--- a/mm/swap.h
+++ b/mm/swap.h
@@ -226,7 +226,7 @@ static inline void swap_read_unplug(struct swap_iocb *plug)
 }
 void swap_write_unplug(struct swap_iocb *sio);
 int swap_writeout(struct folio *folio, struct swap_iocb **swap_plug);
-void __swap_writepage(struct folio *folio, struct swap_iocb **swap_plug);
+int __swap_writepage(struct folio *folio, struct swap_iocb **swap_plug);
 
 /* linux/mm/swap_state.c */
 extern struct address_space swap_space __read_mostly;
diff --git a/mm/swapfile.c b/mm/swapfile.c
index 4018e8694b72..65666c43cbd5 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -67,6 +67,7 @@ static void move_cluster(struct swap_info_struct *si,
 static DEFINE_SPINLOCK(swap_lock);
 static unsigned int nr_swapfiles;
 atomic_long_t nr_swap_pages;
+atomic_t nr_real_swapfiles;
 /*
  * Some modules use swappable objects and may try to swap them out under
  * memory pressure (via the shrinker). Before doing so, they may wish to
@@ -1211,6 +1212,8 @@ static void del_from_avail_list(struct swap_info_struct *si, bool swapoff)
 			goto skip;
 	}
 
+	if (!(si->flags & SWP_GHOST))
+		atomic_sub(1, &nr_real_swapfiles);
 	plist_del(&si->avail_list, &swap_avail_head);
 
 skip:
@@ -1253,6 +1256,8 @@ static void add_to_avail_list(struct swap_info_struct *si, bool swapon)
 	}
 
 	plist_add(&si->avail_list, &swap_avail_head);
+	if (!(si->flags & SWP_GHOST))
+		atomic_add(1, &nr_real_swapfiles);
 
 skip:
 	spin_unlock(&swap_avail_lock);
@@ -2793,6 +2798,11 @@ static int setup_swap_extents(struct swap_info_struct *sis,
 	struct inode *inode = mapping->host;
 	int ret;
 
+	if (sis->flags & SWP_GHOST) {
+		*span = 0;
+		return 0;
+	}
+
 	if (S_ISBLK(inode->i_mode)) {
 		ret = add_swap_extent(sis, 0, sis->max, 0);
 		*span = sis->pages;
@@ -2992,7 +3002,8 @@ SYSCALL_DEFINE1(swapoff, const char __user *, specialfile)
 
 	destroy_swap_extents(p, p->swap_file);
 
-	if (!(p->flags & SWP_SOLIDSTATE))
+	if (!(p->flags & SWP_GHOST) &&
+	    !(p->flags & SWP_SOLIDSTATE))
 		atomic_dec(&nr_rotate_swap);
 
 	mutex_lock(&swapon_mutex);
@@ -3102,6 +3113,19 @@ static void swap_stop(struct seq_file *swap, void *v)
 	mutex_unlock(&swapon_mutex);
 }
 
+static const char *swap_type_str(struct swap_info_struct *si)
+{
+	struct file *file = si->swap_file;
+
+	if (si->flags & SWP_GHOST)
+		return "ghost\t";
+
+	if (S_ISBLK(file_inode(file)->i_mode))
+		return "partition";
+
+	return "file\t";
+}
+
 static int swap_show(struct seq_file *swap, void *v)
 {
 	struct swap_info_struct *si = v;
@@ -3121,8 +3145,7 @@ static int swap_show(struct seq_file *swap, void *v)
 	len = seq_file_path(swap, file, " \t\n\\");
 	seq_printf(swap, "%*s%s\t%lu\t%s%lu\t%s%d\n",
 			len < 40 ? 40 - len : 1, " ",
-			S_ISBLK(file_inode(file)->i_mode) ?
-				"partition" : "file\t",
+			swap_type_str(si),
 			bytes, bytes < 10000000 ? "\t" : "",
 			inuse, inuse < 10000000 ? "\t" : "",
 			si->prio);
@@ -3254,7 +3277,6 @@ static int claim_swapfile(struct swap_info_struct *si, struct inode *inode)
 	return 0;
 }
 
-
 /*
  * Find out how many pages are allowed for a single swap device. There
  * are two limiting factors:
@@ -3300,6 +3322,7 @@ static unsigned long read_swap_header(struct swap_info_struct *si,
 	unsigned long maxpages;
 	unsigned long swapfilepages;
 	unsigned long last_page;
+	loff_t size;
 
 	if (memcmp("SWAPSPACE2", swap_header->magic.magic, 10)) {
 		pr_err("Unable to find swap-space signature\n");
@@ -3342,7 +3365,16 @@ static unsigned long read_swap_header(struct swap_info_struct *si,
 
 	if (!maxpages)
 		return 0;
-	swapfilepages = i_size_read(inode) >> PAGE_SHIFT;
+
+	size = i_size_read(inode);
+	if (size == PAGE_SIZE) {
+		/* Ghost swapfile */
+		si->bdev = NULL;
+		si->flags |= SWP_GHOST | SWP_SOLIDSTATE;
+		return maxpages;
+	}
+
+	swapfilepages = size >> PAGE_SHIFT;
 	if (swapfilepages && maxpages > swapfilepages) {
 		pr_warn("Swap area shorter than signature indicates\n");
 		return 0;
diff --git a/mm/zswap.c b/mm/zswap.c
index 5d83539a8bba..e470f697e770 100644
--- a/mm/zswap.c
+++ b/mm/zswap.c
@@ -995,11 +995,16 @@ static int zswap_writeback_entry(struct zswap_entry *entry,
 	struct swap_info_struct *si;
 	int ret = 0;
 
-	/* try to allocate swap cache folio */
 	si = get_swap_device(swpentry);
 	if (!si)
 		return -EEXIST;
 
+	if (si->flags & SWP_GHOST) {
+		put_swap_device(si);
+		return -EINVAL;
+	}
+
+	/* try to allocate swap cache folio */
 	mpol = get_task_policy(current);
 	folio = swap_cache_alloc_folio(swpentry, GFP_KERNEL, 0, NULL, mpol,
 				       NO_INTERLEAVE_INDEX);
@@ -1052,7 +1057,8 @@ static int zswap_writeback_entry(struct zswap_entry *entry,
 	folio_set_reclaim(folio);
 
 	/* start writeback */
-	__swap_writepage(folio, NULL);
+	ret = __swap_writepage(folio, NULL);
+	WARN_ON_ONCE(ret);
 
 out:
 	if (ret) {
@@ -1536,7 +1542,7 @@ bool zswap_store(struct folio *folio)
 	zswap_pool_put(pool);
 put_objcg:
 	obj_cgroup_put(objcg);
-	if (!ret && zswap_pool_reached_full)
+	if (!ret && zswap_pool_reached_full && atomic_read(&nr_real_swapfiles))
 		queue_work(shrink_wq, &zswap_shrink_work);
 check_old:
 	/*

-- 
2.53.0




^ permalink raw reply	[flat|nested] 19+ messages in thread

* [PATCH RFC 14/15] mm, swap: add a special device for ghost swap setup
  2026-02-19 23:42 [PATCH RFC 00/15] mm, swap: swap table phase IV with dynamic ghost swapfile Kairui Song via B4 Relay
                   ` (12 preceding siblings ...)
  2026-02-19 23:42 ` [PATCH RFC 13/15] mm: ghost swapfile support for zswap Kairui Song via B4 Relay
@ 2026-02-19 23:42 ` Kairui Song via B4 Relay
  2026-02-19 23:42 ` [PATCH RFC 15/15] mm, swap: allocate cluster dynamically for ghost swapfile Kairui Song via B4 Relay
  2026-02-21  8:15 ` [PATCH RFC 00/15] mm, swap: swap table phase IV with dynamic " Barry Song
  15 siblings, 0 replies; 19+ messages in thread
From: Kairui Song via B4 Relay @ 2026-02-19 23:42 UTC (permalink / raw)
  To: linux-mm
  Cc: Andrew Morton, David Hildenbrand, Lorenzo Stoakes, Zi Yan,
	Baolin Wang, Barry Song, Hugh Dickins, Chris Li, Kemeng Shi,
	Nhat Pham, Baoquan He, Johannes Weiner, Yosry Ahmed,
	Youngjun Park, Chengming Zhou, Roman Gushchin, Shakeel Butt,
	Muchun Song, Qi Zheng, linux-kernel, cgroups, Kairui Song

From: Kairui Song <kasong@tencent.com>

Use /dev/ghostswap as a special device so userspace can setup ghost
swap easily without any extra tools.

Signed-off-by: Kairui Song <kasong@tencent.com>
---
 drivers/char/mem.c   | 39 +++++++++++++++++++++++++++++++++++++++
 include/linux/swap.h |  2 ++
 mm/swapfile.c        | 22 +++++++++++++++++++---
 3 files changed, 60 insertions(+), 3 deletions(-)

diff --git a/drivers/char/mem.c b/drivers/char/mem.c
index cca4529431f8..8d0eb3f7d191 100644
--- a/drivers/char/mem.c
+++ b/drivers/char/mem.c
@@ -30,6 +30,7 @@
 #include <linux/uio.h>
 #include <linux/uaccess.h>
 #include <linux/security.h>
+#include <linux/swap.h>
 
 #define DEVMEM_MINOR	1
 #define DEVPORT_MINOR	4
@@ -667,6 +668,41 @@ static const struct file_operations null_fops = {
 	.uring_cmd	= uring_cmd_null,
 };
 
+#ifdef CONFIG_SWAP
+static ssize_t read_ghostswap(struct file *file, char __user *buf,
+			      size_t count, loff_t *ppos)
+{
+	union swap_header *hdr;
+	size_t to_copy;
+
+	if (*ppos >= PAGE_SIZE)
+		return 0;
+
+	hdr = kzalloc(PAGE_SIZE, GFP_KERNEL);
+	if (!hdr)
+		return -ENOMEM;
+
+	hdr->info.version = 1;
+	hdr->info.last_page = totalram_pages() - 1;
+	memcpy(hdr->magic.magic, "SWAPSPACE2", 10);
+	to_copy = min_t(size_t, count, PAGE_SIZE - *ppos);
+	if (copy_to_user(buf, (char *)hdr + *ppos, to_copy)) {
+		kfree(hdr);
+		return -EFAULT;
+	}
+
+	kfree(hdr);
+	*ppos += to_copy;
+	return to_copy;
+}
+
+static const struct file_operations ghostswap_fops = {
+	.llseek		= null_lseek,
+	.read		= read_ghostswap,
+	.write		= write_null,
+};
+#endif
+
 #ifdef CONFIG_DEVPORT
 static const struct file_operations port_fops = {
 	.llseek		= memory_lseek,
@@ -718,6 +754,9 @@ static const struct memdev {
 #ifdef CONFIG_PRINTK
 	[11] = { "kmsg", &kmsg_fops, 0, 0644 },
 #endif
+#ifdef CONFIG_SWAP
+	[DEVGHOST_MINOR] = { "ghostswap", &ghostswap_fops, 0, 0660 },
+#endif
 };
 
 static int memory_open(struct inode *inode, struct file *filp)
diff --git a/include/linux/swap.h b/include/linux/swap.h
index 3b2efd319f44..b57a4a40f4fe 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -421,6 +421,8 @@ void free_pages_and_swap_cache(struct encoded_page **, int);
 /* linux/mm/swapfile.c */
 extern atomic_long_t nr_swap_pages;
 extern atomic_t nr_real_swapfiles;
+
+#define DEVGHOST_MINOR	13	/* /dev/ghostswap char device minor */
 extern long total_swap_pages;
 extern atomic_t nr_rotate_swap;
 
diff --git a/mm/swapfile.c b/mm/swapfile.c
index 65666c43cbd5..d054f40ec75f 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -42,6 +42,7 @@
 #include <linux/suspend.h>
 #include <linux/zswap.h>
 #include <linux/plist.h>
+#include <linux/major.h>
 
 #include <asm/tlbflush.h>
 #include <linux/leafops.h>
@@ -1703,6 +1704,7 @@ int folio_alloc_swap(struct folio *folio)
 	unsigned int size = 1 << order;
 	struct swap_cluster_info *ci;
 
+	VM_WARN_ON_FOLIO(folio_test_swapcache(folio), folio);
 	VM_BUG_ON_FOLIO(!folio_test_locked(folio), folio);
 	VM_BUG_ON_FOLIO(!folio_test_uptodate(folio), folio);
 
@@ -3421,6 +3423,10 @@ static int setup_swap_clusters_info(struct swap_info_struct *si,
 	err = swap_cluster_setup_bad_slot(si, cluster_info, 0, false);
 	if (err)
 		goto err;
+
+	if (!swap_header)
+		goto setup_cluster_info;
+
 	for (i = 0; i < swap_header->info.nr_badpages; i++) {
 		unsigned int page_nr = swap_header->info.badpages[i];
 
@@ -3440,6 +3446,7 @@ static int setup_swap_clusters_info(struct swap_info_struct *si,
 			goto err;
 	}
 
+setup_cluster_info:
 	INIT_LIST_HEAD(&si->free_clusters);
 	INIT_LIST_HEAD(&si->full_clusters);
 	INIT_LIST_HEAD(&si->discard_clusters);
@@ -3476,7 +3483,7 @@ SYSCALL_DEFINE2(swapon, const char __user *, specialfile, int, swap_flags)
 	struct dentry *dentry;
 	int prio;
 	int error;
-	union swap_header *swap_header;
+	union swap_header *swap_header = NULL;
 	int nr_extents;
 	sector_t span;
 	unsigned long maxpages;
@@ -3528,6 +3535,15 @@ SYSCALL_DEFINE2(swapon, const char __user *, specialfile, int, swap_flags)
 		goto bad_swap_unlock_inode;
 	}
 
+	/* /dev/ghostswap: synthesize a ghost swap device. */
+	if (S_ISCHR(inode->i_mode) &&
+	    imajor(inode) == MEM_MAJOR && iminor(inode) == DEVGHOST_MINOR) {
+		maxpages = round_up(totalram_pages(), SWAPFILE_CLUSTER);
+		si->flags |= SWP_GHOST | SWP_SOLIDSTATE;
+		si->bdev = NULL;
+		goto setup;
+	}
+
 	/*
 	 * The swap subsystem needs a major overhaul to support this.
 	 * It doesn't work yet so just disable it for now.
@@ -3550,13 +3566,13 @@ SYSCALL_DEFINE2(swapon, const char __user *, specialfile, int, swap_flags)
 		goto bad_swap_unlock_inode;
 	}
 	swap_header = kmap_local_folio(folio, 0);
-
 	maxpages = read_swap_header(si, swap_header, inode);
 	if (unlikely(!maxpages)) {
 		error = -EINVAL;
 		goto bad_swap_unlock_inode;
 	}
 
+setup:
 	si->max = maxpages;
 	si->pages = maxpages - 1;
 	nr_extents = setup_swap_extents(si, swap_file, &span);
@@ -3585,7 +3601,7 @@ SYSCALL_DEFINE2(swapon, const char __user *, specialfile, int, swap_flags)
 
 	if (si->bdev && bdev_nonrot(si->bdev)) {
 		si->flags |= SWP_SOLIDSTATE;
-	} else {
+	} else if (!(si->flags & SWP_SOLIDSTATE)) {
 		atomic_inc(&nr_rotate_swap);
 		inced_nr_rotate_swap = true;
 	}

-- 
2.53.0




^ permalink raw reply	[flat|nested] 19+ messages in thread

* [PATCH RFC 15/15] mm, swap: allocate cluster dynamically for ghost swapfile
  2026-02-19 23:42 [PATCH RFC 00/15] mm, swap: swap table phase IV with dynamic ghost swapfile Kairui Song via B4 Relay
                   ` (13 preceding siblings ...)
  2026-02-19 23:42 ` [PATCH RFC 14/15] mm, swap: add a special device for ghost swap setup Kairui Song via B4 Relay
@ 2026-02-19 23:42 ` Kairui Song via B4 Relay
  2026-02-21  8:15 ` [PATCH RFC 00/15] mm, swap: swap table phase IV with dynamic " Barry Song
  15 siblings, 0 replies; 19+ messages in thread
From: Kairui Song via B4 Relay @ 2026-02-19 23:42 UTC (permalink / raw)
  To: linux-mm
  Cc: Andrew Morton, David Hildenbrand, Lorenzo Stoakes, Zi Yan,
	Baolin Wang, Barry Song, Hugh Dickins, Chris Li, Kemeng Shi,
	Nhat Pham, Baoquan He, Johannes Weiner, Yosry Ahmed,
	Youngjun Park, Chengming Zhou, Roman Gushchin, Shakeel Butt,
	Muchun Song, Qi Zheng, linux-kernel, cgroups, Kairui Song

From: Kairui Song <kasong@tencent.com>

Now, the ghost swap file is completely dynamic. For easier testing, this
commit makes the /dev/ghostswap 8 times the size of total ram by
default.

NOTE: This commit is still a minimal proof of concept, so many parts of
the implementation can be improved.

And we have a ci_dyn->virtual_table that's is ready to be used (not used
yet). For example, storing zswap's metadata. In theory the folio lock can
be used to stablize it's virtual table data.

e.g., Swap entry writeback can also be done easily using a
folio_realloc_swap, skip the folio->swap's device and use underlying
devices, it will be easier to do if we remove the global percpu cluster
cache as suggested by [1] and should just work with tiering and priority.
Just put the folio->swap as a reverse entry in the lower layer's swap
table, and collect lower level's swap entry in the virtual_table, then
it's all good.

And right now all allocations are using atomic, which can also be
improved as the swap table already has sleep allocation support,
just need to adapt it.

The RCU lock protection convention can also be simplified.

But without all that, this works pretty well. We can have a "virtual
swap" of any size with zero overhead, common stress tests are showing
a very nice performance, while ordinary swaps have zero overhead,
and everything is runtime configurable.

But don't be too surprised if some corner cases are not well covered
yet, as most works are still focusing on the infrastructure.

Link: https://lore.kernel.org/linux-mm/20260126065242.1221862-5-youngjun.park@lge.com/ [1]
Signed-off-by: Kairui Song <kasong@tencent.com>
---
 include/linux/swap.h |   1 +
 mm/swap.h            |  44 +++++++++++++---
 mm/swap_state.c      |  35 ++++++++-----
 mm/swap_table.h      |   2 +
 mm/swapfile.c        | 145 +++++++++++++++++++++++++++++++++++++++++++++++----
 5 files changed, 199 insertions(+), 28 deletions(-)

diff --git a/include/linux/swap.h b/include/linux/swap.h
index b57a4a40f4fe..41d7eae56d65 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -284,6 +284,7 @@ struct swap_info_struct {
 	struct work_struct reclaim_work; /* reclaim worker */
 	struct list_head discard_clusters; /* discard clusters list */
 	struct plist_node avail_list;   /* entry in swap_avail_head */
+	struct xarray cluster_info_pool; /* Xarray for ghost swap cluster info */
 };
 
 static inline swp_entry_t page_swap_entry(struct page *page)
diff --git a/mm/swap.h b/mm/swap.h
index 55aa6d904afd..7a4d1d939842 100644
--- a/mm/swap.h
+++ b/mm/swap.h
@@ -41,6 +41,13 @@ struct swap_cluster_info {
 	struct list_head list;
 };
 
+struct swap_cluster_info_dynamic {
+	struct swap_cluster_info ci;	/* Underlying cluster info */
+	unsigned int index;		/* for cluster_index() */
+	struct rcu_head rcu;		/* For kfree_rcu deferred free */
+	/* unsigned long *virtual_table; And we can easily have a virtual table */
+};
+
 /* All on-list cluster must have a non-zero flag. */
 enum swap_cluster_flags {
 	CLUSTER_FLAG_NONE = 0, /* For temporary off-list cluster */
@@ -51,6 +58,7 @@ enum swap_cluster_flags {
 	CLUSTER_FLAG_USABLE = CLUSTER_FLAG_FRAG,
 	CLUSTER_FLAG_FULL,
 	CLUSTER_FLAG_DISCARD,
+	CLUSTER_FLAG_DEAD,	/* Ghost cluster pending kfree_rcu */
 	CLUSTER_FLAG_MAX,
 };
 
@@ -84,9 +92,19 @@ static inline struct swap_info_struct *__swap_entry_to_info(swp_entry_t entry)
 static inline struct swap_cluster_info *__swap_offset_to_cluster(
 		struct swap_info_struct *si, pgoff_t offset)
 {
+	unsigned int cluster_idx = offset / SWAPFILE_CLUSTER;
+
 	VM_WARN_ON_ONCE(percpu_ref_is_zero(&si->users)); /* race with swapoff */
 	VM_WARN_ON_ONCE(offset >= roundup(si->max, SWAPFILE_CLUSTER));
-	return &si->cluster_info[offset / SWAPFILE_CLUSTER];
+
+	if (si->flags & SWP_GHOST) {
+		struct swap_cluster_info_dynamic *ci_dyn;
+
+		ci_dyn = xa_load(&si->cluster_info_pool, cluster_idx);
+		return ci_dyn ? &ci_dyn->ci : NULL;
+	}
+
+	return &si->cluster_info[cluster_idx];
 }
 
 static inline struct swap_cluster_info *__swap_entry_to_cluster(swp_entry_t entry)
@@ -98,7 +116,7 @@ static inline struct swap_cluster_info *__swap_entry_to_cluster(swp_entry_t entr
 static __always_inline struct swap_cluster_info *__swap_cluster_lock(
 		struct swap_info_struct *si, unsigned long offset, bool irq)
 {
-	struct swap_cluster_info *ci = __swap_offset_to_cluster(si, offset);
+	struct swap_cluster_info *ci;
 
 	/*
 	 * Nothing modifies swap cache in an IRQ context. All access to
@@ -111,10 +129,24 @@ static __always_inline struct swap_cluster_info *__swap_cluster_lock(
 	 */
 	VM_WARN_ON_ONCE(!in_task());
 	VM_WARN_ON_ONCE(percpu_ref_is_zero(&si->users)); /* race with swapoff */
-	if (irq)
-		spin_lock_irq(&ci->lock);
-	else
-		spin_lock(&ci->lock);
+
+	rcu_read_lock();
+	ci = __swap_offset_to_cluster(si, offset);
+	if (ci) {
+		if (irq)
+			spin_lock_irq(&ci->lock);
+		else
+			spin_lock(&ci->lock);
+
+		if (ci->flags == CLUSTER_FLAG_DEAD) {
+			if (irq)
+				spin_unlock_irq(&ci->lock);
+			else
+				spin_unlock(&ci->lock);
+			ci = NULL;
+		}
+	}
+	rcu_read_unlock();
 	return ci;
 }
 
diff --git a/mm/swap_state.c b/mm/swap_state.c
index 419419e18a47..1c3600a93ecd 100644
--- a/mm/swap_state.c
+++ b/mm/swap_state.c
@@ -90,8 +90,10 @@ struct folio *swap_cache_get_folio(swp_entry_t entry)
 	struct folio *folio;
 
 	for (;;) {
+		rcu_read_lock();
 		swp_tb = swap_table_get(__swap_entry_to_cluster(entry),
 					swp_cluster_offset(entry));
+		rcu_read_unlock();
 		if (!swp_tb_is_folio(swp_tb))
 			return NULL;
 		folio = swp_tb_to_folio(swp_tb);
@@ -113,8 +115,10 @@ bool swap_cache_has_folio(swp_entry_t entry)
 {
 	unsigned long swp_tb;
 
+	rcu_read_lock();
 	swp_tb = swap_table_get(__swap_entry_to_cluster(entry),
 				swp_cluster_offset(entry));
+	rcu_read_unlock();
 	return swp_tb_is_folio(swp_tb);
 }
 
@@ -130,8 +134,10 @@ void *swap_cache_get_shadow(swp_entry_t entry)
 {
 	unsigned long swp_tb;
 
+	rcu_read_lock();
 	swp_tb = swap_table_get(__swap_entry_to_cluster(entry),
 				swp_cluster_offset(entry));
+	rcu_read_unlock();
 	if (swp_tb_is_shadow(swp_tb))
 		return swp_tb_to_shadow(swp_tb);
 	return NULL;
@@ -209,14 +215,14 @@ void __swap_cache_add_folio(struct swap_cluster_info *ci,
 	lruvec_stat_mod_folio(folio, NR_SWAPCACHE, nr_pages);
 }
 
-static struct folio *__swap_cache_alloc(struct swap_cluster_info *ci,
-					swp_entry_t targ_entry, gfp_t gfp,
+static struct folio *__swap_cache_alloc(swp_entry_t targ_entry, gfp_t gfp,
 					unsigned int order, struct vm_fault *vmf,
 					struct mempolicy *mpol, pgoff_t ilx)
 {
 	int err;
 	swp_entry_t entry;
 	struct folio *folio;
+	struct swap_cluster_info *ci;
 	void *shadow = NULL, *shadow_check = NULL;
 	unsigned long address, nr_pages = 1 << order;
 	unsigned int ci_off, ci_targ = swp_cluster_offset(targ_entry);
@@ -225,9 +231,12 @@ static struct folio *__swap_cache_alloc(struct swap_cluster_info *ci,
 	ci_off = round_down(ci_targ, nr_pages);
 
 	/* First check if the range is available */
-	spin_lock(&ci->lock);
-	err = __swap_cache_check_batch(ci, entry, ci_off, ci_targ, nr_pages, &shadow);
-	spin_unlock(&ci->lock);
+	err = -ENOENT;
+	ci = swap_cluster_lock(__swap_entry_to_info(entry), swp_offset(entry));
+	if (ci) {
+		err = __swap_cache_check_batch(ci, entry, ci_off, ci_targ, nr_pages, &shadow);
+		swap_cluster_unlock(ci);
+	}
 	if (unlikely(err))
 		return ERR_PTR(err);
 
@@ -243,10 +252,13 @@ static struct folio *__swap_cache_alloc(struct swap_cluster_info *ci,
 		return ERR_PTR(-ENOMEM);
 
 	/* Double check the range is still not in conflict */
-	spin_lock(&ci->lock);
-	err = __swap_cache_check_batch(ci, entry, ci_off, ci_targ, nr_pages, &shadow_check);
+	err = -ENOENT;
+	ci = swap_cluster_lock(__swap_entry_to_info(entry), swp_offset(entry));
+	if (ci)
+		err = __swap_cache_check_batch(ci, entry, ci_off, ci_targ, nr_pages, &shadow_check);
 	if (unlikely(err) || shadow_check != shadow) {
-		spin_unlock(&ci->lock);
+		if (ci)
+			swap_cluster_unlock(ci);
 		folio_put(folio);
 
 		/* If shadow changed, just try again */
@@ -256,13 +268,14 @@ static struct folio *__swap_cache_alloc(struct swap_cluster_info *ci,
 	__folio_set_locked(folio);
 	__folio_set_swapbacked(folio);
 	__swap_cache_add_folio(ci, folio, entry);
-	spin_unlock(&ci->lock);
+	swap_cluster_unlock(ci);
 
 	/* With swap table, we must have a shadow, for memcg tracking */
 	WARN_ON(!shadow);
 
 	if (mem_cgroup_swapin_charge_folio(folio, vmf ? vmf->vma->vm_mm : NULL,
 					   gfp, shadow_to_memcgid(shadow))) {
+		/* The folio pins the cluster */
 		spin_lock(&ci->lock);
 		__swap_cache_del_folio(ci, folio, shadow, false, false);
 		spin_unlock(&ci->lock);
@@ -305,13 +318,11 @@ struct folio *swap_cache_alloc_folio(swp_entry_t targ_entry, gfp_t gfp_mask,
 {
 	int order, err;
 	struct folio *folio;
-	struct swap_cluster_info *ci;
 
 	/* Always allow order 0 so swap won't fail under pressure. */
 	order = orders ? highest_order(orders |= BIT(0)) : 0;
-	ci = __swap_entry_to_cluster(targ_entry);
 	for (;;) {
-		folio = __swap_cache_alloc(ci, targ_entry, gfp_mask, order,
+		folio = __swap_cache_alloc(targ_entry, gfp_mask, order,
 					   vmf, mpol, ilx);
 		if (!IS_ERR(folio))
 			return folio;
diff --git a/mm/swap_table.h b/mm/swap_table.h
index 6d3d773e1908..867bcfff0e3c 100644
--- a/mm/swap_table.h
+++ b/mm/swap_table.h
@@ -260,6 +260,8 @@ static inline unsigned long swap_table_get(struct swap_cluster_info *ci,
 	unsigned long swp_tb;
 
 	VM_WARN_ON_ONCE(off >= SWAPFILE_CLUSTER);
+	if (!ci)
+		return SWP_TB_NULL;
 
 	rcu_read_lock();
 	table = rcu_dereference(ci->table);
diff --git a/mm/swapfile.c b/mm/swapfile.c
index d054f40ec75f..f0682c8c8f53 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -404,6 +404,8 @@ static inline bool cluster_is_usable(struct swap_cluster_info *ci, int order)
 static inline unsigned int cluster_index(struct swap_info_struct *si,
 					 struct swap_cluster_info *ci)
 {
+	if (si->flags & SWP_GHOST)
+		return container_of(ci, struct swap_cluster_info_dynamic, ci)->index;
 	return ci - si->cluster_info;
 }
 
@@ -708,6 +710,22 @@ static void free_cluster(struct swap_info_struct *si, struct swap_cluster_info *
 		return;
 	}
 
+	if (si->flags & SWP_GHOST) {
+		struct swap_cluster_info_dynamic *ci_dyn;
+
+		ci_dyn = container_of(ci, struct swap_cluster_info_dynamic, ci);
+		if (ci->flags != CLUSTER_FLAG_NONE) {
+			spin_lock(&si->lock);
+			list_del(&ci->list);
+			spin_unlock(&si->lock);
+		}
+		swap_cluster_free_table(ci);
+		xa_erase(&si->cluster_info_pool, ci_dyn->index);
+		ci->flags = CLUSTER_FLAG_DEAD;
+		kfree_rcu(ci_dyn, rcu);
+		return;
+	}
+
 	__free_cluster(si, ci);
 }
 
@@ -814,15 +832,17 @@ static int swap_cluster_setup_bad_slot(struct swap_info_struct *si,
  * stolen by a lower order). @usable will be set to false if that happens.
  */
 static bool cluster_reclaim_range(struct swap_info_struct *si,
-				  struct swap_cluster_info *ci,
+				  struct swap_cluster_info **pcip,
 				  unsigned long start, unsigned int order,
 				  bool *usable)
 {
+	struct swap_cluster_info *ci = *pcip;
 	unsigned int nr_pages = 1 << order;
 	unsigned long offset = start, end = start + nr_pages;
 	unsigned long swp_tb;
 
 	spin_unlock(&ci->lock);
+	rcu_read_lock();
 	do {
 		swp_tb = swap_table_get(ci, offset % SWAPFILE_CLUSTER);
 		if (swp_tb_get_count(swp_tb))
@@ -831,7 +851,15 @@ static bool cluster_reclaim_range(struct swap_info_struct *si,
 			if (__try_to_reclaim_swap(si, offset, TTRS_ANYWAY) < 0)
 				break;
 	} while (++offset < end);
-	spin_lock(&ci->lock);
+	rcu_read_unlock();
+
+	/* Re-lookup: ghost cluster may have been freed while lock was dropped */
+	ci = swap_cluster_lock(si, start);
+	*pcip = ci;
+	if (!ci) {
+		*usable = false;
+		return false;
+	}
 
 	/*
 	 * We just dropped ci->lock so cluster could be used by another
@@ -979,7 +1007,8 @@ static unsigned int alloc_swap_scan_cluster(struct swap_info_struct *si,
 		if (!cluster_scan_range(si, ci, offset, nr_pages, &need_reclaim))
 			continue;
 		if (need_reclaim) {
-			ret = cluster_reclaim_range(si, ci, offset, order, &usable);
+			ret = cluster_reclaim_range(si, &ci, offset, order,
+						    &usable);
 			if (!usable)
 				goto out;
 			if (cluster_is_empty(ci))
@@ -1005,8 +1034,10 @@ static unsigned int alloc_swap_scan_cluster(struct swap_info_struct *si,
 	 * should use a new cluster, and move the failed cluster to where it
 	 * should be.
 	 */
-	relocate_cluster(si, ci);
-	swap_cluster_unlock(ci);
+	if (ci) {
+		relocate_cluster(si, ci);
+		swap_cluster_unlock(ci);
+	}
 	if (si->flags & SWP_SOLIDSTATE) {
 		this_cpu_write(percpu_swap_cluster.offset[order], next);
 		this_cpu_write(percpu_swap_cluster.si[order], si);
@@ -1038,6 +1069,44 @@ static unsigned int alloc_swap_scan_list(struct swap_info_struct *si,
 	return found;
 }
 
+static unsigned int alloc_swap_scan_dynamic(struct swap_info_struct *si,
+					    struct folio *folio)
+{
+	struct swap_cluster_info_dynamic *ci_dyn;
+	struct swap_cluster_info *ci;
+	struct swap_table *table;
+	unsigned long offset;
+
+	WARN_ON(!(si->flags & SWP_GHOST));
+
+	ci_dyn = kzalloc(sizeof(*ci_dyn), GFP_ATOMIC);
+	if (!ci_dyn)
+		return SWAP_ENTRY_INVALID;
+
+	table = swap_table_alloc(GFP_ATOMIC);
+	if (!table) {
+		kfree(ci_dyn);
+		return SWAP_ENTRY_INVALID;
+	}
+
+	spin_lock_init(&ci_dyn->ci.lock);
+	INIT_LIST_HEAD(&ci_dyn->ci.list);
+	rcu_assign_pointer(ci_dyn->ci.table, table);
+
+	if (xa_alloc(&si->cluster_info_pool, &ci_dyn->index, ci_dyn,
+		     XA_LIMIT(1, DIV_ROUND_UP(si->max, SWAPFILE_CLUSTER) - 1),
+		     GFP_ATOMIC)) {
+		swap_table_free(table);
+		kfree(ci_dyn);
+		return SWAP_ENTRY_INVALID;
+	}
+
+	ci = &ci_dyn->ci;
+	spin_lock(&ci->lock);
+	offset = cluster_offset(si, ci);
+	return alloc_swap_scan_cluster(si, ci, folio, offset);
+}
+
 static void swap_reclaim_full_clusters(struct swap_info_struct *si, bool force)
 {
 	long to_scan = 1;
@@ -1060,7 +1129,9 @@ static void swap_reclaim_full_clusters(struct swap_info_struct *si, bool force)
 				spin_unlock(&ci->lock);
 				nr_reclaim = __try_to_reclaim_swap(si, offset,
 								   TTRS_ANYWAY);
-				spin_lock(&ci->lock);
+				ci = swap_cluster_lock(si, offset);
+				if (!ci)
+					goto next;
 				if (nr_reclaim) {
 					offset += abs(nr_reclaim);
 					continue;
@@ -1074,6 +1145,7 @@ static void swap_reclaim_full_clusters(struct swap_info_struct *si, bool force)
 			relocate_cluster(si, ci);
 
 		swap_cluster_unlock(ci);
+next:
 		if (to_scan <= 0)
 			break;
 	}
@@ -1136,6 +1208,12 @@ static unsigned long cluster_alloc_swap_entry(struct swap_info_struct *si,
 			goto done;
 	}
 
+	if (si->flags & SWP_GHOST) {
+		found = alloc_swap_scan_dynamic(si, folio);
+		if (found)
+			goto done;
+	}
+
 	if (!(si->flags & SWP_PAGE_DISCARD)) {
 		found = alloc_swap_scan_list(si, &si->free_clusters, folio, false);
 		if (found)
@@ -1375,7 +1453,8 @@ static bool swap_alloc_fast(struct folio *folio)
 		return false;
 
 	ci = swap_cluster_lock(si, offset);
-	alloc_swap_scan_cluster(si, ci, folio, offset);
+	if (ci)
+		alloc_swap_scan_cluster(si, ci, folio, offset);
 	put_swap_device(si);
 	return folio_test_swapcache(folio);
 }
@@ -1476,6 +1555,7 @@ int swap_retry_table_alloc(swp_entry_t entry, gfp_t gfp)
 	if (!si)
 		return 0;
 
+	/* Entry is in use (being faulted in), so its cluster is alive. */
 	ci = __swap_offset_to_cluster(si, offset);
 	ret = swap_extend_table_alloc(si, ci, gfp);
 
@@ -1996,6 +2076,7 @@ bool folio_maybe_swapped(struct folio *folio)
 	VM_WARN_ON_ONCE_FOLIO(!folio_test_locked(folio), folio);
 	VM_WARN_ON_ONCE_FOLIO(!folio_test_swapcache(folio), folio);
 
+	/* Folio is locked and in swap cache, so ci->count > 0: cluster is alive. */
 	ci = __swap_entry_to_cluster(entry);
 	ci_off = swp_cluster_offset(entry);
 	ci_end = ci_off + folio_nr_pages(folio);
@@ -2124,7 +2205,8 @@ swp_entry_t swap_alloc_hibernation_slot(int type)
 	pcp_offset = this_cpu_read(percpu_swap_cluster.offset[0]);
 	if (pcp_si == si && pcp_offset) {
 		ci = swap_cluster_lock(si, pcp_offset);
-		offset = alloc_swap_scan_cluster(si, ci, NULL, pcp_offset);
+		if (ci)
+			offset = alloc_swap_scan_cluster(si, ci, NULL, pcp_offset);
 	}
 	if (offset == SWAP_ENTRY_INVALID)
 		offset = cluster_alloc_swap_entry(si, NULL);
@@ -2413,8 +2495,10 @@ static int unuse_pte_range(struct vm_area_struct *vma, pmd_t *pmd,
 						&vmf);
 		}
 		if (!folio) {
+			rcu_read_lock();
 			swp_tb = swap_table_get(__swap_entry_to_cluster(entry),
 						swp_cluster_offset(entry));
+			rcu_read_unlock();
 			if (swp_tb_get_count(swp_tb) <= 0)
 				continue;
 			return -ENOMEM;
@@ -2560,8 +2644,10 @@ static unsigned int find_next_to_unuse(struct swap_info_struct *si,
 	 * allocations from this area (while holding swap_lock).
 	 */
 	for (i = prev + 1; i < si->max; i++) {
+		rcu_read_lock();
 		swp_tb = swap_table_get(__swap_offset_to_cluster(si, i),
 					i % SWAPFILE_CLUSTER);
+		rcu_read_unlock();
 		if (!swp_tb_is_null(swp_tb) && !swp_tb_is_bad(swp_tb))
 			break;
 		if ((i % LATENCY_LIMIT) == 0)
@@ -2874,6 +2960,8 @@ static void wait_for_allocation(struct swap_info_struct *si)
 	struct swap_cluster_info *ci;
 
 	BUG_ON(si->flags & SWP_WRITEOK);
+	if (si->flags & SWP_GHOST)
+		return;
 
 	for (offset = 0; offset < end; offset += SWAPFILE_CLUSTER) {
 		ci = swap_cluster_lock(si, offset);
@@ -3394,10 +3482,47 @@ static int setup_swap_clusters_info(struct swap_info_struct *si,
 				    unsigned long maxpages)
 {
 	unsigned long nr_clusters = DIV_ROUND_UP(maxpages, SWAPFILE_CLUSTER);
-	struct swap_cluster_info *cluster_info;
+	struct swap_cluster_info *cluster_info = NULL;
+	struct swap_cluster_info_dynamic *ci_dyn;
 	int err = -ENOMEM;
 	unsigned long i;
 
+	/* For SWP_GHOST files, initialize Xarray pool instead of static array */
+	if (si->flags & SWP_GHOST) {
+		/*
+		 * Pre-allocate cluster 0 and mark slot 0 (header page)
+		 * as bad so the allocator never hands out page offset 0.
+		 */
+		ci_dyn = kzalloc(sizeof(*ci_dyn), GFP_KERNEL);
+		if (!ci_dyn)
+			goto err;
+		spin_lock_init(&ci_dyn->ci.lock);
+		INIT_LIST_HEAD(&ci_dyn->ci.list);
+
+		nr_clusters = 0;
+		xa_init_flags(&si->cluster_info_pool, XA_FLAGS_ALLOC);
+		err = xa_insert(&si->cluster_info_pool, 0, ci_dyn, GFP_KERNEL);
+		if (err) {
+			kfree(ci_dyn);
+			goto err;
+		}
+
+		err = swap_cluster_setup_bad_slot(si, &ci_dyn->ci, 0, false);
+		if (err) {
+			struct swap_table *table;
+
+			xa_erase(&si->cluster_info_pool, 0);
+			table = (void *)rcu_dereference_protected(ci_dyn->ci.table, true);
+			if (table)
+				swap_table_free(table);
+			kfree(ci_dyn);
+			xa_destroy(&si->cluster_info_pool);
+			goto err;
+		}
+
+		goto setup_cluster_info;
+	}
+
 	cluster_info = kvcalloc(nr_clusters, sizeof(*cluster_info), GFP_KERNEL);
 	if (!cluster_info)
 		goto err;
@@ -3538,7 +3663,7 @@ SYSCALL_DEFINE2(swapon, const char __user *, specialfile, int, swap_flags)
 	/* /dev/ghostswap: synthesize a ghost swap device. */
 	if (S_ISCHR(inode->i_mode) &&
 	    imajor(inode) == MEM_MAJOR && iminor(inode) == DEVGHOST_MINOR) {
-		maxpages = round_up(totalram_pages(), SWAPFILE_CLUSTER);
+		maxpages = round_up(totalram_pages(), SWAPFILE_CLUSTER) * 8;
 		si->flags |= SWP_GHOST | SWP_SOLIDSTATE;
 		si->bdev = NULL;
 		goto setup;

-- 
2.53.0




^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH RFC 00/15] mm, swap: swap table phase IV with dynamic ghost swapfile
  2026-02-19 23:42 [PATCH RFC 00/15] mm, swap: swap table phase IV with dynamic ghost swapfile Kairui Song via B4 Relay
                   ` (14 preceding siblings ...)
  2026-02-19 23:42 ` [PATCH RFC 15/15] mm, swap: allocate cluster dynamically for ghost swapfile Kairui Song via B4 Relay
@ 2026-02-21  8:15 ` Barry Song
  2026-02-21  9:07   ` Kairui Song
  15 siblings, 1 reply; 19+ messages in thread
From: Barry Song @ 2026-02-21  8:15 UTC (permalink / raw)
  To: kasong
  Cc: linux-mm, Andrew Morton, David Hildenbrand, Lorenzo Stoakes,
	Zi Yan, Baolin Wang, Hugh Dickins, Chris Li, Kemeng Shi,
	Nhat Pham, Baoquan He, Johannes Weiner, Yosry Ahmed,
	Youngjun Park, Chengming Zhou, Roman Gushchin, Shakeel Butt,
	Muchun Song, Qi Zheng, linux-kernel, cgroups

On Fri, Feb 20, 2026 at 7:42 AM Kairui Song via B4 Relay
<devnull+kasong.tencent.com@kernel.org> wrote:
>
> NOTE for an RFC quality series: Swap table P4 is patch 1 - 12, and the
> dynamic ghost file is patch 13 - 15. Putting them together as RFC for
> easier review and discussions. Swap table P4 is stable and good to merge
> if we are OK with a few memcg reparent behavior (there is also a
> solution if we don't), dynamic ghost swap is yet a minimal proof of
> concept. See patch 15 for more details. And see below for Swap table 4
> cover letter (nice performance gain and memory save).

To be honest, I really dislike the name "ghost." I would
prefer something that reflects its actual functionality.
"Ghost" does not describe what it does and feels rather
arbitrary.

I suggest retiring the name "ghost" and replacing it with
something more appropriate. "vswap" could be a good option,
but Nhat is already using that name.

>
> This is based on the latest mm-unstable, swap table P3 [1] and patches
> [2] and [3], [4]. Sending this out early, as it might be helpful for us
> to get a cleaner picture of the ongoing efforts, make the discussions easier.
>
> Summary: With this approach, we can have an infinitely or dynamically
> large ghost which could be identical to "virtual swap", and support
> every feature we need while being *runtime configurable* with *zero
> overhead* for plain swap and keep the infrastructure unified. Also
> highly compatible with YoungJun's swap tiering [5], and other ideas like
> swap table compaction, swapops, as it aligns with a few proposals [6]
> [7] [8] [9] [10].
>
> In the past two years, most efforts have focused on the swap
> infrastructure, and we have made tremendous gains in performance,
> keeping the memory usage reasonable or lower, and also greatly cleaned
> up and simplified the API and conventions.
>
> Now the infrastructures are almost ready, after P4, implementing an
> infinitely or dynamically large swapfile can be done in a very easy to
> maintain and flexible way, code change is minimal and progressive
> for review, and makes future optimization like swap table compaction
> doable too, since the infrastructure is all the same for all swaps.
>
> The dynamic swap file is now using Xarray for the cluster info, and
> inside the cluster, it's all the same swap allocator, swap table, and
> existing infrastructures. A virtual table is available for any extra
> data or usage. See below for the benefits and what we can achieve.
>
> Huge thanks to Chris Li for the layered swap table and ghost swapfile
> idea, without whom the work here can't be archived. Also, thanks to Nhat
> for pushing and suggesting using an Xarray for the swapfile [11] for
> dynamic size. I was originally planning to use a dynamic cluster
> array, which requires a bit more adaptation, cleanup, and convention
> changes. But during the discussion there, I got the inspiration that
> Xarray can be used as the intermediate step, making this approach
> doable with minimal changes. Just keep using it in the future, it
> might not hurt too, as Xarray is only limited to ghost / virtual
> files, so plain swaps won't have any extra overhead for lookup or high
> risk of swapout allocation failure.
>
> I'm fully open and totally fine for suggestions on naming or API
> strategy, and others are highly welcome to keep the work going using
> this flexible approach. Following this approach, we will have all the
> following things progressively (some are already or almost there):
>
> - 8 bytes per slot memory usage, when using only plain swap.
>   - And the memory usage can be reduced to 3 or only 1 byte.
> - 16 bytes per slot memory usage, when using ghost / virtual zswap.
>   - Zswap can just use ci_dyn->virtual_table to free up it's content
>     completely.
>   - And the memory usage can be reduced to 11 or 8 bytes using the same
>     code above.
>   - 24 bytes only if including reverse mapping is in use.
> - Minimal code review or maintenance burden. All layers are using the exact
>   same infrastructure for metadata / allocation / synchronization, making
>   all API and conventions consistent and easy to maintain.
> - Writeback, migration and compaction are easily supportable since both
>   reverse mapping and reallocation are prepared. We just need a
>   folio_realloc_swap to allocate new entries for the existing entry, and
>   fill the swap table with a reserve map entry.
> - Fast swapoff: Just read into ghost / virtual swap cache.
> - Zero static data (mostly due to swap table P4), even the clusters are
>   dynamic (If using Xarray, only for ghost / virtual swap file).
> - So we can have an infinitely sized swap space with no static data
>   overhead.
> - Everything is runtime configurable, and high-performance. An
>   uncompressible workload or an offline batch workload can directly use a
>   plain or remote swap for the lowest interference, memory usage, or for
>   best performance.
> - Highly compatible with YoungJun's swap tiering, even the ghost / virtual
>   file can be just a tier. For example, if you have a huge NBD that doesn't
>   care about fragmentation and compression, or the workload is
>   uncompressible, setting the workload to use NBD's tier will give you only
>   8 bytes of overhead per slot and peak performance, bypassing everything.
>   Meanwhile, other workloads or cgroups can still use the ghost layer with
>   compression or defragmentation using 16 bytes (zswap only) or 24 bytes
>   (ghost swap with physical writeback) overhead.
> - No force or breaking change to any existing allocation, priority, swap
>   setup, or reclaim strategy. Ghost / virtual swap can be enabled or
>   disabled using swapon / swapoff.
>
> And if you consider these ops are too complex to set up and maintain, we
> can then only allow one ghost / virtual file, make it infinitely large,
> and be the default one and top tier, then it achieves the identical thing
> to virtual swap space, but with much fewer LOC changed and being runtime
> optional.
>
> Currently, the dynamic ghost files are just reported as ordinary swap files
> in /proc/swaps and we can have multiple ones, so users will have a full
> view of what's going on. This is a very easy-to-change design decision.
> I'm open to ideas about how we should present this to users. e.g., Hiding
> it will make it more "virtual", but I don't think that's a good idea.

Even if it remains visible in /proc/swaps, I would rather
not represent it as a real file in any filesystem. Putting
a "ghost" swapfile on something like ext4 seems unnatural.

>
> The size of the swapfile (si->max) is now just a number, which could be
> changeable at runtime if we have a proper idea how to expose that and
> might need some audit of a few remaining users. But right now, we can
> already easily have a huge swap device with no overhead, for example:
>
> free -m
>                total        used        free      shared  buff/cache   available
> Mem:            1465         250         927           1         356        1215
> Swap:       15269887           0    15269887
>
> And for easier testing, I added a /dev/ghostswap in this RFC. `swapon
> /dev/ghostswap` enables that. Without swapon /dev/ghostswap, any existing
> users, including ZRAM, won't observe any change.

/dev/ghostswap is assumed to be a virtual block device or
something similar? If it is a block device, how is its size
related to si->size?

Looking at [PATCH RFC 14/15] mm, swap: add a special device
for ghost swap setup, it appears to be a character device.
This feels very odd to me. I’m not in favor of coupling the
ghost swapfile with a memdev character device.
A cdev should be a true character device.

>
> ===
>
> Original cover letter for swap table phase IV:
>
> This series unifies the allocation and charging process of anon and shmem,
> provides better synchronization, and consolidates cgroup tracking, hence
> dropping the cgroup array and improving the performance of mTHP by about
> ~15%.
>
> Still testing with build kernel under great pressure, enabling mTHP 256kB,
> on an EPYC 7K62 using 16G ZRAM, make -j48 with 1G memory limit, 12 test
> runs:
>
> Before: 2215.55s system, 2:53.03 elapsed
> After:  1852.14s system, 2:41.44 elapsed (16.4% faster system time)
>
> In some workloads, the speed gain is more than that since this reduces
> memory thrashing, so even IO-bound work could benefit a lot, and I no
> longer see any: "Huh VM_FAULT_OOM leaked out to the #PF handler. Retrying
> PF", it was shown from time to time before this series.
>
> Now, the swap cache layer ensures a folio will be the exclusive owner of
> the swap slot, then charge it, which leads to much smaller thrashing when
> under pressure.
>
> And besides, the swap cgroup static array is gone, so for example, mounting
> a 1TB swap device saves about 512MB of memory:
>
> Before:
>         total     used     free     shared  buff/cache available
> Mem:    1465      854      331      1       347        610
> Swap:   1048575   0        1048575
>
> After:
>         total     used     free     shared  buff/cache available
> Mem:    1465      332      838      1       363        1133
> Swap:   1048575   0        1048575
>
> It saves us ~512M of memory, we now have close to 0 static overhead.
>
> Link: https://lore.kernel.org/linux-mm/20260218-swap-table-p3-v3-0-f4e34be021a7@tencent.com/ [1]
> Link: https://lore.kernel.org/linux-mm/20260213-memcg-privid-v1-1-d8cb7afcf831@tencent.com/ [2]
> Link: https://lore.kernel.org/linux-mm/20260211-shmem-swap-gfp-v1-1-e9781099a861@tencent.com/ [3]
> Link: https://lore.kernel.org/linux-mm/20260216-hibernate-perf-v4-0-1ba9f0bf1ec9@tencent.com/ [4]
> Link: https://lore.kernel.org/linux-mm/20260217000950.4015880-1-youngjun.park@lge.com/ [5]
> Link: https://lore.kernel.org/all/CAMgjq7BvQ0ZXvyLGp2YP96+i+6COCBBJCYmjXHGBnfisCAb8VA@mail.gmail.com/ [6]
> Link: https://lwn.net/Articles/974587/ [7]
> Link: https://lwn.net/Articles/932077/ [8]
> Link: https://lwn.net/Articles/1016136/ [9]
> Link: https://lore.kernel.org/linux-mm/20260208215839.87595-1-nphamcs@gmail.com/ [10]
> Link: https://lore.kernel.org/linux-mm/CAKEwX=OUni7PuUqGQUhbMDtErurFN_i=1RgzyQsNXy4LABhXoA@mail.gmail.com/ [11]
>
> Signed-off-by: Kairui Song <kasong@tencent.com>
> ---
> Chris Li (1):
>       mm: ghost swapfile support for zswap
>
> Kairui Song (14):
>       mm: move thp_limit_gfp_mask to header
>       mm, swap: simplify swap_cache_alloc_folio
>       mm, swap: move conflict checking logic of out swap cache adding
>       mm, swap: add support for large order folios in swap cache directly
>       mm, swap: unify large folio allocation
>       memcg, swap: reparent the swap entry on swapin if swapout cgroup is dead
>       memcg, swap: defer the recording of memcg info and reparent flexibly
>       mm, swap: store and check memcg info in the swap table
>       mm, swap: support flexible batch freeing of slots in different memcg
>       mm, swap: always retrieve memcg id from swap table
>       mm/swap, memcg: remove swap cgroup array
>       mm, swap: merge zeromap into swap table
>       mm, swap: add a special device for ghost swap setup
>       mm, swap: allocate cluster dynamically for ghost swapfile
>
>  MAINTAINERS                 |   1 -
>  drivers/char/mem.c          |  39 ++++
>  include/linux/huge_mm.h     |  24 +++
>  include/linux/memcontrol.h  |  12 +-
>  include/linux/swap.h        |  30 ++-
>  include/linux/swap_cgroup.h |  47 -----
>  mm/Makefile                 |   3 -
>  mm/internal.h               |  25 ++-
>  mm/memcontrol-v1.c          |  78 ++++----
>  mm/memcontrol.c             | 119 ++++++++++--
>  mm/memory.c                 |  89 ++-------
>  mm/page_io.c                |  46 +++--
>  mm/shmem.c                  | 122 +++---------
>  mm/swap.h                   | 122 +++++-------
>  mm/swap_cgroup.c            | 172 ----------------
>  mm/swap_state.c             | 464 ++++++++++++++++++++++++--------------------
>  mm/swap_table.h             | 105 ++++++++--
>  mm/swapfile.c               | 278 ++++++++++++++++++++------
>  mm/vmscan.c                 |   7 +-
>  mm/workingset.c             |  16 +-
>  mm/zswap.c                  |  29 +--
>  21 files changed, 977 insertions(+), 851 deletions(-)
> ---
> base-commit: 4750368e2cd365ac1e02c6919013c8871f35d8f9
> change-id: 20260111-swap-table-p4-98ee92baa7c4
>
> Best regards,
> --
> Kairui Song <kasong@tencent.com>
>
>

Thanks
Barry


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH RFC 00/15] mm, swap: swap table phase IV with dynamic ghost swapfile
  2026-02-21  8:15 ` [PATCH RFC 00/15] mm, swap: swap table phase IV with dynamic " Barry Song
@ 2026-02-21  9:07   ` Kairui Song
  2026-02-21  9:30     ` Barry Song
  0 siblings, 1 reply; 19+ messages in thread
From: Kairui Song @ 2026-02-21  9:07 UTC (permalink / raw)
  To: Barry Song
  Cc: linux-mm, Andrew Morton, David Hildenbrand, Lorenzo Stoakes,
	Zi Yan, Baolin Wang, Hugh Dickins, Chris Li, Kemeng Shi,
	Nhat Pham, Baoquan He, Johannes Weiner, Yosry Ahmed,
	Youngjun Park, Chengming Zhou, Roman Gushchin, Shakeel Butt,
	Muchun Song, Qi Zheng, linux-kernel, cgroups

On Sat, Feb 21, 2026 at 4:16 PM Barry Song <21cnbao@gmail.com> wrote:
>
> On Fri, Feb 20, 2026 at 7:42 AM Kairui Song via B4 Relay
> <devnull+kasong.tencent.com@kernel.org> wrote:
>
> To be honest, I really dislike the name "ghost." I would
> prefer something that reflects its actual functionality.
> "Ghost" does not describe what it does and feels rather
> arbitrary.

Hi Barry,

That can be easily changed by "search and replace", I just kept the
name since patch 13 is directly from Chris and I just didn't change
it.

>
> I suggest retiring the name "ghost" and replacing it with
> something more appropriate. "vswap" could be a good option,

That looks good to me too, you can also check the slide from LSFMM
last year page 23 to see how I imaged thing would workout at that
time:
https://drive.google.com/file/d/1_QKlXErUkQ-TXmJJy79fJoLPui9TGK1S/view

The actual layout will be a bit different from that slide, since the
redirect entry will be in the lower devices, the virtual device will
have an extra virtual table to hold its redirect entry. But still I'm
glad that plain swap still has zero overhead so ZRAM or high
performance NVME is still good.

> > Currently, the dynamic ghost files are just reported as ordinary swap files
> > in /proc/swaps and we can have multiple ones, so users will have a full
> > view of what's going on. This is a very easy-to-change design decision.
> > I'm open to ideas about how we should present this to users. e.g., Hiding
> > it will make it more "virtual", but I don't think that's a good idea.
>
> Even if it remains visible in /proc/swaps, I would rather
> not represent it as a real file in any filesystem. Putting
> a "ghost" swapfile on something like ext4 seems unnatural.

How do you think about this? Here is the output after this sereis:
# swapon
NAME           TYPE       SIZE USED PRIO
/dev/ghostswap ghost     11.5G 821M   -1
/dev/ram0      partition 1024G 9.9M   -1
/dev/vdb2      partition    2G 112K   -1

Or we can rename it to:
# swapon
NAME           TYPE       SIZE USED PRIO
/dev/xswap     xswap     11.5G 821M   -1
/dev/ram0      partition 1024G 9.9M   -1
/dev/vdb2      partition    2G 112K   -1

swapon /dev/xswap will enable this layer (for now I just hardcoded it
to be 8 times the size of total ram). swapoff /dev/xswap disables it.
We can also change the priority.

We can also hide it.

> > And for easier testing, I added a /dev/ghostswap in this RFC. `swapon
> > /dev/ghostswap` enables that. Without swapon /dev/ghostswap, any existing
> > users, including ZRAM, won't observe any change.
>
> /dev/ghostswap is assumed to be a virtual block device or
> something similar? If it is a block device, how is its size
> related to si->size?

It's not a real device, just a placeholder to make swapon usable
without any modification for easier testing (some user space
implementation doesn't work well with dummy header). And it has
nothing to do with the si->size.

>
> Looking at [PATCH RFC 14/15] mm, swap: add a special device
> for ghost swap setup, it appears to be a character device.
> This feels very odd to me. I’m not in favor of coupling the
> ghost swapfile with a memdev character device.
> A cdev should be a true character device.

No coupling at all, it's just a place holder so swapon (the syscall)
knows it's a virtual device, which is just an alternative to the dummy
header approach from Chris, so people can test it easier.

The si->size is just a number and any value can be given. I just
haven't decided how we should pass the number to the kernel or just
make it dynamic: e.g. set it to total ram size and increase by 2M
every time a new cluster is used.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH RFC 00/15] mm, swap: swap table phase IV with dynamic ghost swapfile
  2026-02-21  9:07   ` Kairui Song
@ 2026-02-21  9:30     ` Barry Song
  0 siblings, 0 replies; 19+ messages in thread
From: Barry Song @ 2026-02-21  9:30 UTC (permalink / raw)
  To: Kairui Song
  Cc: linux-mm, Andrew Morton, David Hildenbrand, Lorenzo Stoakes,
	Zi Yan, Baolin Wang, Hugh Dickins, Chris Li, Kemeng Shi,
	Nhat Pham, Baoquan He, Johannes Weiner, Yosry Ahmed,
	Youngjun Park, Chengming Zhou, Roman Gushchin, Shakeel Butt,
	Muchun Song, Qi Zheng, linux-kernel, cgroups

On Sat, Feb 21, 2026 at 5:07 PM Kairui Song <ryncsn@gmail.com> wrote:
>
> On Sat, Feb 21, 2026 at 4:16 PM Barry Song <21cnbao@gmail.com> wrote:
> >
> > On Fri, Feb 20, 2026 at 7:42 AM Kairui Song via B4 Relay
> > <devnull+kasong.tencent.com@kernel.org> wrote:
> >
> > To be honest, I really dislike the name "ghost." I would
> > prefer something that reflects its actual functionality.
> > "Ghost" does not describe what it does and feels rather
> > arbitrary.
>
> Hi Barry,
>
> That can be easily changed by "search and replace", I just kept the
> name since patch 13 is directly from Chris and I just didn't change
> it.
>
> >
> > I suggest retiring the name "ghost" and replacing it with
> > something more appropriate. "vswap" could be a good option,
>
> That looks good to me too, you can also check the slide from LSFMM
> last year page 23 to see how I imaged thing would workout at that
> time:
> https://drive.google.com/file/d/1_QKlXErUkQ-TXmJJy79fJoLPui9TGK1S/view
>
> The actual layout will be a bit different from that slide, since the
> redirect entry will be in the lower devices, the virtual device will
> have an extra virtual table to hold its redirect entry. But still I'm
> glad that plain swap still has zero overhead so ZRAM or high
> performance NVME is still good.
>
> > > Currently, the dynamic ghost files are just reported as ordinary swap files
> > > in /proc/swaps and we can have multiple ones, so users will have a full
> > > view of what's going on. This is a very easy-to-change design decision.
> > > I'm open to ideas about how we should present this to users. e.g., Hiding
> > > it will make it more "virtual", but I don't think that's a good idea.
> >
> > Even if it remains visible in /proc/swaps, I would rather
> > not represent it as a real file in any filesystem. Putting
> > a "ghost" swapfile on something like ext4 seems unnatural.
>
> How do you think about this? Here is the output after this sereis:
> # swapon
> NAME           TYPE       SIZE USED PRIO
> /dev/ghostswap ghost     11.5G 821M   -1
> /dev/ram0      partition 1024G 9.9M   -1
> /dev/vdb2      partition    2G 112K   -1

I’d rather have a “virtual” block device, /dev/xswap, with
its size displayed as 11.5G via `ls -l filename`. This is
also more natural than relying on a cdev placeholder.

If

>
> Or we can rename it to:
> # swapon
> NAME           TYPE       SIZE USED PRIO
> /dev/xswap     xswap     11.5G 821M   -1
> /dev/ram0      partition 1024G 9.9M   -1
> /dev/vdb2      partition    2G 112K   -1
>
> swapon /dev/xswap will enable this layer (for now I just hardcoded it
> to be 8 times the size of total ram). swapoff /dev/xswap disables it.
> We can also change the priority.
>
> We can also hide it.
>
> > > And for easier testing, I added a /dev/ghostswap in this RFC. `swapon
> > > /dev/ghostswap` enables that. Without swapon /dev/ghostswap, any existing
> > > users, including ZRAM, won't observe any change.
> >
> > /dev/ghostswap is assumed to be a virtual block device or
> > something similar? If it is a block device, how is its size
> > related to si->size?
>
> It's not a real device, just a placeholder to make swapon usable
> without any modification for easier testing (some user space
> implementation doesn't work well with dummy header). And it has
> nothing to do with the si->size.

I understand it is a placeholder for swap, but if it appears
as /dev/ghostfile, users browsing /dev/ will see it as a
real cdev. A /dev/chardev is intended for user read/write
access.
Also, udev rules can act on an exported cdev. This couples
us with a lot of userspace behavior.

>
> >
> > Looking at [PATCH RFC 14/15] mm, swap: add a special device
> > for ghost swap setup, it appears to be a character device.
> > This feels very odd to me. I’m not in favor of coupling the
> > ghost swapfile with a memdev character device.
> > A cdev should be a true character device.
>
> No coupling at all, it's just a place holder so swapon (the syscall)
> knows it's a virtual device, which is just an alternative to the dummy
> header approach from Chris, so people can test it easier.

Using a cdev as a placeholder has introduced behavioral
coupling. For swap, it serves as a placeholder; for anything
outside swap, it behaves as a regular cdev.

>
> The si->size is just a number and any value can be given. I just
> haven't decided how we should pass the number to the kernel or just
> make it dynamic: e.g. set it to total ram size and increase by 2M
> every time a new cluster is used.


^ permalink raw reply	[flat|nested] 19+ messages in thread

end of thread, other threads:[~2026-02-21  9:30 UTC | newest]

Thread overview: 19+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2026-02-19 23:42 [PATCH RFC 00/15] mm, swap: swap table phase IV with dynamic ghost swapfile Kairui Song via B4 Relay
2026-02-19 23:42 ` [PATCH RFC 01/15] mm: move thp_limit_gfp_mask to header Kairui Song via B4 Relay
2026-02-19 23:42 ` [PATCH RFC 02/15] mm, swap: simplify swap_cache_alloc_folio Kairui Song via B4 Relay
2026-02-19 23:42 ` [PATCH RFC 03/15] mm, swap: move conflict checking logic of out swap cache adding Kairui Song via B4 Relay
2026-02-19 23:42 ` [PATCH RFC 04/15] mm, swap: add support for large order folios in swap cache directly Kairui Song via B4 Relay
2026-02-19 23:42 ` [PATCH RFC 05/15] mm, swap: unify large folio allocation Kairui Song via B4 Relay
2026-02-19 23:42 ` [PATCH RFC 06/15] memcg, swap: reparent the swap entry on swapin if swapout cgroup is dead Kairui Song via B4 Relay
2026-02-19 23:42 ` [PATCH RFC 07/15] memcg, swap: defer the recording of memcg info and reparent flexibly Kairui Song via B4 Relay
2026-02-19 23:42 ` [PATCH RFC 08/15] mm, swap: store and check memcg info in the swap table Kairui Song via B4 Relay
2026-02-19 23:42 ` [PATCH RFC 09/15] mm, swap: support flexible batch freeing of slots in different memcg Kairui Song via B4 Relay
2026-02-19 23:42 ` [PATCH RFC 10/15] mm, swap: always retrieve memcg id from swap table Kairui Song via B4 Relay
2026-02-19 23:42 ` [PATCH RFC 11/15] mm/swap, memcg: remove swap cgroup array Kairui Song via B4 Relay
2026-02-19 23:42 ` [PATCH RFC 12/15] mm, swap: merge zeromap into swap table Kairui Song via B4 Relay
2026-02-19 23:42 ` [PATCH RFC 13/15] mm: ghost swapfile support for zswap Kairui Song via B4 Relay
2026-02-19 23:42 ` [PATCH RFC 14/15] mm, swap: add a special device for ghost swap setup Kairui Song via B4 Relay
2026-02-19 23:42 ` [PATCH RFC 15/15] mm, swap: allocate cluster dynamically for ghost swapfile Kairui Song via B4 Relay
2026-02-21  8:15 ` [PATCH RFC 00/15] mm, swap: swap table phase IV with dynamic " Barry Song
2026-02-21  9:07   ` Kairui Song
2026-02-21  9:30     ` Barry Song

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox