[RFC] mm/swap, memcg: Introduce swap tiers for cgroup based swap control

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [RFC] mm/swap, memcg: Introduce swap tiers for cgroup based swap control
@ 2025-11-09 12:49 Youngjun Park
  2025-11-09 12:49 ` [PATCH 1/3] mm, swap: change back to use each swap device's percpu cluster Youngjun Park
                   ` (4 more replies)
  0 siblings, 5 replies; 25+ messages in thread
From: Youngjun Park @ 2025-11-09 12:49 UTC (permalink / raw)
  To: akpm, linux-mm
  Cc: cgroups, linux-kernel, chrisl, kasong, hannes, mhocko,
	roman.gushchin, shakeel.butt, muchun.song, shikemeng, nphamcs,
	bhe, baohua, youngjun.park, gunho.lee, taejoon.song

Hi all,

In constrained environments, there is a need to improve workload
performance by controlling swap device usage on a per-process or
per-cgroup basis. For example, one might want to direct critical
processes to faster swap devices (like SSDs) while relegating
less critical ones to slower devices (like HDDs or Network Swap).

Initial approach was to introduce a per-cgroup swap priority
mechanism [1]. However, through review and discussion, several
drawbacks were identified:

a. There is a lack of concrete use cases for assigning a fine-grained,
   unique swap priority to each cgroup. 
b. The implementation complexity was high relative to the desired
   level of control.
c. Differing swap priorities between cgroups could lead to LRU
   inversion problems.

To address these concerns, I propose the "swap tiers" concept, 
originally suggested by Chris Li [2] and further developed through 
collaborative discussions. I would like to thank Chris Li and 
He Baoquan for their invaluable contributions in refining this 
approach, and Kairui Song, Nhat Pham, and Michal Koutný for their 
insightful reviews of earlier RFC versions.

Concept
-------
A swap tier is a grouping mechanism that assigns a "named id" to a
range of swap priorities. For example, all swap devices with a
priority of 100 or higher could be grouped into a tier named "SSD",
and all others into a tier named "HDD".

Cgroups can then select which named tiers they are permitted to use for
swapping via a new cgroup interface. This effectively restricts a
cgroup's swap activity to a specific subset of the available swap
devices.

Proposed Interface
------------------
1. Global Tier Definition: /sys/kernel/mm/swap/tiers

This file is used to define the global swap tiers and their associated
minimum priority levels.

- To add tiers:
  Format: + 'tier_name':'prio'[,|' ']'tier_name 2':'prio']...
  Example:
  # echo "+ SSD:100,HDD:2" > /sys/kernel/mm/swap/tiers

  There are several rules for defining tiers:
  - Priority ranges for tiers must not overlap.
  - The combination of all defined tiers must cover the entire valid
    priority range (DEF_SWAP_PRIO to SHRT_MAX) to ensure every swap device
    can be assigned to a tier.
  - A tier's prio value is its inclusive lower bound,
    covering priorities up to the next tier's prio.
    The highest tier extends to SHRT_MAX, and the lowest tier extends to DEF_SWAP_PRIO.
  - If the specified tiers do not cover the entire priority range,
    the priority of the tier with the lowest specified priority value
    is set to SHRT_MIN
  - The total number of tiers is limited. 

- To remove tiers:
  Format: - 'tier_name'[,|' ']'tier_name2']...
  Example:
  # echo "- SSD,HDD" > /sys/kernel/mm/swap/tiers

  Note: A tier cannot be removed if it is currently in use by any
  cgroup or if any active swap device is assigned to it. This acts as
  a reference count to prevent disruption.

- To show current tiers:
  Reading the file displays the currently configured tiers, their
  internal index, and the priority range they cover.
  Example:
  # echo "+ SSD:100,HDD:2" > /sys/kernel/mm/swap/tiers
  # cat /sys/kernel/mm/swap/tiers
  Name      Idx   PrioStart   PrioEnd
            0
  SSD       1    100         32767
  HDD       2     -1         99

  - `Name`: The name of the tier. The unnamed entry is a default tier.
  - `Idx`: The internal index assigned to the tier.
  - `PrioStart`: The starting priority of the range covered by this tier.
  - `PrioEnd`: The ending priority of the range covered by this tier.

Two special tiers are predefined:
- "": Represents the default inheritance behavior in cgroups.
- "zswap": Reserved for zswap integration.

2. Cgroup Tier Selection: memory.swap.tiers

This file controls which swap tiers are enabled for a given cgroup.

- Reading the file:
  The first line shows the operation that was written to the file.
  The second line shows the final, effective set of tiers after
  merging with the parent cgroup's configuration.

- Writing to the file:
  Format: [+/-] [+|-][TIER_NAME]...
  - `+TIER_NAME`: Explicitly enables this tier for the cgroup.
  - `-TIER_NAME`: Explicitly disables this tier for the cgroup.
  - If a tier is not specified, its setting is inherited from the
    parent cgroup.
  - A standalone `+` at the beginning resets the configuration: it
    ignores the parent's settings, enables all globally defined tiers,
    and then applies the subsequent operations in the command.
  - A standalone `-` at the beginning also resets: it ignores the
    parent's settings, disables all tiers, and then applies subsequent
    operations.
  - The root cgroup defaults to an implicit `+`, enabling all swap
    devices.

  Example:
  # echo "+ -SSD -HDD" > /sys/fs/cgroup/my_cgroup/memory.swap.tiers
  This command first resets the cgroup's configuration to enable all
  tiers (due to the leading `+`), and then explicitly disables the
  "SSD" and "HDD" tiers.

Further Discussion and Open Questions
-------------------------------------
I seek feedback on this concept and have identified several key
points that require further discussion (though this is not an 
exhaustive list). This topic will also be presented at the upcoming 
Linux Plumbers Conference 2025 [3], and I would appreciate any 
feedback here on the list beforehand, or in person at the conference.

1.  The swap fast path utilizes a percpu cluster cache for efficiency.
    In swap tiers, this has been changed to a per-device per-cpu 
    cluster cache. (See the first patch in this series.)
    An alternative approach would be to cache only the swap_info_struct 
    (si) per-tier per-cpu, avoiding cluster caching entirely while still 
    maintaining fast device acquisition without `swap_avail_lock`.
    Should we pursue this alternative, or is the current per-device 
    per-cpu cluster caching approach preferable?

2.  Consistency with cgroup parent-child semantics: Unlike general
    resource distribution, tier selection may bypass parent
    constraints (e.g., a child can enable a tier disabled by its
    parent). Is this behavior acceptable?

3.  Per-cgroup swap tier limit: Is a `swap.tier.max` needed in
    addition to the existing `swap.max`?

4.  Parent-child tier mismatch: If a zombie memcg (child) uses a tier
    that is not available to its new parent, how should this be
    handled during recharging or reparenting? (This question is raised
    in the context of ongoing work to improve memcg reparenting and
    handle zombie memcgs [4, 5].)

5.  Tier mask calculation: What are the trade-offs between calculating
    the effective tier mask at runtime vs. pre-calculating it when the
    interface is written to?

6.  If a swap tier configuration is applied to a memcg, should we
    migrate existing swap-out pages that are on devices not belonging
    to any of the cgroup's allowed tiers?

7.  swap tier could be good abstraction layer. Discuss on extended usage of swap tiers.

Any feedback on the overall concept, interface, and these specific
points would be greatly appreciated.

Best Regards,
Youngjun Park

References
----------
[1] https://lore.kernel.org/linux-mm/aEvLjEInMQC7hEyh@yjaykim-PowerEdge-T330/T/#mbbb6a5e9e30843097e1f5f65fb98f31d582b973d
[2] https://lore.kernel.org/linux-mm/20250716202006.3640584-1-youngjun.park@lge.com/
[3] https://lpc.events/event/19/abstracts/2296/
[4] https://lore.kernel.org/linux-mm/20230720070825.992023-1-yosryahmed@google.com/
[5] https://blogs.oracle.com/linux/post/zombie-memcg-issues

Youngjun Park (3):
  mm, swap: change back to use each swap device's percpu cluster
  mm: swap: introduce swap tier infrastructure
  mm/swap: integrate swap tier infrastructure into swap subsystem

 Documentation/admin-guide/cgroup-v2.rst |  32 ++
 MAINTAINERS                             |   2 +
 include/linux/memcontrol.h              |   4 +
 include/linux/swap.h                    |  16 +-
 mm/Kconfig                              |  13 +
 mm/Makefile                             |   1 +
 mm/memcontrol.c                         |  69 +++
 mm/page_io.c                            |  21 +-
 mm/swap.h                               |   4 +
 mm/swap_state.c                         |  93 ++++
 mm/swap_tier.c                          | 602 ++++++++++++++++++++++++
 mm/swap_tier.h                          |  75 +++
 mm/swapfile.c                           | 169 +++----
 13 files changed, 987 insertions(+), 114 deletions(-)
 create mode 100644 mm/swap_tier.c
 create mode 100644 mm/swap_tier.h

base-commit: 02dafa01ec9a00c3758c1c6478d82fe601f5f1ba
-- 
2.34.1



^ permalink raw reply	[flat|nested] 25+ messages in thread

* [PATCH 1/3] mm, swap: change back to use each swap device's percpu cluster
  2025-11-09 12:49 [RFC] mm/swap, memcg: Introduce swap tiers for cgroup based swap control Youngjun Park
@ 2025-11-09 12:49 ` Youngjun Park
  2025-11-13  6:07   ` Kairui Song
  2025-11-09 12:49 ` [PATCH 2/3] mm: swap: introduce swap tier infrastructure Youngjun Park
                   ` (3 subsequent siblings)
  4 siblings, 1 reply; 25+ messages in thread
From: Youngjun Park @ 2025-11-09 12:49 UTC (permalink / raw)
  To: akpm, linux-mm
  Cc: cgroups, linux-kernel, chrisl, kasong, hannes, mhocko,
	roman.gushchin, shakeel.butt, muchun.song, shikemeng, nphamcs,
	bhe, baohua, youngjun.park, gunho.lee, taejoon.song

This reverts commit 1b7e90020eb7 ("mm, swap: use percpu cluster as
allocation fast path").

Because in the newly introduced swap tiers, the global percpu cluster
will cause two issues:
1) it will cause caching oscillation in the same order of different si
   if two different memcg can only be allowed to access different si and
   both of them are swapping out.
2) It can cause priority inversion on swap devices. Imagine a case where
   there are two memcg, say memcg1 and memcg2. Memcg1 can access si A, B
   and A is higher priority device. While memcg2 can only access si B.
   Then memcg 2 could write the global percpu cluster with si B, then
   memcg1 take si B in fast path even though si A is not exhausted.

Hence in order to support swap tier, revert commit 1b7e90020eb7 to use
each swap device's percpu cluster.

Co-developed-by: Baoquan He <bhe@redhat.com>
Suggested-by: Kairui Song <kasong@tencent.com>
Signed-off-by: Baoquan He <bhe@redhat.com>
Signed-off-by: Youngjun Park <youngjun.park@lge.com>
---
 include/linux/swap.h |  13 +++-
 mm/swapfile.c        | 151 +++++++++++++------------------------------
 2 files changed, 56 insertions(+), 108 deletions(-)

diff --git a/include/linux/swap.h b/include/linux/swap.h
index 38ca3df68716..90fa27bb7796 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -250,10 +250,17 @@ enum {
 #endif
 
 /*
- * We keep using same cluster for rotational device so IO will be sequential.
- * The purpose is to optimize SWAP throughput on these device.
+ * We assign a cluster to each CPU, so each CPU can allocate swap entry from
+ * its own cluster and swapout sequentially. The purpose is to optimize swapout
+ * throughput.
  */
+struct percpu_cluster {
+	local_lock_t lock; /* Protect the percpu_cluster above */
+	unsigned int next[SWAP_NR_ORDERS]; /* Likely next allocation offset */
+};
+
 struct swap_sequential_cluster {
+	spinlock_t lock; /* Serialize usage of global cluster */
 	unsigned int next[SWAP_NR_ORDERS]; /* Likely next allocation offset */
 };
 
@@ -278,8 +285,8 @@ struct swap_info_struct {
 					/* list of cluster that are fragmented or contented */
 	unsigned int pages;		/* total of usable pages of swap */
 	atomic_long_t inuse_pages;	/* number of those currently in use */
+	struct percpu_cluster __percpu *percpu_cluster; /* per cpu's swap location */
 	struct swap_sequential_cluster *global_cluster; /* Use one global cluster for rotating device */
-	spinlock_t global_cluster_lock;	/* Serialize usage of global cluster */
 	struct rb_root swap_extent_root;/* root of the swap extent rbtree */
 	struct block_device *bdev;	/* swap device or bdev of swap file */
 	struct file *swap_file;		/* seldom referenced */
diff --git a/mm/swapfile.c b/mm/swapfile.c
index 44eb6a6e5800..3bb77c9a4879 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -118,18 +118,6 @@ static atomic_t proc_poll_event = ATOMIC_INIT(0);
 
 atomic_t nr_rotate_swap = ATOMIC_INIT(0);
 
-struct percpu_swap_cluster {
-	struct swap_info_struct *si[SWAP_NR_ORDERS];
-	unsigned long offset[SWAP_NR_ORDERS];
-	local_lock_t lock;
-};
-
-static DEFINE_PER_CPU(struct percpu_swap_cluster, percpu_swap_cluster) = {
-	.si = { NULL },
-	.offset = { SWAP_ENTRY_INVALID },
-	.lock = INIT_LOCAL_LOCK(),
-};
-
 /* May return NULL on invalid type, caller must check for NULL return */
 static struct swap_info_struct *swap_type_to_info(int type)
 {
@@ -497,7 +485,7 @@ swap_cluster_alloc_table(struct swap_info_struct *si,
 	 * Swap allocator uses percpu clusters and holds the local lock.
 	 */
 	lockdep_assert_held(&ci->lock);
-	lockdep_assert_held(&this_cpu_ptr(&percpu_swap_cluster)->lock);
+	lockdep_assert_held(&this_cpu_ptr(si->percpu_cluster)->lock);
 
 	/* The cluster must be free and was just isolated from the free list. */
 	VM_WARN_ON_ONCE(ci->flags || !cluster_is_empty(ci));
@@ -515,8 +503,8 @@ swap_cluster_alloc_table(struct swap_info_struct *si,
 	 */
 	spin_unlock(&ci->lock);
 	if (!(si->flags & SWP_SOLIDSTATE))
-		spin_unlock(&si->global_cluster_lock);
-	local_unlock(&percpu_swap_cluster.lock);
+		spin_unlock(&si->global_cluster->lock);
+	local_unlock(&si->percpu_cluster->lock);
 
 	table = swap_table_alloc(__GFP_HIGH | __GFP_NOMEMALLOC | GFP_KERNEL);
 
@@ -528,9 +516,9 @@ swap_cluster_alloc_table(struct swap_info_struct *si,
 	 * could happen with ignoring the percpu cluster is fragmentation,
 	 * which is acceptable since this fallback and race is rare.
 	 */
-	local_lock(&percpu_swap_cluster.lock);
+	local_lock(&si->percpu_cluster->lock);
 	if (!(si->flags & SWP_SOLIDSTATE))
-		spin_lock(&si->global_cluster_lock);
+		spin_lock(&si->global_cluster->lock);
 	spin_lock(&ci->lock);
 
 	/* Nothing except this helper should touch a dangling empty cluster. */
@@ -642,7 +630,7 @@ static bool swap_do_scheduled_discard(struct swap_info_struct *si)
 		ci = list_first_entry(&si->discard_clusters, struct swap_cluster_info, list);
 		/*
 		 * Delete the cluster from list to prepare for discard, but keep
-		 * the CLUSTER_FLAG_DISCARD flag, percpu_swap_cluster could be
+		 * the CLUSTER_FLAG_DISCARD flag, there could be percpu_cluster
 		 * pointing to it, or ran into by relocate_cluster.
 		 */
 		list_del(&ci->list);
@@ -939,12 +927,11 @@ static unsigned int alloc_swap_scan_cluster(struct swap_info_struct *si,
 out:
 	relocate_cluster(si, ci);
 	swap_cluster_unlock(ci);
-	if (si->flags & SWP_SOLIDSTATE) {
-		this_cpu_write(percpu_swap_cluster.offset[order], next);
-		this_cpu_write(percpu_swap_cluster.si[order], si);
-	} else {
+	if (si->flags & SWP_SOLIDSTATE)
+		this_cpu_write(si->percpu_cluster->next[order], next);
+	else
 		si->global_cluster->next[order] = next;
-	}
+
 	return found;
 }
 
@@ -1037,13 +1024,17 @@ static unsigned long cluster_alloc_swap_entry(struct swap_info_struct *si, int o
 	if (order && !(si->flags & SWP_BLKDEV))
 		return 0;
 
-	if (!(si->flags & SWP_SOLIDSTATE)) {
+	if (si->flags & SWP_SOLIDSTATE) {
+		/* Fast path using per CPU cluster */
+		local_lock(&si->percpu_cluster->lock);
+		offset = __this_cpu_read(si->percpu_cluster->next[order]);
+	} else {
 		/* Serialize HDD SWAP allocation for each device. */
-		spin_lock(&si->global_cluster_lock);
+		spin_lock(&si->global_cluster->lock);
 		offset = si->global_cluster->next[order];
-		if (offset == SWAP_ENTRY_INVALID)
-			goto new_cluster;
+	}
 
+	if (offset != SWAP_ENTRY_INVALID) {
 		ci = swap_cluster_lock(si, offset);
 		/* Cluster could have been used by another order */
 		if (cluster_is_usable(ci, order)) {
@@ -1058,7 +1049,6 @@ static unsigned long cluster_alloc_swap_entry(struct swap_info_struct *si, int o
 			goto done;
 	}
 
-new_cluster:
 	/*
 	 * If the device need discard, prefer new cluster over nonfull
 	 * to spread out the writes.
@@ -1121,8 +1111,10 @@ static unsigned long cluster_alloc_swap_entry(struct swap_info_struct *si, int o
 			goto done;
 	}
 done:
-	if (!(si->flags & SWP_SOLIDSTATE))
-		spin_unlock(&si->global_cluster_lock);
+	if (si->flags & SWP_SOLIDSTATE)
+		local_unlock(&si->percpu_cluster->lock);
+	else
+		spin_unlock(&si->global_cluster->lock);
 
 	return found;
 }
@@ -1303,43 +1295,8 @@ static bool get_swap_device_info(struct swap_info_struct *si)
 	return true;
 }
 
-/*
- * Fast path try to get swap entries with specified order from current
- * CPU's swap entry pool (a cluster).
- */
-static bool swap_alloc_fast(swp_entry_t *entry,
-			    int order)
-{
-	struct swap_cluster_info *ci;
-	struct swap_info_struct *si;
-	unsigned int offset, found = SWAP_ENTRY_INVALID;
-
-	/*
-	 * Once allocated, swap_info_struct will never be completely freed,
-	 * so checking it's liveness by get_swap_device_info is enough.
-	 */
-	si = this_cpu_read(percpu_swap_cluster.si[order]);
-	offset = this_cpu_read(percpu_swap_cluster.offset[order]);
-	if (!si || !offset || !get_swap_device_info(si))
-		return false;
-
-	ci = swap_cluster_lock(si, offset);
-	if (cluster_is_usable(ci, order)) {
-		if (cluster_is_empty(ci))
-			offset = cluster_offset(si, ci);
-		found = alloc_swap_scan_cluster(si, ci, offset, order, SWAP_HAS_CACHE);
-		if (found)
-			*entry = swp_entry(si->type, found);
-	} else {
-		swap_cluster_unlock(ci);
-	}
-
-	put_swap_device(si);
-	return !!found;
-}
-
 /* Rotate the device and switch to a new cluster */
-static bool swap_alloc_slow(swp_entry_t *entry,
+static void swap_alloc_entry(swp_entry_t *entry,
 			    int order)
 {
 	unsigned long offset;
@@ -1356,10 +1313,10 @@ static bool swap_alloc_slow(swp_entry_t *entry,
 			put_swap_device(si);
 			if (offset) {
 				*entry = swp_entry(si->type, offset);
-				return true;
+				return;
 			}
 			if (order)
-				return false;
+				return;
 		}
 
 		spin_lock(&swap_avail_lock);
@@ -1378,7 +1335,6 @@ static bool swap_alloc_slow(swp_entry_t *entry,
 			goto start_over;
 	}
 	spin_unlock(&swap_avail_lock);
-	return false;
 }
 
 /*
@@ -1445,10 +1401,7 @@ int folio_alloc_swap(struct folio *folio)
 	}
 
 again:
-	local_lock(&percpu_swap_cluster.lock);
-	if (!swap_alloc_fast(&entry, order))
-		swap_alloc_slow(&entry, order);
-	local_unlock(&percpu_swap_cluster.lock);
+	swap_alloc_entry(&entry, order);
 
 	if (unlikely(!order && !entry.val)) {
 		if (swap_sync_discard())
@@ -2016,13 +1969,7 @@ swp_entry_t get_swap_page_of_type(int type)
 	/* This is called for allocating swap entry, not cache */
 	if (get_swap_device_info(si)) {
 		if (si->flags & SWP_WRITEOK) {
-			/*
-			 * Grab the local lock to be complaint
-			 * with swap table allocation.
-			 */
-			local_lock(&percpu_swap_cluster.lock);
 			offset = cluster_alloc_swap_entry(si, 0, 1);
-			local_unlock(&percpu_swap_cluster.lock);
 			if (offset)
 				entry = swp_entry(si->type, offset);
 		}
@@ -2799,28 +2746,6 @@ static void free_cluster_info(struct swap_cluster_info *cluster_info,
 	kvfree(cluster_info);
 }
 
-/*
- * Called after swap device's reference count is dead, so
- * neither scan nor allocation will use it.
- */
-static void flush_percpu_swap_cluster(struct swap_info_struct *si)
-{
-	int cpu, i;
-	struct swap_info_struct **pcp_si;
-
-	for_each_possible_cpu(cpu) {
-		pcp_si = per_cpu_ptr(percpu_swap_cluster.si, cpu);
-		/*
-		 * Invalidate the percpu swap cluster cache, si->users
-		 * is dead, so no new user will point to it, just flush
-		 * any existing user.
-		 */
-		for (i = 0; i < SWAP_NR_ORDERS; i++)
-			cmpxchg(&pcp_si[i], si, NULL);
-	}
-}
-
-
 SYSCALL_DEFINE1(swapoff, const char __user *, specialfile)
 {
 	struct swap_info_struct *p = NULL;
@@ -2904,7 +2829,6 @@ SYSCALL_DEFINE1(swapoff, const char __user *, specialfile)
 
 	flush_work(&p->discard_work);
 	flush_work(&p->reclaim_work);
-	flush_percpu_swap_cluster(p);
 
 	destroy_swap_extents(p);
 	if (p->flags & SWP_CONTINUED)
@@ -2933,6 +2857,8 @@ SYSCALL_DEFINE1(swapoff, const char __user *, specialfile)
 	arch_swap_invalidate_area(p->type);
 	zswap_swapoff(p->type);
 	mutex_unlock(&swapon_mutex);
+	free_percpu(p->percpu_cluster);
+	p->percpu_cluster = NULL;
 	kfree(p->global_cluster);
 	p->global_cluster = NULL;
 	vfree(swap_map);
@@ -3308,7 +3234,7 @@ static struct swap_cluster_info *setup_clusters(struct swap_info_struct *si,
 {
 	unsigned long nr_clusters = DIV_ROUND_UP(maxpages, SWAPFILE_CLUSTER);
 	struct swap_cluster_info *cluster_info;
-	int err = -ENOMEM;
+	int cpu, err = -ENOMEM;
 	unsigned long i;
 
 	cluster_info = kvcalloc(nr_clusters, sizeof(*cluster_info), GFP_KERNEL);
@@ -3318,14 +3244,27 @@ static struct swap_cluster_info *setup_clusters(struct swap_info_struct *si,
 	for (i = 0; i < nr_clusters; i++)
 		spin_lock_init(&cluster_info[i].lock);
 
-	if (!(si->flags & SWP_SOLIDSTATE)) {
+	if (si->flags & SWP_SOLIDSTATE) {
+		si->percpu_cluster = alloc_percpu(struct percpu_cluster);
+		if (!si->percpu_cluster)
+			goto err;
+
+		for_each_possible_cpu(cpu) {
+			struct percpu_cluster *cluster;
+
+			cluster = per_cpu_ptr(si->percpu_cluster, cpu);
+			for (i = 0; i < SWAP_NR_ORDERS; i++)
+				cluster->next[i] = SWAP_ENTRY_INVALID;
+			local_lock_init(&cluster->lock);
+		}
+	} else {
 		si->global_cluster = kmalloc(sizeof(*si->global_cluster),
 				     GFP_KERNEL);
 		if (!si->global_cluster)
 			goto err_free;
 		for (i = 0; i < SWAP_NR_ORDERS; i++)
 			si->global_cluster->next[i] = SWAP_ENTRY_INVALID;
-		spin_lock_init(&si->global_cluster_lock);
+		spin_lock_init(&si->global_cluster->lock);
 	}
 
 	/*
@@ -3607,6 +3546,8 @@ SYSCALL_DEFINE2(swapon, const char __user *, specialfile, int, swap_flags)
 bad_swap_unlock_inode:
 	inode_unlock(inode);
 bad_swap:
+	free_percpu(si->percpu_cluster);
+	si->percpu_cluster = NULL;
 	kfree(si->global_cluster);
 	si->global_cluster = NULL;
 	inode = NULL;
-- 
2.34.1



^ permalink raw reply	[flat|nested] 25+ messages in thread

* [PATCH 2/3] mm: swap: introduce swap tier infrastructure
  2025-11-09 12:49 [RFC] mm/swap, memcg: Introduce swap tiers for cgroup based swap control Youngjun Park
  2025-11-09 12:49 ` [PATCH 1/3] mm, swap: change back to use each swap device's percpu cluster Youngjun Park
@ 2025-11-09 12:49 ` Youngjun Park
  2025-11-12 14:20   ` Chris Li
  2025-11-09 12:49 ` [PATCH 3/3] mm/swap: integrate swap tier infrastructure into swap subsystem Youngjun Park
                   ` (2 subsequent siblings)
  4 siblings, 1 reply; 25+ messages in thread
From: Youngjun Park @ 2025-11-09 12:49 UTC (permalink / raw)
  To: akpm, linux-mm
  Cc: cgroups, linux-kernel, chrisl, kasong, hannes, mhocko,
	roman.gushchin, shakeel.butt, muchun.song, shikemeng, nphamcs,
	bhe, baohua, youngjun.park, gunho.lee, taejoon.song

Swap tier groups swap devices by priority ranges.
This allows different memory groups to use different swap devices.

Swap tiers are identified by names and configured with priority ranges.
Two special tiers are built-in: default (no name) tier and "zswap" tier.

Interface:
* /sys/kernel/mm/swap/tiers
- Write:
- Add format: "+ tierName:prio,tierName2:prio..."
- Example: echo "+ SSD:100,HDD:-1" > /sys/kernel/mm/swap/tiers
- Remove format: "- tierName,tierName2..."
- Example: echo "- SSD,HDD" > /sys/kernel/mm/swap/tiers
- Priority ranges merge when no cgroup references exist
- Minimum priority becomes -1 when tiers are configured above -1
- Read:
- Format: "Name             Idx   PrioStart   PrioEnd"
- Shows configured tiers with their priority ranges

* /sys/fs/cgroup/*/memory.swap.tiers
- Read: Shows written operations and final merged result
- Write: Format "(+/-)tierName (+/-)tierName2..."
- Example: echo "+ +SSD -HDD" > memory.swap.tier
- Override semantics (echo "" clears all settings)
- + enables tier, - disables tier
- No prefix inherits parent settings
- +/- alone affects all tiers (+ enables all, - disables all)
- Root cgroup defaults to "+" (all swap devices enabled)

Special tiers:
- "" (default): for inheritance
- "zswap": for zswap integration

Suggested-by: Chris Li <chrisl@kernel.org>
Signed-off-by: Youngjun Park <youngjun.park@lge.com>
---
 Documentation/admin-guide/cgroup-v2.rst |  32 ++
 MAINTAINERS                             |   2 +
 include/linux/memcontrol.h              |   4 +
 include/linux/swap.h                    |   3 +
 mm/Kconfig                              |  13 +
 mm/Makefile                             |   1 +
 mm/swap.h                               |   4 +
 mm/swap_tier.c                          | 602 ++++++++++++++++++++++++
 mm/swap_tier.h                          |  75 +++
 mm/swapfile.c                           |   5 +-
 10 files changed, 738 insertions(+), 3 deletions(-)
 create mode 100644 mm/swap_tier.c
 create mode 100644 mm/swap_tier.h

diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst
index 3345961c30ac..ed8ef33b4d6a 100644
--- a/Documentation/admin-guide/cgroup-v2.rst
+++ b/Documentation/admin-guide/cgroup-v2.rst
@@ -1850,6 +1850,38 @@ The following nested keys are defined.
 	Swap usage hard limit.  If a cgroup's swap usage reaches this
 	limit, anonymous memory of the cgroup will not be swapped out.
 
+  memory.swap.tiers
+        A read-write nested-keyed file which exists on non-root
+        cgroups. The default is empty (inherits from parent).
+
+        This interface allows selecting which swap tiers a cgroup can
+        use for swapping out memory.
+
+        When read, the file shows two lines:
+          - The first line shows the operation string that was
+            written to this file.
+          - The second line shows the effective operation after
+            merging with parent settings.
+
+        When writing, the format is:
+          (+/-)(TIER_NAME) (+/-)(TIER_NAME) ...
+
+        Valid tier names are those configured in
+        /sys/kernel/mm/swap/tiers.
+
+        Each tier can be prefixed with:
+          +    Enable this tier (always on)
+          -     Disable this tier (always off)
+          none  Inherit from parent (incremental merge)
+
+        Special default tier operations (+ or - without tier name):
+          +    Ignore parent settings, enable all swap tiers, then
+                apply subsequent operations
+          -     Ignore parent settings, disable all swap tiers, then
+                apply subsequent operations
+          none  Inherit all tier settings from parent, then apply
+                subsequent operations incrementally
+
   memory.swap.events
 	A read-only flat-keyed file which exists on non-root cgroups.
 	The following entries are defined.  Unless specified
diff --git a/MAINTAINERS b/MAINTAINERS
index fa519cbc4b88..3416d466d011 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -16517,6 +16517,8 @@ F:	mm/swap.c
 F:	mm/swap.h
 F:	mm/swap_table.h
 F:	mm/swap_state.c
+F:	mm/swap_tier.c
+F	mm/swap_tier.h
 F:	mm/swapfile.c
 
 MEMORY MANAGEMENT - THP (TRANSPARENT HUGE PAGE)
diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 966f7c1a0128..1224029620ed 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -283,6 +283,10 @@ struct mem_cgroup {
 	/* per-memcg mm_struct list */
 	struct lru_gen_mm_list mm_list;
 #endif
+#ifdef CONFIG_SWAP_TIER
+	int tiers_onoff;
+	int tiers_mask;
+#endif
 
 #ifdef CONFIG_MEMCG_V1
 	/* Legacy consumer-oriented counters */
diff --git a/include/linux/swap.h b/include/linux/swap.h
index 90fa27bb7796..8911eff9d37a 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -271,6 +271,9 @@ struct swap_info_struct {
 	struct percpu_ref users;	/* indicate and keep swap device valid. */
 	unsigned long	flags;		/* SWP_USED etc: see above */
 	signed short	prio;		/* swap priority of this type */
+#ifdef CONFIG_SWAP_TIER
+	int tier_idx;
+#endif
 	struct plist_node list;		/* entry in swap_active_head */
 	signed char	type;		/* strange name for an index */
 	unsigned int	max;		/* extent of the swap_map */
diff --git a/mm/Kconfig b/mm/Kconfig
index e1fb11f36ede..78173ffe65d6 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -163,6 +163,19 @@ endmenu
 
 endif
 
+config SWAP_TIER
+	default n
+	bool "Swap tier"
+	depends on SWAP && MEMCG && SYSFS
+	help
+          Swap tier is a concept proposed to group swap devices by
+          priority ranges, allowing cgroups to select and use a desired
+          swap_tier. This enables more flexible control over swap usage
+          across different workloads.
+          It is possible to assign a name for identifying a swap tier,
+          and to configure the priority range used for selecting swap
+          devices within that tier.
+
 menu "Slab allocator options"
 
 config SLUB
diff --git a/mm/Makefile b/mm/Makefile
index 00ceb2418b64..9e98799f2edf 100644
--- a/mm/Makefile
+++ b/mm/Makefile
@@ -76,6 +76,7 @@ ifdef CONFIG_MMU
 endif
 
 obj-$(CONFIG_SWAP)	+= page_io.o swap_state.o swapfile.o
+obj-$(CONFIG_SWAP_TIER)	+= swap_tier.o
 obj-$(CONFIG_ZSWAP)	+= zswap.o
 obj-$(CONFIG_HAS_DMA)	+= dmapool.o
 obj-$(CONFIG_HUGETLBFS)	+= hugetlb.o hugetlb_sysfs.o hugetlb_sysctl.o
diff --git a/mm/swap.h b/mm/swap.h
index d034c13d8dd2..b116282690c8 100644
--- a/mm/swap.h
+++ b/mm/swap.h
@@ -16,6 +16,10 @@ extern int page_cluster;
 #define swap_entry_order(order)	0
 #endif
 
+#define DEF_SWAP_PRIO  -1
+
+extern spinlock_t swap_lock;
+extern struct plist_head swap_active_head;
 extern struct swap_info_struct *swap_info[];
 
 /*
diff --git a/mm/swap_tier.c b/mm/swap_tier.c
new file mode 100644
index 000000000000..4301e3c766b9
--- /dev/null
+++ b/mm/swap_tier.c
@@ -0,0 +1,602 @@
+// SPDX-License-Identifier: GPL-2.0
+#include <linux/swap.h>
+#include <linux/rcupdate.h>
+#include <linux/memcontrol.h>
+#include <linux/plist.h>
+#include <linux/sysfs.h>
+#include <linux/sort.h>
+
+#include "swap_tier.h"
+
+/*
+ * struct swap_tier - Structure representing a swap tier.
+ *
+ * @name:		Name of the swap_tier.
+ * @prio_st:		Starting value of priority.
+ * @prio_end:		Ending value of priority.
+ * @list:		Linked list of active tiers.
+ * @inactive_list:	Linked list of inactive tiers.
+ * @kref:		Reference count (held by swap device and memory cgroup).
+ */
+static struct swap_tier {
+	char name[MAX_TIERNAME];
+	short prio_st;
+	short prio_end;
+	union {
+		struct plist_node list;
+		struct list_head inactive_list;
+	};
+	struct kref ref;
+} swap_tiers[MAX_SWAPTIER];
+
+static DEFINE_SPINLOCK(swap_tier_lock);
+
+/* Active swap priority list, reserved tier first, sorted in descending order */
+static PLIST_HEAD(swap_tier_active_list);
+/* Unused swap_tier object */
+static LIST_HEAD(swap_tier_inactive_list);
+
+#define TIER_IDX(tier)	((tier) - swap_tiers)
+
+#define for_each_active_tier(tier) \
+	plist_for_each_entry(tier, &swap_tier_active_list, list)
+
+#define for_each_tier(tier, idx) \
+	for (idx = 0, tier = &swap_tiers[0]; idx < MAX_SWAPTIER; \
+		idx++, tier = &swap_tiers[idx])
+
+static int nr_swaptiers = SWAP_TIER_RESERVED_CNT;
+
+#define SKIP_REPLACE_IDX	-1
+#define NULL_TIER               -1
+
+/*
+ * Naming Convention:
+ *   swap_tiers_*() - Public/exported functions
+ *   swap_tier_*()  - Private/internal functions
+ */
+
+static bool swap_tier_replace_marked(int idx)
+{
+	return idx != SKIP_REPLACE_IDX;
+}
+
+static bool swap_tier_is_active(void)
+{
+	return nr_swaptiers > SWAP_TIER_RESERVED_CNT;
+}
+
+static bool swap_tier_prio_in_range(struct swap_tier *tier, short prio)
+{
+	if (tier->prio_st <= prio && tier->prio_end >= prio)
+		return true;
+
+	return false;
+}
+
+/* swap_tiers initialization */
+void swap_tiers_init(void)
+{
+	struct swap_tier *tier;
+	int idx;
+
+	BUILD_BUG_ON(BITS_PER_TYPE(int) < MAX_SWAPTIER * 2);
+
+	for_each_tier(tier, idx) {
+		if (idx < SWAP_TIER_RESERVED_CNT) {
+			/* for display fisrt */
+			plist_node_init(&tier->list, -SHRT_MAX);
+			plist_add(&tier->list, &swap_tier_active_list);
+			kref_init(&tier->ref);
+		} else {
+			INIT_LIST_HEAD(&tier->inactive_list);
+			list_add_tail(&tier->inactive_list, &swap_tier_inactive_list);
+		}
+	}
+
+	strscpy(swap_tiers[SWAP_TIER_DEFAULT].name, DEFAULT_TIER_NAME);
+#ifdef CONFIG_ZSWAP
+	strscpy(swap_tiers[SWAP_TIER_ZSWAP].name, ZSWAP_TIER_NAME);
+#endif
+}
+
+static bool swap_tier_alive(struct swap_tier *tier)
+{
+	lockdep_assert_held(&swap_tier_lock);
+	return (kref_read(&tier->ref) > 1) ? true : false;
+}
+
+static void swap_tier_get(struct swap_tier *tier)
+{
+	lockdep_assert_held(&swap_tier_lock);
+	kref_get(&tier->ref);
+}
+
+static void swap_tier_remove(struct swap_tier *tier);
+static void swap_tier_release(struct kref *ref)
+{
+	struct swap_tier *tier = container_of(ref, struct swap_tier, ref);
+
+	swap_tier_remove(tier);
+}
+
+static void swap_tier_put(struct swap_tier *tier)
+{
+	lockdep_assert_held(&swap_tier_lock);
+	kref_put(&tier->ref, swap_tier_release);
+}
+
+static int swap_tier_cmp(const void *p, const void *p2)
+{
+	const struct tiers_desc *tp = p;
+	const struct tiers_desc *tp2 = p2;
+
+	return tp2->prio_st - tp->prio_st;
+}
+
+static void swap_tier_get_mask(int mask)
+{
+	struct swap_tier *tier;
+
+	lockdep_assert_held(&swap_tier_lock);
+
+	for_each_active_tier(tier) {
+		int tier_mask = TIER_MASK(TIER_IDX(tier), TIER_FULL_MASK);
+
+		if (mask & tier_mask)
+			swap_tier_get(tier);
+	}
+}
+
+static void swap_tier_put_mask(int mask)
+{
+	struct swap_tier *tier;
+
+	lockdep_assert_held(&swap_tier_lock);
+
+	for_each_active_tier(tier) {
+		int tier_mask = TIER_MASK(TIER_IDX(tier), TIER_FULL_MASK);
+
+		if (mask & tier_mask)
+			swap_tier_put(tier);
+	}
+}
+
+static bool swap_tier_is_default(struct swap_tier *tier)
+{
+	int idx = TIER_IDX(tier);
+
+	return idx < SWAP_TIER_RESERVED_CNT ? true : false;
+}
+
+/*
+ * Check if tier priority range can be split. If input falls within a
+ * referenced range, splitting is not allowed and an error is returned.
+ * If priority start is the same but tier has no reference, mark
+ * SKIP_REPLACE_IDX to allow replacement at apply time.
+ */
+static int swap_tier_can_split_range(struct tiers_desc *desc)
+{
+	struct swap_tier *tier;
+	int ret = 0;
+
+	lockdep_assert_held(&swap_tier_lock);
+	desc->tier_idx = SKIP_REPLACE_IDX;
+
+	for_each_active_tier(tier) {
+		if (swap_tier_is_default(tier))
+			continue;
+
+		if (tier->prio_st > desc->prio_st)
+			continue;
+
+		/* If start priorities match, replace tier name */
+		if (tier->prio_st == desc->prio_st)
+			desc->tier_idx = TIER_IDX(tier);
+
+		if (swap_tier_alive(tier))
+			ret = -EBUSY;
+
+		break;
+	}
+
+	return ret;
+}
+
+static void swap_tier_prepare(struct tiers_desc *desc)
+{
+	struct swap_tier *tier;
+
+	lockdep_assert_held(&swap_tier_lock);
+
+	if (WARN_ON(list_empty(&swap_tier_inactive_list)))
+		return;
+
+	if (swap_tier_replace_marked(desc->tier_idx))
+		return;
+
+	tier = list_first_entry(&swap_tier_inactive_list,
+		struct swap_tier, inactive_list);
+
+	list_del(&tier->inactive_list);
+	strscpy(tier->name, desc->name, MAX_TIERNAME);
+	tier->prio_st = desc->prio_st;
+	plist_node_init(&tier->list, -tier->prio_st);
+	kref_init(&tier->ref);
+
+	plist_add(&tier->list, &swap_tier_active_list);
+	nr_swaptiers++;
+}
+
+static void swap_tier_apply(struct tiers_desc *desc, int nr)
+{
+	struct plist_node *pnode;
+	struct swap_info_struct *p = NULL;
+	struct swap_tier *tier;
+	int prio_end = SHRT_MAX;
+
+	lockdep_assert_held(&swap_lock);
+	lockdep_assert_held(&swap_tier_lock);
+
+	for (int i = 0; i < nr; i++) {
+		int idx = desc[i].tier_idx;
+
+		if (swap_tier_replace_marked(idx))
+			strscpy(swap_tiers[idx].name, desc[i].name, MAX_TIERNAME);
+	}
+
+	for_each_active_tier(tier) {
+		if (swap_tier_is_default(tier))
+			continue;
+
+		tier->prio_end = prio_end;
+		prio_end = tier->prio_st - 1;
+	}
+
+	/* TODO: use swapfiles ?*/
+	if (plist_head_empty(&swap_active_head))
+		return;
+
+	for_each_active_tier(tier) {
+		plist_for_each_continue(pnode, &swap_active_head) {
+			p = container_of(pnode, struct swap_info_struct, list);
+			if (p->tier_idx != NULL_TIER)
+				continue;
+			if (!swap_tier_prio_in_range(tier, p->prio))
+				break;
+			swap_tier_get(tier);
+		}
+	}
+}
+
+int swap_tiers_add(struct tiers_desc desc[], int nr)
+{
+	struct swap_tier *tier;
+	int ret = 0;
+
+	if (nr <= 0)
+		return -EINVAL;
+
+	sort(desc, nr, sizeof(desc[0]), swap_tier_cmp, NULL);
+
+	for (int i = 0; i < nr; i++) {
+		if (i > 0 && desc[i].prio_st == desc[i-1].prio_st)
+			return -EINVAL;
+
+		for (int j = i+1; j < nr; j++) {
+			if (!strcmp(desc[i].name, desc[j].name))
+				return -EINVAL;
+		}
+
+		for_each_active_tier(tier) {
+			if (!strcmp(desc[i].name, tier->name))
+				return -EALREADY;
+		}
+	}
+
+	spin_lock(&swap_lock);
+	spin_lock(&swap_tier_lock);
+
+	/* Tier set must cover all priorities */
+	if (!swap_tier_is_active() && desc[nr-1].prio_st != DEF_SWAP_PRIO)
+		desc[nr-1].prio_st = DEF_SWAP_PRIO;
+
+	if (nr + nr_swaptiers > MAX_SWAPTIER) {
+		ret = -ERANGE;
+		goto out;
+	}
+
+	for (int i = 0; i < nr; i++) {
+		ret = swap_tier_can_split_range(&desc[i]);
+		if (ret)
+			goto out;
+	}
+
+	for (int i = 0; i < nr; i++)
+		swap_tier_prepare(&desc[i]);
+
+	swap_tier_apply(desc, nr);
+out:
+	spin_unlock(&swap_tier_lock);
+	spin_unlock(&swap_lock);
+
+	return ret;
+}
+
+static void swap_tier_remove(struct swap_tier *tier)
+{
+	bool least_prio_merge = false;
+	struct plist_node *next;
+	struct swap_tier *merge_tier;
+
+	lockdep_assert_held(&swap_tier_lock);
+
+	if (tier->prio_st == DEF_SWAP_PRIO) {
+		least_prio_merge = true;
+		next = plist_prev(&tier->list);
+	} else
+		next = plist_next(&tier->list);
+
+	plist_del(&tier->list, &swap_tier_active_list);
+	INIT_LIST_HEAD(&tier->inactive_list);
+	list_add_tail(&tier->inactive_list, &swap_tier_inactive_list);
+	nr_swaptiers--;
+
+	if (nr_swaptiers == SWAP_TIER_RESERVED_CNT)
+		return;
+
+	merge_tier = container_of(next, struct swap_tier, list);
+	/*
+	 * When the first tier is removed, the rule that all priorities
+	 * must be covered should be maintained. The next tier inherits the
+	 * start value of the removed first tier.
+	 */
+	if (least_prio_merge)
+		merge_tier->prio_st = DEF_SWAP_PRIO;
+	else
+		merge_tier->prio_end = tier->prio_end;
+}
+
+int swap_tiers_remove(struct tiers_desc desc[], int nr)
+{
+	int cnt = 0;
+	int ret = 0;
+	struct swap_tier *tier;
+
+	for (int i = 0; i < nr; i++) {
+		for (int j = i+1; j < nr; j++) {
+			if (!strcmp(desc[i].name, desc[j].name))
+				return -EINVAL;
+		}
+	}
+
+	spin_lock(&swap_tier_lock);
+	if (nr <= 0 || nr > nr_swaptiers) {
+		ret = -EINVAL;
+		goto out;
+	}
+
+	for (int i = 0; i < nr; i++) {
+		for_each_active_tier(tier) {
+			if (swap_tier_is_default(tier))
+				continue;
+
+			if (!strcmp(desc[i].name, tier->name)) {
+				if (swap_tier_alive(tier)) {
+					ret = -EBUSY;
+					goto out;
+				}
+				desc[i].tier_idx = TIER_IDX(tier);
+				cnt++;
+			}
+		}
+	}
+
+	if (cnt != nr) {
+		ret = -EINVAL;
+		goto out;
+	}
+
+	for (int i = 0; i < nr; i++)
+		swap_tier_put(&swap_tiers[desc[i].tier_idx]);
+
+out:
+	spin_unlock(&swap_tier_lock);
+	return ret;
+}
+
+ssize_t swap_tiers_show_sysfs(char *buf)
+{
+	struct swap_tier *tier;
+	ssize_t len = 0;
+
+	spin_lock(&swap_tier_lock);
+	len += sysfs_emit_at(buf, len, "%-16s %-5s %-11s %-11s\n",
+			 "Name", "Idx", "PrioStart", "PrioEnd");
+	for_each_active_tier(tier) {
+		if (swap_tier_is_default(tier))
+			len += sysfs_emit_at(buf, len, "%-16s %-5ld\n",
+					     tier->name,
+					     TIER_IDX(tier));
+		else {
+			len += sysfs_emit_at(buf, len, "%-16s %-5ld %-11d %-11d\n",
+					     tier->name,
+					     TIER_IDX(tier),
+					     tier->prio_st,
+					     tier->prio_end);
+		}
+		if (len >= PAGE_SIZE)
+			break;
+	}
+	spin_unlock(&swap_tier_lock);
+
+	return len;
+}
+
+static void swap_tier_show_mask(struct seq_file *m, int memcg_mask)
+{
+	struct swap_tier *tier;
+	int idx;
+
+	for_each_active_tier(tier) {
+		idx = TIER_IDX(tier);
+		int tier_on_mask = TIER_MASK(idx, TIER_ON_MASK);
+		int tier_off_mask = TIER_MASK(idx, TIER_OFF_MASK);
+
+		if (memcg_mask & tier_on_mask)
+			seq_printf(m, "+%s ", tier->name);
+		else if (memcg_mask & tier_off_mask)
+			seq_printf(m, "-%s ", tier->name);
+	}
+	seq_puts(m, "\n");
+}
+
+static bool swap_tier_has_default(int mask)
+{
+	return mask & DEFAULT_FULL_MASK;
+}
+
+/*
+ * TODO: Simplify tier mask collection design if possible.
+ *
+ * Current approach uses two separate fields per cgroup:
+ *   - tiers_onoff: 2-bit per tier to indicate on/off state
+ *   - tiers_mask:  2-bit per tier to indicate if the tier is configured
+ *
+ * Trade-off to consider:
+ *   - Pre-compute at cgroup setup time (faster runtime, complex invalidation)
+ *   - Keep runtime calculation (simpler, but repeated parent walks)
+ */
+static int swap_tier_collect_mask(struct mem_cgroup *memcg)
+{
+	struct mem_cgroup *p;
+	int onoff, mask, merge, new;
+
+	if (!memcg)
+		return DEFAULT_ON_MASK;
+
+	onoff = memcg->tiers_onoff;
+	mask = memcg->tiers_mask;
+
+	rcu_read_lock();
+	for (p = parent_mem_cgroup(memcg); p && !swap_tier_has_default(onoff);
+		p = parent_mem_cgroup(p)) {
+		merge = mask | p->tiers_mask;
+		new = merge ^ mask;
+		onoff |= p->tiers_onoff & new;
+		mask = merge;
+	}
+	rcu_read_unlock();
+
+	return onoff;
+}
+
+int swap_tiers_collect_compare_mask(struct mem_cgroup *memcg)
+{
+	int onoff;
+
+	onoff = swap_tier_collect_mask(memcg);
+
+	/* Make common case fast */
+	if (onoff & DEFAULT_ON_MASK)
+		return onoff;
+	/*
+	 * Root memcg has DEFAULT_ON_MASK; defaults are covered.
+	 * Checking DEFAULT_OFF_MASK suffices; only each tier's ON bit is checked.
+	 * ON flag is inverted and compared with each swap device's OFF mask.
+	 */
+	return ~(onoff << 1);
+}
+
+void swap_tiers_show_memcg(struct seq_file *m, struct mem_cgroup *memcg)
+{
+	spin_lock(&swap_tier_lock);
+	if (memcg->tiers_onoff)
+		swap_tier_show_mask(m, memcg->tiers_onoff);
+	else
+		seq_puts(m, "\n");
+	swap_tier_show_mask(m, swap_tier_collect_mask(memcg));
+	spin_unlock(&swap_tier_lock);
+}
+
+void swap_tiers_assign(struct swap_info_struct *swp)
+{
+	struct swap_tier *tier;
+
+	spin_lock(&swap_tier_lock);
+	swp->tier_idx = NULL_TIER;
+
+	for_each_active_tier(tier) {
+		if (swap_tier_is_default(tier))
+			continue;
+		if (swap_tier_prio_in_range(tier, swp->prio)) {
+			swp->tier_idx = TIER_IDX(tier);
+			swap_tier_get(tier);
+			break;
+		}
+	}
+	spin_unlock(&swap_tier_lock);
+}
+
+void swap_tiers_release(struct swap_info_struct *swp)
+{
+	spin_lock(&swap_tier_lock);
+	if (swp->tier_idx != NULL_TIER)
+		swap_tier_put(&swap_tiers[swp->tier_idx]);
+	spin_unlock(&swap_tier_lock);
+}
+
+/* not incremental, but reset. */
+int swap_tiers_get_mask(struct tiers_desc *desc, int nr, struct mem_cgroup *memcg)
+{
+	struct swap_tier *tier;
+	int ret = 0;
+	int tiers_mask = 0;
+	int tiers_onoff = 0;
+	int cnt = 0;
+
+	for (int i = 0; i < nr; i++) {
+		for (int j = i+1; j < nr; j++) {
+			if (!strcmp(desc[i].name, desc[j].name))
+				return -EINVAL;
+		}
+	}
+
+	spin_lock(&swap_tier_lock);
+	/* nr 0 is allowed for deletion */
+	if (nr < 0 || nr > nr_swaptiers) {
+		ret = -EINVAL;
+		goto out;
+	}
+
+	for (int i = 0; i < nr; i++) {
+		for_each_active_tier(tier) {
+			if (!strcmp(desc[i].name, tier->name)) {
+				tiers_mask |= TIER_MASK(TIER_IDX(tier), TIER_FULL_MASK);
+				tiers_onoff |= TIER_MASK(TIER_IDX(tier), desc[i].ops);
+				cnt++;
+				break;
+			}
+		}
+	}
+
+	if (cnt != nr) {
+		ret = -EINVAL;
+		goto out;
+	}
+
+	swap_tier_put_mask(memcg->tiers_mask);
+	swap_tier_get_mask(tiers_mask);
+	memcg->tiers_mask = tiers_mask;
+	memcg->tiers_onoff = tiers_onoff;
+out:
+	spin_unlock(&swap_tier_lock);
+	return ret;
+}
+
+void swap_tiers_put_mask(struct mem_cgroup *memcg)
+{
+	spin_lock(&swap_tier_lock);
+	swap_tier_put_mask(memcg->tiers_mask);
+	spin_unlock(&swap_tier_lock);
+}
diff --git a/mm/swap_tier.h b/mm/swap_tier.h
new file mode 100644
index 000000000000..f4b4e385638a
--- /dev/null
+++ b/mm/swap_tier.h
@@ -0,0 +1,75 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _SWAP_TIER_H
+#define _SWAP_TIER_H
+
+#include "swap.h"
+#include <linux/memcontrol.h>
+
+#ifdef CONFIG_SWAP_TIER
+
+enum swap_tiers_index {
+	SWAP_TIER_DEFAULT,
+#ifdef CONFIG_ZSWAP
+	SWAP_TIER_ZSWAP,
+#endif
+	SWAP_TIER_RESERVED_CNT,
+};
+
+#define MAX_TIERNAME			16
+#define MAX_SWAPTIER			12
+
+#define TIER_MASK(idx, mask)		((mask) << (idx) * 2)
+#define TIER_ON_MASK			(1 << 0)
+#define TIER_OFF_MASK			(1 << 1)
+#define TIER_FULL_MASK			(TIER_ON_MASK | TIER_OFF_MASK)
+
+#define DEFAULT_ON_MASK			TIER_MASK(SWAP_TIER_DEFAULT, TIER_ON_MASK)
+#define DEFAULT_OFF_MASK		TIER_MASK(SWAP_TIER_DEFAULT, TIER_OFF_MASK)
+#define DEFAULT_FULL_MASK		(DEFAULT_ON_MASK | DEFAULT_OFF_MASK)
+
+#define DEFAULT_TIER_NAME		""
+#define ZSWAP_TIER_NAME			"ZSWAP"
+
+struct tiers_desc {
+	char name[MAX_TIERNAME];
+	union {
+		signed short prio_st;
+		signed short ops;
+	};
+	int tier_idx;
+};
+
+void swap_tiers_init(void);
+int swap_tiers_add(struct tiers_desc desc[], int nr);
+int swap_tiers_remove(struct tiers_desc desc[], int nr);
+ssize_t swap_tiers_show_sysfs(char *buf);
+void swap_tiers_show_memcg(struct seq_file *m, struct mem_cgroup *memcg);
+void swap_tiers_assign(struct swap_info_struct *swp);
+void swap_tiers_release(struct swap_info_struct *swp);
+int swap_tiers_get_mask(struct tiers_desc *desc, int nr, struct mem_cgroup *memcg);
+void swap_tiers_put_mask(struct mem_cgroup *memcg);
+static inline bool swap_tiers_test_off(int tier_idx, int mask)
+{
+	return TIER_MASK(tier_idx, TIER_OFF_MASK) & mask;
+}
+int swap_tiers_collect_compare_mask(struct mem_cgroup *memcg);
+#else
+static inline void swap_tiers_init(void)
+{
+}
+static inline void swap_tiers_assign(struct swap_info_struct *swp)
+{
+}
+static inline void swap_tiers_release(struct swap_info_struct *swp)
+{
+}
+static inline bool swap_tiers_test_off(int tier_off_mask, int mask)
+{
+	return false;
+}
+static inline int swap_tiers_collect_compare_mask(struct mem_cgroup *memcg);
+{
+	return 0;
+}
+#endif
+#endif
diff --git a/mm/swapfile.c b/mm/swapfile.c
index 3bb77c9a4879..a5c90e419ff3 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -63,7 +63,7 @@ static void move_cluster(struct swap_info_struct *si,
 			 struct swap_cluster_info *ci, struct list_head *list,
 			 enum swap_cluster_flags new_flags);
 
-static DEFINE_SPINLOCK(swap_lock);
+DEFINE_SPINLOCK(swap_lock);
 static unsigned int nr_swapfiles;
 atomic_long_t nr_swap_pages;
 /*
@@ -74,7 +74,6 @@ atomic_long_t nr_swap_pages;
 EXPORT_SYMBOL_GPL(nr_swap_pages);
 /* protected with swap_lock. reading in vm_swap_full() doesn't need lock */
 long total_swap_pages;
-#define DEF_SWAP_PRIO  -1
 unsigned long swapfile_maximum_size;
 #ifdef CONFIG_MIGRATION
 bool swap_migration_ad_supported;
@@ -89,7 +88,7 @@ static const char Unused_offset[] = "Unused swap offset entry ";
  * all active swap_info_structs
  * protected with swap_lock, and ordered by priority.
  */
-static PLIST_HEAD(swap_active_head);
+PLIST_HEAD(swap_active_head);
 
 /*
  * all available (active, not full) swap_info_structs
-- 
2.34.1



^ permalink raw reply	[flat|nested] 25+ messages in thread

* [PATCH 3/3] mm/swap: integrate swap tier infrastructure into swap subsystem
  2025-11-09 12:49 [RFC] mm/swap, memcg: Introduce swap tiers for cgroup based swap control Youngjun Park
  2025-11-09 12:49 ` [PATCH 1/3] mm, swap: change back to use each swap device's percpu cluster Youngjun Park
  2025-11-09 12:49 ` [PATCH 2/3] mm: swap: introduce swap tier infrastructure Youngjun Park
@ 2025-11-09 12:49 ` Youngjun Park
  2025-11-10 11:40   ` kernel test robot
                     ` (3 more replies)
  2025-11-12 13:34 ` [RFC] mm/swap, memcg: Introduce swap tiers for cgroup based swap control Chris Li
  2025-11-15  1:22 ` SeongJae Park
  4 siblings, 4 replies; 25+ messages in thread
From: Youngjun Park @ 2025-11-09 12:49 UTC (permalink / raw)
  To: akpm, linux-mm
  Cc: cgroups, linux-kernel, chrisl, kasong, hannes, mhocko,
	roman.gushchin, shakeel.butt, muchun.song, shikemeng, nphamcs,
	bhe, baohua, youngjun.park, gunho.lee, taejoon.song

Integrate the swap tier infrastructure into the existing swap subsystem
to enable selective swap device usage based on tier configuration.

Signed-off-by: Youngjun Park <youngjun.park@lge.com>
---
 mm/memcontrol.c | 69 ++++++++++++++++++++++++++++++++++++
 mm/page_io.c    | 21 ++++++++++-
 mm/swap_state.c | 93 +++++++++++++++++++++++++++++++++++++++++++++++++
 mm/swapfile.c   | 15 ++++++--
 4 files changed, 194 insertions(+), 4 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index bfc986da3289..33c7cc069754 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -68,6 +68,7 @@
 #include <net/ip.h>
 #include "slab.h"
 #include "memcontrol-v1.h"
+#include "swap_tier.h"
 
 #include <linux/uaccess.h>
 
@@ -3730,6 +3731,7 @@ static void mem_cgroup_free(struct mem_cgroup *memcg)
 {
 	lru_gen_exit_memcg(memcg);
 	memcg_wb_domain_exit(memcg);
+	swap_tiers_put_mask(memcg);
 	__mem_cgroup_free(memcg);
 }
 
@@ -3842,6 +3844,11 @@ mem_cgroup_css_alloc(struct cgroup_subsys_state *parent_css)
 		page_counter_init(&memcg->kmem, &parent->kmem, false);
 		page_counter_init(&memcg->tcpmem, &parent->tcpmem, false);
 #endif
+#ifdef CONFIG_SWAP_TIER
+		memcg->tiers_mask = 0;
+		memcg->tiers_onoff = 0;
+#endif
+
 	} else {
 		init_memcg_stats();
 		init_memcg_events();
@@ -3850,6 +3857,10 @@ mem_cgroup_css_alloc(struct cgroup_subsys_state *parent_css)
 #ifdef CONFIG_MEMCG_V1
 		page_counter_init(&memcg->kmem, NULL, false);
 		page_counter_init(&memcg->tcpmem, NULL, false);
+#endif
+#ifdef CONFIG_SWAP_TIER
+		memcg->tiers_mask = DEFAULT_FULL_MASK;
+		memcg->tiers_onoff = DEFAULT_ON_MASK;
 #endif
 		root_mem_cgroup = memcg;
 		return &memcg->css;
@@ -5390,6 +5401,56 @@ static int swap_events_show(struct seq_file *m, void *v)
 	return 0;
 }
 
+#ifdef CONFIG_SWAP_TIER
+static int swap_tier_show(struct seq_file *m, void *v)
+{
+	struct mem_cgroup *memcg = mem_cgroup_from_seq(m);
+
+	swap_tiers_show_memcg(m, memcg);
+	return 0;
+}
+
+static ssize_t swap_tier_write(struct kernfs_open_file *of,
+				char *buf, size_t nbytes, loff_t off)
+{
+	struct mem_cgroup *memcg = mem_cgroup_from_css(of_css(of));
+	struct tiers_desc desc[MAX_SWAPTIER] = {};
+	char *pos = buf, *token;
+	int nr = 0;
+	int ret;
+
+	while ((token = strsep(&pos, " \t\n")) != NULL) {
+		if (!*token)
+			continue;
+
+		if (nr >= MAX_SWAPTIER)
+			return -E2BIG;
+
+		if (token[0] != '+' && token[0] != '-')
+			return -EINVAL;
+
+		desc[nr].ops = (token[0] == '+') ? TIER_ON_MASK : TIER_OFF_MASK;
+
+		if (strlen(token) <= 1) {
+			strscpy(desc[nr].name, DEFAULT_TIER_NAME);
+			nr++;
+			continue;
+		}
+
+		if (strscpy(desc[nr].name, token + 1, MAX_TIERNAME) < 0)
+			return -EINVAL;
+
+		nr++;
+	}
+
+	ret = swap_tiers_get_mask(desc, nr, memcg);
+	if (ret)
+		return ret;
+
+	return nbytes;
+}
+#endif
+
 static struct cftype swap_files[] = {
 	{
 		.name = "swap.current",
@@ -5422,6 +5483,14 @@ static struct cftype swap_files[] = {
 		.file_offset = offsetof(struct mem_cgroup, swap_events_file),
 		.seq_show = swap_events_show,
 	},
+#ifdef CONFIG_SWAP_TIER
+	{
+		.name = "swap.tiers",
+		.flags = CFTYPE_NOT_ON_ROOT,
+		.seq_show = swap_tier_show,
+		.write = swap_tier_write,
+	},
+#endif
 	{ }	/* terminate */
 };
 
diff --git a/mm/page_io.c b/mm/page_io.c
index 3c342db77ce3..2b3b1154a169 100644
--- a/mm/page_io.c
+++ b/mm/page_io.c
@@ -26,6 +26,7 @@
 #include <linux/delayacct.h>
 #include <linux/zswap.h>
 #include "swap.h"
+#include "swap_tier.h"
 
 static void __end_swap_bio_write(struct bio *bio)
 {
@@ -233,6 +234,24 @@ static void swap_zeromap_folio_clear(struct folio *folio)
 	}
 }
 
+#if defined(CONFIG_SWAP_TIER) && defined(CONFIG_ZSWAP)
+static bool folio_swap_tier_zswap_test_off(struct folio *folio)
+{
+	struct mem_cgroup *memcg;
+
+	memcg = folio_memcg(folio);
+	if (memcg)
+		return swap_tier_test_off(memcg->tiers_mask,
+			TIER_MASK(SWAP_TIER_ZSWAP, TIER_ON_MASK));
+
+	return false;
+}
+#else
+static bool folio_swap_tier_zswap_test_off(struct folio *folio)
+{
+	return false;
+}
+#endif
 /*
  * We may have stale swap cache pages in memory: notice
  * them here and get rid of the unnecessary final write.
@@ -272,7 +291,7 @@ int swap_writeout(struct folio *folio, struct swap_iocb **swap_plug)
 	 */
 	swap_zeromap_folio_clear(folio);
 
-	if (zswap_store(folio)) {
+	if (folio_swap_tier_zswap_test_off(folio) || zswap_store(folio)) {
 		count_mthp_stat(folio_order(folio), MTHP_STAT_ZSWPOUT);
 		goto out_unlock;
 	}
diff --git a/mm/swap_state.c b/mm/swap_state.c
index 3f85a1c4cfd9..2e5f65ff2479 100644
--- a/mm/swap_state.c
+++ b/mm/swap_state.c
@@ -25,6 +25,7 @@
 #include "internal.h"
 #include "swap_table.h"
 #include "swap.h"
+#include "swap_tier.h"
 
 /*
  * swapper_space is a fiction, retained to simplify the path through
@@ -836,8 +837,100 @@ static ssize_t vma_ra_enabled_store(struct kobject *kobj,
 }
 static struct kobj_attribute vma_ra_enabled_attr = __ATTR_RW(vma_ra_enabled);
 
+#ifdef CONFIG_SWAP_TIER
+static ssize_t tiers_show(struct kobject *kobj,
+				     struct kobj_attribute *attr, char *buf)
+{
+	return swap_tiers_show_sysfs(buf);
+}
+
+static ssize_t tiers_store(struct kobject *kobj,
+				struct kobj_attribute *attr,
+				const char *buf, size_t count)
+{
+	struct tiers_desc desc[MAX_SWAPTIER] = {};
+	int nr = 0;
+	char *data, *p, *token;
+	int ret = 0;
+	bool is_add = true;
+
+	if (!count)
+		return -EINVAL;
+
+	data = kmemdup_nul(buf, count, GFP_KERNEL);
+	if (!data)
+		return -ENOMEM;
+
+	p = data;
+
+	if (*p == '+')
+		p++;
+	else if (*p == '-') {
+		is_add = false;
+		p++;
+	} else
+		return -EINVAL;
+
+	while ((token = strsep(&p, ", \t\n")) != NULL) {
+		if (!*token)
+			continue;
+
+		if (nr >= MAX_SWAPTIER) {
+			ret = -E2BIG;
+			goto out;
+		}
+
+		if (is_add) {
+			char *name, *prio_str;
+			int prio;
+
+			name = strsep(&token, ":");
+			prio_str = token;
+
+			if (!name || !prio_str || !*name || !*prio_str) {
+				ret = -EINVAL;
+				goto out;
+			}
+
+			if (strscpy(desc[nr].name, name, MAX_TIERNAME) < 0) {
+				ret = -EINVAL;
+				goto out;
+			}
+
+			if (kstrtoint(prio_str, 10, &prio)) {
+				ret = -EINVAL;
+				goto out;
+			}
+
+			desc[nr].prio_st = prio;
+		} else {
+			if (strscpy(desc[nr].name, token, MAX_TIERNAME) < 0) {
+				ret = -EINVAL;
+				goto out;
+			}
+			desc[nr].prio_st = 0;
+		}
+		nr++;
+	}
+
+	if (is_add)
+		ret = swap_tiers_add(desc, nr);
+	else
+		ret = swap_tiers_remove(desc, nr);
+
+out:
+	kfree(data);
+	return ret ? ret : count;
+}
+
+static struct kobj_attribute tier_attr = __ATTR_RW(tiers);
+#endif
+
 static struct attribute *swap_attrs[] = {
 	&vma_ra_enabled_attr.attr,
+#ifdef CONFIG_SWAP_TIER
+	&tier_attr.attr,
+#endif
 	NULL,
 };
 
diff --git a/mm/swapfile.c b/mm/swapfile.c
index a5c90e419ff3..8715a2d94140 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -49,6 +49,7 @@
 #include "swap_table.h"
 #include "internal.h"
 #include "swap.h"
+#include "swap_tier.h"
 
 static bool swap_count_continued(struct swap_info_struct *, pgoff_t,
 				 unsigned char);
@@ -1296,7 +1297,8 @@ static bool get_swap_device_info(struct swap_info_struct *si)
 
 /* Rotate the device and switch to a new cluster */
 static void swap_alloc_entry(swp_entry_t *entry,
-			    int order)
+			    int order,
+			    int mask)
 {
 	unsigned long offset;
 	struct swap_info_struct *si, *next;
@@ -1304,6 +1306,8 @@ static void swap_alloc_entry(swp_entry_t *entry,
 	spin_lock(&swap_avail_lock);
 start_over:
 	plist_for_each_entry_safe(si, next, &swap_avail_head, avail_list) {
+		if (swap_tiers_test_off(si->tier_idx, mask))
+			continue;
 		/* Rotate the device and switch to a new cluster */
 		plist_requeue(&si->avail_list, &swap_avail_head);
 		spin_unlock(&swap_avail_lock);
@@ -1376,6 +1380,7 @@ int folio_alloc_swap(struct folio *folio)
 {
 	unsigned int order = folio_order(folio);
 	unsigned int size = 1 << order;
+	int mask;
 	swp_entry_t entry = {};
 
 	VM_BUG_ON_FOLIO(!folio_test_locked(folio), folio);
@@ -1400,8 +1405,8 @@ int folio_alloc_swap(struct folio *folio)
 	}
 
 again:
-	swap_alloc_entry(&entry, order);
-
+	mask = swap_tiers_collect_compare_mask(folio_memcg(folio));
+	swap_alloc_entry(&entry, order, mask);
 	if (unlikely(!order && !entry.val)) {
 		if (swap_sync_discard())
 			goto again;
@@ -2673,6 +2678,8 @@ static void _enable_swap_info(struct swap_info_struct *si)
 
 	/* Add back to available list */
 	add_to_avail_list(si, true);
+
+	swap_tiers_assign(si);
 }
 
 static void enable_swap_info(struct swap_info_struct *si, int prio,
@@ -2840,6 +2847,7 @@ SYSCALL_DEFINE1(swapoff, const char __user *, specialfile)
 	spin_lock(&swap_lock);
 	spin_lock(&p->lock);
 	drain_mmlist();
+	swap_tiers_release(p);
 
 	swap_file = p->swap_file;
 	p->swap_file = NULL;
@@ -4004,6 +4012,7 @@ static int __init swapfile_init(void)
 		swap_migration_ad_supported = true;
 #endif	/* CONFIG_MIGRATION */
 
+	swap_tiers_init();
 	return 0;
 }
 subsys_initcall(swapfile_init);
-- 
2.34.1



^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 3/3] mm/swap: integrate swap tier infrastructure into swap subsystem
  2025-11-09 12:49 ` [PATCH 3/3] mm/swap: integrate swap tier infrastructure into swap subsystem Youngjun Park
@ 2025-11-10 11:40   ` kernel test robot
  2025-11-10 12:12   ` kernel test robot
                     ` (2 subsequent siblings)
  3 siblings, 0 replies; 25+ messages in thread
From: kernel test robot @ 2025-11-10 11:40 UTC (permalink / raw)
  To: Youngjun Park, akpm, linux-mm
  Cc: oe-kbuild-all, cgroups, linux-kernel, chrisl, kasong, hannes,
	mhocko, roman.gushchin, shakeel.butt, muchun.song, shikemeng,
	nphamcs, bhe, baohua, youngjun.park, gunho.lee, taejoon.song

Hi Youngjun,

kernel test robot noticed the following build warnings:

[auto build test WARNING on akpm-mm/mm-everything]

url:    https://github.com/intel-lab-lkp/linux/commits/Youngjun-Park/mm-swap-introduce-swap-tier-infrastructure/20251109-215033
base:   https://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm.git mm-everything
patch link:    https://lore.kernel.org/r/20251109124947.1101520-4-youngjun.park%40lge.com
patch subject: [PATCH 3/3] mm/swap: integrate swap tier infrastructure into swap subsystem
config: s390-randconfig-001-20251110 (https://download.01.org/0day-ci/archive/20251110/202511101938.muW3KLKk-lkp@intel.com/config)
compiler: s390-linux-gcc (GCC) 8.5.0
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20251110/202511101938.muW3KLKk-lkp@intel.com/reproduce)

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202511101938.muW3KLKk-lkp@intel.com/

All warnings (new ones prefixed by >>):

   In file included from mm/page_io.c:29:
   mm/swap_tier.h:71:1: error: expected identifier or '(' before '{' token
    {
    ^
>> mm/swap_tier.h:70:19: warning: 'swap_tiers_collect_compare_mask' declared 'static' but never defined [-Wunused-function]
    static inline int swap_tiers_collect_compare_mask(struct mem_cgroup *memcg);
                      ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
--
   In file included from mm/swapfile.c:52:
   mm/swap_tier.h:71:1: error: expected identifier or '(' before '{' token
    {
    ^
   mm/swapfile.c: In function 'swap_alloc_entry':
   mm/swapfile.c:1309:29: error: 'struct swap_info_struct' has no member named 'tier_idx'
      if (swap_tiers_test_off(si->tier_idx, mask))
                                ^~
   In file included from mm/swapfile.c:52:
   mm/swapfile.c: At top level:
>> mm/swap_tier.h:70:19: warning: 'swap_tiers_collect_compare_mask' used but never defined
    static inline int swap_tiers_collect_compare_mask(struct mem_cgroup *memcg);
                      ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
--
   In file included from mm/memcontrol.c:71:
   mm/swap_tier.h:71:1: error: expected identifier or '(' before '{' token
    {
    ^
   mm/memcontrol.c: In function 'mem_cgroup_free':
   mm/memcontrol.c:3734:2: error: implicit declaration of function 'swap_tiers_put_mask'; did you mean 'swap_tiers_release'? [-Werror=implicit-function-declaration]
     swap_tiers_put_mask(memcg);
     ^~~~~~~~~~~~~~~~~~~
     swap_tiers_release
   In file included from mm/memcontrol.c:71:
   mm/memcontrol.c: At top level:
>> mm/swap_tier.h:70:19: warning: 'swap_tiers_collect_compare_mask' declared 'static' but never defined [-Wunused-function]
    static inline int swap_tiers_collect_compare_mask(struct mem_cgroup *memcg);
                      ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
   cc1: some warnings being treated as errors

Kconfig warnings: (for reference only)
   WARNING: unmet direct dependencies detected for OF_GPIO
   Depends on [n]: GPIOLIB [=y] && OF [=y] && HAS_IOMEM [=n]
   Selected by [y]:
   - REGULATOR_RT5133 [=y] && REGULATOR [=y] && I2C [=y] && GPIOLIB [=y] && OF [=y]


vim +70 mm/swap_tier.h

c17fe68325c921 Youngjun Park 2025-11-09  41  
c17fe68325c921 Youngjun Park 2025-11-09  42  void swap_tiers_init(void);
c17fe68325c921 Youngjun Park 2025-11-09  43  int swap_tiers_add(struct tiers_desc desc[], int nr);
c17fe68325c921 Youngjun Park 2025-11-09  44  int swap_tiers_remove(struct tiers_desc desc[], int nr);
c17fe68325c921 Youngjun Park 2025-11-09  45  ssize_t swap_tiers_show_sysfs(char *buf);
c17fe68325c921 Youngjun Park 2025-11-09  46  void swap_tiers_show_memcg(struct seq_file *m, struct mem_cgroup *memcg);
c17fe68325c921 Youngjun Park 2025-11-09  47  void swap_tiers_assign(struct swap_info_struct *swp);
c17fe68325c921 Youngjun Park 2025-11-09  48  void swap_tiers_release(struct swap_info_struct *swp);
c17fe68325c921 Youngjun Park 2025-11-09  49  int swap_tiers_get_mask(struct tiers_desc *desc, int nr, struct mem_cgroup *memcg);
c17fe68325c921 Youngjun Park 2025-11-09  50  void swap_tiers_put_mask(struct mem_cgroup *memcg);
c17fe68325c921 Youngjun Park 2025-11-09  51  static inline bool swap_tiers_test_off(int tier_idx, int mask)
c17fe68325c921 Youngjun Park 2025-11-09  52  {
c17fe68325c921 Youngjun Park 2025-11-09  53  	return TIER_MASK(tier_idx, TIER_OFF_MASK) & mask;
c17fe68325c921 Youngjun Park 2025-11-09  54  }
c17fe68325c921 Youngjun Park 2025-11-09  55  int swap_tiers_collect_compare_mask(struct mem_cgroup *memcg);
c17fe68325c921 Youngjun Park 2025-11-09  56  #else
c17fe68325c921 Youngjun Park 2025-11-09  57  static inline void swap_tiers_init(void)
c17fe68325c921 Youngjun Park 2025-11-09  58  {
c17fe68325c921 Youngjun Park 2025-11-09  59  }
c17fe68325c921 Youngjun Park 2025-11-09  60  static inline void swap_tiers_assign(struct swap_info_struct *swp)
c17fe68325c921 Youngjun Park 2025-11-09  61  {
c17fe68325c921 Youngjun Park 2025-11-09  62  }
c17fe68325c921 Youngjun Park 2025-11-09  63  static inline void swap_tiers_release(struct swap_info_struct *swp)
c17fe68325c921 Youngjun Park 2025-11-09  64  {
c17fe68325c921 Youngjun Park 2025-11-09  65  }
c17fe68325c921 Youngjun Park 2025-11-09  66  static inline bool swap_tiers_test_off(int tier_off_mask, int mask)
c17fe68325c921 Youngjun Park 2025-11-09  67  {
c17fe68325c921 Youngjun Park 2025-11-09  68  	return false;
c17fe68325c921 Youngjun Park 2025-11-09  69  }
c17fe68325c921 Youngjun Park 2025-11-09 @70  static inline int swap_tiers_collect_compare_mask(struct mem_cgroup *memcg);

-- 
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 3/3] mm/swap: integrate swap tier infrastructure into swap subsystem
  2025-11-09 12:49 ` [PATCH 3/3] mm/swap: integrate swap tier infrastructure into swap subsystem Youngjun Park
  2025-11-10 11:40   ` kernel test robot
@ 2025-11-10 12:12   ` kernel test robot
  2025-11-10 13:26   ` kernel test robot
  2025-11-12 14:44   ` Chris Li
  3 siblings, 0 replies; 25+ messages in thread
From: kernel test robot @ 2025-11-10 12:12 UTC (permalink / raw)
  To: Youngjun Park, akpm, linux-mm
  Cc: oe-kbuild-all, cgroups, linux-kernel, chrisl, kasong, hannes,
	mhocko, roman.gushchin, shakeel.butt, muchun.song, shikemeng,
	nphamcs, bhe, baohua, youngjun.park, gunho.lee, taejoon.song

Hi Youngjun,

kernel test robot noticed the following build errors:

[auto build test ERROR on akpm-mm/mm-everything]

url:    https://github.com/intel-lab-lkp/linux/commits/Youngjun-Park/mm-swap-introduce-swap-tier-infrastructure/20251109-215033
base:   https://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm.git mm-everything
patch link:    https://lore.kernel.org/r/20251109124947.1101520-4-youngjun.park%40lge.com
patch subject: [PATCH 3/3] mm/swap: integrate swap tier infrastructure into swap subsystem
config: s390-randconfig-001-20251110 (https://download.01.org/0day-ci/archive/20251110/202511102053.StJy9JAB-lkp@intel.com/config)
compiler: s390-linux-gcc (GCC) 8.5.0
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20251110/202511102053.StJy9JAB-lkp@intel.com/reproduce)

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202511102053.StJy9JAB-lkp@intel.com/

All errors (new ones prefixed by >>):

   In file included from mm/page_io.c:29:
>> mm/swap_tier.h:71:1: error: expected identifier or '(' before '{' token
    {
    ^
   mm/swap_tier.h:70:19: warning: 'swap_tiers_collect_compare_mask' declared 'static' but never defined [-Wunused-function]
    static inline int swap_tiers_collect_compare_mask(struct mem_cgroup *memcg);
                      ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
--
   In file included from mm/swapfile.c:52:
>> mm/swap_tier.h:71:1: error: expected identifier or '(' before '{' token
    {
    ^
   mm/swapfile.c: In function 'swap_alloc_entry':
>> mm/swapfile.c:1309:29: error: 'struct swap_info_struct' has no member named 'tier_idx'
      if (swap_tiers_test_off(si->tier_idx, mask))
                                ^~
   In file included from mm/swapfile.c:52:
   mm/swapfile.c: At top level:
   mm/swap_tier.h:70:19: warning: 'swap_tiers_collect_compare_mask' used but never defined
    static inline int swap_tiers_collect_compare_mask(struct mem_cgroup *memcg);
                      ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
--
   In file included from mm/memcontrol.c:71:
>> mm/swap_tier.h:71:1: error: expected identifier or '(' before '{' token
    {
    ^
   mm/memcontrol.c: In function 'mem_cgroup_free':
>> mm/memcontrol.c:3734:2: error: implicit declaration of function 'swap_tiers_put_mask'; did you mean 'swap_tiers_release'? [-Werror=implicit-function-declaration]
     swap_tiers_put_mask(memcg);
     ^~~~~~~~~~~~~~~~~~~
     swap_tiers_release
   In file included from mm/memcontrol.c:71:
   mm/memcontrol.c: At top level:
   mm/swap_tier.h:70:19: warning: 'swap_tiers_collect_compare_mask' declared 'static' but never defined [-Wunused-function]
    static inline int swap_tiers_collect_compare_mask(struct mem_cgroup *memcg);
                      ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
   cc1: some warnings being treated as errors

Kconfig warnings: (for reference only)
   WARNING: unmet direct dependencies detected for OF_GPIO
   Depends on [n]: GPIOLIB [=y] && OF [=y] && HAS_IOMEM [=n]
   Selected by [y]:
   - REGULATOR_RT5133 [=y] && REGULATOR [=y] && I2C [=y] && GPIOLIB [=y] && OF [=y]


vim +71 mm/swap_tier.h

c17fe68325c921 Youngjun Park 2025-11-09  41  
c17fe68325c921 Youngjun Park 2025-11-09  42  void swap_tiers_init(void);
c17fe68325c921 Youngjun Park 2025-11-09  43  int swap_tiers_add(struct tiers_desc desc[], int nr);
c17fe68325c921 Youngjun Park 2025-11-09  44  int swap_tiers_remove(struct tiers_desc desc[], int nr);
c17fe68325c921 Youngjun Park 2025-11-09  45  ssize_t swap_tiers_show_sysfs(char *buf);
c17fe68325c921 Youngjun Park 2025-11-09  46  void swap_tiers_show_memcg(struct seq_file *m, struct mem_cgroup *memcg);
c17fe68325c921 Youngjun Park 2025-11-09  47  void swap_tiers_assign(struct swap_info_struct *swp);
c17fe68325c921 Youngjun Park 2025-11-09  48  void swap_tiers_release(struct swap_info_struct *swp);
c17fe68325c921 Youngjun Park 2025-11-09  49  int swap_tiers_get_mask(struct tiers_desc *desc, int nr, struct mem_cgroup *memcg);
c17fe68325c921 Youngjun Park 2025-11-09  50  void swap_tiers_put_mask(struct mem_cgroup *memcg);
c17fe68325c921 Youngjun Park 2025-11-09  51  static inline bool swap_tiers_test_off(int tier_idx, int mask)
c17fe68325c921 Youngjun Park 2025-11-09  52  {
c17fe68325c921 Youngjun Park 2025-11-09  53  	return TIER_MASK(tier_idx, TIER_OFF_MASK) & mask;
c17fe68325c921 Youngjun Park 2025-11-09  54  }
c17fe68325c921 Youngjun Park 2025-11-09  55  int swap_tiers_collect_compare_mask(struct mem_cgroup *memcg);
c17fe68325c921 Youngjun Park 2025-11-09  56  #else
c17fe68325c921 Youngjun Park 2025-11-09  57  static inline void swap_tiers_init(void)
c17fe68325c921 Youngjun Park 2025-11-09  58  {
c17fe68325c921 Youngjun Park 2025-11-09  59  }
c17fe68325c921 Youngjun Park 2025-11-09  60  static inline void swap_tiers_assign(struct swap_info_struct *swp)
c17fe68325c921 Youngjun Park 2025-11-09  61  {
c17fe68325c921 Youngjun Park 2025-11-09  62  }
c17fe68325c921 Youngjun Park 2025-11-09  63  static inline void swap_tiers_release(struct swap_info_struct *swp)
c17fe68325c921 Youngjun Park 2025-11-09  64  {
c17fe68325c921 Youngjun Park 2025-11-09  65  }
c17fe68325c921 Youngjun Park 2025-11-09  66  static inline bool swap_tiers_test_off(int tier_off_mask, int mask)
c17fe68325c921 Youngjun Park 2025-11-09  67  {
c17fe68325c921 Youngjun Park 2025-11-09  68  	return false;
c17fe68325c921 Youngjun Park 2025-11-09  69  }
c17fe68325c921 Youngjun Park 2025-11-09  70  static inline int swap_tiers_collect_compare_mask(struct mem_cgroup *memcg);
c17fe68325c921 Youngjun Park 2025-11-09 @71  {

-- 
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 3/3] mm/swap: integrate swap tier infrastructure into swap subsystem
  2025-11-09 12:49 ` [PATCH 3/3] mm/swap: integrate swap tier infrastructure into swap subsystem Youngjun Park
  2025-11-10 11:40   ` kernel test robot
  2025-11-10 12:12   ` kernel test robot
@ 2025-11-10 13:26   ` kernel test robot
  2025-11-12 14:44   ` Chris Li
  3 siblings, 0 replies; 25+ messages in thread
From: kernel test robot @ 2025-11-10 13:26 UTC (permalink / raw)
  To: Youngjun Park, akpm, linux-mm
  Cc: llvm, oe-kbuild-all, cgroups, linux-kernel, chrisl, kasong,
	hannes, mhocko, roman.gushchin, shakeel.butt, muchun.song,
	shikemeng, nphamcs, bhe, baohua, youngjun.park, gunho.lee,
	taejoon.song

Hi Youngjun,

kernel test robot noticed the following build errors:

[auto build test ERROR on akpm-mm/mm-everything]

url:    https://github.com/intel-lab-lkp/linux/commits/Youngjun-Park/mm-swap-introduce-swap-tier-infrastructure/20251109-215033
base:   https://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm.git mm-everything
patch link:    https://lore.kernel.org/r/20251109124947.1101520-4-youngjun.park%40lge.com
patch subject: [PATCH 3/3] mm/swap: integrate swap tier infrastructure into swap subsystem
config: riscv-randconfig-001-20251110 (https://download.01.org/0day-ci/archive/20251110/202511102100.y6n1mtle-lkp@intel.com/config)
compiler: clang version 22.0.0git (https://github.com/llvm/llvm-project 996639d6ebb86ff15a8c99b67f1c2e2117636ae7)
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20251110/202511102100.y6n1mtle-lkp@intel.com/reproduce)

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202511102100.y6n1mtle-lkp@intel.com/

All errors (new ones prefixed by >>):

   In file included from mm/page_io.c:29:
>> mm/swap_tier.h:71:1: error: expected identifier or '('
      71 | {
         | ^
   1 error generated.
--
   In file included from mm/swapfile.c:52:
>> mm/swap_tier.h:71:1: error: expected identifier or '('
      71 | {
         | ^
>> mm/swapfile.c:1309:31: error: no member named 'tier_idx' in 'struct swap_info_struct'
    1309 |                 if (swap_tiers_test_off(si->tier_idx, mask))
         |                                         ~~  ^
   2 errors generated.


vim +71 mm/swap_tier.h

c17fe68325c921 Youngjun Park 2025-11-09  41  
c17fe68325c921 Youngjun Park 2025-11-09  42  void swap_tiers_init(void);
c17fe68325c921 Youngjun Park 2025-11-09  43  int swap_tiers_add(struct tiers_desc desc[], int nr);
c17fe68325c921 Youngjun Park 2025-11-09  44  int swap_tiers_remove(struct tiers_desc desc[], int nr);
c17fe68325c921 Youngjun Park 2025-11-09  45  ssize_t swap_tiers_show_sysfs(char *buf);
c17fe68325c921 Youngjun Park 2025-11-09  46  void swap_tiers_show_memcg(struct seq_file *m, struct mem_cgroup *memcg);
c17fe68325c921 Youngjun Park 2025-11-09  47  void swap_tiers_assign(struct swap_info_struct *swp);
c17fe68325c921 Youngjun Park 2025-11-09  48  void swap_tiers_release(struct swap_info_struct *swp);
c17fe68325c921 Youngjun Park 2025-11-09  49  int swap_tiers_get_mask(struct tiers_desc *desc, int nr, struct mem_cgroup *memcg);
c17fe68325c921 Youngjun Park 2025-11-09  50  void swap_tiers_put_mask(struct mem_cgroup *memcg);
c17fe68325c921 Youngjun Park 2025-11-09  51  static inline bool swap_tiers_test_off(int tier_idx, int mask)
c17fe68325c921 Youngjun Park 2025-11-09  52  {
c17fe68325c921 Youngjun Park 2025-11-09  53  	return TIER_MASK(tier_idx, TIER_OFF_MASK) & mask;
c17fe68325c921 Youngjun Park 2025-11-09  54  }
c17fe68325c921 Youngjun Park 2025-11-09  55  int swap_tiers_collect_compare_mask(struct mem_cgroup *memcg);
c17fe68325c921 Youngjun Park 2025-11-09  56  #else
c17fe68325c921 Youngjun Park 2025-11-09  57  static inline void swap_tiers_init(void)
c17fe68325c921 Youngjun Park 2025-11-09  58  {
c17fe68325c921 Youngjun Park 2025-11-09  59  }
c17fe68325c921 Youngjun Park 2025-11-09  60  static inline void swap_tiers_assign(struct swap_info_struct *swp)
c17fe68325c921 Youngjun Park 2025-11-09  61  {
c17fe68325c921 Youngjun Park 2025-11-09  62  }
c17fe68325c921 Youngjun Park 2025-11-09  63  static inline void swap_tiers_release(struct swap_info_struct *swp)
c17fe68325c921 Youngjun Park 2025-11-09  64  {
c17fe68325c921 Youngjun Park 2025-11-09  65  }
c17fe68325c921 Youngjun Park 2025-11-09  66  static inline bool swap_tiers_test_off(int tier_off_mask, int mask)
c17fe68325c921 Youngjun Park 2025-11-09  67  {
c17fe68325c921 Youngjun Park 2025-11-09  68  	return false;
c17fe68325c921 Youngjun Park 2025-11-09  69  }
c17fe68325c921 Youngjun Park 2025-11-09  70  static inline int swap_tiers_collect_compare_mask(struct mem_cgroup *memcg);
c17fe68325c921 Youngjun Park 2025-11-09 @71  {

-- 
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [RFC] mm/swap, memcg: Introduce swap tiers for cgroup based swap control
  2025-11-09 12:49 [RFC] mm/swap, memcg: Introduce swap tiers for cgroup based swap control Youngjun Park
                   ` (2 preceding siblings ...)
  2025-11-09 12:49 ` [PATCH 3/3] mm/swap: integrate swap tier infrastructure into swap subsystem Youngjun Park
@ 2025-11-12 13:34 ` Chris Li
  2025-11-13  1:33   ` YoungJun Park
  2025-11-15  1:22 ` SeongJae Park
  4 siblings, 1 reply; 25+ messages in thread
From: Chris Li @ 2025-11-12 13:34 UTC (permalink / raw)
  To: Youngjun Park
  Cc: akpm, linux-mm, cgroups, linux-kernel, kasong, hannes, mhocko,
	roman.gushchin, shakeel.butt, muchun.song, shikemeng, nphamcs,
	bhe, baohua, gunho.lee, taejoon.song

Hi Youngjun,

Sorry for the late reply, I have been super busy in the last two weeks
and I am still catching up on my backlogs.

Thanks for the patches. I notice that your cover letter does not have
[0/3] on it. One tool I found useful is using the b4 to send out
patches in series. Just for your consideration, it is not an ask. I
can review patches not sent out from b4 just fine.

On Sun, Nov 9, 2025 at 4:50 AM Youngjun Park <youngjun.park@lge.com> wrote:
>
> Hi all,
>
> In constrained environments, there is a need to improve workload
> performance by controlling swap device usage on a per-process or
> per-cgroup basis. For example, one might want to direct critical
> processes to faster swap devices (like SSDs) while relegating
> less critical ones to slower devices (like HDDs or Network Swap).
>
> Initial approach was to introduce a per-cgroup swap priority
> mechanism [1]. However, through review and discussion, several
> drawbacks were identified:
>
> a. There is a lack of concrete use cases for assigning a fine-grained,
>    unique swap priority to each cgroup.
> b. The implementation complexity was high relative to the desired
>    level of control.
> c. Differing swap priorities between cgroups could lead to LRU
>    inversion problems.
>
> To address these concerns, I propose the "swap tiers" concept,
> originally suggested by Chris Li [2] and further developed through
> collaborative discussions. I would like to thank Chris Li and
> He Baoquan for their invaluable contributions in refining this
> approach, and Kairui Song, Nhat Pham, and Michal Koutný for their
> insightful reviews of earlier RFC versions.
>
> Concept
> -------
> A swap tier is a grouping mechanism that assigns a "named id" to a
> range of swap priorities. For example, all swap devices with a
> priority of 100 or higher could be grouped into a tier named "SSD",
> and all others into a tier named "HDD".
>
> Cgroups can then select which named tiers they are permitted to use for
> swapping via a new cgroup interface. This effectively restricts a
> cgroup's swap activity to a specific subset of the available swap
> devices.
>
> Proposed Interface
> ------------------
> 1. Global Tier Definition: /sys/kernel/mm/swap/tiers
>
> This file is used to define the global swap tiers and their associated
> minimum priority levels.
>
> - To add tiers:
>   Format: + 'tier_name':'prio'[,|' ']'tier_name 2':'prio']...
>   Example:
>   # echo "+ SSD:100,HDD:2" > /sys/kernel/mm/swap/tiers

I think a lot of this documentation nature of the cover letter should
move into a kernel document commit. Maybe
Documentation/mm/swap_tiers.rst

Another suggestion is use "+SSD:100,+HDD:2,-SD" that kind of flavor
similar to "cgroup.subtree_control" interface, which allows adding or
removing cgroups. That way you can add and remove in one line action.

>
>   There are several rules for defining tiers:
>   - Priority ranges for tiers must not overlap.

We can add that we suggest allocating a higher priority range for
faster swap devices. That way more swap page faults will likely be
served by faster swap devices.

>   - The combination of all defined tiers must cover the entire valid
>     priority range (DEF_SWAP_PRIO to SHRT_MAX) to ensure every swap device
>     can be assigned to a tier.
>   - A tier's prio value is its inclusive lower bound,
>     covering priorities up to the next tier's prio.
>     The highest tier extends to SHRT_MAX, and the lowest tier extends to DEF_SWAP_PRIO.
>   - If the specified tiers do not cover the entire priority range,
>     the priority of the tier with the lowest specified priority value
>     is set to SHRT_MIN
>   - The total number of tiers is limited.
>
> - To remove tiers:
>   Format: - 'tier_name'[,|' ']'tier_name2']...
>   Example:
>   # echo "- SSD,HDD" > /sys/kernel/mm/swap/tiers

See above, make the '-SSD, -HDD' similar to the "cgroup.subtree_control"
>
>   Note: A tier cannot be removed if it is currently in use by any
>   cgroup or if any active swap device is assigned to it. This acts as
>   a reference count to prevent disruption.
>
> - To show current tiers:
>   Reading the file displays the currently configured tiers, their
>   internal index, and the priority range they cover.
>   Example:
>   # echo "+ SSD:100,HDD:2" > /sys/kernel/mm/swap/tiers
>   # cat /sys/kernel/mm/swap/tiers
>   Name      Idx   PrioStart   PrioEnd
>             0
>   SSD       1    100         32767
>   HDD       2     -1         99
>
>   - `Name`: The name of the tier. The unnamed entry is a default tier.
>   - `Idx`: The internal index assigned to the tier.
>   - `PrioStart`: The starting priority of the range covered by this tier.
>   - `PrioEnd`: The ending priority of the range covered by this tier.
>
> Two special tiers are predefined:
> - "": Represents the default inheritance behavior in cgroups.
This belongs to the memory.swap.tiers section.
"" is not a real tier's name. It is just a wide cast to refer to all tiers.

> - "zswap": Reserved for zswap integration.

One thing I realize is that, we might need to have per swap tier have
a matching zswap tier. Otherwise when we refer to zswap, there is no
way for the cgroup to select which backing swapfile does this zswap
use for allocating the swap entry.

We can avoid this complexity by providing a dedicated ghost swapfile,
which only zswap can use to allocate swap entries.

> 2. Cgroup Tier Selection: memory.swap.tiers
>
> This file controls which swap tiers are enabled for a given cgroup.
>
> - Reading the file:
>   The first line shows the operation that was written to the file.
>   The second line shows the final, effective set of tiers after
>   merging with the parent cgroup's configuration.
>
> - Writing to the file:
>   Format: [+/-] [+|-][TIER_NAME]...
>   - `+TIER_NAME`: Explicitly enables this tier for the cgroup.
>   - `-TIER_NAME`: Explicitly disables this tier for the cgroup.
>   - If a tier is not specified, its setting is inherited from the
>     parent cgroup.
>   - A standalone `+` at the beginning resets the configuration: it
>     ignores the parent's settings, enables all globally defined tiers,
>     and then applies the subsequent operations in the command.
>   - A standalone `-` at the beginning also resets: it ignores the
>     parent's settings, disables all tiers, and then applies subsequent
>     operations.
>   - The root cgroup defaults to an implicit `+`, enabling all swap
>     devices.
>
>   Example:
>   # echo "+ -SSD -HDD" > /sys/fs/cgroup/my_cgroup/memory.swap.tiers
>   This command first resets the cgroup's configuration to enable all
>   tiers (due to the leading `+`), and then explicitly disables the
>   "SSD" and "HDD" tiers.
>
> Further Discussion and Open Questions
> -------------------------------------
> I seek feedback on this concept and have identified several key
> points that require further discussion (though this is not an
> exhaustive list). This topic will also be presented at the upcoming
> Linux Plumbers Conference 2025 [3], and I would appreciate any
> feedback here on the list beforehand, or in person at the conference.

All very good questions. Thanks for preparing that.

>
> 1.  The swap fast path utilizes a percpu cluster cache for efficiency.
>     In swap tiers, this has been changed to a per-device per-cpu
>     cluster cache. (See the first patch in this series.)
>     An alternative approach would be to cache only the swap_info_struct
>     (si) per-tier per-cpu, avoiding cluster caching entirely while still
>     maintaining fast device acquisition without `swap_avail_lock`.
>     Should we pursue this alternative, or is the current per-device
>     per-cpu cluster caching approach preferable?
>
> 2.  Consistency with cgroup parent-child semantics: Unlike general
>     resource distribution, tier selection may bypass parent
>     constraints (e.g., a child can enable a tier disabled by its
>     parent). Is this behavior acceptable?
>
> 3.  Per-cgroup swap tier limit: Is a `swap.tier.max` needed in
>     addition to the existing `swap.max`?

I really hope not.

>
> 4.  Parent-child tier mismatch: If a zombie memcg (child) uses a tier
>     that is not available to its new parent, how should this be
>     handled during recharging or reparenting? (This question is raised
>     in the context of ongoing work to improve memcg reparenting and
>     handle zombie memcgs [4, 5].)

I really want to avoid doing the reparenting.

Chris


>
> 5.  Tier mask calculation: What are the trade-offs between calculating
>     the effective tier mask at runtime vs. pre-calculating it when the
>     interface is written to?
>
> 6.  If a swap tier configuration is applied to a memcg, should we
>     migrate existing swap-out pages that are on devices not belonging
>     to any of the cgroup's allowed tiers?
>
> 7.  swap tier could be good abstraction layer. Discuss on extended usage of swap tiers.
>
> Any feedback on the overall concept, interface, and these specific
> points would be greatly appreciated.
>
> Best Regards,
> Youngjun Park
>
> References
> ----------
> [1] https://lore.kernel.org/linux-mm/aEvLjEInMQC7hEyh@yjaykim-PowerEdge-T330/T/#mbbb6a5e9e30843097e1f5f65fb98f31d582b973d
> [2] https://lore.kernel.org/linux-mm/20250716202006.3640584-1-youngjun.park@lge.com/
> [3] https://lpc.events/event/19/abstracts/2296/
> [4] https://lore.kernel.org/linux-mm/20230720070825.992023-1-yosryahmed@google.com/
> [5] https://blogs.oracle.com/linux/post/zombie-memcg-issues
>
> Youngjun Park (3):
>   mm, swap: change back to use each swap device's percpu cluster
>   mm: swap: introduce swap tier infrastructure
>   mm/swap: integrate swap tier infrastructure into swap subsystem
>
>  Documentation/admin-guide/cgroup-v2.rst |  32 ++
>  MAINTAINERS                             |   2 +
>  include/linux/memcontrol.h              |   4 +
>  include/linux/swap.h                    |  16 +-
>  mm/Kconfig                              |  13 +
>  mm/Makefile                             |   1 +
>  mm/memcontrol.c                         |  69 +++
>  mm/page_io.c                            |  21 +-
>  mm/swap.h                               |   4 +
>  mm/swap_state.c                         |  93 ++++
>  mm/swap_tier.c                          | 602 ++++++++++++++++++++++++
>  mm/swap_tier.h                          |  75 +++
>  mm/swapfile.c                           | 169 +++----
>  13 files changed, 987 insertions(+), 114 deletions(-)
>  create mode 100644 mm/swap_tier.c
>  create mode 100644 mm/swap_tier.h
>
> base-commit: 02dafa01ec9a00c3758c1c6478d82fe601f5f1ba
> --
> 2.34.1
>
>


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 2/3] mm: swap: introduce swap tier infrastructure
  2025-11-09 12:49 ` [PATCH 2/3] mm: swap: introduce swap tier infrastructure Youngjun Park
@ 2025-11-12 14:20   ` Chris Li
  2025-11-13  2:01     ` YoungJun Park
  0 siblings, 1 reply; 25+ messages in thread
From: Chris Li @ 2025-11-12 14:20 UTC (permalink / raw)
  To: Youngjun Park
  Cc: akpm, linux-mm, cgroups, linux-kernel, kasong, hannes, mhocko,
	roman.gushchin, shakeel.butt, muchun.song, shikemeng, nphamcs,
	bhe, baohua, gunho.lee, taejoon.song

First of all, for RFC series, please include RFC in each patch subject as well.

For the real patch submission, please consider split it into smaller
chunks and have incremental milestones.
Only introduce the function needed for each milestone, not define them
all together then use it in later patches.

See some feedback as follows.

This patch is too big, to be continued.

Chris

On Sun, Nov 9, 2025 at 4:50 AM Youngjun Park <youngjun.park@lge.com> wrote:
>
> Swap tier groups swap devices by priority ranges.
> This allows different memory groups to use different swap devices.
>
> Swap tiers are identified by names and configured with priority ranges.
> Two special tiers are built-in: default (no name) tier and "zswap" tier.
>
> Interface:
> * /sys/kernel/mm/swap/tiers
> - Write:
> - Add format: "+ tierName:prio,tierName2:prio..."
> - Example: echo "+ SSD:100,HDD:-1" > /sys/kernel/mm/swap/tiers
> - Remove format: "- tierName,tierName2..."
> - Example: echo "- SSD,HDD" > /sys/kernel/mm/swap/tiers
> - Priority ranges merge when no cgroup references exist
> - Minimum priority becomes -1 when tiers are configured above -1
> - Read:
> - Format: "Name             Idx   PrioStart   PrioEnd"
> - Shows configured tiers with their priority ranges
>
> * /sys/fs/cgroup/*/memory.swap.tiers
> - Read: Shows written operations and final merged result
> - Write: Format "(+/-)tierName (+/-)tierName2..."
> - Example: echo "+ +SSD -HDD" > memory.swap.tier
> - Override semantics (echo "" clears all settings)
> - + enables tier, - disables tier
> - No prefix inherits parent settings
> - +/- alone affects all tiers (+ enables all, - disables all)
> - Root cgroup defaults to "+" (all swap devices enabled)
>
> Special tiers:
> - "" (default): for inheritance
> - "zswap": for zswap integration
>
> Suggested-by: Chris Li <chrisl@kernel.org>
> Signed-off-by: Youngjun Park <youngjun.park@lge.com>
> ---
>  Documentation/admin-guide/cgroup-v2.rst |  32 ++
>  MAINTAINERS                             |   2 +
>  include/linux/memcontrol.h              |   4 +
>  include/linux/swap.h                    |   3 +
>  mm/Kconfig                              |  13 +
>  mm/Makefile                             |   1 +
>  mm/swap.h                               |   4 +
>  mm/swap_tier.c                          | 602 ++++++++++++++++++++++++
>  mm/swap_tier.h                          |  75 +++
>  mm/swapfile.c                           |   5 +-
>  10 files changed, 738 insertions(+), 3 deletions(-)
>  create mode 100644 mm/swap_tier.c
>  create mode 100644 mm/swap_tier.h
>
> diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst
> index 3345961c30ac..ed8ef33b4d6a 100644
> --- a/Documentation/admin-guide/cgroup-v2.rst
> +++ b/Documentation/admin-guide/cgroup-v2.rst
> @@ -1850,6 +1850,38 @@ The following nested keys are defined.
>         Swap usage hard limit.  If a cgroup's swap usage reaches this
>         limit, anonymous memory of the cgroup will not be swapped out.
>
> +  memory.swap.tiers
> +        A read-write nested-keyed file which exists on non-root
> +        cgroups. The default is empty (inherits from parent).
> +
> +        This interface allows selecting which swap tiers a cgroup can
> +        use for swapping out memory.
> +
> +        When read, the file shows two lines:
> +          - The first line shows the operation string that was
> +            written to this file.
> +          - The second line shows the effective operation after
> +            merging with parent settings.
> +
> +        When writing, the format is:
> +          (+/-)(TIER_NAME) (+/-)(TIER_NAME) ...
> +
> +        Valid tier names are those configured in
> +        /sys/kernel/mm/swap/tiers.
> +
> +        Each tier can be prefixed with:
> +          +    Enable this tier (always on)
> +          -     Disable this tier (always off)
> +          none  Inherit from parent (incremental merge)
> +
> +        Special default tier operations (+ or - without tier name):
> +          +    Ignore parent settings, enable all swap tiers, then
> +                apply subsequent operations
> +          -     Ignore parent settings, disable all swap tiers, then
> +                apply subsequent operations
> +          none  Inherit all tier settings from parent, then apply
> +                subsequent operations incrementally
> +
>    memory.swap.events
>         A read-only flat-keyed file which exists on non-root cgroups.
>         The following entries are defined.  Unless specified
> diff --git a/MAINTAINERS b/MAINTAINERS
> index fa519cbc4b88..3416d466d011 100644
> --- a/MAINTAINERS
> +++ b/MAINTAINERS
> @@ -16517,6 +16517,8 @@ F:      mm/swap.c
>  F:     mm/swap.h
>  F:     mm/swap_table.h
>  F:     mm/swap_state.c
> +F:     mm/swap_tier.c
> +F      mm/swap_tier.h
>  F:     mm/swapfile.c
>
>  MEMORY MANAGEMENT - THP (TRANSPARENT HUGE PAGE)
> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> index 966f7c1a0128..1224029620ed 100644
> --- a/include/linux/memcontrol.h
> +++ b/include/linux/memcontrol.h
> @@ -283,6 +283,10 @@ struct mem_cgroup {
>         /* per-memcg mm_struct list */
>         struct lru_gen_mm_list mm_list;
>  #endif
> +#ifdef CONFIG_SWAP_TIER

I think we don't need this CONFIG_SWAP_TIER. Making it part of the
default swap is fine.
By default the memory.swap.tiers is empty and matches the previous
swap behavior. The user doesn't need to do anything if they are not
using swap tiers. I see no reason to have a separate config option.

> +       int tiers_onoff;
> +       int tiers_mask;
> +#endif
>
>  #ifdef CONFIG_MEMCG_V1
>         /* Legacy consumer-oriented counters */
> diff --git a/include/linux/swap.h b/include/linux/swap.h
> index 90fa27bb7796..8911eff9d37a 100644
> --- a/include/linux/swap.h
> +++ b/include/linux/swap.h
> @@ -271,6 +271,9 @@ struct swap_info_struct {
>         struct percpu_ref users;        /* indicate and keep swap device valid. */
>         unsigned long   flags;          /* SWP_USED etc: see above */
>         signed short    prio;           /* swap priority of this type */
> +#ifdef CONFIG_SWAP_TIER
> +       int tier_idx;
> +#endif
>         struct plist_node list;         /* entry in swap_active_head */
>         signed char     type;           /* strange name for an index */
>         unsigned int    max;            /* extent of the swap_map */
> diff --git a/mm/Kconfig b/mm/Kconfig
> index e1fb11f36ede..78173ffe65d6 100644
> --- a/mm/Kconfig
> +++ b/mm/Kconfig
> @@ -163,6 +163,19 @@ endmenu
>
>  endif
>
> +config SWAP_TIER
Same, I think we can remove the SWAP_TIER, just turn it on when swap is enabled.

> +       default n
> +       bool "Swap tier"
> +       depends on SWAP && MEMCG && SYSFS
> +       help
> +          Swap tier is a concept proposed to group swap devices by
> +          priority ranges, allowing cgroups to select and use a desired
> +          swap_tier. This enables more flexible control over swap usage
> +          across different workloads.
> +          It is possible to assign a name for identifying a swap tier,
> +          and to configure the priority range used for selecting swap
> +          devices within that tier.
> +
>  menu "Slab allocator options"
>
>  config SLUB
> diff --git a/mm/Makefile b/mm/Makefile
> index 00ceb2418b64..9e98799f2edf 100644
> --- a/mm/Makefile
> +++ b/mm/Makefile
> @@ -76,6 +76,7 @@ ifdef CONFIG_MMU
>  endif
>
>  obj-$(CONFIG_SWAP)     += page_io.o swap_state.o swapfile.o
> +obj-$(CONFIG_SWAP_TIER)        += swap_tier.o
>  obj-$(CONFIG_ZSWAP)    += zswap.o
>  obj-$(CONFIG_HAS_DMA)  += dmapool.o
>  obj-$(CONFIG_HUGETLBFS)        += hugetlb.o hugetlb_sysfs.o hugetlb_sysctl.o
> diff --git a/mm/swap.h b/mm/swap.h
> index d034c13d8dd2..b116282690c8 100644
> --- a/mm/swap.h
> +++ b/mm/swap.h
> @@ -16,6 +16,10 @@ extern int page_cluster;
>  #define swap_entry_order(order)        0
>  #endif
>
> +#define DEF_SWAP_PRIO  -1
> +
> +extern spinlock_t swap_lock;
> +extern struct plist_head swap_active_head;
>  extern struct swap_info_struct *swap_info[];
>
>  /*
> diff --git a/mm/swap_tier.c b/mm/swap_tier.c
> new file mode 100644
> index 000000000000..4301e3c766b9
> --- /dev/null
> +++ b/mm/swap_tier.c
> @@ -0,0 +1,602 @@
> +// SPDX-License-Identifier: GPL-2.0
> +#include <linux/swap.h>
> +#include <linux/rcupdate.h>
> +#include <linux/memcontrol.h>
> +#include <linux/plist.h>
> +#include <linux/sysfs.h>
> +#include <linux/sort.h>
> +
> +#include "swap_tier.h"
> +
> +/*
> + * struct swap_tier - Structure representing a swap tier.
> + *
> + * @name:              Name of the swap_tier.
> + * @prio_st:           Starting value of priority.
> + * @prio_end:          Ending value of priority.
> + * @list:              Linked list of active tiers.
> + * @inactive_list:     Linked list of inactive tiers.
> + * @kref:              Reference count (held by swap device and memory cgroup).
> + */
> +static struct swap_tier {
> +       char name[MAX_TIERNAME];
> +       short prio_st;
> +       short prio_end;

I see a lot of complexity of the code come from this priority range.
Having both start and end.
We should be able to just keep just one of the start and end,  e.g.
high end of the priority, then keep all tier in a sorted list or
array. Then use the next tier's priority to indicate the other end of
the current tier.

> +       union {
> +               struct plist_node list;
> +               struct list_head inactive_list;
> +       };

Is this the list of swapfiles?
Why union, how does it indicate which field of the union is valid?

> +       struct kref ref;
> +} swap_tiers[MAX_SWAPTIER];
> +
> +static DEFINE_SPINLOCK(swap_tier_lock);
> +
> +/* Active swap priority list, reserved tier first, sorted in descending order */
> +static PLIST_HEAD(swap_tier_active_list);
> +/* Unused swap_tier object */
> +static LIST_HEAD(swap_tier_inactive_list);
> +
> +#define TIER_IDX(tier) ((tier) - swap_tiers)
> +
> +#define for_each_active_tier(tier) \
> +       plist_for_each_entry(tier, &swap_tier_active_list, list)
> +
> +#define for_each_tier(tier, idx) \
> +       for (idx = 0, tier = &swap_tiers[0]; idx < MAX_SWAPTIER; \
> +               idx++, tier = &swap_tiers[idx])
> +
> +static int nr_swaptiers = SWAP_TIER_RESERVED_CNT;
> +
> +#define SKIP_REPLACE_IDX       -1
> +#define NULL_TIER               -1
> +
> +/*
> + * Naming Convention:
> + *   swap_tiers_*() - Public/exported functions
> + *   swap_tier_*()  - Private/internal functions
> + */
> +
> +static bool swap_tier_replace_marked(int idx)
> +{
> +       return idx != SKIP_REPLACE_IDX;
> +}
> +
> +static bool swap_tier_is_active(void)
> +{
> +       return nr_swaptiers > SWAP_TIER_RESERVED_CNT;
> +}
> +
> +static bool swap_tier_prio_in_range(struct swap_tier *tier, short prio)
> +{
> +       if (tier->prio_st <= prio && tier->prio_end >= prio)
> +               return true;
> +
> +       return false;
> +}
> +
> +/* swap_tiers initialization */
> +void swap_tiers_init(void)
> +{
> +       struct swap_tier *tier;
> +       int idx;
> +
> +       BUILD_BUG_ON(BITS_PER_TYPE(int) < MAX_SWAPTIER * 2);
> +
> +       for_each_tier(tier, idx) {
> +               if (idx < SWAP_TIER_RESERVED_CNT) {
> +                       /* for display fisrt */
> +                       plist_node_init(&tier->list, -SHRT_MAX);
> +                       plist_add(&tier->list, &swap_tier_active_list);
> +                       kref_init(&tier->ref);
> +               } else {
> +                       INIT_LIST_HEAD(&tier->inactive_list);
> +                       list_add_tail(&tier->inactive_list, &swap_tier_inactive_list);
> +               }
> +       }
> +
> +       strscpy(swap_tiers[SWAP_TIER_DEFAULT].name, DEFAULT_TIER_NAME);

The default tier is not a real tier. It shouldn't show up in the
swap_tiers array.
The default tier is only a wide cast for memory.swap.tiers to select
tiers to turn on/off swap tiers. It is a wide cast pattern, not an
actual tier.

> +#ifdef CONFIG_ZSWAP
> +       strscpy(swap_tiers[SWAP_TIER_ZSWAP].name, ZSWAP_TIER_NAME);
> +#endif
> +}
> +
> +static bool swap_tier_alive(struct swap_tier *tier)
> +{
> +       lockdep_assert_held(&swap_tier_lock);
> +       return (kref_read(&tier->ref) > 1) ? true : false;
> +}
> +
> +static void swap_tier_get(struct swap_tier *tier)
> +{
> +       lockdep_assert_held(&swap_tier_lock);
> +       kref_get(&tier->ref);
> +}
> +
> +static void swap_tier_remove(struct swap_tier *tier);
> +static void swap_tier_release(struct kref *ref)
> +{
> +       struct swap_tier *tier = container_of(ref, struct swap_tier, ref);
> +
> +       swap_tier_remove(tier);
> +}
> +
> +static void swap_tier_put(struct swap_tier *tier)
> +{
> +       lockdep_assert_held(&swap_tier_lock);
> +       kref_put(&tier->ref, swap_tier_release);
> +}
> +
> +static int swap_tier_cmp(const void *p, const void *p2)
> +{
> +       const struct tiers_desc *tp = p;
> +       const struct tiers_desc *tp2 = p2;
> +
> +       return tp2->prio_st - tp->prio_st;
> +}
> +
> +static void swap_tier_get_mask(int mask)
> +{
> +       struct swap_tier *tier;
> +
> +       lockdep_assert_held(&swap_tier_lock);
> +
> +       for_each_active_tier(tier) {
> +               int tier_mask = TIER_MASK(TIER_IDX(tier), TIER_FULL_MASK);
> +
> +               if (mask & tier_mask)
> +                       swap_tier_get(tier);
> +       }
> +}
> +
> +static void swap_tier_put_mask(int mask)
> +{
> +       struct swap_tier *tier;
> +
> +       lockdep_assert_held(&swap_tier_lock);
> +
> +       for_each_active_tier(tier) {
> +               int tier_mask = TIER_MASK(TIER_IDX(tier), TIER_FULL_MASK);
> +
> +               if (mask & tier_mask)
> +                       swap_tier_put(tier);
> +       }
> +}
> +
> +static bool swap_tier_is_default(struct swap_tier *tier)
> +{
> +       int idx = TIER_IDX(tier);
> +
> +       return idx < SWAP_TIER_RESERVED_CNT ? true : false;
> +}
> +
> +/*
> + * Check if tier priority range can be split. If input falls within a
> + * referenced range, splitting is not allowed and an error is returned.
> + * If priority start is the same but tier has no reference, mark
> + * SKIP_REPLACE_IDX to allow replacement at apply time.
> + */
> +static int swap_tier_can_split_range(struct tiers_desc *desc)
> +{
> +       struct swap_tier *tier;
> +       int ret = 0;
> +
> +       lockdep_assert_held(&swap_tier_lock);
> +       desc->tier_idx = SKIP_REPLACE_IDX;
> +
> +       for_each_active_tier(tier) {
> +               if (swap_tier_is_default(tier))
> +                       continue;
> +
> +               if (tier->prio_st > desc->prio_st)
> +                       continue;
> +
> +               /* If start priorities match, replace tier name */
> +               if (tier->prio_st == desc->prio_st)
> +                       desc->tier_idx = TIER_IDX(tier);
> +
> +               if (swap_tier_alive(tier))
> +                       ret = -EBUSY;
> +
> +               break;
> +       }
> +
> +       return ret;
> +}
> +
> +static void swap_tier_prepare(struct tiers_desc *desc)
> +{
> +       struct swap_tier *tier;
> +
> +       lockdep_assert_held(&swap_tier_lock);
> +
> +       if (WARN_ON(list_empty(&swap_tier_inactive_list)))
> +               return;
> +
> +       if (swap_tier_replace_marked(desc->tier_idx))
> +               return;
> +
> +       tier = list_first_entry(&swap_tier_inactive_list,
> +               struct swap_tier, inactive_list);
> +
> +       list_del(&tier->inactive_list);
> +       strscpy(tier->name, desc->name, MAX_TIERNAME);
> +       tier->prio_st = desc->prio_st;
> +       plist_node_init(&tier->list, -tier->prio_st);
> +       kref_init(&tier->ref);
> +
> +       plist_add(&tier->list, &swap_tier_active_list);
> +       nr_swaptiers++;
> +}
> +
> +static void swap_tier_apply(struct tiers_desc *desc, int nr)
> +{
> +       struct plist_node *pnode;
> +       struct swap_info_struct *p = NULL;
> +       struct swap_tier *tier;
> +       int prio_end = SHRT_MAX;
> +
> +       lockdep_assert_held(&swap_lock);
> +       lockdep_assert_held(&swap_tier_lock);
> +
> +       for (int i = 0; i < nr; i++) {
> +               int idx = desc[i].tier_idx;
> +
> +               if (swap_tier_replace_marked(idx))
> +                       strscpy(swap_tiers[idx].name, desc[i].name, MAX_TIERNAME);
> +       }
> +
> +       for_each_active_tier(tier) {
> +               if (swap_tier_is_default(tier))
> +                       continue;
> +
> +               tier->prio_end = prio_end;
> +               prio_end = tier->prio_st - 1;
> +       }
> +
> +       /* TODO: use swapfiles ?*/
> +       if (plist_head_empty(&swap_active_head))
> +               return;
> +
> +       for_each_active_tier(tier) {
> +               plist_for_each_continue(pnode, &swap_active_head) {
> +                       p = container_of(pnode, struct swap_info_struct, list);
> +                       if (p->tier_idx != NULL_TIER)
> +                               continue;
> +                       if (!swap_tier_prio_in_range(tier, p->prio))
> +                               break;
> +                       swap_tier_get(tier);
> +               }
> +       }
> +}
> +
> +int swap_tiers_add(struct tiers_desc desc[], int nr)
> +{
> +       struct swap_tier *tier;
> +       int ret = 0;
> +
> +       if (nr <= 0)
> +               return -EINVAL;
> +
> +       sort(desc, nr, sizeof(desc[0]), swap_tier_cmp, NULL);
> +
> +       for (int i = 0; i < nr; i++) {
> +               if (i > 0 && desc[i].prio_st == desc[i-1].prio_st)
> +                       return -EINVAL;
> +
> +               for (int j = i+1; j < nr; j++) {
> +                       if (!strcmp(desc[i].name, desc[j].name))
> +                               return -EINVAL;
> +               }
> +
> +               for_each_active_tier(tier) {
> +                       if (!strcmp(desc[i].name, tier->name))
> +                               return -EALREADY;
> +               }
> +       }
> +
> +       spin_lock(&swap_lock);
> +       spin_lock(&swap_tier_lock);
> +
> +       /* Tier set must cover all priorities */
> +       if (!swap_tier_is_active() && desc[nr-1].prio_st != DEF_SWAP_PRIO)
> +               desc[nr-1].prio_st = DEF_SWAP_PRIO;
> +
> +       if (nr + nr_swaptiers > MAX_SWAPTIER) {
> +               ret = -ERANGE;
> +               goto out;
> +       }
> +
> +       for (int i = 0; i < nr; i++) {
> +               ret = swap_tier_can_split_range(&desc[i]);
> +               if (ret)
> +                       goto out;
> +       }
> +
> +       for (int i = 0; i < nr; i++)
> +               swap_tier_prepare(&desc[i]);
> +
> +       swap_tier_apply(desc, nr);
> +out:
> +       spin_unlock(&swap_tier_lock);
> +       spin_unlock(&swap_lock);
> +
> +       return ret;
> +}
> +
> +static void swap_tier_remove(struct swap_tier *tier)
> +{
> +       bool least_prio_merge = false;
> +       struct plist_node *next;
> +       struct swap_tier *merge_tier;
> +
> +       lockdep_assert_held(&swap_tier_lock);
> +
> +       if (tier->prio_st == DEF_SWAP_PRIO) {
> +               least_prio_merge = true;
> +               next = plist_prev(&tier->list);
> +       } else
> +               next = plist_next(&tier->list);
> +
> +       plist_del(&tier->list, &swap_tier_active_list);
> +       INIT_LIST_HEAD(&tier->inactive_list);
> +       list_add_tail(&tier->inactive_list, &swap_tier_inactive_list);
> +       nr_swaptiers--;
> +
> +       if (nr_swaptiers == SWAP_TIER_RESERVED_CNT)
> +               return;
> +
> +       merge_tier = container_of(next, struct swap_tier, list);
> +       /*
> +        * When the first tier is removed, the rule that all priorities
> +        * must be covered should be maintained. The next tier inherits the
> +        * start value of the removed first tier.
> +        */
> +       if (least_prio_merge)
> +               merge_tier->prio_st = DEF_SWAP_PRIO;
> +       else
> +               merge_tier->prio_end = tier->prio_end;
> +}
> +
> +int swap_tiers_remove(struct tiers_desc desc[], int nr)
> +{
> +       int cnt = 0;
> +       int ret = 0;
> +       struct swap_tier *tier;
> +
> +       for (int i = 0; i < nr; i++) {
> +               for (int j = i+1; j < nr; j++) {
> +                       if (!strcmp(desc[i].name, desc[j].name))
> +                               return -EINVAL;
> +               }
> +       }
> +
> +       spin_lock(&swap_tier_lock);
> +       if (nr <= 0 || nr > nr_swaptiers) {
> +               ret = -EINVAL;
> +               goto out;
> +       }
> +
> +       for (int i = 0; i < nr; i++) {
> +               for_each_active_tier(tier) {
> +                       if (swap_tier_is_default(tier))
> +                               continue;
> +
> +                       if (!strcmp(desc[i].name, tier->name)) {
> +                               if (swap_tier_alive(tier)) {
> +                                       ret = -EBUSY;
> +                                       goto out;
> +                               }
> +                               desc[i].tier_idx = TIER_IDX(tier);
> +                               cnt++;
> +                       }
> +               }
> +       }
> +
> +       if (cnt != nr) {
> +               ret = -EINVAL;
> +               goto out;
> +       }
> +
> +       for (int i = 0; i < nr; i++)
> +               swap_tier_put(&swap_tiers[desc[i].tier_idx]);
> +
> +out:
> +       spin_unlock(&swap_tier_lock);
> +       return ret;
> +}
> +
> +ssize_t swap_tiers_show_sysfs(char *buf)
> +{
> +       struct swap_tier *tier;
> +       ssize_t len = 0;
> +
> +       spin_lock(&swap_tier_lock);
> +       len += sysfs_emit_at(buf, len, "%-16s %-5s %-11s %-11s\n",
> +                        "Name", "Idx", "PrioStart", "PrioEnd");
> +       for_each_active_tier(tier) {
> +               if (swap_tier_is_default(tier))
> +                       len += sysfs_emit_at(buf, len, "%-16s %-5ld\n",
> +                                            tier->name,
> +                                            TIER_IDX(tier));
> +               else {
> +                       len += sysfs_emit_at(buf, len, "%-16s %-5ld %-11d %-11d\n",
> +                                            tier->name,
> +                                            TIER_IDX(tier),
> +                                            tier->prio_st,
> +                                            tier->prio_end);
> +               }
> +               if (len >= PAGE_SIZE)
> +                       break;
> +       }
> +       spin_unlock(&swap_tier_lock);
> +
> +       return len;
> +}
> +
> +static void swap_tier_show_mask(struct seq_file *m, int memcg_mask)
> +{
> +       struct swap_tier *tier;
> +       int idx;
> +
> +       for_each_active_tier(tier) {
> +               idx = TIER_IDX(tier);
> +               int tier_on_mask = TIER_MASK(idx, TIER_ON_MASK);
> +               int tier_off_mask = TIER_MASK(idx, TIER_OFF_MASK);
> +
> +               if (memcg_mask & tier_on_mask)
> +                       seq_printf(m, "+%s ", tier->name);
> +               else if (memcg_mask & tier_off_mask)
> +                       seq_printf(m, "-%s ", tier->name);
> +       }
> +       seq_puts(m, "\n");
> +}
> +
> +static bool swap_tier_has_default(int mask)
> +{
> +       return mask & DEFAULT_FULL_MASK;
> +}
> +
> +/*
> + * TODO: Simplify tier mask collection design if possible.
> + *
> + * Current approach uses two separate fields per cgroup:
> + *   - tiers_onoff: 2-bit per tier to indicate on/off state
> + *   - tiers_mask:  2-bit per tier to indicate if the tier is configured
> + *
> + * Trade-off to consider:
> + *   - Pre-compute at cgroup setup time (faster runtime, complex invalidation)
> + *   - Keep runtime calculation (simpler, but repeated parent walks)
> + */
> +static int swap_tier_collect_mask(struct mem_cgroup *memcg)
> +{
> +       struct mem_cgroup *p;
> +       int onoff, mask, merge, new;
> +
> +       if (!memcg)
> +               return DEFAULT_ON_MASK;
> +
> +       onoff = memcg->tiers_onoff;
> +       mask = memcg->tiers_mask;
> +
> +       rcu_read_lock();
> +       for (p = parent_mem_cgroup(memcg); p && !swap_tier_has_default(onoff);
> +               p = parent_mem_cgroup(p)) {
> +               merge = mask | p->tiers_mask;
> +               new = merge ^ mask;
> +               onoff |= p->tiers_onoff & new;
> +               mask = merge;
> +       }
> +       rcu_read_unlock();
> +
> +       return onoff;
> +}
> +
> +int swap_tiers_collect_compare_mask(struct mem_cgroup *memcg)
> +{
> +       int onoff;
> +
> +       onoff = swap_tier_collect_mask(memcg);
> +
> +       /* Make common case fast */
> +       if (onoff & DEFAULT_ON_MASK)
> +               return onoff;
> +       /*
> +        * Root memcg has DEFAULT_ON_MASK; defaults are covered.
> +        * Checking DEFAULT_OFF_MASK suffices; only each tier's ON bit is checked.
> +        * ON flag is inverted and compared with each swap device's OFF mask.
> +        */
> +       return ~(onoff << 1);
> +}
> +
> +void swap_tiers_show_memcg(struct seq_file *m, struct mem_cgroup *memcg)
> +{
> +       spin_lock(&swap_tier_lock);
> +       if (memcg->tiers_onoff)
> +               swap_tier_show_mask(m, memcg->tiers_onoff);
> +       else
> +               seq_puts(m, "\n");
> +       swap_tier_show_mask(m, swap_tier_collect_mask(memcg));
> +       spin_unlock(&swap_tier_lock);
> +}
> +
> +void swap_tiers_assign(struct swap_info_struct *swp)
> +{
> +       struct swap_tier *tier;
> +
> +       spin_lock(&swap_tier_lock);
> +       swp->tier_idx = NULL_TIER;
> +
> +       for_each_active_tier(tier) {
> +               if (swap_tier_is_default(tier))
> +                       continue;
> +               if (swap_tier_prio_in_range(tier, swp->prio)) {
> +                       swp->tier_idx = TIER_IDX(tier);
> +                       swap_tier_get(tier);
> +                       break;
> +               }
> +       }
> +       spin_unlock(&swap_tier_lock);
> +}
> +
> +void swap_tiers_release(struct swap_info_struct *swp)
> +{
> +       spin_lock(&swap_tier_lock);
> +       if (swp->tier_idx != NULL_TIER)
> +               swap_tier_put(&swap_tiers[swp->tier_idx]);
> +       spin_unlock(&swap_tier_lock);
> +}
> +
> +/* not incremental, but reset. */
> +int swap_tiers_get_mask(struct tiers_desc *desc, int nr, struct mem_cgroup *memcg)

Who is using this function? I can't find the user of this function in
this patch.
Please introduce this function together with the patch that uses it.
Don't introduce a function without a user.


> +{
> +       struct swap_tier *tier;
> +       int ret = 0;
> +       int tiers_mask = 0;
> +       int tiers_onoff = 0;
> +       int cnt = 0;
> +
> +       for (int i = 0; i < nr; i++) {
> +               for (int j = i+1; j < nr; j++) {
> +                       if (!strcmp(desc[i].name, desc[j].name))

These two nested for loops look suspicious. Again because I don't see
the caller, I can't reason what it is doing here.

Chris

> +                               return -EINVAL;
> +               }
> +       }
> +
> +       spin_lock(&swap_tier_lock);
> +       /* nr 0 is allowed for deletion */
> +       if (nr < 0 || nr > nr_swaptiers) {
> +               ret = -EINVAL;
> +               goto out;
> +       }
> +
> +       for (int i = 0; i < nr; i++) {
> +               for_each_active_tier(tier) {
> +                       if (!strcmp(desc[i].name, tier->name)) {
> +                               tiers_mask |= TIER_MASK(TIER_IDX(tier), TIER_FULL_MASK);
> +                               tiers_onoff |= TIER_MASK(TIER_IDX(tier), desc[i].ops);
> +                               cnt++;
> +                               break;
> +                       }
> +               }
> +       }
> +
> +       if (cnt != nr) {
> +               ret = -EINVAL;
> +               goto out;
> +       }
> +
> +       swap_tier_put_mask(memcg->tiers_mask);
> +       swap_tier_get_mask(tiers_mask);
> +       memcg->tiers_mask = tiers_mask;
> +       memcg->tiers_onoff = tiers_onoff;
> +out:
> +       spin_unlock(&swap_tier_lock);
> +       return ret;
> +}
> +
> +void swap_tiers_put_mask(struct mem_cgroup *memcg)
> +{
> +       spin_lock(&swap_tier_lock);
> +       swap_tier_put_mask(memcg->tiers_mask);
> +       spin_unlock(&swap_tier_lock);
> +}
> diff --git a/mm/swap_tier.h b/mm/swap_tier.h
> new file mode 100644
> index 000000000000..f4b4e385638a
> --- /dev/null
> +++ b/mm/swap_tier.h
> @@ -0,0 +1,75 @@
> +/* SPDX-License-Identifier: GPL-2.0 */
> +#ifndef _SWAP_TIER_H
> +#define _SWAP_TIER_H
> +
> +#include "swap.h"
> +#include <linux/memcontrol.h>
> +
> +#ifdef CONFIG_SWAP_TIER
> +
> +enum swap_tiers_index {
> +       SWAP_TIER_DEFAULT,
> +#ifdef CONFIG_ZSWAP
> +       SWAP_TIER_ZSWAP,
> +#endif
> +       SWAP_TIER_RESERVED_CNT,
> +};
> +
> +#define MAX_TIERNAME                   16
> +#define MAX_SWAPTIER                   12
> +
> +#define TIER_MASK(idx, mask)           ((mask) << (idx) * 2)
> +#define TIER_ON_MASK                   (1 << 0)
> +#define TIER_OFF_MASK                  (1 << 1)
> +#define TIER_FULL_MASK                 (TIER_ON_MASK | TIER_OFF_MASK)
> +
> +#define DEFAULT_ON_MASK                        TIER_MASK(SWAP_TIER_DEFAULT, TIER_ON_MASK)
> +#define DEFAULT_OFF_MASK               TIER_MASK(SWAP_TIER_DEFAULT, TIER_OFF_MASK)
> +#define DEFAULT_FULL_MASK              (DEFAULT_ON_MASK | DEFAULT_OFF_MASK)
> +
> +#define DEFAULT_TIER_NAME              ""
> +#define ZSWAP_TIER_NAME                        "ZSWAP"
> +
> +struct tiers_desc {
> +       char name[MAX_TIERNAME];
> +       union {
> +               signed short prio_st;
> +               signed short ops;
> +       };
> +       int tier_idx;
> +};
> +
> +void swap_tiers_init(void);
> +int swap_tiers_add(struct tiers_desc desc[], int nr);
> +int swap_tiers_remove(struct tiers_desc desc[], int nr);
> +ssize_t swap_tiers_show_sysfs(char *buf);
> +void swap_tiers_show_memcg(struct seq_file *m, struct mem_cgroup *memcg);
> +void swap_tiers_assign(struct swap_info_struct *swp);
> +void swap_tiers_release(struct swap_info_struct *swp);
> +int swap_tiers_get_mask(struct tiers_desc *desc, int nr, struct mem_cgroup *memcg);
> +void swap_tiers_put_mask(struct mem_cgroup *memcg);
> +static inline bool swap_tiers_test_off(int tier_idx, int mask)
> +{
> +       return TIER_MASK(tier_idx, TIER_OFF_MASK) & mask;
> +}
> +int swap_tiers_collect_compare_mask(struct mem_cgroup *memcg);
> +#else
> +static inline void swap_tiers_init(void)
> +{
> +}
> +static inline void swap_tiers_assign(struct swap_info_struct *swp)
> +{
> +}
> +static inline void swap_tiers_release(struct swap_info_struct *swp)
> +{
> +}
> +static inline bool swap_tiers_test_off(int tier_off_mask, int mask)
> +{
> +       return false;
> +}
> +static inline int swap_tiers_collect_compare_mask(struct mem_cgroup *memcg);
> +{
> +       return 0;
> +}
> +#endif
> +#endif
> diff --git a/mm/swapfile.c b/mm/swapfile.c
> index 3bb77c9a4879..a5c90e419ff3 100644
> --- a/mm/swapfile.c
> +++ b/mm/swapfile.c
> @@ -63,7 +63,7 @@ static void move_cluster(struct swap_info_struct *si,
>                          struct swap_cluster_info *ci, struct list_head *list,
>                          enum swap_cluster_flags new_flags);
>
> -static DEFINE_SPINLOCK(swap_lock);
> +DEFINE_SPINLOCK(swap_lock);
>  static unsigned int nr_swapfiles;
>  atomic_long_t nr_swap_pages;
>  /*
> @@ -74,7 +74,6 @@ atomic_long_t nr_swap_pages;
>  EXPORT_SYMBOL_GPL(nr_swap_pages);
>  /* protected with swap_lock. reading in vm_swap_full() doesn't need lock */
>  long total_swap_pages;
> -#define DEF_SWAP_PRIO  -1
>  unsigned long swapfile_maximum_size;
>  #ifdef CONFIG_MIGRATION
>  bool swap_migration_ad_supported;
> @@ -89,7 +88,7 @@ static const char Unused_offset[] = "Unused swap offset entry ";
>   * all active swap_info_structs
>   * protected with swap_lock, and ordered by priority.
>   */
> -static PLIST_HEAD(swap_active_head);
> +PLIST_HEAD(swap_active_head);
>
>  /*
>   * all available (active, not full) swap_info_structs
> --
> 2.34.1
>
>


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 3/3] mm/swap: integrate swap tier infrastructure into swap subsystem
  2025-11-09 12:49 ` [PATCH 3/3] mm/swap: integrate swap tier infrastructure into swap subsystem Youngjun Park
                     ` (2 preceding siblings ...)
  2025-11-10 13:26   ` kernel test robot
@ 2025-11-12 14:44   ` Chris Li
  2025-11-13  4:07     ` YoungJun Park
  3 siblings, 1 reply; 25+ messages in thread
From: Chris Li @ 2025-11-12 14:44 UTC (permalink / raw)
  To: Youngjun Park
  Cc: akpm, linux-mm, cgroups, linux-kernel, kasong, hannes, mhocko,
	roman.gushchin, shakeel.butt, muchun.song, shikemeng, nphamcs,
	bhe, baohua, gunho.lee, taejoon.song

It seems you should introduce the tiers first. Allow users to define tiers.
Then the follow up patches use tiers defined here.

The patch order seems reversed to me.

See some feedback below, to be continued.

Chhris


On Sun, Nov 9, 2025 at 4:50 AM Youngjun Park <youngjun.park@lge.com> wrote:
>
> Integrate the swap tier infrastructure into the existing swap subsystem
> to enable selective swap device usage based on tier configuration.
>
> Signed-off-by: Youngjun Park <youngjun.park@lge.com>
> ---
>  mm/memcontrol.c | 69 ++++++++++++++++++++++++++++++++++++
>  mm/page_io.c    | 21 ++++++++++-
>  mm/swap_state.c | 93 +++++++++++++++++++++++++++++++++++++++++++++++++
>  mm/swapfile.c   | 15 ++++++--
>  4 files changed, 194 insertions(+), 4 deletions(-)
>
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index bfc986da3289..33c7cc069754 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -68,6 +68,7 @@
>  #include <net/ip.h>
>  #include "slab.h"
>  #include "memcontrol-v1.h"
> +#include "swap_tier.h"
>
>  #include <linux/uaccess.h>
>
> @@ -3730,6 +3731,7 @@ static void mem_cgroup_free(struct mem_cgroup *memcg)
>  {
>         lru_gen_exit_memcg(memcg);
>         memcg_wb_domain_exit(memcg);
> +       swap_tiers_put_mask(memcg);
>         __mem_cgroup_free(memcg);
>  }
>
> @@ -3842,6 +3844,11 @@ mem_cgroup_css_alloc(struct cgroup_subsys_state *parent_css)
>                 page_counter_init(&memcg->kmem, &parent->kmem, false);
>                 page_counter_init(&memcg->tcpmem, &parent->tcpmem, false);
>  #endif
> +#ifdef CONFIG_SWAP_TIER
> +               memcg->tiers_mask = 0;
> +               memcg->tiers_onoff = 0;
> +#endif
> +
>         } else {
>                 init_memcg_stats();
>                 init_memcg_events();
> @@ -3850,6 +3857,10 @@ mem_cgroup_css_alloc(struct cgroup_subsys_state *parent_css)
>  #ifdef CONFIG_MEMCG_V1
>                 page_counter_init(&memcg->kmem, NULL, false);
>                 page_counter_init(&memcg->tcpmem, NULL, false);
> +#endif
> +#ifdef CONFIG_SWAP_TIER
Again, don't need this config.

> +               memcg->tiers_mask = DEFAULT_FULL_MASK;

Is this  memcg->tiers_mask a cached value after evaluating the
swap.tiers onoff list by looking up the parent?

I was thinking of starting with always evaluating the tiers_mask. Then
we don't need to store it here. How do you indicate the tiers_mask is
out of date?

> +               memcg->tiers_onoff = DEFAULT_ON_MASK;
>  #endif
>                 root_mem_cgroup = memcg;
>                 return &memcg->css;
> @@ -5390,6 +5401,56 @@ static int swap_events_show(struct seq_file *m, void *v)
>         return 0;
>  }
>
> +#ifdef CONFIG_SWAP_TIER
> +static int swap_tier_show(struct seq_file *m, void *v)
> +{
> +       struct mem_cgroup *memcg = mem_cgroup_from_seq(m);
> +
> +       swap_tiers_show_memcg(m, memcg);
> +       return 0;
> +}
> +
> +static ssize_t swap_tier_write(struct kernfs_open_file *of,
> +                               char *buf, size_t nbytes, loff_t off)
> +{
> +       struct mem_cgroup *memcg = mem_cgroup_from_css(of_css(of));
> +       struct tiers_desc desc[MAX_SWAPTIER] = {};
> +       char *pos = buf, *token;
> +       int nr = 0;
> +       int ret;
> +
> +       while ((token = strsep(&pos, " \t\n")) != NULL) {

Not allowing plain space " "? Compare pointer != NULL is redundant.

> +               if (!*token)
> +                       continue;
> +
> +               if (nr >= MAX_SWAPTIER)
> +                       return -E2BIG;
> +
> +               if (token[0] != '+' && token[0] != '-')
> +                       return -EINVAL;
> +
> +               desc[nr].ops = (token[0] == '+') ? TIER_ON_MASK : TIER_OFF_MASK;
> +
> +               if (strlen(token) <= 1) {
> +                       strscpy(desc[nr].name, DEFAULT_TIER_NAME);
> +                       nr++;
> +                       continue;
> +               }
> +
> +               if (strscpy(desc[nr].name, token + 1, MAX_TIERNAME) < 0)
> +                       return -EINVAL;
> +
> +               nr++;
I don't think you need this nr, you will reach to the end of the
string any way. What if the user specifies the same tier more than
once? It is not optimal but the kernel should take it.

OK, I see what is going on now, this whole desc thing can be greatly
simplified. You shouldn't need to maintain the desc[nr], that desc
array is the onoff mask in my mind. You just need to keep the tier
bits in order.

Notice in the memory.swap.tiers. Except for the default tier pattern,
which always the first one if exists. The rest of the tier + - order
does not matter. You look up the tier name into the tier mask bit.
Just set the onoff bits for that tier.

> +       }
> +
> +       ret = swap_tiers_get_mask(desc, nr, memcg);
> +       if (ret)
> +               return ret;
> +
> +       return nbytes;
> +}
> +#endif
> +
>  static struct cftype swap_files[] = {
>         {
>                 .name = "swap.current",
> @@ -5422,6 +5483,14 @@ static struct cftype swap_files[] = {
>                 .file_offset = offsetof(struct mem_cgroup, swap_events_file),
>                 .seq_show = swap_events_show,
>         },
> +#ifdef CONFIG_SWAP_TIER
> +       {
> +               .name = "swap.tiers",
> +               .flags = CFTYPE_NOT_ON_ROOT,
> +               .seq_show = swap_tier_show,
> +               .write = swap_tier_write,
> +       },
> +#endif
>         { }     /* terminate */
>  };
>
> diff --git a/mm/page_io.c b/mm/page_io.c
> index 3c342db77ce3..2b3b1154a169 100644
> --- a/mm/page_io.c
> +++ b/mm/page_io.c
> @@ -26,6 +26,7 @@
>  #include <linux/delayacct.h>
>  #include <linux/zswap.h>
>  #include "swap.h"
> +#include "swap_tier.h"
>
>  static void __end_swap_bio_write(struct bio *bio)
>  {
> @@ -233,6 +234,24 @@ static void swap_zeromap_folio_clear(struct folio *folio)
>         }
>  }
>
> +#if defined(CONFIG_SWAP_TIER) && defined(CONFIG_ZSWAP)
> +static bool folio_swap_tier_zswap_test_off(struct folio *folio)
> +{
> +       struct mem_cgroup *memcg;
> +
> +       memcg = folio_memcg(folio);
> +       if (memcg)
> +               return swap_tier_test_off(memcg->tiers_mask,
> +                       TIER_MASK(SWAP_TIER_ZSWAP, TIER_ON_MASK));
> +
> +       return false;
> +}
> +#else
> +static bool folio_swap_tier_zswap_test_off(struct folio *folio)
> +{
> +       return false;
> +}
> +#endif
>  /*
>   * We may have stale swap cache pages in memory: notice
>   * them here and get rid of the unnecessary final write.
> @@ -272,7 +291,7 @@ int swap_writeout(struct folio *folio, struct swap_iocb **swap_plug)
>          */
>         swap_zeromap_folio_clear(folio);
>
> -       if (zswap_store(folio)) {
> +       if (folio_swap_tier_zswap_test_off(folio) || zswap_store(folio)) {
>                 count_mthp_stat(folio_order(folio), MTHP_STAT_ZSWPOUT);
>                 goto out_unlock;
>         }
> diff --git a/mm/swap_state.c b/mm/swap_state.c
> index 3f85a1c4cfd9..2e5f65ff2479 100644
> --- a/mm/swap_state.c
> +++ b/mm/swap_state.c
> @@ -25,6 +25,7 @@
>  #include "internal.h"
>  #include "swap_table.h"
>  #include "swap.h"
> +#include "swap_tier.h"
>
>  /*
>   * swapper_space is a fiction, retained to simplify the path through
> @@ -836,8 +837,100 @@ static ssize_t vma_ra_enabled_store(struct kobject *kobj,
>  }
>  static struct kobj_attribute vma_ra_enabled_attr = __ATTR_RW(vma_ra_enabled);
>
> +#ifdef CONFIG_SWAP_TIER
> +static ssize_t tiers_show(struct kobject *kobj,
> +                                    struct kobj_attribute *attr, char *buf)
> +{
> +       return swap_tiers_show_sysfs(buf);
> +}
> +
> +static ssize_t tiers_store(struct kobject *kobj,
> +                               struct kobj_attribute *attr,
> +                               const char *buf, size_t count)
> +{
> +       struct tiers_desc desc[MAX_SWAPTIER] = {};
> +       int nr = 0;
> +       char *data, *p, *token;
> +       int ret = 0;
> +       bool is_add = true;
> +
> +       if (!count)
> +               return -EINVAL;
> +
> +       data = kmemdup_nul(buf, count, GFP_KERNEL);
> +       if (!data)
> +               return -ENOMEM;
> +
> +       p = data;
> +
> +       if (*p == '+')
> +               p++;
> +       else if (*p == '-') {
> +               is_add = false;
> +               p++;
> +       } else
> +               return -EINVAL;
> +
> +       while ((token = strsep(&p, ", \t\n")) != NULL) {
> +               if (!*token)
> +                       continue;
> +
> +               if (nr >= MAX_SWAPTIER) {
> +                       ret = -E2BIG;
> +                       goto out;
> +               }
> +
> +               if (is_add) {
> +                       char *name, *prio_str;
> +                       int prio;
> +
> +                       name = strsep(&token, ":");
> +                       prio_str = token;
> +
> +                       if (!name || !prio_str || !*name || !*prio_str) {
> +                               ret = -EINVAL;
> +                               goto out;
> +                       }
> +
> +                       if (strscpy(desc[nr].name, name, MAX_TIERNAME) < 0) {
> +                               ret = -EINVAL;
> +                               goto out;
> +                       }
> +
> +                       if (kstrtoint(prio_str, 10, &prio)) {
> +                               ret = -EINVAL;
> +                               goto out;
> +                       }
> +
> +                       desc[nr].prio_st = prio;
> +               } else {
> +                       if (strscpy(desc[nr].name, token, MAX_TIERNAME) < 0) {
> +                               ret = -EINVAL;
> +                               goto out;
> +                       }
> +                       desc[nr].prio_st = 0;
> +               }
> +               nr++;
> +       }
> +
> +       if (is_add)
> +               ret = swap_tiers_add(desc, nr);
> +       else
> +               ret = swap_tiers_remove(desc, nr);
> +
> +out:
> +       kfree(data);
> +       return ret ? ret : count;
> +}
> +
> +static struct kobj_attribute tier_attr = __ATTR_RW(tiers);
> +#endif
> +
>  static struct attribute *swap_attrs[] = {
>         &vma_ra_enabled_attr.attr,
> +#ifdef CONFIG_SWAP_TIER
> +       &tier_attr.attr,
> +#endif
>         NULL,
>  };
>
> diff --git a/mm/swapfile.c b/mm/swapfile.c
> index a5c90e419ff3..8715a2d94140 100644
> --- a/mm/swapfile.c
> +++ b/mm/swapfile.c
> @@ -49,6 +49,7 @@
>  #include "swap_table.h"
>  #include "internal.h"
>  #include "swap.h"
> +#include "swap_tier.h"
>
>  static bool swap_count_continued(struct swap_info_struct *, pgoff_t,
>                                  unsigned char);
> @@ -1296,7 +1297,8 @@ static bool get_swap_device_info(struct swap_info_struct *si)
>
>  /* Rotate the device and switch to a new cluster */
>  static void swap_alloc_entry(swp_entry_t *entry,
> -                           int order)
> +                           int order,
> +                           int mask)
>  {
>         unsigned long offset;
>         struct swap_info_struct *si, *next;
> @@ -1304,6 +1306,8 @@ static void swap_alloc_entry(swp_entry_t *entry,
>         spin_lock(&swap_avail_lock);
>  start_over:
>         plist_for_each_entry_safe(si, next, &swap_avail_head, avail_list) {
> +               if (swap_tiers_test_off(si->tier_idx, mask))
> +                       continue;
>                 /* Rotate the device and switch to a new cluster */
>                 plist_requeue(&si->avail_list, &swap_avail_head);
>                 spin_unlock(&swap_avail_lock);
> @@ -1376,6 +1380,7 @@ int folio_alloc_swap(struct folio *folio)
>  {
>         unsigned int order = folio_order(folio);
>         unsigned int size = 1 << order;
> +       int mask;
>         swp_entry_t entry = {};
>
>         VM_BUG_ON_FOLIO(!folio_test_locked(folio), folio);
> @@ -1400,8 +1405,8 @@ int folio_alloc_swap(struct folio *folio)
>         }
>
>  again:
> -       swap_alloc_entry(&entry, order);
> -
> +       mask = swap_tiers_collect_compare_mask(folio_memcg(folio));
> +       swap_alloc_entry(&entry, order, mask);
>         if (unlikely(!order && !entry.val)) {
>                 if (swap_sync_discard())
>                         goto again;
> @@ -2673,6 +2678,8 @@ static void _enable_swap_info(struct swap_info_struct *si)
>
>         /* Add back to available list */
>         add_to_avail_list(si, true);
> +
> +       swap_tiers_assign(si);
>  }
>
>  static void enable_swap_info(struct swap_info_struct *si, int prio,
> @@ -2840,6 +2847,7 @@ SYSCALL_DEFINE1(swapoff, const char __user *, specialfile)
>         spin_lock(&swap_lock);
>         spin_lock(&p->lock);
>         drain_mmlist();
> +       swap_tiers_release(p);
>
>         swap_file = p->swap_file;
>         p->swap_file = NULL;
> @@ -4004,6 +4012,7 @@ static int __init swapfile_init(void)
>                 swap_migration_ad_supported = true;
>  #endif /* CONFIG_MIGRATION */
>
> +       swap_tiers_init();
>         return 0;
>  }
>  subsys_initcall(swapfile_init);
> --
> 2.34.1
>
>


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [RFC] mm/swap, memcg: Introduce swap tiers for cgroup based swap control
  2025-11-12 13:34 ` [RFC] mm/swap, memcg: Introduce swap tiers for cgroup based swap control Chris Li
@ 2025-11-13  1:33   ` YoungJun Park
  0 siblings, 0 replies; 25+ messages in thread
From: YoungJun Park @ 2025-11-13  1:33 UTC (permalink / raw)
  To: Chris Li
  Cc: akpm, linux-mm, cgroups, linux-kernel, kasong, hannes, mhocko,
	roman.gushchin, shakeel.butt, muchun.song, shikemeng, nphamcs,
	bhe, baohua, gunho.lee, taejoon.song

On Wed, Nov 12, 2025 at 05:34:05AM -0800, Chris Li wrote:

Hello Chris :)

> Thanks for the patches. I notice that your cover letter does not have
> [0/3] on it. One tool I found useful is using the b4 to send out
> patches in series. Just for your consideration, it is not an ask. I
> can review patches not sent out from b4 just fine.

I manually edited the cover letter title, but made a human error.
Thanks for the tip.
 
> On Sun, Nov 9, 2025 at 4:50 AM Youngjun Park <youngjun.park@lge.com> wrote:
> >
> > Hi all,
> >
> > In constrained environments, there is a need to improve workload
> > performance by controlling swap device usage on a per-process or
> > per-cgroup basis. For example, one might want to direct critical
> > processes to faster swap devices (like SSDs) while relegating
> > less critical ones to slower devices (like HDDs or Network Swap).
> >
> > Initial approach was to introduce a per-cgroup swap priority
> > mechanism [1]. However, through review and discussion, several
> > drawbacks were identified:
> >
> > a. There is a lack of concrete use cases for assigning a fine-grained,
> >    unique swap priority to each cgroup.
> > b. The implementation complexity was high relative to the desired
> >    level of control.
> > c. Differing swap priorities between cgroups could lead to LRU
> >    inversion problems.
> >
> > To address these concerns, I propose the "swap tiers" concept,
> > originally suggested by Chris Li [2] and further developed through
> > collaborative discussions. I would like to thank Chris Li and
> > He Baoquan for their invaluable contributions in refining this
> > approach, and Kairui Song, Nhat Pham, and Michal Koutný for their
> > insightful reviews of earlier RFC versions.
> >
> > Concept
> > -------
> > A swap tier is a grouping mechanism that assigns a "named id" to a
> > range of swap priorities. For example, all swap devices with a
> > priority of 100 or higher could be grouped into a tier named "SSD",
> > and all others into a tier named "HDD".
> >
> > Cgroups can then select which named tiers they are permitted to use for
> > swapping via a new cgroup interface. This effectively restricts a
> > cgroup's swap activity to a specific subset of the available swap
> > devices.
> >
> > Proposed Interface
> > ------------------
> > 1. Global Tier Definition: /sys/kernel/mm/swap/tiers
> >
> > This file is used to define the global swap tiers and their associated
> > minimum priority levels.
> >
> > - To add tiers:
> >   Format: + 'tier_name':'prio'[,|' ']'tier_name 2':'prio']...
> >   Example:
> >   # echo "+ SSD:100,HDD:2" > /sys/kernel/mm/swap/tiers
> 
> I think a lot of this documentation nature of the cover letter should
> move into a kernel document commit. Maybe
> Documentation/mm/swap_tiers.rst

I will create a Documentation file based on what is mentioned here.

> Another suggestion is use "+SSD:100,+HDD:2,-SD" that kind of flavor
> similar to "cgroup.subtree_control" interface, which allows adding or
> removing cgroups. That way you can add and remove in one line action.

Your suggested format is more familiar. I have no objections and will
change it accordingly.

> >
> >   There are several rules for defining tiers:
> >   - Priority ranges for tiers must not overlap.
> 
> We can add that we suggest allocating a higher priority range for
> faster swap devices. That way more swap page faults will likely be
> served by faster swap devices.

It would be good to explicitly state this in the Documentation.

> >   - The combination of all defined tiers must cover the entire valid
> >     priority range (DEF_SWAP_PRIO to SHRT_MAX) to ensure every swap device
> >     can be assigned to a tier.
> >   - A tier's prio value is its inclusive lower bound,
> >     covering priorities up to the next tier's prio.
> >     The highest tier extends to SHRT_MAX, and the lowest tier extends to DEF_SWAP_PRIO.
> >   - If the specified tiers do not cover the entire priority range,
> >     the priority of the tier with the lowest specified priority value
> >     is set to SHRT_MIN
> >   - The total number of tiers is limited.
> >
> > - To remove tiers:
> >   Format: - 'tier_name'[,|' ']'tier_name2']...
> >   Example:
> >   # echo "- SSD,HDD" > /sys/kernel/mm/swap/tiers
> 
> See above, make the '-SSD, -HDD' similar to the "cgroup.subtree_control"

Ack as I said before commenct. Thanks for suggestion again.

> >   Note: A tier cannot be removed if it is currently in use by any
> >   cgroup or if any active swap device is assigned to it. This acts as
> >   a reference count to prevent disruption.
> >
> > - To show current tiers:
> >   Reading the file displays the currently configured tiers, their
> >   internal index, and the priority range they cover.
> >   Example:
> >   # echo "+ SSD:100,HDD:2" > /sys/kernel/mm/swap/tiers
> >   # cat /sys/kernel/mm/swap/tiers
> >   Name      Idx   PrioStart   PrioEnd
> >             0
> >   SSD       1    100         32767
> >   HDD       2     -1         99
> >
> >   - `Name`: The name of the tier. The unnamed entry is a default tier.
> >   - `Idx`: The internal index assigned to the tier.
> >   - `PrioStart`: The starting priority of the range covered by this tier.
> >   - `PrioEnd`: The ending priority of the range covered by this tier.
> >
> > Two special tiers are predefined:
> > - "": Represents the default inheritance behavior in cgroups.
> This belongs to the memory.swap.tiers section.
> "" is not a real tier's name. It is just a wide cast to refer to all tiers.

I will manage it separately as a logical tier that is not exposed to
users, and also handle it at the code level.

> > - "zswap": Reserved for zswap integration.
>
> One thing I realize is that, we might need to have per swap tier have
> a matching zswap tier. Otherwise when we refer to zswap, there is no
> way for the cgroup to select which backing swapfile does this zswap
> use for allocating the swap entry.

From the perspective of per-cgroup swap control,
if a ZSWAP tier is assigned and a cgroup selects that tier,
it determines whether to use zswap or not.

However, since the zswap backend does not know which tier it is linked
to, there could be a mismatch between the zswap tier (1), the backend
storage (2), and possibly another layer (3).  
This could lead to a contradiction where the cgroup-selected tier may
or may not correspond to the actual backend tier.  
Is this the correct understanding?

> We can avoid this complexity by providing a dedicated ghost swapfile,
> which only zswap can use to allocate swap entries.

From what I understood when youpreviously mentioned this concept,
the “ghost swapfile” is not a real swap device.  
It exists conceptually so that zswap can operate as if there is a swap
device, but in reality, only compressed swap entries are managed by
zswap itself. (zswap needs actual swap for compress swap)

Considering both points above, could you please clarify the intended
direction?  

Are you suggesting removing the zswap tier entirely, or defining a
specific way to manage it?  

I would appreciate a bit more explanation on how you envision the zswap
tier being handled.

Best Regards,
Youngjun Park


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 2/3] mm: swap: introduce swap tier infrastructure
  2025-11-12 14:20   ` Chris Li
@ 2025-11-13  2:01     ` YoungJun Park
  0 siblings, 0 replies; 25+ messages in thread
From: YoungJun Park @ 2025-11-13  2:01 UTC (permalink / raw)
  To: Chris Li
  Cc: akpm, linux-mm, cgroups, linux-kernel, kasong, hannes, mhocko,
	roman.gushchin, shakeel.butt, muchun.song, shikemeng, nphamcs,
	bhe, baohua, gunho.lee, taejoon.song

On Wed, Nov 12, 2025 at 06:20:22AM -0800, Chris Li wrote:
> First of all, for RFC series, please include RFC in each patch subject as well.
> 
> For the real patch submission, please consider split it into smaller
> chunks and have incremental milestones.
> Only introduce the function needed for each milestone, not define them
> all together then use it in later patches.
> 
> See some feedback as follows.
> 
> This patch is too big, to be continued.
> 
> Chris

Sure, I will take care of it. will make better on real patch submissions.

> > diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> > index 966f7c1a0128..1224029620ed 100644
> > --- a/include/linux/memcontrol.h
> > +++ b/include/linux/memcontrol.h
> > @@ -283,6 +283,10 @@ struct mem_cgroup {
> >         /* per-memcg mm_struct list */
> >         struct lru_gen_mm_list mm_list;
> >  #endif
> > +#ifdef CONFIG_SWAP_TIER
> 
> I think we don't need this CONFIG_SWAP_TIER. Making it part of the
> default swap is fine.
> By default the memory.swap.tiers is empty and matches the previous
> swap behavior. The user doesn't need to do anything if they are not
> using swap tiers. I see no reason to have a separate config option.

Okay I will change it to default kernel option.

> > +       int tiers_onoff;
> > +       int tiers_mask;
> > +#endif
> >
> >  #ifdef CONFIG_MEMCG_V1
> >         /* Legacy consumer-oriented counters */
> > diff --git a/include/linux/swap.h b/include/linux/swap.h
> > index 90fa27bb7796..8911eff9d37a 100644
> > --- a/include/linux/swap.h
> > +++ b/include/linux/swap.h
> > @@ -271,6 +271,9 @@ struct swap_info_struct {
> >         struct percpu_ref users;        /* indicate and keep swap device valid. */
> >         unsigned long   flags;          /* SWP_USED etc: see above */
> >         signed short    prio;           /* swap priority of this type */
> > +#ifdef CONFIG_SWAP_TIER
> > +       int tier_idx;
> > +#endif
> >         struct plist_node list;         /* entry in swap_active_head */
> >         signed char     type;           /* strange name for an index */
> >         unsigned int    max;            /* extent of the swap_map */
> > diff --git a/mm/Kconfig b/mm/Kconfig
> > index e1fb11f36ede..78173ffe65d6 100644
> > --- a/mm/Kconfig
> > +++ b/mm/Kconfig
> > @@ -163,6 +163,19 @@ endmenu
> >
> >  endif
> >
> > +config SWAP_TIER
> Same, I think we can remove the SWAP_TIER, just turn it on when swap is enabled.

Ack

> > diff --git a/mm/swap.h b/mm/swap.h
> > index d034c13d8dd2..b116282690c8 100644
> > --- a/mm/swap.h
> > +++ b/mm/swap.h
> > @@ -16,6 +16,10 @@ extern int page_cluster;
> >  #define swap_entry_order(order)        0
> >  #endif
> >
> > +#define DEF_SWAP_PRIO  -1
> > +
> > +extern spinlock_t swap_lock;
> > +extern struct plist_head swap_active_head;
> >  extern struct swap_info_struct *swap_info[];
> >
> >  /*
> > diff --git a/mm/swap_tier.c b/mm/swap_tier.c
> > new file mode 100644
> > index 000000000000..4301e3c766b9
> > --- /dev/null
> > +++ b/mm/swap_tier.c
> > @@ -0,0 +1,602 @@
> > +// SPDX-License-Identifier: GPL-2.0
> > +#include <linux/swap.h>
> > +#include <linux/rcupdate.h>
> > +#include <linux/memcontrol.h>
> > +#include <linux/plist.h>
> > +#include <linux/sysfs.h>
> > +#include <linux/sort.h>
> > +
> > +#include "swap_tier.h"
> > +
> > +/*
> > + * struct swap_tier - Structure representing a swap tier.
> > + *
> > + * @name:              Name of the swap_tier.
> > + * @prio_st:           Starting value of priority.
> > + * @prio_end:          Ending value of priority.
> > + * @list:              Linked list of active tiers.
> > + * @inactive_list:     Linked list of inactive tiers.
> > + * @kref:              Reference count (held by swap device and memory cgroup).
> > + */
> > +static struct swap_tier {
> > +       char name[MAX_TIERNAME];
> > +       short prio_st;
> > +       short prio_end;
> 
> I see a lot of complexity of the code come from this priority range.
> Having both start and end.
> We should be able to just keep just one of the start and end,  e.g.
> high end of the priority, then keep all tier in a sorted list or
> array. Then use the next tier's priority to indicate the other end of
> the current tier.

After checking my code, I decide to remove st or end.

> > +       union {
> > +               struct plist_node list;
> > +               struct list_head inactive_list;
> > +       };
> 
> Is this the list of swapfiles?
> Why union, how does it indicate which field of the union is valid?

It is swap_tier itself. The 'list' maintains active tiers, and
'inactive_list' maintains inactive tiers. One tier exists on either
'list' or 'inactive_list'. The code ensures that a tier must be on one
of them.

> > +
> > +/* swap_tiers initialization */
> > +void swap_tiers_init(void)
> > +{
> > +       struct swap_tier *tier;
> > +       int idx;
> > +
> > +       BUILD_BUG_ON(BITS_PER_TYPE(int) < MAX_SWAPTIER * 2);
> > +
> > +       for_each_tier(tier, idx) {
> > +               if (idx < SWAP_TIER_RESERVED_CNT) {
> > +                       /* for display fisrt */
> > +                       plist_node_init(&tier->list, -SHRT_MAX);
> > +                       plist_add(&tier->list, &swap_tier_active_list);
> > +                       kref_init(&tier->ref);
> > +               } else {
> > +                       INIT_LIST_HEAD(&tier->inactive_list);
> > +                       list_add_tail(&tier->inactive_list, &swap_tier_inactive_list);
> > +               }
> > +       }
> > +
> > +       strscpy(swap_tiers[SWAP_TIER_DEFAULT].name, DEFAULT_TIER_NAME);
> 
> The default tier is not a real tier. It shouldn't show up in the
> swap_tiers array.
> The default tier is only a wide cast for memory.swap.tiers to select
> tiers to turn on/off swap tiers. It is a wide cast pattern, not an
> actual tier.

Yeah, as I commented in the previous mail, I will change it to a
logical concept.

> > +void swap_tiers_show_memcg(struct seq_file *m, struct mem_cgroup *memcg)
> > +{
> > +       spin_lock(&swap_tier_lock);
> > +       if (memcg->tiers_onoff)
> > +               swap_tier_show_mask(m, memcg->tiers_onoff);
> > +       else
> > +               seq_puts(m, "\n");
> > +       swap_tier_show_mask(m, swap_tier_collect_mask(memcg));
> > +       spin_unlock(&swap_tier_lock);
> > +}
> > +
> > +void swap_tiers_assign(struct swap_info_struct *swp)
> > +{
> > +       struct swap_tier *tier;
> > +
> > +       spin_lock(&swap_tier_lock);
> > +       swp->tier_idx = NULL_TIER;
> > +
> > +       for_each_active_tier(tier) {
> > +               if (swap_tier_is_default(tier))
> > +                       continue;
> > +               if (swap_tier_prio_in_range(tier, swp->prio)) {
> > +                       swp->tier_idx = TIER_IDX(tier);
> > +                       swap_tier_get(tier);
> > +                       break;
> > +               }
> > +       }
> > +       spin_unlock(&swap_tier_lock);
> > +}
> > +
> > +void swap_tiers_release(struct swap_info_struct *swp)
> > +{
> > +       spin_lock(&swap_tier_lock);
> > +       if (swp->tier_idx != NULL_TIER)
> > +               swap_tier_put(&swap_tiers[swp->tier_idx]);
> > +       spin_unlock(&swap_tier_lock);
> > +}
> > +
> > +/* not incremental, but reset. */
> > +int swap_tiers_get_mask(struct tiers_desc *desc, int nr, struct mem_cgroup *memcg)
> 
> Who is using this function? I can't find the user of this function in
> this patch.
> Please introduce this function together with the patch that uses it.
> Don't introduce a function without a user.

swapoff calls swap_tiers_release. I must improve the readability of my
RFC patch series.

> 
> > +{
> > +       struct swap_tier *tier;
> > +       int ret = 0;
> > +       int tiers_mask = 0;
> > +       int tiers_onoff = 0;
> > +       int cnt = 0;
> > +
> > +       for (int i = 0; i < nr; i++) {
> > +               for (int j = i+1; j < nr; j++) {
> > +                       if (!strcmp(desc[i].name, desc[j].name))
> 
> These two nested for loops look suspicious. Again because I don't see

For assuring, unique tier name input.
I will think of making simple.

> the caller, I can't reason what it is doing here.

When a swap tier is designated by the memcg sys interface, it is
called. I will introduce this logic together with the caller
implementation.

Best Regards,
Youngjun Park


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 3/3] mm/swap: integrate swap tier infrastructure into swap subsystem
  2025-11-12 14:44   ` Chris Li
@ 2025-11-13  4:07     ` YoungJun Park
  0 siblings, 0 replies; 25+ messages in thread
From: YoungJun Park @ 2025-11-13  4:07 UTC (permalink / raw)
  To: Chris Li
  Cc: akpm, linux-mm, cgroups, linux-kernel, kasong, hannes, mhocko,
	roman.gushchin, shakeel.butt, muchun.song, shikemeng, nphamcs,
	bhe, baohua, gunho.lee, taejoon.song

On Wed, Nov 12, 2025 at 06:44:23AM -0800, Chris Li wrote:
> It seems you should introduce the tiers first. Allow users to define tiers.
> Then the follow up patches use tiers defined here.
> 
> The patch order seems reversed to me.
> 
> See some feedback below, to be continued.
> 
> Chris

Ack. As I already mentioned, I will change it properly :D

> On Sun, Nov 9, 2025 at 4:50 AM Youngjun Park &lt;youngjun.park@lge.com&gt; wrote:
> >
> > Integrate the swap tier infrastructure into the existing swap subsystem
> > to enable selective swap device usage based on tier configuration.
> >
> > Signed-off-by: Youngjun Park &lt;youngjun.park@lge.com&gt;
> > ---
> >  mm/memcontrol.c | 69 ++++++++++++++++++++++++++++++++++++
> >  mm/page_io.c    | 21 ++++++++++-
> >  mm/swap_state.c | 93 +++++++++++++++++++++++++++++++++++++++++++++++++
> >  mm/swapfile.c   | 15 ++++++--
> >  4 files changed, 194 insertions(+), 4 deletions(-)
> >
> > diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> > index bfc986da3289..33c7cc069754 100644
> > --- a/mm/memcontrol.c
> > +++ b/mm/memcontrol.c
> > @@ -68,6 +68,7 @@
> >  #include <net/ip.h>
> >  #include "slab.h"
> >  #include "memcontrol-v1.h"
> > +#include "swap_tier.h"
> >
> >  #include <linux/uaccess.h>
> >
> > @@ -3730,6 +3731,7 @@ static void mem_cgroup_free(struct mem_cgroup *memcg)
> >  {
> >         lru_gen_exit_memcg(memcg);
> >         memcg_wb_domain_exit(memcg);
> > +       swap_tiers_put_mask(memcg);
> >         __mem_cgroup_free(memcg);
> >  }
> >
> > @@ -3842,6 +3844,11 @@ mem_cgroup_css_alloc(struct cgroup_subsys_state *parent_css)
> >                 page_counter_init(&memcg->kmem, &parent->kmem, false);
> >                 page_counter_init(&memcg->tcpmem, &parent->tcpmem, false);
> >  #endif
> > +#ifdef CONFIG_SWAP_TIER
> > +               memcg->tiers_mask = 0;
> > +               memcg->tiers_onoff = 0;
> > +#endif
> > +
> >         } else {
> >                 init_memcg_stats();
> >                 init_memcg_events();
> > @@ -3850,6 +3857,10 @@ mem_cgroup_css_alloc(struct cgroup_subsys_state *parent_css)
> >  #ifdef CONFIG_MEMCG_V1
> >                 page_counter_init(&memcg->kmem, NULL, false);
> >                 page_counter_init(&memcg->tcpmem, NULL, false);
> > +#endif
> > +#ifdef CONFIG_SWAP_TIER
> Again, don't need this config.

Ack.

> > +               memcg->tiers_mask = DEFAULT_FULL_MASK;
>
> Is this memcg->tiers_mask a cached value after evaluating the
> swap.tiers onoff list by looking up the parent?

It is updated when configured through the memcg interface.
The tiers_onoff field represents which tiers are turned on or off,
and tiers_mask is the mask that includes both on and off bits for
those tiers. This mask is used in swap_tier_collect_mask logic to
avoid recalculating it every time.

> I was thinking of starting with always evaluating the tiers_mask. Then
> we don't need to store it here. How do you indicate the tiers_mask is
> out of date?

swap_tier_collect_mask does not cache it. The effective mask is
calculated at swap I/O time. It only changes through the user
interface.

From your mention of “evaluation,” I understand you are referring to
a dynamically computed mask that depends on the parent’s settings.
However, in my implementation, tiers_mask is simply the mask of the
selected tiers. tiers_onoff indicates on/off state, and tiers_mask
represents the full mask (both on and off bits) needed for
swap_tier_collect_mask calculation. Therefore, tiers_mask can be
derived from tiers_onoff and could potentially be removed.

> > +               memcg->tiers_onoff = DEFAULT_ON_MASK;
> >  #endif
> >                 root_mem_cgroup = memcg;
> >                 return &memcg->css;
> > @@ -5390,6 +5401,56 @@ static int swap_events_show(struct seq_file *m, void *v)
> >         return 0;
> >  }
> >
> > +#ifdef CONFIG_SWAP_TIER
> > +static int swap_tier_show(struct seq_file *m, void *v)
> > +{
> > +       struct mem_cgroup *memcg = mem_cgroup_from_seq(m);
> > +
> > +       swap_tiers_show_memcg(m, memcg);
> > +       return 0;
> > +}
> > +
> > +static ssize_t swap_tier_write(struct kernfs_open_file *of,
> > +                               char *buf, size_t nbytes, loff_t off)
> > +{
> > +       struct mem_cgroup *memcg = mem_cgroup_from_css(of_css(of));
> > +       struct tiers_desc desc[MAX_SWAPTIER] = {};
> > +       char *pos = buf, *token;
> > +       int nr = 0;
> > +       int ret;
> > +
> > +       while ((token = strsep(&pos, " \t\n")) != NULL) {
>
> Not allowing plain space " "? Compare pointer != NULL is redundant.

You mean just allow " "?

I will change it to:
while ((token = strsep(" "))) {

> > +               if (!*token)
> > +                       continue;
> > +
> > +               if (nr >= MAX_SWAPTIER)
> > +                       return -E2BIG;
> > +
> > +               if (token[0] != '+' && token[0] != '-')
> > +                       return -EINVAL;
> > +
> > +               desc[nr].ops = (token[0] == '+') ? TIER_ON_MASK : TIER_OFF_MASK;
> > +
> > +               if (strlen(token) <= 1) {
> > +                       strscpy(desc[nr].name, DEFAULT_TIER_NAME);
> > +                       nr++;
> > +                       continue;
> > +               }
> > +
> > +               if (strscpy(desc[nr].name, token + 1, MAX_TIERNAME) < 0)
> > +                       return -EINVAL;
> > +
> > +               nr++;
> I don't think you need this nr, you will reach to the end of the
> string any way. What if the user specifies the same tier more than
> once? It is not optimal but the kernel should take it.
>
> OK, I see what is going on now, this whole desc thing can be greatly
> simplified. You shouldn't need to maintain the desc[nr], that desc
> array is the onoff mask in my mind. You just need to keep the tier
> bits in order.
>
> Notice in the memory.swap.tiers. Except for the default tier pattern,
> which always the first one if exists. The rest of the tier + - order
> does not matter. You look up the tier name into the tier mask bit.
> Just set the onoff bits for that tier.

The desc array is currently used as an
intermediate structure before applying the bitmask in swap_tier.c.
but as you mentioned, it might be unnecessary. I will review this
part to see if it can be simplified.

Best Regards,
Youngjun Park


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 1/3] mm, swap: change back to use each swap device's percpu cluster
  2025-11-09 12:49 ` [PATCH 1/3] mm, swap: change back to use each swap device's percpu cluster Youngjun Park
@ 2025-11-13  6:07   ` Kairui Song
  2025-11-13 11:45     ` YoungJun Park
  0 siblings, 1 reply; 25+ messages in thread
From: Kairui Song @ 2025-11-13  6:07 UTC (permalink / raw)
  To: Youngjun Park
  Cc: akpm, linux-mm, cgroups, linux-kernel, chrisl, hannes, mhocko,
	roman.gushchin, shakeel.butt, muchun.song, shikemeng, nphamcs,
	bhe, baohua, gunho.lee, taejoon.song

On Sun, Nov 9, 2025 at 8:54 PM Youngjun Park <youngjun.park@lge.com> wrote:
>
> This reverts commit 1b7e90020eb7 ("mm, swap: use percpu cluster as
> allocation fast path").
>
> Because in the newly introduced swap tiers, the global percpu cluster
> will cause two issues:
> 1) it will cause caching oscillation in the same order of different si
>    if two different memcg can only be allowed to access different si and
>    both of them are swapping out.
> 2) It can cause priority inversion on swap devices. Imagine a case where
>    there are two memcg, say memcg1 and memcg2. Memcg1 can access si A, B
>    and A is higher priority device. While memcg2 can only access si B.
>    Then memcg 2 could write the global percpu cluster with si B, then
>    memcg1 take si B in fast path even though si A is not exhausted.
>
> Hence in order to support swap tier, revert commit 1b7e90020eb7 to use
> each swap device's percpu cluster.
>
> Co-developed-by: Baoquan He <bhe@redhat.com>
> Suggested-by: Kairui Song <kasong@tencent.com>
> Signed-off-by: Baoquan He <bhe@redhat.com>
> Signed-off-by: Youngjun Park <youngjun.park@lge.com>

Hi Youngjun, Baoquan, Thanks for the work on the percpu cluster thing.

It will be better if you can provide some benchmark result since the
whole point of global percpu cluster is to improve the performance and
get rid of the swap slot cache.

I'm fine with a small regression but we better be aware of it. And we
can still figure out some other ways to optimize it. e.g. I remember
Chris once mentioned an idea of having a per device slot cache, that
is different from the original slot cache (swap_slot.c): the allocator
will be aware of it so it will be much cleaner.

> ---
>  include/linux/swap.h |  13 +++-
>  mm/swapfile.c        | 151 +++++++++++++------------------------------
>  2 files changed, 56 insertions(+), 108 deletions(-)
>
> diff --git a/include/linux/swap.h b/include/linux/swap.h
> index 38ca3df68716..90fa27bb7796 100644
> --- a/include/linux/swap.h
> +++ b/include/linux/swap.h
> @@ -250,10 +250,17 @@ enum {
>  #endif
>
>  /*
> - * We keep using same cluster for rotational device so IO will be sequential.
> - * The purpose is to optimize SWAP throughput on these device.
> + * We assign a cluster to each CPU, so each CPU can allocate swap entry from
> + * its own cluster and swapout sequentially. The purpose is to optimize swapout
> + * throughput.
>   */
> +struct percpu_cluster {
> +       local_lock_t lock; /* Protect the percpu_cluster above */

I think you mean "below"?

> +       unsigned int next[SWAP_NR_ORDERS]; /* Likely next allocation offset */
> +};
> +
>  struct swap_sequential_cluster {
> +       spinlock_t lock; /* Serialize usage of global cluster */
>         unsigned int next[SWAP_NR_ORDERS]; /* Likely next allocation offset */
>  };
>
> @@ -278,8 +285,8 @@ struct swap_info_struct {
>                                         /* list of cluster that are fragmented or contented */
>         unsigned int pages;             /* total of usable pages of swap */
>         atomic_long_t inuse_pages;      /* number of those currently in use */
> +       struct percpu_cluster __percpu *percpu_cluster; /* per cpu's swap location */
>         struct swap_sequential_cluster *global_cluster; /* Use one global cluster for rotating device */
> -       spinlock_t global_cluster_lock; /* Serialize usage of global cluster */
>         struct rb_root swap_extent_root;/* root of the swap extent rbtree */
>         struct block_device *bdev;      /* swap device or bdev of swap file */
>         struct file *swap_file;         /* seldom referenced */
> diff --git a/mm/swapfile.c b/mm/swapfile.c
> index 44eb6a6e5800..3bb77c9a4879 100644
> --- a/mm/swapfile.c
> +++ b/mm/swapfile.c
> @@ -118,18 +118,6 @@ static atomic_t proc_poll_event = ATOMIC_INIT(0);
>
>  atomic_t nr_rotate_swap = ATOMIC_INIT(0);
>
> -struct percpu_swap_cluster {
> -       struct swap_info_struct *si[SWAP_NR_ORDERS];
> -       unsigned long offset[SWAP_NR_ORDERS];
> -       local_lock_t lock;
> -};
> -
> -static DEFINE_PER_CPU(struct percpu_swap_cluster, percpu_swap_cluster) = {
> -       .si = { NULL },
> -       .offset = { SWAP_ENTRY_INVALID },
> -       .lock = INIT_LOCAL_LOCK(),
> -};
> -
>  /* May return NULL on invalid type, caller must check for NULL return */
>  static struct swap_info_struct *swap_type_to_info(int type)
>  {
> @@ -497,7 +485,7 @@ swap_cluster_alloc_table(struct swap_info_struct *si,
>          * Swap allocator uses percpu clusters and holds the local lock.
>          */
>         lockdep_assert_held(&ci->lock);
> -       lockdep_assert_held(&this_cpu_ptr(&percpu_swap_cluster)->lock);
> +       lockdep_assert_held(&this_cpu_ptr(si->percpu_cluster)->lock);
>
>         /* The cluster must be free and was just isolated from the free list. */
>         VM_WARN_ON_ONCE(ci->flags || !cluster_is_empty(ci));
> @@ -515,8 +503,8 @@ swap_cluster_alloc_table(struct swap_info_struct *si,
>          */
>         spin_unlock(&ci->lock);
>         if (!(si->flags & SWP_SOLIDSTATE))
> -               spin_unlock(&si->global_cluster_lock);
> -       local_unlock(&percpu_swap_cluster.lock);
> +               spin_unlock(&si->global_cluster->lock);
> +       local_unlock(&si->percpu_cluster->lock);
>
>         table = swap_table_alloc(__GFP_HIGH | __GFP_NOMEMALLOC | GFP_KERNEL);
>
> @@ -528,9 +516,9 @@ swap_cluster_alloc_table(struct swap_info_struct *si,
>          * could happen with ignoring the percpu cluster is fragmentation,
>          * which is acceptable since this fallback and race is rare.
>          */
> -       local_lock(&percpu_swap_cluster.lock);
> +       local_lock(&si->percpu_cluster->lock);
>         if (!(si->flags & SWP_SOLIDSTATE))
> -               spin_lock(&si->global_cluster_lock);
> +               spin_lock(&si->global_cluster->lock);
>         spin_lock(&ci->lock);
>
>         /* Nothing except this helper should touch a dangling empty cluster. */
> @@ -642,7 +630,7 @@ static bool swap_do_scheduled_discard(struct swap_info_struct *si)
>                 ci = list_first_entry(&si->discard_clusters, struct swap_cluster_info, list);
>                 /*
>                  * Delete the cluster from list to prepare for discard, but keep
> -                * the CLUSTER_FLAG_DISCARD flag, percpu_swap_cluster could be
> +                * the CLUSTER_FLAG_DISCARD flag, there could be percpu_cluster
>                  * pointing to it, or ran into by relocate_cluster.
>                  */
>                 list_del(&ci->list);
> @@ -939,12 +927,11 @@ static unsigned int alloc_swap_scan_cluster(struct swap_info_struct *si,
>  out:
>         relocate_cluster(si, ci);
>         swap_cluster_unlock(ci);
> -       if (si->flags & SWP_SOLIDSTATE) {
> -               this_cpu_write(percpu_swap_cluster.offset[order], next);
> -               this_cpu_write(percpu_swap_cluster.si[order], si);
> -       } else {
> +       if (si->flags & SWP_SOLIDSTATE)
> +               this_cpu_write(si->percpu_cluster->next[order], next);
> +       else
>                 si->global_cluster->next[order] = next;
> -       }
> +
>         return found;
>  }
>
> @@ -1037,13 +1024,17 @@ static unsigned long cluster_alloc_swap_entry(struct swap_info_struct *si, int o
>         if (order && !(si->flags & SWP_BLKDEV))
>                 return 0;
>
> -       if (!(si->flags & SWP_SOLIDSTATE)) {
> +       if (si->flags & SWP_SOLIDSTATE) {
> +               /* Fast path using per CPU cluster */
> +               local_lock(&si->percpu_cluster->lock);
> +               offset = __this_cpu_read(si->percpu_cluster->next[order]);
> +       } else {
>                 /* Serialize HDD SWAP allocation for each device. */
> -               spin_lock(&si->global_cluster_lock);
> +               spin_lock(&si->global_cluster->lock);
>                 offset = si->global_cluster->next[order];
> -               if (offset == SWAP_ENTRY_INVALID)
> -                       goto new_cluster;
> +       }
>
> +       if (offset != SWAP_ENTRY_INVALID) {
>                 ci = swap_cluster_lock(si, offset);
>                 /* Cluster could have been used by another order */
>                 if (cluster_is_usable(ci, order)) {
> @@ -1058,7 +1049,6 @@ static unsigned long cluster_alloc_swap_entry(struct swap_info_struct *si, int o
>                         goto done;
>         }
>
> -new_cluster:
>         /*
>          * If the device need discard, prefer new cluster over nonfull
>          * to spread out the writes.
> @@ -1121,8 +1111,10 @@ static unsigned long cluster_alloc_swap_entry(struct swap_info_struct *si, int o
>                         goto done;
>         }
>  done:
> -       if (!(si->flags & SWP_SOLIDSTATE))
> -               spin_unlock(&si->global_cluster_lock);
> +       if (si->flags & SWP_SOLIDSTATE)
> +               local_unlock(&si->percpu_cluster->lock);
> +       else
> +               spin_unlock(&si->global_cluster->lock);
>
>         return found;
>  }
> @@ -1303,43 +1295,8 @@ static bool get_swap_device_info(struct swap_info_struct *si)
>         return true;
>  }
>
> -/*
> - * Fast path try to get swap entries with specified order from current
> - * CPU's swap entry pool (a cluster).
> - */
> -static bool swap_alloc_fast(swp_entry_t *entry,
> -                           int order)
> -{
> -       struct swap_cluster_info *ci;
> -       struct swap_info_struct *si;
> -       unsigned int offset, found = SWAP_ENTRY_INVALID;
> -
> -       /*
> -        * Once allocated, swap_info_struct will never be completely freed,
> -        * so checking it's liveness by get_swap_device_info is enough.
> -        */
> -       si = this_cpu_read(percpu_swap_cluster.si[order]);
> -       offset = this_cpu_read(percpu_swap_cluster.offset[order]);
> -       if (!si || !offset || !get_swap_device_info(si))
> -               return false;
> -
> -       ci = swap_cluster_lock(si, offset);
> -       if (cluster_is_usable(ci, order)) {
> -               if (cluster_is_empty(ci))
> -                       offset = cluster_offset(si, ci);
> -               found = alloc_swap_scan_cluster(si, ci, offset, order, SWAP_HAS_CACHE);
> -               if (found)
> -                       *entry = swp_entry(si->type, found);
> -       } else {
> -               swap_cluster_unlock(ci);
> -       }
> -
> -       put_swap_device(si);
> -       return !!found;
> -}
> -
>  /* Rotate the device and switch to a new cluster */
> -static bool swap_alloc_slow(swp_entry_t *entry,
> +static void swap_alloc_entry(swp_entry_t *entry,
>                             int order)

It seems you also changed the rotation rule here so every allocation
of any order is causing a swap device rotation? Before 1b7e90020eb7
every 64 allocation causes a rotation as we had slot cache
(swap_slot.c). The global cluster makes the rotation happen for every
cluster so the overhead is even lower on average. But now a per
allocation roration seems a rather high overhead and may cause serious
fragmentation.


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 1/3] mm, swap: change back to use each swap device's percpu cluster
  2025-11-13  6:07   ` Kairui Song
@ 2025-11-13 11:45     ` YoungJun Park
  2025-11-14  1:05       ` Baoquan He
  0 siblings, 1 reply; 25+ messages in thread
From: YoungJun Park @ 2025-11-13 11:45 UTC (permalink / raw)
  To: Kairui Song
  Cc: akpm, linux-mm, cgroups, linux-kernel, chrisl, hannes, mhocko,
	roman.gushchin, shakeel.butt, muchun.song, shikemeng, nphamcs,
	bhe, baohua, gunho.lee, taejoon.song

On Thu, Nov 13, 2025 at 02:07:59PM +0800, Kairui Song wrote:
> On Sun, Nov 9, 2025 at 8:54 PM Youngjun Park <youngjun.park@lge.com> wrote:
> >
> > This reverts commit 1b7e90020eb7 ("mm, swap: use percpu cluster as
> > allocation fast path").
> >
> > Because in the newly introduced swap tiers, the global percpu cluster
> > will cause two issues:
> > 1) it will cause caching oscillation in the same order of different si
> >    if two different memcg can only be allowed to access different si and
> >    both of them are swapping out.
> > 2) It can cause priority inversion on swap devices. Imagine a case where
> >    there are two memcg, say memcg1 and memcg2. Memcg1 can access si A, B
> >    and A is higher priority device. While memcg2 can only access si B.
> >    Then memcg 2 could write the global percpu cluster with si B, then
> >    memcg1 take si B in fast path even though si A is not exhausted.
> >
> > Hence in order to support swap tier, revert commit 1b7e90020eb7 to use
> > each swap device's percpu cluster.
> >
> > Co-developed-by: Baoquan He <bhe@redhat.com>
> > Suggested-by: Kairui Song <kasong@tencent.com>
> > Signed-off-by: Baoquan He <bhe@redhat.com>
> > Signed-off-by: Youngjun Park <youngjun.park@lge.com>
>
> Hi Youngjun, Baoquan, Thanks for the work on the percpu cluster thing.

Hello Kairui,

> It will be better if you can provide some benchmark result since the
> whole point of global percpu cluster is to improve the performance and
> get rid of the swap slot cache.

After RFC stage,
I will try to prepare benchmark results.

> I'm fine with a small regression but we better be aware of it. And we
> can still figure out some other ways to optimize it. e.g. I remember
> Chris once mentioned an idea of having a per device slot cache, that
> is different from the original slot cache (swap_slot.c): the allocator
> will be aware of it so it will be much cleaner.

Ack, we will work on better optimization.

> > ---
> >  include/linux/swap.h |  13 +++-
> >  mm/swapfile.c        | 151 +++++++++++++------------------------------
> >  2 files changed, 56 insertions(+), 108 deletions(-)
> >
> > diff --git a/include/linux/swap.h b/include/linux/swap.h
> > index 38ca3df68716..90fa27bb7796 100644
> > --- a/include/linux/swap.h
> > +++ b/include/linux/swap.h
> > @@ -250,10 +250,17 @@ enum {
> >  #endif
> >
> >  /*
> > - * We keep using same cluster for rotational device so IO will be sequential.
> > - * The purpose is to optimize SWAP throughput on these device.
> > + * We assign a cluster to each CPU, so each CPU can allocate swap entry from
> > + * its own cluster and swapout sequentially. The purpose is to optimize swapout
> > + * throughput.
> >   */
> > +struct percpu_cluster {
> > +       local_lock_t lock; /* Protect the percpu_cluster above */
>
> I think you mean "below"?

This comment was originally written this way in the earlier code, and it
seems to refer to the percpu_cluster structure itself rather than the
fields below. But I agree it's a bit ambiguous. I'll just remove this
comment since the structure name is self-explanatory. Or change it to below. :)

> > +       unsigned int next[SWAP_NR_ORDERS]; /* Likely next allocation offset */
> > +};
> > +
> >
> > -/*
> > - * Fast path try to get swap entries with specified order from current
> > - * CPU's swap entry pool (a cluster).
> > - */
> > -static bool swap_alloc_fast(swp_entry_t *entry,
> > -                           int order)
> > -{
> > -       struct swap_cluster_info *ci;
> > -       struct swap_info_struct *si;
> > -       unsigned int offset, found = SWAP_ENTRY_INVALID;
> > -
> > -       /*
> > -        * Once allocated, swap_info_struct will never be completely freed,
> > -        * so checking it's liveness by get_swap_device_info is enough.
> > -        */
> > -       si = this_cpu_read(percpu_swap_cluster.si[order]);
> > -       offset = this_cpu_read(percpu_swap_cluster.offset[order]);
> > -       if (!si || !offset || !get_swap_device_info(si))
> > -               return false;
> > -
> > -       ci = swap_cluster_lock(si, offset);
> > -       if (cluster_is_usable(ci, order)) {
> > -               if (cluster_is_empty(ci))
> > -                       offset = cluster_offset(si, ci);
> > -               found = alloc_swap_scan_cluster(si, ci, offset, order, SWAP_HAS_CACHE);
> > -               if (found)
> > -                       *entry = swp_entry(si->type, found);
> > -       } else {
> > -               swap_cluster_unlock(ci);
> > -       }
> > -
> > -       put_swap_device(si);
> > -       return !!found;
> > -}
> > -
> >  /* Rotate the device and switch to a new cluster */
> > -static bool swap_alloc_slow(swp_entry_t *entry,
> > +static void swap_alloc_entry(swp_entry_t *entry,
> >                             int order)
>
> It seems you also changed the rotation rule here so every allocation
> of any order is causing a swap device rotation? Before 1b7e90020eb7
> every 64 allocation causes a rotation as we had slot cache
> (swap_slot.c). The global cluster makes the rotation happen for every
> cluster so the overhead is even lower on average. But now a per
> allocation rotation seems a rather high overhead and may cause serious
> fragmentation.

Yeah... The rotation rule has indeed changed. I remember the
discussion about rotation behavior:
https://lore.kernel.org/linux-mm/aPc3lmbJEVTXoV6h@yjaykim-PowerEdge-T330/

After that discussion, I've been thinking about the rotation.
Currently, the requeue happens after every priority list traversal, and this logic
is easily affected by changes.
The rotation logic change behavior change is not not mentioned somtimes.
(as you mentioned in commit 1b7e90020eb7).

I'd like to share some ideas and hear your thoughts:

1. Getting rid of the same priority requeue rule
   - same priority devices get priority - 1 or + 1 after requeue
     (more add or remove as needed to handle any overlapping priority appropriately)

2. Requeue only when a new cluster is allocated
   - Instead of requeueing after every priority list traversal, we
     requeue only when a cluster is fully used
   - This might have some performance impact, but the rotation behavior
     would be similar to the existing one (though slightly different due
     to synchronization and logic processing changes)

Going further with these approaches, if we remove the requeue mechanism
entirely, we could potentially reduce synchronization overhead during
plist traversal. (degrade the lock)

I'm curious what do you think about these possibilities?

Youngjun Park


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 1/3] mm, swap: change back to use each swap device's percpu cluster
  2025-11-13 11:45     ` YoungJun Park
@ 2025-11-14  1:05       ` Baoquan He
  2025-11-14 15:52         ` Kairui Song
  0 siblings, 1 reply; 25+ messages in thread
From: Baoquan He @ 2025-11-14  1:05 UTC (permalink / raw)
  To: YoungJun Park
  Cc: Kairui Song, akpm, linux-mm, cgroups, linux-kernel, chrisl,
	hannes, mhocko, roman.gushchin, shakeel.butt, muchun.song,
	shikemeng, nphamcs, baohua, gunho.lee, taejoon.song

On 11/13/25 at 08:45pm, YoungJun Park wrote:
> On Thu, Nov 13, 2025 at 02:07:59PM +0800, Kairui Song wrote:
> > On Sun, Nov 9, 2025 at 8:54 PM Youngjun Park <youngjun.park@lge.com> wrote:
> > >
> > > This reverts commit 1b7e90020eb7 ("mm, swap: use percpu cluster as
> > > allocation fast path").
> > >
> > > Because in the newly introduced swap tiers, the global percpu cluster
> > > will cause two issues:
> > > 1) it will cause caching oscillation in the same order of different si
> > >    if two different memcg can only be allowed to access different si and
> > >    both of them are swapping out.
> > > 2) It can cause priority inversion on swap devices. Imagine a case where
> > >    there are two memcg, say memcg1 and memcg2. Memcg1 can access si A, B
> > >    and A is higher priority device. While memcg2 can only access si B.
> > >    Then memcg 2 could write the global percpu cluster with si B, then
> > >    memcg1 take si B in fast path even though si A is not exhausted.
> > >
> > > Hence in order to support swap tier, revert commit 1b7e90020eb7 to use
> > > each swap device's percpu cluster.
> > >
> > > Co-developed-by: Baoquan He <bhe@redhat.com>
> > > Suggested-by: Kairui Song <kasong@tencent.com>
> > > Signed-off-by: Baoquan He <bhe@redhat.com>
> > > Signed-off-by: Youngjun Park <youngjun.park@lge.com>
> >
> > Hi Youngjun, Baoquan, Thanks for the work on the percpu cluster thing.
> 
> Hello Kairui,
> 
> > It will be better if you can provide some benchmark result since the
> > whole point of global percpu cluster is to improve the performance and
> > get rid of the swap slot cache.
> 
> After RFC stage,
> I will try to prepare benchmark results.
> 
> > I'm fine with a small regression but we better be aware of it. And we
> > can still figure out some other ways to optimize it. e.g. I remember
> > Chris once mentioned an idea of having a per device slot cache, that
> > is different from the original slot cache (swap_slot.c): the allocator
> > will be aware of it so it will be much cleaner.
> 
> Ack, we will work on better optimization.
> 
> > > ---
> > >  include/linux/swap.h |  13 +++-
> > >  mm/swapfile.c        | 151 +++++++++++++------------------------------
> > >  2 files changed, 56 insertions(+), 108 deletions(-)
> > >
> > > diff --git a/include/linux/swap.h b/include/linux/swap.h
> > > index 38ca3df68716..90fa27bb7796 100644
> > > --- a/include/linux/swap.h
> > > +++ b/include/linux/swap.h
> > > @@ -250,10 +250,17 @@ enum {
> > >  #endif
> > >
> > >  /*
> > > - * We keep using same cluster for rotational device so IO will be sequential.
> > > - * The purpose is to optimize SWAP throughput on these device.
> > > + * We assign a cluster to each CPU, so each CPU can allocate swap entry from
> > > + * its own cluster and swapout sequentially. The purpose is to optimize swapout
> > > + * throughput.
> > >   */
> > > +struct percpu_cluster {
> > > +       local_lock_t lock; /* Protect the percpu_cluster above */
> >
> > I think you mean "below"?
> 
> This comment was originally written this way in the earlier code, and it
> seems to refer to the percpu_cluster structure itself rather than the
> fields below. But I agree it's a bit ambiguous. I'll just remove this
> comment since the structure name is self-explanatory. Or change it to below. :)
> 
> > > +       unsigned int next[SWAP_NR_ORDERS]; /* Likely next allocation offset */
> > > +};
> > > +
> > >
> > > -/*
> > > - * Fast path try to get swap entries with specified order from current
> > > - * CPU's swap entry pool (a cluster).
> > > - */
> > > -static bool swap_alloc_fast(swp_entry_t *entry,
> > > -                           int order)
> > > -{
> > > -       struct swap_cluster_info *ci;
> > > -       struct swap_info_struct *si;
> > > -       unsigned int offset, found = SWAP_ENTRY_INVALID;
> > > -
> > > -       /*
> > > -        * Once allocated, swap_info_struct will never be completely freed,
> > > -        * so checking it's liveness by get_swap_device_info is enough.
> > > -        */
> > > -       si = this_cpu_read(percpu_swap_cluster.si[order]);
> > > -       offset = this_cpu_read(percpu_swap_cluster.offset[order]);
> > > -       if (!si || !offset || !get_swap_device_info(si))
> > > -               return false;
> > > -
> > > -       ci = swap_cluster_lock(si, offset);
> > > -       if (cluster_is_usable(ci, order)) {
> > > -               if (cluster_is_empty(ci))
> > > -                       offset = cluster_offset(si, ci);
> > > -               found = alloc_swap_scan_cluster(si, ci, offset, order, SWAP_HAS_CACHE);
> > > -               if (found)
> > > -                       *entry = swp_entry(si->type, found);
> > > -       } else {
> > > -               swap_cluster_unlock(ci);
> > > -       }
> > > -
> > > -       put_swap_device(si);
> > > -       return !!found;
> > > -}
> > > -
> > >  /* Rotate the device and switch to a new cluster */
> > > -static bool swap_alloc_slow(swp_entry_t *entry,
> > > +static void swap_alloc_entry(swp_entry_t *entry,
> > >                             int order)
> >
> > It seems you also changed the rotation rule here so every allocation
> > of any order is causing a swap device rotation? Before 1b7e90020eb7
> > every 64 allocation causes a rotation as we had slot cache
> > (swap_slot.c). The global cluster makes the rotation happen for every
> > cluster so the overhead is even lower on average. But now a per
> > allocation rotation seems a rather high overhead and may cause serious
> > fragmentation.
> 
> Yeah... The rotation rule has indeed changed. I remember the
> discussion about rotation behavior:
> https://lore.kernel.org/linux-mm/aPc3lmbJEVTXoV6h@yjaykim-PowerEdge-T330/
> 
> After that discussion, I've been thinking about the rotation.
> Currently, the requeue happens after every priority list traversal, and this logic
> is easily affected by changes.
> The rotation logic change behavior change is not not mentioned somtimes.
> (as you mentioned in commit 1b7e90020eb7).
> 
> I'd like to share some ideas and hear your thoughts:
> 
> 1. Getting rid of the same priority requeue rule
>    - same priority devices get priority - 1 or + 1 after requeue
>      (more add or remove as needed to handle any overlapping priority appropriately)
> 
> 2. Requeue only when a new cluster is allocated
>    - Instead of requeueing after every priority list traversal, we
>      requeue only when a cluster is fully used
>    - This might have some performance impact, but the rotation behavior
>      would be similar to the existing one (though slightly different due
>      to synchronization and logic processing changes)

2) sounds better to me, and the logic and code change is simpler.
> 
> Going further with these approaches, if we remove the requeue mechanism
> entirely, we could potentially reduce synchronization overhead during
> plist traversal. (degrade the lock)

Removing requeue may change behaviour. Swap devices of the same priority
should be round robin to take.



^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 1/3] mm, swap: change back to use each swap device's percpu cluster
  2025-11-14  1:05       ` Baoquan He
@ 2025-11-14 15:52         ` Kairui Song
  2025-11-15  9:28           ` YoungJun Park
  0 siblings, 1 reply; 25+ messages in thread
From: Kairui Song @ 2025-11-14 15:52 UTC (permalink / raw)
  To: Baoquan He, YoungJun Park
  Cc: akpm, linux-mm, cgroups, linux-kernel, chrisl, hannes, mhocko,
	roman.gushchin, shakeel.butt, muchun.song, shikemeng, nphamcs,
	baohua, gunho.lee, taejoon.song

On Fri, Nov 14, 2025 at 9:05 AM Baoquan He <bhe@redhat.com> wrote:
> On 11/13/25 at 08:45pm, YoungJun Park wrote:
> > On Thu, Nov 13, 2025 at 02:07:59PM +0800, Kairui Song wrote:
> > > On Sun, Nov 9, 2025 at 8:54 PM Youngjun Park <youngjun.park@lge.com> wrote:
> > > >
> > > > This reverts commit 1b7e90020eb7 ("mm, swap: use percpu cluster as
> > > > allocation fast path").
> > > >
> > > > Because in the newly introduced swap tiers, the global percpu cluster
> > > > will cause two issues:
> > > > 1) it will cause caching oscillation in the same order of different si
> > > >    if two different memcg can only be allowed to access different si and
> > > >    both of them are swapping out.
> > > > 2) It can cause priority inversion on swap devices. Imagine a case where
> > > >    there are two memcg, say memcg1 and memcg2. Memcg1 can access si A, B
> > > >    and A is higher priority device. While memcg2 can only access si B.
> > > >    Then memcg 2 could write the global percpu cluster with si B, then
> > > >    memcg1 take si B in fast path even though si A is not exhausted.
> > > >
> > > > Hence in order to support swap tier, revert commit 1b7e90020eb7 to use
> > > > each swap device's percpu cluster.
> > > >
> > > > Co-developed-by: Baoquan He <bhe@redhat.com>
> > > > Suggested-by: Kairui Song <kasong@tencent.com>
> > > > Signed-off-by: Baoquan He <bhe@redhat.com>
> > > > Signed-off-by: Youngjun Park <youngjun.park@lge.com>
> > >
> > > Hi Youngjun, Baoquan, Thanks for the work on the percpu cluster thing.
> >
> > Hello Kairui,

...

> >
> > Yeah... The rotation rule has indeed changed. I remember the
> > discussion about rotation behavior:
> > https://lore.kernel.org/linux-mm/aPc3lmbJEVTXoV6h@yjaykim-PowerEdge-T330/
> >
> > After that discussion, I've been thinking about the rotation.
> > Currently, the requeue happens after every priority list traversal, and this logic
> > is easily affected by changes.
> > The rotation logic change behavior change is not not mentioned somtimes.
> > (as you mentioned in commit 1b7e90020eb7).
> >
> > I'd like to share some ideas and hear your thoughts:
> >
> > 1. Getting rid of the same priority requeue rule
> >    - same priority devices get priority - 1 or + 1 after requeue
> >      (more add or remove as needed to handle any overlapping priority appropriately)
> >
> > 2. Requeue only when a new cluster is allocated
> >    - Instead of requeueing after every priority list traversal, we
> >      requeue only when a cluster is fully used
> >    - This might have some performance impact, but the rotation behavior
> >      would be similar to the existing one (though slightly different due
> >      to synchronization and logic processing changes)
>
> 2) sounds better to me, and the logic and code change is simpler.
>
> Removing requeue may change behaviour. Swap devices of the same priority
> should be round robin to take.

I agree. We definitely need balancing between devices of the same
priority, cluster based rotation seems good enough.

And I'm thinking if we can have a better rotation mechanism? Maybe
plist isn't the best way to do rotation if we want to minimize the
cost of rotation.


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [RFC] mm/swap, memcg: Introduce swap tiers for cgroup based swap control
  2025-11-09 12:49 [RFC] mm/swap, memcg: Introduce swap tiers for cgroup based swap control Youngjun Park
                   ` (3 preceding siblings ...)
  2025-11-12 13:34 ` [RFC] mm/swap, memcg: Introduce swap tiers for cgroup based swap control Chris Li
@ 2025-11-15  1:22 ` SeongJae Park
  2025-11-15  9:44   ` YoungJun Park
  2025-11-15 15:13   ` Chris Li
  4 siblings, 2 replies; 25+ messages in thread
From: SeongJae Park @ 2025-11-15  1:22 UTC (permalink / raw)
  To: Youngjun Park
  Cc: SeongJae Park, akpm, linux-mm, cgroups, linux-kernel, chrisl,
	kasong, hannes, mhocko, roman.gushchin, shakeel.butt,
	muchun.song, shikemeng, nphamcs, bhe, baohua, gunho.lee,
	taejoon.song

On Sun,  9 Nov 2025 21:49:44 +0900 Youngjun Park <youngjun.park@lge.com> wrote:

> Hi all,
> 
> In constrained environments, there is a need to improve workload
> performance by controlling swap device usage on a per-process or
> per-cgroup basis. For example, one might want to direct critical
> processes to faster swap devices (like SSDs) while relegating
> less critical ones to slower devices (like HDDs or Network Swap).
> 
> Initial approach was to introduce a per-cgroup swap priority
> mechanism [1]. However, through review and discussion, several
> drawbacks were identified:
> 
> a. There is a lack of concrete use cases for assigning a fine-grained,
>    unique swap priority to each cgroup. 
> b. The implementation complexity was high relative to the desired
>    level of control.
> c. Differing swap priorities between cgroups could lead to LRU
>    inversion problems.
> 
> To address these concerns, I propose the "swap tiers" concept, 
> originally suggested by Chris Li [2] and further developed through 
> collaborative discussions. I would like to thank Chris Li and 
> He Baoquan for their invaluable contributions in refining this 
> approach, and Kairui Song, Nhat Pham, and Michal Koutný for their 
> insightful reviews of earlier RFC versions.

I think the tiers concept is a nice abstraction.  I'm also interested in how
the in-kernel control mechanism will deal with tiers management, which is not
always simple.  I'll try to take a time to read this series thoroughly.  Thank
you for sharing this nice work!

Nevertheless, I'm curious if there is simpler and more flexible ways to achieve
the goal (control of swap device to use).  For example, extending existing
proactive pageout features, such as memory.reclaim, MADV_PAGEOUT or
DAMOS_PAGEOUT, to let users specify the swap device to use.  Doing such
extension for MADV_PAGEOUT may be challenging, but it might be doable for
memory.reclaim and DAMOS_PAGEOUT.  Have you considered this kind of options?


Thanks,
SJ

[...]


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 1/3] mm, swap: change back to use each swap device's percpu cluster
  2025-11-14 15:52         ` Kairui Song
@ 2025-11-15  9:28           ` YoungJun Park
  0 siblings, 0 replies; 25+ messages in thread
From: YoungJun Park @ 2025-11-15  9:28 UTC (permalink / raw)
  To: Kairui Song
  Cc: Baoquan He, akpm, linux-mm, cgroups, linux-kernel, chrisl,
	hannes, mhocko, roman.gushchin, shakeel.butt, muchun.song,
	shikemeng, nphamcs, baohua, gunho.lee, taejoon.song

On Fri, Nov 14, 2025 at 11:52:25PM +0800, Kairui Song wrote:
> On Fri, Nov 14, 2025 at 9:05 AM Baoquan He <bhe@redhat.com> wrote:
> > On 11/13/25 at 08:45pm, YoungJun Park wrote:
> > > On Thu, Nov 13, 2025 at 02:07:59PM +0800, Kairui Song wrote:
> > > > On Sun, Nov 9, 2025 at 8:54 PM Youngjun Park <youngjun.park@lge.com> wrote:
> > > > >
> > > > > This reverts commit 1b7e90020eb7 ("mm, swap: use percpu cluster as
> > > > > allocation fast path").
> > > > >
> > > > > Because in the newly introduced swap tiers, the global percpu cluster
> > > > > will cause two issues:
> > > > > 1) it will cause caching oscillation in the same order of different si
> > > > >    if two different memcg can only be allowed to access different si and
> > > > >    both of them are swapping out.
> > > > > 2) It can cause priority inversion on swap devices. Imagine a case where
> > > > >    there are two memcg, say memcg1 and memcg2. Memcg1 can access si A, B
> > > > >    and A is higher priority device. While memcg2 can only access si B.
> > > > >    Then memcg 2 could write the global percpu cluster with si B, then
> > > > >    memcg1 take si B in fast path even though si A is not exhausted.
> > > > >
> > > > > Hence in order to support swap tier, revert commit 1b7e90020eb7 to use
> > > > > each swap device's percpu cluster.
> > > > >
> > > > > Co-developed-by: Baoquan He <bhe@redhat.com>
> > > > > Suggested-by: Kairui Song <kasong@tencent.com>
> > > > > Signed-off-by: Baoquan He <bhe@redhat.com>
> > > > > Signed-off-by: Youngjun Park <youngjun.park@lge.com>
> > > >
> > > > Hi Youngjun, Baoquan, Thanks for the work on the percpu cluster thing.
> > >
> > > Hello Kairui,
> 
> ...
> 
> > >
> > > Yeah... The rotation rule has indeed changed. I remember the
> > > discussion about rotation behavior:
> > > https://lore.kernel.org/linux-mm/aPc3lmbJEVTXoV6h@yjaykim-PowerEdge-T330/
> > >
> > > After that discussion, I've been thinking about the rotation.
> > > Currently, the requeue happens after every priority list traversal, and this logic
> > > is easily affected by changes.
> > > The rotation logic change behavior change is not not mentioned somtimes.
> > > (as you mentioned in commit 1b7e90020eb7).
> > >
> > > I'd like to share some ideas and hear your thoughts:
> > >
> > > 1. Getting rid of the same priority requeue rule
> > >    - same priority devices get priority - 1 or + 1 after requeue
> > >      (more add or remove as needed to handle any overlapping priority appropriately)
> > >
> > > 2. Requeue only when a new cluster is allocated
> > >    - Instead of requeueing after every priority list traversal, we
> > >      requeue only when a cluster is fully used
> > >    - This might have some performance impact, but the rotation behavior
> > >      would be similar to the existing one (though slightly different due
> > >      to synchronization and logic processing changes)
> >
> > 2) sounds better to me, and the logic and code change is simpler.
> >
> > Removing requeue may change behaviour. Swap devices of the same priority
> > should be round robin to take.
> 
> I agree. We definitely need balancing between devices of the same
> priority, cluster based rotation seems good enough.

Hello Kairui, Baoquan.
Thanks for your feedback. 

Okay I try to keep current rotation logic workable on next patch iteration.

Based on Kairui suggested previously,
We can keep the per-cpu si cache alive.
(However, since it could pick si from unselected tiers, it should
exist per tier - per cpu)

Or, following the current code structure, we could also consider,
Requeue while holding swap_avail_lock when the cluster is consumed.
 
> And I'm thinking if we can have a better rotation mechanism? Maybe
> plist isn't the best way to do rotation if we want to minimize the
> cost of rotation.

I did some more ideation.
(Although it is some workable way, next step idea. like I said just ideation )

I've been thinking about the inefficiencies with plist_requeue during
rotation, and the plist_for_each_entry traversal structure itself.
There is also small problem like it can be ended up selecting a lower priority swap device
while traversing the list, even when a higher priority swap device gets
inserted into the plist.

So anyway as I think... 

- On the read side (alloc_swap_entry), manage it so only one swap
  device can be obtained when selecting a swap device. (grabbing
  read_lock). swap selection logic does not any behavior affecting
  logic change like current approach. just see swapdevice only.

- On the write side, handle it appropriately using plist or some
  improved data structure. (grabbing write_lock)

- For rotation, instead of placing a plist per swap device, we could
  create something like a priority node. In this priority node
  structure, entries would be rotated each time a cluster is fully used.

- Also, with tiers introduced, since we only need to traverse the
  selected tier for each I/O, the current single swap_avail_list may
  not be suitable anymore. This could be changed to a per-tier
  structure.


Thanks,
YoungJun 


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [RFC] mm/swap, memcg: Introduce swap tiers for cgroup based swap control
  2025-11-15  1:22 ` SeongJae Park
@ 2025-11-15  9:44   ` YoungJun Park
  2025-11-15 16:56     ` SeongJae Park
  2025-11-15 15:13   ` Chris Li
  1 sibling, 1 reply; 25+ messages in thread
From: YoungJun Park @ 2025-11-15  9:44 UTC (permalink / raw)
  To: SeongJae Park
  Cc: akpm, linux-mm, cgroups, linux-kernel, chrisl, kasong, hannes,
	mhocko, roman.gushchin, shakeel.butt, muchun.song, shikemeng,
	nphamcs, bhe, baohua, gunho.lee, taejoon.song

On Fri, Nov 14, 2025 at 05:22:45PM -0800, SeongJae Park wrote:
> On Sun,  9 Nov 2025 21:49:44 +0900 Youngjun Park <youngjun.park@lge.com> wrote:
> 
> > Hi all,
> > 
> > In constrained environments, there is a need to improve workload
> > performance by controlling swap device usage on a per-process or
> > per-cgroup basis. For example, one might want to direct critical
> > processes to faster swap devices (like SSDs) while relegating
> > less critical ones to slower devices (like HDDs or Network Swap).
> > 
> > Initial approach was to introduce a per-cgroup swap priority
> > mechanism [1]. However, through review and discussion, several
> > drawbacks were identified:
> > 
> > a. There is a lack of concrete use cases for assigning a fine-grained,
> >    unique swap priority to each cgroup. 
> > b. The implementation complexity was high relative to the desired
> >    level of control.
> > c. Differing swap priorities between cgroups could lead to LRU
> >    inversion problems.
> > 
> > To address these concerns, I propose the "swap tiers" concept, 
> > originally suggested by Chris Li [2] and further developed through 
> > collaborative discussions. I would like to thank Chris Li and 
> > He Baoquan for their invaluable contributions in refining this 
> > approach, and Kairui Song, Nhat Pham, and Michal Koutný for their 
> > insightful reviews of earlier RFC versions.
> 
> I think the tiers concept is a nice abstraction.  I'm also interested in how
> the in-kernel control mechanism will deal with tiers management, which is not
> always simple.  I'll try to take a time to read this series thoroughly.  Thank
> you for sharing this nice work!

Hi SeongJae,

Thank you for your feedback and interest in the swap tiers concept
I appreciate your willingness to review this series.

Regarding your question about simpler approaches using memory.reclaim,
MADV_PAGEOUT, or DAMOS_PAGEOUT with swap device specification - I've
looked into this perspective after reading your comments. This approach
would indeed be one way to enable per-process swap device selection
from a broader standpoint.

> Nevertheless, I'm curious if there is simpler and more flexible ways to achieve
> the goal (control of swap device to use).  For example, extending existing
> proactive pageout features, such as memory.reclaim, MADV_PAGEOUT or
> DAMOS_PAGEOUT, to let users specify the swap device to use.  Doing such
> extension for MADV_PAGEOUT may be challenging, but it might be doable for
> memory.reclaim and DAMOS_PAGEOUT.  Have you considered this kind of options?

Regarding your question about simpler approaches using memory.reclaim,
MADV_PAGEOUT, or DAMOS_PAGEOUT with swap device specification - I've
looked into this perspective after reading your comments. This approach
would indeed be one way to enable per-process swap device selection
from a broader standpoint.

However, for our use case, per-process granularity feels too fine-grained,
which is why we've been focusing more on the cgroup-based approach.

That said, if we were to aggressively consider the per-process approach
as well in the future, I'm thinking about how we might integrate it with
the tier concept(not just indivisual swap device). During discussions with Chris Li, we also talked about
potentially tying this to per-VMA control (see the discussion at
https://lore.kernel.org/linux-mm/CACePvbW_Q6O2ppMG35gwj7OHCdbjja3qUCF1T7GFsm9VDr2e_g@mail.gmail.com/).
This concept could go beyond just selection at the cgroup layer.

Thanks,
YoungJun

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [RFC] mm/swap, memcg: Introduce swap tiers for cgroup based swap control
  2025-11-15  1:22 ` SeongJae Park
  2025-11-15  9:44   ` YoungJun Park
@ 2025-11-15 15:13   ` Chris Li
  2025-11-15 17:24     ` SeongJae Park
  1 sibling, 1 reply; 25+ messages in thread
From: Chris Li @ 2025-11-15 15:13 UTC (permalink / raw)
  To: SeongJae Park
  Cc: Youngjun Park, akpm, linux-mm, cgroups, linux-kernel, kasong,
	hannes, mhocko, roman.gushchin, shakeel.butt, muchun.song,
	shikemeng, nphamcs, bhe, baohua, gunho.lee, taejoon.song

On Fri, Nov 14, 2025 at 5:22 PM SeongJae Park <sj@kernel.org> wrote:
>
> On Sun,  9 Nov 2025 21:49:44 +0900 Youngjun Park <youngjun.park@lge.com> wrote:
>
> > Hi all,
> >
> > In constrained environments, there is a need to improve workload
> > performance by controlling swap device usage on a per-process or
> > per-cgroup basis. For example, one might want to direct critical
> > processes to faster swap devices (like SSDs) while relegating
> > less critical ones to slower devices (like HDDs or Network Swap).
> >
> > Initial approach was to introduce a per-cgroup swap priority
> > mechanism [1]. However, through review and discussion, several
> > drawbacks were identified:
> >
> > a. There is a lack of concrete use cases for assigning a fine-grained,
> >    unique swap priority to each cgroup.
> > b. The implementation complexity was high relative to the desired
> >    level of control.
> > c. Differing swap priorities between cgroups could lead to LRU
> >    inversion problems.
> >
> > To address these concerns, I propose the "swap tiers" concept,
> > originally suggested by Chris Li [2] and further developed through
> > collaborative discussions. I would like to thank Chris Li and
> > He Baoquan for their invaluable contributions in refining this
> > approach, and Kairui Song, Nhat Pham, and Michal Koutný for their
> > insightful reviews of earlier RFC versions.
>
> I think the tiers concept is a nice abstraction.  I'm also interested in how
> the in-kernel control mechanism will deal with tiers management, which is not
> always simple.  I'll try to take a time to read this series thoroughly.  Thank
> you for sharing this nice work!

Thank you for your interest. Please keep in mind that this patch
series is RFC. I suspect the current series will go through a lot of
overhaul before it gets merged in. I predict the end result will
likely have less than half of the code resemble what it is in the
series right  now.

> Nevertheless, I'm curious if there is simpler and more flexible ways to achieve
> the goal (control of swap device to use).  For example, extending existing
Simplicity is one of my primary design principles. The current design
is close to the simplest within the design constraints.

> proactive pageout features, such as memory.reclaim, MADV_PAGEOUT or
> DAMOS_PAGEOUT, to let users specify the swap device to use.  Doing such

In my mind that is a later phase. No, per VMA swapfile is not simpler
to use, nor is the API simpler to code. There are much more VMA than
memcg in the system, no even the same magnitude. It is a higher burden
for both user space and kernel to maintain all the per VMA mapping.
The VMA and mmap path is much more complex to hack. Doing it on the
memcg level as the first step is the right approach.

> extension for MADV_PAGEOUT may be challenging, but it might be doable for
> memory.reclaim and DAMOS_PAGEOUT.  Have you considered this kind of options?

Yes, as YoungJun points out, that has been considered here, but in a
later phase. Borrow the link in his email here:
https://lore.kernel.org/linux-mm/CACePvbW_Q6O2ppMG35gwj7OHCdbjja3qUCF1T7GFsm9VDr2e_g@mail.gmail.com/

Chris


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [RFC] mm/swap, memcg: Introduce swap tiers for cgroup based swap control
  2025-11-15  9:44   ` YoungJun Park
@ 2025-11-15 16:56     ` SeongJae Park
  0 siblings, 0 replies; 25+ messages in thread
From: SeongJae Park @ 2025-11-15 16:56 UTC (permalink / raw)
  To: YoungJun Park
  Cc: SeongJae Park, akpm, linux-mm, cgroups, linux-kernel, chrisl,
	kasong, hannes, mhocko, roman.gushchin, shakeel.butt,
	muchun.song, shikemeng, nphamcs, bhe, baohua, gunho.lee,
	taejoon.song

On Sat, 15 Nov 2025 18:44:56 +0900 YoungJun Park <youngjun.park@lge.com> wrote:

> On Fri, Nov 14, 2025 at 05:22:45PM -0800, SeongJae Park wrote:
> > On Sun,  9 Nov 2025 21:49:44 +0900 Youngjun Park <youngjun.park@lge.com> wrote:
> > 
> > > Hi all,
> > > 
> > > In constrained environments, there is a need to improve workload
> > > performance by controlling swap device usage on a per-process or
> > > per-cgroup basis. For example, one might want to direct critical
> > > processes to faster swap devices (like SSDs) while relegating
> > > less critical ones to slower devices (like HDDs or Network Swap).
> > > 
> > > Initial approach was to introduce a per-cgroup swap priority
> > > mechanism [1]. However, through review and discussion, several
> > > drawbacks were identified:
> > > 
> > > a. There is a lack of concrete use cases for assigning a fine-grained,
> > >    unique swap priority to each cgroup. 
> > > b. The implementation complexity was high relative to the desired
> > >    level of control.
> > > c. Differing swap priorities between cgroups could lead to LRU
> > >    inversion problems.
> > > 
> > > To address these concerns, I propose the "swap tiers" concept, 
> > > originally suggested by Chris Li [2] and further developed through 
> > > collaborative discussions. I would like to thank Chris Li and 
> > > He Baoquan for their invaluable contributions in refining this 
> > > approach, and Kairui Song, Nhat Pham, and Michal KoutnÃ½ for their 
> > > insightful reviews of earlier RFC versions.
> > 
> > I think the tiers concept is a nice abstraction.  I'm also interested in how
> > the in-kernel control mechanism will deal with tiers management, which is not
> > always simple.  I'll try to take a time to read this series thoroughly.  Thank
> > you for sharing this nice work!
> 
> Hi SeongJae,
> 
> Thank you for your feedback and interest in the swap tiers concept
> I appreciate your willingness to review this series.
> 
> Regarding your question about simpler approaches using memory.reclaim,
> MADV_PAGEOUT, or DAMOS_PAGEOUT with swap device specification - I've
> looked into this perspective after reading your comments. This approach
> would indeed be one way to enable per-process swap device selection
> from a broader standpoint.
> 
> > Nevertheless, I'm curious if there is simpler and more flexible ways to achieve
> > the goal (control of swap device to use).  For example, extending existing
> > proactive pageout features, such as memory.reclaim, MADV_PAGEOUT or
> > DAMOS_PAGEOUT, to let users specify the swap device to use.  Doing such
> > extension for MADV_PAGEOUT may be challenging, but it might be doable for
> > memory.reclaim and DAMOS_PAGEOUT.  Have you considered this kind of options?
> 
> Regarding your question about simpler approaches using memory.reclaim,
> MADV_PAGEOUT, or DAMOS_PAGEOUT with swap device specification - I've
> looked into this perspective after reading your comments. This approach
> would indeed be one way to enable per-process swap device selection
> from a broader standpoint.
> 
> However, for our use case, per-process granularity feels too fine-grained,
> which is why we've been focusing more on the cgroup-based approach.

Thank you for kindly sharing your opinion.  That all makes sense.  Nonetheless,
I think the limitation is only for MADV_PAGEOUT.

MADV_PAGEOUT would indeed have a limitation at applying it on cgroup level.  In
case of memory.reclaim and DAMOS_PAGEOUT, however, I think it can work in
cgroup level, since memory.reclaim exists per cgroup, and DAMOS_PAGEOUT has
knobs for cgroup level controls, including cgroup based DAMOS filters and
per-node per-cgroup memory usage based DAMOS quota goal.  Also, if needed for
swap tiers, extending DAMOS seems doable, to my perspective.

> 
> That said, if we were to aggressively consider the per-process approach
> as well in the future, I'm thinking about how we might integrate it with
> the tier concept(not just indivisual swap device). During discussions with Chris Li, we also talked about
> potentially tying this to per-VMA control (see the discussion at
> https://lore.kernel.org/linux-mm/CACePvbW_Q6O2ppMG35gwj7OHCdbjja3qUCF1T7GFsm9VDr2e_g@mail.gmail.com/).
> This concept could go beyond just selection at the cgroup layer.

Sounds interesting.  I once thought extending DAMOS for vma level control
(e.g., asking some DAMOS actions to target only vmas of specific names) could
be useful, in the past.  I have no real plan to do that at the moment due to
the absence of expected usage.  But if that could be used for swap tiers, I
would be happy to help.


Thanks,
SJ

[...]


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [RFC] mm/swap, memcg: Introduce swap tiers for cgroup based swap control
  2025-11-15 15:13   ` Chris Li
@ 2025-11-15 17:24     ` SeongJae Park
  2025-11-17 22:17       ` Chris Li
  0 siblings, 1 reply; 25+ messages in thread
From: SeongJae Park @ 2025-11-15 17:24 UTC (permalink / raw)
  To: Chris Li
  Cc: SeongJae Park, Youngjun Park, akpm, linux-mm, cgroups,
	linux-kernel, kasong, hannes, mhocko, roman.gushchin,
	shakeel.butt, muchun.song, shikemeng, nphamcs, bhe, baohua,
	gunho.lee, taejoon.song

On Sat, 15 Nov 2025 07:13:49 -0800 Chris Li <chrisl@kernel.org> wrote:

> On Fri, Nov 14, 2025 at 5:22 PM SeongJae Park <sj@kernel.org> wrote:
> >
> > On Sun,  9 Nov 2025 21:49:44 +0900 Youngjun Park <youngjun.park@lge.com> wrote:
> >
> > > Hi all,
> > >
> > > In constrained environments, there is a need to improve workload
> > > performance by controlling swap device usage on a per-process or
> > > per-cgroup basis. For example, one might want to direct critical
> > > processes to faster swap devices (like SSDs) while relegating
> > > less critical ones to slower devices (like HDDs or Network Swap).
> > >
> > > Initial approach was to introduce a per-cgroup swap priority
> > > mechanism [1]. However, through review and discussion, several
> > > drawbacks were identified:
> > >
> > > a. There is a lack of concrete use cases for assigning a fine-grained,
> > >    unique swap priority to each cgroup.
> > > b. The implementation complexity was high relative to the desired
> > >    level of control.
> > > c. Differing swap priorities between cgroups could lead to LRU
> > >    inversion problems.
> > >
> > > To address these concerns, I propose the "swap tiers" concept,
> > > originally suggested by Chris Li [2] and further developed through
> > > collaborative discussions. I would like to thank Chris Li and
> > > He Baoquan for their invaluable contributions in refining this
> > > approach, and Kairui Song, Nhat Pham, and Michal Koutný for their
> > > insightful reviews of earlier RFC versions.
> >
> > I think the tiers concept is a nice abstraction.  I'm also interested in how
> > the in-kernel control mechanism will deal with tiers management, which is not
> > always simple.  I'll try to take a time to read this series thoroughly.  Thank
> > you for sharing this nice work!
> 
> Thank you for your interest. Please keep in mind that this patch
> series is RFC. I suspect the current series will go through a lot of
> overhaul before it gets merged in. I predict the end result will
> likely have less than half of the code resemble what it is in the
> series right  now.

Sure, I belive this work will greatly evolve :)

> 
> > Nevertheless, I'm curious if there is simpler and more flexible ways to achieve
> > the goal (control of swap device to use).  For example, extending existing
> Simplicity is one of my primary design principles. The current design
> is close to the simplest within the design constraints.

I agree the concept is very simple.  But, I was thinking there _could_ be
complexity for its implementation and required changes to existing code.
Especially I'm curious about how the control logic for tiers maangement would
be implemented in a simple but optimum and flexible way.  Hence I was lazily
thinking what if we just let users make the control.

I'm not saying tiers approach's control part implementation will, or is,
complex or suboptimum.  I didn't read this series thoroughly yet.

Even if it is at the moment, as you pointed out, I believe it will evolve to a
simple and optimum one.  That's why I am willing to try to get time for reading
this series and learn from it, and contribute back to the evolution if I find
something :)

> 
> > proactive pageout features, such as memory.reclaim, MADV_PAGEOUT or
> > DAMOS_PAGEOUT, to let users specify the swap device to use.  Doing such
> 
> In my mind that is a later phase. No, per VMA swapfile is not simpler
> to use, nor is the API simpler to code. There are much more VMA than
> memcg in the system, no even the same magnitude. It is a higher burden
> for both user space and kernel to maintain all the per VMA mapping.
> The VMA and mmap path is much more complex to hack. Doing it on the
> memcg level as the first step is the right approach.
> 
> > extension for MADV_PAGEOUT may be challenging, but it might be doable for
> > memory.reclaim and DAMOS_PAGEOUT.  Have you considered this kind of options?
> 
> Yes, as YoungJun points out, that has been considered here, but in a
> later phase. Borrow the link in his email here:
> https://lore.kernel.org/linux-mm/CACePvbW_Q6O2ppMG35gwj7OHCdbjja3qUCF1T7GFsm9VDr2e_g@mail.gmail.com/

Thank you for kindly sharing your opinion and previous discussion!  I
understand you believe sub-cgroup (e.g., vma level) control of swap tiers can
be useful, but there is no expected use case, and you concern about its
complexity in terms of implementation and interface.  That all makes sense to
me.

Nonetheless, I'm not saying about sub-cgroup control.  As I also replied [1] to
Youngjun, memory.reclaim and DAMOS_PAGEOUT based extension would work in cgroup
level.  And to my humble perspective, doing the extension could be doable, at
least for DAMOS_PAGEOUT.

Hmm, I feel like my mail might be read like I'm suggesting you to use
DAMOS_PAGEOUT.  The decision is yours and I will respect it, of course.  I'm
saying this though, because I am uncautiously but definitely biased as DAMON
maintainer. ;)  Again, the decision is yours and I will respect it.

[1] https://lore.kernel.org/20251115165637.82966-1-sj@kernel.org


Thanks,
SJ

[...]


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [RFC] mm/swap, memcg: Introduce swap tiers for cgroup based swap control
  2025-11-15 17:24     ` SeongJae Park
@ 2025-11-17 22:17       ` Chris Li
  2025-11-18  1:11         ` SeongJae Park
  0 siblings, 1 reply; 25+ messages in thread
From: Chris Li @ 2025-11-17 22:17 UTC (permalink / raw)
  To: SeongJae Park
  Cc: Youngjun Park, akpm, linux-mm, cgroups, linux-kernel, kasong,
	hannes, mhocko, roman.gushchin, shakeel.butt, muchun.song,
	shikemeng, nphamcs, bhe, baohua, gunho.lee, taejoon.song

On Sat, Nov 15, 2025 at 9:24 AM SeongJae Park <sj@kernel.org> wrote:
>
> On Sat, 15 Nov 2025 07:13:49 -0800 Chris Li <chrisl@kernel.org> wrote:
> > Thank you for your interest. Please keep in mind that this patch
> > series is RFC. I suspect the current series will go through a lot of
> > overhaul before it gets merged in. I predict the end result will
> > likely have less than half of the code resemble what it is in the
> > series right  now.
>
> Sure, I belive this work will greatly evolve :)

Yes, we can use any eyes that can help to review or spot bugs.

> > > Nevertheless, I'm curious if there is simpler and more flexible ways to achieve
> > > the goal (control of swap device to use).  For example, extending existing
> > Simplicity is one of my primary design principles. The current design
> > is close to the simplest within the design constraints.
>
> I agree the concept is very simple.  But, I was thinking there _could_ be
> complexity for its implementation and required changes to existing code.
> Especially I'm curious about how the control logic for tiers maangement would
> be implemented in a simple but optimum and flexible way.  Hence I was lazily
> thinking what if we just let users make the control.

The selection of the swap device will be at the swap allocator. The
good news is that we just rewrite the whole swap allocator so it is an
easier code base to work with for us than the previous swap allocator.
I haven't imagined how to implement swap file selection on the
previous allocator, I am just glad that I don't need to worry about
it.

Some feedback on the madvise API that selects one specific device.
That might sound simple, because you only need to remember one swap
file. However, the less than ideal part is that, you are pinned to one
swap file, if that swap file is full, you are stuck. If that swap file
has been swapoff, you are stuck.

I believe that allowing selection of a tier class, e.g. a QoS aspect
of the swap latency expectation, is better fit what the user really
wants to do. So I see selecting swapfile vs swap tier is a separate
issue of how to select the swap device (madvise vs memory.swap.tiers).
Your argument is that selecting a tier is more complex than selecting
a swap file directly. I agree from an implementation point of view.
However the tiers offer better flexibility and free users from the
swapfile pinning. e.g. round robin on a few swap files of the same
tier is better than pinning to one swap file. That has been proven
from Baoquan's test benchmark.

Another feedback is that user space isn't the primary one to perform
swap out by madivse PAGEOUT. A lot of swap happens due to the cgroup
memory usage hitting the memory cgroup limit, which triggers the swap
out from the memory cgroup that hit the limit. That is an existing
usage case and we have a need to select which swap file anyway. If we
extend the madvise for per swapfile selection, that is a question that
must have an answer for native swap out (by the kernel not madvise)
anyway.  I can see  the user space wants to set the POLICY about a VMA
if it ever gets swapped out, what speed of swap file it goes to. That
is a follow up after we have the swapfile selection at the memory
cgroup level.

> I'm not saying tiers approach's control part implementation will, or is,
> complex or suboptimum.  I didn't read this series thoroughly yet.
>
> Even if it is at the moment, as you pointed out, I believe it will evolve to a
> simple and optimum one.  That's why I am willing to try to get time for reading
> this series and learn from it, and contribute back to the evolution if I find
> something :)
>
> >
> > > proactive pageout features, such as memory.reclaim, MADV_PAGEOUT or
> > > DAMOS_PAGEOUT, to let users specify the swap device to use.  Doing such
> >
> > In my mind that is a later phase. No, per VMA swapfile is not simpler
> > to use, nor is the API simpler to code. There are much more VMA than
> > memcg in the system, no even the same magnitude. It is a higher burden
> > for both user space and kernel to maintain all the per VMA mapping.
> > The VMA and mmap path is much more complex to hack. Doing it on the
> > memcg level as the first step is the right approach.
> >
> > > extension for MADV_PAGEOUT may be challenging, but it might be doable for
> > > memory.reclaim and DAMOS_PAGEOUT.  Have you considered this kind of options?
> >
> > Yes, as YoungJun points out, that has been considered here, but in a
> > later phase. Borrow the link in his email here:
> > https://lore.kernel.org/linux-mm/CACePvbW_Q6O2ppMG35gwj7OHCdbjja3qUCF1T7GFsm9VDr2e_g@mail.gmail.com/
>
> Thank you for kindly sharing your opinion and previous discussion!  I
> understand you believe sub-cgroup (e.g., vma level) control of swap tiers can
> be useful, but there is no expected use case, and you concern about its
> complexity in terms of implementation and interface.  That all makes sense to
> me.

There is some usage request from Android wanting to protect some VMA
never getting swapped into slower tiers. Otherwise it can cause
jankiness. Still I consider the cgroup swap file selection is a more
common one.

> Nonetheless, I'm not saying about sub-cgroup control.  As I also replied [1] to
> Youngjun, memory.reclaim and DAMOS_PAGEOUT based extension would work in cgroup
> level.  And to my humble perspective, doing the extension could be doable, at
> least for DAMOS_PAGEOUT.

I would do it one thing at a time and start from the mem cgroup level
swap file selection e.g. "memory.swap.tiers". However, if you are
passionate about VMA level swap file selection, please feel free to
submit patches for it.

> Hmm, I feel like my mail might be read like I'm suggesting you to use
> DAMOS_PAGEOUT.  The decision is yours and I will respect it, of course.  I'm
> saying this though, because I am uncautiously but definitely biased as DAMON
> maintainer. ;)  Again, the decision is yours and I will respect it.
>
> [1] https://lore.kernel.org/20251115165637.82966-1-sj@kernel.org

Sorry I haven't read much about the DAMOS_PAGEOUT yet. After reading
the above thread, I still don't feel I have a good sense of
DAMOS_PAGEOUT. Who is the actual user that requested that feature and
what is the typical usage work flow and life cycle? BTW, I am still
considering the per VMA swap policy should happen after the
memory.swap.tiers given my current understanding.

Chris

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [RFC] mm/swap, memcg: Introduce swap tiers for cgroup based swap control
  2025-11-17 22:17       ` Chris Li
@ 2025-11-18  1:11         ` SeongJae Park
  0 siblings, 0 replies; 25+ messages in thread
From: SeongJae Park @ 2025-11-18  1:11 UTC (permalink / raw)
  To: Chris Li
  Cc: SeongJae Park, Youngjun Park, akpm, linux-mm, cgroups,
	linux-kernel, kasong, hannes, mhocko, roman.gushchin,
	shakeel.butt, muchun.song, shikemeng, nphamcs, bhe, baohua,
	gunho.lee, taejoon.song

On Mon, 17 Nov 2025 14:17:43 -0800 Chris Li <chrisl@kernel.org> wrote:

TLDR: my idea was only for proactive reclaim use cases, while you want to cover
reactive swap use cases.  I agree my idea cannot work for that.

I added more detailed comments below.

> On Sat, Nov 15, 2025 at 9:24 AM SeongJae Park <sj@kernel.org> wrote:
> >
> > On Sat, 15 Nov 2025 07:13:49 -0800 Chris Li <chrisl@kernel.org> wrote:
> > > Thank you for your interest. Please keep in mind that this patch
> > > series is RFC. I suspect the current series will go through a lot of
> > > overhaul before it gets merged in. I predict the end result will
> > > likely have less than half of the code resemble what it is in the
> > > series right  now.
> >
> > Sure, I belive this work will greatly evolve :)
> 
> Yes, we can use any eyes that can help to review or spot bugs.
> 
> > > > Nevertheless, I'm curious if there is simpler and more flexible ways to achieve
> > > > the goal (control of swap device to use).  For example, extending existing
> > > Simplicity is one of my primary design principles. The current design
> > > is close to the simplest within the design constraints.
> >
> > I agree the concept is very simple.  But, I was thinking there _could_ be
> > complexity for its implementation and required changes to existing code.
> > Especially I'm curious about how the control logic for tiers maangement would
> > be implemented in a simple but optimum and flexible way.  Hence I was lazily
> > thinking what if we just let users make the control.
> 
> The selection of the swap device will be at the swap allocator. The
> good news is that we just rewrite the whole swap allocator so it is an
> easier code base to work with for us than the previous swap allocator.
> I haven't imagined how to implement swap file selection on the
> previous allocator, I am just glad that I don't need to worry about
> it.
> 
> Some feedback on the madvise API that selects one specific device.
> That might sound simple, because you only need to remember one swap
> file. However, the less than ideal part is that, you are pinned to one
> swap file, if that swap file is full, you are stuck. If that swap file
> has been swapoff, you are stuck.

I agree about the problem.

My idea was, however, letting each madvise() call to decide which swap device
to use.  In the case, if a swap device is full, the user may try other swap
device, so no such stuck would happen.

And because of the madivse() interface, I was saying doing such extension for
madvise() could be challenging, while such extensions for memory.reclaim or
DAMOS_PAGEOUT may be much more doable.

> 
> I believe that allowing selection of a tier class, e.g. a QoS aspect
> of the swap latency expectation, is better fit what the user really
> wants to do. So I see selecting swapfile vs swap tier is a separate
> issue of how to select the swap device (madvise vs memory.swap.tiers).
> Your argument is that selecting a tier is more complex than selecting
> a swap file directly. I agree from an implementation point of view.
> However the tiers offer better flexibility and free users from the
> swapfile pinning. e.g. round robin on a few swap files of the same
> tier is better than pinning to one swap file. That has been proven
> from Baoquan's test benchmark.

I agree the problem of pinning.  Nonetheless my idea was not pinning, but just
letting users select the swap device whenever they want to swap.  That is, the
user may be able to do the round robin selection.  And in a case, they might
want and be able to do an advanced selection that optimized for their special
case.

> 
> Another feedback is that user space isn't the primary one to perform
> swap out by madivse PAGEOUT. A lot of swap happens due to the cgroup
> memory usage hitting the memory cgroup limit, which triggers the swap
> out from the memory cgroup that hit the limit. That is an existing
> usage case and we have a need to select which swap file anyway. If we
> extend the madvise for per swapfile selection, that is a question that
> must have an answer for native swap out (by the kernel not madvise)
> anyway.  I can see  the user space wants to set the POLICY about a VMA
> if it ever gets swapped out, what speed of swap file it goes to. That
> is a follow up after we have the swapfile selection at the memory
> cgroup level.

I fully agreed.  My idea is basically extending proactive reclamation features.
It cannot cover this reactive reclaim cases.

I think this perfectly answers my question!

> 
> > I'm not saying tiers approach's control part implementation will, or is,
> > complex or suboptimum.  I didn't read this series thoroughly yet.
> >
> > Even if it is at the moment, as you pointed out, I believe it will evolve to a
> > simple and optimum one.  That's why I am willing to try to get time for reading
> > this series and learn from it, and contribute back to the evolution if I find
> > something :)
> >
> > >
> > > > proactive pageout features, such as memory.reclaim, MADV_PAGEOUT or
> > > > DAMOS_PAGEOUT, to let users specify the swap device to use.  Doing such
> > >
> > > In my mind that is a later phase. No, per VMA swapfile is not simpler
> > > to use, nor is the API simpler to code. There are much more VMA than
> > > memcg in the system, no even the same magnitude. It is a higher burden
> > > for both user space and kernel to maintain all the per VMA mapping.
> > > The VMA and mmap path is much more complex to hack. Doing it on the
> > > memcg level as the first step is the right approach.
> > >
> > > > extension for MADV_PAGEOUT may be challenging, but it might be doable for
> > > > memory.reclaim and DAMOS_PAGEOUT.  Have you considered this kind of options?
> > >
> > > Yes, as YoungJun points out, that has been considered here, but in a
> > > later phase. Borrow the link in his email here:
> > > https://lore.kernel.org/linux-mm/CACePvbW_Q6O2ppMG35gwj7OHCdbjja3qUCF1T7GFsm9VDr2e_g@mail.gmail.com/
> >
> > Thank you for kindly sharing your opinion and previous discussion!  I
> > understand you believe sub-cgroup (e.g., vma level) control of swap tiers can
> > be useful, but there is no expected use case, and you concern about its
> > complexity in terms of implementation and interface.  That all makes sense to
> > me.
> 
> There is some usage request from Android wanting to protect some VMA
> never getting swapped into slower tiers. Otherwise it can cause
> jankiness. Still I consider the cgroup swap file selection is a more
> common one.

Thank you for sharing this interesting usage request!  And I agree cgroup level
requirements would be more common.

> 
> > Nonetheless, I'm not saying about sub-cgroup control.  As I also replied [1] to
> > Youngjun, memory.reclaim and DAMOS_PAGEOUT based extension would work in cgroup
> > level.  And to my humble perspective, doing the extension could be doable, at
> > least for DAMOS_PAGEOUT.
> 
> I would do it one thing at a time and start from the mem cgroup level
> swap file selection e.g. "memory.swap.tiers".

Makes sense, please proceed on your schedule :)

> However, if you are
> passionate about VMA level swap file selection, please feel free to
> submit patches for it.

I have no plan to do that at the moment.  I just wanted to hear your opinion on
my naive ideas :)  Thank you for sharing the opinions!

> 
> > Hmm, I feel like my mail might be read like I'm suggesting you to use
> > DAMOS_PAGEOUT.  The decision is yours and I will respect it, of course.  I'm
> > saying this though, because I am uncautiously but definitely biased as DAMON
> > maintainer. ;)  Again, the decision is yours and I will respect it.
> >
> > [1] https://lore.kernel.org/20251115165637.82966-1-sj@kernel.org
> 
> Sorry I haven't read much about the DAMOS_PAGEOUT yet. After reading
> the above thread, I still don't feel I have a good sense of
> DAMOS_PAGEOUT. Who is the actual user that requested that feature and
> what is the typical usage work flow and life cycle?

In short, DAMOS_PAGEOUT is a proactive reclaim mechanism.  Users can ask a
kernel thread (called kdamond) to monitor the access pattern of the system and
reclaim pages of specific access pattern (e.g., not accessed for >=2 minutes)
as soon as found.  Some users including AWS are using it as a proactive
reclamation mechanism.

As I mentioned above, since it is a sort of proactive method, it wouldn't cover
the reactive reclamation use case and cannot be an alternative of your swap
tiers work.

> BTW, I am still
> considering the per VMA swap policy should happen after the
> memory.swap.tiers given my current understanding.

I have no strong opinion, and that also makes sense to me :)


Thanks,
SJ

[...]


^ permalink raw reply	[flat|nested] 25+ messages in thread

end of thread, other threads:[~2025-11-18  1:11 UTC | newest]

Thread overview: 25+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2025-11-09 12:49 [RFC] mm/swap, memcg: Introduce swap tiers for cgroup based swap control Youngjun Park
2025-11-09 12:49 ` [PATCH 1/3] mm, swap: change back to use each swap device's percpu cluster Youngjun Park
2025-11-13  6:07   ` Kairui Song
2025-11-13 11:45     ` YoungJun Park
2025-11-14  1:05       ` Baoquan He
2025-11-14 15:52         ` Kairui Song
2025-11-15  9:28           ` YoungJun Park
2025-11-09 12:49 ` [PATCH 2/3] mm: swap: introduce swap tier infrastructure Youngjun Park
2025-11-12 14:20   ` Chris Li
2025-11-13  2:01     ` YoungJun Park
2025-11-09 12:49 ` [PATCH 3/3] mm/swap: integrate swap tier infrastructure into swap subsystem Youngjun Park
2025-11-10 11:40   ` kernel test robot
2025-11-10 12:12   ` kernel test robot
2025-11-10 13:26   ` kernel test robot
2025-11-12 14:44   ` Chris Li
2025-11-13  4:07     ` YoungJun Park
2025-11-12 13:34 ` [RFC] mm/swap, memcg: Introduce swap tiers for cgroup based swap control Chris Li
2025-11-13  1:33   ` YoungJun Park
2025-11-15  1:22 ` SeongJae Park
2025-11-15  9:44   ` YoungJun Park
2025-11-15 16:56     ` SeongJae Park
2025-11-15 15:13   ` Chris Li
2025-11-15 17:24     ` SeongJae Park
2025-11-17 22:17       ` Chris Li
2025-11-18  1:11         ` SeongJae Park

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox